Skip to content

assemble pages and then look for claims #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gvelez17
Copy link
Member

No description provided.

@gvelez17 gvelez17 requested a review from ZiadHamdyy April 21, 2025 03:51
@TutTrue TutTrue requested a review from zeyadhessuin April 21, 2025 11:58
@@ -21,14 +22,30 @@ def process_and_visualize_claims(docmgr, output_file: str = "claims_analysis.htm
print(f"First metadata sample: {results['metadatas'][0]}")
else:
print("No documents found in collection! Exiting")
exit
return
Copy link
Contributor

@zeyadhessuin zeyadhessuin Apr 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. If we pass a new pdf, it will be processed by doc_manager.process_pdf(args.pdf) in 'main()', so how could this else statement happen unless there is not text in the pdf.
  2. Testing code on new pdf not processed before doesn't work as it return instead of processing the pdf and complete

def sort_key(x):
metadata = x[1]
if 'bbox' in metadata:
return (metadata['page'], metadata['bbox'][1]) # Sort by page, then y-coord
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets the page number without checking the PDF source. So it get's all text from page n, whatever it belongs to any pdf

page_text = ""
for text, metadata in pages[page_num]:
if metadata.get('type') == 'text': # Skip images
page_text += text + "\n\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

images' text can be added at the end of page_text after adding all direct text

Copy link
Contributor

@zeyadhessuin zeyadhessuin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grouping the chunks doesn't work well in the case of multiple stored processed PDFs in chromadb. It groups all text in page[n] from different PDFs (pdf_1, pdf_2, ...), so it will group text from all pages[n] we have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants