Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extraction: add extract_with_metadata method and ut #765

Merged
merged 3 commits into from
Dec 11, 2024

Conversation

unsleepy22
Copy link
Contributor

Add extract_with_metadata method in core for scenarios which require separate metadata and extracted content to be both returned.

A few extra notes:

  1. deprecated options like no_fallback and max_tree_size are removed in the new method.
  2. with_metadata option is hard-coded to True and only_with_metadata is hard-coded to False to avoid confusion.

Copy link

codecov bot commented Dec 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.27%. Comparing base (7067937) to head (d20a078).
Report is 2 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #765   +/-   ##
=======================================
  Coverage   99.27%   99.27%           
=======================================
  Files          21       21           
  Lines        3572     3576    +4     
=======================================
+ Hits         3546     3550    +4     
  Misses         26       26           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@unsleepy22 unsleepy22 force-pushed the feat/extract-with-metadata branch 2 times, most recently from ca056c2 to b7e859d Compare December 9, 2024 13:06
@unsleepy22 unsleepy22 force-pushed the feat/extract-with-metadata branch from b7e859d to b534c66 Compare December 9, 2024 13:15
@adbar
Copy link
Owner

adbar commented Dec 10, 2024

Hi @unsleepy22, if I understand properly your new function is a wrapper around bare_extraction which returns a document if the extraction is successful. So it's nearly the same as extract.

This may be convenient but as such the code entails duplicate lines (extraction options and post-processing). Could you please find a way to regroup the common functionality to extract() and extract_with_metadata() ?

  • Option 1: At least the post-processing could isolated in a separate internal function, e.g. _postprocessing().
  • Option 2: Create an internal function like _extraction_process() which returns a document so that both extract() and extract_with_metadata() can act as wrappers around it

This would make future maintenance much easier. If you find other ways to simplify the code that would be nice.

What do you think?

@unsleepy22
Copy link
Contributor Author

Hi @unsleepy22, if I understand properly your new function is a wrapper around bare_extraction which returns a document if the extraction is successful. So it's nearly the same as extract.

This may be convenient but as such the code entails duplicate lines (extraction options and post-processing). Could you please find a way to regroup the common functionality to extract() and extract_with_metadata() ?

  • Option 1: At least the post-processing could isolated in a separate internal function, e.g. _postprocessing().
  • Option 2: Create an internal function like _extraction_process() which returns a document so that both extract() and extract_with_metadata() can act as wrappers around it

This would make future maintenance much easier. If you find other ways to simplify the code that would be nice.

What do you think?

Sure, your comments make sense. I felt a bit strange too when adding the method. I'll see how to regroup the code.

@unsleepy22
Copy link
Contributor Author

Would you take a look again? If this is ok I'll rebase the code.
There're still 2 regroup-extraction-options in newly added _internal_extraction and bare_extraction though, with different param numbers(18 vs 19), not too certain if I should refactor these 2 places since it won't save much code, so I just leave it as it is, what do you think?

@adbar
Copy link
Owner

adbar commented Dec 11, 2024

Thanks, I think it's fine for now, it can always be improved later.

If you write other pull requests you can always rebase the code, I don't mind.

@adbar adbar merged commit b010779 into adbar:master Dec 11, 2024
15 checks passed
@unsleepy22 unsleepy22 deleted the feat/extract-with-metadata branch December 11, 2024 11:42
@unsleepy22
Copy link
Contributor Author

By the way, would you mind bumping a new version like 2.0.1 ? As my company is eager to use the newly added method in production :)

@adbar
Copy link
Owner

adbar commented Dec 11, 2024

I understand but I don't plan a new release right now, there is another PR waiting and there might be more to come. You can always point to any given commit in your dependency file so basically you can already use the code with your changes.

Also, if your company uses the software please think about sponsoring the project.

@unsleepy22
Copy link
Contributor Author

I understand but I don't plan a new release right now, there is another PR waiting and there might be more to come. You can always point to any given commit in your dependency file so basically you can already use the code with your changes.

Also, if your company uses the software please think about sponsoring the project.

We're using trafilatura in a small team for an internal non-commercial project. Anyway I will try to talk to guys who manage open source affairs to promote this great project to see if they can sponsor this project.

@adbar
Copy link
Owner

adbar commented Dec 18, 2024

That would be great, maintenance is definitely work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants