-
-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extraction: add extract_with_metadata method and ut #765
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #765 +/- ##
=======================================
Coverage 99.27% 99.27%
=======================================
Files 21 21
Lines 3572 3576 +4
=======================================
+ Hits 3546 3550 +4
Misses 26 26 ☔ View full report in Codecov by Sentry. |
ca056c2
to
b7e859d
Compare
b7e859d
to
b534c66
Compare
Hi @unsleepy22, if I understand properly your new function is a wrapper around This may be convenient but as such the code entails duplicate lines (extraction options and post-processing). Could you please find a way to regroup the common functionality to
This would make future maintenance much easier. If you find other ways to simplify the code that would be nice. What do you think? |
Sure, your comments make sense. I felt a bit strange too when adding the method. I'll see how to regroup the code. |
Would you take a look again? If this is ok I'll rebase the code. |
Thanks, I think it's fine for now, it can always be improved later. If you write other pull requests you can always rebase the code, I don't mind. |
By the way, would you mind bumping a new version like 2.0.1 ? As my company is eager to use the newly added method in production :) |
I understand but I don't plan a new release right now, there is another PR waiting and there might be more to come. You can always point to any given commit in your dependency file so basically you can already use the code with your changes. Also, if your company uses the software please think about sponsoring the project. |
We're using trafilatura in a small team for an internal non-commercial project. Anyway I will try to talk to guys who manage open source affairs to promote this great project to see if they can sponsor this project. |
That would be great, maintenance is definitely work. |
Add extract_with_metadata method in core for scenarios which require separate metadata and extracted content to be both returned.
A few extra notes:
no_fallback
andmax_tree_size
are removed in the new method.with_metadata
option is hard-coded to True andonly_with_metadata
is hard-coded to False to avoid confusion.