-
-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
include_images
changes text extraction
#194
Comments
Hi @carschno, I can reproduce the bug. Extraction with images isn't my priority but I'll try to look into it. |
@adbar Thanks! I understand that this behaviour is definitely not expected, right? |
No it isn't expected but it looks quite convoluted. The backup algorithm (internal fork of readability-lxml but identical here) triggers the error:
If you want to look at the code, here are the sections concerned: trafilatura/trafilatura/external.py Line 34 in 146506a
https://github.com/adbar/trafilatura/blob/master/trafilatura/readability_lxml.py You could maybe look into what happens to |
Digging deeper into the analysis of this error, this part of the HTML looks suspicious to me, in particular the
However, this is visible to me only when I save the page locally. When it gets parsed in the browser (Firefox in my case), this part looks like this when I look at the 'Web Developer console':
I am not very familiar with how these JavaScript/HTML parsing works, but I guess that Trafilatura (or the underlying XML parser) tries to parse the plain HTML code and fails when hitting the Does that make any sense at all? |
I could be wrong but I don't see any line in the code which could be affected by that. The vertical bars are between quotation marks so they are part of the image source just like any other symbol. |
Trafilatura version: 1.2.0
I have noticed that adding the
include_images=True
argument totrafilatura.extract()
changes the output text.To reproduce it:
Note that the value for
text
is different. When images are included, the text stops shortly after the first (in this case: only) image.This seems possibly related to #51 , but there is no exception raised here.
The text was updated successfully, but these errors were encountered: