Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

include_images changes text extraction #194

Open
carschno opened this issue Apr 12, 2022 · 5 comments
Open

include_images changes text extraction #194

carschno opened this issue Apr 12, 2022 · 5 comments
Labels
bug Something isn't working

Comments

@carschno
Copy link

Trafilatura version: 1.2.0

I have noticed that adding the include_images=True argument to trafilatura.extract() changes the output text.

To reproduce it:

import trafilatura
from trafilatura import fetch_url

url = "https://www.tropenmuseum.nl/nl/zien-en-doen/bisjpalen-restauratie"

In [43]: trafilatura.bare_extraction(fetch_url(url), include_images=False)
Out[43]: 
{'title': 'Bisjpalen restauratie',
 'author': None,
 'url': 'https://www.tropenmuseum.nl/nl/zien-en-doen/bisjpalen-restauratie',
 'hostname': 'tropenmuseum.nl',
 'description': 'Een bijzondere restauratie van twaalf gigantische bisjpalen in de monumentale Lichthal.',
 'sitename': 'Tropenmuseum in Amsterdam',
 'date': '2022-01-01',
 'categories': [],
 'tags': [],
 'fingerprint': None,
 'id': None,
 'license': None,
 'body': None,
 'comments': '',
 'commentsbody': None,
 'raw_text': None,
 'text': 'Deze rituele palen uit de Indonesische provincie Papoea maken onderdeel uit van de wereldberoemde Nieuw-Guinea collectie van het museum. De bisjpalen collectie is bijzonder, omdat deze in het land van herkomst niet bewaard worden: de palen worden normaliter na afloop van de ceremonie in het moeras achtergelaten om weg te rotten\nBijzondere restauratie\nTropenmuseum restaureerde twaalf gigantische bisjpalen in de monumentale Lichthal.\nHerkomst\nBisjfeest\nBisjpalen zijn boomstammen waarin met het houtsnijwerk overleden dorpsgenoten zijn uitgebeeld. De palen worden gebruikt om de doden te eren tijdens een ‘bisjfeest’. Halverwege de vorige eeuw ontstond de angst dat het ritueel zou uitsterven, waardoor Tropenmuseum en Wereldmuseum besloten om grootscheeps bisjpalen aan te kopen. In Nederland bevindt zich nu de grootste collectie bisjpalen ter wereld\nRestauratie\nDe rituele palen zijn in de loop der jaren door stof, vocht en insecten aangetast. De restauratoren reinigen heel voorzichtig het kwetsbare oppervlak van de palen en zetten daarna de verf opnieuw vast. Tijdens de restauratie zijn de bisjpalen van dichtbij te bekijken. Bezoekers kunnen via een beeldscherm meekijken met het beeld van de microscoop; een uitgelezen kans om een museumobject met andere ogen te ervaren!'}

In [44]: trafilatura.bare_extraction(fetch_url(url), include_images=True)
Out[44]: 
{'title': 'Bisjpalen restauratie',
 'author': None,
 'url': 'https://www.tropenmuseum.nl/nl/zien-en-doen/bisjpalen-restauratie',
 'hostname': 'tropenmuseum.nl',
 'description': 'Een bijzondere restauratie van twaalf gigantische bisjpalen in de monumentale Lichthal.',
 'sitename': 'Tropenmuseum in Amsterdam',
 'date': '2022-01-01',
 'categories': [],
 'tags': [],
 'fingerprint': None,
 'id': None,
 'license': None,
 'body': None,
 'comments': '',
 'commentsbody': None,
 'raw_text': None,
 'text': 'Deze rituele palen uit de Indonesische provincie Papoea maken onderdeel uit van de wereldberoemde Nieuw-Guinea collectie van het museum. De bisjpalen collectie is bijzonder, omdat deze in het land van herkomst niet bewaard worden: de palen worden normaliter na afloop van de ceremonie in het moeras achtergelaten om weg te rotten\n/sites/default/files/styles/hero/public/bisjpalen.jpg?h=e442ce2f&itok=Jaeowj5G Tropenmuseum. Bisjpalen restauratie.'}

Note that the value for text is different. When images are included, the text stops shortly after the first (in this case: only) image.

This seems possibly related to #51 , but there is no exception raised here.

@adbar adbar added the bug Something isn't working label Apr 12, 2022
@adbar
Copy link
Owner

adbar commented Apr 12, 2022

Hi @carschno, I can reproduce the bug. Extraction with images isn't my priority but I'll try to look into it.

@carschno
Copy link
Author

@adbar Thanks!
In case you have a pointer to the potentially relevant piece of the code, I might be able to investigate myself and create a PR (depending on how deeply the issue is rooted).

I understand that this behaviour is definitely not expected, right?

@adbar
Copy link
Owner

adbar commented Apr 12, 2022

No it isn't expected but it looks quite convoluted. The backup algorithm (internal fork of readability-lxml but identical here) triggers the error:

  • No images, backup algorithm used, everything is fine (that's the case I'm evaluating).
  • With images the heuristics of the backup algorithm doesn't work the same way, and the HTML sections around the images (I guess) are discarded. That's logical since images are often associated with undesirable content, I assume it's an unfortunate borderline case here.

If you want to look at the code, here are the sections concerned:

def try_readability(htmlinput):

https://github.com/adbar/trafilatura/blob/master/trafilatura/readability_lxml.py

You could maybe look into what happens to img elements in the latter.

@carschno
Copy link
Author

Digging deeper into the analysis of this error, this part of the HTML looks suspicious to me, in particular the | symbols in the srcset attributed of the three last sources:

      <div class="field field--name-field-hero-image field--type-image field--label-hidden field__items">
              <div class="field__item">    <picture>
                  <source srcset="Bisjpalen%20restauratie%20|%20Tropenmuseum%20in%20Amsterdam_files/bisjpalen.webp 1x" media="screen and (max-width: 767px)" type="image/webp">
              <source srcset="Bisjpalen%20restauratie%20|%20Tropenmuseum%20in%20Amsterdam_files/bisjpalen_002.webp 1x" media="screen and (min-width: 768px) and (max-width: 992px)" type="image/webp">
              <source srcset="Bisjpalen%20restauratie%20|%20Tropenmuseum%20in%20Amsterdam_files/bisjpalen_002.webp 1x" media="screen and (min-width: 993px)" type="image/webp">
              <source srcset="Bisjpalen%20restauratie%20|%20Tropenmuseum%20in%20Amsterdam_files/bisjpalen.jpg 1x" media="screen and (max-width: 767px)" type="image/jpeg">
              <source srcset="Bisjpalen%20restauratie%20|%20Tropenmuseum%20in%20Amsterdam_files/bisjpalen_002.jpg 1x" media="screen and (min-width: 768px) and (max-width: 992px)" type="image/jpeg">
              <source srcset="Bisjpalen%20restauratie%20|%20Tropenmuseum%20in%20Amsterdam_files/bisjpalen_002.jpg 1x" media="screen and (min-width: 993px)" type="image/jpeg">
                  <img src="Bisjpalen%20restauratie%20|%20Tropenmuseum%20in%20Amsterdam_files/bisjpalen_002.jpg" alt="Tropenmuseum. Bisjpalen restauratie. " typeof="foaf:Image">

However, this is visible to me only when I save the page locally. When it gets parsed in the browser (Firefox in my case), this part looks like this when I look at the 'Web Developer console':


                  <source srcset="/sites/default/files/styles/hero_mobile/public/bisjpalen.webp?h=c7551848&amp;itok=ajmSsyac 1x" media="screen and (max-width: 767px)" type="image/webp">
              <source srcset="/sites/default/files/styles/hero/public/bisjpalen.webp?h=e442ce2f&amp;itok=Jaeowj5G 1x" media="screen and (min-width: 768px) and (max-width: 992px)" type="image/webp">
              <source srcset="/sites/default/files/styles/hero/public/bisjpalen.webp?h=e442ce2f&amp;itok=Jaeowj5G 1x" media="screen and (min-width: 993px)" type="image/webp">
              <source srcset="/sites/default/files/styles/hero_mobile/public/bisjpalen.jpg?h=c7551848&amp;itok=ajmSsyac 1x" media="screen and (max-width: 767px)" type="image/jpeg">
              <source srcset="/sites/default/files/styles/hero/public/bisjpalen.jpg?h=e442ce2f&amp;itok=Jaeowj5G 1x" media="screen and (min-width: 768px) and (max-width: 992px)" type="image/jpeg">
              <source srcset="/sites/default/files/styles/hero/public/bisjpalen.jpg?h=e442ce2f&amp;itok=Jaeowj5G 1x" media="screen and (min-width: 993px)" type="image/jpeg">
                  <img src="/sites/default/files/styles/hero/public/bisjpalen.jpg?h=e442ce2f&amp;itok=Jaeowj5G" alt="Tropenmuseum. Bisjpalen restauratie. " typeof="foaf:Image">

I am not very familiar with how these JavaScript/HTML parsing works, but I guess that Trafilatura (or the underlying XML parser) tries to parse the plain HTML code and fails when hitting the | symbols, or something similar.

Does that make any sense at all?

@adbar
Copy link
Owner

adbar commented Apr 22, 2022

I could be wrong but I don't see any line in the code which could be affected by that. The vertical bars are between quotation marks so they are part of the image source just like any other symbol.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants