Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in extracted images sources returning a base64 #92

Open
cayolblake opened this issue Feb 18, 2021 · 13 comments
Open

Bug in extracted images sources returning a base64 #92

cayolblake opened this issue Feb 18, 2021 · 13 comments
Assignees
Labels
bug Something isn't working

Comments

@cayolblake
Copy link

cayolblake commented Feb 18, 2021

Solution

Added by @martintoreilly

  1. (minimal fix): update our local copy of Readability.js to the latest file from the Mozilla package
  2. (better): stop duplicating Readability.js as a local copy and instead reference the Mozilla package directly (but double check why we opted for a local copy first)

Original issue

Raised by @cayolblake
Hello,

Any idea why original tag on site looks like the following.

<img alt="Ursula von der Leyen, who trained as a physician before going into politics, pictured with her husband and seven children in 2005. " data-src-mini="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-small-169.jpg" data-src-xsmall="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-medium-plus-169.jpg" data-src-small="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-large-169.jpg" data-src-medium="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-exlarge-169.jpg" data-src-large="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-super-169.jpg" data-src-full16x9="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-full-169.jpg" data-src-mini1x1="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-small-11.jpg" data-demand-load="loaded" data-eq-pts="mini: 0, xsmall: 221, small: 308, medium: 461, large: 781" src="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-exlarge-169.jpg" data-eq-state="mini xsmall small medium" data-src="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-exlarge-169.jpg" class="media__image media__image--responsive">

And the extracted one looks like the following with the "src" attribute messed up?

<img data-src-mini="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-small-169.jpg" data-src-xsmall="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-medium-plus-169.jpg" data-src-small="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-large-169.jpg" data-src-medium="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-exlarge-169.jpg" data-src-large="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-super-169.jpg" data-src-full16x9="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-full-169.jpg" data-src-mini1x1="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-small-11.jpg" data-demand-load="not-loaded" data-eq-pts="mini: 0, xsmall: 221, small: 308, medium: 461, large: 781" src="image/gif;base64,R0lGODlhEAAJAJEAAAAAAP///////wAAACH5BAEAAAIALAAAAAAQAAkAAAIKlI+py+0Po5yUFQA7">

Basically, src="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-exlarge-169.jpg" ends up being src="image/gif;base64,R0lGODlhEAAJAJEAAAAAAP///////wAAACH5BAEAAAIALAAAAAAQAAkAAAIKlI+py+0Po5yUFQA7"

What to do to avoid such behavior?

@martintoreilly
Copy link
Member

Hi @cayolblake. Welcome 👋 Are you using ReadabiliPy from the command line or as a Python package? If the Python package, what function are you calling? If the command line, what parameters are you passing? When called from the command line ReadbiliPy defaults to Mozilla's Readability.js package for parsing the source HTML to a simple representation, so this will help us figure out if the issue is in our code or the Mozilla code. Thanks.

@cayolblake
Copy link
Author

Hi @martintoreilly

I'm using the Python library. Code is as simple as follows.

from readabilipy import simple_json_from_html_string

simple_json_from_html_string(html, use_readability=True)

@martintoreilly
Copy link
Member

martintoreilly commented Feb 19, 2021

Hi @cayolblake. The use_readability=True parameter actually causes the python package to call an embedded local Readability.js script, which is actually Mozilla's package as it was in Nov 2018 🤔 .

We moved away from using Readability.js to writing our own python HTML simplification code because of some things that were parsed unexpectedly by Readability.js (at least back then), but we also therefore never updated the embedded copy. I can't remember why we embedded our own copy, but I suspect we'd probably be better off removing it and installing the latest copy from Mozilla via node.

I've been trying to replicate your issue using this webpage: https://edition.cnn.com/2021/02/09/europe/ursula-von-der-leyen-profile-intl/index.html. Is that the one you are having the issue with?

Unfortunately when I call simple_json_from_html_string(html, use_readability=True), I get a node error saying the package jsdom is missing (even though it should be installed when I install our ReadabiliPy python package). Explicitly installing jsdom via node didn't resolve this error either. Did you run into this issue when installing our package? I'll look into this a bit more over the weekend, but I won't be able to test anything that calls the Javascript code until I fix this missing package error I'm getting. When I do my plan will be to do the following:

  1. Replicate the issue you are seeing with our current embedded version of Readability.js
  2. Replace our >2 year old embedded Readability.js version with the latest version from Mozilla and check whether the issue you are experiencing is fixed. [Edit: I did try a quick check by viewing the CNN page in Firefox, as its Reader mode uses Readability.js. However, reader mode was not offered for this page, which I think means Readability.js judged it likely to fail to process well]
  3. Update our ReadabiliPy package to just install the latest Readability.js from Mozilla rather than keep a local copy.

I could also look at extending our pure python simplifier to retain image tags (and potentially embedded video). We currently strip these out as our use case was very focussed on extracting the text while retaining the paragraph structure in simple HTML to allow easy annotation. What's your use case?

@martintoreilly martintoreilly self-assigned this Feb 19, 2021
@martintoreilly martintoreilly added the bug Something isn't working label Feb 19, 2021
@cayolblake
Copy link
Author

Hi @martintoreilly

Appreciate your extensive involvement. I don't have the jsdom problem at all, I retried on a vanilla machine and didn't experience it at all, I can send you my pip list as well as my node version and node modules list and version if that would help.

Yes, the link you stated is the one I am testing on and experiencing such a problem. I just tried the use_readability=False but the output was nightmarish, it included almost everything inside the page.

I just tried on Firefox Nightly and Firefox Developer Edition and both got the Reader View enabled and showing actual content fine.

I believe if the Readability.js is outdated that much it could be a logical reason behind this and having it updated would be great, also, if possible, if you can provide your update script or the procedures to update it till you take a look I would highly appreciate it.

@cayolblake
Copy link
Author

So, I just tried replacing the Readability.js file inside the project and retried and it worked like a charm!

This fixed as well some other mistakenly extracted content.

The only problem now is for img tags with various size options, it chooses the smallest version which I compared as a behavior with FireFox Reader View and made sure this behavior doesn't happen however I can do some tag post processor to fix such issues I guess.

Will wait your input on that.

@martintoreilly
Copy link
Member

martintoreilly commented Feb 19, 2021

So, I just tried replacing the Readability.js file inside the project and retried and it worked like a charm!

This fixed as well some other mistakenly extracted content.

That's brilliant! Thanks very much for trying this out. We're very happy to accept a pull request from you if you'd like to make one. Otherwise, I'll make this change once I solve my node/javascript issues and tag you on the change so you get credit.

@cayolblake
Copy link
Author

Sure, I can make a pull request once I check if the image size could be fixed as well if possible.

@martintoreilly
Copy link
Member

The only problem now is for img tags with various size options, it chooses the smallest version which I compared as a behavior with FireFox Reader View and made sure this behavior doesn't happen however I can do some tag post processor to fix such issues I guess.

Will wait your input on that.

In terms of the behaviour of Readability.js, our plans only extend to making it easily usable from Python and to extract plain text in a more structured manner. I tried to dig into the Mozilla javascript when we were looking to make some changes to how lists were processed but wasn't really able to get my head round it enough to propose changes to their package.

Our original use case was focussed on extracting the text content of pages while retaining enough of the structure around paragraphs/lists etc to render sensibly for human annotation of the extracted text. We ended up writing our own HTML simplification functions in python as it was much easier for us to iterate on as we found issues with the Readability.js output for sites we were scraping for our original project. We're not actively using this library on any of our projects at the moment, so we've limited time to either extend our HTML simplifier or look at contributing changes upstream Readability.js. However, the Readability.js Github issue tracker seems to be quite active, so it might be worth raising your issues about image sizes and iframe video there. If they incorporate these into Readability.js we can update to the latest version to incorporate them here (or hopefully automatically get these with a re-install if we move to referencing the Mozilla package directly rather than maintaining a local copy).

@martintoreilly
Copy link
Member

martintoreilly commented Feb 19, 2021

Solution

  1. (minimal fix): update our local copy of Readability.js to the latest file from the Mozilla package
  2. (better): stop duplicating Readability.js as a local copy and instead reference the Mozilla package directly (but double check why we opted for a local copy first). This will install the latest published version of Readability.js from node whenever our Python package is installed.
  3. Embed the Mozilla Readability.js repo as a submodule in this repo. This would let us track unreleased changes in Readability.js, but require us to trigger updates to the submodule tracking manually.

@cayolblake
Copy link
Author

I guess the best course of action now is to link to Readability.js repo directly as a submodule which should keep depending on it as an upstream properly and about further maintenance work.

@martintoreilly
Copy link
Member

I guess the best course of action now is to link to Readability.js repo directly as a submodule which should keep depending on it as an upstream properly and about further maintenance work.

Option 2 (referencing the Mozilla node package for Readability.js from our ExtractArticle.js script and listing it as a dependency in our package.json) would install the latest version of Readability.js published to node whenever the Python package is installed. Embedding the Mozilla Readability.js repo as a submodule would require us to manually update the version we are tracking but would give us the option of choosing to track ahead of the latest published version if we wanted. I've added this as option 3 above.

1 similar comment
@martintoreilly
Copy link
Member

I guess the best course of action now is to link to Readability.js repo directly as a submodule which should keep depending on it as an upstream properly and about further maintenance work.

Option 2 (referencing the Mozilla node package for Readability.js from our ExtractArticle.js script and listing it as a dependency in our package.json) would install the latest version of Readability.js published to node whenever the Python package is installed. Embedding the Mozilla Readability.js repo as a submodule would require us to manually update the version we are tracking but would give us the option of choosing to track ahead of the latest published version if we wanted. I've added this as option 3 above.

@cayolblake
Copy link
Author

Totally agree with your suggestion to be best course of action.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants