Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML Parsing breaks on valid HTML #442

Closed
Jufik opened this issue Nov 8, 2023 · 6 comments
Closed

XML Parsing breaks on valid HTML #442

Jufik opened this issue Nov 8, 2023 · 6 comments
Labels
feedback Feedback from users requested

Comments

@Jufik
Copy link

Jufik commented Nov 8, 2023

URL: https://fastapi.tiangolo.com

Versions:

  • Trafilatura: 1.6.2
  • Python: 3.10.13

When running trafilatura --output-format xml --URL https://fastapi.tiangolo.com/, this error show up:

~/.pyenv/versions/3.10.13/lib/python3.10/site-packages/trafilatura/xml.py:239: FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead.
  if not parent:
ERROR: Char 0x0 out of allowed range, line 1, column 2 (<string>, line 1)
Traceback (most recent call last):
  File "~/.pyenv/versions/3.10.13/lib/python3.10/site-packages/trafilatura/cli_utils.py", line 397, in examine
    result = extract(htmlstring, url=url, no_fallback=args.fast,
  File "~/.pyenv/versions/3.10.13/lib/python3.10/site-packages/trafilatura/core.py", line 1107, in extract
    return determine_returnstring(document, output_format, include_formatting, tei_validation)
  File "~/.pyenv/versions/3.10.13/lib/python3.10/site-packages/trafilatura/core.py", line 815, in determine_returnstring
    returnstring = control_xml_output(output, output_format, tei_validation, document)
  File "~/.pyenv/versions/3.10.13/lib/python3.10/site-packages/trafilatura/xml.py", line 122, in control_xml_output
    output_tree = fromstring(control_string, CONTROL_PARSER)
  File "src/lxml/etree.pyx", line 3257, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1916, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1796, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
  File "src/lxml/parser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 728, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 657, in lxml.etree._raiseParseError
  File "<string>", line 1
lxml.etree.XMLSyntaxError: Char 0x0 out of allowed range, line 1, column 2

Outputing txt and json works.

At this point of code, the data already went through sanitization which should remove 0x00.
On top of that :

>>> import requests
>>> r = requests.get("https://fastapi.tiangolo.com/")
>>> b'\x00' in r.content
False
>>>set(map(chr,r.content))
{'u', ']', 'v', '|', 'ê', '¬', '%', '-', '\x9e', '¡', '.', 'N', 'è', '\x81', '¯', 'ï', '{', '·', '¿', '°', 't', 'Ã', '@', 'z', '\x96', 'd', '7', '\x80', "'", '\x9a', 'X', ')', 'r', 'y', 'g', 'S', 'Q', ':', '\x97', '\x8e', '\x83', 'e', 'n', 'f', 'b', '\x8b', '9', '\x98', '_', 'º', ';', '\x8c', 'j', '8', 'C', 'L', '+', '\x9f', 'A', 'o', 'á', '\x87', '<', 'I', 'O', '(', '4', '¥', '/', '&', 'E', 'q', '\n', '5', 'c', 'W', '\x89', '"', 'R', 'a', 'x', '¼', 'Ñ', '?', '>', '=', '´', '¨', 'ð', 'M', '¸', 'ì', '0', 'F', '»', 'l', 'K', 'Ð', 'm', 'w', '¹', 'p', '§', 'P', '±', 'ª', '!', '\x8f', ',', '}', 'i', 'T', 'æ', 'í', 'D', '6', 'h', '$', '\x8d', '1', ' ', 'â', '\x95', '2', '[', '\x99', 's', 'U', '\xad', 'Y', 'k', '\x9c', 'J', 'µ', 'V', '#', 'G', 'B', 'Z', '3', '*', 'H'}

I don't know the implementation enough to point out where this chr comes from.
Any pointer to contribute is welcome.

@adbar
Copy link
Owner

adbar commented Nov 8, 2023

Hi @Jufik, I cannot reproduce the bug, which platform are you using?

@adbar adbar added the question Further information is requested label Nov 8, 2023
@Jufik
Copy link
Author

Jufik commented Nov 8, 2023

On an Apple M2 Max.
I've dug around, seems like by-passing sanitize call in xml.control_xml_output fixes the issue locally.

@adbar
Copy link
Owner

adbar commented Nov 8, 2023

There are sometimes problems with LXML on M1/M2 platforms. Installing trafilatura (and thus lxml) with brew could help.

We could also sanitize the output as you say.

@adbar adbar linked a pull request Nov 10, 2023 that will close this issue
@adbar
Copy link
Owner

adbar commented Nov 10, 2023

This ongoing PR adopts a different approach to doc sanitizing, it should also solve this problem, although I can't replicate it.

@adbar adbar reopened this Nov 20, 2023
@adbar adbar removed a link to a pull request Nov 20, 2023
@adbar
Copy link
Owner

adbar commented Jan 26, 2024

@Jufik Is the problem solved?

@adbar adbar added feedback Feedback from users requested and removed question Further information is requested labels Jan 26, 2024
@Jufik
Copy link
Author

Jufik commented Feb 4, 2024

@adbar just made a test, works like a charm with 1.7.0!

@adbar adbar closed this as completed Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feedback Feedback from users requested
Projects
None yet
Development

No branches or pull requests

2 participants