XML Parsing breaks on valid HTML #442

Jufik · 2023-11-08T08:22:42Z

Versions:

Trafilatura: 1.6.2
Python: 3.10.13

When running trafilatura --output-format xml --URL https://fastapi.tiangolo.com/, this error show up:

~/.pyenv/versions/3.10.13/lib/python3.10/site-packages/trafilatura/xml.py:239: FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead.
  if not parent:
ERROR: Char 0x0 out of allowed range, line 1, column 2 (<string>, line 1)
Traceback (most recent call last):
  File "~/.pyenv/versions/3.10.13/lib/python3.10/site-packages/trafilatura/cli_utils.py", line 397, in examine
    result = extract(htmlstring, url=url, no_fallback=args.fast,
  File "~/.pyenv/versions/3.10.13/lib/python3.10/site-packages/trafilatura/core.py", line 1107, in extract
    return determine_returnstring(document, output_format, include_formatting, tei_validation)
  File "~/.pyenv/versions/3.10.13/lib/python3.10/site-packages/trafilatura/core.py", line 815, in determine_returnstring
    returnstring = control_xml_output(output, output_format, tei_validation, document)
  File "~/.pyenv/versions/3.10.13/lib/python3.10/site-packages/trafilatura/xml.py", line 122, in control_xml_output
    output_tree = fromstring(control_string, CONTROL_PARSER)
  File "src/lxml/etree.pyx", line 3257, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1916, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1796, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
  File "src/lxml/parser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 728, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 657, in lxml.etree._raiseParseError
  File "<string>", line 1
lxml.etree.XMLSyntaxError: Char 0x0 out of allowed range, line 1, column 2

Outputing txt and json works.

At this point of code, the data already went through sanitization which should remove 0x00.
On top of that :

>>> import requests
>>> r = requests.get("https://fastapi.tiangolo.com/")
>>> b'\x00' in r.content
False
>>>set(map(chr,r.content))
{'u', ']', 'v', '|', 'ê', '¬', '%', '-', '\x9e', '¡', '.', 'N', 'è', '\x81', '¯', 'ï', '{', '·', '¿', '°', 't', 'Ã', '@', 'z', '\x96', 'd', '7', '\x80', "'", '\x9a', 'X', ')', 'r', 'y', 'g', 'S', 'Q', ':', '\x97', '\x8e', '\x83', 'e', 'n', 'f', 'b', '\x8b', '9', '\x98', '_', 'º', ';', '\x8c', 'j', '8', 'C', 'L', '+', '\x9f', 'A', 'o', 'á', '\x87', '<', 'I', 'O', '(', '4', '¥', '/', '&', 'E', 'q', '\n', '5', 'c', 'W', '\x89', '"', 'R', 'a', 'x', '¼', 'Ñ', '?', '>', '=', '´', '¨', 'ð', 'M', '¸', 'ì', '0', 'F', '»', 'l', 'K', 'Ð', 'm', 'w', '¹', 'p', '§', 'P', '±', 'ª', '!', '\x8f', ',', '}', 'i', 'T', 'æ', 'í', 'D', '6', 'h', '$', '\x8d', '1', ' ', 'â', '\x95', '2', '[', '\x99', 's', 'U', '\xad', 'Y', 'k', '\x9c', 'J', 'µ', 'V', '#', 'G', 'B', 'Z', '3', '*', 'H'}

I don't know the implementation enough to point out where this chr comes from.
Any pointer to contribute is welcome.

The text was updated successfully, but these errors were encountered:

adbar · 2023-11-08T12:54:22Z

Hi @Jufik, I cannot reproduce the bug, which platform are you using?

Jufik · 2023-11-08T15:18:16Z

On an Apple M2 Max.
I've dug around, seems like by-passing sanitize call in xml.control_xml_output fixes the issue locally.

adbar · 2023-11-08T15:54:11Z

There are sometimes problems with LXML on M1/M2 platforms. Installing trafilatura (and thus lxml) with brew could help.

We could also sanitize the output as you say.

adbar · 2023-11-10T13:06:49Z

This ongoing PR adopts a different approach to doc sanitizing, it should also solve this problem, although I can't replicate it.

adbar · 2024-01-26T12:05:26Z

@Jufik Is the problem solved?

Jufik · 2024-02-04T10:35:20Z

@adbar just made a test, works like a charm with 1.7.0!

adbar added the question Further information is requested label Nov 8, 2023

adbar linked a pull request Nov 10, 2023 that will close this issue

preserve space in certain elements #429

Merged

adbar closed this as completed in #429 Nov 20, 2023

adbar reopened this Nov 20, 2023

adbar removed a link to a pull request Nov 20, 2023

preserve space in certain elements #429

Merged

adbar added feedback Feedback from users requested and removed question Further information is requested labels Jan 26, 2024

adbar closed this as completed Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XML Parsing breaks on valid HTML #442

XML Parsing breaks on valid HTML #442

Jufik commented Nov 8, 2023

adbar commented Nov 8, 2023

Jufik commented Nov 8, 2023

adbar commented Nov 8, 2023

adbar commented Nov 10, 2023

adbar commented Jan 26, 2024

Jufik commented Feb 4, 2024 •

edited

Loading

XML Parsing breaks on valid HTML #442

XML Parsing breaks on valid HTML #442

Comments

Jufik commented Nov 8, 2023

adbar commented Nov 8, 2023

Jufik commented Nov 8, 2023

adbar commented Nov 8, 2023

adbar commented Nov 10, 2023

adbar commented Jan 26, 2024

Jufik commented Feb 4, 2024 • edited Loading

Jufik commented Feb 4, 2024 •

edited

Loading