-
-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XML Parsing breaks on valid HTML #442
Comments
Hi @Jufik, I cannot reproduce the bug, which platform are you using? |
On an Apple M2 Max. |
There are sometimes problems with LXML on M1/M2 platforms. Installing trafilatura (and thus lxml) with brew could help. We could also sanitize the output as you say. |
This ongoing PR adopts a different approach to doc sanitizing, it should also solve this problem, although I can't replicate it. |
@Jufik Is the problem solved? |
@adbar just made a test, works like a charm with 1.7.0! |
URL: https://fastapi.tiangolo.com
Versions:
1.6.2
3.10.13
When running
trafilatura --output-format xml --URL https://fastapi.tiangolo.com/
, this error show up:Outputing
txt
andjson
works.At this point of code, the data already went through sanitization which should remove
0x00
.On top of that :
I don't know the implementation enough to point out where this chr comes from.
Any pointer to contribute is welcome.
The text was updated successfully, but these errors were encountered: