-
-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve horizontal space in code blocks #553
Comments
Do you mean space before the code or space in general? Could you provide a concrete example of code block? |
Guten Tag,
in this case, all white space before the code line in the code block were removed, which is unexpected and not friendly for LLM training btw. here is another bug (maybe) when extracting inline code block, a redundant '\n' was added after a inline code block
接口,注册拦截器
thank you |
Yes, spacing is not necessarily preserved in code blocks, this can be improved. |
Hello,
thanks for yours continous work on trafilatura
recent when we using trafilatura working on code-text content extraction, wo noticed that the santize func remove all white space \ table even in code block when using txt outpput formating
we think the problem is here preserve_space=False in default
https://github.com/adbar/trafilatura/blob/2c9f20296c1c5ce9a23715a07df5b623f3016b65/trafilatura/xml.py#L315C5-L315C51
The text was updated successfully, but these errors were encountered: