-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tidy HTML breaks result #88
Comments
Yes, I believe a tiny fix should remove the bug. #89 |
Possibly a duplicate of #31 (and see my comment there, maybe the code snippet helps?) Even untidy HTML has… issues with whitespace. |
jsm28
added a commit
to jsm28/python-markdownify
that referenced
this issue
Oct 3, 2024
This fixes various issues relating to how input whitespace is handled and how wrapping handles whitespace resulting from hard line breaks. This PR uses a branch based on that for matthewwithanm#120 to avoid conflicts with the fixes and associated test changes there. My suggestion is thus first to merge matthewwithanm#120 (which fixes two open issues), then to merge the remaining changes from this PR. Wrapping paragraphs has the effect of losing all newlines including those from `<br>` tags, contrary to HTML semantics (wrapping should be a matter of pretty-printing the output; input whitespace from the HTML input should be normalized, but `<br>` should remain as a hard line break). To fix this, we need to wrap the portions of a paragraph between hard line breaks separately. For this to work, ensure that when wrapping, all input whitespace is normalized at an early stage, including turning newlines into spaces. (Only ASCII whitespace is handled this way; `\s` is not used as it's not clear Unicode whitespace should get such normalization.) When not wrapping, there is still too much input whitespace preservation. If the input contains a blank line, that ends up as a paragraph break in the output, or breaks the header formatting when appearing in a header tag, though in terms of HTML semantics such a blank line is no different from a space. In the case of an ATX header, even a single newline appearing in the output breaks the Markdown. Thus, when not wrapping, arrange for input whitespace containing at least one `\r` or `\n` to be normalized to a single newline, and in the ATX header case, normalize to a space. Fixes matthewwithanm#130 (probably, not sure exactly what the HTML input there is) Fixes matthewwithanm#88 (a related case, anyway; the actual input in matthewwithanm#88 has already been fixed)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
On version
0.11.6
when I run:I expect this:
but I get this:
which I believe is a bug.
The text was updated successfully, but these errors were encountered: