Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indent before HTML block elements causes indent in Markdown output #98

Open
chrispy-snps opened this issue Nov 26, 2023 · 2 comments · May be fixed by #120
Open

Indent before HTML block elements causes indent in Markdown output #98

chrispy-snps opened this issue Nov 26, 2023 · 2 comments · May be fixed by #120

Comments

@chrispy-snps
Copy link
Collaborator

chrispy-snps commented Nov 26, 2023

In our HTML, block elements are indented:

<html>
  <body>
    <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit,
      sed do eiusmod tempor incididunt ut labore et dolore magna
      aliqua. Ut enim ad minim veniam, quis nostrud exercitation
      ullamco laboris nisi ut aliquip ex ea commodo consequat.
    </p>
  </body>
</html>

When HTML with indented block elements is converted, the indent causes incorrect formatting in the output.

Converting this indented <p> element:

from markdownify import markdownify as md

print(repr(md("""\
  <p>This is
     some text.</p>
""")))

produces this:

' This is\n some text.\n\n\n'
 ^       ^^^

It happens for non-<p> elements too. Converting these indented <h1> elements with the UNDERLINED and ATX heading formats:

print(repr(md("""\
    <h1>Title</h1>
""")))

print(repr(md("""\
    <h1>Title</h1>
""", heading_style="ATX")))

produces this:

' Title\n=====\n\n\n'
 ^

' # Title\n\n\n'
 ^

As a workaround, we iterate through all text object descendants in all text-containing block elements (<p>, <entry>, <li>, etc.) and convert newlines to spaces, but this is expensive on large document sets.

Possibly related to #31.

@chrispy-snps chrispy-snps changed the title Indent in <p> causes indent in Markdown output Indent before HTML block elements causes indent in Markdown output Nov 26, 2023
@chrispy-snps
Copy link
Collaborator Author

This seems to be a duplicate of issue #96.

@mirabilos
Copy link

or rather #88 perhaps

jsm28 added a commit to jsm28/python-markdownify that referenced this issue Apr 9, 2024
There are various cases in which inline text fails to be separated by
(sufficiently many) newlines from adjacent block content.  A paragraph
needs a blank line (two newlines) separating it from prior text, as
does an underlined header; an ATX header needs a single newline
separating it from prior text.  A list needs at least one newline
separating it from prior text, but in general two newlines (for an
ordered list starting other than at 1, which will only be recognized
given a blank line before).

To avoid accumulation of more newlines than necessary, take care when
concatenating the results of converting consecutive tags to remove
redundant newlines (keeping the greater of the number ending the prior
text and the number starting the subsequent text).

This is thus an alternative to matthewwithanm#108 that tries to avoid the excess
newline accumulation that was a concern there, as well as fixing more
cases than just paragraphs, and updating tests.

Fixes matthewwithanm#92

Fixes matthewwithanm#98
@jsm28 jsm28 linked a pull request Apr 9, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants