Preserve attributes on HTML paragraphs #10850

Valgard · 2025-05-17T18:09:59Z

This PR implements the preservation of attributes on HTML paragraphs, addressing issue #10768.

HTML reader now wraps attributed <p> tags in a Div with wrapper="1".
HTML writer unwraps these Divs back to attributed <p> tags.

This approach is similar to the Djot reader/writer as discussed in #10768, ensuring that semantic information in HTML attributes on paragraphs is preserved during conversion.

jgm

Thanks for this. As noted I didn't understand the special treatment of "align".

Another question I have is how common it is for paragraphs to have classes or other attributes in HTML in the wild. If it is very common, then I suppose this change will lead to more cluttered HTML -> markdown conversions and we'd need to weight that.

jgm · 2025-05-28T20:26:41Z

src/Text/Pandoc/Readers/HTML.hs

+pParaWithWrapper :: PandocMonad m => Attr -> TagParser m Blocks
+pParaWithWrapper (ident, classes, kvs) = do
+  guardEnabled Ext_native_divs -- Ensure native_divs is enabled for this behavior
+  pInhalt <- trimInlines <$> pInTags "p" inline


I usually use the naming convention of beginning parsers with p; so it would be better to use something like inhalt instead for this name.

Good point! I've renamed it to contents to follow your p prefix convention for parsers. Thanks for catching that!

jgm · 2025-05-28T20:28:10Z

src/Text/Pandoc/Readers/HTML.hs

+    let otherKVs = filter (\(k,_) -> k /= "align") kvs
+    let validAlignKV = case alignValue of
+                         Just algn | algn `elem` ["left","right","center","justify"] -> [("align", algn)]
+                         _ -> []
+    let finalKVs = wrapperAttr : (validAlignKV ++ otherKVs)


What is the motivation for treating the "align" attribute specially in this way?

See my comment below

jgm · 2025-05-28T20:29:58Z

src/Text/Pandoc/Readers/HTML.hs

+    return (case alignValue of
+              Just algn | algn `elem` ["left","right","center","justify"] ->
+                            B.divWith ("", [], [("align", algn)]) paraBlock


I don't understand the motivation for this.

See my comment below

- HTML reader wraps attributed `p` tags in `Div` with `wrapper="1"`. - HTML writer unwraps `Div` with `wrapper="1"` back to attributed `p` tag. - Add tests for HTML paragraph attribute roundtrip. - Update EPUB golden files to reflect new AST for attributed paragraphs.

Split pPara into pParaWithWrapper and pParaSimple helpers. Ensure pParaWithWrapper correctly discards invalid align attributes. Add specific tests for align attribute in HTML reader and writer.

- Update MANUAL.txt to reflect `native_divs` wrapping of attributed `<p>` tags.

- Add test cases for HTML to native, native to HTML, HTML to HTML, and HTML to HTML5 conversions - Verify preservation of id, class, and data attributes on p tags

- Treat align attribute like any other attribute - Always wrap paragraphs with attributes in divs (including align-only) - Remove validation logic for align values - Update tests to reflect consistent wrapper behavior

Valgard · 2025-06-08T22:20:42Z

Thanks for this. As noted I didn't understand the special treatment of "align".

Another question I have is how common it is for paragraphs to have classes or other attributes in HTML in the wild. If it is very common, then I suppose this change will lead to more cluttered HTML -> markdown conversions and we'd need to weight that.

Thanks for catching that! You're right about the align logic - that was actually an idea I had initially discarded, but it seems to have somehow made its way into the pull request anyway. I'll remove that special handling.

Regarding your question about paragraph attributes in the wild - you're absolutely right to be concerned. Paragraphs with classes and other attributes are extremely common in modern HTML, especially with:

CSS frameworks (Bootstrap's text-center, lead, text-muted)
Utility-first frameworks (Tailwind's text-lg, mb-4, text-gray-600)
CMS-generated content (WordPress, Drupal automatically add classes)
JavaScript hooks (js-expandable, track-click)
Semantic styling (introduction, disclaimer, highlight)

A configurable approach would be ideal here. We could add a command-line option or extension setting that controls attribute preservation behavior:

Default mode: Strip most attributes for clean, readable Markdown
Preserve mode: Keep essential attributes (IDs for anchor links, alt text for images)
Full preservation: Maintain all attributes for round-trip conversion

This would allow users to choose between clean output (which most expect from HTML→Markdown conversion) and technical preservation (needed for specific use cases like documentation sites requiring anchor links). The majority of conversions prioritize readability over technical fidelity, so defaulting to clean output while providing flexibility makes the most sense.

Valgard force-pushed the main branch from 9cc9389 to f089a81 Compare May 17, 2025 18:13

jgm reviewed May 28, 2025

View reviewed changes

Valgard added 4 commits June 8, 2025 22:27

refactor(HTML): Improve pPara and align handling

ab22397

Split pPara into pParaWithWrapper and pParaSimple helpers. Ensure pParaWithWrapper correctly discards invalid align attributes. Add specific tests for align attribute in HTML reader and writer.

docs(HTML): Document native_divs behavior for attributed p tags

3389697

- Update MANUAL.txt to reflect `native_divs` wrapping of attributed `<p>` tags.

test(HTML): add command tests for attributed p tags

3b33de2

- Add test cases for HTML to native, native to HTML, HTML to HTML, and HTML to HTML5 conversions - Verify preservation of id, class, and data attributes on p tags

Valgard force-pushed the main branch from f089a81 to 3b33de2 Compare June 8, 2025 20:28

fix: remove special handling of align attribute in HTML paragraphs

469a25e

- Treat align attribute like any other attribute - Always wrap paragraphs with attributes in divs (including align-only) - Remove validation logic for align values - Update tests to reflect consistent wrapper behavior

Valgard force-pushed the main branch from d9f2474 to 469a25e Compare June 8, 2025 22:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Preserve attributes on HTML paragraphs #10850

Preserve attributes on HTML paragraphs #10850

Uh oh!

Valgard commented May 17, 2025 •

edited

Loading

Uh oh!

jgm left a comment

Uh oh!

jgm May 28, 2025

Uh oh!

Valgard Jun 8, 2025

Uh oh!

jgm May 28, 2025

Uh oh!

Valgard Jun 8, 2025

Uh oh!

jgm May 28, 2025

Uh oh!

Valgard Jun 8, 2025

Uh oh!

Valgard commented Jun 8, 2025

Uh oh!

Uh oh!

Uh oh!

Preserve attributes on HTML paragraphs #10850

Are you sure you want to change the base?

Preserve attributes on HTML paragraphs #10850

Uh oh!

Conversation

Valgard commented May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgm left a comment

Choose a reason for hiding this comment

Uh oh!

jgm May 28, 2025

Choose a reason for hiding this comment

Uh oh!

Valgard Jun 8, 2025

Choose a reason for hiding this comment

Uh oh!

jgm May 28, 2025

Choose a reason for hiding this comment

Uh oh!

Valgard Jun 8, 2025

Choose a reason for hiding this comment

Uh oh!

jgm May 28, 2025

Choose a reason for hiding this comment

Uh oh!

Valgard Jun 8, 2025

Choose a reason for hiding this comment

Uh oh!

Valgard commented Jun 8, 2025

Uh oh!

Uh oh!

Valgard commented May 17, 2025 •

edited

Loading