[fuzz result] code span vanishes when link destination is ` #136

notriddle · 2024-01-16T20:00:56Z

This markdown:

[link](`)`x`

In most engines I've tried with, including GitHub, it does this:

linkx

commonmark-hs generates this:

<p><a href="`">link</a>`x`</p>

Events from pulldown-cmark:

"[^](`)`|`\n" -> [
  Start(Paragraph)
    Start(Link { link_type: Inline, dest_url: Borrowed("`"), title: Borrowed(""), id: Borrowed("") })
      Text(Borrowed("^"))
    End(Link)
    Code(Borrowed("|"))
  End(Paragraph)
]

Events from pandoc:

"[^](`)`|`\n" -> [
  Start(Paragraph)
    Start(Link { link_type: Inline, dest_url: Inlined(InlineStr { inner: [96, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], len: 1 }), title: Inlined(InlineStr { inner: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], len: 0 }), id: Borrowed("") })
      Text(Boxed("^"))
    End(Link)
    Text(Boxed("`|`"))
  End(Paragraph)
]

Events from commonmark.js:

"[^](`)`|`\n" -> [
  Start(Paragraph)
    Start(Link { link_type: Inline, dest_url: Inlined(InlineStr { inner: [96, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], len: 1 }), title: Inlined(InlineStr { inner: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], len: 0 }), id: Borrowed("") })
      Text(Boxed("^"))
    End(Link)
    Code(Boxed("|"))
  End(Paragraph)
]

The text was updated successfully, but these errors were encountered:

notriddle · 2024-01-17T07:23:49Z

The code that looks like the culprit is here:

commonmark-hs/commonmark/src/Commonmark/Inlines.hs

Lines 840 to 843 in 1875e9a

    
           else (Chunk (Parsed (ranged 
        
                    (rangeFromToks missingtoks newpos) 
        
                    (str (untokenize missingtoks)))) 
        
                 newpos missingtoks :)

In this case, the end paren is in the middle of a "parsed code" chunk, which gets split into the link destination and some plain text.

        v end paren
[link](`)`x`
      :^^^-- parsed chunk
      :| code span
      : suffix pos

The simplest solution is to make inline code bind tighter than link destinations, but this wouldn't match the reference implementation. The other simple options are to re-parse with the inline syntax parsers, or to interleave bracket matching with inline code span parsing.

jgm · 2024-01-17T16:33:11Z

None of those are simple and obvious fixes, unfortunately.
Reparsing seems ugly but maybe it's the best idea.
I'm not sure what the last option would involve -- it sounds like a pretty major architectural change, though!

notriddle · 2024-01-17T17:31:40Z

The three options that come to my mind are:

You can't have unescaped ` in link destinations. This is the easiest, and doesn't match cmark.c.
Re-parsing is the least invasive, but still ugly, and I'm not convinced it's actually correct.
Add another kind of Chunk, for ` runs, then turn it into code spans in processBs. This matches closest with how pulldown-cmark does it (MaybeCode, MaybeLinkOpen, and MaybeImage are all prepared in the tree building class, then handle_inline_pass1 turns them into spans, then handle_emphasis_and_hard_break ... handles emphasis and hard line breaks in a second inline pass), and since it means there's a clean separation between "definitely a code span" and "a valid delimiter that might become a code span", it seems more likely to be correct.

Fixes jgm#136 This works by re-parsing the tokens that come after the link, but only when the end delimiter isn't on a chunk boundary (since that's the only way this problem can happen). Re-parsing a specific chunk won't work, because the part that needs re-interpreted can span more than one chunk. For example, we can draw the bounds of the erroneous code chunk in this example: [x](`) <a href="`"> ^-----------^ If we re-parse the underlined part in isolation, we'll fix the first link, but won't find the HTML (since the closing angle bracket is in the next chunk). On the other hand, parsing links, code, and HTML in a single pass would make writing extensions more complicated. For example, LaTeX math is supposed to have the same binding strength as code spans: $first[$](about) ^------^ this is a math span, not a link [first]($)$5/8$ ^-^ this is an analogue of the original bug it shouldn't be a math span, but looks like one

jgm · 2024-02-05T17:31:57Z

You can't have unescaped ` in link destinations. This is the easiest, and doesn't match cmark.c.

This would require a spec change.

As for options 2 and 3, I'm not sure. I agree that 2 is ugly. So 3 has some appeal, but I'd have to see what is actually involved in going this way.

notriddle · 2024-02-05T18:52:06Z

I implemented option 2 in #137, but this implementation has potentially quadratic behavior.

The trouble with option 3 is that the extensions also need redone, because $math$ is just as susceptible to this problem as `code` is.

Fixes jgm#136 This works by re-parsing the tokens that come after the link, but only when the end delimiter isn't on a chunk boundary (since that's the only way this problem can happen). Re-parsing a specific chunk won't work, because the part that needs re-interpreted can span more than one chunk. For example, we can draw the bounds of the erroneous code chunk in this example: [x](`) <a href="`"> ^-----------^ If we re-parse the underlined part in isolation, we'll fix the first link, but won't find the HTML (since the closing angle bracket is in the next chunk). On the other hand, parsing links, code, and HTML in a single pass would make writing extensions more complicated. For example, LaTeX math is supposed to have the same binding strength as code spans: $first[$](about) ^------^ this is a math span, not a link [first]($)$5/8$ ^-^ this is an analogue of the original bug it shouldn't be a math span, but looks like one

notriddle changed the title ~~[fuzz result] code span vanishes when link destination is ` ``~~ [fuzz result] code span vanishes when link destination is ` Jan 16, 2024

notriddle mentioned this issue Jan 18, 2024

Fix parsing of link destinations that look like code or <html> #137

Merged

jgm mentioned this issue Feb 6, 2024

gfm parsing oddity with links and raw HTML #147

Closed

jgm closed this as completed in #137 Sep 11, 2024

jgm closed this as completed in ff9fe57 Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fuzz result] code span vanishes when link destination is ` #136

[fuzz result] code span vanishes when link destination is ` #136

notriddle commented Jan 16, 2024

notriddle commented Jan 17, 2024

jgm commented Jan 17, 2024

notriddle commented Jan 17, 2024 •

edited

Loading

jgm commented Feb 5, 2024

notriddle commented Feb 5, 2024

[fuzz result] code span vanishes when link destination is ` #136

[fuzz result] code span vanishes when link destination is ` #136

Comments

notriddle commented Jan 16, 2024

notriddle commented Jan 17, 2024

jgm commented Jan 17, 2024

notriddle commented Jan 17, 2024 • edited Loading

jgm commented Feb 5, 2024

notriddle commented Feb 5, 2024

notriddle commented Jan 17, 2024 •

edited

Loading