Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fuzz result] code span vanishes when link destination is ` #136

Closed
notriddle opened this issue Jan 16, 2024 · 5 comments · Fixed by #137
Closed

[fuzz result] code span vanishes when link destination is ` #136

notriddle opened this issue Jan 16, 2024 · 5 comments · Fixed by #137

Comments

@notriddle
Copy link
Contributor

This markdown:

[link](`)`x`

In most engines I've tried with, including GitHub, it does this:

linkx

commonmark-hs generates this:

<p><a href="`">link</a>`x`</p>

Events from pulldown-cmark:

"[^](`)`|`\n" -> [
  Start(Paragraph)
    Start(Link { link_type: Inline, dest_url: Borrowed("`"), title: Borrowed(""), id: Borrowed("") })
      Text(Borrowed("^"))
    End(Link)
    Code(Borrowed("|"))
  End(Paragraph)
]

Events from pandoc:

"[^](`)`|`\n" -> [
  Start(Paragraph)
    Start(Link { link_type: Inline, dest_url: Inlined(InlineStr { inner: [96, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], len: 1 }), title: Inlined(InlineStr { inner: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], len: 0 }), id: Borrowed("") })
      Text(Boxed("^"))
    End(Link)
    Text(Boxed("`|`"))
  End(Paragraph)
]

Events from commonmark.js:

"[^](`)`|`\n" -> [
  Start(Paragraph)
    Start(Link { link_type: Inline, dest_url: Inlined(InlineStr { inner: [96, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], len: 1 }), title: Inlined(InlineStr { inner: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], len: 0 }), id: Borrowed("") })
      Text(Boxed("^"))
    End(Link)
    Code(Boxed("|"))
  End(Paragraph)
]
@notriddle notriddle changed the title [fuzz result] code span vanishes when link destination is ` `` [fuzz result] code span vanishes when link destination is ` Jan 16, 2024
@notriddle
Copy link
Contributor Author

The code that looks like the culprit is here:

else (Chunk (Parsed (ranged
(rangeFromToks missingtoks newpos)
(str (untokenize missingtoks))))
newpos missingtoks :)

In this case, the end paren is in the middle of a "parsed code" chunk, which gets split into the link destination and some plain text.

        v end paren
[link](`)`x`
      :^^^-- parsed chunk
      :| code span
      : suffix pos

The simplest solution is to make inline code bind tighter than link destinations, but this wouldn't match the reference implementation. The other simple options are to re-parse with the inline syntax parsers, or to interleave bracket matching with inline code span parsing.

@jgm
Copy link
Owner

jgm commented Jan 17, 2024

None of those are simple and obvious fixes, unfortunately.
Reparsing seems ugly but maybe it's the best idea.
I'm not sure what the last option would involve -- it sounds like a pretty major architectural change, though!

@notriddle
Copy link
Contributor Author

notriddle commented Jan 17, 2024

The three options that come to my mind are:

  1. You can't have unescaped ` in link destinations. This is the easiest, and doesn't match cmark.c.
  2. Re-parsing is the least invasive, but still ugly, and I'm not convinced it's actually correct.
  3. Add another kind of Chunk, for ` runs, then turn it into code spans in processBs. This matches closest with how pulldown-cmark does it (MaybeCode, MaybeLinkOpen, and MaybeImage are all prepared in the tree building class, then handle_inline_pass1 turns them into spans, then handle_emphasis_and_hard_break ... handles emphasis and hard line breaks in a second inline pass), and since it means there's a clean separation between "definitely a code span" and "a valid delimiter that might become a code span", it seems more likely to be correct.

notriddle added a commit to notriddle/commonmark-hs that referenced this issue Jan 18, 2024
Fixes jgm#136

This works by re-parsing the tokens that come after the link,
but only when the end delimiter isn't on a chunk boundary
(since that's the only way this problem can happen).

Re-parsing a specific chunk won't work, because the part that
needs re-interpreted can span more than one chunk. For example,
we can draw the bounds of the erroneous code chunk in this example:

    [x](`) <a href="`">
        ^-----------^

If we re-parse the underlined part in isolation, we'll fix the
first link, but won't find the HTML (since the closing angle
bracket is in the next chunk).

On the other hand, parsing links, code, and HTML in a single pass
would make writing extensions more complicated. For example,
LaTeX math is supposed to have the same binding strength as
code spans:

    $first[$](about)
    ^------^ this is a math span, not a link

    [first]($)$5/8$
            ^-^ this is an analogue of the original bug
                it shouldn't be a math span, but looks like one
notriddle added a commit to notriddle/commonmark-hs that referenced this issue Jan 18, 2024
Fixes jgm#136

This works by re-parsing the tokens that come after the link,
but only when the end delimiter isn't on a chunk boundary
(since that's the only way this problem can happen).

Re-parsing a specific chunk won't work, because the part that
needs re-interpreted can span more than one chunk. For example,
we can draw the bounds of the erroneous code chunk in this example:

    [x](`) <a href="`">
        ^-----------^

If we re-parse the underlined part in isolation, we'll fix the
first link, but won't find the HTML (since the closing angle
bracket is in the next chunk).

On the other hand, parsing links, code, and HTML in a single pass
would make writing extensions more complicated. For example,
LaTeX math is supposed to have the same binding strength as
code spans:

    $first[$](about)
    ^------^ this is a math span, not a link

    [first]($)$5/8$
            ^-^ this is an analogue of the original bug
                it shouldn't be a math span, but looks like one
notriddle added a commit to notriddle/commonmark-hs that referenced this issue Jan 26, 2024
Fixes jgm#136

This works by re-parsing the tokens that come after the link,
but only when the end delimiter isn't on a chunk boundary
(since that's the only way this problem can happen).

Re-parsing a specific chunk won't work, because the part that
needs re-interpreted can span more than one chunk. For example,
we can draw the bounds of the erroneous code chunk in this example:

    [x](`) <a href="`">
        ^-----------^

If we re-parse the underlined part in isolation, we'll fix the
first link, but won't find the HTML (since the closing angle
bracket is in the next chunk).

On the other hand, parsing links, code, and HTML in a single pass
would make writing extensions more complicated. For example,
LaTeX math is supposed to have the same binding strength as
code spans:

    $first[$](about)
    ^------^ this is a math span, not a link

    [first]($)$5/8$
            ^-^ this is an analogue of the original bug
                it shouldn't be a math span, but looks like one
notriddle added a commit to notriddle/commonmark-hs that referenced this issue Jan 27, 2024
Fixes jgm#136

This works by re-parsing the tokens that come after the link,
but only when the end delimiter isn't on a chunk boundary
(since that's the only way this problem can happen).

Re-parsing a specific chunk won't work, because the part that
needs re-interpreted can span more than one chunk. For example,
we can draw the bounds of the erroneous code chunk in this example:

    [x](`) <a href="`">
        ^-----------^

If we re-parse the underlined part in isolation, we'll fix the
first link, but won't find the HTML (since the closing angle
bracket is in the next chunk).

On the other hand, parsing links, code, and HTML in a single pass
would make writing extensions more complicated. For example,
LaTeX math is supposed to have the same binding strength as
code spans:

    $first[$](about)
    ^------^ this is a math span, not a link

    [first]($)$5/8$
            ^-^ this is an analogue of the original bug
                it shouldn't be a math span, but looks like one
@jgm
Copy link
Owner

jgm commented Feb 5, 2024

  1. You can't have unescaped ` in link destinations. This is the easiest, and doesn't match cmark.c.

This would require a spec change.

As for options 2 and 3, I'm not sure. I agree that 2 is ugly. So 3 has some appeal, but I'd have to see what is actually involved in going this way.

@notriddle
Copy link
Contributor Author

I implemented option 2 in #137, but this implementation has potentially quadratic behavior.

The trouble with option 3 is that the extensions also need redone, because $math$ is just as susceptible to this problem as `code` is.

notriddle added a commit to notriddle/commonmark-hs that referenced this issue Jul 4, 2024
Fixes jgm#136

This works by re-parsing the tokens that come after the link,
but only when the end delimiter isn't on a chunk boundary
(since that's the only way this problem can happen).

Re-parsing a specific chunk won't work, because the part that
needs re-interpreted can span more than one chunk. For example,
we can draw the bounds of the erroneous code chunk in this example:

    [x](`) <a href="`">
        ^-----------^

If we re-parse the underlined part in isolation, we'll fix the
first link, but won't find the HTML (since the closing angle
bracket is in the next chunk).

On the other hand, parsing links, code, and HTML in a single pass
would make writing extensions more complicated. For example,
LaTeX math is supposed to have the same binding strength as
code spans:

    $first[$](about)
    ^------^ this is a math span, not a link

    [first]($)$5/8$
            ^-^ this is an analogue of the original bug
                it shouldn't be a math span, but looks like one
notriddle added a commit to notriddle/commonmark-hs that referenced this issue Jul 4, 2024
Fixes jgm#136

This works by re-parsing the tokens that come after the link,
but only when the end delimiter isn't on a chunk boundary
(since that's the only way this problem can happen).

Re-parsing a specific chunk won't work, because the part that
needs re-interpreted can span more than one chunk. For example,
we can draw the bounds of the erroneous code chunk in this example:

    [x](`) <a href="`">
        ^-----------^

If we re-parse the underlined part in isolation, we'll fix the
first link, but won't find the HTML (since the closing angle
bracket is in the next chunk).

On the other hand, parsing links, code, and HTML in a single pass
would make writing extensions more complicated. For example,
LaTeX math is supposed to have the same binding strength as
code spans:

    $first[$](about)
    ^------^ this is a math span, not a link

    [first]($)$5/8$
            ^-^ this is an analogue of the original bug
                it shouldn't be a math span, but looks like one
@jgm jgm closed this as completed in #137 Sep 11, 2024
@jgm jgm closed this as completed in ff9fe57 Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants