Skip to content

Commit

Permalink
Improve autolinks extension.
Browse files Browse the repository at this point in the history
The autolinks extension was interacting badly with explicit links,
as shown by #147:

    [link](https://baidu.com)aaa<span></span>bbb

To fix this we had to make autolink parsing a bit stricter than
the GFM spec does.  They allow unbalanced `)` except at the end
of a URL (which is defined as: followed by optional final punctuation
then whitespace or eof).  With this change, we don't allow unbalanced
`)` at all in raw URLs. This should not be a big problem in practice.

Closes #147.
  • Loading branch information
jgm committed Feb 7, 2024
1 parent 6c07e48 commit f0b9653
Show file tree
Hide file tree
Showing 2 changed files with 79 additions and 35 deletions.
68 changes: 43 additions & 25 deletions commonmark-extensions/src/Commonmark/Extensions/Autolink.hs
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ wwwAutolink :: Monad m => InlineParser m Text
wwwAutolink = try $ do
lookAhead $ satisfyWord (== "www")
validDomain
linkSuffix
linkPath 0
return "http://"

validDomain :: Monad m => InlineParser m ()
Expand All @@ -47,29 +47,47 @@ validDomain = do
domainPart
skipMany1 $ try (symbol '.' >> domainPart)

linkSuffix :: Monad m => InlineParser m ()
linkSuffix = try $ do
toks <- getInput
let possibleSuffixTok (Tok (Symbol c) _ _) =
c `notElem` ['<','>','{','}','|','\\','^','[',']','`']
possibleSuffixTok (Tok WordChars _ _) = True
possibleSuffixTok _ = False
let isDroppable (Tok (Symbol c) _ _) =
c `elem` ['?','!','.',',',':','*','_','~']
isDroppable _ = False
let numToks = case dropWhile isDroppable $
reverse (takeWhile possibleSuffixTok toks) of
(Tok (Symbol ')') _ _ : xs)
| length [t | t@(Tok (Symbol '(') _ _) <- xs] <=
length [t | t@(Tok (Symbol ')') _ _) <- xs]
-> length xs
(Tok (Symbol ';') _ _
: Tok WordChars _ _
: Tok (Symbol '&') _ _
: xs) -> length xs
xs -> length xs
count numToks anyTok
return ()
linkPath :: Monad m => Int -> InlineParser m ()
linkPath openParens =
try (symbol '&' *>
notFollowedBy
(try (satisfyWord (const True) *> symbol ';' *> linkEnd)) *>
linkPath openParens)
<|> (pathPunctuation *> linkPath openParens)
<|> (symbol '(' *> linkPath (openParens + 1))
<|> (guard (openParens > 0) *> symbol ')' *> linkPath (openParens - 1))
-- the following clause is needed to implement the GFM spec, which allows
-- unbalanced ) except at link end. However, leaving this in causes
-- problematic interaction with explicit link syntax in certain odd cases (see #147).
-- <|> (notFollowedBy linkEnd *> symbol ')' *> linkPath (openParens - 1))
<|> (satisfyTok (\t -> case tokType t of
LineEnd -> False
Spaces -> False
Symbol c -> not (isTrailingPunctuation c || c == '&' || c == ')')
_ -> True) *> linkPath openParens)
<|> pure ()

linkEnd :: Monad m => InlineParser m ()
linkEnd = try $ skipMany trailingPunctuation *> (void whitespace <|> eof)

trailingPunctuation :: Monad m => InlineParser m ()
trailingPunctuation = void $
satisfyTok (\t -> case tokType t of
Symbol c -> isTrailingPunctuation c
_ -> False)

isTrailingPunctuation :: Char -> Bool
isTrailingPunctuation =
(`elem` ['!', '"', '\'', ')', '*', ',', '.', ':', ';', '?', '_', '~', '<'])

pathPunctuation :: Monad m => InlineParser m ()
pathPunctuation = try $ do
satisfyTok (\t -> case tokType t of
Symbol c -> isTrailingPunctuation c && c /= ')' && c /= '<'
_ -> False)
void $ lookAhead (satisfyTok (\t -> case tokType t of
WordChars -> True
_ -> False))

urlAutolink :: Monad m => InlineParser m Text
urlAutolink = try $ do
Expand All @@ -78,7 +96,7 @@ urlAutolink = try $ do
symbol '/'
symbol '/'
validDomain
linkSuffix
linkPath 0
return ""

emailAutolink :: Monad m => InlineParser m Text
Expand Down
46 changes: 36 additions & 10 deletions commonmark-extensions/test/autolinks.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,10 +49,37 @@ Visit www.commonmark.org/~jm/foo/bar.pdf.
<p>Visit <a href="http://www.commonmark.org/~jm/foo/bar.pdf">www.commonmark.org/~jm/foo/bar.pdf</a>.</p>
````````````````````````````````

When an autolink ends in `)`, we scan the entire autolink for the total number
of parentheses. If there is a greater number of closing parentheses than
opening ones, we don't consider the last character part of the autolink, in
order to facilitate including an autolink inside a parenthesis:
((commonmark-hs: We depart from the GFM spec here. Alternative spec and tests
can be found after these commented-out ones. For motivation for this departure from
the GFM spec, see #147.))

> When an autolink ends in `)`, we scan the entire autolink for the total number
> of parentheses. If there is a greater number of closing parentheses than
> opening ones, we don't consider the last character part of the autolink, in
> order to facilitate including an autolink inside a parenthesis:
>
> ```````````````````````````````` example
> www.google.com/search?q=Markup+(business)
>
> (www.google.com/search?q=Markup+(business))
> .
> <p><a href="http://www.google.com/search?q=Markup+(business)">www.google.com/search?q=Markup+(business)</a></p>
> <p>(<a href="http://www.google.com/search?q=Markup+(business)">www.google.com/search?q=Markup+(business)</a>)</p>
> ````````````````````````````````
>
> This check is only done when the link ends in a closing parentheses `)`, so if
> the only parentheses are in the interior of the autolink, no special rules are
> applied:
>
> ```````````````````````````````` example
> www.google.com/search?q=(business))+ok
> .
> <p><a href="http://www.google.com/search?q=(business))+ok">www.google.com/search?q=(business))+ok</a></p>
> ````````````````````````````````
Autolinks can contain balanced pairs of parentheses, or unbalanced `)`.
We don't allow unbalanced `)`, in order to facilitate including
an autolink inside a parenthesis:
```````````````````````````````` example
www.google.com/search?q=Markup+(business)
Expand All @@ -63,16 +90,15 @@ www.google.com/search?q=Markup+(business)
<p>(<a href="http://www.google.com/search?q=Markup+(business)">www.google.com/search?q=Markup+(business)</a>)</p>
````````````````````````````````
This check is only done when the link ends in a closing parentheses `)`, so if
the only parentheses are in the interior of the autolink, no special rules are
applied:

Issue #147:
```````````````````````````````` example
www.google.com/search?q=(business))+ok
[link](https://baidu.com)aaa<span></span>bbb
.
<p><a href="http://www.google.com/search?q=(business))+ok">www.google.com/search?q=(business))+ok</a></p>
<p><a href="https://baidu.com">link</a>aaa<span></span>bbb</p>
````````````````````````````````

((End of diverging section.))

If an autolink ends in a semicolon (`;`), we check to see if it appears to
resemble an [entity reference][entity references]; if the preceding text is `&`
followed by one or more alphanumeric characters. If so, it is excluded from
Expand Down

0 comments on commit f0b9653

Please sign in to comment.