Unify autolinks processing between mdpo and md2po (#165)

* Unify autolinks processing between mdpo and md2po * Handle titles in autolinks uniformity process
mondeja · Aug 30, 2021 · 138437d · 138437d
1 parent e788e6d
commit 138437d
Show file tree

Hide file tree

Showing 21 changed files with 279 additions and 152 deletions.
diff --git a/README.md b/README.md
@@ -22,8 +22,10 @@
 </h2>
 
 <p align="center">
-Fully complies with <a href="https://spec.commonmark.org/0.29">CommonMark Specification v0.29</a>,
-supporting some additional features.
+Complies with <a href="https://spec.commonmark.org/">CommonMark Specification</a>
+<a href="https://spec.commonmark.org/0.29">v0.29</a> and
+<a href="https://spec.commonmark.org/0.30">v0.30</a>, supporting some
+additional features.
 </p>
 
 ## Status

diff --git a/docs/cli.rst b/docs/cli.rst
@@ -52,6 +52,7 @@ The output produced by :ref:`po2md-cli` is compatible with the following
 
    {
      "no-blanks-blockquote": false,
+     "no-bare-urls": false,
      "ul-indent": {
        "indent": 3
      }

diff --git a/docs/implementation-notes.rst b/docs/implementation-notes.rst
@@ -0,0 +1,32 @@
+.. _implementation-notes:
+
+********************
+Implementation notes
+********************
+
+.. note::
+
+   Refer to the `CommonMark Specification v0.30`_ for descriptions of the terms
+   used by this document.
+
+Autolink vs link clash
+======================
+
+An autolink is something like ``<https://foo.bar>`` and a link is something
+like ``[foo](https://foo.bar)``.
+
+MD4C parser doesn't distinguish between an autolink and a link whose text and
+destination is the same. So, mdpo will treat all links whose text and
+destination is the same as autolinks.
+
+If a link has inside his text markup characters, even if its content if the
+same as its target, will be treated as different and rendered as a link. So,
+in practice: if a link text has markup characters, can't be an autolink.
+
+Link cloisterers
+================
+
+Although a link title can be wrapped between different characters, mdpo will
+use ``"`` always due to MD4C parser limitations.
+
+.. _CommonMark Specification v0.30: https://spec.commonmark.org/0.30
diff --git a/docs/index.rst b/docs/index.rst
@@ -2,8 +2,8 @@
 mdpo's documentation
 ####################
 
-Markdown files translation using pofiles. Fully complies with
-`CommonMark Specification v0.29`_.
+Markdown files translation using pofiles. Complies with
+`CommonMark Specification`_ `v0.29`_ and `v0.30`_.
 
 .. toctree::
    :maxdepth: 2
@@ -29,6 +29,7 @@ Markdown files translation using pofiles. Fully complies with
    :caption: In depth
 
    rationale
+   implementation-notes
 
 .. raw:: html
 
@@ -40,4 +41,6 @@ Markdown files translation using pofiles. Fully complies with
 
    devref/index
 
-.. _CommonMark Specification v0.29: https://spec.commonmark.org/0.29
+.. _CommonMark Specification: https://spec.commonmark.org/
+.. _v0.29: https://spec.commonmark.org/0.29
+.. _v0.30: https://spec.commonmark.org/0.30
diff --git a/docs/rationale.rst b/docs/rationale.rst
@@ -56,7 +56,10 @@ the text that needs to be translated, including markdown characters:
 * ````Code text```` and ```Code text``` are unified to use the minimum possible
   backticks for start and end characters and dumped into msgids as
   ```Code text```.
-* ``[Link text](target)`` is not changed, is dumped into msgids as is.
+* ``[Link text](target)`` is not changed if the text is different than the
+  target, is dumped into msgids as is. In the case that link text and target
+  are equal, is converted to an `autolink`_ and dumped into msgids as
+  ``<link>``.
 * Images as ``![Image alternative text](/target.ext "Image title text")``,
   are not changed, but included as is.
 * ``~~Strikethrough text~~`` is not changed, is dumped into msgids as
@@ -66,10 +69,13 @@ the text that needs to be translated, including markdown characters:
 * ``__Underline text__`` and ``_Underline text_`` are unified to
   ``__Underline text__`` into msgids if ``underline`` mode is active,
   otherwise are treated like bold text (with two characters ``__``) and dumped
-  as ``**Underline text**`` or italic text (with one character ``_``) and
+  as ``**Underline text**`` or italic text (with one character, ``_``) and
   dumped as ``*Underline text*``.
 
 
+.. seealso::
+   * :ref:`Implementation notes<implementation-notes>`
+
 Advantages
 ----------
 
@@ -86,3 +92,5 @@ Disadvantages
 * Message replacers needs to be written and depends on this specification.
 * Translation editors needs to be configured with this specification if they
   want to handle properly markup character templates.
+
+.. _autolink: https://spec.commonmark.org/0.30/#autolinks
diff --git a/mdpo/md.py b/mdpo/md.py
@@ -52,37 +52,6 @@ def escape_links_titles(text, link_start_string='[', link_end_string=']'):
     return text
 
 
-def inline_untexted_links(text, link_start_string='[', link_end_string=']'):
-    """Replace Markdown self-referenced links delimiters by ``<`` and ``>``.
-
-    Given a string like ``"Text with [self-referenced-link]"``, replaces self
-    referenced links markup characters by new ones, in this case would becomes
-    ``"Text with <self-referenced-link>"``.
-
-    Wikilinks are not replaced (strings started with ``[[`` and ended with
-    ``]]`` string chunks).
-
-    Args:
-        text (str): Text that could contain self-referenced links.
-        link_start_string (str): String that delimites the start of a link.
-        link_end_string (str): String that delimites the end of a link.
-
-    Returns:
-        str: Same text as input with replaced link delimiters characters found
-        inside titles.
-
-    Examples:
-        >>> inline_untexted_links('Text with [self-referenced-link]')
-        'Text with <self-referenced-link>'
-    """
-    return re.sub(
-        (
-            re.escape(link_start_string) + r'(\w{1,5}:\/\/[^\s]+)' +
-            re.escape(link_end_string)
-        ), r'<\g<1>>', text,
-    )
-
-
 def parse_link_references(content):
     """Parses link references found in a Markdown content.
 

diff --git a/mdpo/md2po/__init__.py b/mdpo/md2po/__init__.py
@@ -93,14 +93,16 @@ class Md2Po:
         '_inside_htmlblock',
         '_inside_codeblock',
         '_inside_pblock',
+        '_inside_aspan',
         '_inside_liblock',
         '_inside_codespan',
         '_inside_olblock',
         '_inside_hblock',
         '_quoteblocks_deep',
         '_codespan_start_index',
         '_codespan_backticks',
-        '_current_aspan_target',
+        '_current_aspan_text',
+        '_current_aspan_ref_target',
         '_current_wikilink_target',
         '_current_imgspan',
         '_uls_deep',
@@ -207,9 +209,6 @@ def __init__(self, glob_or_content, **kwargs):
                 self.code_end_string,
             )
 
-            self.link_start_string = kwargs.get('link_start_string', '[')
-            self.link_end_string = kwargs.get('link_end_string', ']')
-
             _include_xheaders = kwargs.get('xheaders', False)
 
             if _include_xheaders:
@@ -220,22 +219,18 @@ def __init__(self, glob_or_content, **kwargs):
                     'x-mdpo-italic-end': self.italic_end_string,
                     'x-mdpo-code-start': self.code_start_string,
                     'x-mdpo-code-end': self.code_end_string,
-                    'x-mdpo-link-start': self.link_start_string,
-                    'x-mdpo-link-end': self.link_end_string,
                 })
 
             self._enterspan_replacer = {
                 md4c.SpanType.STRONG.value: self.bold_start_string,
                 md4c.SpanType.EM.value: self.italic_start_string,
                 md4c.SpanType.CODE.value: self.code_start_string,
-                md4c.SpanType.A.value: self.link_start_string,
             }
 
             self._leavespan_replacer = {
                 md4c.SpanType.STRONG.value: self.bold_end_string,
                 md4c.SpanType.EM.value: self.italic_end_string,
                 md4c.SpanType.CODE.value: self.code_end_string,
-                md4c.SpanType.A.value: self.link_end_string,
             }
 
             if 'strikethrough' in self.extensions:
@@ -354,7 +349,12 @@ def __init__(self, glob_or_content, **kwargs):
         self._codespan_start_index = None
         self._codespan_backticks = None
 
-        self._current_aspan_target = None
+        self._inside_aspan = False
+        self._current_aspan_text = ''
+        # indicates the target of the current link, which is referenced and
+        # extracted without using MD4C, so we can preserve it as referenced
+        self._current_aspan_ref_target = None
+
         self._link_references = None
         self._current_wikilink_target = None
         self._current_imgspan = {}
@@ -394,7 +394,7 @@ def _save_msgid(
             #       if the user has configured an `enter_block` or
             #       `leave_block` event remembering that
             #       `_current_top_level_block_number` and
-            #       _current_top_level_block_type` properties must be handled
+            #       `_current_top_level_block_type` properties must be handled
             #       accordingly
 
             occurrence = (
@@ -650,17 +650,31 @@ def enter_span(self, span, details):
         if not self.plaintext:
             # underline spans for double '_' character enters two times
             if not self._inside_uspan:
-                try:
-                    self._current_msgid += self._enterspan_replacer[span.value]
-                except KeyError:
-                    pass
+                if self._inside_aspan:  # span inside link text
+                    try:
+                        self._current_aspan_text += self._enterspan_replacer[
+                            span.value
+                        ]
+                    except KeyError:
+                        pass
+                else:
+                    try:
+                        self._current_msgid += (
+                            self._enterspan_replacer[span.value]
+                        )
+                    except KeyError:
+                        pass
 
             if span is md4c.SpanType.A:
+                # here resides the logic of discover if the current link
+                # is referenced
                 if self._link_references is None:
                     self._link_references = parse_link_references(self.content)
 
+                self._inside_aspan = True
+
                 current_aspan_href = details['href'][0][1]
-                self._current_aspan_target = None
+                self._current_aspan_ref_target = None
 
                 if details['title']:
                     current_aspan_title = details['title'][0][1]
@@ -669,12 +683,12 @@ def enter_span(self, span, details):
                             href == current_aspan_href and
                             title == current_aspan_title
                         ):
-                            self._current_aspan_target = target
+                            self._current_aspan_ref_target = target
                             break
                 else:
                     for target, href, title in self._link_references:
                         if href == current_aspan_href:
-                            self._current_aspan_target = target
+                            self._current_aspan_ref_target = target
                             break
 
             elif span is md4c.SpanType.CODE:
@@ -709,30 +723,60 @@ def leave_span(self, span, details):
                 if span is md4c.SpanType.WIKILINK:
                     self._current_msgid += self._current_wikilink_target
                     self._current_wikilink_target = None
-
-                try:
-                    self._current_msgid += self._leavespan_replacer[span.value]
-                except KeyError:
-                    pass
+                if self._inside_aspan:  # span inside link text
+                    try:
+                        self._current_aspan_text += self._leavespan_replacer[
+                            span.value
+                        ]
+                    except KeyError:
+                        pass
+                else:
+                    try:
+                        self._current_msgid += (
+                            self._leavespan_replacer[span.value]
+                        )
+                    except KeyError:
+                        pass
 
             if span is md4c.SpanType.A:
-                if self._current_aspan_target:  # reference link
-                    self._current_msgid += f'[{self._current_aspan_target}]'
-                    self._current_aspan_target = None
-                else:
-                    self._current_msgid += '({}{})'.format(
-                        details['href'][0][1],
-                        '' if not details['title'] else ' "{}"'.format(
-                            details['title'][0][1],
-                        ),
+                if self._current_aspan_ref_target:  # referenced link
+                    self._current_msgid += (
+                        f'[{self._current_aspan_text}]'
+                        f'[{self._current_aspan_ref_target}]'
                     )
+                    self._current_aspan_ref_target = None
+                else:
+                    if self._current_aspan_text == details['href'][0][1]:
+                        # autolink vs link clash (see implementation notes)
+                        self._current_msgid += f'<{self._current_aspan_text}'
+                        if details['title']:
+                            self._current_msgid += ' "{}"'.format(
+                                details['title'][0][1],
+                            )
+                        self._current_msgid += '>'
+                    else:
+                        self._current_msgid += '[{}]({}{})'.format(
+                            self._current_aspan_text,
+                            details['href'][0][1],
+                            '' if not details['title'] else ' "{}"'.format(
+                                details['title'][0][1],
+                            ),
+                        )
+                self._inside_aspan = False
+                self._current_aspan_text = ''
             elif span is md4c.SpanType.CODE:
                 self._inside_codespan = False
                 self._codespan_start_index = None
+
                 # add backticks at the end for escape internal backticks
-                self._current_msgid += (
-                    self._codespan_backticks * self.code_end_string
-                )
+                if self._inside_aspan:
+                    self._current_aspan_text += (
+                        self._codespan_backticks * self.code_end_string
+                    )
+                else:
+                    self._current_msgid += (
+                        self._codespan_backticks * self.code_end_string
+                    )
                 self._codespan_backticks = None
             elif span is md4c.SpanType.IMG:
                 self._current_msgid += '![{}]({}'.format(
@@ -755,7 +799,9 @@ def text(self, block, text):
 
         if not self._inside_htmlblock:
             if not self._inside_codeblock:
-                if self._inside_liblock and text == '\n':
+                if any([  # softbreaks
+                    self._inside_liblock, self._inside_aspan,
+                ]) and text == '\n':
                     text = ' '
                 if not self.plaintext:
                     if self._current_imgspan:
@@ -773,6 +819,12 @@ def text(self, block, text):
                             self._codespan_backticks * self.code_start_string,
                             self._current_msgid[self._codespan_start_index:],
                         )
+                        if self._inside_aspan:
+                            self._current_aspan_text += text
+                            return
+                    elif self._inside_aspan:
+                        self._current_aspan_text += text
+                        return
                     elif text == self.italic_start_string:
                         text = self.italic_start_string_escaped
                     elif text == self.code_start_string: