Skip to content

Commit

Permalink
Unify autolinks processing between mdpo and md2po (#165)
Browse files Browse the repository at this point in the history
* Unify autolinks processing between mdpo and md2po

* Handle titles in autolinks uniformity process
  • Loading branch information
mondeja authored Aug 30, 2021
1 parent e788e6d commit 138437d
Show file tree
Hide file tree
Showing 21 changed files with 279 additions and 152 deletions.
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,10 @@
</h2>

<p align="center">
Fully complies with <a href="https://spec.commonmark.org/0.29">CommonMark Specification v0.29</a>,
supporting some additional features.
Complies with <a href="https://spec.commonmark.org/">CommonMark Specification</a>
<a href="https://spec.commonmark.org/0.29">v0.29</a> and
<a href="https://spec.commonmark.org/0.30">v0.30</a>, supporting some
additional features.
</p>

## Status
Expand Down
1 change: 1 addition & 0 deletions docs/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ The output produced by :ref:`po2md-cli` is compatible with the following
{
"no-blanks-blockquote": false,
"no-bare-urls": false,
"ul-indent": {
"indent": 3
}
Expand Down
32 changes: 32 additions & 0 deletions docs/implementation-notes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
.. _implementation-notes:

********************
Implementation notes
********************

.. note::

Refer to the `CommonMark Specification v0.30`_ for descriptions of the terms
used by this document.

Autolink vs link clash
======================

An autolink is something like ``<https://foo.bar>`` and a link is something
like ``[foo](https://foo.bar)``.

MD4C parser doesn't distinguish between an autolink and a link whose text and
destination is the same. So, mdpo will treat all links whose text and
destination is the same as autolinks.

If a link has inside his text markup characters, even if its content if the
same as its target, will be treated as different and rendered as a link. So,
in practice: if a link text has markup characters, can't be an autolink.

Link cloisterers
================

Although a link title can be wrapped between different characters, mdpo will
use ``"`` always due to MD4C parser limitations.

.. _CommonMark Specification v0.30: https://spec.commonmark.org/0.30
9 changes: 6 additions & 3 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
mdpo's documentation
####################

Markdown files translation using pofiles. Fully complies with
`CommonMark Specification v0.29`_.
Markdown files translation using pofiles. Complies with
`CommonMark Specification`_ `v0.29`_ and `v0.30`_.

.. toctree::
:maxdepth: 2
Expand All @@ -29,6 +29,7 @@ Markdown files translation using pofiles. Fully complies with
:caption: In depth

rationale
implementation-notes

.. raw:: html

Expand All @@ -40,4 +41,6 @@ Markdown files translation using pofiles. Fully complies with

devref/index

.. _CommonMark Specification v0.29: https://spec.commonmark.org/0.29
.. _CommonMark Specification: https://spec.commonmark.org/
.. _v0.29: https://spec.commonmark.org/0.29
.. _v0.30: https://spec.commonmark.org/0.30
12 changes: 10 additions & 2 deletions docs/rationale.rst
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,10 @@ the text that needs to be translated, including markdown characters:
* ````Code text```` and ```Code text``` are unified to use the minimum possible
backticks for start and end characters and dumped into msgids as
```Code text```.
* ``[Link text](target)`` is not changed, is dumped into msgids as is.
* ``[Link text](target)`` is not changed if the text is different than the
target, is dumped into msgids as is. In the case that link text and target
are equal, is converted to an `autolink`_ and dumped into msgids as
``<link>``.
* Images as ``![Image alternative text](/target.ext "Image title text")``,
are not changed, but included as is.
* ``~~Strikethrough text~~`` is not changed, is dumped into msgids as
Expand All @@ -66,10 +69,13 @@ the text that needs to be translated, including markdown characters:
* ``__Underline text__`` and ``_Underline text_`` are unified to
``__Underline text__`` into msgids if ``underline`` mode is active,
otherwise are treated like bold text (with two characters ``__``) and dumped
as ``**Underline text**`` or italic text (with one character ``_``) and
as ``**Underline text**`` or italic text (with one character, ``_``) and
dumped as ``*Underline text*``.


.. seealso::
* :ref:`Implementation notes<implementation-notes>`

Advantages
----------

Expand All @@ -86,3 +92,5 @@ Disadvantages
* Message replacers needs to be written and depends on this specification.
* Translation editors needs to be configured with this specification if they
want to handle properly markup character templates.

.. _autolink: https://spec.commonmark.org/0.30/#autolinks
31 changes: 0 additions & 31 deletions mdpo/md.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,37 +52,6 @@ def escape_links_titles(text, link_start_string='[', link_end_string=']'):
return text


def inline_untexted_links(text, link_start_string='[', link_end_string=']'):
"""Replace Markdown self-referenced links delimiters by ``<`` and ``>``.
Given a string like ``"Text with [self-referenced-link]"``, replaces self
referenced links markup characters by new ones, in this case would becomes
``"Text with <self-referenced-link>"``.
Wikilinks are not replaced (strings started with ``[[`` and ended with
``]]`` string chunks).
Args:
text (str): Text that could contain self-referenced links.
link_start_string (str): String that delimites the start of a link.
link_end_string (str): String that delimites the end of a link.
Returns:
str: Same text as input with replaced link delimiters characters found
inside titles.
Examples:
>>> inline_untexted_links('Text with [self-referenced-link]')
'Text with <self-referenced-link>'
"""
return re.sub(
(
re.escape(link_start_string) + r'(\w{1,5}:\/\/[^\s]+)' +
re.escape(link_end_string)
), r'<\g<1>>', text,
)


def parse_link_references(content):
"""Parses link references found in a Markdown content.
Expand Down
122 changes: 87 additions & 35 deletions mdpo/md2po/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,14 +93,16 @@ class Md2Po:
'_inside_htmlblock',
'_inside_codeblock',
'_inside_pblock',
'_inside_aspan',
'_inside_liblock',
'_inside_codespan',
'_inside_olblock',
'_inside_hblock',
'_quoteblocks_deep',
'_codespan_start_index',
'_codespan_backticks',
'_current_aspan_target',
'_current_aspan_text',
'_current_aspan_ref_target',
'_current_wikilink_target',
'_current_imgspan',
'_uls_deep',
Expand Down Expand Up @@ -207,9 +209,6 @@ def __init__(self, glob_or_content, **kwargs):
self.code_end_string,
)

self.link_start_string = kwargs.get('link_start_string', '[')
self.link_end_string = kwargs.get('link_end_string', ']')

_include_xheaders = kwargs.get('xheaders', False)

if _include_xheaders:
Expand All @@ -220,22 +219,18 @@ def __init__(self, glob_or_content, **kwargs):
'x-mdpo-italic-end': self.italic_end_string,
'x-mdpo-code-start': self.code_start_string,
'x-mdpo-code-end': self.code_end_string,
'x-mdpo-link-start': self.link_start_string,
'x-mdpo-link-end': self.link_end_string,
})

self._enterspan_replacer = {
md4c.SpanType.STRONG.value: self.bold_start_string,
md4c.SpanType.EM.value: self.italic_start_string,
md4c.SpanType.CODE.value: self.code_start_string,
md4c.SpanType.A.value: self.link_start_string,
}

self._leavespan_replacer = {
md4c.SpanType.STRONG.value: self.bold_end_string,
md4c.SpanType.EM.value: self.italic_end_string,
md4c.SpanType.CODE.value: self.code_end_string,
md4c.SpanType.A.value: self.link_end_string,
}

if 'strikethrough' in self.extensions:
Expand Down Expand Up @@ -354,7 +349,12 @@ def __init__(self, glob_or_content, **kwargs):
self._codespan_start_index = None
self._codespan_backticks = None

self._current_aspan_target = None
self._inside_aspan = False
self._current_aspan_text = ''
# indicates the target of the current link, which is referenced and
# extracted without using MD4C, so we can preserve it as referenced
self._current_aspan_ref_target = None

self._link_references = None
self._current_wikilink_target = None
self._current_imgspan = {}
Expand Down Expand Up @@ -394,7 +394,7 @@ def _save_msgid(
# if the user has configured an `enter_block` or
# `leave_block` event remembering that
# `_current_top_level_block_number` and
# _current_top_level_block_type` properties must be handled
# `_current_top_level_block_type` properties must be handled
# accordingly

occurrence = (
Expand Down Expand Up @@ -650,17 +650,31 @@ def enter_span(self, span, details):
if not self.plaintext:
# underline spans for double '_' character enters two times
if not self._inside_uspan:
try:
self._current_msgid += self._enterspan_replacer[span.value]
except KeyError:
pass
if self._inside_aspan: # span inside link text
try:
self._current_aspan_text += self._enterspan_replacer[
span.value
]
except KeyError:
pass
else:
try:
self._current_msgid += (
self._enterspan_replacer[span.value]
)
except KeyError:
pass

if span is md4c.SpanType.A:
# here resides the logic of discover if the current link
# is referenced
if self._link_references is None:
self._link_references = parse_link_references(self.content)

self._inside_aspan = True

current_aspan_href = details['href'][0][1]
self._current_aspan_target = None
self._current_aspan_ref_target = None

if details['title']:
current_aspan_title = details['title'][0][1]
Expand All @@ -669,12 +683,12 @@ def enter_span(self, span, details):
href == current_aspan_href and
title == current_aspan_title
):
self._current_aspan_target = target
self._current_aspan_ref_target = target
break
else:
for target, href, title in self._link_references:
if href == current_aspan_href:
self._current_aspan_target = target
self._current_aspan_ref_target = target
break

elif span is md4c.SpanType.CODE:
Expand Down Expand Up @@ -709,30 +723,60 @@ def leave_span(self, span, details):
if span is md4c.SpanType.WIKILINK:
self._current_msgid += self._current_wikilink_target
self._current_wikilink_target = None

try:
self._current_msgid += self._leavespan_replacer[span.value]
except KeyError:
pass
if self._inside_aspan: # span inside link text
try:
self._current_aspan_text += self._leavespan_replacer[
span.value
]
except KeyError:
pass
else:
try:
self._current_msgid += (
self._leavespan_replacer[span.value]
)
except KeyError:
pass

if span is md4c.SpanType.A:
if self._current_aspan_target: # reference link
self._current_msgid += f'[{self._current_aspan_target}]'
self._current_aspan_target = None
else:
self._current_msgid += '({}{})'.format(
details['href'][0][1],
'' if not details['title'] else ' "{}"'.format(
details['title'][0][1],
),
if self._current_aspan_ref_target: # referenced link
self._current_msgid += (
f'[{self._current_aspan_text}]'
f'[{self._current_aspan_ref_target}]'
)
self._current_aspan_ref_target = None
else:
if self._current_aspan_text == details['href'][0][1]:
# autolink vs link clash (see implementation notes)
self._current_msgid += f'<{self._current_aspan_text}'
if details['title']:
self._current_msgid += ' "{}"'.format(
details['title'][0][1],
)
self._current_msgid += '>'
else:
self._current_msgid += '[{}]({}{})'.format(
self._current_aspan_text,
details['href'][0][1],
'' if not details['title'] else ' "{}"'.format(
details['title'][0][1],
),
)
self._inside_aspan = False
self._current_aspan_text = ''
elif span is md4c.SpanType.CODE:
self._inside_codespan = False
self._codespan_start_index = None

# add backticks at the end for escape internal backticks
self._current_msgid += (
self._codespan_backticks * self.code_end_string
)
if self._inside_aspan:
self._current_aspan_text += (
self._codespan_backticks * self.code_end_string
)
else:
self._current_msgid += (
self._codespan_backticks * self.code_end_string
)
self._codespan_backticks = None
elif span is md4c.SpanType.IMG:
self._current_msgid += '![{}]({}'.format(
Expand All @@ -755,7 +799,9 @@ def text(self, block, text):

if not self._inside_htmlblock:
if not self._inside_codeblock:
if self._inside_liblock and text == '\n':
if any([ # softbreaks
self._inside_liblock, self._inside_aspan,
]) and text == '\n':
text = ' '
if not self.plaintext:
if self._current_imgspan:
Expand All @@ -773,6 +819,12 @@ def text(self, block, text):
self._codespan_backticks * self.code_start_string,
self._current_msgid[self._codespan_start_index:],
)
if self._inside_aspan:
self._current_aspan_text += text
return
elif self._inside_aspan:
self._current_aspan_text += text
return
elif text == self.italic_start_string:
text = self.italic_start_string_escaped
elif text == self.code_start_string:
Expand Down
Loading

0 comments on commit 138437d

Please sign in to comment.