Refactor TOC sanitation #1441

waylan · 2024-02-09T16:42:58Z

All postprocessors are run on heading content (not just RawHtmlPostprocessor).
Footnote references are stripped from heading content. Fixes Footnote ref number in TOC #660.
A more robust striptags is provided to convert headings to plain text. Unlike, markupsafe's implementation, HTML entities are not unescaped.
Both the plain text name and rich html and unescaped raw data-toc-label are saved to toc_tokens, which means users can now access the full rich text content of the headings directly from the toc_tokens.
data-toc-label is sanitized separate from heading content.
A html.unescape call added to slugify and slugify_unicode, which ensures slugify operates on Unicode characters, rather than HTML entities. By including in the functions, users can override with their own slugify functions if they desire. A html.unescape call is made just prior to calling slugify so that slugify only operates on Unicode characters. Note that html.unescape is not run on the name or html.

Note that this first commit includes minimal changes to the tests to show very little change in behavior (mostly the new html attribute of the toc_tokens was added). A refactoring of the tests will be in a separate commit.

- All postprocessors are run on heading content (not just `RawHtmlPostprocessor`). - Footnote references are stripped from heading content. Fixes Python-Markdown#660. - A more robust `striptags` is provided to convert headings to plain text. Unlike, markupsafe's implementation, HTML entities are not unescaped. - Both the plain text `name` and rich `html` are saved to `toc_tokens`, which means users can now access the full rich text content of the headings directly from the `toc_tokens`. - `data-toc-label` is sanitized separate from heading content. - A `html.unescape` call added to `slugify` and `slugify_unicode`, which ensures `slugify` operates on Unicode characters, rather than HTML entities. By including in the functions, users can override with their own slugify functions if they desire. Note that this first commit includes minimal changes to the tests to show very little change in behavior (mostly the new `html` attribute of the `toc_tokens` was added). A refactoring of the tests will be in a separate commit.

oprypin

Seems like a good direction to move in :)

I did a cursory check whether the change affects performance on a large MkDocs site, and seems it doesn't!

markdown/extensions/toc.py

waylan · 2024-02-09T18:57:57Z

In my initial commit, I deleted a couple public functions which are no longer needed. I probably should have left them in and marked them as deprecated instead. Do we need to do that? Are others calling them?

Also, it occurs to me that these changes could result in various slugs being different, which would break existing links out in the wild. True, none of the slugs in our own tests are changed, but by fully rendering the HTML before generating a slug could result in additional/different content from any third-party extensions that make use of postprocessors.

tests/test_extensions.py

vedranmiletic · 2024-02-12T10:03:13Z

FWIW, I installed this using

pip install git+https://github.com/Python-Markdown/markdown.git@refs/pull/1441/head

and tested it on my group's website page that had issues, angle brackets are removed correctly and the e-mails show up nicely. Not sure I can help with actual code review, but from a user's perspective, this fixes the issue.

@vedranmiletic

Thanks to @vedranmiletic for test case.

markdown/extensions/toc.py

oprypin

The change looks good. This will be nice :)

Co-authored-by: Oleh Prypin <[email protected]>

waylan · 2024-03-07T18:32:49Z

I finished updating the tests and documentation. I think this is ready to go.

waylan · 2024-03-12T16:06:50Z

@oprypin @pawamoy I wanted to check with you before I did a release of this. As this is a change in behavior that affects your projects (in that a bug was fixed which allowed markup to be passed through to toc_token['name'] in certain circumstance), I wanted to make sure you where ready for the release before I publish it.

Presumably at a minimum, MkDocs will need to do a release which includes mkdocs/mkdocs#3578 before I do my release. And I am assuming mkdostrings will want to do a release which takes advantage of mkdocs/mkdocs#3578 as well.

For the record, I intend to release this as version 3.6, so any earlier releases of your projects will likely want to be constrained to markdown<3.6 for any users who want markup in their toc token names.

pawamoy · 2024-03-12T16:43:13Z

Thanks for the heads up. I'm not ready for this. I'll start by adding an upper bound in mkdocstrings and publish a new release. I should be able to do that today, but if not, before the end of the week for sure 🙂 I'll report back once it's done!

oprypin · 2024-03-12T17:26:14Z

I think actually there's no direct interaction with MkDocs, the release can be made here first. Extracting the primary heading has a separate implementation in MkDocs and extracting secondary headings will just change for the better with this release but still nothing to be done in MkDocs

waylan · 2024-03-13T20:05:08Z

Well, there definitely is interaction with mkdocstrings. According to my own testing, the show_symbol_type_toc feature breaks with this change. Presumably, mkdocstings will need to retrieve the value from token['data-toc-label'] rather than token['name'] as it does now. Therefore, at a minimum, I'll wait for an indication that mkdocstrings is ready before doing a release.

pawamoy · 2024-03-13T22:03:24Z

I've published a new version of mkdocstrings-python with an upper bound on Python-Markdown 3.6, you can release it :)

pawamoy · 2024-03-25T19:10:46Z

Hello again! Today someone reported that we should support version 3.6, so I made time to work on this. And it turns out I can do this very simple thing:

class _TocLabelsTreeProcessor(Treeprocessor):
    def run(self, root: Element) -> None:  # noqa: ARG002
        self._override_toc_labels(self.md.toc_tokens)  # type: ignore[attr-defined]

    def _override_toc_labels(self, tokens: list) -> None:
        for token in tokens:
            if token["name"] != token["data-toc-label"]:
                token["name"] = token["data-toc-label"]
                self._override_toc_labels(token["children"])

...and register this tree processor in our outer MkdocstringsExtension:

        md.treeprocessors.register(
            _TocLabelsTreeProcessor(md),
            "mkdocstrings_post_toc_labels",
            priority=4.5,  # Right after 'toc'.
        )

I suppose it's a bit brutal and instead of blindly overwriting names with toc-labels, I should add a way to detect when it comes from mkdocstrings and overwrite only those? For example, if token["name"] != token["data-toc-label"] and token["data-toc-label"].startswith("<code")?

Otherwise it means I'm just removing the additional safety that was brought with this PR, right? Is this really bad or is there no practical impact 🤔?

waylan · 2024-03-26T13:05:27Z

The primary concern for MkDocs is that the page title should not ever contain any markup (to avoid markup in <title> etc). As the page title comes from token['name'], then by extension, token['name'] should never contain any markup... but that only really matters for the top <h1> in a page. That said, in Markdown's own documentation, mkdocstrings does set a value to data-toc-label for the top <h1> on each page it generates.

So, yes, to avoid breaking themes (by causing them to generate invalid HTML), token['name'] should never contain any markup. The correct fix is to pull the value from token['data-toc-label'] (and why I added it). However, that would require an edit to theme templates when building the TOC. And that is something theme devs need to do, unless you can provide a way to override that block of a theme. IIRC, there is not a straightforward way for an extension to do that (it interferes with a user's overrides).

So, in the end, we have a chicken and egg problem. Theme's need to support token['data-toc-label'] directly when building a page's toc, but they are not likely to do that unless users and/or extensions are assigning values to token['data-toc-label'] which contain markup.

waylan · 2024-03-26T13:12:49Z

Two other factors come into play,

For a theme to access token['data-toc-label'], then MkDocs needs to pass that on. I don't think it does (needs confirmation).
In Full-featured text extraction from the first H1 heading mkdocs/mkdocs#3578, the same sanitation code is used by MkDocs to sanitize page titles. Therefore, assigning markup to `token['name'] may not be a problem after all as MkDocs will sanitize it before setting the value to the page title.

In other words, at this time, your suggested solution may actually be the only workable one.

pawamoy · 2024-03-26T15:14:36Z

Thank you!

I'm not sure to understand why there is a data-toc-label attribute if it is only use to override the item name which is otherwise derived from the heading's text/id 🤔 I thought data-toc-label had exactly this use of providing an alternate label for the table of contents (label as, displayable text that doesn't affect anything else). But from your answer I gather that themes actually never used this attribute and always relied on the name itself, and it only seemed to work as I expected it to work because the toc-label was not sanitized properly. Now that it is, it unveiled the true inner working of themes, which is to use the name directly. Makes sense!

Anyway, I'll go with the solution above then, thanks! 🙂

waylan · 2024-03-26T17:22:03Z

Prior to this present change, the assumption was that both name and data-toc-label never contained markup. It was a weird anomaly that this ever worked for you. Therefore, with this change only sanitized content of data-toc-label gets copied to name (no markup is preserved as name still never has markup). The thinking is that if you want the raw content of data-toc-label then you need to access it directly from data-toc-label. In other words, with this change, we are officially supporting using markup in data-toc-label but to access that markup you can't get if from name; you have to get it from data-toc-label directly.

waylan mentioned this pull request Feb 9, 2024

Strip all HTML when getting the title from the first H1 tag mkdocs/mkdocs#3564

Merged

oprypin reviewed Feb 9, 2024

View reviewed changes

markdown/extensions/toc.py Show resolved Hide resolved

markdown/extensions/toc.py Show resolved Hide resolved

markdown/extensions/toc.py Show resolved Hide resolved

waylan requested review from facelessuser and mitya57 February 9, 2024 19:01

waylan added 2 commits February 9, 2024 15:03

lint cleanup

4ef4ad6

Restore and deprecate deleted functions

dbbd038

oprypin mentioned this pull request Feb 11, 2024

Header and navigation sidebars display incorrectly arbitrary markup mkdocs/mkdocs#3357

Closed

oprypin reviewed Feb 11, 2024

View reviewed changes

tests/test_extensions.py Outdated Show resolved Hide resolved

waylan added 5 commits February 12, 2024 09:03

unescape before calling slugify for compat w/ 3rd parties

a561789

Add test for email auto links

6cb1fd7

Thanks to @vedranmiletic for test case.

lint cleanup

5377d2b

refactor testTocWithAttrList for full coverage

caf4f2e

Add comprehensive strip_tag tests

d7bff9b

oprypin reviewed Feb 18, 2024

View reviewed changes

markdown/extensions/toc.py Outdated Show resolved Hide resolved

oprypin reviewed Feb 20, 2024

View reviewed changes

markdown/extensions/toc.py Show resolved Hide resolved

waylan added 3 commits February 22, 2024 09:37

Add a test case for double footnotes

1355826

lint cleanup

7e4117b

Add data-toc-label to toc_token

b4dbb0a

oprypin mentioned this pull request Feb 24, 2024

Full-featured text extraction from the first H1 heading mkdocs/mkdocs#3578

Merged

oprypin reviewed Feb 24, 2024

View reviewed changes

markdown/extensions/toc.py Outdated Show resolved Hide resolved

markdown/extensions/toc.py Outdated Show resolved Hide resolved

oprypin reviewed Feb 24, 2024

View reviewed changes

waylan and others added 6 commits March 6, 2024 15:48

Merge branch 'master' into toc

eabab9d

Apply suggestions from code review

a7f1554

Co-authored-by: Oleh Prypin <[email protected]>

Update all TOC tests to new framework

b1021cb

cleanup

9d5d813

update docs

c4a139f

cleanup

6a39af0

waylan marked this pull request as ready for review March 7, 2024 18:29

update copyright

e02941b

waylan merged commit e4ab4a6 into Python-Markdown:master Mar 8, 2024

waylan deleted the toc branch March 8, 2024 14:05

pawamoy mentioned this pull request Apr 2, 2024

bug: dependency ceiling markdown>=3.3,<3.6 acts as a timebomb mkdocstrings/python#143

Closed

Refactor TOC sanitation #1441

Refactor TOC sanitation #1441

Uh oh!

Conversation

waylan commented Feb 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oprypin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

waylan commented Feb 9, 2024

Uh oh!

Uh oh!

vedranmiletic commented Feb 12, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oprypin left a comment

Choose a reason for hiding this comment

Uh oh!

waylan commented Mar 7, 2024

Uh oh!

waylan commented Mar 12, 2024

Uh oh!

pawamoy commented Mar 12, 2024

Uh oh!

oprypin commented Mar 12, 2024

Uh oh!

waylan commented Mar 13, 2024

Uh oh!

pawamoy commented Mar 13, 2024

Uh oh!

pawamoy commented Mar 25, 2024

Uh oh!

waylan commented Mar 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

waylan commented Mar 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pawamoy commented Mar 26, 2024

Uh oh!

waylan commented Mar 26, 2024

Uh oh!

Uh oh!

waylan commented Feb 9, 2024 •

edited

Loading

waylan commented Mar 26, 2024 •

edited

Loading

waylan commented Mar 26, 2024 •

edited

Loading