-
Notifications
You must be signed in to change notification settings - Fork 886
Refactor TOC sanitation #1441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor TOC sanitation #1441
Conversation
- All postprocessors are run on heading content (not just `RawHtmlPostprocessor`). - Footnote references are stripped from heading content. Fixes Python-Markdown#660. - A more robust `striptags` is provided to convert headings to plain text. Unlike, markupsafe's implementation, HTML entities are not unescaped. - Both the plain text `name` and rich `html` are saved to `toc_tokens`, which means users can now access the full rich text content of the headings directly from the `toc_tokens`. - `data-toc-label` is sanitized separate from heading content. - A `html.unescape` call added to `slugify` and `slugify_unicode`, which ensures `slugify` operates on Unicode characters, rather than HTML entities. By including in the functions, users can override with their own slugify functions if they desire. Note that this first commit includes minimal changes to the tests to show very little change in behavior (mostly the new `html` attribute of the `toc_tokens` was added). A refactoring of the tests will be in a separate commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good direction to move in :)
I did a cursory check whether the change affects performance on a large MkDocs site, and seems it doesn't!
In my initial commit, I deleted a couple public functions which are no longer needed. I probably should have left them in and marked them as deprecated instead. Do we need to do that? Are others calling them? Also, it occurs to me that these changes could result in various slugs being different, which would break existing links out in the wild. True, none of the slugs in our own tests are changed, but by fully rendering the HTML before generating a slug could result in additional/different content from any third-party extensions that make use of postprocessors. |
FWIW, I installed this using
and tested it on my group's website page that had issues, angle brackets are removed correctly and the e-mails show up nicely. Not sure I can help with actual code review, but from a user's perspective, this fixes the issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good. This will be nice :)
I finished updating the tests and documentation. I think this is ready to go. |
@oprypin @pawamoy I wanted to check with you before I did a release of this. As this is a change in behavior that affects your projects (in that a bug was fixed which allowed markup to be passed through to Presumably at a minimum, MkDocs will need to do a release which includes mkdocs/mkdocs#3578 before I do my release. And I am assuming mkdostrings will want to do a release which takes advantage of mkdocs/mkdocs#3578 as well. For the record, I intend to release this as version 3.6, so any earlier releases of your projects will likely want to be constrained to |
Thanks for the heads up. I'm not ready for this. I'll start by adding an upper bound in mkdocstrings and publish a new release. I should be able to do that today, but if not, before the end of the week for sure 🙂 I'll report back once it's done! |
I think actually there's no direct interaction with MkDocs, the release can be made here first. Extracting the primary heading has a separate implementation in MkDocs and extracting secondary headings will just change for the better with this release but still nothing to be done in MkDocs |
Well, there definitely is interaction with mkdocstrings. According to my own testing, the show_symbol_type_toc feature breaks with this change. Presumably, mkdocstings will need to retrieve the value from |
I've published a new version of mkdocstrings-python with an upper bound on Python-Markdown 3.6, you can release it :) |
Hello again! Today someone reported that we should support version 3.6, so I made time to work on this. And it turns out I can do this very simple thing: class _TocLabelsTreeProcessor(Treeprocessor):
def run(self, root: Element) -> None: # noqa: ARG002
self._override_toc_labels(self.md.toc_tokens) # type: ignore[attr-defined]
def _override_toc_labels(self, tokens: list) -> None:
for token in tokens:
if token["name"] != token["data-toc-label"]:
token["name"] = token["data-toc-label"]
self._override_toc_labels(token["children"]) ...and register this tree processor in our outer MkdocstringsExtension: md.treeprocessors.register(
_TocLabelsTreeProcessor(md),
"mkdocstrings_post_toc_labels",
priority=4.5, # Right after 'toc'.
) I suppose it's a bit brutal and instead of blindly overwriting names with toc-labels, I should add a way to detect when it comes from mkdocstrings and overwrite only those? For example, Otherwise it means I'm just removing the additional safety that was brought with this PR, right? Is this really bad or is there no practical impact 🤔? |
The primary concern for MkDocs is that the page title should not ever contain any markup (to avoid markup in So, yes, to avoid breaking themes (by causing them to generate invalid HTML), So, in the end, we have a chicken and egg problem. Theme's need to support |
Two other factors come into play,
In other words, at this time, your suggested solution may actually be the only workable one. |
Thank you! I'm not sure to understand why there is a data-toc-label attribute if it is only use to override the item name which is otherwise derived from the heading's text/id 🤔 I thought data-toc-label had exactly this use of providing an alternate label for the table of contents (label as, displayable text that doesn't affect anything else). But from your answer I gather that themes actually never used this attribute and always relied on the name itself, and it only seemed to work as I expected it to work because the toc-label was not sanitized properly. Now that it is, it unveiled the true inner working of themes, which is to use the name directly. Makes sense! Anyway, I'll go with the solution above then, thanks! 🙂 |
Prior to this present change, the assumption was that both |
RawHtmlPostprocessor
).striptags
is provided to convert headings to plain text. Unlike, markupsafe's implementation, HTML entities are not unescaped.name
and richhtml
and unescaped rawdata-toc-label
are saved totoc_tokens
, which means users can now access the full rich text content of the headings directly from thetoc_tokens
.data-toc-label
is sanitized separate from heading content.AAhtml.unescape
call added toslugify
andslugify_unicode
, which ensuresslugify
operates on Unicode characters, rather than HTML entities. By including in the functions, users can override with their own slugify functions if they desire.html.unescape
call is made just prior to callingslugify
so thatslugify
only operates on Unicode characters. Note thathtml.unescape
is not run on thename
orhtml
.Note that this first commit includes minimal changes to the tests to show very little change in behavior (mostly the new
html
attribute of thetoc_tokens
was added). A refactoring of the tests will be in a separate commit.