From c4a139ff3183acae4e726f05064355268db099d1 Mon Sep 17 00:00:00 2001 From: Waylan Limberg Date: Thu, 7 Mar 2024 13:24:21 -0500 Subject: [PATCH] update docs --- docs/changelog.md | 19 +++++++++++++++++++ docs/extensions/toc.md | 11 +++++++++++ 2 files changed, 30 insertions(+) diff --git a/docs/changelog.md b/docs/changelog.md index 8deaefd2..0d8c38df 100644 --- a/docs/changelog.md +++ b/docs/changelog.md @@ -10,6 +10,25 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [unreleased] +### Changed + +#### Refactor TOC Sanitation + +* All postprocessors are run on heading content. +* Footnote references are stripped from heading content. Fixes #660. +* A more robust `striptags` is provided to convert headings to plain text. + Unlike, markupsafe's implementation, HTML entities are not unescaped. +* The plain text `name`, rich `html` and unescaped raw `data-toc-label` are + saved to `toc_tokens`, allowing users to access the full rich text content of + the headings directly from `toc_tokens`. +* `data-toc-label` is sanitized separate from heading content. +* A `html.unescape` call is made just prior to calling `slugify` so that + `slugify` only operates on Unicode characters. Note that `html.unescape` is + not run on the `name` or `html`. +* The `get_name` and `stashedHTML2text` functions defined in the `toc` extension + are both **deprecated**. Instead, use some combination of `run_postprocessors`, + `render_inner_html` and `striptags`. + ### Fixed * Include `scripts/*.py` in the generated source tarballs (#1430). diff --git a/docs/extensions/toc.md b/docs/extensions/toc.md index 1f80c7ea..d1c64a9d 100644 --- a/docs/extensions/toc.md +++ b/docs/extensions/toc.md @@ -80,6 +80,8 @@ the following object at `md.toc_tokens`: 'level': 1, 'id': 'header-1', 'name': 'Header 1', + 'html': 'Header 1', + 'data-toc-label': '', 'children': [ {'level': 2, 'id': 'header-2', 'name': 'Header 2', 'children':[]} ] @@ -91,6 +93,11 @@ Note that the `level` refers to the `hn` level. In other words, `

` is level `1` and `

` is level `2`, etc. Be aware that improperly nested levels in the input may result in odd nesting of the output. +`name` is the sanitized value which would also be used as a label for the HTML +version of the Table of Contents. `html` contains the fully rendered HTML +content of the heading and has not been sanitized in any way. This may be used +with your own custom sanitation to create custom table of contents. + ### Custom Labels In most cases, the text label in the Table of Contents should match the text of @@ -131,6 +138,10 @@ attribute list to provide a cleaner URL when linking to the header. If the ID is not manually defined, it is always derived from the text of the header, never from the `data-toc-label` attribute. +The value of the `data-toc-label` attribute is sanitized and stripped of any HTML +tags. However, `toc_tokens` will contain the raw content under +`data-toc-label`. + Usage -----