From c4a139ff3183acae4e726f05064355268db099d1 Mon Sep 17 00:00:00 2001
From: Waylan Limberg <waylan.limberg@icloud.com>
Date: Thu, 7 Mar 2024 13:24:21 -0500
Subject: [PATCH] update docs

---
 docs/changelog.md      | 19 +++++++++++++++++++
 docs/extensions/toc.md | 11 +++++++++++
 2 files changed, 30 insertions(+)
diff --git a/docs/changelog.md b/docs/changelog.md
index 8deaefd2..0d8c38df 100644
--- a/docs/changelog.md
+++ b/docs/changelog.md
@@ -10,6 +10,25 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [unreleased]
 
+### Changed
+
+#### Refactor TOC Sanitation
+
+* All postprocessors are run on heading content.
+* Footnote references are stripped from heading content. Fixes #660.
+* A more robust `striptags` is provided to convert headings to plain text.
+  Unlike, markupsafe's implementation, HTML entities are not unescaped.
+* The plain text `name`, rich `html` and unescaped raw `data-toc-label` are
+  saved to `toc_tokens`, allowing users to access the full rich text content of
+  the headings directly from `toc_tokens`.
+* `data-toc-label` is sanitized separate from heading content.
+* A `html.unescape` call is made just prior to calling `slugify` so that
+  `slugify` only operates on Unicode characters. Note that `html.unescape` is
+  not run on the `name` or `html`.
+* The `get_name` and `stashedHTML2text` functions defined in the `toc` extension
+  are both **deprecated**. Instead, use some combination of `run_postprocessors`,
+  `render_inner_html` and `striptags`.
+
 ### Fixed
 
 * Include `scripts/*.py` in the generated source tarballs (#1430).
diff --git a/docs/extensions/toc.md b/docs/extensions/toc.md
index 1f80c7ea..d1c64a9d 100644
--- a/docs/extensions/toc.md
+++ b/docs/extensions/toc.md
@@ -80,6 +80,8 @@ the following object at `md.toc_tokens`:
         'level': 1,
         'id': 'header-1',
         'name': 'Header 1',
+        'html': 'Header 1',
+        'data-toc-label': '',
         'children': [
             {'level': 2, 'id': 'header-2', 'name': 'Header 2', 'children':[]}
         ]
@@ -91,6 +93,11 @@ Note that the `level` refers to the `hn` level. In other words, `<h1>` is level
 `1` and `<h2>` is level `2`, etc. Be aware that improperly nested levels in the
 input may result in odd nesting of the output.
 
+`name` is the sanitized value which would also be used as a label for the HTML
+version of the Table of Contents. `html` contains the fully rendered HTML
+content of the heading and has not been sanitized in any way. This may be used
+with your own custom sanitation to create custom table of contents.
+
 ### Custom Labels
 
 In most cases, the text label in the Table of Contents should match the text of
@@ -131,6 +138,10 @@ attribute list to provide a cleaner URL when linking to the header. If the ID is
 not manually defined, it is always derived from the text of the header, never
 from the `data-toc-label` attribute.
 
+The value of the `data-toc-label` attribute is sanitized and stripped of any HTML
+tags. However, `toc_tokens` will contain the raw content under
+`data-toc-label`.
+
 Usage
 -----