Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automatically fill the id attributes of the headings #267

Closed
wants to merge 18 commits into from

Conversation

tatchi
Copy link
Collaborator

@tatchi tatchi commented May 3, 2022

Fixes: #251

This is my attempt at implementing #251

I didn't add an option to enable/disable that feature. Let me know if I should.

Copy link
Collaborator

@shonfeder shonfeder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

I've just one question, but it generally looks solid to me. 🎉

tests/blackbox/heading-id.t Outdated Show resolved Hide resolved
Copy link
Collaborator

@shonfeder shonfeder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meant to leave review status as "Comment" :)

@shonfeder
Copy link
Collaborator

Ah, I just looked at the failures in CI. There are two issues:

  1. To maintain compatibility with OCaml 4.04.2, we'll either need to avoid using the Fun module and write our implementations, or pull in https://github.com/thierry-martinez/stdcompat (or an equivalent, if available). I'm quite happy with doing the latter.
  2. The failures on recent version of OCaml show why this actually does have to be configurable: it breaks all of our specification compliance tests :). I suggest we make this configurable by CLI, by default to having it enabled.

How does that sound?

@tatchi tatchi force-pushed the auto-id-for-headings branch 2 times, most recently from d01fbf0 to eec7388 Compare May 16, 2022 19:01
@tatchi
Copy link
Collaborator Author

tatchi commented May 17, 2022

  1. To maintain compatibility with OCaml 4.04.2, we'll either need to avoid using the Fun module and write our implementations, or pull in https://github.com/thierry-martinez/stdcompat (or an equivalent, if available). I'm quite happy with doing the latter.

I just inlined the function as it was straightforward to do: 0570385

  1. The failures on recent version of OCaml show why this actually does have to be configurable: it breaks all of our specification compliance tests :). I suggest we make this configurable by CLI, by default to having it enabled.

I added an --auto-identifiers option (inspired by pandoc) in cb088d1. Since it's enabled by default it still break the tests. Should I promote them or disable that option when running the tests?

As for the identifiers, there's actually some rules described in the doc on how to derive them:

The default algorithm used to derive the identifier from the heading text is:

  • Remove all formatting, links, etc.
  • Remove all footnotes.
  • Remove all non-alphanumeric characters, except underscores, hyphens, and periods.
  • Replace all spaces and newlines with hyphens.
  • Convert all alphabetic characters to lowercase.
  • Remove everything up to the first letter (identifiers may not begin with a number or punctuation mark).
  • If nothing is left after this, use the identifier section.

I had to switch to an implementation that supports Unicode characters in eec7388 (using the Uucp and Uutf modules). It's not yet perfect as some of the test results are not correct. I'll continue to try to improve that :)

@tatchi
Copy link
Collaborator Author

tatchi commented May 18, 2022

I improved the algorithm slightly; it should be ok for a second review.

There are still some differences compared to the result I get on https://pandoc.org/try/, but pandoc can only generate id's when using markdown and not commonmark. So the outputted HTML content (from which the id is derived) might be different in some cases, hence why the result might differ.

The only thing missing I think is:

several headings have the same text; in this case, the first will get an identifier as described above; the second will get the same identifier with -1 appended; the third with -2; and so on.

So for instance in pandoc, the following

### hello
### hello

becomes:

<h3 id="hello">hello</h3>
<h3 id="hello-1">hello</h3>

I guess it means that we'll need to keep a state somewhere with all the ids we've already generated. Not sure how to best implement that and if that's something we want. Would love to have your input on that :)

We'll also have to decide how to handle "empty" id. In pandoc:

If nothing is left after this, use the identifier section.

So

### 33
### 22
### 

becomes

<h3 id="section">33</h3>
<h3 id="section-1">22</h3>
<h3 id="section-2"></h3>

@tmattio
Copy link

tmattio commented Jul 25, 2022

cc @koonwen, this would probably fix ocaml/ocaml.org#523!

@shonfeder
Copy link
Collaborator

shonfeder commented Aug 2, 2022

We'll also have to decide how to handle "empty" id. In pandoc:

I'm fine with following pandoc here.

Copy link
Collaborator

@shonfeder shonfeder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! I finally was able to address your outstanding questions. Sorry again for the long delay here!

@@ -128,9 +178,13 @@ and inline = function
| Image (attr, { label; destination; title }) ->
img label destination title attr

let rec block = function
let rec block ~auto_identifiers = function
Copy link
Collaborator

@shonfeder shonfeder Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it means that we'll need to keep a state somewhere with all the ids we've already generated. Not sure how to best implement that and if that's something we want. Would love to have your input on that :)

Perhaps we can have the block function take a record for this argument that can carry some configuration/state? For the identifier numerical suffix, it could be a int StringMap.t. Of course, we could also use a mutable hash map for this, but would be nice to avoid if feasible. WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, I added an int StringMap.t to keep track of the numerical suffix. Let me know what you think :)

src/omd.ml Outdated Show resolved Hide resolved
@tatchi tatchi force-pushed the auto-id-for-headings branch 3 times, most recently from 8ad77e1 to 5a2d985 Compare August 27, 2022 13:37
Thanks to the index and accumulator arguments given by
`Uutf.String.fold_utf_8`, we can avoid the need for a sentinel "start"
the need for repeated checks after we've found the suffix, and the need
to allocate a new string.
Copy link
Collaborator

@shonfeder shonfeder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few followup suggestions. I think we're just about there!

tests/blackbox/heading-id.t Show resolved Hide resolved
src/html.ml Outdated Show resolved Hide resolved
src/html.ml Outdated Show resolved Hide resolved
src/html.ml Show resolved Hide resolved
src/html.ml Outdated
Comment on lines 265 to 271
let id = slugify (to_plain_text text) in
(* Default identifier if empty. It matches what pandoc does. *)
let id = if id = "" then "section" else id in
let count, identifiers = Identifiers.touch id identifiers in
let id =
if count = 0 then id else Printf.sprintf "%s-%i" id count
in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this might better belong in the slugify function. WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any strong opinion on that one. I made the changes in 820f310 (#267)

tatchi and others added 4 commits November 3, 2022 22:46
Similar to the approach with `drop_while`, we can use the accumulator
argument of the fold to track whether or not we've seen white space,
which allows us to clean up the logic a bit, IMO. However, unlike that
case, the need to construct a new string still requires some side
effects.
@tatchi tatchi requested a review from shonfeder November 4, 2022 17:53
Copy link
Collaborator

@shonfeder shonfeder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks so much, and sorry for the sluggish review loops!

bin/main.ml Outdated Show resolved Hide resolved
@shonfeder
Copy link
Collaborator

Closing this in favor of #294. I opened up the new branch purely for the purposes of making merge conflict resolution easier.

Thanks so much, @tatchi!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

(Option to) Automatically fill the id attributes of the headings
3 participants