Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add compiler plugin for python-markdown #1

Merged
merged 13 commits into from
Jul 6, 2024

Conversation

shaokeyibb
Copy link
Member

A extension for python-markdown to calculate document fragement offsets and put them to attributes of output html elements.

For example, the input markdown:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent vel nulla ac diam dignissim congue ut sed ligula. Pellentesque aliquet ante sit amet risus iaculis, eget tincidunt nibh volutpat. Etiam non pulvinar enim. Mauris viverra augue urna, non aliquam ligula sodales in. Duis mattis ligula pretium dui bibendum, nec tincidunt neque placerat. Pellentesque eu est malesuada, dictum nulla quis, facilisis lectus. Fusce tempor mi ac tellus dictum porta. Cras venenatis pulvinar turpis. Suspendisse consequat nulla suscipit sagittis pretium.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin sed lacus vitae neque vestibulum porttitor id et urna. Quisque nisl nisi, fermentum at justo quis, varius aliquet lorem. Ut fringilla vel purus et fermentum. Mauris ac lacinia nisi, sed ultricies dolor. Nunc ut augue quis eros iaculis tempor vel eu erat. Vestibulum efficitur porta justo. Fusce cursus magna dui, eget posuere neque tristique id. Suspendisse varius mauris arcu, nec congue metus efficitur in. Etiam ac pretium justo. Proin non ante faucibus, mattis mi et, consectetur sapien. Proin feugiat commodo euismod.

Morbi neque lectus, faucibus a mattis at, aliquam quis est. Maecenas sed luctus elit. Nam vel consequat magna, ac dictum velit. Quisque non cursus enim, at ullamcorper massa. Integer quam mauris, scelerisque eu luctus et, facilisis nec ante. Proin feugiat vehicula felis at ornare. Maecenas est risus, tempus sit amet fermentum vel, sagittis in tellus. Integer ultrices velit at nulla tincidunt cursus. Curabitur non nunc in erat imperdiet imperdiet id sed felis. Quisque euismod velit a mi pellentesque, sit amet molestie eros dignissim. Morbi tincidunt dui vitae orci viverra, vitae gravida sapien semper. Pellentesque viverra a turpis blandit ornare. Quisque tincidunt quam a est facilisis, a fringilla augue sollicitudin. Pellentesque et eros sed arcu placerat sollicitudin. Donec diam eros, auctor non risus eu, interdum interdum mi.

The output is:

<p data-original-document-end="544" data-original-document-start="0">Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent vel nulla ac diam dignissim congue ut sed ligula. Pellentesque aliquet ante sit amet risus iaculis, eget tincidunt nibh volutpat. Etiam non pulvinar enim. Mauris viverra augue urna, non aliquam ligula sodales in. Duis mattis ligula pretium dui bibendum, nec tincidunt neque placerat. Pellentesque eu est malesuada, dictum nulla quis, facilisis lectus. Fusce tempor mi ac tellus dictum porta. Cras venenatis pulvinar turpis. Suspendisse consequat nulla suscipit sagittis pretium.</p>
<p data-original-document-end="1131" data-original-document-start="546">Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin sed lacus vitae neque vestibulum porttitor id et urna. Quisque nisl nisi, fermentum at justo quis, varius aliquet lorem. Ut fringilla vel purus et fermentum. Mauris ac lacinia nisi, sed ultricies dolor. Nunc ut augue quis eros iaculis tempor vel eu erat. Vestibulum efficitur porta justo. Fusce cursus magna dui, eget posuere neque tristique id. Suspendisse varius mauris arcu, nec congue metus efficitur in. Etiam ac pretium justo. Proin non ante faucibus, mattis mi et, consectetur sapien. Proin feugiat commodo euismod.</p>
<p data-original-document-end="1971" data-original-document-start="1133">Morbi neque lectus, faucibus a mattis at, aliquam quis est. Maecenas sed luctus elit. Nam vel consequat magna, ac dictum velit. Quisque non cursus enim, at ullamcorper massa. Integer quam mauris, scelerisque eu luctus et, facilisis nec ante. Proin feugiat vehicula felis at ornare. Maecenas est risus, tempus sit amet fermentum vel, sagittis in tellus. Integer ultrices velit at nulla tincidunt cursus. Curabitur non nunc in erat imperdiet imperdiet id sed felis. Quisque euismod velit a mi pellentesque, sit amet molestie eros dignissim. Morbi tincidunt dui vitae orci viverra, vitae gravida sapien semper. Pellentesque viverra a turpis blandit ornare. Quisque tincidunt quam a est facilisis, a fringilla augue sollicitudin. Pellentesque et eros sed arcu placerat sollicitudin. Donec diam eros, auctor non risus eu, interdum interdum mi.</p>     

The data-original-document-start and data-original-document-end attributes are injected for further usage.

Copy link
Member

@Enter-tainer Enter-tainer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM。之后我们可以 setup 一下 ci,初步来看可以加上自动化测试和format检查

compiler-plugin/pyproject.toml Outdated Show resolved Hide resolved
compiler-plugin/test/__main__.py Outdated Show resolved Hide resolved
compiler-plugin/test/__main__.py Outdated Show resolved Hide resolved
@shaokeyibb
Copy link
Member Author

提供一个目前无法通过测例的边界情况供参考:对于未按 Markdown 标准,在标题和文本中间未空行的,会导致输出产生问题:

# Lorem ipsum             
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin sed lacus vitae neque vestibulum porttitor id et urna.

这会使得 python-markdown 将此两者视作为一个 block 传入,妨碍我们的注入逻辑,因此得到错误结果:

<h1>Lorem ipsum</h1>
<p data-original-document-end="144" data-original-document-start="0">Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin sed lacus vitae neque vestibulum porttitor id et urna.</p>

一种解决方案可能是让 python-markdown 内置的 blockprocessor 将此 block 分开后再继续处理,但是很遗憾的是对于此 HashHeaderProcessor

blockprocessors.py#L476中:

            before = block[:m.start()]  # All lines before header
            after = block[m.end():]     # All lines after header
            if before:
                # As the header was not the first line of the block and the
                # lines before the header must be parsed first,
                # recursively parse this lines as a block.
                self.parser.parseBlocks(parent, [before])

被分开的两块 block 被提前渲染,这导致当我们将优先级设置为低于此处理器(70),后,反而拿不到 header 的 block 数据。

这里可能需要再考虑一下如何解决,可能需要再做一个 blockprocessor 把所有没分开的 block 手动分开,但感觉成本有点大;如果能保证所有 markdown 写的都很标准,那可以忽略这个问题。

@shaokeyibb shaokeyibb marked this pull request as draft July 2, 2024 17:02
@Enter-tainer
Copy link
Member

Enter-tainer commented Jul 3, 2024

https://github.com/OI-wiki/OI-wiki/blob/master/docs/ds/2-3-4-tree.md?plain=1 看起来至少oiwiki现在的bot会(吗?)把 heading 和下面的内容之间插入一个空行。我开了个测试的 pr OI-wiki/OI-wiki#5691

看起来现在是会做这个的

但是我不太确定会不会有其他的情况会触发这个问题

ci: update env

ci: refactor

ci: working dir

ci: install deps

ci: set working dir
@shaokeyibb shaokeyibb force-pushed the hikarilan/feat-compiler-plugin branch from ac746ce to dac6176 Compare July 3, 2024 09:15
Copy link
Member

@Enter-tainer Enter-tainer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

其他地方 LGTM

@shaokeyibb
Copy link
Member Author

(必须得承认这代码变得越来越屎了)
一整个晚上我都在想办法解决这个问题,好消息是老问题解决了,坏消息是新问题出来了。
目前采用了启发式算法来尽可能地匹配原文档,效果还行,但是有一种情况目前没法解决,就是内嵌 HTML 的情况
如果在 Markdown 中直接内嵌 HTML,那么算法会把这坨 HTML 计算进离它最近的那个文本块里头,比如:

本项目受 [CTF Wiki](https://ctf-wiki.org/) 的启发,在编写过程中参考了诸多资料,在此一并致谢。

<div align="center">
<a href="https://www.netlify.com/" target="_blank" style="margin-left: 60px;"><img style="height: 40px; " src="images/netlify.png"></a>
</div>

<script>
  // #758
  document.getElementsByClassName('md-nav__title')[1].click()
</script>

会被认为是在一个元素内,所以我们会得到一个错误的计算结果

{
"tag": "p",
"offset": (780, 1101),  # FIXME: Correct one is (780, 1101)
}

这个我实在没办法干预了,想了很多种办法,但是似乎都不是很管用。

@Enter-tainer
Copy link
Member

感觉“归到上面一段这个行为”,好像不是很恶性?BTW,这个MR现在应该可以被mark as ready了?

@Enter-tainer
Copy link
Member

能大概描述一下现在的启发式算法是怎么做的吗?我暂时放弃读懂这里的逻辑了

@shaokeyibb
Copy link
Member Author

他会试图拿着之前计算好的行偏移量字典去和被修改后的行列表对应。
如果直接对应上了,那么这很好,只需要为相同的行调用一次 match 函数重新记录一下即可
在这个过程中,如果有没有对应上的地方,那么会尝试查找可以于新文档对应上的行,然后将中间这部分直接归给没有对应的地方。

比如如果

# 标题

文字

'''javascript
// 代码
'''

文字 2

被渲染为

# 标题

文字

[占位符]

文字 2

那么当源文档中的代码块无法被匹配时,则会被暂时跳过,直到 文字 2 精准匹配成功,这时 文字文字 2 中间的这部分就会被认为属于 [占位符]

@shaokeyibb shaokeyibb marked this pull request as ready for review July 5, 2024 15:37
Copy link
Member

@Enter-tainer Enter-tainer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@Enter-tainer Enter-tainer merged commit 6030593 into master Jul 6, 2024
2 checks passed
@Enter-tainer Enter-tainer deleted the hikarilan/feat-compiler-plugin branch July 6, 2024 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants