Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Internal tokenizer #204

Merged
merged 74 commits into from
Jun 12, 2021
Merged

WIP: Internal tokenizer #204

merged 74 commits into from
Jun 12, 2021

Conversation

philss
Copy link
Owner

@philss philss commented Apr 3, 2019

This is work in progress of a HTML tokenizer.

It is related with #37

@philss philss force-pushed the internal-tokenizer branch from 311664e to 18804dd Compare May 3, 2019 01:13
@@ -0,0 +1,63 @@
defmodule Floki.HTML.TreeConstruction do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Been a long road to get here- congrats!

Now the real "fun" starts. 😄

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mischov thank you! 💜 😄
I hope I can have some fun with all this complexity 😅

@philss philss force-pushed the internal-tokenizer branch from d06386d to f611461 Compare August 29, 2019 01:10
@philss philss force-pushed the internal-tokenizer branch from 485bac5 to a28ab2a Compare October 7, 2019 03:40
@philss philss force-pushed the internal-tokenizer branch from a28ab2a to 837c2dd Compare January 6, 2020 00:09
@philss philss force-pushed the internal-tokenizer branch from 837c2dd to 2bda94b Compare March 14, 2020 20:54
@philss philss force-pushed the internal-tokenizer branch from 29aef07 to e7906cc Compare March 22, 2020 00:48
@philss philss force-pushed the internal-tokenizer branch from cf12a09 to 6a8006a Compare May 23, 2020 22:15
@philss philss marked this pull request as draft September 30, 2020 23:57
@philss philss force-pushed the internal-tokenizer branch 2 times, most recently from 1a10a4a to dfcb4b8 Compare October 8, 2020 22:26
@philss philss force-pushed the internal-tokenizer branch 2 times, most recently from c326149 to a0ed877 Compare January 21, 2021 22:59
@philss philss force-pushed the internal-tokenizer branch from d0490a3 to 3c80baf Compare February 16, 2021 03:25
@philss philss force-pushed the internal-tokenizer branch 2 times, most recently from fdb9523 to 5c874fd Compare March 29, 2021 22:23
philss added 25 commits June 12, 2021 00:48
It also improves the Document.add_node/3 to accept only the parent node
id instead of having to keep all the parent data. This should give a
little bit of performance and reduce the memory usage.
It adds some of the states transitions of tree construction
This algorithm is important as it return the place/node that we need to
add a new node.
This has the intention to improve performance.
This is a big change that enable us to replace the fetch of HTML
entities from a JSON file parsed in memory to a lookup in a module.
This is similar to what we have in HTML Entities package.

Also this fixes a lot of small IO data building errors.
This feature was incomplete and is can be added in the future.
This is a replacement for the strategy that was loading all the tests at
once from metaprogramming.

Now the tests are being loaded much faster and are easier to debug.
Those files are based on html5lib-tests and ensure that we are following
or we are close to the HTML specification.

The test files that were not passing were not added.
This will enable easier debugging.
The fix of warnings make clear which functions are public (re entrant
tokenizer states) and also fix typespecs.
Don't worries: it is going to appear in another PR.
This is a preparation to merge the pending work.
@philss philss force-pushed the internal-tokenizer branch from 8aba3f4 to 47396b0 Compare June 12, 2021 03:49
@philss philss marked this pull request as ready for review June 12, 2021 04:24
@philss philss merged commit 042b3f6 into master Jun 12, 2021
@philss philss deleted the internal-tokenizer branch June 12, 2021 05:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants