Support stream-based tokenization #14

shonfeder · 2019-05-04T02:12:07Z

For large inputs we want to be able to process one line at a time, so we don't have to read the entire thing in to memory.

Anniepoo · 2019-05-04T04:56:26Z

Use phrase_from_file. This reads ahead only enough to perform the reduce.
If you don't leave choice points when you emit a token it should neatly dispose of list.
Also, it works with lazy_list_location and so on for neatly reporting errors.
The only time it would consume a lot of memory would be eg an unclosed string if you allow newlines in files, or trying to read a file by lines, and a line by line mode breaks under same conditions.

I would advocate for not doing this, it hard codes line breaks are token breaks into the tokenizer in a way you may come to regret. EG if you attempt to tokenize SWI-Prolog code, you have to deal with multiline strings.

shonfeder added the help wanted label May 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support stream-based tokenization #14

Support stream-based tokenization #14

shonfeder commented May 4, 2019

Anniepoo commented May 4, 2019

Support stream-based tokenization #14

Support stream-based tokenization #14

Comments

shonfeder commented May 4, 2019

Anniepoo commented May 4, 2019