Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support stream-based tokenization #14

Open
shonfeder opened this issue May 4, 2019 · 1 comment
Open

Support stream-based tokenization #14

shonfeder opened this issue May 4, 2019 · 1 comment

Comments

@shonfeder
Copy link
Owner

For large inputs we want to be able to process one line at a time, so we don't have to read the entire thing in to memory.

@Anniepoo
Copy link
Collaborator

Anniepoo commented May 4, 2019

Use phrase_from_file. This reads ahead only enough to perform the reduce.
If you don't leave choice points when you emit a token it should neatly dispose of list.
Also, it works with lazy_list_location and so on for neatly reporting errors.
The only time it would consume a lot of memory would be eg an unclosed string if you allow newlines in files, or trying to read a file by lines, and a line by line mode breaks under same conditions.

I would advocate for not doing this, it hard codes line breaks are token breaks into the tokenizer in a way you may come to regret. EG if you attempt to tokenize SWI-Prolog code, you have to deal with multiline strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants