Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse stream? #10

Open
Laeeth opened this issue Dec 5, 2015 · 1 comment
Open

parse stream? #10

Laeeth opened this issue Dec 5, 2015 · 1 comment

Comments

@Laeeth
Copy link

Laeeth commented Dec 5, 2015

Hi Marco.

Small enhancement request. (Apologies if it's implemented already and I didn't see).

Quite often one wants to parse a JSON stream (like from Twitter or the Reddit comment dump). It would be nice to have that implemented as part of the library, so it's very easy to use. I have written a small range to do this, but it's quite crude, and I haven't paid attention to efficiency. I can make a pull request if you would like (and you can refine it later), but you may prefer to implement yourself - let me know.

Here is some very simple code to process Reddit comments:
https://gist.github.com/Laeeth/bbd08dd576cb7aeff444

The original comments are here:
https://archive.org/details/2015_reddit_comments_corpus

On one core it takes 35 minutes to process one month's data (35 Gig).

Thanks for getting in touch by email. That was about something else - have had to figure out some other things but will respond shortly.

Laeeth.

@mleise
Copy link
Collaborator

mleise commented Jun 8, 2016

Sorry for the late answer. The dilemma is that treating the entire JSON text as one memory block is fundamental for the code as it stands. Most data in JSON has no length limit (strings, numbers), so a lot of places need to become aware of the sliding window that comes with streaming. I do see the use case with huge files and think that maybe RapidJSON can shine here. I'll keep the report open anyways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants