Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option for processing data in batches #50

Open
fl034 opened this issue Jan 27, 2020 · 1 comment
Open

Add option for processing data in batches #50

fl034 opened this issue Jan 27, 2020 · 1 comment

Comments

@fl034
Copy link

fl034 commented Jan 27, 2020

Hey there again,
your library does a great jobs importing data line by line to prevent memory.
But it seems like there is no solution to process the imported lines in batches. Correct me if I oversaw this feature.

In my case, my CSV file is huge so adding all the data to an array exceeds the memory limit of some devices.

It would be good to have an option to set a batch size.
When the importer has imported as many elements as defined, a callback is fired where I can process the batch of processed elements (e.g. save them to a database) and free up memory, so the importer can continue.

Also there must be a onFinish callback that doesn't pass the whole array, because it could be the case that it doesn't fit into memory.

@Jeehut
Copy link
Member

Jeehut commented Jan 30, 2020

@fl034 Thank you for reporting this issue you've had with CSVImporter. I think both of the suggestions make total sense and should be added to this library.

Please note that it is already possible that you just don't make the onFinish call. So the second part is already kind of implemented. But we should definitely document it as an example.

For the first part to work properly, I think we need to change the following:

  1. We need a new method which works similar to startImportingRecords but named something like importInBatches which receives a batch size and where a new array is created after each batch.
  2. Something similar to shouldReportProgress could be implemented named like shouldReportBatch so that it returns true exactly when the next batch size or the end of the file is reached when the new importInBatches feature is used.
  3. We need a new onNextBatch callback method which could work similar to onProgress or onFinish but only provide the next batch.
  4. Tests should be written to ensure this works as expected.
  5. We need to add this variant to the documentation in the README

Unfortunately, I won't have the time to implement this myself any time soon, but I hope I could lay out the way of how this could be implemented well enough that it shouldn't be too hard to implement by someone who needs it. I'd love to review a PR that implements this. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants