Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve all aspects of compute performance (save disk space cost) for pytorch datasets by pre-caching processed items. #76

Closed
wants to merge 16 commits into from

Conversation

mmcdermott
Copy link
Owner

This adds a new PytorchDataset class which has the same interface as the old class, but now pre-computes the outputs of the "process-on-the-fly" class and caches them to disk. For generative pre-training, like has been done with BERT and other NLP models, this should be done over a series of epochs first rather than only one epoch, but for fine-tuning task models, this can just be done for one epoch as normally these will not be stochastic per item.

The resulting files stored to disk are fully padded, tensorized *.pt files stored with torch.save, and the model can load these files in much more efficiently and iterate through them with minimal memory and time cost. This will also be much easier to pivot to a chunked solution when datasets grow too large re memory cost.

@mmcdermott
Copy link
Owner Author

@juancq this is the PR I mentioned.

Copy link

codecov bot commented Nov 10, 2023

Codecov Report

Attention: 19 lines in your changes are missing coverage. Please review.

Comparison is base (834c850) 86.04% compared to head (3a1821c) 86.30%.
Report is 13 commits behind head on dev.

Files Patch % Lines
EventStream/data/pytorch_dataset.py 91.44% 13 Missing ⚠️
EventStream/data/config.py 93.50% 5 Missing ⚠️
EventStream/transformer/config.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev      #76      +/-   ##
==========================================
+ Coverage   86.04%   86.30%   +0.25%     
==========================================
  Files          34       34              
  Lines        6401     6608     +207     
==========================================
+ Hits         5508     5703     +195     
- Misses        893      905      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@juancq
Copy link
Contributor

juancq commented Nov 13, 2023

I'll give it a try on my dataset.

@juancq
Copy link
Contributor

juancq commented Nov 13, 2023

@mmcdermott I tested it on a subset of my dataset. This subset includes 100,000 patients. The files sizes of this subset are:

  1. subjects_df - 1.1MB
  2. events_df - 18MB
  3. dynamic_measurements_df - 38MB

The positive: this pull request solves the increased memory consumption issue.
The negative: the memory consumption with this pull request was about 32.5GB, with an epoch taking about 100 seconds to run. The pull request I submitted gives me memory consumption around 4.5GB, with an epoch taking about 95 seconds to run.

@mmcdermott
Copy link
Owner Author

mmcdermott commented Nov 13, 2023 via email

@juancq
Copy link
Contributor

juancq commented Nov 15, 2023

@mmcdermott The only change was replacing data/config.py and data/pytorch_dataset.py with the copies in this pull request.

I'll test again.

@juancq
Copy link
Contributor

juancq commented Nov 27, 2023

@mmcdermott I ran pretraining a couple of more times to test again memory consumption and the how long each epoch takes (I'm ignoring the pre-caching process and focusing entirely on time per epoch). I selected a fairly similar 100,000 patient cohort to the one I mentioned above (minor differences in inclusion/exclusion criteria).

The precached solution in this pull request uses about 3x more memory (average of 12.8 GB vs 4.4GB) and takes slightly longer per epoch (1 minute 26 seconds vs 1 minute 14 seconds).

I ran it multiple times, and subsequent times were faster during start-up, but the time per epoch did not change.

@mmcdermott
Copy link
Owner Author

Thanks for profiling this so extensively, @juancq . I'll dig deeper on my end; it's possible this approach is uniquely suitable for fine-tuning and not pre-training, though I can't imagine why that would be the case.

@mmcdermott mmcdermott closed this Nov 27, 2023
@mmcdermott
Copy link
Owner Author

@juancq quick question -- were your model runs being done on CPU or GPU? Trying to debug the slowdown you observed. Thanks!

@juancq
Copy link
Contributor

juancq commented Dec 6, 2023

@mmcdermott the model was running on a GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants