Improve all aspects of compute performance (save disk space cost) for pytorch datasets by pre-caching processed items. #76

mmcdermott · 2023-11-10T20:58:22Z

This adds a new PytorchDataset class which has the same interface as the old class, but now pre-computes the outputs of the "process-on-the-fly" class and caches them to disk. For generative pre-training, like has been done with BERT and other NLP models, this should be done over a series of epochs first rather than only one epoch, but for fine-tuning task models, this can just be done for one epoch as normally these will not be stochastic per item.

The resulting files stored to disk are fully padded, tensorized *.pt files stored with torch.save, and the model can load these files in much more efficiently and iterate through them with minimal memory and time cost. This will also be much easier to pivot to a chunked solution when datasets grow too large re memory cost.

…nly still failing.

…least.

…runs now.

mmcdermott · 2023-11-10T20:59:01Z

@juancq this is the PR I mentioned.

codecov · 2023-11-10T21:08:57Z

Codecov Report

Attention: 19 lines in your changes are missing coverage. Please review.

Comparison is base (834c850) 86.04% compared to head (3a1821c) 86.30%.
Report is 13 commits behind head on dev.

Files	Patch %	Lines
EventStream/data/pytorch_dataset.py	91.44%	13 Missing ⚠️
EventStream/data/config.py	93.50%	5 Missing ⚠️
EventStream/transformer/config.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##              dev      #76      +/-   ##
==========================================
+ Coverage   86.04%   86.30%   +0.25%     
==========================================
  Files          34       34              
  Lines        6401     6608     +207     
==========================================
+ Hits         5508     5703     +195     
- Misses        893      905      +12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

juancq · 2023-11-13T02:42:33Z

I'll give it a try on my dataset.

juancq · 2023-11-13T07:39:18Z

@mmcdermott I tested it on a subset of my dataset. This subset includes 100,000 patients. The files sizes of this subset are:

subjects_df - 1.1MB
events_df - 18MB
dynamic_measurements_df - 38MB

The positive: this pull request solves the increased memory consumption issue.
The negative: the memory consumption with this pull request was about 32.5GB, with an epoch taking about 100 seconds to run. The pull request I submitted gives me memory consumption around 4.5GB, with an epoch taking about 95 seconds to run.

mmcdermott · 2023-11-13T08:21:27Z

Did you modify your PyTorch data configs at all when running this version? E.g., put in a >1 number of cached epochs? Additionally trying running it one more time, as the first time this dataset is created it does the pre-caching process, which takes longer, and only on subsequent passes will it be able to use those pre-cached files to best effect.

…

On Mon, Nov 13, 2023, 2:39 AM Juan Quiroz Aguilera ***@***.***> wrote: @mmcdermott <https://github.com/mmcdermott> I tested it on a subset of my dataset. This subset includes 100,000 patients. The files sizes of this subset are: 1. subjects_df - 1.1MB 2. events_df - 18MB 3. dynamic_measurements_df - 38MB The positive: this pull request solves the increased memory consumption issue. The negative: the memory consumption with this pull request was about 32.5GB, with an epoch taking about 1:40 seconds to run. The pull request I submitted still gives me memory consumption around 4.5GB, with an epoch taking about 1:35 seconds to run. — Reply to this email directly, view it on GitHub <#76 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADS5XYUXC3JXQVT5BTONILYEHFDDAVCNFSM6AAAAAA7GWG2GCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBXGYYDMOBWGM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

juancq · 2023-11-15T00:48:11Z

@mmcdermott The only change was replacing data/config.py and data/pytorch_dataset.py with the copies in this pull request.

I'll test again.

Duplication bug in task caching

juancq · 2023-11-27T06:51:16Z

@mmcdermott I ran pretraining a couple of more times to test again memory consumption and the how long each epoch takes (I'm ignoring the pre-caching process and focusing entirely on time per epoch). I selected a fairly similar 100,000 patient cohort to the one I mentioned above (minor differences in inclusion/exclusion criteria).

The precached solution in this pull request uses about 3x more memory (average of 12.8 GB vs 4.4GB) and takes slightly longer per epoch (1 minute 26 seconds vs 1 minute 14 seconds).

I ran it multiple times, and subsequent times were faster during start-up, but the time per epoch did not change.

mmcdermott · 2023-11-27T14:17:37Z

Thanks for profiling this so extensively, @juancq . I'll dig deeper on my end; it's possible this approach is uniquely suitable for fine-tuning and not pre-training, though I can't imagine why that would be the case.

mmcdermott · 2023-12-05T18:32:45Z

@juancq quick question -- were your model runs being done on CPU or GPU? Trying to debug the slowdown you observed. Thanks!

juancq · 2023-12-06T00:31:53Z

@mmcdermott the model was running on a GPU.

mmcdermott added 10 commits October 25, 2023 12:06

Added a schema output to pytorch dataset.

c313c1c

Fixed small bug with schema and added test.

8437fe0

Small updates to dataset for testing compute costs.

8cbacb9

First attempt; likely very buggy.

0504580

More changes; not sure if it is 100% working yet or not. Tests certai…

49c4f02

…nly still failing.

Fixed some more typos; now can run up through getitem and collate at …

3d577ab

…least.

Fixed some small bugs; tests still not fully passing but sample data …

00a0305

…runs now.

Fixed a number of small bugs; tests closer to passing.

260161a

I think all tests are now passing.

6052b8a

Updated notebook.

7098db8

mmcdermott added 6 commits November 21, 2023 12:31

Added doctest.

a7ab111

Merge branch 'improved_compute' into duplication_bug_in_task_caching

c2c4a2e

Added test for task caching bug.

6235bd0

Fixed test.

20ef6af

Lint fixes.

c78cb4d

Merge pull request #77 from mmcdermott/duplication_bug_in_task_caching

3a1821c

Duplication bug in task caching

mmcdermott closed this Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve all aspects of compute performance (save disk space cost) for pytorch datasets by pre-caching processed items. #76

Improve all aspects of compute performance (save disk space cost) for pytorch datasets by pre-caching processed items. #76

mmcdermott commented Nov 10, 2023

mmcdermott commented Nov 10, 2023

codecov bot commented Nov 10, 2023 •

edited

Loading

juancq commented Nov 13, 2023

juancq commented Nov 13, 2023 •

edited

Loading

mmcdermott commented Nov 13, 2023 via email

juancq commented Nov 15, 2023

juancq commented Nov 27, 2023

mmcdermott commented Nov 27, 2023

mmcdermott commented Dec 5, 2023

juancq commented Dec 6, 2023

Improve all aspects of compute performance (save disk space cost) for pytorch datasets by pre-caching processed items. #76

Improve all aspects of compute performance (save disk space cost) for pytorch datasets by pre-caching processed items. #76

Conversation

mmcdermott commented Nov 10, 2023

mmcdermott commented Nov 10, 2023

codecov bot commented Nov 10, 2023 • edited Loading

Codecov Report

juancq commented Nov 13, 2023

juancq commented Nov 13, 2023 • edited Loading

mmcdermott commented Nov 13, 2023 via email

juancq commented Nov 15, 2023

juancq commented Nov 27, 2023

mmcdermott commented Nov 27, 2023

mmcdermott commented Dec 5, 2023

juancq commented Dec 6, 2023

codecov bot commented Nov 10, 2023 •

edited

Loading

juancq commented Nov 13, 2023 •

edited

Loading