-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve all aspects of compute performance (save disk space cost) for pytorch datasets by pre-caching processed items. #76
Conversation
…nly still failing.
@juancq this is the PR I mentioned. |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## dev #76 +/- ##
==========================================
+ Coverage 86.04% 86.30% +0.25%
==========================================
Files 34 34
Lines 6401 6608 +207
==========================================
+ Hits 5508 5703 +195
- Misses 893 905 +12 ☔ View full report in Codecov by Sentry. |
I'll give it a try on my dataset. |
@mmcdermott I tested it on a subset of my dataset. This subset includes 100,000 patients. The files sizes of this subset are:
The positive: this pull request solves the increased memory consumption issue. |
Did you modify your PyTorch data configs at all when running this version?
E.g., put in a >1 number of cached epochs?
Additionally trying running it one more time, as the first time this
dataset is created it does the pre-caching process, which takes longer, and
only on subsequent passes will it be able to use those pre-cached files to
best effect.
…On Mon, Nov 13, 2023, 2:39 AM Juan Quiroz Aguilera ***@***.***> wrote:
@mmcdermott <https://github.com/mmcdermott> I tested it on a subset of my
dataset. This subset includes 100,000 patients. The files sizes of this
subset are:
1. subjects_df - 1.1MB
2. events_df - 18MB
3. dynamic_measurements_df - 38MB
The positive: this pull request solves the increased memory consumption
issue.
The negative: the memory consumption with this pull request was about
32.5GB, with an epoch taking about 1:40 seconds to run. The pull request I
submitted still gives me memory consumption around 4.5GB, with an epoch
taking about 1:35 seconds to run.
—
Reply to this email directly, view it on GitHub
<#76 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADS5XYUXC3JXQVT5BTONILYEHFDDAVCNFSM6AAAAAA7GWG2GCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBXGYYDMOBWGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@mmcdermott The only change was replacing data/config.py and data/pytorch_dataset.py with the copies in this pull request. I'll test again. |
@mmcdermott I ran pretraining a couple of more times to test again memory consumption and the how long each epoch takes (I'm ignoring the pre-caching process and focusing entirely on time per epoch). I selected a fairly similar 100,000 patient cohort to the one I mentioned above (minor differences in inclusion/exclusion criteria). The precached solution in this pull request uses about 3x more memory (average of 12.8 GB vs 4.4GB) and takes slightly longer per epoch (1 minute 26 seconds vs 1 minute 14 seconds). I ran it multiple times, and subsequent times were faster during start-up, but the time per epoch did not change. |
Thanks for profiling this so extensively, @juancq . I'll dig deeper on my end; it's possible this approach is uniquely suitable for fine-tuning and not pre-training, though I can't imagine why that would be the case. |
@juancq quick question -- were your model runs being done on CPU or GPU? Trying to debug the slowdown you observed. Thanks! |
@mmcdermott the model was running on a GPU. |
This adds a new
PytorchDataset
class which has the same interface as the old class, but now pre-computes the outputs of the "process-on-the-fly" class and caches them to disk. For generative pre-training, like has been done with BERT and other NLP models, this should be done over a series of epochs first rather than only one epoch, but for fine-tuning task models, this can just be done for one epoch as normally these will not be stochastic per item.The resulting files stored to disk are fully padded, tensorized
*.pt
files stored withtorch.save
, and the model can load these files in much more efficiently and iterate through them with minimal memory and time cost. This will also be much easier to pivot to a chunked solution when datasets grow too large re memory cost.