refactor and optimize `dataprep_connect_abcd_with_scenario()` #7

cjyetman · 2022-11-26T15:53:34Z

dataprep_connect_abcd_with_scenario() is the elephant in the room. It's hundreds of lines of code, difficult to understand, and torturously long to run. There's got to be a better way.

AB#10867

The text was updated successfully, but these errors were encountered:

cjyetman · 2024-07-02T09:26:41Z

After a lot of experimentation, I found that a significant contribution to the slow behavior of data.prep is due to memory fragmentation. Every time one of the very large datasets are loaded into memory, R tries to find space in RAM for it. When the object is removed, R "releases" the space in RAM that it was taking up, but after a while, there are not enough contiguous blocks of memory to efficiently load another large dataset. Additionally, while R "releases" the memory, the OS does not reclaim it, so the memory requirement on the OS continually increases, likely leading to memory swap occurring.

To test mitigating this, I tried wrapping chunks of the data.prep process in callr() statements which run the code in a separate R thread which is completely exited and the memory is fully released to the OS after it completes. This seemed to give a significant performance advantage, even though there's some overhead starting multiple R sub-threads.

With this in mind, I think it may be a good strategy to implement the isolation of various chunks in data.prep using callr(), but with a well though-out strategy of when/where each chunk is run with the aim of starting each dataprep_connect_abcd_with_scenario() run from an R environment unburdened with heavy memory usage.

AlexAxthelm · 2024-07-02T09:34:08Z

Thanks for investigating. In those screenshots, the first is "as-is", and the second is with callr?

cjyetman · 2024-07-02T09:58:36Z

Thanks for investigating. In those screenshots, the first is "as-is", and the second is with callr?

roughly, yes

jdhoffa · 2024-07-02T10:49:12Z

@cjyetman I'm happy to move forward with the callr strategy that you proposed, what would you think the next steps would be? Shall we have a call to discuss what/ where/ how these callr chunks should be defined?

(The call doesn't need to be soon/ urgent of course)

cjyetman · 2024-07-02T11:19:44Z

@cjyetman I'm happy to move forward with the callr strategy that you proposed, what would you think the next steps would be? Shall we have a call to discuss what/ where/ how these callr chunks should be defined?

(The call doesn't need to be soon/ urgent of course)

I'm struggling to find the scripts I experimented with, which complicates things a bit, but... I think some serious strategizing of how to implement it would be good, like deciding in what order different chunks can/should be done.

jdhoffa · 2024-07-02T11:37:15Z

If you'd like to schedule a call with @AlexAxthelm and I, or make use of a Tech Review to discuss this sometime in the next few weeks, I'm open to it!

jdhoffa · 2024-08-15T09:16:27Z

Shall we call this closed by #240 ?

AlexAxthelm · 2024-08-15T09:28:32Z

RMI-PACTA/workflow.data.preparation#240? I don't think so. That doesn't seem to actually reduce memory consumption of the function, it just prevents leakage (from objects in the top-level environment)

I think that in order to actually close this one, we need to not process the "big blocks" of financial/scenario data, and instead process individual elements ("on-demand").

cjyetman · 2024-08-15T09:33:37Z

Yeah, this is still something worth doing, hypothetically anyway.

jdhoffa · 2024-08-15T11:40:29Z

Sounds good!

This was referenced Feb 27, 2023

consider removing rows where scenario data is not available #10

Open

don't expand scenario_geography and equity_market until necessary #11

Open

jdhoffa transferred this issue from another repository Apr 15, 2024

AlexAxthelm mentioned this issue Apr 29, 2024

Explore multi-threading {asset}_abcd_scenario.rds generating function #2

Closed

cjyetman added ADO priority next labels Apr 30, 2024

cjyetman mentioned this issue May 2, 2024

only run the global aggregate when appropriate scenario data is available #23

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor and optimize `dataprep_connect_abcd_with_scenario()` #7

refactor and optimize `dataprep_connect_abcd_with_scenario()` #7

cjyetman commented Nov 26, 2022 •

edited by github-actions bot

Loading

cjyetman commented Jul 2, 2024

AlexAxthelm commented Jul 2, 2024

cjyetman commented Jul 2, 2024

jdhoffa commented Jul 2, 2024 •

edited

Loading

cjyetman commented Jul 2, 2024

jdhoffa commented Jul 2, 2024

jdhoffa commented Aug 15, 2024

AlexAxthelm commented Aug 15, 2024

cjyetman commented Aug 15, 2024

jdhoffa commented Aug 15, 2024

refactor and optimize dataprep_connect_abcd_with_scenario() #7

refactor and optimize dataprep_connect_abcd_with_scenario() #7

Comments

cjyetman commented Nov 26, 2022 • edited by github-actions bot Loading

cjyetman commented Jul 2, 2024

AlexAxthelm commented Jul 2, 2024

cjyetman commented Jul 2, 2024

jdhoffa commented Jul 2, 2024 • edited Loading

cjyetman commented Jul 2, 2024

jdhoffa commented Jul 2, 2024

jdhoffa commented Aug 15, 2024

AlexAxthelm commented Aug 15, 2024

cjyetman commented Aug 15, 2024

jdhoffa commented Aug 15, 2024

refactor and optimize `dataprep_connect_abcd_with_scenario()` #7

refactor and optimize `dataprep_connect_abcd_with_scenario()` #7

cjyetman commented Nov 26, 2022 •

edited by github-actions bot

Loading

jdhoffa commented Jul 2, 2024 •

edited

Loading