Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor and optimize dataprep_connect_abcd_with_scenario() #7

Open
cjyetman opened this issue Nov 26, 2022 · 10 comments
Open

refactor and optimize dataprep_connect_abcd_with_scenario() #7

cjyetman opened this issue Nov 26, 2022 · 10 comments

Comments

@cjyetman
Copy link
Member

cjyetman commented Nov 26, 2022

dataprep_connect_abcd_with_scenario() is the elephant in the room. It's hundreds of lines of code, difficult to understand, and torturously long to run. There's got to be a better way.

AB#10867

@cjyetman
Copy link
Member Author

cjyetman commented Jul 2, 2024

After a lot of experimentation, I found that a significant contribution to the slow behavior of data.prep is due to memory fragmentation. Every time one of the very large datasets are loaded into memory, R tries to find space in RAM for it. When the object is removed, R "releases" the space in RAM that it was taking up, but after a while, there are not enough contiguous blocks of memory to efficiently load another large dataset. Additionally, while R "releases" the memory, the OS does not reclaim it, so the memory requirement on the OS continually increases, likely leading to memory swap occurring.

To test mitigating this, I tried wrapping chunks of the data.prep process in callr() statements which run the code in a separate R thread which is completely exited and the memory is fully released to the OS after it completes. This seemed to give a significant performance advantage, even though there's some overhead starting multiple R sub-threads.

With this in mind, I think it may be a good strategy to implement the isolation of various chunks in data.prep using callr(), but with a well though-out strategy of when/where each chunk is run with the aim of starting each dataprep_connect_abcd_with_scenario() run from an R environment unburdened with heavy memory usage.

Screenshot 2024-05-06 at 16 20 31 Screenshot 2024-05-06 at 15 45 16

@AlexAxthelm
Copy link

Thanks for investigating. In those screenshots, the first is "as-is", and the second is with callr?

@cjyetman
Copy link
Member Author

cjyetman commented Jul 2, 2024

Thanks for investigating. In those screenshots, the first is "as-is", and the second is with callr?

roughly, yes

@jdhoffa
Copy link
Member

jdhoffa commented Jul 2, 2024

@cjyetman I'm happy to move forward with the callr strategy that you proposed, what would you think the next steps would be? Shall we have a call to discuss what/ where/ how these callr chunks should be defined?

(The call doesn't need to be soon/ urgent of course)

@cjyetman
Copy link
Member Author

cjyetman commented Jul 2, 2024

@cjyetman I'm happy to move forward with the callr strategy that you proposed, what would you think the next steps would be? Shall we have a call to discuss what/ where/ how these callr chunks should be defined?

(The call doesn't need to be soon/ urgent of course)

I'm struggling to find the scripts I experimented with, which complicates things a bit, but... I think some serious strategizing of how to implement it would be good, like deciding in what order different chunks can/should be done.

@jdhoffa
Copy link
Member

jdhoffa commented Jul 2, 2024

If you'd like to schedule a call with @AlexAxthelm and I, or make use of a Tech Review to discuss this sometime in the next few weeks, I'm open to it!

@jdhoffa
Copy link
Member

jdhoffa commented Aug 15, 2024

Shall we call this closed by #240 ?

@AlexAxthelm
Copy link

RMI-PACTA/workflow.data.preparation#240? I don't think so. That doesn't seem to actually reduce memory consumption of the function, it just prevents leakage (from objects in the top-level environment)

I think that in order to actually close this one, we need to not process the "big blocks" of financial/scenario data, and instead process individual elements ("on-demand").

@cjyetman
Copy link
Member Author

Yeah, this is still something worth doing, hypothetically anyway.

@jdhoffa
Copy link
Member

jdhoffa commented Aug 15, 2024

Sounds good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants