Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle long limnigraph regarding speed (loading/saving), memory usage and file sizes #52

Open
IvanHeriver opened this issue Feb 8, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@IvanHeriver
Copy link
Collaborator

Currently large limnigraph with uncertainties result in:

  • long loading time
  • very long project saving time
  • very long import time (when importing from v2 project)
  • large error matrix file
  • long computation time by BaM
  • huge BaM result files
  • huge project file when saved.

I had a project where a long limnigraph (more than 44000 time steps) which was used to compute two differents hydrographs:

  • importing the v2 project took more than 10minutes on my machine
  • saving the project to v3 .bam format took a few minutes as well
  • resulting project file was more than 550Mb (unzipped it is more than 2Gb)

I think these issues call for the following measures:

  • have a monitoring frames: when importing limnigraph, when saving a project
  • not keeping spag files: currently, the full BaM results are saved along with a project; spagetti files are the main reason behind the large size of the project file. Spagetti file should probably be removed after each BaM run, or at least before saving a project. In all the cases that I can imagine, the envelops are sufficient within BaRatinAGE. However, this implies that raw BaM prediction results cannot be available to BaRatinAGE users.

Let me know if other options to better manage long limnigraph within BaRatinAGE can be implemented.

@IvanHeriver IvanHeriver added the enhancement New feature or request label Feb 8, 2024
@benRenard
Copy link
Member

related to issue #40

I agree with your last point that Q(t) spaghettis are the main problem. In v2 there was an option to enable/disable the saving of spaghettis within the project bar.zip file, so it's probably an approach we could re-implement. Maybe it could be made more flexible, by e.g. asking the user a maximum file size above which the saving of spaghettis is disabled. Or at the contrary it could be made less flexible, by never saving the spaghettis (only the envelops). But in any case, a useful feature would be to ask the user if she/he wishes to export the spaghettis: this way it's still re-usable downstream BaRatinAGE, but it doesn't bloat the bar.zip project file.

There are a few tricks in issue #40 to improve memory or CPU time, but I'm not sure it should be the job of BaRatinAGE to implement them, and in any case there will always be instances with massive spaghetti files, so we should find a way to handle it properly.

@IvanHeriver
Copy link
Collaborator Author

Exporting spagettis of a prediction in BaRatinAGE v2 is not possible, right? The user had to go look for it in the bar.zip file. In v3, this is also currently not possible.

The approaches you suggest are interesting but it seems a bit overly complicated for a feature not many people use (I might be wrong).

Here is another simpler idea:

  1. spaghettis of predictions are simply never saved (exept if it has only one column, which is the case of maxpost prediction) in the project file
  2. in the RC and Qt panels, add a result tab with a button to download the spagettis of each prediction experiment
  3. if the project is reopened, spaghettis are lost, buttons are greyed out, and a message says: "to retrieve the prediction samples, BaM needs to be rerun" or something similar.

Point 1 could be quite simply implemented for version 3.0.0 and point 2 and 3 be implemented in future version.

I tested not zipping spagetti files, resutling files are way samller and managable (e.g. from 26Mb to 3.5Mb).

However, this doesn't fix the big project file issue when there are long time series with stage errors because the stage error matrix is still saved. Maybe a possible fix would be to use a seed with the random number generator and save the seed instead of the matrix. But the problem will then be the project loading time (it is very intensive to build such an error matrix).

@benRenard
Copy link
Member

OK with your approach for RC and Qt spaghettis.

Why is it so important to save the stage spaghettis? Couldn't we use the same approach and not save them, while offering a way to download them if the user wishes?

In particular, I don't understand why you need to generate the stage spaghettis when loading the project: in my eyes they only need to be generated before performing a prediction experiment that requires it (e.g. total uncertainty on Qt). And even if it is a bit intensive, I don't think it's as intensive as passing them through the RC equation to compute Qt spaghettis.

@IvanHeriver
Copy link
Collaborator Author

Currently stage spaghettis are computed when loading a stage time series to (1) compute the uncertainty envelop, (2) be visible (and exportable) in a table within the limnigraph panel and (3) be used any time a discharge time series is computed.

I chose the approach because I wanted the sampled stage errors to remain the same after saving and reloading the project and for all the children discharge time series. However, as I stated in my previous comment, using a seed (probably possible but I haven't checked) might solve this particular issue.

Computing the stage errors only when required (e.g. to compute Qt spaghettis, or if the user request the spaghettis to export them) as you suggest seems to be the way to go.

It is indeed less intensive than computing Qt spaghettis but it can still take a few seconds. But I find it less problematic than taking time on project load (which is already pretty slow) or to store the entire matrix in the project file which takes a lot of space AND is slow to unzip and read.

Changes to how stage errors are managed within BaRatinAGE is not that straightforward I think. But it might still be worth the effort for version 3.0.0 since the large file project file are significant issue in my opinion. It might also affect project file structure (and I know how painful it can be to handle several file versions).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants