-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve concatenation of netCDF files with simulation pickup #3818
base: main
Are you sure you want to change the base?
Conversation
Let me try to reframe the problem and tell me if you agree. I think its a real issue that the output files are created when the output writers are instantiated. Because of this, we find ourselves having to write This is a hack for sure. It's a failure of the checkpointer design --- the whole point of the design is to make checkpointing easy. We should be able to change just one line. I even think we should be able to pick up from a checkpoint with an environment variable so we don't have to change the run script at all. That would be more robust. What you're describing sounds like an additional problem of this deficiency. The fix seems straightforward. We just need to introduce the concept of output "initialization". Then we can delay creating the output file to It seems like if we introduce initialization we can also handle file splitting. What do you think? If you want to help we can get started on it. I was putting it off myself because I don't have immediate needs for checkpointing but that will probably change pretty soon. Separately I don't like I also don't like |
I see the interest of creating a handler of all the outputs of the model. Is the idea to make a wrapper depending on parsing a preexisting writers i.e. I agree that keeping a simple way to create I agree that the initialisation of the writer should not create a file, I find that quite confusing. I believe it will be simple to change by only executting the Despite the implementation of this, I still see value in having a flag Regarding my changes in this PR, the function |
If you are referring to #3793 my intent is just to introduce an additional wrapper on top of the existing writers. It's merely an alternative to adding output writers manually to
I agree that it's simple to design.
Why would it result in (unintended) data loss? Because its common to mistakenly re-run a simulation?
Okay, I'll take a look. |
Trying to understand this workflow. You're saying that for a single run, you would create multiple directories, with a new directory for each time a simulation is restarted? |
Currently in my workflow, I define an environment variable that contains an string like
then the next pickup will result in files:
of this files, the first 2 parts ( If I don't change the name each time I pickup, with the Another workflow will be to create a new folder for each pickup keeping the same filename, but splitting the files will result in the same issue of file duplication. I hope this description of the workflow is clearer. |
Perhaps, there is no need for an environment variable that modify the filename, instead this is something that could be done if the user is using a checkpoint. But rather than using this convoluted workflow that will require renaming and deleting files, I thought the optimal implementation is to append to the previous existing file when using checkpoints. If the filename is
|
A possible issue with this implementation may occur if the simulation crashes after already writing some output to the last file. However, we could test in the first call of the writer if the time to store is smaller than the current time to be append. |
It's clear!
Of course I agree with this. I think our basic principle should be that there is no difference in workflow, either for running or analysis, between scripts for checkpointing or not. Also, I think that
If we have a system that does not overwrite files when picking up from a checkpoint, do you think that it will be necessary to have Assume for the following discussion that the "overwriting issue" is solved for checkpointing: In my experience, I typically set up a script with My thought is that it makes more sense to ask people to intentionally request Also, even for this purpose I don't really want a "non-overwriting guarantee" as a feature. I would prefer something like "unique_filenames = true" or something, which I could then "turn on" if I got into a situation where I was re-running expensive simulations. |
Also curious --- do you split files to make them easier to download? I'm wondering if there is a better way. For example it would not be hard to design a utility that would automatically separate a file along some axis (such as the time axis) to produce smaller files for the purpose of downloading. For example something like download_chunk(download_spec, time_window=(0, 60days)) where |
Yes, I completely agree, if the simulation picks up from a checkpoint it should not delete data.
Now it is clear. If we handle properly picking up simulations, I completely agree with you that best option is for the user to manually change
I mostly do it to keep a consistent size and chunks across files for post-processing, but supporting both things will make the user workflow more flexible, particularly when running long simulations with multiple pickups. It's likely that the post-processing you suggest will be useful, but in that case it may be more useful to include support to zarr files within the possible simulation outputs (https://github.com/JuliaIO/Zarr.jl). |
Ok, now that this is clear, I clone and modify the PR #3793. |
A |
Ok! Curious how this will evolve, we eventually hope to support FieldTimeSeries for NetCDF which will give us some options to design various productivity features... |
Sorry, I don't understand. Why are we closing this? I prefer small PRs to large ones and this one seems good. |
@ali-ramadhan might be good for you to take a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we test this?
# Update schedule based on user input | ||
is_output_splitted!(schedule, filepath) = nothing | ||
|
||
function is_output_splitted!(schedule::TimeInterval, filepath, overwrite_existing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my only comment is that I find this function name confusing. is_output_splitted!
suggests it will return a boolean true
or false
, but it actually modifies a schedule and returns a part number and a filepath?
Thanks for working on this @josuemtzmo! I do tend to avoid file splitting since one file, even if it is huge, simplifies data analysis. And most of the time, the data analysis can be done on the fly alleviating the need for huge outputs.
I don't think so. Definitely a good idea to do so since a wrong implementation can result in data loss. Usually Julia would stop then re-run the entire script so would a test look something like: set up a simulation with a checkpointer, run it for a some iterations with some file splitting output, then set up the exact same simulation (copy paste) and run it for some more iterations and more file splitting, then check that the output is all correct? There may be some edge cases too, e.g. zero or only one output actuation after picking up, or before the initial simulation ends. |
@ali-ramadhan Thanks for your suggestion, I had finally some time to work again on this. I agree that the name is not ideal, would it be better to call it |
What does the function do, and what does it return? |
It returns a int with the value of the number of files or file divisions already performed and the file name. if files are file_part_1.nc, file_part_2.nc, file_part_3.nc, the output will be: |
How about calling it Could also improve clarity to split it into two functions. One to return the number of split files, and the other to return |
I've been thinking on how to divide the function, but it will require the duplication of the lines:
within the code, how important is it to avoid duplication vs clarity? |
Clarity matters most. But usually there are not strong trade-offs: duplication improves clarify only if the duplicated code is short and easy to understand. How is this: filename = first(split(basename(filepath),".")) * "_part" different from For clarity I think it's it's best to use more lines for complicated operations rather than concatenating many operations into a single line. That line does a lot of things at once: basename, split with ".", first, and then finally concatenation with _part. That's quite a few things for one line. |
Hello,
I've worked implementing this automatic concatenation for the output, when a simulation is picked up from a checkpoint, with the flag splitting_files on. This feature addresses the fact that otherwise the simulation output filename needs to be changed manually each time the simulation is restarted.
In other words, if a simulation is run to output in a
file.nc
with the flag splitting_files, different files will be created such asfile_part1.nc
,file_part2.nc
. If the overwrite_output is true, these files will be rewritten and the data will be deleted. If the overwrite_output is false, the simulation will crash since it will not find the originalfile.nc
. The new code ensures that if overwrite_output is false, then the model will append the output to the last file. i.e.file_part2.nc
.Before working more on this (i.e. including joining output in jld2), I'm wondering if this will be useful to implement in the other schedulers and merge to master. What do you think @glwagner @tomchor @navidcy ?