-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increasingly slow to add events to ASDF dataset #72
Comments
Just wanted to say I observed the same issue. I have a list of a few thousands of events, with for each event a QuakeML file and a mseed waveform file, and I wanted to add in a loop all events and waveforms to a single H5 file. For further processing, I wanted to have each waveforms with an event_id tag, to easily extract the correpsonding event for each waveform, so my code was something like:
And each iteration was increasingly slow. It was intractable to run this for thousands of events. |
Hi Seth and @savardge, For this problem, I think you would find the reason as you see what's add_quakeml is doing. The key part is here
As you may see, pyasdf will do an intersection between the previous existing catalog and the new one, making sure there's no overlap between two catalog. These are quite not a problem as python has been well optimized (at least). If we do look deeper which step in add_quakeml is costing more time, here's the test. Add 500 events one by one.
Here you could see that, reading/writing catalog from/to pyasdf is where more time is spent. Note that y-axis is log scale, so reading time will soon explode... By further looking at how pyasdf read catalog from ASDF here, it seems that obspy.read_envets function is the ultimate reason for this increasing time: ObsPy parsed the entire QuakeML... I hope there's better way to append new events like storing each event into a single dataset. Before that, you may still want to save the entire catalog for only one time. Hope this helps! Best, |
Also see this issue at obspy/obspy#1910 |
I am using pyasdf to store catalogs of detected events, which can include thousands of obpsy events and waveforms.
When adding these objects to ASDF dataset with ds.add_quakeml(), I found that it becomes increasingly slower to add events to the dataset the larger the number of events in the dataset is. I suspect this may be due to checking for redundancy between the new event and all the events already in the dataset.
I had initially designed a workflow that looped through a list of detections and added an obspy event and waveforms associated with the event via the event_id for each detection. This makes some intuitive sense, but becomes very slow when adding thousands of waveforms and events. This appears to be circumvented by first loading the events into an obspy catalog and adding the entire catalog to the ASDF dataset. This is a fine solution, but should probably be mentioned in the documentation as the preferred method.
The attached snippet of code should simply demonstrate the issue.
The text was updated successfully, but these errors were encountered: