-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RADPS (NRAO) Requirements and Feature Requests #64
Comments
Here is an IPython notebook that compares the performance of casa-formats-io and python-casacore: IPython notebook: casa_formats_io_vs_python_casacore.ipynb Dataset (3.36 GB): VLASS3.2.sb45755730.eb46170641.60480.16266136574_spw10_split.ms.zip On my Mac M3, casa-formats-io takes approximately 11 seconds and python-casacore takes approximately 3 seconds to read all of the main table data. Initial tests show that the time taken to read the data and perform reshaping (using np.fromfile and casa_formats_io._casa_chunking._combine_chunks) is comparable between the two libraries. Therefore, the performance difference is likely related to how the data gets organized. |
Thanks, this is very useful! I haven't really done any performance optimisation in casa-formats-io at this point so I am sure there is a lot of low hanging fruit. I will have a think about the requirements and will follow up soon. |
@astrofrog, any update on your thoughts about the requirements? |
Sorry for not getting back to you sooner, I was off work for a significant fraction of the summer. I have had a chance to think about the requirements you mention, and have a few follow-up questions/comments. First, do you need to be able to access just part of the data, or would you always load an entire column into memory? Second, you mention 'Single-threaded (no Dask)' - note that it is possible to use dask in single-threaded mode, so just to make sure we are on the same page, do you object in general to making use of the dask API (specifically the fact that the astropy table we currently return has dask arrays that require The high-level API I was striving for here aims to completely hide away the details of a table to a user, and they would inspect the table using e.g. If the use of the dask API is a deal breaker, maybe we could agree on a public lower-level API that both you and the dask interface could use. |
|
@astrofrog, we now have some developer time available and plan to start looking into this. Have you had a chance to consider it further? |
@Jan-Willem - sorry for the delay, I'll try and reply tonight! I'm sure we can find a way forward to avoid duplicating efforts, I'll write up some thoughts/suggestions this evening. In any case, one thing that definitely needs doing is documenting the format, so that would be worthwhile starting if you have immediate time available. |
I spoke with our developers and we agree that having a clear specification of the layout of the components of casacore tables would be very useful. While significant documentation exists for the table system, it falls short by being neither formal enough nor specific enough for a new implementation to be created based solely on its description. We will begin work on a specification. Our plan is to use Kaitai Struct to create a specification of casacore tables. Kaitai has a number of advantages:
|
@Jan-Willem - Kaitai Struct seems nice, I didn't know about it! Just to understand, would this also be used to generate the actual parsing code? Is the idea that this parsing code would then be wrapped by a slightly higher level API that would then be the public API, or would the generated code be the public API? I suspect it would be the former as for example I doubt the generated parsing code would understand what to do with say incremental columns and so on, and so we'd need a layer that transforms the very low level parts of the file into meaningful e.g. numpy arrays and so on? Would it make sense to collaborate on this under this present repository or a separate one in radio-astro-tools since ultimately the specification is language-independent and could potentially be used in parsers for other languages? For instance, we could make another repository called |
I do think we should use Kaitai to generate the parsing code since that will validate that we have recorded the schema correctly and reduce some work we would have to do. A separate repository casa-formats-specification sounds like a good idea, and I agree we can discuss the API and integration into casa-formats-io once we have a rudimentary implementation. Having some CI and doing the work using pull requests sounds good. Can you please add the following people as maintainers:
|
Done: https://github.com/radio-astro-tools/casa-formats-specification - I will set up a bare bones CI job that doesn't do much for now |
As mentioned in #63 here is what the NRAO would be interested in working on (casa-formats-io might already have some of these features):
@astrofrog, @keflavich, @e-koch, please let us know if you think something like this is feasible. We would, of course, be happy to contribute developer effort to achieve this.
The text was updated successfully, but these errors were encountered: