Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AGASC supplement HDF5 is bloated #127

Open
taldcroft opened this issue Dec 2, 2021 · 3 comments
Open

AGASC supplement HDF5 is bloated #127

taldcroft opened this issue Dec 2, 2021 · 3 comments

Comments

@taldcroft
Copy link
Member

taldcroft commented Dec 2, 2021

For some reason, maybe related to the PyTables update process, the agasc_supplement.h5 file size is a factor of 10 larger than I would expect. Most of the size is from the mags table, and that should be about 1.7 Mb uncompressed:

In [35]: mags.info                                                                                                                             
Out[35]: 
<Table length=85885>
     name      dtype 
------------- -------
     agasc_id   int32
      mag_aca float32
  mag_aca_err float32
last_obs_time float64

In [36]: 85885 * (4 + 4 + 4 + 8)                                                                                                               
Out[36]: 1717700

The other two tables are less than 100 kb. I confirmed this size estimate by writing an H5 file of mags using astropy. With compression on it is close to 1.0 Mb.

I was just wondering about tossing a version of the supplement in the backstop tarball, but 2 Mb might be a bit much. But we should get back to the idea of maintaining a git-based time machine of previous versions #128.

@taldcroft taldcroft changed the title AGASC supplement HDF5 is bloated and the time machine idea AGASC supplement HDF5 is bloated Dec 2, 2021
@javierggt
Copy link
Contributor

Question: How does the "time machine" idea relate to the bloating?

@taldcroft
Copy link
Member Author

Question: How does the "time machine" idea relate to the bloating?

I had a thought that we might mitigate problems with reproducibility by storing the actual AGASC supplement used in load generation (by the FOT) in the backstop load products. But the current size is way too huge, so I got started trying to think of another solution to the reproducibility problem => time machine / snapshots. Anyway the time machine idea got moved to a new issue.

@taldcroft
Copy link
Member Author

Getting back to this again related to #128, I read in each of the tables using astropy and wrote back out with no compression:

mags.write('junk.h5', path='mags', overwrite=True)
obs.write('junk.h5', path='obs', append=True, ovewrite=True)
bad.write('junk.h5', path='bad', append=True, ovewrite=True)

This gives a file that is 1.8 Mb instead of the current 19 Mb. I'd suggest that the code that finally writes out the new supplement h5 file should be modified as above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants