Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix data loss issue + add test #184

Closed
wants to merge 1 commit into from
Closed

Conversation

maxwest-uw
Copy link
Contributor

Change Description

fixes a bug that causes data loss when splitting data.

in the current code, when we gather the unique destination pixel tuples + row counts, if the row count differs but the destination pixel is the same between two different origin pixels, like for instance in an nside=2**10 alignment where

alignment[1306154] = (4, 318, 1)
alignment[1305941] = (4, 318, 2)

the split_pixels function will try to write both chunks the same file name, overwriting any previously split rows, meaning that only the 2 rows at pixel 1305941 above would end up in the final product.

this data loss problem becomes a major issue when we have a many high order pixels mapping to the same destination pixel, with each high order pixel having a different row count in each (which should be a very common case, probably the ideal case in fact).

Data Integrity

I'm worried that this bug may have been in all of our already generated beta data products. I think we should take some time soon to go back over them to see the extent of data loss and possibly regenerate a lot of those catalogs.

Solution Description

added unique_index into the shard file name, to differentiate the different shards and avoid data overwriting.

Code Quality

  • I have read the Contribution Guide
  • My code follows the code style of this project
  • My code builds (or compiles) cleanly without any errors or warnings
  • My code contains relevant comments and necessary documentation

Project-Specific Pull Request Checklists

Bug Fix Checklist

  • My fix includes a new test that breaks as a result of the bug (if possible)

Copy link

codecov bot commented Dec 5, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (2a476a8) 99.91% compared to head (ddd58a2) 99.91%.
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #184   +/-   ##
=======================================
  Coverage   99.91%   99.91%           
=======================================
  Files          27       27           
  Lines        1164     1164           
=======================================
  Hits         1163     1163           
  Misses          1        1           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@delucchi-cmu
Copy link
Contributor

I'm a little confused by this PR.

The suggested alignment is not a valid one, and should instead be something like:

alignment[1306154] = (4, 318, 3)
alignment[1305941] = (4, 318, 3)

The mapped row count (third tuple element) should be the same for all instances of the same destination pixel, as this row count represents the TOTAL expected number of rows in the destination pixel, once fully-merged.

Then both 1306154 and 1305941 would map to the same unique index.

Do we see alignments like the one you've represented above in the wild? That would indeed create data loss issues. If, however, each destination tuple is identical, then such overriding behavior would be caught later in the pipeline. The reduce stage will check that the final outputted size is equal to the third tuple element (https://github.com/astronomy-commons/hipscat-import/blob/main/src/hipscat_import/catalog/map_reduce.py#L268). I've never seen this error get tripped in any imports.

@maxwest-uw
Copy link
Contributor Author

seems like I misunderstood how the alignment worked, I thought the alignment index referred to the number of rows in the origin pixel. not a problem then, I'll close out this pr

@maxwest-uw maxwest-uw closed this Dec 5, 2023
@delucchi-cmu delucchi-cmu deleted the chunking_data_loss branch April 2, 2024 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants