Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Change Description
fixes a bug that causes data loss when splitting data.
in the current code, when we gather the unique destination pixel tuples + row counts, if the row count differs but the destination pixel is the same between two different origin pixels, like for instance in an
nside=2**10
alignment wherethe
split_pixels
function will try to write both chunks the same file name, overwriting any previously split rows, meaning that only the 2 rows at pixel 1305941 above would end up in the final product.this data loss problem becomes a major issue when we have a many high order pixels mapping to the same destination pixel, with each high order pixel having a different row count in each (which should be a very common case, probably the ideal case in fact).
Data Integrity
I'm worried that this bug may have been in all of our already generated beta data products. I think we should take some time soon to go back over them to see the extent of data loss and possibly regenerate a lot of those catalogs.
Solution Description
added
unique_index
into the shard file name, to differentiate the different shards and avoid data overwriting.Code Quality
Project-Specific Pull Request Checklists
Bug Fix Checklist