Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add verification tests #379

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

troyraen
Copy link
Collaborator

@troyraen troyraen commented Aug 14, 2024

Related to #373, #374

Change Description

  • My PR includes a link to the issue that I am addressing

Solution Description

Code Quality

  • I have read the Contribution Guide and LINCC Frameworks Code of Conduct
  • My code follows the code style of this project
  • My code builds (or compiles) cleanly without any errors or warnings
  • My code contains relevant comments and necessary documentation

Project-Specific Pull Request Checklists

Bug Fix Checklist

  • My fix includes a new test that breaks as a result of the bug (if possible)
  • My change includes a breaking change
    • My change includes backwards compatibility and deprecation warnings (if possible)

New Feature Checklist

  • I have added or updated the docstrings associated with my feature using the NumPy docstring format
  • I have updated the tutorial to highlight my new feature (if appropriate)
  • I have added unit/End-to-End (E2E) test cases to cover my new feature
  • My change includes a breaking change
    • My change includes backwards compatibility and deprecation warnings (if possible)

Documentation Change Checklist

Build/CI Change Checklist

  • If required or optional dependencies have changed (including version numbers), I have updated the README to reflect this
  • If this is a new CI setup, I have added the associated badge to the README

Other Change Checklist

  • Any new or updated docstrings use the NumPy docstring format.
  • I have updated the tutorial to highlight my new feature (if appropriate)
  • I have added unit/End-to-End (E2E) test cases to cover any changes
  • My change includes a breaking change
    • My change includes backwards compatibility and deprecation warnings (if possible)

Copy link

codecov bot commented Aug 14, 2024

Codecov Report

Attention: Patch coverage is 29.62963% with 19 lines in your changes missing coverage. Please review.

Project coverage is 98.47%. Comparing base (e7f9b4c) to head (029196d).

Files Patch % Lines
...rc/hipscat_import/verification/run_verification.py 29.62% 19 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #379      +/-   ##
==========================================
- Coverage   99.72%   98.47%   -1.26%     
==========================================
  Files          26       26              
  Lines        1481     1508      +27     
==========================================
+ Hits         1477     1485       +8     
- Misses          4       23      +19     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@troyraen
Copy link
Collaborator Author

@delucchi-cmu here are some options. Take a look and let me know what you think. It's obviously not fully integrated yet. I did add some data that will make the tests fail; not sure if you want those extra files or if it should somehow use data that's already here. Also, it probably needs to use your custom file pointers and the user-supplied storage kwargs that you support, but I haven't added them yet. Also haven't done docstrings yet (these probably aren't the functions you actually want anyway).

There are four tests. The one checking the file schemas against _common_metadata is the thing you actually asked for. It feels a little incomplete if the _common_metadata isn't checked against a schema from the user, but if you don't want to rely on user input here I can understand that. The other tests are code that I had handy that's inline with checking individual parquet files. One is row counts, which I know you're taking care of in a different repo but I think you said that didn't check the individual files? In any case, I'm happy to take all those extra three tests out if you just want to focus on the schema right now.

I'm not sure how you want this integrated with the main run function. Right now, the functions I wrote return a boolean indicating pass/fail. What do you want it to do if it fails? I'm guessing it should not raise an error. Should it just print messages to std out, or create a report file, or ..?

Copy link

github-actions bot commented Sep 13, 2024

Before [829fe47] After [2405014] Ratio Benchmark (Parameter)
failed failed n/a benchmarks.BinningSuite.time_read_histogram

Click here to view all benchmarks.

@troyraen troyraen changed the title Raen/verify/files Add verification tests Sep 18, 2024
@troyraen troyraen self-assigned this Sep 18, 2024
@troyraen
Copy link
Collaborator Author

I made choices about some questions from above and filled out this code a little more. So right now there is a Verifier class that handles the tests and it's called in the run function with:

verifier = Verifier.from_args(args)
verifier.test_is_valid_catalog()  # run hipscat.io.validation.is_valid_catalog
verifier.test_schemas()  # user-provided schema vs _common_metadata, _metadata, and file footers
verifier.test_num_rows()  # file footers vs _metadata (per file), and user-provided total
verifier.record_results()  # write a verification report

verifier.record_distributions()  # calculate distributions (min/max of all fields) and write a file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant