Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More informative consistency checks #283

Merged
merged 17 commits into from
Feb 21, 2024
Merged

Conversation

nevrome
Copy link
Member

@nevrome nevrome commented Dec 8, 2023

... and other ideas to make modifying .janno files less painful with better validation.

Work in progress. Not sure if all of these ideas are useful.

Copy link

codecov bot commented Dec 8, 2023

Codecov Report

Attention: 42 lines in your changes are missing coverage. Please review.

Comparison is base (46c7114) 68.63% compared to head (b7acba7) 68.32%.
Report is 4 commits behind head on master.

Files Patch % Lines
src/Poseidon/Janno.hs 53.24% 30 Missing and 6 partials ⚠️
src/Poseidon/SequencingSource.hs 58.33% 4 Missing and 1 partial ⚠️
src/Poseidon/Utils.hs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #283      +/-   ##
==========================================
- Coverage   68.63%   68.32%   -0.31%     
==========================================
  Files          25       25              
  Lines        3341     3375      +34     
  Branches      382      376       -6     
==========================================
+ Hits         2293     2306      +13     
- Misses        666      693      +27     
+ Partials      382      376       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nevrome
Copy link
Member Author

nevrome commented Dec 10, 2023

As indicated above this PR introduces a number of changes to improve the error reporting for broken .janno files. @AyGhal recently contacted me about various error messages she encountered when preparing packages for the PCA. Some of them were not terribly useful. So here's what I did to mitigate this issue:

  1. When cassava encounters a parsing issue in an individual .janno file field it returns an error. For many more specific fields with custom types we created smart constructors with explicit error reporting (called by the respective parseField functions). For them the error messages are good and useful. But for types without this additional specification, e.g. simple Int or Double fields, the messages are very simple, e.g.:

    parse error (Failed reading: conversion error: expected Int, got "430;" (incomplete field parse, leftover: [59]))
    

    Especially the list of ASCII codes for the unparsed leftover is disappointing. cassava unfortunately does not expose the components of this message in a useful error type. It only returns a string. I thus went ahead and parsed this string to compile something slightly more readable:

    parse error in one column (expected data type: Int, broken value: "430;", problematic characters: ";")
    

    A pretty disgusting solution and incomplete at best (I don't see how to get the column name into this message), but a bit better than what we had so far.

  2. A problem that was raised multiple times by @AyGhal was the exact behaviour of the consistency check for radiocarbon dates in checkC14ColsConsistent. I reworked that now into a tree of allowed and forbidden cases, whose leaves feature clear error messages (beyond the pretty useless The Date_* columns are not consistent). For example if the the Date_Type is not set to C14, but we have radiocarbon date information in the data columns, we now get this:

    Date_Type is not "C14", but either Date_C14_Uncal_BP or Date_C14_Uncal_BP_Err are not empty.
    

    The check's logic was changed in this process and I would like to ask you to take a close look into it in the review. I implemented the same system for Contamination_* and the Relation_* columns, but the validation logic is much simpler there.

  3. The most interesting/controversial change, finally, is the way these consistency checks are called. So far they have been part of an Either based system that could only fail or forward the valid .janno row. checkJannoRowConsistency and the functions below it now live in a little monad stack of Except and Writer, which allows to record both failure and (!) warnings.

    type JannoRowLog = E.ExceptT String (W.Writer JannoRowWarnings)
    

    I created this, because I encountered a critical omission in checkC14ColsConsistent, which allowed the Date_Type to be set to C14 even if Date_C14_Uncal_BP and Date_C14_Uncal_BP_Err are empty. Two packages in the PCA (2022_Gretzinger_AngloSaxons and 2020_Nagele_Caribbean) feature this invalid case and this brings us exactly in the kind of trouble discussed in Breaking changes and maintaining reproducibility #276. Through the new monad stack I now report only a warning for these packages instead of refusing to load them, thus propagating solution 1. mentioned there. What do you think?

I also quickly checked SequencingSource.hs and removed some unused code there to keep the two modules for .janno and .ssf files somewhat consistent.

@stschiff
Copy link
Member

Wow, OK, I will need some time to review this. Happy to discuss priorities, but I can see how this will be a useful addition!

@nevrome
Copy link
Member Author

nevrome commented Feb 15, 2024

@stschiff Just a quick reminder that I'm waiting for a review here 👍

@stschiff
Copy link
Member

On it! Likely not finish today though.

Copy link
Member

@stschiff stschiff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this looks all good to me, more or less. I have left a few comments, perhaps check them and I don't think we have to do much back-and-forth. I will give this an "Approve" flag, since none of my comments I consider critical. Specifically about your three points above:

  1. Very good, even if nasty to code, but it's not too bad in terms of code and if it helps I'm all for it. I looked into alternatives to provide better errors, and I think they are all much more complicated (as in recoding parseNamedRecord to include custom parsers that don't use the FromField instance).

  2. The new consistency decision tree is very nice and easy to review. I couldn't find a glitch, other than a cosmetic comment about a sentence there.

  3. The Monad-Stack thing is - in my view and as discussed - unnecessary. I would just log them straight out as they come. There is no need to collect warnings and show as a combined report, as all of this logging happens per line. It would be different if we were to collect stuff through multiple calls of the consistency-check functions across multiple rows or even multiple files. But as long as we're dealing with a single row of a Janno file I would have just logged and thrown straight into PoseidonIO. But on the other hand the more complex solution with Write and ExceptT is still reasonably easy, so I'm OK with this in the end.

src/Poseidon/Janno.hs Show resolved Hide resolved
src/Poseidon/Janno.hs Show resolved Hide resolved
src/Poseidon/Janno.hs Outdated Show resolved Hide resolved
@nevrome nevrome merged commit e7e1996 into master Feb 21, 2024
2 of 4 checks passed
@nevrome nevrome deleted the moreInformativeConsistencyChecks branch February 21, 2024 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants