Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jannocoalesce #282

Merged
merged 37 commits into from
Feb 26, 2024
Merged

Jannocoalesce #282

merged 37 commits into from
Feb 26, 2024

Conversation

stschiff
Copy link
Member

OK, I've finished the CLI API and functionality, and it compiles. I haven't tested it yet.

Copy link

codecov bot commented Nov 30, 2023

Codecov Report

Attention: Patch coverage is 64.51613% with 33 lines in your changes are missing coverage. Please review.

Project coverage is 68.22%. Comparing base (0cdea0d) to head (04c557e).

❗ Current head 04c557e differs from pull request most recent head d81a951. Consider uploading reports for the commit d81a951 to get more accurate results

Files Patch % Lines
src/Poseidon/CLI/Jannocoalesce.hs 64.51% 19 Missing and 14 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #282      +/-   ##
==========================================
- Coverage   68.32%   68.22%   -0.11%     
==========================================
  Files          25       26       +1     
  Lines        3375     3468      +93     
  Branches      376      390      +14     
==========================================
+ Hits         2306     2366      +60     
- Misses        693      712      +19     
- Partials      376      390      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@stschiff stschiff marked this pull request as ready for review December 5, 2023 08:53
@stschiff
Copy link
Member Author

stschiff commented Dec 5, 2023

OK, @nevrome, this is ready for review.

@stschiff stschiff requested a review from nevrome December 5, 2023 08:54
Copy link
Member

@nevrome nevrome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty cool. I will run some tests on the command line now.

src/Poseidon/CLI/Jannocoalesce.hs Show resolved Hide resolved
test/PoseidonGoldenTests/GoldenTestsRunCommands.hs Outdated Show resolved Hide resolved
@nevrome
Copy link
Member

nevrome commented Dec 7, 2023

Ok - so I tried to run it for the open minotaur-archive PR here: poseidon-framework/minotaur-archive#2

Some minor observations:

  1. Overwriting in place worked there, so maybe my comment above was unnecessary.
  2. It would be nice to have some more CLI output, not just [Info] Writing to file (directory will be created if missing): 2021_CarlhoffNature/2021_CarlhoffNature.janno. Maybe a message for each major step, like loading, merging and finally writing.
  3. The columns of the output .janno file are reordered according to the column order specified in poseidon-hs. This is sensible, but maybe a bit surprising. Maybe we should announce this somewhere in the process or the documentation.

Beyond that and most importantly ❗:
Merging did nothing in this case, because the Poseidon_IDs differ. Generally because of the _MNT suffixes we asked @TCLamnidis to add for each sample, but also more specifically because of other suffixes/insets like _SG and _ss or _TF Thiseas added for different versions of the data. This differences render jannocoalesce in its current form functionally useless for its intended purpose.

I see two ways to fix this. Either we go with your initial idea and allow coalescing (?) by additional columns, e.g. the Alternative_IDs column, where we would then have to add the community-archive IDs for the specific minotaur-archive usecase. Or we add a feature to apply some regex to the Poseidon_IDs before merging.

Funny how this issue goes back to the question what exactly a Poseidon_ID is. Note that I wrote about this for the paper in Supp. Text 5.2.

@TCLamnidis
Copy link
Member

I was thinking that the suffixes would cause issue. I wonder if a Main_ID / Master_ID column could be used for the matching to the Poseidon_ID?

@nevrome
Copy link
Member

nevrome commented Dec 7, 2023

Potentially yes. If it was an Individual_ID we could coalesce all columns that are relevant on the individual-level. A Poseidon_ID refers to something pretty specific (see my write-up for the paper) and a Main_ID would have to refer to an individual to be helpful and not muddle the waters even more.

Edit: I still think the suffix is a good thing, btw.

@stschiff
Copy link
Member Author

OK, I think we could easily put a flag to ignore suffixes in the target, as well as custom key-lookup-columns. I can implement both of these changes easily. Thanks for the feedback.

@stschiff
Copy link
Member Author

OK, so I've added quite a number of features:

--sourceKey ARG          the janno column to use as the source key
                           (default: "Poseidon_ID")
  --targetKey ARG          the janno column to use as the target key
                           (default: "Poseidon_ID")
  --stripIdSuffix ARG      an optional regular expression to identify parts of
                           the IDs to strip before matching between source and
                           target

this should now help with the cases we discussed above. Note that I had to put in one safety check: If Regexes are given, it is possible that Janno rows differ in the Poseidon_ID, which would then be copied over with --force. So I now added a check that columns titled Poseidon_ID, or the given key-name in --targetKey, respectively, are never overwritten.

It would be nice if you could try things out. I've added a number of automatic tests now to JannocoalesceSpec.hs, which should cover the main functionality. I have not expanded the golden tests yet.

@stschiff
Copy link
Member Author

Note that I think we need a bit more output on the statistics how many rows have been edited. I will work on that now.

@stschiff
Copy link
Member Author

OK, I've added some log-output. Here is an example run on the test-data in the package:

stephan_schiffels@BIONB100-2 poseidon-hs % trident jannocoalesce -s test/testDat/testJannoFiles/normal_full.janno -t test/testDat/testJannoFiles/normal_subset.janno -o ~/Desktop/test.txt
trident v1.4.1.0 for poseidon v2.5.0, v2.6.0, v2.7.0, v2.7.1
https://poseidon-framework.github.io

[Info]    matched target individual with ID XXX011 with source individual with ID XXX011
[Info]    -- copied "DE" from column Country_ISO
[Info]    -- copied "Paul;Peter" from column Alternative_IDs
[Info]    -- copied "xxx" from column Country
[Info]    -- copied "yyy" from column Relation_Note
[Info]    -- copied "father_of;grandfather_of" from column Relation_Type
[Info]    -- copied "first;second" from column Relation_Degree
[Info]    -- copied "XXX012;I1234" from column Relation_To
[Info]    matched target individual with ID XXX012 with source individual with ID XXX012
[Info]    -- copied "FR" from column Country_ISO
[Info]    -- copied "xxx" from column Country
[Info]    -- copied "daughter_of" from column Relation_Type
[Info]    -- copied "first" from column Relation_Degree
[Info]    -- copied "XXX011" from column Relation_To
[Info]    matched target individual with ID XXX013 with source individual with ID XXX013
[Info]    -- copied "EG" from column Country_ISO
[Info]    -- copied "Skeleton Joe" from column Alternative_IDs
[Info]    -- copied "xxx" from column Country
[Info]    -- copied "xxx" from column Relation_Note
[Info]    -- copied "sixthToTenth" from column Relation_Degree
[Info]    -- copied "XXX011" from column Relation_To
[Info]    Writing to file (directory will be created if missing): /Users/stephan_schiffels/Desktop/test.txt

And I also checked whether the in-place writing works, and it does (if you omit the -o output file).

@stschiff
Copy link
Member Author

So I think this is ready for another try, @nevrome

@nevrome
Copy link
Member

nevrome commented Dec 20, 2023

I looked at the code and found it quite brilliant. Only two comments:

  • I think the -- copied messages are a bit too verbose for logInfo, especially for large packages. Maybe it's better to report them via logDebug.
  • You pointed out to me that --stripIdSuffix requires a special kind of regex syntax. Maybe it would be better to mention this in the command line documentation.

I see that the tests are failing because of some issue with the base monad. I'll look into that now to see if I can easily fix it. Afterwards I'll go back to my use-case in and for the minotaur archive.

@nevrome
Copy link
Member

nevrome commented Dec 20, 2023

So the first thing I tried was this

trident jannocoalesce -s ../../community-archive/2021_CarlhoffNature/2021_CarlhoffNature.janno -t 2021_CarlhoffNature.janno --stripIdSuffix "_*$"

This is obviously wrong (missing qualifier before the * quantifier in --stripIdSuffix) and nothing was matched. The target file was reordered a bit, but no information was added from the source. trident jannocoalesce returned nothing but

[Info]    Writing to file (directory will be created if missing): 2021_CarlhoffNature.janno

in this case, which left me wondering if it actually worked or not. I think we should report this (probably common) failure more clearly, maybe with a message like this:

[Warning] No matches across source (key: Poseidon_ID) and target (key: Poseidon_ID) with --stripIdSuffix "_*$"

When I used a correct regex ("_.*$") things worked like a charm ❤️

I wonder, though, if we should rename --stripIdSuffix to (for example) --stripIdRegex. Suffixes are certainly the most common mismatch, but with this implementation also prefixes or infixes can be removed.

I'll now prepare a PR with the changes I suggested here and in the comment above.

@nevrome
Copy link
Member

nevrome commented Dec 20, 2023

See my PR #284.

When running jannocoalesce for 2020_Margaryan_Viking (poseidon-framework/minotaur-archive#4 (comment))) I realized that the info messages for each target row (in this case 441) are still pretty verbose... But I guess people can always turn them off with --logMode NoLog, so maybe it's OK.

I want to point out how well the regex solution works here. Still a big fan of that.

@stschiff
Copy link
Member Author

stschiff commented Jan 9, 2024

Great, thanks. I'll take a look at your PR at once. Sorry for letting this hang stale for so long!

…ustments

Jannocoalesce minor adjustments
@stschiff
Copy link
Member Author

OK, just for the record, I think we discussed that you, @nevrome, would perhaps add a few small things to this PR based on your experience with poseidon-framework/minotaur-archive#5. So I will stand back for now unless told otherwise

@nevrome nevrome merged commit f9e6464 into master Feb 26, 2024
2 checks passed
@nevrome nevrome deleted the jannocoalesce branch February 26, 2024 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants