-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added a feature to remove empty .janno columns with rectify #326
base: master
Are you sure you want to change the base?
Changes from all commits
9807c81
17352c0
94c8e5d
0684309
0f6ac12
fe58949
3aaccbd
bfdcdbc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,6 +24,7 @@ | |
JannoRelationDegree (..), | ||
JannoLibraryBuilt (..), | ||
writeJannoFile, | ||
writeJannoFileWithoutEmptyCols, | ||
readJannoFile, | ||
createMinimalJanno, | ||
createMinimalSample, | ||
|
@@ -64,7 +65,7 @@ | |
import qualified Data.HashMap.Strict as HM | ||
import Data.List (elemIndex, foldl', | ||
intercalate, nub, sort, | ||
(\\)) | ||
transpose, (\\)) | ||
import Data.Maybe (fromJust) | ||
import qualified Data.Text as T | ||
import qualified Data.Vector as V | ||
|
@@ -394,15 +395,41 @@ | |
|
||
-- Janno file writing | ||
|
||
-- | A helper functions to replace empty bytestrings values in janno files with explicit "n/a" | ||
explicitNA :: Bch.ByteString -> Bch.ByteString | ||
explicitNA = replaceInJannoBytestring Bch.empty "n/a" | ||
|
||
replaceInJannoBytestring :: Bch.ByteString -> Bch.ByteString -> Bch.ByteString -> Bch.ByteString | ||
replaceInJannoBytestring from to tsv = | ||
let tsvRows = Bch.lines tsv | ||
tsvCells = map (Bch.splitWith (=='\t')) tsvRows | ||
tsvCellsUpdated = map (map (\y -> if y == from || y == Bch.append from "\r" then to else y)) tsvCells | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's going on with "\r"? Why would that happen? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No idea - I didn't touch this code beyond moving it around. I'll have a look. |
||
tsvRowsUpdated = map (Bch.intercalate (Bch.pack "\t")) tsvCellsUpdated | ||
in Bch.unlines tsvRowsUpdated | ||
|
||
makeHeaderWithAdditionalColumns :: [JannoRow] -> Csv.Header | ||
makeHeaderWithAdditionalColumns rows = | ||
V.fromList $ jannoHeader ++ sort (HM.keys (HM.unions (map (getCsvNR . jAdditionalColumns) rows))) | ||
|
||
writeJannoFile :: FilePath -> JannoRows -> IO () | ||
writeJannoFile path (JannoRows rows) = do | ||
let jannoAsBytestring = Csv.encodeByNameWith encodingOptions makeHeaderWithAdditionalColumns rows | ||
let jannoAsBytestringwithNA = explicitNA jannoAsBytestring | ||
let jannoAsBytestring = Csv.encodeByNameWith encodingOptions (makeHeaderWithAdditionalColumns rows) rows | ||
jannoAsBytestringwithNA = explicitNA jannoAsBytestring | ||
Bch.writeFile path jannoAsBytestringwithNA | ||
where | ||
makeHeaderWithAdditionalColumns :: Csv.Header | ||
makeHeaderWithAdditionalColumns = | ||
V.fromList $ jannoHeader ++ sort (HM.keys (HM.unions (map (getCsvNR . jAdditionalColumns) rows))) | ||
|
||
writeJannoFileWithoutEmptyCols :: FilePath -> JannoRows -> IO () | ||
writeJannoFileWithoutEmptyCols path (JannoRows rows) = do | ||
let jannoAsBytestring = Csv.encodeByNameWith encodingOptions (makeHeaderWithAdditionalColumns rows) rows | ||
jannoAsBytestringwithNA = explicitNA jannoAsBytestring | ||
case Csv.decodeWith decodingOptions Csv.NoHeader jannoAsBytestringwithNA :: Either String (V.Vector (V.Vector Bch.ByteString)) of | ||
Left _ -> error "internal error, please report" | ||
Right x -> do | ||
let janno = V.toList $ V.map V.toList x | ||
jannoTransposed = transpose janno | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Clever! I didn't know Cassava can simply parse into a vector of vectors. That's great. I think in that case we could just improve There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right - that would be one possible solution. |
||
jannoTransposedFiltered = filter (any (/= "n/a") . tail) jannoTransposed | ||
jannoBackTransposed = transpose jannoTransposedFiltered | ||
jannoConcat = Bch.intercalate "\n" $ map (Bch.intercalate "\t") jannoBackTransposed | ||
Bch.writeFile path (jannoConcat <> "\n") | ||
|
||
encodingOptions :: Csv.EncodeOptions | ||
encodingOptions = Csv.defaultEncodeOptions { | ||
|
@@ -528,18 +555,6 @@ | |
"broken value: " ++ actual ++ ", " ++ | ||
"problematic characters: " ++ show leftover ++ ")" | ||
|
||
-- | A helper functions to replace empty bytestrings values in janno files with explicit "n/a" | ||
explicitNA :: Bch.ByteString -> Bch.ByteString | ||
explicitNA = replaceInJannoBytestring Bch.empty "n/a" | ||
|
||
replaceInJannoBytestring :: Bch.ByteString -> Bch.ByteString -> Bch.ByteString -> Bch.ByteString | ||
replaceInJannoBytestring from to tsv = | ||
let tsvRows = Bch.lines tsv | ||
tsvCells = map (Bch.splitWith (=='\t')) tsvRows | ||
tsvCellsUpdated = map (map (\y -> if y == from || y == Bch.append from "\r" then to else y)) tsvCells | ||
tsvRowsUpdated = map (Bch.intercalate (Bch.pack "\t")) tsvCellsUpdated | ||
in Bch.unlines tsvRowsUpdated | ||
|
||
-- Global janno consistency checks | ||
|
||
checkJannoConsistency :: FilePath -> JannoRows -> Either PoseidonException JannoRows | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,11 +1,11 @@ | ||
Poseidon_ID Genetic_Sex Group_Name Alternative_IDs Relation_To Relation_Degree Relation_Type Relation_Note Collection_ID Country Country_ISO Location Site Latitude Longitude Date_Type Date_C14_Labnr Date_C14_Uncal_BP Date_C14_Uncal_BP_Err Date_BC_AD_Start Date_BC_AD_Median Date_BC_AD_Stop Date_Note MT_Haplogroup Y_Haplogroup Source_Tissue Nr_Libraries Library_Names Capture_Type UDG Library_Built Genotype_Ploidy Data_Preparation_Pipeline_URL Endogenous Nr_SNPs Coverage_on_Target_SNPs Damage Contamination Contamination_Err Contamination_Meas Contamination_Note Genetic_Source_Accession_IDs Primary_Contact Publication Note Keywords | ||
XXX001 M POP1 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a | ||
XXX002 F POP2 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a | ||
XXX003 M POP1 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a | ||
XXX004 F POP2 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a | ||
XXX005 M POP2 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a | ||
XXX006 F POP2 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a | ||
XXX007 M POP1 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a | ||
XXX008 F POP3 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a | ||
XXX009 F POP1 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a | ||
XXX010 M POP3 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a | ||
Poseidon_ID Genetic_Sex Group_Name | ||
XXX001 M POP1 | ||
XXX002 F POP2 | ||
XXX003 M POP1 | ||
XXX004 F POP2 | ||
XXX005 M POP2 | ||
XXX006 F POP2 | ||
XXX007 M POP1 | ||
XXX008 F POP3 | ||
XXX009 F POP1 | ||
XXX010 M POP3 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,29 +1,29 @@ | ||
title: Chronicle title | ||
description: Chronicle description | ||
chronicleVersion: 0.2.0 | ||
lastModified: 2024-11-13 | ||
lastModified: 2025-01-03 | ||
packages: | ||
- title: Lamnidis_2018 | ||
version: 1.0.0 | ||
commit: c59bfb82fec3f2742cc0e10ceb2932ee06e56aa1 | ||
commit: e59bbf7865a783e78979e2bf9f757a8aa9020656 | ||
path: Lamnidis_2018 | ||
- title: Lamnidis_2018 | ||
version: 1.0.1 | ||
commit: c59bfb82fec3f2742cc0e10ceb2932ee06e56aa1 | ||
commit: e59bbf7865a783e78979e2bf9f757a8aa9020656 | ||
path: Lamnidis_2018_newVersion | ||
- title: Schiffels | ||
version: 1.1.1 | ||
commit: a32a46cf82b8895af72c8920be4ca4843cd5e7f7 | ||
commit: cf3deedf474ef0a651fdcfe5e92085e7810cb816 | ||
path: Schiffels | ||
- title: Schiffels_2016 | ||
version: 1.0.1 | ||
commit: c59bfb82fec3f2742cc0e10ceb2932ee06e56aa1 | ||
commit: e59bbf7865a783e78979e2bf9f757a8aa9020656 | ||
path: Schiffels_2016 | ||
- title: Schmid_2028 | ||
version: 1.0.0 | ||
commit: c59bfb82fec3f2742cc0e10ceb2932ee06e56aa1 | ||
commit: e59bbf7865a783e78979e2bf9f757a8aa9020656 | ||
path: Schmid_2028 | ||
- title: Wang_2020 | ||
version: 0.1.0 | ||
commit: c59bfb82fec3f2742cc0e10ceb2932ee06e56aa1 | ||
commit: e59bbf7865a783e78979e2bf9f757a8aa9020656 | ||
path: Wang_2020 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,11 +1,11 @@ | ||
Poseidon_ID Genetic_Sex Group_Name Alternative_IDs Relation_To Relation_Degree Relation_Type Relation_Note Collection_ID Country Country_ISO Location Site Latitude Longitude Date_Type Date_C14_Labnr Date_C14_Uncal_BP Date_C14_Uncal_BP_Err Date_BC_AD_Start Date_BC_AD_Median Date_BC_AD_Stop Date_Note MT_Haplogroup Y_Haplogroup Source_Tissue Nr_Libraries Library_Names Capture_Type UDG Library_Built Genotype_Ploidy Data_Preparation_Pipeline_URL Endogenous Nr_SNPs Coverage_on_Target_SNPs Damage Contamination Contamination_Err Contamination_Meas Contamination_Note Genetic_Source_Accession_IDs Primary_Contact Publication Note Keywords | ||
XXX001 M POP1 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a | ||
XXX002 F POP2 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a | ||
XXX003 M POP1 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a | ||
XXX004 F POP2 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a | ||
XXX005 M POP2 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a | ||
XXX006 F POP2 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a | ||
XXX007 M POP1 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a | ||
XXX008 F POP3 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a | ||
XXX009 F POP1 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a | ||
XXX010 M POP3 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a | ||
Poseidon_ID Genetic_Sex Group_Name | ||
XXX001 M POP1 | ||
XXX002 F POP2 | ||
XXX003 M POP1 | ||
XXX004 F POP2 | ||
XXX005 M POP2 | ||
XXX006 F POP2 | ||
XXX007 M POP1 | ||
XXX008 F POP3 | ||
XXX009 F POP1 | ||
XXX010 M POP3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be problematic if a tab is hidden inside quotes, like "a\tweird\tbut\tlegal\tfield-value". Maybe that's OK. It's a bit tragic that we have all this fancy machinery to parse TSV and don't use it here. I understand why (this is so much simpler), but if wanted to be semantically 100% correct it would have to be more complicated. Not sure.
Is this shortcut actually needed? I think the only client who uses this function is
explicitNA
, so I suppose we could get rid of these two functions and simply augment our various janno-writing functions to make sure empty strings are always output via n/a? So it would then be matter of parsing and writing a Janno.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know - I didn't write this function for this PR, but only moved it within the code. It's old.
I'll think about how to replace it - either with the more clever decoding-encoding mechanism introduced in this PR, or with more targeted encoding somewhere upstream. I'll propose something.