Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add .ssf for many packages #114

Merged
merged 57 commits into from
May 9, 2023
Merged

add .ssf for many packages #114

merged 57 commits into from
May 9, 2023

Conversation

93Boy
Copy link
Contributor

@93Boy 93Boy commented Feb 17, 2023

Adding ENA data tables to current packages

@nevrome
Copy link
Member

nevrome commented Feb 18, 2023

Great! Here are some things I think should be changed:

  • Please remove all .py files. The scripts are not supposed to be in the package library -> Dana.
  • Please reference the newly added ena tables in the respective POSEIDON.yml files as defined in the upcoming poseidon schema release (also including a checksum). Please make sure the fields are named correctly (sequencingSourceFile) -> Dana.
  • Please remove superfluous quotes from the .janno files where you accidentally added them. We should not change file checksums if there is no need. -> Thiseas will provide sed-magic. Dana will do it.
  • Column names in SSF files need to be adapted to lower-case versions. Please check the schema of the sequencingSourceFile. -> Thiseas can provide sed, please, and Dana may use it.

I think we could also remove the quotes from the ena table .tsv files, if they're not necessary. In libre office, for example, you can specify the writing behaviour for .csv/.tsv files and turn off wrapping everything in quotes.

@TCLamnidis
Copy link
Member

Some ENA tables have almost no quotes (e.g. 2020_AgranatTamir_LevantBA/ENAtable.tsv), while others have quotes in every field (e.g. 2020_Nagele_Caribbean/ENAtable.tsv).

Why is that?

Comment on lines 493 to 1673
"SAMEA10556698" "PRJEB47891" "ERR7195387" "I16611.MT" "I16611" "ERS8208682" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208682" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/007/ERR7195387/ERR7195387.fastq.gz" 841043 "1c32f55bdb42435f948e4306d3deff5a" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/007/ERR7195387/ERR7195387.fastq.gz" 45470 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195387/I16611.MT.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195387/I16611.MT.bam.bai"
"SAMEA10556735" "PRJEB47891" "ERR7195424" "I17145" "I17145" "ERS8208719" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208719" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/004/ERR7195424/ERR7195424.fastq.gz" 295911199 "f64d265c4f93f4ed630209214896a9f8" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/004/ERR7195424/ERR7195424.fastq.gz" 9572979 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195424/I17145.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195424/I17145.bam.bai"
"SAMEA10556448" "PRJEB47891" "ERR7195138" "I15048" "I15048" "ERS8208433" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208433" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/008/ERR7195138/ERR7195138.fastq.gz" 463718032 "e6a707435a6e2115a043a24acc5a8681" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/008/ERR7195138/ERR7195138.fastq.gz" 13918874 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195138/I15048.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195138/I15048.bam.bai"
"SAMEA10556451" "PRJEB47891" "ERR7195141" "I15049.MT" "I15049" "ERS8208436" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208436" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/001/ERR7195141/ERR7195141.fastq.gz" 497601 "7a519df57f99d136149a6f21db59f8ed" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/001/ERR7195141/ERR7195141.fastq.gz" 23871 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195141/I15049.MT.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195141/I15049.MT.bam.bai"
"SAMEA10556452" "PRJEB47891" "ERR7195142" "I15071" "I15071" "ERS8208437" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208437" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/002/ERR7195142/ERR7195142.fastq.gz" 83898699 "51bdda20dda4e552e4fbdff6aa2d2a0a" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/002/ERR7195142/ERR7195142.fastq.gz" 2770837 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195142/I15071.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195142/I15071.bam.bai"
"SAMEA10556453" "PRJEB47891" "ERR7195143" "I15071.MT" "I15071" "ERS8208438" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208438" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/003/ERR7195143/ERR7195143.fastq.gz" 198416 "3fb03c345ba6064aa853f3609e260908" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/003/ERR7195143/ERR7195143.fastq.gz" 10107 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195143/I15071.MT.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195143/I15071.MT.bam.bai"
"SAMEA10556446" "PRJEB47891" "ERR7195136" "I15047" "I15047" "ERS8208431" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208431" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/006/ERR7195136/ERR7195136.fastq.gz" 264000464 "da78777b6aff2fdb6cec6a3be618dcb7" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/006/ERR7195136/ERR7195136.fastq.gz" 8189015 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195136/I15047.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195136/I15047.bam.bai"
"SAMEA10556450" "PRJEB47891" "ERR7195140" "I15049" "I15049" "ERS8208435" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208435" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/000/ERR7195140/ERR7195140.fastq.gz" 129131737 "ba9c46c580eb755adc598c698523dfc0" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/000/ERR7195140/ERR7195140.fastq.gz" 3881065 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195140/I15049.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195140/I15049.bam.bai"
"SAMEA10556447" "PRJEB47891" "ERR7195137" "I15047.MT" "I15047" "ERS8208432" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208432" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/007/ERR7195137/ERR7195137.fastq.gz" 581955 "eae85b702d2a7ab8955f9485ff3a6885" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/007/ERR7195137/ERR7195137.fastq.gz" 31235 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195137/I15047.MT.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195137/I15047.MT.bam.bai"
"SAMEA10556449" "PRJEB47891" "ERR7195139" "I15048.MT" "I15048" "ERS8208434" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208434" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/009/ERR7195139/ERR7195139.fastq.gz" 1270588 "ca8c532ac405b9415b9cfcbd22f3ef9d" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/009/ERR7195139/ERR7195139.fastq.gz" 62681 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195139/I15048.MT.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195139/I15048.MT.bam.bai"
"SAMEA10556516" "PRJEB47891" "ERR7195206" "I16099" "I16099" "ERS8208501" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208501" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/006/ERR7195206/ERR7195206.fastq.gz" 263845245 "a07b76f9d4619b296283e59d1a3b888e" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/006/ERR7195206/ERR7195206.fastq.gz" 8379837 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195206/I16099.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195206/I16099.bam.bai"
"SAMEA10556469" "PRJEB47891" "ERR7195159" "I15819.MT" "I15819" "ERS8208454" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208454" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/009/ERR7195159/ERR7195159.fastq.gz" 233827 "f29f003aec54a4f825c3e2f572e9c4a9" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/009/ERR7195159/ERR7195159.fastq.gz" 12443 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195159/I15819.MT.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195159/I15819.MT.bam.bai"
"SAMEA10556467" "PRJEB47891" "ERR7195157" "I15818.MT" "I15818" "ERS8208452" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208452" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/007/ERR7195157/ERR7195157.fastq.gz" 1448272 "6dbb80b2856a0e60aeb3911d1ab73ab0" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/007/ERR7195157/ERR7195157.fastq.gz" 72915 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195157/I15818.MT.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195157/I15818.MT.bam.bai"
"SAMEA10556471" "PRJEB47891" "ERR7195161" "I15821.MT" "I15821" "ERS8208456" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208456" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/001/ERR7195161/ERR7195161.fastq.gz" 5959 "287610ebd1dbc4847599e3c592637ec3" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/001/ERR7195161/ERR7195161.fastq.gz" 200 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195161/I15821.MT.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195161/I15821.MT.bam.bai"
"SAMEA10556466" "PRJEB47891" "ERR7195156" "I15818" "I15818" "ERS8208451" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208451" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/006/ERR7195156/ERR7195156.fastq.gz" 326507936 "c09007cb4425843742de07d18c458d52" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/006/ERR7195156/ERR7195156.fastq.gz" 10760686 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195156/I15818.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195156/I15818.bam.bai"
"SAMEA10556476" "PRJEB47891" "ERR7195166" "I15825" "I15825" "ERS8208461" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208461" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/006/ERR7195166/ERR7195166.fastq.gz" 158049239 "0fa8c9dca6002be7ef0bc17034d15ae9" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/006/ERR7195166/ERR7195166.fastq.gz" 5578142 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195166/I15825.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195166/I15825.bam.bai"
"SAMEA10556475" "PRJEB47891" "ERR7195165" "I15824.MT" "I15824" "ERS8208460" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208460" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/005/ERR7195165/ERR7195165.fastq.gz" 1350290 "a0d22f5546b83e6c1a0b71605b26f852" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/005/ERR7195165/ERR7195165.fastq.gz" 64314 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195165/I15824.MT.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195165/I15824.MT.bam.bai"
"SAMEA10556478" "PRJEB47891" "ERR7195168" "I15826" "I15826" "ERS8208463" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208463" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/008/ERR7195168/ERR7195168.fastq.gz" 81931038 "d96270dda60f454bb3ff289c05350999" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/008/ERR7195168/ERR7195168.fastq.gz" 3177628 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195168/I15826.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195168/I15826.bam.bai"
"SAMEA10556479" "PRJEB47891" "ERR7195169" "I15826.MT" "I15826" "ERS8208464" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208464" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/009/ERR7195169/ERR7195169.fastq.gz" 303016 "278bed47346cf86e40b58d8683f6978e" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/009/ERR7195169/ERR7195169.fastq.gz" 17065 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195169/I15826.MT.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195169/I15826.MT.bam.bai"
"SAMEA10556456" "PRJEB47891" "ERR7195146" "I15643" "I15643" "ERS8208441" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208441" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/006/ERR7195146/ERR7195146.fastq.gz" 225094788 "0f7b869de6b056f63ab14619c8df7dd3" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/006/ERR7195146/ERR7195146.fastq.gz" 7250726 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195146/I15643.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195146/I15643.bam.bai"
"SAMEA10556454" "PRJEB47891" "ERR7195144" "I15642" "I15642" "ERS8208439" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208439" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/004/ERR7195144/ERR7195144.fastq.gz" 101009164 "cc8b1cdcabc65b124e6ee3c1ea445669" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/004/ERR7195144/ERR7195144.fastq.gz" 3385990 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195144/I15642.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195144/I15642.bam.bai"
"SAMEA10556457" "PRJEB47891" "ERR7195147" "I15643.MT" "I15643" "ERS8208442" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208442" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/007/ERR7195147/ERR7195147.fastq.gz" 1074204 "684fdbdddb54ed566ce964e3be85664b" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/007/ERR7195147/ERR7195147.fastq.gz" 55345 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195147/I15643.MT.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195147/I15643.MT.bam.bai"
"SAMEA10556455" "PRJEB47891" "ERR7195145" "I15642.MT" "I15642" "ERS8208440" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208440" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/005/ERR7195145/ERR7195145.fastq.gz" 377928 "f9ba82a9d43787c7c89891fa0a04cc21" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/005/ERR7195145/ERR7195145.fastq.gz" 20761 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195145/I15642.MT.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195145/I15642.MT.bam.bai"
"SAMEA10556737" "PRJEB47891" "ERR7195426" "I17146" "I17146" "ERS8208721" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208721" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/006/ERR7195426/ERR7195426.fastq.gz" 198413526 "4152129d3db844c2d45683f7cac05527" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/006/ERR7195426/ERR7195426.fastq.gz" 6489897 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195426/I17146.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195426/I17146.bam.bai"
"SAMEA10556570" "PRJEB47891" "ERR7195260" "I16395" "I16395" "ERS8208555" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208555" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/000/ERR7195260/ERR7195260.fastq.gz" 37166054 "1abc5c021409a319d38267f43eec4f5f" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/000/ERR7195260/ERR7195260.fastq.gz" 1339727 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195260/I16395.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195260/I16395.bam.bai"
"SAMEA10556569" "PRJEB47891" "ERR7195259" "I16394.MT" "I16394" "ERS8208554" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208554" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/009/ERR7195259/ERR7195259.fastq.gz" 50180 "4b9d3103ccb839b160949588ffec20f7" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/009/ERR7195259/ERR7195259.fastq.gz" 2235 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195259/I16394.MT.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195259/I16394.MT.bam.bai"
"SAMEA10556464" "PRJEB47891" "ERR7195154" "I15650" "I15650" "ERS8208449" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208449" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/004/ERR7195154/ERR7195154.fastq.gz" 229150373 "b6a8e9db45afc20ece7c0f7947282180" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/004/ERR7195154/ERR7195154.fastq.gz" 7282809 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195154/I15650.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195154/I15650.bam.bai"
"SAMEA10556465" "PRJEB47891" "ERR7195155" "I15650.MT" "I15650" "ERS8208450" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208450" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/005/ERR7195155/ERR7195155.fastq.gz" 1400862 "d489bb81189288dd5176f165fe856b72" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/005/ERR7195155/ERR7195155.fastq.gz" 70228 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195155/I15650.MT.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195155/I15650.MT.bam.bai"
"SAMEA10556461" "PRJEB47891" "ERR7195151" "I15646.MT" "I15646" "ERS8208446" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208446" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/001/ERR7195151/ERR7195151.fastq.gz" 1037465 "233f23426d235168d3fe9426171ae4d7" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/001/ERR7195151/ERR7195151.fastq.gz" 46183 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195151/I15646.MT.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195151/I15646.MT.bam.bai"
"SAMEA10556463" "PRJEB47891" "ERR7195153" "I15648.MT" "I15648" "ERS8208448" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208448" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/003/ERR7195153/ERR7195153.fastq.gz" 1186006 "a2f04c5ee6650ef0bd0943502aa88af1" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/003/ERR7195153/ERR7195153.fastq.gz" 63912 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195153/I15648.MT.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195153/I15648.MT.bam.bai"
"SAMEA10556459" "PRJEB47891" "ERR7195149" "I15644.MT" "I15644" "ERS8208444" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208444" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/009/ERR7195149/ERR7195149.fastq.gz" 1672570 "713575499afa6cc5b9ead3289c2e0dfc" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/009/ERR7195149/ERR7195149.fastq.gz" 83270 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195149/I15644.MT.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195149/I15644.MT.bam.bai"
"SAMEA10556460" "PRJEB47891" "ERR7195150" "I15646" "I15646" "ERS8208445" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208445" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/000/ERR7195150/ERR7195150.fastq.gz" 223956785 "b9e22ee1e1b5a18a11aa536610ca1195" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/000/ERR7195150/ERR7195150.fastq.gz" 6397394 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195150/I15646.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195150/I15646.bam.bai"
"SAMEA10556458" "PRJEB47891" "ERR7195148" "I15644" "I15644" "ERS8208443" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208443" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/008/ERR7195148/ERR7195148.fastq.gz" 261188898 "1303ca29e0feb962cedbe01ece9f979a" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/008/ERR7195148/ERR7195148.fastq.gz" 8051253 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195148/I15644.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195148/I15644.bam.bai"
"SAMEA10556462" "PRJEB47891" "ERR7195152" "I15648" "I15648" "ERS8208447" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208447" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/002/ERR7195152/ERR7195152.fastq.gz" 152343117 "13c199363b66330b78bd1f55cd473d8e" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/002/ERR7195152/ERR7195152.fastq.gz" 4945218 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195152/I15648.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195152/I15648.bam.bai"
"SAMEA10556480" "PRJEB47891" "ERR7195170" "I15950" "I15950" "ERS8208465" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208465" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/000/ERR7195170/ERR7195170.fastq.gz" 271200408 "e16da13da0b536a7df76345f04a7945c" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/000/ERR7195170/ERR7195170.fastq.gz" 8599743 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195170/I15950.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195170/I15950.bam.bai"
"SAMEA10556477" "PRJEB47891" "ERR7195167" "I15825.MT" "I15825" "ERS8208462" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208462" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/007/ERR7195167/ERR7195167.fastq.gz" 1366875 "fdf28bec874b24fa9574b66891fdcca2" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/007/ERR7195167/ERR7195167.fastq.gz" 75489 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195167/I15825.MT.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195167/I15825.MT.bam.bai"
"SAMEA10556470" "PRJEB47891" "ERR7195160" "I15821" "I15821" "ERS8208455" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208455" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/000/ERR7195160/ERR7195160.fastq.gz" 15559676 "127d77ba49b8b7c3b3689f349b7a2c6d" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/000/ERR7195160/ERR7195160.fastq.gz" 765748 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195160/I15821.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195160/I15821.bam.bai"
"SAMEA10556473" "PRJEB47891" "ERR7195163" "I15823.MT" "I15823" "ERS8208458" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208458" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/003/ERR7195163/ERR7195163.fastq.gz" 686534 "7f370e855af72fb6d31deeb7237bb3be" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/003/ERR7195163/ERR7195163.fastq.gz" 36412 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195163/I15823.MT.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195163/I15823.MT.bam.bai"
"SAMEA10556468" "PRJEB47891" "ERR7195158" "I15819" "I15819" "ERS8208453" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208453" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/008/ERR7195158/ERR7195158.fastq.gz" 84310556 "edf465349858ea68d563f28e15bd3379" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/008/ERR7195158/ERR7195158.fastq.gz" 2878062 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195158/I15819.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195158/I15819.bam.bai"
"SAMEA10556688" "PRJEB47891" "ERR7195377" "I16599.MT" "I16599" "ERS8208672" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208672" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/007/ERR7195377/ERR7195377.fastq.gz" 3288870 "5023f1828fedc6188f96ab232455293f" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/007/ERR7195377/ERR7195377.fastq.gz" 165782 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195377/I16599.MT.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195377/I16599.MT.bam.bai"
"SAMEA10556474" "PRJEB47891" "ERR7195164" "I15824" "I15824" "ERS8208459" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208459" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/004/ERR7195164/ERR7195164.fastq.gz" 293783349 "5b9b26a06951caf34eb83988846517e0" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/004/ERR7195164/ERR7195164.fastq.gz" 8978099 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195164/I15824.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195164/I15824.bam.bai"
"SAMEA10556472" "PRJEB47891" "ERR7195162" "I15823" "I15823" "ERS8208457" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208457" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/002/ERR7195162/ERR7195162.fastq.gz" 184289966 "eb93013dd2912f27bb76c77107e94d16" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/002/ERR7195162/ERR7195162.fastq.gz" 6321488 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195162/I15823.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195162/I15823.bam.bai"
"SAMEA10556517" "PRJEB47891" "ERR7195207" "I16099.MT" "I16099" "ERS8208502" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208502" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/007/ERR7195207/ERR7195207.fastq.gz" 1161043 "1306c543782fe2f3be63cb126c5bbd20" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/007/ERR7195207/ERR7195207.fastq.gz" 58358 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195207/I16099.MT.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195207/I16099.MT.bam.bai"
"SAMEA10556547" "PRJEB47891" "ERR7195237" "I16271.MT" "I16271" "ERS8208532" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208532" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/007/ERR7195237/ERR7195237.fastq.gz" 1142943 "cf9e569b43bb1be1cd22dfd1cfcada5a" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/007/ERR7195237/ERR7195237.fastq.gz" 58806 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195237/I16271.MT.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195237/I16271.MT.bam.bai"
"SAMEA10556545" "PRJEB47891" "ERR7195235" "I16270.MT" "I16270" "ERS8208530" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208530" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/005/ERR7195235/ERR7195235.fastq.gz" 295319 "7e5d1332ae293760a8ccb3983527d57b" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/005/ERR7195235/ERR7195235.fastq.gz" 16459 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195235/I16270.MT.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195235/I16270.MT.bam.bai"
"SAMEA10556546" "PRJEB47891" "ERR7195236" "I16271" "I16271" "ERS8208531" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208531" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/006/ERR7195236/ERR7195236.fastq.gz" 217592310 "458543373a612ed8a0b1a2e3f4452a6e" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/006/ERR7195236/ERR7195236.fastq.gz" 6905093 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195236/I16271.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195236/I16271.bam.bai"
"SAMEA10556552" "PRJEB47891" "ERR7195242" "I16326" "I16326" "ERS8208537" 2021-10-29 2021-10-29 "Illumina HiSeq X" "SINGLE" "GENOMIC" "ILLUMINA" "ERS8208537" "OTHER" "fasp.sra.ebi.ac.uk:/vol1/fastq/ERR719/002/ERR7195242/ERR7195242.fastq.gz" 125308855 "b7318a43f3ae98cb0bff76fa1bbc6de4" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR719/002/ERR7195242/ERR7195242.fastq.gz" 4198104 "ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195242/I16326.bam;ftp.sra.ebi.ac.uk/vol1/run/ERR719/ERR7195242/I16326.bam.bai"
"SAMEA105565
Copy link
Member

@TCLamnidis TCLamnidis Mar 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 1626-1673 look strange. (Github seems to preview different ones 😕 )
sample_alias looks like ena-SAMPLE-TAB-28-10-2021-18:52:59:284-23886, and Poseidon_ID field is empty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unable to match 2020_AgranatTamir_LevantBA package with Poseidon data as it has completely different namings. I have opened this draft PR #116 for this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: 2021_PattersonNature
I manually curated the SSF file for this package for my own uses, which you could copy over so we avoid doing work twice. I took the table you created and, using the paper's supplementary table as well as the AADR, filled in the library_built and udg columns. I also fixed the sample_alias and poseidon_IDs for the lines that I complained about above, using the BAM name as the IID.
I have added some markings to the sample_accession (see this PR for explanations). These will be removed after we we discuss things in our next meeting, after which the SSF should be valid and can be copied over.

For future reference, you can find the updated file here. 😃

@stschiff
Copy link
Member

stschiff commented Mar 2, 2023

Right, I think the quotes were also mentioned by Clemens in some of the Janno Tables. @93Boy do you know how to take it from here? Or shell we help with some scripting to get this PR in order?

@TCLamnidis
Copy link
Member

This sed command should remove any double quotes from a given file.

sed -e 's/"//g' file.ssf >updated_file.ssf

You can edit the files in place with the following command instead (no new file created):

sed -i -e 's/"//g' file.ssf

@nevrome
Copy link
Member

nevrome commented Apr 16, 2023

After a request for feedback here's what I see:

  • There is still one .py file in 2018_VeeramahPNAS.
  • The field names in the POSEIDON.yml files still have to be changed: enatableFile -> sequencingSourceFile, enatableFileChkSum/.enatableFileChkSum -> sequencingSourceFileChkSum. Please see the schema definition here.
  • There are superfluous, empty lines in the beginning of the POSEIDON.yml files. While they shouldn't be a problem I think it would be better to remove them. But if the validation doesn't complain, then I don't care too much.
  • The column poseidon_id in the .ssf files should be called poseidon_IDs. Please see the schema definition here.

It's hard to systematically check the .ssf files. Good that we have the automatic validation to test everything in the end. For now I saw the following issues in some random files I inspected:

  • In 2019_Nikitin_LBK/ENAtable.ssf the columns udg and library_built are swapped. UDG treatment is documented in the strandedness column and vice versa.
  • In the same file I saw the entry mixed in the udg column. This entry is allowed in .janno files, but not in .ssf files, where the column can only have one of the values minus;half;plus.

The first issue I observed in at least one more file. The automatic validation will highlight all files with these two issues. What it will not show are the cases where one or both of these columns are missing. But that is fine.

Beyond that I saw the NA value #N/A in at least one .ssf file (2020_Marcus_Sardinia/ENAtable.ssf). I don't like that, but I admit that we didn't specify an NA value for .ssf files. I think we should just stick to what we use for the .janno files: n/a or empty strings.

@93Boy
Copy link
Contributor Author

93Boy commented Apr 20, 2023

@nevrome I have made all the changes except the mixed udg treatment concern. What should be the substitute for mixed values?

@nevrome
Copy link
Member

nevrome commented Apr 20, 2023

Great! Almost all ToDos I saw above are resolved. What I still see:

  • A merge conflict with 2020_Nakatsuka_SouthPatagonia/Nakatsuka_SouthPatagonia.janno, which has to be resolved. Can be done with the Resolve conflicts feature here on GitHub.
  • The remnants of an unresolved merge conflict in 2021_CarlhoffNature/ENAtable.ssf. Search for <<<<<<< HEAD to find it. I guess that has something to do with @TCLamnidis doing.
  • 2020_Nakatsuka_SouthPatagonia/Nakatsuka_SouthPatagonia.janno and 2021_PattersonNature/2021_PattersonNature.janno got some unnecessary quotes when the Genetic_Source_Accession_IDs were added (I assume). They should ideally be removed again. Also in 2019_Nikitin_LBK/ENAtable.ssf.

@dhananjaya93: As this keeps coming up: Maybe you could try setting Quote all text cells to False in the LibreOffice Calc .csv saving dialogue, after selecting Edit filter settings when you set the output name and format. Afaik this option stays like that after setting it once.

Unlike the other packages, 2021_Zegarac_SoutheasternEurope already has a new entry in the CHANGELOG file. Be careful with that, when you apply update to everything in the end.

Regarding the mixed in the udg column: I think in this case it's OK to leave the column empty for now, right, @TCLamnidis? This is information that can not be easily obtained from the data we have atm and has to be added later.

@stschiff
Copy link
Member

OK, good job @93Boy @dhananjaya93 (which one is it now?) resolving most. I think we're happy to help resolving these last outstanding points, unless you want to resolve them yourself. Please let us know briefly whether you're working on it, otherwise I suggest that someone else resolves these quickly so we can finally merge this.

@93Boy
Copy link
Contributor Author

93Boy commented Apr 26, 2023

Yes I really appreciate if someone kindly fix these issues. So I can focus on adding rest of the ENA information quickly and complete this task asap. For the future packages, I will set Quote all text cells into fales as @nevrome mentioned. btw I am using only @93Boy account for the commits

@nevrome
Copy link
Member

nevrome commented May 5, 2023

I added a warning to trident to identify .ssf files with no poseidon_IDs columns/entries. Here's what it returns:

[Warning] Potential consistency issues in file ./2019_BraceNatureEcologyEvolution/ENAtable.ssf: The poseidon_IDs column is completely empty. Package and .ssf file are not linked
[Warning] Potential consistency issues in file ./2020_Bergstrom_HGDP/ENAtable.ssf: The poseidon_IDs column is completely empty. Package and .ssf file are not linked
[Warning] Potential consistency issues in file ./2019_Haber_Crusaders/ENAtable.ssf: The poseidon_IDs column is completely empty. Package and .ssf file are not linked
[Warning] Potential consistency issues in file ./2019_Jeong_InnerEurasia/ENAtable.ssf: The poseidon_IDs column is completely empty. Package and .ssf file are not linked
[Warning] Potential consistency issues in file ./2019_Brace_Britain/ENAtable.ssf: The poseidon_IDs column is completely empty. Package and .ssf file are not linked
[Warning] Potential consistency issues in file ./2021_Yaka_Anatolia/ENAtable.ssf: The poseidon_IDs column is completely empty. Package and .ssf file are not linked

This renders these .ssf files a bit pointless, as already mentioned by @TCLamnidis in our discussion today. Not easy to fix, though, so maybe we should just remove them for this PR. Then @93Boy can work them in with another PR later.

@nevrome nevrome changed the title Ena data add .ssf for many packages May 5, 2023
@nevrome
Copy link
Member

nevrome commented May 5, 2023

I did this and removed the pointless .ssf files here. They live now in #118, #119, #120, #121, #122 and fall under your responsibility, @93Boy 👍

@stschiff
Copy link
Member

stschiff commented May 8, 2023

OK, I have looked briefly into the directory structure of this PR and some coarse look at some SSFs. What's the status regarding the automatic checking? I am happy to approve this PR in principle!

@nevrome
Copy link
Member

nevrome commented May 8, 2023

Automatic checking should be done with a trident version from this PR: poseidon-framework/poseidon-hs#245
For me this validation passes alright now.

@nevrome nevrome merged commit 4c3106d into master May 9, 2023
@nevrome nevrome deleted the ENA_Data branch May 9, 2023 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants