Simplify how known results specified for classifier assessment #425

peterjc · 2022-01-13T10:04:17Z

Currently (as of v0.10.6), the thapbi_pict assess command (also invoked from the thapbi_pict pipeline command) requires multiple <sample>.known.tsv input files and they are over-complicated for current needs. This also overcomplicates the assess command input parsing in general.

Up until v0.7.11 (2021-03-30), the assess command could work at the level of sequences in samples, unique sequences across all samples, or at sample level. Since then it only works at sample level.

Reflecting that history, the <sample>.known.tsv input files were based on the intermediate <sample>.<method>.tsv files, and could contain species listings for each unique sequence in the sample (no longer used), or a wildcard entry applied to the sample as a whole (currently used). e.g.

*(tab)taxid(tab)species name1;species name2;etc

In practice the assess command has not used the NCBI taxid, so all we really need now is the species list expected for each sample. This might more simply be given as a column of a metadata table.

Possible interface might build on the existing options to the pipeline:

  -t METADATAFILE, --metadata METADATAFILE
                        Optional tab separated table containing metadata
                        indexed by sample name. Must also specify the columns
                        with -c / --metacols.
  -x COL, --metaindex COL
                        If using metadata, which column contains the
                        sequenced sample names. Default 1. Field can contain
                        multiple semi-colon separated names catering to the
                        fact that a field sample could be sequenced multiple
                        times with technical replicates

e.g. could add:

  -X COL, --expected COL
                        If using metadata, which column contains the known
                        species list present in each sample for classifier
                        performance assessment. Field can contain multiple
                        semi-colon separated names. Default not used.

(The obvious one letter codes for the expected/known species of -x, -e and -k are currently used for the index column, metadata encoding, and marker)

The text was updated successfully, but these errors were encountered:

peterjc · 2022-01-13T10:44:23Z

An empty column field could be ambiguous.

Might have to reject the empty value, and insist on something like - for no species expected (negative control) and ? for unknown (environmental sample)?

Or maybe the empty string meaning meaning no expected species and ? for unknown is OK?

peterjc · 2022-01-20T16:13:49Z

Notes to self...

Looking at the fungal_mock example, the mock community list currently in file mock_community.known.tsv is as follows (where setup.sh makes per-sample symlinks to this):

Alternaria alternata;Aspergillus flavus;Neosartorya fischeri;Penicillium expansum;Candida apicola;Saccharomyces cerevisiae;Claviceps purpurea;Trichoderma reesei;Fusarium graminearum;Fusarium oxysporum;Fusarium verticillioides;Saitoella complicata;Rhizoctonia solani;Naganishia albida;Ustilago maydis;Chytriomyces hyalinus;Rhizophagus irregularis;Mortierella verticillata;Rhizomucor miehei

That is a very long string to repeat 27 times in a new metadata column. Thinking like a database designer, that would be redundant and better done with a linking table. For instance, matching the existing "Sample-type" column to a species list (the above, or an empty list for the negative controls). That could be done with a second TSV file...

  -K FILENAME, --knownsamples FILENAME
                        Optional tab separated table containing expected species
                        list for each sample type (matching column -X / --expected
                        in the main metadata file).
  -X COL, --expected COL
                        Which metadata column contains the sample type listed in the
                        -K / --knownsamples TSV file.

So in this example that would be -X 5 for the "Sample-type" column in the metadata, with -L pointing to a new 2-column TSV file like this:

#Sample-type (tab) Expected-species
fungal mock community (tab) Alternaria alternata;Aspergillus flavus;Neosartorya fischeri;Penicillium expansum;Candida apicola;Saccharomyces cerevisiae;Claviceps purpurea;Trichoderma reesei;Fusarium graminearum;Fusarium oxysporum;Fusarium verticillioides;Saitoella complicata;Rhizoctonia solani;Naganishia albida;Ustilago maydis;Chytriomyces hyalinus;Rhizophagus irregularis;Mortierella verticillata;Rhizomucor miehei
negative control (tab) (blank)

Most of the worked examples have a similar section in their setup.py parsing the sample type from the metadata in order to symlink to the relevant known species list TSV file.

Sometimes that could be too heavy (e.g. just one or two controls), thus maybe both are useful? Could do something like this:

  -K FILENAME, --knownsamples FILENAME
                        Optional tab separated table containing expected species
                        list for each sample type (matching column -X / --expected
                        in the main metadata file).
  -X COL, --expected COL
                        Which metadata column contains the sample type listed in the
                        -K / --knownsamples TSV file, or an expected species list.

peterjc · 2022-01-20T16:40:24Z

Examples where a linking table seems useful:

fungal_mock (merging replicates in metadata would help)
fecal_sequel (merging replicates in metadata would help, but all samples have same sp.)
drained_ponds (merging replicates in metadata would help)
soil_nematodes
great_lakes

Examples which would be fine adding an expected species column directly to the metadata:

endangered_species (already merged replicates in metadata)
woody_hosts with only 4 known samples, most being unknowns

Examples with no known samples:

recycled_water

peterjc added the enhancement New feature or request label Jan 13, 2022

peterjc changed the title ~~Simplify how known results as specified for classifier assessment~~ Simplify how known results specified for classifier assessment Jan 21, 2022

peterjc mentioned this issue Jan 6, 2023

Discussion for soil_nematodes example #539

Closed

peterjc mentioned this issue Feb 24, 2023

Append classifier columns to sample-tally TSV #549

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify how known results specified for classifier assessment #425

Simplify how known results specified for classifier assessment #425

peterjc commented Jan 13, 2022

peterjc commented Jan 13, 2022

peterjc commented Jan 20, 2022

peterjc commented Jan 20, 2022

Simplify how known results specified for classifier assessment #425

Simplify how known results specified for classifier assessment #425

Comments

peterjc commented Jan 13, 2022

peterjc commented Jan 13, 2022

peterjc commented Jan 20, 2022

peterjc commented Jan 20, 2022