Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify how known results specified for classifier assessment #425

Open
peterjc opened this issue Jan 13, 2022 · 3 comments
Open

Simplify how known results specified for classifier assessment #425

peterjc opened this issue Jan 13, 2022 · 3 comments
Labels
enhancement New feature or request

Comments

@peterjc
Copy link
Owner

peterjc commented Jan 13, 2022

Currently (as of v0.10.6), the thapbi_pict assess command (also invoked from the thapbi_pict pipeline command) requires multiple <sample>.known.tsv input files and they are over-complicated for current needs. This also overcomplicates the assess command input parsing in general.

Up until v0.7.11 (2021-03-30), the assess command could work at the level of sequences in samples, unique sequences across all samples, or at sample level. Since then it only works at sample level.

Reflecting that history, the <sample>.known.tsv input files were based on the intermediate <sample>.<method>.tsv files, and could contain species listings for each unique sequence in the sample (no longer used), or a wildcard entry applied to the sample as a whole (currently used). e.g.

*(tab)taxid(tab)species name1;species name2;etc

In practice the assess command has not used the NCBI taxid, so all we really need now is the species list expected for each sample. This might more simply be given as a column of a metadata table.

Possible interface might build on the existing options to the pipeline:

  -t METADATAFILE, --metadata METADATAFILE
                        Optional tab separated table containing metadata
                        indexed by sample name. Must also specify the columns
                        with -c / --metacols.
  -x COL, --metaindex COL
                        If using metadata, which column contains the
                        sequenced sample names. Default 1. Field can contain
                        multiple semi-colon separated names catering to the
                        fact that a field sample could be sequenced multiple
                        times with technical replicates

e.g. could add:

  -X COL, --expected COL
                        If using metadata, which column contains the known
                        species list present in each sample for classifier
                        performance assessment. Field can contain multiple
                        semi-colon separated names. Default not used.

(The obvious one letter codes for the expected/known species of -x, -e and -k are currently used for the index column, metadata encoding, and marker)

@peterjc peterjc added the enhancement New feature or request label Jan 13, 2022
@peterjc
Copy link
Owner Author

peterjc commented Jan 13, 2022

An empty column field could be ambiguous.

Might have to reject the empty value, and insist on something like - for no species expected (negative control) and ? for unknown (environmental sample)?

Or maybe the empty string meaning meaning no expected species and ? for unknown is OK?

@peterjc
Copy link
Owner Author

peterjc commented Jan 20, 2022

Notes to self...

Looking at the fungal_mock example, the mock community list currently in file mock_community.known.tsv is as follows (where setup.sh makes per-sample symlinks to this):

Alternaria alternata;Aspergillus flavus;Neosartorya fischeri;Penicillium expansum;Candida apicola;Saccharomyces cerevisiae;Claviceps purpurea;Trichoderma reesei;Fusarium graminearum;Fusarium oxysporum;Fusarium verticillioides;Saitoella complicata;Rhizoctonia solani;Naganishia albida;Ustilago maydis;Chytriomyces hyalinus;Rhizophagus irregularis;Mortierella verticillata;Rhizomucor miehei

That is a very long string to repeat 27 times in a new metadata column. Thinking like a database designer, that would be redundant and better done with a linking table. For instance, matching the existing "Sample-type" column to a species list (the above, or an empty list for the negative controls). That could be done with a second TSV file...

  -K FILENAME, --knownsamples FILENAME
                        Optional tab separated table containing expected species
                        list for each sample type (matching column -X / --expected
                        in the main metadata file).
  -X COL, --expected COL
                        Which metadata column contains the sample type listed in the
                        -K / --knownsamples TSV file.

So in this example that would be -X 5 for the "Sample-type" column in the metadata, with -L pointing to a new 2-column TSV file like this:

#Sample-type (tab) Expected-species
fungal mock community (tab) Alternaria alternata;Aspergillus flavus;Neosartorya fischeri;Penicillium expansum;Candida apicola;Saccharomyces cerevisiae;Claviceps purpurea;Trichoderma reesei;Fusarium graminearum;Fusarium oxysporum;Fusarium verticillioides;Saitoella complicata;Rhizoctonia solani;Naganishia albida;Ustilago maydis;Chytriomyces hyalinus;Rhizophagus irregularis;Mortierella verticillata;Rhizomucor miehei
negative control (tab) (blank)

Most of the worked examples have a similar section in their setup.py parsing the sample type from the metadata in order to symlink to the relevant known species list TSV file.

Sometimes that could be too heavy (e.g. just one or two controls), thus maybe both are useful? Could do something like this:

  -K FILENAME, --knownsamples FILENAME
                        Optional tab separated table containing expected species
                        list for each sample type (matching column -X / --expected
                        in the main metadata file).
  -X COL, --expected COL
                        Which metadata column contains the sample type listed in the
                        -K / --knownsamples TSV file, or an expected species list.

@peterjc
Copy link
Owner Author

peterjc commented Jan 20, 2022

Examples where a linking table seems useful:

  • fungal_mock (merging replicates in metadata would help)
  • fecal_sequel (merging replicates in metadata would help, but all samples have same sp.)
  • drained_ponds (merging replicates in metadata would help)
  • soil_nematodes
  • great_lakes

Examples which would be fine adding an expected species column directly to the metadata:

  • endangered_species (already merged replicates in metadata)
  • woody_hosts with only 4 known samples, most being unknowns

Examples with no known samples:

  • recycled_water

@peterjc peterjc changed the title Simplify how known results as specified for classifier assessment Simplify how known results specified for classifier assessment Jan 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant