-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify how known results specified for classifier assessment #425
Comments
An empty column field could be ambiguous. Might have to reject the empty value, and insist on something like Or maybe the empty string meaning meaning no expected species and |
Notes to self... Looking at the
That is a very long string to repeat 27 times in a new metadata column. Thinking like a database designer, that would be redundant and better done with a linking table. For instance, matching the existing "Sample-type" column to a species list (the above, or an empty list for the negative controls). That could be done with a second TSV file...
So in this example that would be
Most of the worked examples have a similar section in their Sometimes that could be too heavy (e.g. just one or two controls), thus maybe both are useful? Could do something like this:
|
Examples where a linking table seems useful:
Examples which would be fine adding an expected species column directly to the metadata:
Examples with no known samples:
|
Currently (as of v0.10.6), the
thapbi_pict assess
command (also invoked from thethapbi_pict pipeline
command) requires multiple<sample>.known.tsv
input files and they are over-complicated for current needs. This also overcomplicates the assess command input parsing in general.Up until v0.7.11 (2021-03-30), the assess command could work at the level of sequences in samples, unique sequences across all samples, or at sample level. Since then it only works at sample level.
Reflecting that history, the
<sample>.known.tsv
input files were based on the intermediate<sample>.<method>.tsv
files, and could contain species listings for each unique sequence in the sample (no longer used), or a wildcard entry applied to the sample as a whole (currently used). e.g.In practice the assess command has not used the NCBI taxid, so all we really need now is the species list expected for each sample. This might more simply be given as a column of a metadata table.
Possible interface might build on the existing options to the pipeline:
e.g. could add:
(The obvious one letter codes for the expected/known species of
-x
,-e
and-k
are currently used for the index column, metadata encoding, and marker)The text was updated successfully, but these errors were encountered: