Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter HMMs update #1881

Open
wants to merge 28 commits into
base: master
Choose a base branch
from
Open

Filter HMMs update #1881

wants to merge 28 commits into from

Conversation

mschecht
Copy link
Contributor

@mschecht mschecht commented Feb 1, 2022

Hi Everyone,

I would like to add a new self attribute called "HMM_hits_domain_table_was_filtered" to contigsDBs after they have been process by anvi-script-filter-hmm-hits-table . I believe this will be important for keeping track of how the hmm_hits table was changed. Additionally, it will prevent cases where anvi-script-filter-hmm-hits-table is re-run in snakemake workflows.

Will this require a migration script?

Cheers,
Matt

@meren
Copy link
Member

meren commented Feb 1, 2022

Hey Matt, yes, it certainly will require a migration script. But are you sure if HMM_hits_domain_table_was_filtered alone will be enough? How about how a given table was filtered? Should we not also store values for --target-coverage and --query-coverage params if we want to keep track of whether some sources belong to HMM_hits_domain_table_was_filtered?

I can propose these variable names for consistency (as an example for two sources):

HMM_dom_filter_sources ...........: Bacteria_71,MY_MODEL
HMM_dom_filter_target_coverage ...: 0.50,0.90
HMM_dom_filter_query_coverage ....: 0.00,0.50

@mschecht
Copy link
Contributor Author

mschecht commented Feb 3, 2022

Thanks for the suggestion @meren!

anvi-script-filter-hmm-hits-table now will record the self attributes HMM_dom_filter_sources, HMM_dom_filter_target_coverage , and HMM_dom_filter_query_coverage as CSVs in a contigsDB!

What are the next steps for making a migration script?

@meren
Copy link
Member

meren commented Feb 3, 2022

I'm testing this with the contigs-db in the infant gut dataset, but this is what I'm getting:

anvi-run-hmms -c CONTIGS.db \
              -I Bacteria_71 \
              -T 4 --domain-hits-table \
              --hmmer-output-dir HMM_OUTPUT

After which I find these files in the output directory:

ls HMM_OUTPUT/
hmm.domtable  hmm.table  hmm.table.fixed

And then,

anvi-script-filter-hmm-hits-table -c CONTIGS.db \
                                  --domain-hits-table HMM_OUTPUT/hmm.domtable \
                                  --hmm-source Bacteria_71 \
                                  --target-coverage 95 \
                                  --query-coverage 95

Database Path ................................: CONTIGS.db
Domtblout Path ...............................: HMM_OUTPUT/hmm.domtable


Config Error: Doesn't look like a --domtblout... anvi'o can't even... Please look at this
              error message to find out what happened: invalid literal for int() with base 10:
              'Ribosomal_S16'

The same happens for any of the other output files. What am I doing wrong?

By the way, I think the help menu for --target-coverage and --query-coverage (as well as their description in the program help online) is not very helpful :( It should say at some point "this is for the coverage of your gene by the model" and "this is the coverage of your model by your gene" to make sure everyone is comfortable with it.

Please also see the changes I just made :)

@mschecht
Copy link
Contributor Author

mschecht commented Feb 4, 2022

@meren and I spoke about his comment above offline and here are the actions items to address it:

  • anvi-run-hmms:
    • If a user is using domtblout, there needs to be a warning that in order to utilize anvi-script-filter-hmm-hits-table they must create the domtblout using the program hmmsearch.
    • need to check the domtblout path earlier in the program, before loading contigDBs etc.
  • anvi-script-filter-hmm-hits-table:
    • Need to have a sanity check for coverage values e.g. 0.0-1.0
    • sanity that incoming domtblout genes match the HMM source. This can be solved by checking if the set(genes) in the domtblout match the HMM source genes
    • Improve --target-coverage and --query-coverage documentation

@mschecht
Copy link
Contributor Author

mschecht commented Feb 7, 2022

@meren I attempted to make the sanity check that compares the genes in the incoming domain-hits-table and the HMM_hits table in the contigsDB being filtered. Please let me know what you think. Here's a test below with infant gut where the domain-hits-table is from Archaea_76 and anvi-script-filter-hmm-hits-table now throws an error with Bacteria_71:

rm -rf HMM_OUTPUT_hmmsearch; \
anvi-run-hmms -c CONTIGS.db -I Archaea_76 -T 4 --domain-hits-table --hmmer-output-dir HMM_OUTPUT_hmmsearch --hmmer-program hmmsearch --just-do-it; \
anvi-script-filter-hmm-hits-table -c CONTIGS.db --hmm-source Bacteria_71  --domain-hits-table HMM_OUTPUT_hmmsearch/hmm.domtable --target-coverage 0.9 --query-coverage 0.9 

Config Error: The genes in HMM_OUTPUT_hmmsearch/hmm.domtable don't seem to be in the hmm_hits
              table from your contigsDB: CONTIGS.db. Please double check you are filtering
              with the same HMM_source that you used to create
              HMM_OUTPUT_hmmsearch/hmm.domtable when you ran anvi-run-hmms.

@meren
Copy link
Member

meren commented Feb 15, 2022

Sorry for the late response to this, Matt. I'm just catching up with this. So I guess this is ready for me to test? Should I just implement a migration script and move on?

BTW, here is a heartbreaking detail for laughs:

12 days ago:

image

8 days ago:

image

@meren
Copy link
Member

meren commented Feb 15, 2022

Database Path ................................: CONTIGS.db
Domtblout Path ...............................: HMM_OUTPUT/hmm.domtable
Num hits before filtering ....................: 685
Num hits after filtering .....................: 542
Num filtered .................................: 143

WARNING
===============================================
This is the base parser class--a part of the code you should never hear from.
PLEASE READ THIS CAREFULLY. While anvi'o was trying to parse some files
associated with the program `hmmsearch`, it found that 1 of the lines in this
file were not able to made sense of. This part of the code does not know
anything more than that. It doesn't even know what file it is. But in general
this error occurs when the mapping function does not find what its looking for
in a line. For instance, a value that was supposed to be an integer ends up
being actually a piece of text or something. Well. Here are the line numbers if
you care and can make sense of this information: 1

Number of weak hits removed by HMMER parser ..: 0
Number of hits in annotation dict  ...........: 542

In the output message the number of weak hits removed by the parser is 0 when it is actually 143 according to the previous line.

@meren
Copy link
Member

meren commented Feb 15, 2022

This warning was coming from an issue,

WARNING

This is the base parser class--a part of the code you should never hear from.
PLEASE READ THIS CAREFULLY. While anvi'o was trying to parse some files
associated with the program hmmsearch, it found that 1 of the lines in this
file were not able to made sense of. This part of the code does not know
anything more than that. It doesn't even know what file it is. But in general
this error occurs when the mapping function does not find what its looking for
in a line. For instance, a value that was supposed to be an integer ends up
being actually a piece of text or something. Well. Here are the line numbers if
you care and can make sense of this information: 1

That is now fixed by the last two comments.

Please take a look at the reports on how many hits are filtered and so on.

@meren
Copy link
Member

meren commented Feb 15, 2022

I found a bug. You run anvi-script-filter-hmm-hits-table on a contigs-db with some --target-coverage and --query-coverage values. You check anvi-db-info and see them there. Then you run anvi-script-filter-hmm-hits-table on a contigs-db with some other --target-coverage and --query-coverage values. When you check anvi-db-info again, you still see the old ones.

@meren meren changed the title Filter hm ms update Filter HMMs update Mar 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants