Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Odd Representative Cluster Behaviour / Selection #902

Open
rpalmavejares opened this issue Nov 14, 2024 · 0 comments
Open

Odd Representative Cluster Behaviour / Selection #902

rpalmavejares opened this issue Nov 14, 2024 · 0 comments

Comments

@rpalmavejares
Copy link

rpalmavejares commented Nov 14, 2024

Expected Behavior

Sequences that are Core Cluster should be output as belonging to their same cluster. Not in different ones

The example is pretty simple:

TSC000_k99_1536813_gene1 TSC000_k99_1536813_gene1
TSC000_k99_1536813_gene1 TSC002_k99_986141_gene1
TSC000_k99_319273_gene1 TSC000_k99_319273_gene1
TSC000_k99_1362901_gene1 TSC000_k99_1362901_gene1
TSC000_k99_143397_gene1 TSC000_k99_143397_gene1

In this case, all black out sequences are core cluster, and are displayed in both columns as the first instance of their cluster list.
In no case other than the first, third, and fourth line, a core cluster would be displayed on the second column. That would indicate that the core cluster, is also part of another different cluster.

Current Behavior

I have this odd behavior, I don't know how to interpret this result

Some core cluster sequences appear to also be part of other clusters.

TSC053_k99_1024271_gene1 TSC040_k99_1291964_gene1
TSC053_k99_1024271_gene1 TSC045_k99_976664_gene1
TSC047_k99_1354130_gene1 TSC053_k99_1024271_gene1

Notice how sequence TSC053_k99_1024271_gene1, does not have a line beginning in:
TSC053_k99_1024271_gene1 TSC053_k99_1024271_gene1

To add to this issue, sequence TSC053_k99_1024271_gene1 is being output as part of other cluster.

TSC047_k99_1354130_gene1 TSC047_k99_1354130_gene1
TSC047_k99_1354130_gene1 TSC053_k99_1024271_gene1

As you can see, TSC047_k99_1354130_gene1 has the normal expected output.

The problem comes with TSC053_k99_1024271_gene1 and TSC047_k99_1354130_gene1 being in the Representative cluster output file.

Steps to Reproduce (for bugs)

These are the command that I ran.

$mmseqs createdb $1 NEW_GENE_CATALOG/tara_source.mmseqs.db --dbtype 2 --shuffle 0

$mmseqs cluster NEW_GENE_CATALOG/tara_source.mmseqs.db NEW_GENE_CATALOG/tara_source.mmseqs.cluster ./tmp --remove-tmp-files 0 --kmer-per-seq-scale 0 --cluster-mode 2 --min-seq-id 0.95 --threads 20 --cov-mode 1 -c 0.9 --split-memory-limit 700G

$mmseqs createsubdb NEW_GENE_CATALOG/tara_source.mmseqs.cluster NEW_GENE_CATALOG/tara_source.mmseqs.db NEW_GENE_CATALOG/tara_source.mmseqs.rep

$mmseqs convert2fasta NEW_GENE_CATALOG/tara_source.mmseqs.rep NEW_GENE_CATALOG/tara_source.mmseqs.rep.fasta

$mmseqs createtsv NEW_GENE_CATALOG/tara_source.mmseqs.db NEW_GENE_CATALOG/tara_source.mmseqs.db NEW_GENE_CATALOG/tara_source.mmseqs.cluster NEW_GENE_CATALOG/tara_source.mmseqs.cluster.tsv

Your Environment

I'm using MMseqs2 Version: 15.6f452

How should I interpret this result? as a bug? and outlier? ( I have 288 confirmed instances of this same issue in the TSV output), Should I choose the good cluster and join the bad one into the other?

@rpalmavejares rpalmavejares changed the title Odd Representative Cluster Behaviour / Selecction Odd Representative Cluster Behaviour / Selection Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant