Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactorization of Analyze schema #11

Merged
merged 6 commits into from
Jan 5, 2024
Merged

Conversation

luissian
Copy link
Member

@luissian luissian commented Dec 19, 2023

Testing result of 70 Locus.

gene_not_found
num_genes_per_allele
quality_of_locus

statistics.csv

@luissian luissian changed the title Refactorization Refactorization of Analyze schema Jan 4, 2024
@luissian luissian marked this pull request as ready for review January 4, 2024 16:59

from setuptools import setup, find_packages

version = "2.2.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe version 3.0.0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, changed to version 3.0.0

# for schema_file in schema_files:
results = []
start = time.perf_counter()
with concurrent.futures.ProcessPoolExecutor() as executor:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include the parameter for setting the threads you want to use so we can test it on the HPC

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included parameter for cpu used. Prokka usage cpus was set to 3. more testing is required to getting the optimize value

a_quality[record.id] = {"quality": "Good quality", "reason": "-"}
allele_seq[record.id] = str(record.seq)
a_quality[record.id]["length"] = len(str(record.seq))
if len(record.seq) % 3 != 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked that the translate protein function of biopython only fails with not a 3 multiple. It would make sense to me but just in case. Also I think that keeping he protein info output could be interesting as it was in the previous code.
Can you run the previous code for analyze schema so we can check the output and see if we are missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the code is replaced now for using biopython

a_quality[record.id]["length"] = len(str(record.seq))
if len(record.seq) % 3 != 0:
a_quality[record.id]["quality"] = "Bad quality"
a_quality[record.id]["reason"] = "Can not be converted to protein"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call this as in the previous code : Not a CDS, cannot be converted to protein. Or something like this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error messages are collected from biopython text message

):
bad_quality_record.append(record.id)

if self.remove_duplicated:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicates and subset checkings must be done all times, when you use remove_duplicated or remove_subset etc.. it's only for removing them from the schema, but the statistics and que quality flag should be there indicating that there is a duplicate or a subset

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated and subset alleles are showed in quality data

@saramonzon
Copy link
Member

After checking the new and old code, I would add also information about different number of proteins found per gene, and some graphs and stats about the lenght variability of the alleles for each gene as in previous code

@saramonzon
Copy link
Member

In the pie graph:
blue should be name known gene name
red - hypothetical protein, I assume that prokka label them like this? (please check this)

@saramonzon
Copy link
Member

In the good quality/bad quality pie, The bad quality (if any) I assume that all the possibilities would appear, right?
no start codon, no stop codon, etc.

@luissian luissian merged commit d212ab4 into BU-ISCIII:develop Jan 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants