-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactorization of Analyze schema #11
Conversation
|
||
from setuptools import setup, find_packages | ||
|
||
version = "2.2.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe version 3.0.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, changed to version 3.0.0
# for schema_file in schema_files: | ||
results = [] | ||
start = time.perf_counter() | ||
with concurrent.futures.ProcessPoolExecutor() as executor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include the parameter for setting the threads you want to use so we can test it on the HPC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Included parameter for cpu used. Prokka usage cpus was set to 3. more testing is required to getting the optimize value
a_quality[record.id] = {"quality": "Good quality", "reason": "-"} | ||
allele_seq[record.id] = str(record.seq) | ||
a_quality[record.id]["length"] = len(str(record.seq)) | ||
if len(record.seq) % 3 != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you checked that the translate protein function of biopython only fails with not a 3 multiple. It would make sense to me but just in case. Also I think that keeping he protein info output could be interesting as it was in the previous code.
Can you run the previous code for analyze schema so we can check the output and see if we are missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of the code is replaced now for using biopython
a_quality[record.id]["length"] = len(str(record.seq)) | ||
if len(record.seq) % 3 != 0: | ||
a_quality[record.id]["quality"] = "Bad quality" | ||
a_quality[record.id]["reason"] = "Can not be converted to protein" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would call this as in the previous code : Not a CDS, cannot be converted to protein. Or something like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error messages are collected from biopython text message
): | ||
bad_quality_record.append(record.id) | ||
|
||
if self.remove_duplicated: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicates and subset checkings must be done all times, when you use remove_duplicated
or remove_subset
etc.. it's only for removing them from the schema, but the statistics and que quality flag should be there indicating that there is a duplicate or a subset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicated and subset alleles are showed in quality data
After checking the new and old code, I would add also information about different number of proteins found per gene, and some graphs and stats about the lenght variability of the alleles for each gene as in previous code |
In the pie graph: |
In the good quality/bad quality pie, The bad quality (if any) I assume that all the possibilities would appear, right? |
Testing result of 70 Locus.
statistics.csv