Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider to_dict or index for pj.records. IMPORTANT #43

Open
szitenberg opened this issue Nov 26, 2014 · 1 comment
Open

consider to_dict or index for pj.records. IMPORTANT #43

szitenberg opened this issue Nov 26, 2014 · 1 comment
Labels

Comments

@szitenberg
Copy link
Member

Records are stored as a list. This means that fetching a records by its accession or a feature by its feature_id requires iteration.

Records should be stored as a dictionary (SeqIO.to_dict(SeqIO.parse(...))). This way, getting to a feature will be much much faster.

get_qualifiers_dictionary can then be done by getting the record using a key and the feature index based on number of the _f suffix.

metadata editing methods that take a feature_id will become much much faster.

any input file writer that will be much faster.

This requires some work changing the way pj.records is iterated throughout.

@szitenberg
Copy link
Member Author

Many things depend on Project.records being a list of SeqRecord objects. For now, I have a private function __get_qualifiers_dictionary__ in parallel to the public one, get_qualifiers_dictionary.

To use it, one needs to first do:

Project.__records_list_to_dict__()
It is not done automatically to i) avoid a duplicate representation of the data if we don't want to use the private version and ii) get the latest data into the dict.

Then
__get_qualifiers_dictionary__(Project, 'feature_id')

can be used.

get_qualifiers_dictionary is a dependency of several functions and methods. My plan is to add a private version for each of them where I use the __get_qualifiers_dictionary__ instead of the public version. For now I've written __make_concatenation_alignments__. This step was the most noticable bottleneck. It can now be avoided like this: Instead of doing:

concat = Concatenation(...)
Project.add_concatenation(concat)
Project.make_concatenation_alignment()

You can do:

concat = Concatenation(...)
Project.add_concatenation(concat)
Project.__records_list_to_dict__()
Project.__make_concatenation_alignment__()

This will make the concatenation process much much faster. Downside: you'll now have all the data twice, once as a list and again as a dict, in your Project.

I am working towards making the changes throughout and eliminating the diplication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant