Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

abPOA user specifiable seeds #37

Open
benedictpaten opened this issue Mar 24, 2022 · 6 comments
Open

abPOA user specifiable seeds #37

benedictpaten opened this issue Mar 24, 2022 · 6 comments

Comments

@benedictpaten
Copy link

Hi @yangao07 , I've been experimenting a little with the seeding in abpoa and am wondering if it would be possible to add an option for users to provide alignment seeds? My issue is that for more divergent sequences minimizers are not very ideal for anchoring. I have found more luck using maximal unique matches (MUMs), using a chaining process more like that in the original MUMmer program. Looking forward, I also see a time where we will want to anchor the alignments based upon unique markers in order to facilitate the alignment of highly repetitive sequences (e.g. satellite arrays). Interested in your perspective on this.

@benedictpaten benedictpaten changed the title abPOA seeds abPOA user specifiable seeds Mar 24, 2022
@yangao07
Copy link
Owner

Yes, theoretically, abPOA could take any type of seeding and chaining result to guide the POA process.
I choose the minimizer simply out of speed consideration.
Using a more mature seeding method (MUM) is definitely preferable for divergent sequences.

I think adding an option to take MUM seed/anchor as input is much easier than implementing it inside abPOA directly.
Only concern is that we need a determined input format.

@benedictpaten
Copy link
Author

benedictpaten commented Mar 25, 2022 via email

@yangao07
Copy link
Owner

PAF format is nice. To feed abPOA, we only need to record which anchor comes from which sequence in the PAF file.
Across multiple sequences may be too stringent, could lead to too few seeds.
I think pairwise should be just fine. Specifically, we just need the anchors between every two adjacent sequences.
The order could be the input order or the order determined by a progressive guide tree (you already knew this).

@benedictpaten
Copy link
Author

benedictpaten commented Mar 28, 2022 via email

@glennhickey
Copy link
Contributor

I think for Cactus,it's important to have an API to pass the anchors in via a struct (as opposed to FILE*). Whether that struct is PAF-based or not is less important.

Also, if we are going to keep using abPOA's progressive ordering, then we'd need an API to get that (if it's not already there) before computing the mum anchors. Something like

[abpoa] get_progressive_order(sequences)
[cactus] compute_mum_anchors(sequences, order)
[abpoa] get_msa(sequences, anchors)

thanks!

@benedictpaten
Copy link
Author

benedictpaten commented Mar 28, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants