Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing sequences with manual overlap #47

Open
bruceamurphy opened this issue Jan 10, 2019 · 3 comments
Open

Removing sequences with manual overlap #47

bruceamurphy opened this issue Jan 10, 2019 · 3 comments

Comments

@bruceamurphy
Copy link

The option to remove sequences using -seqoverlap and -resoverlap produces results I find unexpected. From what I can tell, it seems that when -resoverlap compares if a residue is the same in the other sequences it does not consider the base identity (for a DNA alignment), only whether there is a gap character or any DNA base. Therefore, for a gapfree alignment, if I change all bases in a sequence to e.g. "T" the sequence will not be removed from the alignment, even with strict settings, (e.g. -resoverlap 0.9 seqoverlap 95).

Is this how it is supposed to work and if so I'm curious why it works like this and not as I might have expected it to? Thanks

@Vicfero
Copy link
Contributor

Vicfero commented Feb 8, 2019

Hi @bruceamurphy

First of all, thanks for pointing out this issue. Indeed, this is not the desired or expected behaviour. We've been looking at the issue and think we have a possible solution.

Could you please apply this patch and re-make trimal to see if it works as expected? HOW-TO

Patch: patch_issue#47.txt

@Vicfero
Copy link
Contributor

Vicfero commented Feb 19, 2019

Hi @bruceamurphy

@scapella and I have been discussing the behaviour of this method, and it seems it works as expected, sorry for my misunderstanding on the last message.

And thanks @scapella for the clarification here.

Let me explain what it is used for:

Overlap is defined as having a gap in both positions, an indetermination in both positions, or a residue in both positions.
It's main purpose is to remove sequences which share only a reduced region with the rest of the alignment, whereas the other regions are not shared and filled with gaps.

It may be clearer with an example:

Sp8    -----GLG-----------TKSD---NNNNNNNNNNNNNNNNWV-----------------
Sp17   --FAYTAPDLLL-IGFLLKTV-ATFG-----------------DTWFQLWQGLDLNKMPVX
Sp10   ------DPAVL--FVIMLGTI-TKFS-----------------SEWFFAWLGLEINMMVIX
Sp26   AAAAAAAAALLTYLGLFLGTDYENFA-----------------AAAANAWLGLEINMMAQX

In this case, Sp8 may be removed, depending on the thresholds, as it contains:

  • Blocks that are filled with Ns (Asparagine), whereas on the rest on the alignment, there is a gap.
  • Blocks that contain gaps, whereas on the rest on the alignment, there is a residue.
  • Blocks that contain gaps, whereas on the rest on the alignment, there are indetermination (X).

So, what residue is on the position is not relevant for this trimming method, as it is focused to check whether the two sequences compared contain:

  • Both a residue (whatever residue it is),
  • Both a gap,
  • Both an indetermination,

and compares pair by pair of sequences to obtain a score for each sequence.

So, when you change the gaps by 'T' on your alignment, the score for the sequence is higher and thus, not removed from the alignment. This is the desired behaviour

@bruceamurphy
Copy link
Author

Hi,
Thank you for explaining, I'm sorry I misunderstood the intended functionality.
I did try the patch you made (apologies, I forgot to reply) and it seems to do what I originally hoped the command did - i.e. remove sequences which do not match the rest of the alignment due to mismatched bases OR gaps/Ns.
Perhaps you might consider incorporating both functions. Or is there a reason you feel this is not desirable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants