Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude circular synthetic (or chimeric) sequences #29

Merged
merged 2 commits into from
Feb 26, 2024

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Feb 16, 2024

Description of proposed changes

For now, exclude the circular synthetic sequences from the phylogenetic build flagged by #28.

Alternatively, we can attempt to drop the plasma from the ends of the sequences if it proves feasible.

I ran a quick check of other records in phylogenetic/data/metadata_all.tsv to identify any other sequences that are longer than 15000nt, and I did not see any. But please feel free to flag any records I may have missed.

Related issue(s)

Checklist

  • Checks pass

@j23414 j23414 requested a review from a team February 16, 2024 20:48
@j23414 j23414 marked this pull request as draft February 20, 2024 19:43
@j23414 j23414 force-pushed the exclude-circular-synthetic-strains branch 3 times, most recently from 7bccd57 to 4ac5318 Compare February 22, 2024 00:54
@j23414 j23414 marked this pull request as ready for review February 22, 2024 00:55
@j23414
Copy link
Contributor Author

j23414 commented Feb 22, 2024

re: 4ac5318

Duplicates, referring to identical sequences that may or may not be distinct samples, were highlighted in the following comment: #28 (comment). Additional discussion can be found in the thread starting here: #28 (comment).

It is crucial to note that some of these excluded duplicates actually represent patents (PAT) for vaccine candidates, and as such, they are omitted from the phylogenetic analysis of current dengue diversity.

When encountering duplicates and there is a reference sequence identified with prefix (NC_), the preference was to retain the reference and exclude other duplicates.

In cases where multiple VRL records share the same nucleotide sequence, the earliest sample in alphabetical order was selected, and the others excluded.

The rationale for each exclusion is documented in the respective comments in the exclude.txt file.

Future work: Later on we may be able to work on establishing some deduplication guidelines in this issue: #30

@j23414 j23414 force-pushed the exclude-circular-synthetic-strains branch from 6932f11 to b45bec5 Compare February 24, 2024 00:14
For now, exclude the circular synthetic sequences from the phylogenetic build
flagged by #28.

Alternatively, we can attempt to drop the plasma from the ends of the sequences
if that makes sense.
Duplicates, referring to identical sequences that may or may not be distinct samples, were highlighted in the following comment: #28 (comment). Additional discussion can be found in the thread starting here: #28 (comment).

It is crucial to note that some of these excluded duplicates actually represent patents (PAT) for vaccine candidates, and as such, they are omitted from the phylogenetic analysis of current dengue diversity.

When encountering duplicates and there is a reference sequence identified with prefix (NC_), the preference was to retain the reference and exclude other duplicates.

In cases where multiple VRL records share the same nucleotide sequence, the earliest sample in alphabetical order was selected, and the others excluded.

The rationale for each exclusion is documented in the respective comments in the exclude.txt file.
@j23414 j23414 force-pushed the exclude-circular-synthetic-strains branch from b45bec5 to af38fb0 Compare February 26, 2024 18:00
@j23414 j23414 merged commit 5385608 into main Feb 26, 2024
32 checks passed
@j23414 j23414 deleted the exclude-circular-synthetic-strains branch February 26, 2024 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant