Spurious proteins from CRISPR sequences

Scripts used in the article "CRISPR sequences contaminate public databases with spurious proteins containing spaced repeats"

Introduction

These scripts can be used to search for spurious protein sequences originating from CRISPR sequences. We used two different approaches which are descripted below.

First approach (search for already-annotated repeats)

It consisted in searching for translations of repeat sequences from the CRISPRCasdb database separated by putative spacers.

Script: crispr_spurious1_initial.pl proteins.fasta crisprcasdb.fasta

You can use with another argument (1) to obtain one only line per peptide dataset.

Second approach (search for peptide repeats)

It consisted in searching for amino acid repeats separated by putative spacers directly in the protein sequences of UniProtKB database.

Script: crispr_spurious2_initial.pl proteins.fasta output.tsv

Search for cas genes (discovery of Putative False Proteins)

Finally, the initial candidates from the two approaches are mapped to their corresponding genomic sequences, and cas genes were searched within 15 kb around the candidate.

Script: crispr_spurious_pfp.pl proteins.dat initial_candidates.tsv path

You can choose thresholds for both cas domain coverage and identity. The path is a folder where the cas domain profiles are stored (please see crispr repository GitHub by UPOBioinfo for more details).

When the number of sequences is higher than 500 the following scripts, which use the NCBI API, should be used before: Script: download_NCBI_faa_files.pl

Script: crispr_spurious_pfp_multifasta_gff3.pl proteins.fasta

Files from the article

Database	First approach (initial)	First approach (final)	Second approach (initial)	Second approach (final)
sprot_archaea	sprot_archaea1.1	sprot_archaea1.2	sprot_archaea2.1	sprot_archaea2.2
sprot_bacteria	sprot_bacteria1.1	sprot_bacteria1.2	sprot_bacteria2.1	sprot_bacteria2.2
trembl_archaea	trembl_archaea1.1	trembl_archaea1.2	trembl_archaea2.1	trembl_archaea2.2
trembl_bacteria	trembl_bacteria1.1	trembl_bacteria1.2	trembl_bacteria2.1	trembl_bacteria2.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spurious proteins from CRISPR sequences

Introduction

First approach (search for already-annotated repeats)

Second approach (search for peptide repeats)

Search for cas genes (discovery of Putative False Proteins)

Files from the article

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
crispr_spurious1_initial.pl		crispr_spurious1_initial.pl
crispr_spurious2_initial.pl		crispr_spurious2_initial.pl
crispr_spurious_pfp.pl		crispr_spurious_pfp.pl
crispr_spurious_pfp_multifasta_gff3.pl		crispr_spurious_pfp_multifasta_gff3.pl
download_NCBI_faa_files.pl		download_NCBI_faa_files.pl

License

UPOBioinfo/crispr_spurious

Folders and files

Latest commit

History

Repository files navigation

Spurious proteins from CRISPR sequences

Introduction

First approach (search for already-annotated repeats)

Second approach (search for peptide repeats)

Search for cas genes (discovery of Putative False Proteins)

Files from the article

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages