Scripts used in the article "CRISPR sequences contaminate public databases with spurious proteins containing spaced repeats"
These scripts can be used to search for spurious protein sequences originating from CRISPR sequences. We used two different approaches which are descripted below.
It consisted in searching for translations of repeat sequences from the CRISPRCasdb database separated by putative spacers.
Script: crispr_spurious1_initial.pl proteins.fasta crisprcasdb.fasta
You can use with another argument (1) to obtain one only line per peptide dataset.
It consisted in searching for amino acid repeats separated by putative spacers directly in the protein sequences of UniProtKB database.
Script: crispr_spurious2_initial.pl proteins.fasta output.tsv
Finally, the initial candidates from the two approaches are mapped to their corresponding genomic sequences, and cas genes were searched within 15 kb around the candidate.
Script: crispr_spurious_pfp.pl proteins.dat initial_candidates.tsv path
You can choose thresholds for both cas domain coverage and identity. The path is a folder where the cas domain profiles are stored (please see crispr repository GitHub by UPOBioinfo for more details).
When the number of sequences is higher than 500 the following scripts, which use the NCBI API, should be used before:
Script: download_NCBI_faa_files.pl
Script: crispr_spurious_pfp_multifasta_gff3.pl proteins.fasta
Database | First approach (initial) | First approach (final) | Second approach (initial) | Second approach (final) |
---|---|---|---|---|
sprot_archaea | sprot_archaea1.1 | sprot_archaea1.2 | sprot_archaea2.1 | sprot_archaea2.2 |
sprot_bacteria | sprot_bacteria1.1 | sprot_bacteria1.2 | sprot_bacteria2.1 | sprot_bacteria2.2 |
trembl_archaea | trembl_archaea1.1 | trembl_archaea1.2 | trembl_archaea2.1 | trembl_archaea2.2 |
trembl_bacteria | trembl_bacteria1.1 | trembl_bacteria1.2 | trembl_bacteria2.1 | trembl_bacteria2.2 |