Initial Steps:
- load python module -
module load python/3.8.5
- create virtual environment
python3 -m venv venv && . venv/bin/activate
- Upgrade pip, wheel and pytest
pip install --upgrade pip wheel pytest
- Install repo -
git clone [email protected]:UCLOrengoGroup/cath-alphaflow.git
- Create .env file (use prod settings) - see
config.env
- export instantclient for Oracle
export LD_LIBRARY_PATH=~/instantclient_21_7
Time: 7 minutes
Command:
cath-af-cli create-dataset-uniprot-ids --uniprot_ids_csv af_uniprot_id_100k_from_g3d.list \
--dbname gene3d_21 \
--max_evalue 1e-50 \
--max_records 100000 \
Output: af_uniprot_id_100k_from_g3d.list
Time: 4 minutes
Input: af_uniprot_id_100k_from_g3d.list
Command:
cath-af-cli create-cath-dataset-from-db \
--csv_uniprot_ids af_uniprot_id_100k_from_g3d.list \
--dbname gene3d_21 \
--af_cath_annotations af_100k_cath_annotations.tsv \
--af_chainlist_ids af_100k_chainlist_ids.tsv \
--af_domainlist_ids af_100k_domainlist_ids.tsv \
--gene3d_crh_output af_100k.crh \
--csv_uniprot_md5 af_100k_md5.tsv \
Output:
- af_100k_cath_annotations.tsv
- af_100k_chainlist_ids.tsv
- af_100k_chainlist_ids.tsv
- af_100k.crh
- af_100k_md5.tsv
Time: less than 1 min
Input: af_100k_chainlist_ids.tsv
Output: af_100k_manifest.txt
grep -v 'chain_id' af_100k_chainlist_ids.tsv | sort | uniq | tr -d '\r' | awk '{print "gs://public-datasets-deepmind-alphafold-v4/"$1".cif"}' > af_100k_manifest.txt
time: 2hrs and 20mins
Input: af_100k_manifest.txt
Output: af_raw_cif
folder containing unchopped AlphaFold chains
mkdir af_cif_raw
cat af_100k_manifest.txt | (/share/apps/google-cloud-sdk/bin/gsutil -o GSUtil:parallel_process_count=1 -m cp -I af_cif_raw/ || echo "Ignoring non-zero exit code: \$?")
The step outputs the number of UniProt IDs not available in AlphaFoldDB.
CommandException: 6532 files/objects could not be transferred. Total downloaded CIF: 62,212
Rationale:
The cif_to_fasta
module in cath-af-cli
expects a list of existing files, so we need to feed it just the chains that are in AFDB and are available on the filesystem.
cd af_cif_raw
find *.cif > ../af_100k_chainlist_after_gsutil.txt
sed -i 's/\.cif//g' af_100k_chainlist_after_gsutil.txt
echo af_chain_id | cat - af_100k_chainlist_after_gsutil.txt > temp && mv temp af_100k_chainlist_after_gsutil.txt
Process: 62,212 CIF files -> 62,212 FASTA files + merged.fasta
Thoughts: Too slow on 1CPU (i.e. 30mins for 10k sequences, splitting and parallelizing)
Time: 1hr 9 mins
Input:
af_cif_raw/
af_100k_chainlist_after_gsutil.txt
Output:
af_fasta_raw
- folder containing a FASTA file for each chain available in the filesystem
merged.fasta
- file containing a multiFASTA of all processed CIF files.
mkdir af_fasta_raw
split -l 10000 af_100k_chainlist_after_gsutil.txt splitlist
find splitlista* | xargs -P4 -I XXX sh -c 'cath-af-cli convert-cif-to-fasta --cif_in_dir af_cif_raw/ \
--id_file XXX \
--fasta_out_dir af_fasta_raw/'
Careful: the module expects each chunk to have a header, so the first structure doesn't get processed if we use chunks instead of the first file.
Time: seconds
Input: merged.fasta
Output: af_100k_cif_raw_md5.txt
cath-af-cli create-md5 --fasta merged.fasta \
--uniprot_md5_csv af_100k_cif_raw_md5.txt
Rationale:
- Extract and further process only matches between CIF-md5 and chopping MD5
- Remove all CRH entries that rely on a missing AFDB entry.
Time: 1 second
Issue: The files generated in step 6 have some carriage returns (like ^M) that needs removing using tr -d '\r'
Temporary fix: Strip these characters before doing the grep command.
Command:
grep -v 'af_chain_id' af_100k_cif_raw_md5.txt | awk '{print $3}' | tr -d '\r' | grep -F -f - af_100k.crh > af_100k_crh_after_filtering.crh
sort af_100k_crh_after_filtering.crh | uniq > af_100k_crh_after_filtering_uniq.crh
awk '{print $1}' af_100k_crh_after_filtering_uniq.crh | grep -F -f - af_100k_cif_raw_md5.txt | awk '{print $1}' | grep -F -f - af_100k_domainlist_ids.txt > af_100k_domainlist_after_md5_filter.txt
echo af_domain_id | cat - af_100k_domainlist_after_md5_filter.txt > temp && mv temp af_100k_domainlist_after_md5_filter.txt
Output: af_100k_crh_after_filtering.crh
Unnecessary step as optimise-domain-boundaries
acts on chains, not on domains, so we can optimise first and chop once.
Running on xargs with 12 processes with a uniq-ed crh file and domain list (120,129 domains) Time: 1hr 45 minutes
Command:
split -l 7500 af_100k_domainlist_after_md5_filter.txt split_crh_uniq_ -a 5 -d
find split_crh_uniq_000* | xargs -P12 -I XXX sh -c 'cath-af-cli chop-cif --cif_in_dir af_cif_raw/ \
--id_file XXX \
--cif_out_dir chopped_cif_before_optimization/'
Issues Found: Remove af_domain_id from header, program can't parse it or add exception, as it skips the first entry but it crashes as it can't find af_domain_id
Stats: Chopped 120,129 domains
On 16 threads, 7500 entries each
Time: 1hr 1 min
Comments/TODO:
- Include percentage of delta length (i.e. how much the boundaries are changing?)
- Remove af_domain_id from header, it crashes the process.
- Optimise boundaries works on chains, not on domains.
The initial chopping is redundant, keeping it in this instance as it's interesting for debugging.
Commands:
split -l 7500 af_100k_domainlist_after_md5_filter_uniq.txt split_af_100k -a 5 -d
find split_af_100k000* | xargs -P16 -I XXX sh -c 'cath-af-cli optimise-domain-boundaries --af_domain_list XXX \
--af_chain_mmcif_dir af_cif_raw/ \
--af_domain_list_post_tailchop af_100k_domainlist_post_tailchop_XXX.txt \
--af_domain_mapping_post_tailchop af_100k_domain_mapping_post_tailchop_XXX.tsv \
--status_log optimise_boundaries_status_log_XXX'
Afterwards:
- Concatenate logs
- Concatenate af_domain_list_post_tailchop
- Concatenate af_domain_mapping_post_tailchop
- Remove duplicated headers
cat af_100k_domain_mapping_post_tailchop_split_af_100k000* | grep -v 'af_domain_id' > af_100k_domain_mapping_post_tailchop.tsv
cat af_100k_domainlist_post_tailchop_split_af_100k000* grep -v 'af_domain_id' > af_100k_domainlist_post_tailchop.txt
cat optimise_boundaries_status_log_split_af_100k000* > optimise_boundaries_status_log
rm af_100k_domain_mapping_post_tailchop_split_af_100k000*
rm optimise_boundaries_status_log_split_af_100k000*
rm af_100k_domainlist_post_tailchop_split_af_100k000*
rm split_af_100k000*
Stats:
- 84,604 have unchanged boundaries
- 35,378 have adjusted boundaries
- 129 have failed due to not having residues above pLDDT 70
Time: 1hr 46 mins on 12 threads
split -l 7500 af_100k_domainlist_post_tailchop.txt af_100k_domainlist -a 5 -d
find af_100k_domainlist00* | xargs -P12 -I XXX sh -c 'cath-af-cli chop-cif --cif_in_dir af_cif_raw/ \
--id_file XXX \
--cif_out_dir chopped_cif_after_optimisation/'
TODO: Add StatusLog to ALL MODULES
Chopped 120,112 domains with optimised boundaries
Time: 59 mins on 16 threads
split -l 7500 af_100k_domainlist_post_tailchop.txt split_af_100k_plddt_ -a 5 -d
find split_af_100k_plddt_000* | xargs -P16 -I XXX sh -c 'cath-af-cli convert-cif-to-plddt-summary --cif_in_dir af_cif_raw/ \
--id_file XXX \
--plddt_stats_file af_100k_plddt_summary_XXX.tsv'
cat af_100k_plddt_summary_split_af_100k_plddt_*.tsv > af_100k_plddt_summary.tsv
Time: 4 hrs
Requirements: boost library. On CS HPC can be loaded as a module.
module load boost/1.71.0
mkdir af_dssp_dir
find split_af_100k_dssp_000* | xargs -P16 -I XXX sh -c 'cath-af-cli convert-cif-to-dssp --cif_in_dir af_cif_raw/ \
--id_file XXX \
--dssp_out_dir af_dssp_dir/'
Comments: Implement a skip if the file is already present, maybe switch to IUPRED? Or some other secondary structure predictors based on sequence.
Time: 4 mins
Command:
find split_af_100k_dssp_000* | xargs -P16 -I XXX sh -c 'cath-af-cli convert-dssp-to-sse-summary --dssp_dir af_dssp_dir/ \
--id_file XXX \
--sse_out_file af_100k_sse_XXX.tsv'
Time: 2 hrs 7 mins on a single thread, RAM:1G
Command:
cath-af-cli convert-cif-to-foldseek-db --cif_dir chopped_cif_after_optimisation/ \
--fs_querydb_dir af_foldseek_db/ \
--fs_bin_path /SAN/biosciences/alphafold/foldseek/bin/foldseek
Comment:
- Added an option to generate a database only from a few structures. Needs to be specified as a CLI option using id_list
- Investigate why m_core on CS is not seen by Foldseek. Increasing RAM available speeds up the process significantly.
Time: 2hrs 30
Command:
cath-af-cli run-foldseek --fs_querydb af_foldseek_db/af_query_foldseek.db \
--fs_targetdb /SAN/cath/cath_v4_3_0/databases/foldseek/cath_s95/cath_s95_db \
--fs_bin_path /SAN/biosciences/alphafold/foldseek/bin/foldseek
Time: 5 mins
Command:
cath-af-cli convert-foldseek-output-to-summary --id_file af_100k_domainlist_post_tailchop.txt \
--fs_input_file fs_query_results.m8 \
--fs_results fs_query_results.txt
Starting from 110,250 unique domain ids in foldseek output, resulting into 110,218 hits passing the thresholds.
Time: 2hrs 6 mins
Split into 16 processes. Processing 120,113 domains
Command:
split -l 7500 af_100k_domainlist_post_tailchop.txt split_af_100k_globularity_ -a 5 -d
find split_af_100k_globularity_000* | xargs -P16 -I XXX sh -c 'cath-af-cli measure-globularity --af_domain_list XXX \
--af_chain_mmcif_dir af_cif_raw/ \
--af_domain_globularity af_100k_globularity_XXX.txt'
cat af_100k_globularity_split_af_100k_globularity_000* > af_100k_globularity.txt
rm split_af_100k_globularity_000*
af_100k_globularity_split_af_100k_globularity_000*
As this was done in chunks, remove the 16 additional occurences of the header from af_100k_globularity.txt
TBD