This Python script, process_af3_outputs.py
, is designed to clean, analyze, and organize output files generated by AlphaFold 3 (AF3). It allows you to batch analyse similar AlphaFold3 outputs focussing on Protein-protein interactions. This script looks for residues of the protein you think might be interacting with your protein of interest that have a Predicted Aligned Error (PAE) below a cut off you specificy and identifies with which residues of the protein of interest these residues interact with. This script requires you to have some understanding of what output AlphaFold3 produces - if you are unsure, you can refer to the AlphaFold server FAQs (https://alphafoldserver.com/faq) or this EMBL course on using AlphaFold (https://www.ebi.ac.uk/training/online/courses/alphafold/).
It also ensures compatibility with external hard drives (especially those formatted as exFAT) and performs various operations such as cleaning hidden macOS dot files (e.g., .DS_Store
), parsing protein structure CIF files, and generating comprehensive logs of its activities.
Do you want to use AlphaFold3 to understand how your Protein of interest (POI) interacts with other proteins and understand if there are any shared features in those predicted interactions that have previously been undiscovered?
Are you afraid of the analysis that would be involved if you tried to do said interaction screen because you wouldn't want to individually analyse hundreds of structure predictions and/or you aren't a structural biologist?
Do you have very limited coding experience (Python3 in particular) and wouldn't know where to start a script to analyse all the AF3 outputs?
If you answer yes to any of these questions, this is the script for you.
- This script works on the basis that you have ONE protein of interest and many potential partner proteins.
- When you run your AF3 jobs on the alphafold server (https://alphafoldserver.com/) you need to keep your protein of interest always in the same position e.g. first in the list or thrid or second.
- Give each AF3 job a descriptive title so that when you download the job after it is done you know what you ran.
- Download all your AlphaFold outputs and store them in a designated folder somewhere.
When you run the script you can/have to specify serval inputs listed below. The script mainly considers the ranked_0 prediction.
-id or --input_dir is the Input directory that contains subfolders with AlphaFold3 output (REQUIRED)
-poi or --poi_chain is the Cahin on which the Protein of interest is located - if you put your POI first when running the AF3 job it is chain A so you would put A for this - if you put it second you would put B etc. (default: A).
-partner or --partner_chain is the same as poi but for which chain the potentially interacting protein partner is (default: B).
-pae or --max_pae_cutoff is the Maximum Predicted aligned error cutoff (default: 15). This is the maximum value of the PAE that the residues of the partner protein are allowed to have to be classified as to be interacting with your POI.
-iptm or --min_iptm_cutoff is the minimum iptm value that the model is allowed to have to be processed further (default: 0).
-ptm or --min_ptm_cutoff is the same as the iptm cutoff (default: 0).
-min_residues or --min_residues_cutoff is the minimum number of residues the potentially interactin residue of the partner protein needs to fullfill the PAE cut off for. Recomendations based on NO empirical evidence: I would set this to be about a third of the amino acid size of your POI. (default: 5)
-max_dist or --max_dist is the maximum distance (Angstrom) a partner residue is allowed to be from a POI residue to be considered in contact with the POI residue (default: 8)
An example of how to run this code:
python3 process_alphafold_outputs.py -id AlphaFold3/ -poi B -partner A -pae 10 -min_residues 100 -max_dist 4
- A csv file called interaction_analysis_PAE_{max_pae_cutoff}max_dist{max_dist}.csv which for each predicted interaction has the POI residues that are predicted to interact with the partner residues.
- Structure (.cif) files that take the entirety of your POI and just the region of the partner protein that is predicted to interact with your POI - they are all stored in a folder called Interaction_cif_files_PAE_{max_pae_cutoff}maxdist{max_dist}. This allows you to focus on the structure of high confidence predictions.
- How similar the 5 different models that AlphaFold predicts are, is sometimes another measure to evaluate the confidence in the prediction. Therefore, this script also creates the same structure of cif files (structure files) as in (2) but for all 5 ranked models. It also creates a script for pymol that when you run it overlays all of these structures. These are stored in a folder called Overlays_Interaction_cif_files_PAE_{max_pae_cutoff}maxdist{max_dist}.
The AF3 Processing Script is a tool aimed at researchers working with a lot of AlphaFold3 protein structure predictions. It automates the process of:
- Cleaning hidden files: Cleans up unnecessary macOS hidden files (like
.DS_Store
) that clutter directories. This allows you to have your AlphaFold3 folders stored on a hard drive and run the script from the hard drive. - Reading and processing protein structures: Analyzes and extracts data from CIF (Crystallographic Information File) files containing protein structure information.
- Comparative structure analysis: Uses BioPython to perform structural comparisons between protein models to detect similarities or differences between predicted and actual structures.
- Neighbor Searching: Performs 3D neighbor searches on atoms within the protein structures to identify local interactions.
- Logging: Generates detailed logs of all processes, allowing users to track which files were successfully processed and identify any errors.
This step is specific to users on macOS. The operating system tends to generate hidden files (such as .DS_Store
) that can clutter directories, especially when transferring files to external drives. The script uses the dot_clean
command to remove these files, ensuring that only relevant protein structure data is transferred or stored.
- Function:
clean_dot_files(directory)
- Purpose: Cleans up hidden files from the specified directory before processing the protein structure data, ensuring a clean dataset without unwanted files.
The core part of the script revolves around reading and processing CIF (Crystallographic Information File) files, which contain detailed 3D structural information of proteins predicted by AlphaFold 3.
Using BioPython’s MMCIFParser, the script extracts atomic details, chain information, and residue sequences from these files. The CIF format is crucial for molecular biologists as it stores crystallographic and structural information.
- Function:
read_cif_file(file_path, retries=3)
- Purpose: Reads CIF files, applying robust error handling and retries in case of encoding or read errors.
- Details:
- The script attempts to read CIF files using different encodings (UTF-8, ISO-8859-1), which are common in biological data.
- If a file cannot be read due to encoding issues, the script logs the error and retries up to 3 times before moving on.
For comparative analysis of protein structures, the script utilizes BioPython’s Superimposer module. This function compares two 3D structures by superimposing them and calculating the root-mean-square deviation (RMSD), which is a standard measure of structural similarity.
- Function:
Superimposer()
- Purpose: Aligns protein models from CIF files and calculates the RMSD between them, providing insights into structural deviations between different protein conformations or predictions.
- Analysis:
- This functionality is useful for comparing predicted structures from AF3 against experimental or previously known structures to validate the quality of predictions.
The script performs neighbor searching using BioPython’s NeighborSearch module, which allows the identification of nearby atoms within a specific distance threshold. This is useful for understanding local interactions within the protein structure, such as hydrogen bonds or van der Waals interactions.
- Function:
NeighborSearch(atoms)
- Purpose: Finds atoms within a certain radius of each other to identify potential interactions or bonding sites.
- Details:
- By analyzing neighboring atoms, the script helps biologists understand how protein folding impacts local atomic interactions, which could be crucial for understanding biological function.
The script incorporates a robust error-handling mechanism. If a CIF file cannot be read due to encoding issues or other errors, it retries up to three times before logging the failure. This ensures that most files are processed correctly, while the errors are captured for later investigation.
- Function:
logging
- Purpose: Logs successes and errors in a log file (
process_af3_outputs.log
). Each attempt to read a file or perform structural analysis is logged with a timestamp for auditing and troubleshooting purposes.
- Dot File Cleaning: Automatically removes hidden macOS files like
.DS_Store
that can cause clutter, especially on shared or external drives. - CIF File Parsing: Utilizes BioPython to read and manipulate CIF files containing protein structures.
- Superimposition Analysis: Performs 3D superimposition of protein structures to calculate their RMSD (root-mean-square deviation).
- Neighbor Searching: Identifies atomic interactions within a specified radius using BioPython’s NeighborSearch.
- Error Handling: Retries failed file reads and handles various file encodings (UTF-8, ISO-8859-1).
- Detailed Logging: Tracks the status of each operation, generating a detailed log of successes and errors.
- Cross-Platform Compatibility: While developed on macOS, it can be used on other systems, provided they have the required tools installed.
The script creates a log file (process_af3_outputs.log
) in the directory where the script is run. This log includes:
- Timestamps for all operations.
- Success messages for successfully processed files.
- Error messages if an operation (like reading a file) fails.
Example log entries:
2024-09-23 12:00:00 - INFO - Successfully cleaned dot files in the directory: /path/to/af3/outputs 2024-09-23 12:00:05 - ERROR - Error reading CIF file: /path/to/file.cif
This log can help you track down issues with specific files or operations.
If you're running the script on macOS and encounter an error related to the dot_clean
command, ensure that it is available in your system. This command is built into macOS, but older versions may require updates.
The script attempts to read CIF files using different encodings. If files fail to read after retries, check the encoding of the CIF files. You can manually convert them to UTF-8 or ISO-8859-1.
If the log file is not being created, ensure that you have write permissions for the directory where the script is being executed.
You can find the required Python dependencies in the requirements.txt file. I have only run this program on OSX so please let me know if there are any issues running it on Windows.
If you use Process_AlphaFold3_Outputs in your work e.g. during analysis, please cite it as follows: Willich, S. (2024) Process_AlphaFold3_Outputs doi.org/10.5281/zenodo.13925934
This project is licensed under the GNU License.
This script was written with help from ChatGPT.