GitHub - ciarajudge/SCAMP_Proteomics: Automated pipeline for proteomic analysis of PRIDE raw files using maxquant

ciarajudge / SCAMP_Proteomics Public

Notifications You must be signed in to change notification settings
Fork 0
Star 2

Automated pipeline for proteomic analysis of PRIDE raw files using maxquant

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
AssayApi.py		AssayApi.py
Fetch2.py		Fetch2.py
FileApi.py		FileApi.py
Flowchart.pdf		Flowchart.pdf
PeptideApi.py		PeptideApi.py
ProjectApi.py		ProjectApi.py
ProteinApi.py		ProteinApi.py
README.txt		README.txt
__init__.py		__init__.py
download.py		download.py
emailtest.py		emailtest.py
finishedprojects.txt		finishedprojects.txt
merge_proteomics_sqlites.py		merge_proteomics_sqlites.py
mqpar.xml		mqpar.xml
scamp.py		scamp.py
search.py		search.py
swagger.py		swagger.py
testxml.py		testxml.py
wget.py		wget.py

Repository files navigation

1 Overview and Batch System
The scamp.py script, can be called from the command line with an unlimited number of PRIDE accession numbers as command line arguments. For each argument passed to SCAMP (arguments being project accessions), the script calls the fetch and visualisation script, download.py. The pipeline is laid out as such in order to avoid crashing MaxQuant, which is a possibility should an attempt be made to analyse multiple projects at once with the program. Built in to download.py is a batch system, which downloads and analyses files from a given project in batches of an arbitrary number that can be specified by the user by modifying the download script. This was achieved by placing the primary features of the pipeline (as outlined below) inside a loop which competes the process for each file in a batch before moving on to the next.

2 Pipeline Structure.
2.1 Retrieval of Project Metadata
The PRIDE Archive API ‘ProjectAPI.py’ is used in order to obtain a list of the project metadata, including the project title, abstract, a list of the file names, species(s), and post-translational-modifications (ptms). The species retrieved from PRIDE is used to determine the proteome fasta file ultimately included in the xml file which is fed to MaxQuant. This is achieved through creating an organism dictionary ‘orgdict’ with the keys consisting of the species labels returned by ProjectAPI.py and the values consisting of the file path to the corresponding proteome. The ptms retrieved (in list format) from PRIDE are different to those used by MaxQuant, so this problem was overcome by reviewing a large number of PRIDE projects with both ptms displayed on the site and an mqpar.xml file provided and creating a dictionary to correlate the two.

2.2 Data Fetch
In the data fetch stage, the wget module is used to download all of the files associated with the project that have the extension ‘.raw’. The files download to a folder created by the script with the project accession and batch number. During this download process a list, called ‘pathlist’ is created which contains the file path for each file downloaded in the batch. A further list, called ‘jnames’, is also created containing the file names without their extension which will be used later in the pipeline.

2.3 XML file generation
The next step required in the pipeline is the creation of a .xml file that can be fed to MaxQuant. ElementTree is utilised to modify each of a number of parameters in a template .xml file, and write to a new file specific to the batch. These include file paths, modifications, thread numbers, and corresponding fasta file path.

2.4 MaxQuant Visualisation and SQLite File Generation
Upon creation of the .xml file with the MaxQuant parameters, the program is called from the command line via a subprocess call. When the MaxQuant analysis is complete, a further subprocess call is used to call a separate script which parses the information of interest from the MaxQuant output (contained in evidence.txt and peptides.txt).

3 Checkpoints and Warning Messages
3.1 Checkpoints
Checkpoints are placed throughout the pipeline which ensure that the previous step has taken place, and checks that a given step of the pipeline has not already occurred prior to undertaking it. The most commonly used checkpoint structure checks for the existence of a certain file or directory prior to carrying out the action to create said file or directory. For example, before initiating the download of a raw file associated with a project, the script checks that neither the file itself, nor its corresponding sqlite file, is already present. Before passing the parameter .xml file to MaxQuant, the script checks that the ‘combined’ folder (which is generated by MaxQuant analysis), is not present in the directory.

3.2 Error Proofing and Warning Messages
While the pipeline has been built to be as error robust as possible, the diverse nature of the projects available on the PRIDE Archive indicate that the pipeline may not be able to deal with every eventuality or possibility. To account for this, the areas of highest risk were identified and accounted for in the pipeline. If the organism for a given project accession does not have a corresponding fasta file on the server to feed to MaxQuant, the pipeline returns the error that the organism is not in the pipeline dictionary and aborts the script. Occasionally on the PRIDE archive, multiple organisms are associated with a given project, and in this case the script exits and returns the message ‘There are multiple organisms associated with this project’. A present there does not appear to be a way to automate sorting of files by organism for a multispecies project.

While an attempt was made to catalog post translational modifications as they are returned from PRIDE and correlate them to their MaxQuant counterparts, the customisable nature of modifications in both PRIDE and MaxQuant results in the possibility that a modification obtained from PRIDE will not be in the modification dictionary and will crash the script. To prevent this, if a novel modification appears in the list from PRIDE it is removed from the list and the user is presented with the warning in the terminal that ‘One of the modifications retrieved from PRIDE is not in our dictionary and has been removed from the list’. It would be preferable to be able to account for all modifications, but there is little to no consensus in literature surrounding MaxQuant on the effect of running the analysis without all relevant modifications and an informal assessment of the situation indicates that the modifications may not be as consequential as originally believed.