Performing PCA-based population inference, utilising PLINK for variant extraction and R for classification with 1000 genomes as a reference.
run_population_classifier.sh
– Main shell script that orchestrates the entire pipelineprepare_pcs.sh
– Script to prepare principal componentsclassify.R
– R script to classify populations based on PCA results and save plotsrandom_forest_model.RData
– Pre-trained random forest model for population inferenceKGP_0.3.prune.in
– Reference variants fileKGP_pca.acount
– Reference allele frequency fileKGP_pca.eigenvec.allele
– Reference PCA eigenvectorsKGP_pca.eigenval
– Reference PCA eigenvalues
Step 1: Unzip the folder
Unzip the folder to any directory on your system. The pipeline will run in the unzipped directory, so no additional configuration is required.
unzip run_classifier.zip
cd run_classifier
Step 2: Edit run_population_classifier.sh
First, edit the slurm details accordingly. Then, provide study name and link to input data.
The input data should be in Plink binary format (.bed, .bim, .fam). Provide name of study and the link to your input files (edit Line 21 and 22 of run_population_classifier.sh).
Step 3: Run the pipeline
Submit the pipeline using the run_population_classifier.sh
script. (Ensure you have the necessary permissions to run the script: chmod +x run_population_classifier.sh
)
sbatch run_population_classifier.sh
orbash run_population_classifier.sh
if not submitting to a scheduler or edit to suit your system's job scheduler
This will run the entire pipeline, starting with preparing PCs from your data and then classifying populations using the pre-trained model.
- Population group classifications will be saved in .tsv files (e.g. EUR.tsv, AFR.tsv).
- Population plots will be saved as .png images (e.g. prob0.5.png).
- All output files will be saved in the same directory as the scripts.
If you encounter any issues or have questions, feel free to contact Ritah via [email protected].
Best wishes!