This repository contains files and pipeline for 'Colorectal cancer risk stratification using a polygenic risk score in symptomatic primary care patients – a UK Biobank retrospective cohort study'.
Doi: https://doi.org/10.1038/s41431-024-01654-3
analysis.R contains all the code used for the analysis. It mostly runs on R version 4.1.1.
It's split into the following sections. Some sections require files as input (often hosted securely on UKBB and can't be shared publicly - for reproducibility, I've tried to describe relevant details of these files e.g. column headers in code annotations). Sometimes analysis decisions were made based on descriptive graphs of the data. In summary, the code isn't designed to run from beginning to end non-stop & will need adapting if applied to different datasets.
Contents:
- Identify a list of CRC symptoms
- Find UKBB participants with symptoms and make a table of earliest symptom for each participant
- Find earliest occurence of CRC for participants (identify cases & controls) and remove participants with hereditary syndromes increasing risk of CRC.
- Add all lifestyle/symptom/health variables to participant data frame
- Check case/control numbers by ancestry. Analysis continued with only European cohort due to case numbers and unrelated individuals to avoid bias.
- Generate the polygenic risk score for all participants and work out quintiles
- Split cohort 80:20 into training and testing groups for validation. Stratify both testing & training cohorts by age and sex.
- In training cohort: Logistic regression analysis to find variables associated with case or control groups.
- In training cohort: Calculate ROCAUC of each variable and build integrated risk model iteratively based on ROCAUC values, with 5-fold cross validation.
- In training cohort: Compare all possible integrated risk models with AIC.
- Results of steps 9 and 10 concurred that a 6-variable integrated risk model performed best in the training cohort. Evaluate this model in the testing cohort.
Abbreviations: AIC, Akaike information criterion, CRC = colorectal cancer, ROCAUC = receiver operating characteristic area under the curve
Please note the code to generate a polygenic risk score in this pipeline is no longer working after the rbgen package disappeared from the internet. New code for this can be found at: https://github.com/hdg204/GRS-Nexus - please contact the author of this repository with any questions.
The find_read_codes folder contains an R function which takes read codes as input and returns similar read codes.
CRC_read_codes contains some of the 227 Read codes for CRC symptoms which were used to include participants in this study (others are available upon request - see folder readme file), and the list of 49 Read codes used to identify cases of CRC in participants' GP records.