Plasma cell-free DNA methylomes for hepatocellular carcinoma (HCC) detection and monitoring of recurrence after liver resection or transplantation
The detection of hepatocellular carcinoma (HCC) via mutational profiling of cell-free DNA (cfDNA) in blood plasma is hampered by mutational diversity and the challenge of acquiring tumor tissue for targeted gene panels. Alternatively, DNA methylation patterns can illuminate biological processes without requiring prior mutation-specific knowledge. Our investigation evaluated the effectiveness of cell-free methylated DNA immunoprecipitation and sequencing (cfMeDIP-Seq) in identifying HCC and monitoring its recurrence after surgery.
This research represents the first application of cfMeDIP-Seq to HCC for both its identification and the tracking of residual disease post-operatively. We collected baseline plasma samples from HCC patients undergoing curative interventions or liver transplants, as well as from healthy individuals, including living liver donors. Utilizing cfMeDIP-Seq and machine learning, we identified HCC-specific differentially methylated regions (DMRs) and developed a HCC methylation score (HMS). This score accurately distinguished HCC with high specificity and sensitivity, and importantly, predicted the likelihood of postoperative recurrence. Analysis of post-surgery follow-up samples using cfMeDIP-Seq revealed that changes in HMS were indicative of future recurrence, demonstrating cfMeDIP-Seq’s utility for both precise HCC detection and recurrence prediction without the need for tumor-specific mutations.
Our findings demonstrate the potential of cfMeDIP-Seq for sensitive and specific HCC detection irrespective of tumor type, as well as for predicting recurrent disease post-surgery. The comprehensive analysis of tumor-agnostic cfDNA methylomes accurately detected HCC and forecasted recurrence following liver resection or transplantation. This innovative approach may address current challenges in minimally invasive HCC diagnosis using liquid biopsy, offering significant implications for screening, early detection, treatment decisions, and disease progression monitoring and therapy response assessment.
The repository contains bash scripts, data processing scripts, and R scripts designed to reproduce figures for the manuscript "Plasma cell-free DNA methylomes for hepatocellular carcinoma detection and monitoring after liver resection or transplantation"
R scripts were executed in either R version 3.5.0 or 4.0.1, depending on the specific package dependencies (listeded in Table 1), utilizing RStudio versions (2022.12.0+353 ~ 2023.12.0+369).
The R packages required are detailed within each script.
Tool | Function | Version | Running_Platform | Language | Alternative_tool | Link |
---|---|---|---|---|---|---|
FastQC | FASTQ QC | 0.11.5 | H4H_shell | Java | http://www.bioinformatics.babraham.ac.uk/projects/fastqc | |
MultiQC | ❶ FASTQ QC ❷ count reads |
1.7 | H4H_shell | Python | https://multiqc.info | |
fastp | ❶ ISD QC3 ❷ trim |
0.23.1 | H4H_shell | C++ | ❶ Trim Galore ❷ Trimmomatic ❸ Cutadapt |
https://github.com/OpenGene/fastp |
bowtie2 | align | 2.4.5 | H4H_shell | C++ | BWA-mem | https://bowtie-bio.sourceforge.net/bowtie2/index.shtml |
qualimap | BAM QC | 2.2 | H4H_shell | Java | http://qualimap.conesalab.org | |
SAMtools | sort - index | 1.14 | H4H_shell | C | Picard | http://www.htslib.org |
sambamba | deduplicate | 0.7.0 | H4H_shell | D | ❶ SAMtools ❷ Picard |
https://lomereiter.github.io/sambamba |
MEDIPS | ❶ QC2 ❷ export wig ❸ DMRs/ROIs(edgeR) |
1.50.0 | H4H_R/3.5.0 | R | DESeq2 | https://doi.org/doi:10.18129/B9.bioc.MEDIPS |
MeDEStrand | absolute_methylation_level | 0.0.0.9000 | H4H_R/3.5.0 | R | https://github.com/jxu1234/MeDEStrand | |
sva/ComBat_seq | reduce batch effect | 3.46.0 | H4H_R/4.0.1 | R | https://github.com/zhangyuqing/ComBat-seq | |
edgeR | ❶ TMM normalization ❷ CPM counts |
3.28.0 | H4H_R/4.0.1 | R | https://bioconductor.org/packages/release/bioc/html/edgeR.html | |
limma | ❶ vCounts ❷ DMRs(CPM-trend) |
3.42.0 | ❶ H4H_R/4.0.1 ❷ RStudio |
R | https://bioconductor.org/packages/release/bioc/html/limma.html | |
FactoMineR/factoextra | PCA | 2.8/1.0.7 | ❶ H4H_R/4.0.1 ❷ RStudio |
R | https://rpkgs.datanovia.com/factoextra/index.html | |
randomForest | ❶ build & train models ❷ classification prediction |
4.7.1.1 | ❶ H4H_R/3.5.0 ❷ RStudio |
R | https://www.stat.berkeley.edu/users/breiman/RandomForests | |
glmnet | regularized linear modeling | 4.1.7 | ❶ H4H_R/3.5.0 ❷ RStudio |
R | https://glmnet.stanford.edu/index.html | |
caret | model training configuration | 6.0 | ❶ H4H_R/3.5.0 ❷ RStudio |
R | https://topepo.github.io/caret | |
pROC | AUROC | 1.18.0 | RStudio | R | https://xrobin.github.io/pROC | |
ggplot2 | data visualization | 3.4.2 | RStudio | R | https://ggplot2.tidyverse.org |
Step | Script | Analysis | Panels Generated |
---|---|---|---|
1 | step_1_fastq_to_bam.sh | ||
1s | step_1s_ISD.sh | QC | Figure S2B/2C |
2 | step_2_bam_to_wig.R | ||
3 | step_3_wig_to_txt.sh | ||
4 | step_4_matrices_processing.R | ||
5 | step_5_medips_qc.R | QC | Figure S2A |
6 | step_6_hcc_classifiy_hms.R | kFold modeling AUROC HMS PCA Confusion matrix Simply moving average Postoperative HMS trajectory |
Figure 2 Figure 3 Figure 4 Figure 5 Figure S3C Figure S4 |
7 | step_7_hcc_subtyping.R | HCC subgrouping | Figure S3A/B |
7s | step_7s_one_vs_each_hcc.R | OnevsEach.hcc() required by hcc_subgroup.R |
Name | Size | Script using this file | Description |
---|---|---|---|
hg19_chr1_22_m_coord.rds | 90 MB | step_4_matrices_processing.R | genomic coordinates for 4801145 300bp bins (hg19, chr:1-22 and M) |
hg19_chr1_22_coord.rds | 94.4 MB | - | optional if only chr:1-22 needed |
hg19_chr1_22_x_y_coord.rds |
101.4 MB | - | optional if only chr:X&Y needed |
hg19_chr1_22_m_x_y_coord.rds | 101.4 MB | - | optional if only chr:M needed |
black_bin_v2.RData | 7.8 MB | step_6_hcc_classifiy_hms.R | 109120 bins falling in ENCODE blacklist regions |
n236_cpm.rds | 1.31 GB | - | cfMeDIP signals (summed CPM) analysis |
n236_lcpm.rds | 1.33 GB | step_6_hcc_classifiy_hms.R | matrix for kFold cross-validation limma-trend |
vCount_n236.RData | 1.33 GB | - | option for limma-voom |
sample.rds | 1KB | step_6_hcc_classifiy_hms.R | Clinical metadata |
sample_sq.rds | 1KB | step_6_hcc_classifiy_hms.R | Clinical metadata |
sample_124.rds | 1KB | step_7_hcc_subtyping.R | Clinical metadata |
kFold_ML_res.RData | 1.5 MB | step_6_hcc_classifiy_hms.R | machine learning result of discovery cohort |
Val.prob.list.rds | 11.1 MB | step_6_hcc_classifiy_hms.R | machine learning result of validation cohort |
File Type | Storage Path | Operation/Function |
---|---|---|
Merged FastQ Files | ../bam |
cat for merging L001 and L002 |
FastQC Pre-Processing | ../fastqc_pre |
fastqc for initial QC |
Fastp Reports | ../fastqc_pre/fastp_report |
fastp for quality and adapter trimming |
FastQC Post-Processing | ../fastqc_post |
fastqc for QC after fastp |
Aligned SAM Files | Temporary storage | bowtie2 for alignment |
Unsorted BAM Files | Temporary storage | samtools view for SAM to BAM conversion |
Sorted BAM Files | ../bam |
samtools sort for sorting BAM |
BAM Index Files | ../bam |
samtools index for indexing BAM |
Pre-QC Reports | ../fastq_bam_QC |
samtools flagstat and qualimap |
Deduplicated BAM Files | ../bam |
sambamba markdup for deduplication |
Post-QC Reports | ../fastq_bam_QC |
samtools flagstat and qualimap |
Please note that temporary files, such as unsorted BAM and aligned SAM files, are removed after they are no longer needed to save storage space.
This table clearly delineates the output files from the MEDIPS QC analysis, where to find them, and which specific function in the MEDIPS package or custom script is responsible for generating each file. This comprehensive overview aids in understanding the workflow and accessing specific results for further analysis or reporting.
Result File Description | File Pattern/Type | Storage Path | Generating Function |
---|---|---|---|
Saturation Plots | _SaturationPlot.pdf |
outdir/pdf/ |
MEDIPS.plotSaturation() |
Sequence Coverage Pie Charts | _SeqCoveragePlot_Pie.pdf |
outdir/pdf/ |
MEDIPS.plotSeqCoverage() (pie) |
Sequence Coverage Histograms | _SeqCoveragePlot_Hist.pdf |
outdir/pdf/ |
MEDIPS.plotSeqCoverage() (hist) |
Sequence Coverage Pie Charts (Including Non-Unique Reads) | _SeqCoveragePlot_Pie_uniqF.pdf |
outdir/pdf/ |
MEDIPS.plotSeqCoverage() (pie) |
Sequence Coverage Histograms (Including Non-Unique Reads) | _SeqCoveragePlotHist_uniqF.pdf |
outdir/pdf/ |
MEDIPS.plotSeqCoverage() (hist) |
Saturation Metrics (FigureS2A, BAM QC) | _saturation.txt |
outdir/QCstats/ |
MEDIPS.saturation() |
Coverage Metrics | _coverage.txt |
outdir/QCstats/ |
MEDIPS.seqCoverage() |
CpG Enrichment Analysis (FigureS2A, BAM QC) | _enrichment.txt |
outdir/QCstats/ |
MEDIPS.CpGenrich() |
Comprehensive QC Statistics | _QCstats.txt |
outdir/QCstats/ |
Custom aggregation of MEDIPS data |
Window Coordinates per Chromosome | _window_per_chr.csv |
outdir/ or specified subdirectory |
Custom script using MEDIPS data |
PNG versions of Saturation, Coverage, and Calibration Plots | Various PNG files | outdir/png/ |
MEDIPS.plotSaturation() MEDIPS.plotSeqCoverage() MEDIPS.plotCalibrationPlot() |
Package | Version | H4H/R3.5 | H4H/R4.0 | H4H/R4.2 | --mem | Running Time |
---|---|---|---|---|---|---|
MEDIPS | 1.50.0 | ✅ | ✅ | ❌ | ≥ 180G | ~1 day |
MeDEStrand | 0.0.0.9000 | ✅ | ❌ | ❌ | ≥ 180G | 1~2 days |
ChAMP | 2.28.0 | ❌ | ✅ | NA | ≥ 300G | 1~2 days |
DESeq2 | 1.38.2 | ✅ | ❌ | ✅ | ≥ 180G | ~1 day |
limma | 3.42.0 | ✅ | ✅ | ✅ | 1~2 days | |
edgeR | 3.28.0 | ✅ | ✅ | ❌ | 1~2 days | |
FactoMineR | 2.8 | ❌ | ✅ | ✅ | ||
factoextra | 1.0.7 | ❌ | ✅ | ✅ | ||
dplyr | 1.1.1 | ✅ | ❌ | ✅ | ||
tidyr | 1.3.0 | ✅ | ❌ | ✅ | ||
sva | 3.46.0 | ❌ | ✅ | ❌ | ≥ 500G | ~1 day |
caret | 6.0 | ✅ | ❌ | ❌ | ≥ 80G | ~2 days |
randomForest | 4.7.1.1 | ✅ | ✅ | ❌ | ≥ 80G | ~2 days |
glmnet | 4.1.7 | ✅ | ❌ | ❌ | ≥ 80G | ~2 days |
Tools | Arguments | Support Data | Dependencies |
---|---|---|---|
fastp | -w 10 -g -f 5 -F 5 --adapor_F== --adapor_R== --correction --html=$.html |
Adaptor Sequences Table | |
Bowtie2 | -p 10 --minins 50 --maxins 500 -x hg19 |
hg19.1.bt2 hg19.2.bt2 hg19.3.bt2 hg19.4.bt2 hg19.rev.1.bt2 hg19.rev.2.bt2 |
|
samtools view | -@ 10 -bS -f 2 -F 4 |
||
qualimap | -c --java-mem-size=30G -gff hg19.gtf -outformat pdf -nt 10 |
hg19.gtf | |
sambamba markdup | -r -p -t 10 --overflow-list-size 600000 |
||
MEDIPS/QC | - BSgenome="BSgenome.Hsapiens.UCSC.hg19" - uniq <- 1 - extend <- 300 - shift <- 0 -ws <- 300 - paired <- TRUE - chr.select <- paste0("chr",c(1:22,"X","Y","M")) |
NA | BSgenome.Hsapiens.UCSC.hg19 |
MEDIPS/wig_to_counts | - BSgenome="BSgenome.Hsapiens.UCSC.hg19" - uniq <- 1e-3 - extend <- 300 - shift <- 0 - ws <- 300 - paired <- TRUE - chr.select <- paste0("chr",c(1:22,"M")) |
hg19_chr1_22_m_coord.rds hg19_all_chr_coord.rds hg19_chr1_22_x_coord.rds hg19_chr1_22_x_y_coord.rds |
BSgenome.Hsapiens.UCSC.hg19 edgeR |
MeDEStrand | - BSgenome="BSgenome.Hsapiens.UCSC.hg19" - uniq <- 1 - extend <- 300 - shift <- 0 - ws <- 300 - paired <- TRUE - chr.select <- paste0("chr",c(1:22,"M")) |
BSgenome.Hsapiens.UCSC.hg19 BSgenome GenomicRanges MEDIPSData |
Raw sequencing data for all 203 HCC samples and 23 healthy controls are deposited at the European Genome-phenome Archive (EGA) under the accession number EGAS50000000450. Raw sequencing files for another 12 healthy normal controls can be requested at https://www.ontariohealthstudy.ca/for-researchers/data-access-formsand-templates. Processed intermediate data and objects used to replicate the results in this study are available at https://doi.org/10.5281/zenodo.11251606.
Kui Chen, [email protected]