Skip to content

Latest commit

 

History

History
189 lines (160 loc) · 9.34 KB

README.md

File metadata and controls

189 lines (160 loc) · 9.34 KB

README

Aurélie Gabriel 2023-10-11

This readme describes how the code (available in the scripts/ folder) was used to perform the analyses described in the following paper: A AG Gabriel, J Racle, M Falquet, C Jandus, D Gfeller. 2024. Robust estimation of cancer and immune cell-type proportions from bulk tumor ATAC-Seq data. DOI: https://doi.org/10.7554/eLife.94833.3

In the following sections we will refer to a data/ folder containing files that were too large to be hosted on the github repository. The complete data/ folder can be retrieved on zenodo (https://zenodo.org/record/13132868, additional_data.zip).

Preprocessing of the bulk ATAC-Seq data

The following command line was used to process the ATAC-Seq samples listed in data/markers_identification_input_files/samples/Ref/SRA_samples_metadata.txt. Note that the git_repo_path variable corresponds to the path to the location where this GitHub repository has been cloned. The genome_folder_path variable corresponds to the folder containing the refgenie genome assets which are PEPATAC prerequisites for ATAC-Seq data processing, please refer to the PEPATAC documentation to download these data (http://pepatac.databio.org/en/latest/assets/). sra_output_folder corresponds to the folder where SRA data will be downloaded.

git_repo_path=user_defined_path # path to the folder where this github repository was cloned
path_to_data_folder=user_defined_path # path to the data folder donwloaded from zenodo
genome_folder_path=user_defined_path 
sra_output_folder=user_defined_path 
pepatac_output_folder=user_defined_path # folder where the processed data are saved 

nextflow run ${git_repo_path}scripts/bulk_preprocessing_scripts/ATAC_processing.nf \
-c ${git_repo_path}scripts/bulk_preprocessing_scripts/nextflow.config -profile singularity \
--samples_metadata ${path_to_data_folder}/markers_identification_input_files/samples/Ref/SRA_samples_metadata.txt \
--sra_folder ${sra_output_folder} \   
--output_folder ${pepatac_output_folder} \
--genome_version hg38 \
--genome_folder ${genome_folder_path} \
--multiqc_config ${git_repo_path}scripts/bulk_preprocessing_scripts/multiqc_config.yaml \
--blacklist_file ${git_repo_path}scripts/bulk_preprocessing_scripts/hg38-blacklist.v2.bed \
--original_config ${git_repo_path}scripts/bulk_preprocessing_scripts/pepatac_original.yaml \
--samples_origin SRA --adapters ${git_repo_path}scripts/bulk_preprocessing_scripts/all_adapters_PE.fa \
--preprocess true --getcounts false -bg -resume

The same pipeline was used to process the ENCODE samples listed in data/markers_identification_input_files/samples/ENCODE/SRA_samples_metadata.txt

Definition of the set of consensus peaks and counts extraction

The same pipeline can then be used to identify a set of consensus peaks across studies and cell-types and extract the raw counts for each peak and sample with the following options --preprocess false and --getcounts true.

raw_counts_output_folder=user_defined_path # folder where the counts are saved

nextflow run ${git_repo_path}scripts/bulk_preprocessing_scripts/ATAC_processing.nf \
-c ${git_repo_path}scripts/bulk_preprocessing_scripts/nextflow.config -profile singularity \
--samples_metadata ${path_to_data_folder}markers_identification_input_files/samples/Ref/SRA_samples_metadata.txt \
--output_folder ${pepatac_output_folder} \
--genome_version hg38 --peak_score_thr 2 \
--genome_folder ${genome_folder_path} \
--output_folder_counts ${raw_counts_output_folder} \
--pepatac_config ${git_repo_path}scripts/bulk_preprocessing_scripts/pepatac_config.yaml \
--preprocess false --getcounts true -bg -resume

Raw counts for each peak and sample and the metadata associated to each sample are provided as output in raw_counts_output_folder (raw_counts.txt and metadata.txt). These two files are used later in “Computing reference profiles and identifying marker peaks” and are also provided in the data/markers_identification_input_files/ folder.

We extracted the raw counts from the ENCODE data for the peaks (listed in the .saf file) identified in the reference samples using the following command line. The resulting counts and the encode metadata are located in data/markers_identification_input_files/encode_counts.txt and data/markers_identification_input_files/encode_metadata.txt.

nextflow run ${git_repo_path}scripts/bulk_preprocessing_scripts/extract_peaks_counts.nf \
-c ${git_repo_path}scripts/bulk_preprocessing_scripts/nextflow.config \
-profile singularity \
--bam_folder bam_files_ENCODE/ \
--output_folder_counts ${raw_counts_output_folder}ENCODE/ \
--saf_file ${raw_counts_output_folder}all_consensusPeaks.saf -bg -resume

# bam_files_ENCODE/ contains all bam files obtained from the preprocessing of the ENCODE ATAC-Seq data

Computing reference profiles and identifying marker peaks

Markers are identified in 10 subsets of the reference samples (list of samples in each subset located in data/markers_identification_input_files/samples/Ref/) and a consensus of these markers are retrieved. The following command line will generate reference profiles and identify cell-type specific marker peaks for EPIC-ATAC. It will also use CIBERSORTx and DeconPeaker to build reference profiles. For CIBERSORTx a singularity image is required and can be retrieved on CIBERSORTx website by requesting a token access on the CIBERSORTx website. The singularity image, the token and the associated username must be provided in the workflow parameters (CIBERSORTx_singularity, cibersortx_token, CIBERSORTx_username).

git_repo_path=user_defined_path # path to the folder where this github repository was cloned 
path_to_data_folder=user_defined_path # path to the data folder donwloaded from zenodo
my_cibersortx_token=user_defined_path # path to CIBERSORTx token (obtained from CIBERSORTx website)
cibersortx_container=user_defined_path # path to CIBERSORTx container image (obtained from CIBERSORTx website)
cibersortx_username=user_defined # CIBERSORTx username associated to the token file

# Without considering Tcell subtypes  
ref_output_path=user_defined_path # path where the output files are saved

nextflow run ${git_repo_path}scripts/reference_profiles_scripts/build_references.nf \
-c ${git_repo_path}scripts/reference_profiles_scripts/nextflow.config -profile singularity \
--counts_path ${path_to_data_folder}markers_identification_input_files/raw_counts.txt \
--metadata_path ${path_to_data_folder}markers_identification_input_files/metadata.txt \
--output_folder ${ref_output_path} \
--cibersortx_token ${my_cibersortx_token} \
--CIBERSORTx_singularity ${cibersortx_container} \
--CIBERSORTx_username ${cibersortx_username} \
--cross_validation_files ${path_to_data_folder}markers_identification_input_files/samples/Ref/subsample \
--encode_count_file ${path_to_data_folder}markers_identification_input_files/encode_counts.txt \
--encode_metadata_file ${path_to_data_folder}markers_identification_input_files/encode_metadata.txt \
--TCGA_path ${path_to_data_folder}markers_identification_input_files/TCGA_data.rds \
--HA_path ${path_to_data_folder}markers_identification_input_files/Human_Atlas_peaks.txt \
-bg -resume


# Considering Tcell subtypes
ref_output_path=user_defined_path # path where the output files are saved

nextflow run ${git_repo_path}scripts/reference_profiles_scripts/build_references.nf \
-c ${git_repo_path}scripts/reference_profiles_scripts/nextflow.config -profile singularity \
--counts_path ${path_to_data_folder}markers_identification_input_files/raw_counts.txt \
--metadata_path ${path_to_data_folder}markers_identification_input_files/metadata.txt \
--output_folder ${ref_output_path} \
--cibersortx_token ${my_cibersortx_token} \
--CIBERSORTx_singularity ${cibersortx_container} \
--CIBERSORTx_username ${cibersortx_username} \
--cross_validation_files ${path_to_data_folder}markers_identification_input_files/samples/Ref/subsample \
--encode_count_file ${path_to_data_folder}markers_identification_input_files/encode_counts.txt \
--encode_metadata_file ${path_to_data_folder}markers_identification_input_files/encode_metadata.txt \
--TCGA_path ${path_to_data_folder}markers_identification_input_files/TCGA_data.rds \
--HA_path ${path_to_data_folder}markers_identification_input_files/Human_Atlas_peaks.txt \
--major_groups FALSE \
-bg -resume

Evaluating EPIC-ATAC and reproducing the manuscript figures

path_to_data_folder=user_defined_path
my_cibersortx_token=user_defined_path 
cibersortx_container=user_defined_path 
cibersortx_username=user_defined_path 
output_path=user_defined_path 

# Without considering the T cell subtypes
nextflow run ${git_repo_path}scripts/main_analyses/main_analyses.nf \
-c ${git_repo_path}scripts/main_analyses/nextflow.config \
-profile singularity \
--cibersortx_token ${my_cibersortx_token} \
--CIBERSORTx_singularity ${cibersortx_container} \
--CIBERSORTx_username ${cibersortx_username} \
--data_path ${path_to_data_folder} --withSubtypes FALSE \
--output_path ${output_path} -bg -resume

# Considering the T cell subtypes
nextflow run ${git_repo_path}scripts/main_analyses/main_analyses.nf \
-c ${git_repo_path}scripts/main_analyses/nextflow.config \
-profile singularity \
--cibersortx_token ${my_cibersortx_token} \
--CIBERSORTx_singularity ${cibersortx_container} \
--CIBERSORTx_username ${cibersortx_username} \
--data_path ${path_to_data_folder} --withSubtypes TRUE \
--output_path ${output_path} -bg -resume