Skip to content

HaplotypR pipeline

Monica-Golumbeanu edited this page Jan 6, 2025 · 1 revision

Pipeline for analysis of AmpSeq data with the classic protocol using HaplotypR

This is a pipeline which is based on the HaplotypR package and which does not need the demultiplexing by sample step. Processing until haplotype calling is done in parallel.

Key steps for running the pipeline:

  1. At the very beginning you need to set up an input and output folder. The input folder will contain just a few files needed by the pipeline (see description below) while the output folder will contain all files and folders created during the pipeline, as well as the final results. You will need to provide the paths to these input and output folders when you run the pipeline

  2. Next, you need to create some files that the pipeline needs, these are the same as for HaplotypR: markerFile.txt, sample_table.txt. While you can create markerFile.txt by hand, to create the sample_table.txt you need a table with sample names and corresponding file names. The script create_sample_file.R can be used for this purpose. Examples of these files and their formatting can be found here.

  3. Once you have the resource files, you can run the pipeline as follows:

  • First, you need to run the preprocessing part (script submit_AmpSeqPreprocess.sh). This demultiplexes reads by marker and merges the forward and reverse reads. The sample preprocessing is an array job, to run it you need to use the command sbatch --array=1-N_samples submit_AmpSeqPreprocess.sh input_folder output_folder where you replace input_folder and output_folder with the relevant folder paths for your input and output folders. Also, you will need to replace N_samples with the total number of samples. It is entirely based on the HaplotypeR functions, therefore it should create the corresponding folders and files as HaplotypR. If you want to debug this step, you can have a look at the haplotypR_demultiplex_marker_sample.R and haplotypR_merge_reads_sample.R scripts. They have commented sections which you can use for debugging.
  • After preprocessing, you can run the haplotype calling with submit_AmpSeqCallHaplotypes.sh. At the end, in your output folder, the file finalHaplotypeTab.RDS will contain the final haplotypes.

Do not forget to adapt all the header information from the submit_AmpSeqPreprocess.sh and submit_AmpSeqCallHaplotypes.sh to match your cluster and file system.