From 6282c49c9e9347e3eab15725b717fe33b6b997a2 Mon Sep 17 00:00:00 2001
From: Matteo Schiavinato <matteo.schiavinato.90@gmail.com>
Date: Thu, 2 Dec 2021 13:17:01 +0100
Subject: [PATCH] Fixed output files and folders; Fixed README

---
 README.md |  66 ++++++++++----------------
 jloh      | 135 ++++++++++++++++++++++--------------------------------
 2 files changed, 79 insertions(+), 122 deletions(-)
diff --git a/README.md b/README.md
index f6ce1c6..06f36ef 100644
--- a/README.md
+++ b/README.md
@@ -24,7 +24,7 @@ And it's ready to go!
 The basic usage of the program is as simple as:
 
 ```
-./jloh --vcf <VCF> --genome-file <GENOME_FILE> [options]
+./jloh --vcf <VCF> --genome-file <GENOME_FILE> --bam <BAM> [options]
 ```
 
 To produce a genome file, simply calculate the length of each sequence in your reference FASTA file and produce a file containing their name + length, structured in a tab-separated format that looks like this:
@@ -39,20 +39,8 @@ chr3  90804782
 The program has many other options:
 
 ```
-J LOH
-Still the one from the block
--
-Extact LOH blocks from SNPs and a reference genome
---------------------------------------------------------------------------------
-by Matteo Schiavinato
-based on the work of Leszek Pryszcz and Veronica Mixao
-DOIs:
-Pryszcz et al., 2014	https://doi.org/10.1093/gbe/evu082
-Mixao et al., 2019		https://doi.org/10.3389/fgene.2019.00383
---------------------------------------------------------------------------------
-
 Usage:
-./jloh --vcf <VCF> --genome-file <GENOME_FILE> [options]
+./jloh --vcf <VCF> --genome-file <GENOME_FILE> --bam <BAM> [options]
 
 [mandatory]
 --vcf               Input VCF file containing all types of variants             [!]
@@ -61,44 +49,40 @@ Usage:
 [optional]
 --sample            Sample name / Strain name for output files                  [loh_blocks]
 --output-dir        Output directory                                            [loh_blocks_out]
---window            Window for heterozygous blocks definition                   [100]
+--t0-vcf            VCF with variants to ignore from --vcf                      [off]
+--min-het-snps      Min. num. of heterozygous SNPs in heterozygous region       [2]
+--snp-distance      Min. distance (bp between SNPs for blocks definition        [100]
 --min-size          Min. LOH block size                                         [100]
---block-distances   Comma-sep. list of desired distances between LOH blocks     [100,1000,5000]
---min-af            Min. allele frequency to consider a variant heterozygous    [off]
---max-af            Max. allele frequency to consider a variant heterozygous    [off]
+--min-af            Min. allele frequency to consider a variant heterozygous    [0.3]
+--max-af            Max. allele frequency to consider a variant heterozygous    [0.7]
 --bam               BAM file (only required when filtering by coverage)         [off]
---min-frac-cov      Min. mean coverage fraction for LOH blocks                  [off]
---max-frac-cov      Max. mean coverage fraction for LOH blocks                  [off]
+--min-frac-cov      Min. mean coverage fraction for LOH blocks                  [0.75]
+                    (used only if --bam specified)
+--max-frac-cov      Max. mean coverage fraction for LOH blocks                  [1.25]
+                    (used only if --bam specified)
+--block-lengths     Comma-sep. list of desired distances between LOH blocks     [100,1000,5000]
 --bedtools          Path to the bedtools executable                             [bedtools]
 --print-info        Show authors and edits with dates                           [off]
 ```
 
 ## Output
 
-The program produces the following output files:
+#### selection of relevant variants
 
-```
-Sc_Sk_t1000.d100bp.bed
-Sc_Sk_t1000.d100bp.complement.bed
-Sc_Sk_t1000.d100bp_provisory.bed
-Sc_Sk_t1000.het_snps.vcf
-Sc_Sk_t1000.homo.d100bp.bed
-Sc_Sk_t1000.homo.d100bp.cov_flt.100bp.bed
-Sc_Sk_t1000.homo.d100bp.cov_flt.1000bp.bed
-Sc_Sk_t1000.homo.d100bp.cov_flt.5000bp.bed
-Sc_Sk_t1000.homo.d100bp.cov_flt.bed
-```
+The program produces many output files. First, the variants passed as input VCF file are compared to those (if any) passed as `--t0-vcf`. If the latter option is used, all variants contained in both VCF files are excluded, as a means to exclude variation preexisting the experiment. If the `--t0-vcf` option is not used, then all variants passed as input are retained. In both cases, the retained variants are saved in a VCF file called `<sample>.relevant.vcf`.
+
+#### selection of heterozygous SNPs
+
+The relevant variants are then filtered, retaining only heterozygous SNPs. Indels and homozygous SNPs are filtered out as they aren't normally used to extract LOH blocks. The selection of heterozygous SNPs is conducted based on their FORMAT field (field number 9 and 10 of a VCF file). The first column of this field carries a series of annotations separated by colons (e.g. GT:AF) the values of which are annotated the same way on the second column (e.g.` 0/1:0.60`). If a SNP is annotated as heterozygous, it will carry a genotype (`GT`) such as `0/1` or `1/2`. It should also have an allele frequency (`AF`) annotation. SNPs are considered heterozygous if their `AF` annotation falls between the values specified with `--min-af` and `--max-af`. The heterozygous SNPs are then saved in a VCF file called `<sample>.het_snps.vcf`.
+
+#### extraction of heterozygosity regions
 
-Of these, the main output file containing all the detected **LOH blocks** is: `Sc_Sk_t1000.homo.d100bp.cov_flt.bed`
+The putative heterozygosity regions are extracted based on the number of heterozygous SNPs they contain, and the maximum distance between these SNPs for them to be considered part of the same region. These parameters are controlled with the `--min-het-snps` and the `--snp-distance` parameters. The default values for these two parameters (`--min-het-snps 2 --snp-distance 100`) are those used in many studies. This step uses **bedtools merge** to create BED intervals that fit these criteria. The number of SNPs is assessed with the `bedtools merge -c 1 -o count` option, while the distance is assessed with the `-d` parameter. This step produces three output files: 1) a file called `<sample>.d<snp_distance>bp_provisory.bed` where sample and snp distance depend on the parameters the used passed; 2) output BED file that contains the length of each interval, called the same way but with a `*w_len.bed` ending; 3) the actual output file, filtered by `--min-het-snps` and `--snp-distance`, called `<sample>.d<snp_distance>bp.bed`.
 
-The other *n* files named similarly are subsets of the main output file, with a different minimum distance between LOH blocks (indicated in the filename). For example, the `Sc_Sk_t1000.homo.d100bp.cov_flt.5000bp.bed` file contains only LOH blocks that are at least 5000 bp apart in the genome.
+#### extract complementary homozygous regions
 
-Note that the filenames depend on the parameters declared: using `--window 100` will result in a file called `*.d<winsize>bp.*`, and using `--block-distances` will result in one file per block distance specified (e.g. 100,1000,5000 in this case).
+The objective of this tool is to extract LOH blocks, which are defined by the *loss* of  heterozygosity. Hence, the tool at this point selects the complementary intervals to those that were labelled as heterozygous. This is done with **bedtools complement**. The output file is called `<sample>.d<snp_distance>bp.complement.bed`. These regions are then filtered by their length, retaining only those with a length larger or equal to `--min-size`. This produces an output file called `<sample>.homo.d<snp_distance>bp.bed`.
 
-Besides these files, other output files are:
+#### filter by coverage
 
-- `Sc_Sk_t1000.het_snps.vcf`: the heterozygous SNPs extracted and used to calculate LOH blocks
-- `Sc_Sk_t1000.d100bp_provisory.bed`: the BED file produced from those SNPs, considering windows of `--window` size, indicated in the file name (`*.d<winsize>bp.*`)
-- `Sc_Sk_t1000.d100bp.bed`: a merged version of this BED file, where overlapping heterozygous regions are merged together
-- `Sc_Sk_t1000.d100bp.complement.bed`: the complement of the merged BED file, obtained with `bedtools complement`
-- `Sc_Sk_t1000.homo.d100bp.bed`: candidate homozygous regions in BED format
+Each region that is considered as a candidate LOH region is screened by coverage using the BAM file passed with `--bam`. First, a mean coverage is computed for the whole BAM file. To do that, JLOH checks if the BAM file is indexed, and if not, it indexes it using the **pysam** module. Then, reads mapping inside each region are extracted using the **pysam** module, and compared against the general mean coverage. Candidate LOH blocks are retained in the output only if they have a coverage *below* or *above* what is considered "normal" coverage. The "normal" coverage range is defined via two parameters: `--min-frac-cov` and `--max-frac-cov`. The defaults of these two parameters (0.75 and 1.25) are used in many studies. The global mean coverage computed from the BAM file is multiplied by these two values to obtained the two boundaries of "normal" coverage. Every candidate LOH block falling above or below these two thresholds will be considered a true LOH block. True LOH blocks are placed in an output file called `<sample>.LOH_blocks.bed`. This is the true output of the program, and one of the only two that are not placed in the `process` folder, where temporary files are stored.
diff --git a/jloh b/jloh
index 9a12c60..c898d25 100755
--- a/jloh
+++ b/jloh
@@ -16,7 +16,7 @@ Last change: 11/05/2018
 v3
 Matteo Schiavinato
 Notes: upgrade to python3, integrated three scripts in one, improved code readability
-Last change: 26/11/2021
+Last change: 02/12/2021
 ###
 """
 
@@ -39,7 +39,7 @@ Still the one from the block
 -
 Extact LOH blocks from SNPs and a reference genome
 -
-v 0.1
+v 0.1.0
 --------------------------------------------------------------------------------
 by Matteo Schiavinato
 based on the work of Leszek Pryszcz and Veronica Mixao
@@ -49,22 +49,23 @@ Mixao et al., 2019		https://doi.org/10.3389/fgene.2019.00383
 --------------------------------------------------------------------------------
 
 Usage:
-./jloh --vcf <VCF> --genome-file <GENOME_FILE> [options]
+./jloh --vcf <VCF> --genome-file <GENOME_FILE> --bam <BAM> [options]
 
 [mandatory]
 --vcf               Input VCF file containing all types of variants             [!]
 --genome-file       File with chromosome lengths (chromosome <TAB> size)        [!]
+--bam               BAM file used to call the --vcf variants                    [!]
 
 [optional]
 --sample            Sample name / Strain name for output files                  [loh_blocks]
 --output-dir        Output directory                                            [loh_blocks_out]
 --t0-vcf            VCF with variants to ignore from --vcf                      [off]
+--min-het-snps      Min. num. of heterozygous SNPs in heterozygous region       [2]
 --snp-distance      Min. distance (bp between SNPs for blocks definition        [100]
 --min-size          Min. LOH block size                                         [100]
 --block-lengths     Comma-sep. list of desired distances between LOH blocks     [100,1000,5000]
 --min-af            Min. allele frequency to consider a variant heterozygous    [0.3]
 --max-af            Max. allele frequency to consider a variant heterozygous    [0.7]
---bam               BAM file (only required when filtering by coverage)         [off]
 --min-frac-cov      Min. mean coverage fraction for LOH blocks                  [0.75]
                     (used only if --bam specified)
 --max-frac-cov      Max. mean coverage fraction for LOH blocks                  [1.25]
@@ -82,6 +83,7 @@ p.add_argument("--genome-file")
 p.add_argument("--sample", default="loh_blocks")
 p.add_argument("--output-dir", default="loh_blocks_out")
 p.add_argument("--t0-vcf", type=str, default=None)
+p.add_argument("--min-het-snps", type=float, default=2)
 p.add_argument("--snp-distance", default=100, type=int)
 p.add_argument("--min-size", default=100, type=int)
 p.add_argument("--block-lengths", default="100,1000,5000", type=str)
@@ -106,6 +108,9 @@ def organize_workspace(output_dir, min_frac_cov, max_frac_cov, bam):
 		os.makedirs(output_dir)
 		sys.stderr.write(f"Created output directory {output_dir}\n")
 
+	if os.path.exists(f"{output_dir}/process") == False:
+		os.mkdir(f"{output_dir}/process")
+
 	# check if bam is used or not
 	if bam and not (min_frac_cov or max_frac_cov):
 		sys.stderr.write("ERROR: If you specify a --bam file you may want to use --min-frac-cov or --max-frac-cov\n")
@@ -118,16 +123,17 @@ def organize_workspace(output_dir, min_frac_cov, max_frac_cov, bam):
 	return (True, output_dir)
 
 
-def select_relevant_variants(output_dir, sample, t0_vcf, bedtools, vcf):
+def select_relevant_variants(output_dir, sample, t0_vcf, bedtools, vcf, filt_vcf_out):
 
 	"""
+	v1
+	Author: Matteo Schiavinato
+	Last change: 02/12/2021
 	Removing variants annotated in the declared t0_vcf file
 	If not declared, all variants are retained
 	Which means that the whole input --vcf is assigned as filt_vcf_out
 	"""
 
-	filt_vcf_out = f"{output_dir}/{sample}.relevant.vcf"
-
 	if t0_vcf != None:
 		sys.stderr.write("Removing T0 variants...\n")
 		cmd = f"{bedtools} intersect -header -u -a {vcf} -b {t0_vcf} > {filt_vcf_out}"
@@ -148,18 +154,16 @@ def select_relevant_variants(output_dir, sample, t0_vcf, bedtools, vcf):
 	return (True, filt_vcf_out, sample)
 
 
-def extract_heterozygous_snps(vcf, min_af, max_af, output_dir, sample):
+def extract_heterozygous_snps(vcf, min_af, max_af, output_dir, sample, het_snps_vcf):
 
 	"""
 	v1
 	Author: Matteo Schiavinato
-	Last change: 22/11/2021
+	Last change: 02/12/2021
 	Input: the VCF file containing SNPs and indels
 	Output: only heterozygous SNPs, including number of kept and removed lines
 	"""
 
-	het_snps_vcf = f"{output_dir}/{sample}.het_snps.vcf"
-
 	# iterate over variants
 	removed, kept, Lines, New_lines = 0, 0, [], []
 
@@ -382,7 +386,7 @@ def filter_by_coverage(bam_file, bed_file, covfr, cmax):
 		#unload bed coordinate
 		chrom, start, end, length = bed.rstrip("\n\b\r").split('\t')
 		start, end = int(start), int(end)
-		#get read count within range of bed file 
+		#get read count within range of bed file
 		c = sam.count(chrom, start, end)
 		#check if correct coverage
 		cov = c *1.0 / (end-start)
@@ -416,27 +420,34 @@ def main():
 	# ---------------------------------------
 	# filter out variants from t0 if required
 	sys.stderr.write("Selecting relevant variants...\n")
+	filt_vcf_out = f"{output_dir}/process/{args.sample}.relevant.vcf"
+
 	(result, filt_vcf_out, sample) = select_relevant_variants(	output_dir,
 																args.sample,
 																args.t0_vcf,
 																args.bedtools,
-																args.vcf)
+																args.vcf,
+																filt_vcf_out)
 
 	# -------------------------------------------
 	# Extract heterozygous snps from raw VCF file
 	sys.stderr.write("Extracting heterozygous SNPs from VCF file...\n")
 	sys.stderr.write(f"Considering only SNPs with {args.min_af} <= AF <= {args.max_af}\n")
+	het_snps_vcf = f"{output_dir}/{sample}.het_snps.vcf"
+
 	(result, het_snps_vcf) = extract_heterozygous_snps(	filt_vcf_out,
 														args.min_af,
 														args.max_af,
 														output_dir,
-														sample)
+														sample,
+														het_snps_vcf)
 
 	# -------------------------------------------
 	# Get heterozygous blocks with bedtools merge
 	# merge bed file
 	sys.stderr.write(f"Defining heterozygous blocks for {sample}...\n")
-	merged_bed_out = f"{output_dir}/{sample}.d{args.snp_distance}bp_provisory.bed"
+	merged_bed_out = f"{output_dir}/process/{sample}.d{args.snp_distance}bp_provisory.bed"
+
 	(result, snp_distance) = merge_bed_intervals(sample,
 												output_dir,
 												args.snp_distance,
@@ -446,14 +457,16 @@ def main():
 
 	# add length to BED file
 	sys.stderr.write(f"Calculating length of BED intervals...\n")
-	bed_with_length = f"{output_dir}/{sample}.d{args.snp_distance}bp_provisory_w_len.bed"
+	bed_with_length = f"{output_dir}/process/{sample}.d{args.snp_distance}bp_provisory_w_len.bed"
+
 	result = add_length(merged_bed_out, bed_with_length)
 
 	# ---------------------------------------------
 	# run filtering function on the merged bed file
 	sys.stderr.write("Keeping only blocks with > 1 SNP...\n")
-	minC, maxC, minL = float(2), float("inf"), float(0)
-	filt_bed = f"{output_dir}/{sample}.d{snp_distance}bp.bed"
+	minC, maxC, minL = float(args.min_het_snps), float("inf"), float(0)
+	filt_bed = f"{output_dir}/process/{sample}.d{snp_distance}bp.bed"
+
 	(result, minC, maxC, minL) = filter_by_snp_density(bed_with_length,
 														filt_bed,
 														output_dir,
@@ -467,7 +480,8 @@ def main():
 	# Obtaining complementary regions to heterozygous blocks
 	# get complementary bed file
 	sys.stderr.write("Getting complementary regions...\n")
-	complement_bed = f"{output_dir}/{sample}.d{snp_distance}bp.complement.bed"
+	complement_bed = f"{output_dir}/process/{sample}.d{snp_distance}bp.complement.bed"
+
 	result = get_complementary_regions(complement_bed,
 										args.bedtools,
 										filt_bed,
@@ -476,74 +490,33 @@ def main():
 	# --------------------------------------------------
 	# Filter the complement to get the homozygous blocks
 	sys.stderr.write("Getting candidate LOH blocks...\n")
-	filt_compl_bed = f"{output_dir}/{sample}.homo.d{snp_distance}bp.bed"
+	filt_compl_bed = f"{output_dir}/process/{sample}.homo.d{snp_distance}bp.bed"
+
 	result = filter_by_length(complement_bed,
 							filt_compl_bed,
 							args.min_size)
 
 	# -----------------------------------------
 	# filter by real coverage from the BAM file
-	if (args.min_frac_cov or args.max_frac_cov) and args.bam:
-
-		# check what values were passed
-		if args.min_frac_cov:
-			min_frac_cov = float(args.min_frac_cov)
-		else:
-			min_frac_cov = float(0)
-
-		if args.max_frac_cov:
-			max_frac_cov = float(args.max_frac_cov)
-		else:
-			max_frac_cov = float("inf")
-
-		# index bam file
-		if os.path.exists(f"{args.bam}.bai") == False:
-			sys.stderr.write("BAM index not found. Indexing BAM file...\n")
-			pysam.index(args.bam)
-
-		# use the filtered bed file as input for next step
-		Final_bed_lines = filter_by_coverage(args.bam, filt_compl_bed, min_frac_cov, max_frac_cov)
-		final_bed_file = f"{output_dir}/{sample}.homo.d{snp_distance}bp.cov_flt.bed"
-
-		# write final bed file lines to an actual bed file output
-		OUT = open(final_bed_file, "w")
-		for line in Final_bed_lines:
-			OUT.write(line)
-		OUT.close()
-
-	else:
-		final_bed_file = f"{output_dir}/{sample}.homo.d{snp_distance}bp.cov_flt.bed"
-		try:
-		    subprocess.run([f"cp {filt_compl_bed} {final_bed_file}"], check=True, shell=True)
-		except subprocess.CalledProcessError:
-			sys.stderr.write("ERROR: file copying didn't work\n")
-			sys.exit(1)
-
-	# --------------------------------------------------------
-	# write output files with specific window range thresholds
-
-	# open input file
-	blocks_file = open(final_bed_file, "r")
-	blocks = blocks_file.readlines()
-	blocks_file.close()
-
-	# iterate over predefined ranges
-	for block_length in [int(i) for i in args.block_lengths.split(",")]:
-
-		sys.stderr.write(f"Generating output file filtered at min. {block_length} range\n")
-
-		# open output file
-		OUTPUT = open(f"{output_dir}/{sample}.homo.d{snp_distance}bp.cov_flt.{block_length}bp.bed", "w")
-
-		# extract corresponding regions
-		for line in blocks:
-			chrom, start, end, cov, length = line.rstrip("\b\r\n").split("\t")
-			length = int(length)
-			if length >= block_length:
-				OUTPUT.write(line)
-
-		# close output file
-		OUTPUT.close()
+	# index bam file
+	if os.path.exists(f"{args.bam}.bai") == False:
+		sys.stderr.write("BAM index not found. Indexing BAM file...\n")
+		pysam.index(args.bam)
+
+	# use the filtered bed file as input for next step
+	Final_bed_lines = filter_by_coverage(	args.bam,
+											filt_compl_bed,
+											args.min_frac_cov,
+											args.max_frac_cov)
+
+	final_bed_file = f"{output_dir}/{sample}.LOH_blocks.bed"
+
+	# write final bed file lines to an actual bed file output
+	OUT = open(final_bed_file, "w")
+	OUT.write("Chromosome\tStart\tEnd\tCoverage\tLength\n")
+	for line in Final_bed_lines:
+		OUT.write(line)
+	OUT.close()
 
 	sys.stderr.write("Done!\n")