From 6282c49c9e9347e3eab15725b717fe33b6b997a2 Mon Sep 17 00:00:00 2001 From: Matteo Schiavinato Date: Thu, 2 Dec 2021 13:17:01 +0100 Subject: [PATCH] Fixed output files and folders; Fixed README --- README.md | 66 ++++++++++---------------- jloh | 135 ++++++++++++++++++++++-------------------------------- 2 files changed, 79 insertions(+), 122 deletions(-) diff --git a/README.md b/README.md index f6ce1c6..06f36ef 100644 --- a/README.md +++ b/README.md @@ -24,7 +24,7 @@ And it's ready to go! The basic usage of the program is as simple as: ``` -./jloh --vcf --genome-file [options] +./jloh --vcf --genome-file --bam [options] ``` To produce a genome file, simply calculate the length of each sequence in your reference FASTA file and produce a file containing their name + length, structured in a tab-separated format that looks like this: @@ -39,20 +39,8 @@ chr3 90804782 The program has many other options: ``` -J LOH -Still the one from the block -- -Extact LOH blocks from SNPs and a reference genome --------------------------------------------------------------------------------- -by Matteo Schiavinato -based on the work of Leszek Pryszcz and Veronica Mixao -DOIs: -Pryszcz et al., 2014 https://doi.org/10.1093/gbe/evu082 -Mixao et al., 2019 https://doi.org/10.3389/fgene.2019.00383 --------------------------------------------------------------------------------- - Usage: -./jloh --vcf --genome-file [options] +./jloh --vcf --genome-file --bam [options] [mandatory] --vcf Input VCF file containing all types of variants [!] @@ -61,44 +49,40 @@ Usage: [optional] --sample Sample name / Strain name for output files [loh_blocks] --output-dir Output directory [loh_blocks_out] ---window Window for heterozygous blocks definition [100] +--t0-vcf VCF with variants to ignore from --vcf [off] +--min-het-snps Min. num. of heterozygous SNPs in heterozygous region [2] +--snp-distance Min. distance (bp between SNPs for blocks definition [100] --min-size Min. LOH block size [100] ---block-distances Comma-sep. list of desired distances between LOH blocks [100,1000,5000] ---min-af Min. allele frequency to consider a variant heterozygous [off] ---max-af Max. allele frequency to consider a variant heterozygous [off] +--min-af Min. allele frequency to consider a variant heterozygous [0.3] +--max-af Max. allele frequency to consider a variant heterozygous [0.7] --bam BAM file (only required when filtering by coverage) [off] ---min-frac-cov Min. mean coverage fraction for LOH blocks [off] ---max-frac-cov Max. mean coverage fraction for LOH blocks [off] +--min-frac-cov Min. mean coverage fraction for LOH blocks [0.75] + (used only if --bam specified) +--max-frac-cov Max. mean coverage fraction for LOH blocks [1.25] + (used only if --bam specified) +--block-lengths Comma-sep. list of desired distances between LOH blocks [100,1000,5000] --bedtools Path to the bedtools executable [bedtools] --print-info Show authors and edits with dates [off] ``` ## Output -The program produces the following output files: +#### selection of relevant variants -``` -Sc_Sk_t1000.d100bp.bed -Sc_Sk_t1000.d100bp.complement.bed -Sc_Sk_t1000.d100bp_provisory.bed -Sc_Sk_t1000.het_snps.vcf -Sc_Sk_t1000.homo.d100bp.bed -Sc_Sk_t1000.homo.d100bp.cov_flt.100bp.bed -Sc_Sk_t1000.homo.d100bp.cov_flt.1000bp.bed -Sc_Sk_t1000.homo.d100bp.cov_flt.5000bp.bed -Sc_Sk_t1000.homo.d100bp.cov_flt.bed -``` +The program produces many output files. First, the variants passed as input VCF file are compared to those (if any) passed as `--t0-vcf`. If the latter option is used, all variants contained in both VCF files are excluded, as a means to exclude variation preexisting the experiment. If the `--t0-vcf` option is not used, then all variants passed as input are retained. In both cases, the retained variants are saved in a VCF file called `.relevant.vcf`. + +#### selection of heterozygous SNPs + +The relevant variants are then filtered, retaining only heterozygous SNPs. Indels and homozygous SNPs are filtered out as they aren't normally used to extract LOH blocks. The selection of heterozygous SNPs is conducted based on their FORMAT field (field number 9 and 10 of a VCF file). The first column of this field carries a series of annotations separated by colons (e.g. GT:AF) the values of which are annotated the same way on the second column (e.g.` 0/1:0.60`). If a SNP is annotated as heterozygous, it will carry a genotype (`GT`) such as `0/1` or `1/2`. It should also have an allele frequency (`AF`) annotation. SNPs are considered heterozygous if their `AF` annotation falls between the values specified with `--min-af` and `--max-af`. The heterozygous SNPs are then saved in a VCF file called `.het_snps.vcf`. + +#### extraction of heterozygosity regions -Of these, the main output file containing all the detected **LOH blocks** is: `Sc_Sk_t1000.homo.d100bp.cov_flt.bed` +The putative heterozygosity regions are extracted based on the number of heterozygous SNPs they contain, and the maximum distance between these SNPs for them to be considered part of the same region. These parameters are controlled with the `--min-het-snps` and the `--snp-distance` parameters. The default values for these two parameters (`--min-het-snps 2 --snp-distance 100`) are those used in many studies. This step uses **bedtools merge** to create BED intervals that fit these criteria. The number of SNPs is assessed with the `bedtools merge -c 1 -o count` option, while the distance is assessed with the `-d` parameter. This step produces three output files: 1) a file called `.dbp_provisory.bed` where sample and snp distance depend on the parameters the used passed; 2) output BED file that contains the length of each interval, called the same way but with a `*w_len.bed` ending; 3) the actual output file, filtered by `--min-het-snps` and `--snp-distance`, called `.dbp.bed`. -The other *n* files named similarly are subsets of the main output file, with a different minimum distance between LOH blocks (indicated in the filename). For example, the `Sc_Sk_t1000.homo.d100bp.cov_flt.5000bp.bed` file contains only LOH blocks that are at least 5000 bp apart in the genome. +#### extract complementary homozygous regions -Note that the filenames depend on the parameters declared: using `--window 100` will result in a file called `*.dbp.*`, and using `--block-distances` will result in one file per block distance specified (e.g. 100,1000,5000 in this case). +The objective of this tool is to extract LOH blocks, which are defined by the *loss* of heterozygosity. Hence, the tool at this point selects the complementary intervals to those that were labelled as heterozygous. This is done with **bedtools complement**. The output file is called `.dbp.complement.bed`. These regions are then filtered by their length, retaining only those with a length larger or equal to `--min-size`. This produces an output file called `.homo.dbp.bed`. -Besides these files, other output files are: +#### filter by coverage -- `Sc_Sk_t1000.het_snps.vcf`: the heterozygous SNPs extracted and used to calculate LOH blocks -- `Sc_Sk_t1000.d100bp_provisory.bed`: the BED file produced from those SNPs, considering windows of `--window` size, indicated in the file name (`*.dbp.*`) -- `Sc_Sk_t1000.d100bp.bed`: a merged version of this BED file, where overlapping heterozygous regions are merged together -- `Sc_Sk_t1000.d100bp.complement.bed`: the complement of the merged BED file, obtained with `bedtools complement` -- `Sc_Sk_t1000.homo.d100bp.bed`: candidate homozygous regions in BED format +Each region that is considered as a candidate LOH region is screened by coverage using the BAM file passed with `--bam`. First, a mean coverage is computed for the whole BAM file. To do that, JLOH checks if the BAM file is indexed, and if not, it indexes it using the **pysam** module. Then, reads mapping inside each region are extracted using the **pysam** module, and compared against the general mean coverage. Candidate LOH blocks are retained in the output only if they have a coverage *below* or *above* what is considered "normal" coverage. The "normal" coverage range is defined via two parameters: `--min-frac-cov` and `--max-frac-cov`. The defaults of these two parameters (0.75 and 1.25) are used in many studies. The global mean coverage computed from the BAM file is multiplied by these two values to obtained the two boundaries of "normal" coverage. Every candidate LOH block falling above or below these two thresholds will be considered a true LOH block. True LOH blocks are placed in an output file called `.LOH_blocks.bed`. This is the true output of the program, and one of the only two that are not placed in the `process` folder, where temporary files are stored. diff --git a/jloh b/jloh index 9a12c60..c898d25 100755 --- a/jloh +++ b/jloh @@ -16,7 +16,7 @@ Last change: 11/05/2018 v3 Matteo Schiavinato Notes: upgrade to python3, integrated three scripts in one, improved code readability -Last change: 26/11/2021 +Last change: 02/12/2021 ### """ @@ -39,7 +39,7 @@ Still the one from the block - Extact LOH blocks from SNPs and a reference genome - -v 0.1 +v 0.1.0 -------------------------------------------------------------------------------- by Matteo Schiavinato based on the work of Leszek Pryszcz and Veronica Mixao @@ -49,22 +49,23 @@ Mixao et al., 2019 https://doi.org/10.3389/fgene.2019.00383 -------------------------------------------------------------------------------- Usage: -./jloh --vcf --genome-file [options] +./jloh --vcf --genome-file --bam [options] [mandatory] --vcf Input VCF file containing all types of variants [!] --genome-file File with chromosome lengths (chromosome size) [!] +--bam BAM file used to call the --vcf variants [!] [optional] --sample Sample name / Strain name for output files [loh_blocks] --output-dir Output directory [loh_blocks_out] --t0-vcf VCF with variants to ignore from --vcf [off] +--min-het-snps Min. num. of heterozygous SNPs in heterozygous region [2] --snp-distance Min. distance (bp between SNPs for blocks definition [100] --min-size Min. LOH block size [100] --block-lengths Comma-sep. list of desired distances between LOH blocks [100,1000,5000] --min-af Min. allele frequency to consider a variant heterozygous [0.3] --max-af Max. allele frequency to consider a variant heterozygous [0.7] ---bam BAM file (only required when filtering by coverage) [off] --min-frac-cov Min. mean coverage fraction for LOH blocks [0.75] (used only if --bam specified) --max-frac-cov Max. mean coverage fraction for LOH blocks [1.25] @@ -82,6 +83,7 @@ p.add_argument("--genome-file") p.add_argument("--sample", default="loh_blocks") p.add_argument("--output-dir", default="loh_blocks_out") p.add_argument("--t0-vcf", type=str, default=None) +p.add_argument("--min-het-snps", type=float, default=2) p.add_argument("--snp-distance", default=100, type=int) p.add_argument("--min-size", default=100, type=int) p.add_argument("--block-lengths", default="100,1000,5000", type=str) @@ -106,6 +108,9 @@ def organize_workspace(output_dir, min_frac_cov, max_frac_cov, bam): os.makedirs(output_dir) sys.stderr.write(f"Created output directory {output_dir}\n") + if os.path.exists(f"{output_dir}/process") == False: + os.mkdir(f"{output_dir}/process") + # check if bam is used or not if bam and not (min_frac_cov or max_frac_cov): sys.stderr.write("ERROR: If you specify a --bam file you may want to use --min-frac-cov or --max-frac-cov\n") @@ -118,16 +123,17 @@ def organize_workspace(output_dir, min_frac_cov, max_frac_cov, bam): return (True, output_dir) -def select_relevant_variants(output_dir, sample, t0_vcf, bedtools, vcf): +def select_relevant_variants(output_dir, sample, t0_vcf, bedtools, vcf, filt_vcf_out): """ + v1 + Author: Matteo Schiavinato + Last change: 02/12/2021 Removing variants annotated in the declared t0_vcf file If not declared, all variants are retained Which means that the whole input --vcf is assigned as filt_vcf_out """ - filt_vcf_out = f"{output_dir}/{sample}.relevant.vcf" - if t0_vcf != None: sys.stderr.write("Removing T0 variants...\n") cmd = f"{bedtools} intersect -header -u -a {vcf} -b {t0_vcf} > {filt_vcf_out}" @@ -148,18 +154,16 @@ def select_relevant_variants(output_dir, sample, t0_vcf, bedtools, vcf): return (True, filt_vcf_out, sample) -def extract_heterozygous_snps(vcf, min_af, max_af, output_dir, sample): +def extract_heterozygous_snps(vcf, min_af, max_af, output_dir, sample, het_snps_vcf): """ v1 Author: Matteo Schiavinato - Last change: 22/11/2021 + Last change: 02/12/2021 Input: the VCF file containing SNPs and indels Output: only heterozygous SNPs, including number of kept and removed lines """ - het_snps_vcf = f"{output_dir}/{sample}.het_snps.vcf" - # iterate over variants removed, kept, Lines, New_lines = 0, 0, [], [] @@ -382,7 +386,7 @@ def filter_by_coverage(bam_file, bed_file, covfr, cmax): #unload bed coordinate chrom, start, end, length = bed.rstrip("\n\b\r").split('\t') start, end = int(start), int(end) - #get read count within range of bed file + #get read count within range of bed file c = sam.count(chrom, start, end) #check if correct coverage cov = c *1.0 / (end-start) @@ -416,27 +420,34 @@ def main(): # --------------------------------------- # filter out variants from t0 if required sys.stderr.write("Selecting relevant variants...\n") + filt_vcf_out = f"{output_dir}/process/{args.sample}.relevant.vcf" + (result, filt_vcf_out, sample) = select_relevant_variants( output_dir, args.sample, args.t0_vcf, args.bedtools, - args.vcf) + args.vcf, + filt_vcf_out) # ------------------------------------------- # Extract heterozygous snps from raw VCF file sys.stderr.write("Extracting heterozygous SNPs from VCF file...\n") sys.stderr.write(f"Considering only SNPs with {args.min_af} <= AF <= {args.max_af}\n") + het_snps_vcf = f"{output_dir}/{sample}.het_snps.vcf" + (result, het_snps_vcf) = extract_heterozygous_snps( filt_vcf_out, args.min_af, args.max_af, output_dir, - sample) + sample, + het_snps_vcf) # ------------------------------------------- # Get heterozygous blocks with bedtools merge # merge bed file sys.stderr.write(f"Defining heterozygous blocks for {sample}...\n") - merged_bed_out = f"{output_dir}/{sample}.d{args.snp_distance}bp_provisory.bed" + merged_bed_out = f"{output_dir}/process/{sample}.d{args.snp_distance}bp_provisory.bed" + (result, snp_distance) = merge_bed_intervals(sample, output_dir, args.snp_distance, @@ -446,14 +457,16 @@ def main(): # add length to BED file sys.stderr.write(f"Calculating length of BED intervals...\n") - bed_with_length = f"{output_dir}/{sample}.d{args.snp_distance}bp_provisory_w_len.bed" + bed_with_length = f"{output_dir}/process/{sample}.d{args.snp_distance}bp_provisory_w_len.bed" + result = add_length(merged_bed_out, bed_with_length) # --------------------------------------------- # run filtering function on the merged bed file sys.stderr.write("Keeping only blocks with > 1 SNP...\n") - minC, maxC, minL = float(2), float("inf"), float(0) - filt_bed = f"{output_dir}/{sample}.d{snp_distance}bp.bed" + minC, maxC, minL = float(args.min_het_snps), float("inf"), float(0) + filt_bed = f"{output_dir}/process/{sample}.d{snp_distance}bp.bed" + (result, minC, maxC, minL) = filter_by_snp_density(bed_with_length, filt_bed, output_dir, @@ -467,7 +480,8 @@ def main(): # Obtaining complementary regions to heterozygous blocks # get complementary bed file sys.stderr.write("Getting complementary regions...\n") - complement_bed = f"{output_dir}/{sample}.d{snp_distance}bp.complement.bed" + complement_bed = f"{output_dir}/process/{sample}.d{snp_distance}bp.complement.bed" + result = get_complementary_regions(complement_bed, args.bedtools, filt_bed, @@ -476,74 +490,33 @@ def main(): # -------------------------------------------------- # Filter the complement to get the homozygous blocks sys.stderr.write("Getting candidate LOH blocks...\n") - filt_compl_bed = f"{output_dir}/{sample}.homo.d{snp_distance}bp.bed" + filt_compl_bed = f"{output_dir}/process/{sample}.homo.d{snp_distance}bp.bed" + result = filter_by_length(complement_bed, filt_compl_bed, args.min_size) # ----------------------------------------- # filter by real coverage from the BAM file - if (args.min_frac_cov or args.max_frac_cov) and args.bam: - - # check what values were passed - if args.min_frac_cov: - min_frac_cov = float(args.min_frac_cov) - else: - min_frac_cov = float(0) - - if args.max_frac_cov: - max_frac_cov = float(args.max_frac_cov) - else: - max_frac_cov = float("inf") - - # index bam file - if os.path.exists(f"{args.bam}.bai") == False: - sys.stderr.write("BAM index not found. Indexing BAM file...\n") - pysam.index(args.bam) - - # use the filtered bed file as input for next step - Final_bed_lines = filter_by_coverage(args.bam, filt_compl_bed, min_frac_cov, max_frac_cov) - final_bed_file = f"{output_dir}/{sample}.homo.d{snp_distance}bp.cov_flt.bed" - - # write final bed file lines to an actual bed file output - OUT = open(final_bed_file, "w") - for line in Final_bed_lines: - OUT.write(line) - OUT.close() - - else: - final_bed_file = f"{output_dir}/{sample}.homo.d{snp_distance}bp.cov_flt.bed" - try: - subprocess.run([f"cp {filt_compl_bed} {final_bed_file}"], check=True, shell=True) - except subprocess.CalledProcessError: - sys.stderr.write("ERROR: file copying didn't work\n") - sys.exit(1) - - # -------------------------------------------------------- - # write output files with specific window range thresholds - - # open input file - blocks_file = open(final_bed_file, "r") - blocks = blocks_file.readlines() - blocks_file.close() - - # iterate over predefined ranges - for block_length in [int(i) for i in args.block_lengths.split(",")]: - - sys.stderr.write(f"Generating output file filtered at min. {block_length} range\n") - - # open output file - OUTPUT = open(f"{output_dir}/{sample}.homo.d{snp_distance}bp.cov_flt.{block_length}bp.bed", "w") - - # extract corresponding regions - for line in blocks: - chrom, start, end, cov, length = line.rstrip("\b\r\n").split("\t") - length = int(length) - if length >= block_length: - OUTPUT.write(line) - - # close output file - OUTPUT.close() + # index bam file + if os.path.exists(f"{args.bam}.bai") == False: + sys.stderr.write("BAM index not found. Indexing BAM file...\n") + pysam.index(args.bam) + + # use the filtered bed file as input for next step + Final_bed_lines = filter_by_coverage( args.bam, + filt_compl_bed, + args.min_frac_cov, + args.max_frac_cov) + + final_bed_file = f"{output_dir}/{sample}.LOH_blocks.bed" + + # write final bed file lines to an actual bed file output + OUT = open(final_bed_file, "w") + OUT.write("Chromosome\tStart\tEnd\tCoverage\tLength\n") + for line in Final_bed_lines: + OUT.write(line) + OUT.close() sys.stderr.write("Done!\n")