Skip to content

Commit

Permalink
Fixed output files and folders; Fixed README
Browse files Browse the repository at this point in the history
  • Loading branch information
MatteoSchiavinato committed Dec 2, 2021
1 parent 9c66199 commit 6282c49
Show file tree
Hide file tree
Showing 2 changed files with 79 additions and 122 deletions.
66 changes: 25 additions & 41 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ And it's ready to go!
The basic usage of the program is as simple as:

```
./jloh --vcf <VCF> --genome-file <GENOME_FILE> [options]
./jloh --vcf <VCF> --genome-file <GENOME_FILE> --bam <BAM> [options]
```

To produce a genome file, simply calculate the length of each sequence in your reference FASTA file and produce a file containing their name + length, structured in a tab-separated format that looks like this:
Expand All @@ -39,20 +39,8 @@ chr3 90804782
The program has many other options:

```
J LOH
Still the one from the block
-
Extact LOH blocks from SNPs and a reference genome
--------------------------------------------------------------------------------
by Matteo Schiavinato
based on the work of Leszek Pryszcz and Veronica Mixao
DOIs:
Pryszcz et al., 2014 https://doi.org/10.1093/gbe/evu082
Mixao et al., 2019 https://doi.org/10.3389/fgene.2019.00383
--------------------------------------------------------------------------------
Usage:
./jloh --vcf <VCF> --genome-file <GENOME_FILE> [options]
./jloh --vcf <VCF> --genome-file <GENOME_FILE> --bam <BAM> [options]
[mandatory]
--vcf Input VCF file containing all types of variants [!]
Expand All @@ -61,44 +49,40 @@ Usage:
[optional]
--sample Sample name / Strain name for output files [loh_blocks]
--output-dir Output directory [loh_blocks_out]
--window Window for heterozygous blocks definition [100]
--t0-vcf VCF with variants to ignore from --vcf [off]
--min-het-snps Min. num. of heterozygous SNPs in heterozygous region [2]
--snp-distance Min. distance (bp between SNPs for blocks definition [100]
--min-size Min. LOH block size [100]
--block-distances Comma-sep. list of desired distances between LOH blocks [100,1000,5000]
--min-af Min. allele frequency to consider a variant heterozygous [off]
--max-af Max. allele frequency to consider a variant heterozygous [off]
--min-af Min. allele frequency to consider a variant heterozygous [0.3]
--max-af Max. allele frequency to consider a variant heterozygous [0.7]
--bam BAM file (only required when filtering by coverage) [off]
--min-frac-cov Min. mean coverage fraction for LOH blocks [off]
--max-frac-cov Max. mean coverage fraction for LOH blocks [off]
--min-frac-cov Min. mean coverage fraction for LOH blocks [0.75]
(used only if --bam specified)
--max-frac-cov Max. mean coverage fraction for LOH blocks [1.25]
(used only if --bam specified)
--block-lengths Comma-sep. list of desired distances between LOH blocks [100,1000,5000]
--bedtools Path to the bedtools executable [bedtools]
--print-info Show authors and edits with dates [off]
```

## Output

The program produces the following output files:
#### selection of relevant variants

```
Sc_Sk_t1000.d100bp.bed
Sc_Sk_t1000.d100bp.complement.bed
Sc_Sk_t1000.d100bp_provisory.bed
Sc_Sk_t1000.het_snps.vcf
Sc_Sk_t1000.homo.d100bp.bed
Sc_Sk_t1000.homo.d100bp.cov_flt.100bp.bed
Sc_Sk_t1000.homo.d100bp.cov_flt.1000bp.bed
Sc_Sk_t1000.homo.d100bp.cov_flt.5000bp.bed
Sc_Sk_t1000.homo.d100bp.cov_flt.bed
```
The program produces many output files. First, the variants passed as input VCF file are compared to those (if any) passed as `--t0-vcf`. If the latter option is used, all variants contained in both VCF files are excluded, as a means to exclude variation preexisting the experiment. If the `--t0-vcf` option is not used, then all variants passed as input are retained. In both cases, the retained variants are saved in a VCF file called `<sample>.relevant.vcf`.

#### selection of heterozygous SNPs

The relevant variants are then filtered, retaining only heterozygous SNPs. Indels and homozygous SNPs are filtered out as they aren't normally used to extract LOH blocks. The selection of heterozygous SNPs is conducted based on their FORMAT field (field number 9 and 10 of a VCF file). The first column of this field carries a series of annotations separated by colons (e.g. GT:AF) the values of which are annotated the same way on the second column (e.g.` 0/1:0.60`). If a SNP is annotated as heterozygous, it will carry a genotype (`GT`) such as `0/1` or `1/2`. It should also have an allele frequency (`AF`) annotation. SNPs are considered heterozygous if their `AF` annotation falls between the values specified with `--min-af` and `--max-af`. The heterozygous SNPs are then saved in a VCF file called `<sample>.het_snps.vcf`.

#### extraction of heterozygosity regions

Of these, the main output file containing all the detected **LOH blocks** is: `Sc_Sk_t1000.homo.d100bp.cov_flt.bed`
The putative heterozygosity regions are extracted based on the number of heterozygous SNPs they contain, and the maximum distance between these SNPs for them to be considered part of the same region. These parameters are controlled with the `--min-het-snps` and the `--snp-distance` parameters. The default values for these two parameters (`--min-het-snps 2 --snp-distance 100`) are those used in many studies. This step uses **bedtools merge** to create BED intervals that fit these criteria. The number of SNPs is assessed with the `bedtools merge -c 1 -o count` option, while the distance is assessed with the `-d` parameter. This step produces three output files: 1) a file called `<sample>.d<snp_distance>bp_provisory.bed` where sample and snp distance depend on the parameters the used passed; 2) output BED file that contains the length of each interval, called the same way but with a `*w_len.bed` ending; 3) the actual output file, filtered by `--min-het-snps` and `--snp-distance`, called `<sample>.d<snp_distance>bp.bed`.

The other *n* files named similarly are subsets of the main output file, with a different minimum distance between LOH blocks (indicated in the filename). For example, the `Sc_Sk_t1000.homo.d100bp.cov_flt.5000bp.bed` file contains only LOH blocks that are at least 5000 bp apart in the genome.
#### extract complementary homozygous regions

Note that the filenames depend on the parameters declared: using `--window 100` will result in a file called `*.d<winsize>bp.*`, and using `--block-distances` will result in one file per block distance specified (e.g. 100,1000,5000 in this case).
The objective of this tool is to extract LOH blocks, which are defined by the *loss* of heterozygosity. Hence, the tool at this point selects the complementary intervals to those that were labelled as heterozygous. This is done with **bedtools complement**. The output file is called `<sample>.d<snp_distance>bp.complement.bed`. These regions are then filtered by their length, retaining only those with a length larger or equal to `--min-size`. This produces an output file called `<sample>.homo.d<snp_distance>bp.bed`.

Besides these files, other output files are:
#### filter by coverage

- `Sc_Sk_t1000.het_snps.vcf`: the heterozygous SNPs extracted and used to calculate LOH blocks
- `Sc_Sk_t1000.d100bp_provisory.bed`: the BED file produced from those SNPs, considering windows of `--window` size, indicated in the file name (`*.d<winsize>bp.*`)
- `Sc_Sk_t1000.d100bp.bed`: a merged version of this BED file, where overlapping heterozygous regions are merged together
- `Sc_Sk_t1000.d100bp.complement.bed`: the complement of the merged BED file, obtained with `bedtools complement`
- `Sc_Sk_t1000.homo.d100bp.bed`: candidate homozygous regions in BED format
Each region that is considered as a candidate LOH region is screened by coverage using the BAM file passed with `--bam`. First, a mean coverage is computed for the whole BAM file. To do that, JLOH checks if the BAM file is indexed, and if not, it indexes it using the **pysam** module. Then, reads mapping inside each region are extracted using the **pysam** module, and compared against the general mean coverage. Candidate LOH blocks are retained in the output only if they have a coverage *below* or *above* what is considered "normal" coverage. The "normal" coverage range is defined via two parameters: `--min-frac-cov` and `--max-frac-cov`. The defaults of these two parameters (0.75 and 1.25) are used in many studies. The global mean coverage computed from the BAM file is multiplied by these two values to obtained the two boundaries of "normal" coverage. Every candidate LOH block falling above or below these two thresholds will be considered a true LOH block. True LOH blocks are placed in an output file called `<sample>.LOH_blocks.bed`. This is the true output of the program, and one of the only two that are not placed in the `process` folder, where temporary files are stored.
135 changes: 54 additions & 81 deletions jloh
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Last change: 11/05/2018
v3
Matteo Schiavinato
Notes: upgrade to python3, integrated three scripts in one, improved code readability
Last change: 26/11/2021
Last change: 02/12/2021
###
"""

Expand All @@ -39,7 +39,7 @@ Still the one from the block
-
Extact LOH blocks from SNPs and a reference genome
-
v 0.1
v 0.1.0
--------------------------------------------------------------------------------
by Matteo Schiavinato
based on the work of Leszek Pryszcz and Veronica Mixao
Expand All @@ -49,22 +49,23 @@ Mixao et al., 2019 https://doi.org/10.3389/fgene.2019.00383
--------------------------------------------------------------------------------
Usage:
./jloh --vcf <VCF> --genome-file <GENOME_FILE> [options]
./jloh --vcf <VCF> --genome-file <GENOME_FILE> --bam <BAM> [options]
[mandatory]
--vcf Input VCF file containing all types of variants [!]
--genome-file File with chromosome lengths (chromosome <TAB> size) [!]
--bam BAM file used to call the --vcf variants [!]
[optional]
--sample Sample name / Strain name for output files [loh_blocks]
--output-dir Output directory [loh_blocks_out]
--t0-vcf VCF with variants to ignore from --vcf [off]
--min-het-snps Min. num. of heterozygous SNPs in heterozygous region [2]
--snp-distance Min. distance (bp between SNPs for blocks definition [100]
--min-size Min. LOH block size [100]
--block-lengths Comma-sep. list of desired distances between LOH blocks [100,1000,5000]
--min-af Min. allele frequency to consider a variant heterozygous [0.3]
--max-af Max. allele frequency to consider a variant heterozygous [0.7]
--bam BAM file (only required when filtering by coverage) [off]
--min-frac-cov Min. mean coverage fraction for LOH blocks [0.75]
(used only if --bam specified)
--max-frac-cov Max. mean coverage fraction for LOH blocks [1.25]
Expand All @@ -82,6 +83,7 @@ p.add_argument("--genome-file")
p.add_argument("--sample", default="loh_blocks")
p.add_argument("--output-dir", default="loh_blocks_out")
p.add_argument("--t0-vcf", type=str, default=None)
p.add_argument("--min-het-snps", type=float, default=2)
p.add_argument("--snp-distance", default=100, type=int)
p.add_argument("--min-size", default=100, type=int)
p.add_argument("--block-lengths", default="100,1000,5000", type=str)
Expand All @@ -106,6 +108,9 @@ def organize_workspace(output_dir, min_frac_cov, max_frac_cov, bam):
os.makedirs(output_dir)
sys.stderr.write(f"Created output directory {output_dir}\n")

if os.path.exists(f"{output_dir}/process") == False:
os.mkdir(f"{output_dir}/process")

# check if bam is used or not
if bam and not (min_frac_cov or max_frac_cov):
sys.stderr.write("ERROR: If you specify a --bam file you may want to use --min-frac-cov or --max-frac-cov\n")
Expand All @@ -118,16 +123,17 @@ def organize_workspace(output_dir, min_frac_cov, max_frac_cov, bam):
return (True, output_dir)


def select_relevant_variants(output_dir, sample, t0_vcf, bedtools, vcf):
def select_relevant_variants(output_dir, sample, t0_vcf, bedtools, vcf, filt_vcf_out):

"""
v1
Author: Matteo Schiavinato
Last change: 02/12/2021
Removing variants annotated in the declared t0_vcf file
If not declared, all variants are retained
Which means that the whole input --vcf is assigned as filt_vcf_out
"""

filt_vcf_out = f"{output_dir}/{sample}.relevant.vcf"

if t0_vcf != None:
sys.stderr.write("Removing T0 variants...\n")
cmd = f"{bedtools} intersect -header -u -a {vcf} -b {t0_vcf} > {filt_vcf_out}"
Expand All @@ -148,18 +154,16 @@ def select_relevant_variants(output_dir, sample, t0_vcf, bedtools, vcf):
return (True, filt_vcf_out, sample)


def extract_heterozygous_snps(vcf, min_af, max_af, output_dir, sample):
def extract_heterozygous_snps(vcf, min_af, max_af, output_dir, sample, het_snps_vcf):

"""
v1
Author: Matteo Schiavinato
Last change: 22/11/2021
Last change: 02/12/2021
Input: the VCF file containing SNPs and indels
Output: only heterozygous SNPs, including number of kept and removed lines
"""

het_snps_vcf = f"{output_dir}/{sample}.het_snps.vcf"

# iterate over variants
removed, kept, Lines, New_lines = 0, 0, [], []

Expand Down Expand Up @@ -382,7 +386,7 @@ def filter_by_coverage(bam_file, bed_file, covfr, cmax):
#unload bed coordinate
chrom, start, end, length = bed.rstrip("\n\b\r").split('\t')
start, end = int(start), int(end)
#get read count within range of bed file
#get read count within range of bed file
c = sam.count(chrom, start, end)
#check if correct coverage
cov = c *1.0 / (end-start)
Expand Down Expand Up @@ -416,27 +420,34 @@ def main():
# ---------------------------------------
# filter out variants from t0 if required
sys.stderr.write("Selecting relevant variants...\n")
filt_vcf_out = f"{output_dir}/process/{args.sample}.relevant.vcf"

(result, filt_vcf_out, sample) = select_relevant_variants( output_dir,
args.sample,
args.t0_vcf,
args.bedtools,
args.vcf)
args.vcf,
filt_vcf_out)

# -------------------------------------------
# Extract heterozygous snps from raw VCF file
sys.stderr.write("Extracting heterozygous SNPs from VCF file...\n")
sys.stderr.write(f"Considering only SNPs with {args.min_af} <= AF <= {args.max_af}\n")
het_snps_vcf = f"{output_dir}/{sample}.het_snps.vcf"

(result, het_snps_vcf) = extract_heterozygous_snps( filt_vcf_out,
args.min_af,
args.max_af,
output_dir,
sample)
sample,
het_snps_vcf)

# -------------------------------------------
# Get heterozygous blocks with bedtools merge
# merge bed file
sys.stderr.write(f"Defining heterozygous blocks for {sample}...\n")
merged_bed_out = f"{output_dir}/{sample}.d{args.snp_distance}bp_provisory.bed"
merged_bed_out = f"{output_dir}/process/{sample}.d{args.snp_distance}bp_provisory.bed"

(result, snp_distance) = merge_bed_intervals(sample,
output_dir,
args.snp_distance,
Expand All @@ -446,14 +457,16 @@ def main():

# add length to BED file
sys.stderr.write(f"Calculating length of BED intervals...\n")
bed_with_length = f"{output_dir}/{sample}.d{args.snp_distance}bp_provisory_w_len.bed"
bed_with_length = f"{output_dir}/process/{sample}.d{args.snp_distance}bp_provisory_w_len.bed"

result = add_length(merged_bed_out, bed_with_length)

# ---------------------------------------------
# run filtering function on the merged bed file
sys.stderr.write("Keeping only blocks with > 1 SNP...\n")
minC, maxC, minL = float(2), float("inf"), float(0)
filt_bed = f"{output_dir}/{sample}.d{snp_distance}bp.bed"
minC, maxC, minL = float(args.min_het_snps), float("inf"), float(0)
filt_bed = f"{output_dir}/process/{sample}.d{snp_distance}bp.bed"

(result, minC, maxC, minL) = filter_by_snp_density(bed_with_length,
filt_bed,
output_dir,
Expand All @@ -467,7 +480,8 @@ def main():
# Obtaining complementary regions to heterozygous blocks
# get complementary bed file
sys.stderr.write("Getting complementary regions...\n")
complement_bed = f"{output_dir}/{sample}.d{snp_distance}bp.complement.bed"
complement_bed = f"{output_dir}/process/{sample}.d{snp_distance}bp.complement.bed"

result = get_complementary_regions(complement_bed,
args.bedtools,
filt_bed,
Expand All @@ -476,74 +490,33 @@ def main():
# --------------------------------------------------
# Filter the complement to get the homozygous blocks
sys.stderr.write("Getting candidate LOH blocks...\n")
filt_compl_bed = f"{output_dir}/{sample}.homo.d{snp_distance}bp.bed"
filt_compl_bed = f"{output_dir}/process/{sample}.homo.d{snp_distance}bp.bed"

result = filter_by_length(complement_bed,
filt_compl_bed,
args.min_size)

# -----------------------------------------
# filter by real coverage from the BAM file
if (args.min_frac_cov or args.max_frac_cov) and args.bam:

# check what values were passed
if args.min_frac_cov:
min_frac_cov = float(args.min_frac_cov)
else:
min_frac_cov = float(0)

if args.max_frac_cov:
max_frac_cov = float(args.max_frac_cov)
else:
max_frac_cov = float("inf")

# index bam file
if os.path.exists(f"{args.bam}.bai") == False:
sys.stderr.write("BAM index not found. Indexing BAM file...\n")
pysam.index(args.bam)

# use the filtered bed file as input for next step
Final_bed_lines = filter_by_coverage(args.bam, filt_compl_bed, min_frac_cov, max_frac_cov)
final_bed_file = f"{output_dir}/{sample}.homo.d{snp_distance}bp.cov_flt.bed"

# write final bed file lines to an actual bed file output
OUT = open(final_bed_file, "w")
for line in Final_bed_lines:
OUT.write(line)
OUT.close()

else:
final_bed_file = f"{output_dir}/{sample}.homo.d{snp_distance}bp.cov_flt.bed"
try:
subprocess.run([f"cp {filt_compl_bed} {final_bed_file}"], check=True, shell=True)
except subprocess.CalledProcessError:
sys.stderr.write("ERROR: file copying didn't work\n")
sys.exit(1)

# --------------------------------------------------------
# write output files with specific window range thresholds

# open input file
blocks_file = open(final_bed_file, "r")
blocks = blocks_file.readlines()
blocks_file.close()

# iterate over predefined ranges
for block_length in [int(i) for i in args.block_lengths.split(",")]:

sys.stderr.write(f"Generating output file filtered at min. {block_length} range\n")

# open output file
OUTPUT = open(f"{output_dir}/{sample}.homo.d{snp_distance}bp.cov_flt.{block_length}bp.bed", "w")

# extract corresponding regions
for line in blocks:
chrom, start, end, cov, length = line.rstrip("\b\r\n").split("\t")
length = int(length)
if length >= block_length:
OUTPUT.write(line)

# close output file
OUTPUT.close()
# index bam file
if os.path.exists(f"{args.bam}.bai") == False:
sys.stderr.write("BAM index not found. Indexing BAM file...\n")
pysam.index(args.bam)

# use the filtered bed file as input for next step
Final_bed_lines = filter_by_coverage( args.bam,
filt_compl_bed,
args.min_frac_cov,
args.max_frac_cov)

final_bed_file = f"{output_dir}/{sample}.LOH_blocks.bed"

# write final bed file lines to an actual bed file output
OUT = open(final_bed_file, "w")
OUT.write("Chromosome\tStart\tEnd\tCoverage\tLength\n")
for line in Final_bed_lines:
OUT.write(line)
OUT.close()

sys.stderr.write("Done!\n")

Expand Down

0 comments on commit 6282c49

Please sign in to comment.