Skip to content

gist indices to bioinformatics-related scripts

Notifications You must be signed in to change notification settings

hurrialice/qutils

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

89 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The repo is a collection of simple code snippets that I often forget..

General python

Pandas

  • Assign a column based on multiple conditions
  • chromosome sorting done right: raw['chr'] = pd.Categorical(raw['chr'], categories=['chr'+str(x) for x in list(range(1,23))+list("XY")], ordered=False)
  • TODO: taking care of indices during merge / concat operations
  • x in pd.Series always returns False
  • pd.Series([True, False, np.nan]).astype(bool) will turn NA to True
  • MultiIndex handling
    • adat.index = pd.MultiIndex.from_frame(<pd.dataframe>)
    • adat.sort_index(level=['relationship','Subject ID']).sort_index(level = ["Dilution"], axis=1) # reorder a multindex dataframe
    • adat2.reindex_like(adat1) # reorder adat2 to adat1-matching indices with certain filling logic

Numpy

Matplotlib

General R (to be filled)

  • Vannila R example for several statistical tests, proficient use of lapply families, split, unnest, Map.
  • R stats functions, different glm designs
  • Tidyverse family of libraries (dplyr, purrr, ...)
  • SummerizedExpriments for efficient organization of assays, other reliable Bioconductor packages
  • ggplot2 family of bioinformatics related plotting
  • .rs.restartR() to restart R session from console (with proper backup of enviroment variables)

NGS manipulation

Variants (MAF/VCF)

  • Several useful commands with awk/sed/grep
  • Reheader VCF with another reference fasta index (e.g. different contig names) by bcftools reheader -f XXX.fasta.fai old.vcf -o new.vcf
  • Appending header lines to VCF by bcftools annotate -h <hdr> -a <annot> -s <sample>
  • Change sample name of VCF: bcftools reheader -samples <samples.list>, as in the order of original VCF
  • bcftools cheatsheet
  • convert a site/sample level information from a VCF to a table (note, it is not MAF! so be careful abut the position changes) with GATK VariantsToTable, or bcftools query with -f. Both ways work to tabulate info/format fields and can be piped to awk
  • Matching indel positions between VCF vs MAF
    # insertion
    ms = vs
    me = ms + 1
    # mask the first base for (vref, valt)
    mref,malt = (-,G)
    
    # deletion
    ms = vs + 1
    # mask the first base for (vref, valt)
    malt = '-'
    me = ms + len(mref) -1
    
  • bcftools expression syntax
  • Working with VCF info flags (binary tag):
    • with bcftools annotate, explicitly include an additional column for 1/0/.; where as . denotes keeping existing flag
    • with bcftools annotate, use -m <TAG> to denote the presence of a TAG
    • Correct header for a binary flag looks like:
      echo -e "##INFO=<ID=pathogenic_intron,Number=0,Type=Flag,Description="Intronic variants that is marked P/LP from clinvar"" > pathogenic_intron.hdr
      
    • Filtering by bcftools filter -i 'pathogenic_intron=1' for presence of that tag
  • bcftools merge for non-overlapping sample (be careful with FILTER info loss); bcftools concat for identical order of sample with different regions / variant types (different FORMAT headers will be properly handled too) , however output needs to be sorted
  • GVCF: The key difference between a regular VCF and a GVCF is that the GVCF has records for all sites, whether there is a variant call there or not. Related GATK Joint calling framework
  • Compression (TODO: this part needs more work)
    • Compress and index a vcf
      bgzip file.vcf # or bcftools view file.vcf -Oz -o file.vcf.gz
      tabix file.vcf.gz # or bcftools index file.vcf.gz
      
    • bcftools convert vcf/bcf/gvcf
    • -O: Output compressed BCF (b), uncompressed BCF (u), compressed VCF (z), uncompressed VCF (v). Use the -Ou option when piping between bcftools; subcommands to speed up performance by removing unnecessary compression/decompression and VCF<->BCF conversion. Size comparison
  • consensus sequence from VCF, usually better to have input vcf normalized (indel left aligned with bcftools norm --fasta-ref)
    samtools faidx ref.fa 8:11870-11890 | bcftools consensus in.vcf.gz > out.fa
    
  • filter hom-ref from a VCF, gives you any site where at least one sample has an alt.
    bcftools view -i 'GT[*]="alt"' input.vcf
    
  • bcftools +mendelian plugin option: -l + will remove any inconsistent records, -d will only mask GT field to ./.
  • make a bedtools-compatible genome file by awk -v OFS='\t' {'print $1,$2'} ${reffa}.fai > ${reffa}.genome
  • bcftools region option:
  -r, --regions <region>              restrict to comma-separated list of regions
  -R, --regions-file <file>           restrict to regions listed in a file
  -t, --targets [^]<region>           similar to -r but streams rather than index-jumps. Exclude regions with "^" prefix
  -T, --targets-file [^]<file>        similar to -R but streams rather than index-jumps. Exclude regions with "^" prefix

Yet another difference between the two is that -r checks both start and end positions of indels, whereas -t checks start positions only.

  • MiniBAM

  • Phasing

    • Ways to phase variants:
      • statistical(popgen) phasing: beagle
      • read-backed phasing: whatshap
      • pedigree (parental scaffold) phasing: shapeit4??
    • Compare two phased VCF with whatshap compare:
      whatshap compare --names ped,single\
        --tsv-pairwise ${fam_id}_eval.tsv \
        --sample ${proband_id} \
        --longest-block-tsv ${fam_id}_longest-block.tsv \
        <pvcf1> <pvcf2>
      
      And check phase_agreeing column from longest-block-tsv.
    • switch error vs flip error (def=two switch errors in a row):
      A   B   C   D
      0|1 0|1 0|1 0|1
      0|1 0|1 0|1 0|1
      0|1 1|0 1|0 1|0
      1|0 0|1 1|0 0|1
      1|0 0|1 1|0 1|0
      
      In this case, A<->B = 1 switch; A<->C = 1 flip; A<->D = 2 switches
  • Structural variants VCF with symbolic ALT (VCF4.3) symbolify ALT allele

    Symbolic alternate alleles are described as follows:
    
    ##ALT=<ID=type,Description=description>
    
    Structural Variants
    In symbolic alternate alleles for imprecise structural variants, the ID field indicates the type of structural variant, and can be a colon-separated list of types and subtypes. ID values are case sensitive strings and must not contain whitespace or angle brackets. The first level type must be one of the following:
    
    DEL Deletion relative to the reference
    INS Insertion of novel sequence relative to the reference
    DUP Region of elevated copy number relative to the reference
    INV Inversion of reference sequence
    CNV Copy number variable region (may be both deletion and duplication)
    BND Breakend
    The CNV category should not be used when a more specific category can be applied. Reserved subtypes include:
    
    DUP:TANDEM Tandem duplication
    DEL:ME Deletion of mobile element relative to the reference
    INS:ME Insertion of a mobile element relative to the reference
    

Alignments (BAM/CRAM)

  • Explaining read flags
  • Filtering by read tag with samtools view -d TAG:VAL or samtools view -d TAG for binary flag
  • export GCS_OAUTH_TOKEN=$(gcloud auth application-default print-access-token) so that samtools will stream gs://
  • Note that -r in samtools view denotes read group instead of region. Be aware...
  • samtools mpileup with regrex patterns in python:
  • parsing outputs from samtools view, however inflating BAM should generally be avoided
    • image
    • Max MAPQ with samtools view ${bam} chr1:20000000-21000000 | awk 'BEGIN{a= 0}{if ($5>0+a) a=$5} END{print a}'
    • Mean MAPQ with samtools view aln.bam chr22:20000000-21000000 | awk '{sum+=$5} END { print "Mean MAPQ =",sum/NR}'
    • bioawk functions similar to bcftools query
  • TODO: curl only a subset of a (super-large) BAM by region with BAI

IGV

Misc

About

gist indices to bioinformatics-related scripts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published