The repo is a collection of simple code snippets that I often forget..
- ipdb cheatsheet
- Conda basic commands; conda config with basic R4 + bioconductor
- Enable
%%R
magic by%load_ext rpy2.ipython
; more on rmagic - python requests library
- advanced use
- pagination example with generator
- pytest framework
- Assign a column based on multiple conditions
- chromosome sorting done right:
raw['chr'] = pd.Categorical(raw['chr'], categories=['chr'+str(x) for x in list(range(1,23))+list("XY")], ordered=False)
- TODO: taking care of indices during merge / concat operations
x in pd.Series
always returnsFalse
pd.Series([True, False, np.nan]).astype(bool)
will turn NA to True- MultiIndex handling
adat.index = pd.MultiIndex.from_frame(<pd.dataframe>)
adat.sort_index(level=['relationship','Subject ID']).sort_index(level = ["Dilution"], axis=1)
# reorder a multindex dataframeadat2.reindex_like(adat1)
# reorder adat2 to adat1-matching indices with certain filling logic
- Slicing
- Broadcasting
- GaddyGram from aggregated MAF
- Upset plot comparing calls from different sources
- aws cli settings
- Add y=x to existing figure without changing axis
x = np.linspace(*ax.get_xlim()) ax.plot(x, x)
- Q-Q plot
- centering diverging cmap
- adding hatches to a heatmap/clustermap; note the use of
np.ma
module! - discrete color codes. I am still trying to find an elegant way to annotate colors but this question might be relevant. If to draw a scatterplot with categorized colors from a column, just use
col.map(color_dict)
in whichcolor_dict
is a preset mapping of colors, usually taking from company's visual guide.ilmn = ['#E91207','#ED8B00', '#FFB81C', '#DE1B76', '#7D55C7', '#0077C8'] sns.palplot(sns.color_palette(ilmn)) # show colors plt.scatter(...., c=[sns.color_palette(ilmn)[x] for x in df.colname.map(CMAPDICT)]))
- custom legend
- Extendable quick QC plot on tabular metrics file
- twin axis with seaborn gridplot
- TODO: pairwise alignment dotplot plot for given read name
- Vannila R example for several statistical tests, proficient use of
lapply
families,split
,unnest
,Map
. - R stats functions, different glm designs
- Tidyverse family of libraries (
dplyr
,purrr
, ...) SummerizedExpriments
for efficient organization of assays, other reliableBioconductor
packagesggplot2
family of bioinformatics related plotting.rs.restartR()
to restart R session from console (with proper backup of enviroment variables)
- Cytoband file parser Credit to Julian Hess
- Simulate reads from haplotig with NEAT from give error model
- Several useful commands with awk/sed/grep
- Reheader VCF with another reference fasta index (e.g. different contig names) by
bcftools reheader -f XXX.fasta.fai old.vcf -o new.vcf
- Appending header lines to VCF by
bcftools annotate -h <hdr> -a <annot> -s <sample>
- Change sample name of VCF:
bcftools reheader -samples <samples.list>
, as in the order of original VCF - bcftools cheatsheet
- convert a site/sample level information from a VCF to a table (note, it is not MAF! so be careful abut the position changes) with GATK VariantsToTable, or bcftools query with
-f
. Both ways work to tabulate info/format fields and can be piped toawk
- Matching indel positions between VCF vs MAF
# insertion ms = vs me = ms + 1 # mask the first base for (vref, valt) mref,malt = (-,G) # deletion ms = vs + 1 # mask the first base for (vref, valt) malt = '-' me = ms + len(mref) -1
- bcftools expression syntax
- Working with VCF info flags (binary tag):
- with
bcftools annotate
, explicitly include an additional column for 1/0/.; where as.
denotes keeping existing flag - with
bcftools annotate
, use-m <TAG>
to denote the presence of a TAG - Correct header for a binary flag looks like:
echo -e "##INFO=<ID=pathogenic_intron,Number=0,Type=Flag,Description="Intronic variants that is marked P/LP from clinvar"" > pathogenic_intron.hdr
- Filtering by
bcftools filter -i 'pathogenic_intron=1'
for presence of that tag
- with
bcftools merge
for non-overlapping sample (be careful with FILTER info loss);bcftools concat
for identical order of sample with different regions / variant types (different FORMAT headers will be properly handled too) , however output needs to be sorted- GVCF: The key difference between a regular VCF and a GVCF is that the GVCF has records for all sites, whether there is a variant call there or not. Related GATK Joint calling framework
- Compression (TODO: this part needs more work)
- Compress and index a vcf
bgzip file.vcf # or bcftools view file.vcf -Oz -o file.vcf.gz tabix file.vcf.gz # or bcftools index file.vcf.gz
bcftools convert
vcf/bcf/gvcf-O
: Output compressed BCF (b), uncompressed BCF (u), compressed VCF (z), uncompressed VCF (v). Use the -Ou option when piping between bcftools; subcommands to speed up performance by removing unnecessary compression/decompression and VCF<->BCF conversion. Size comparison
- Compress and index a vcf
- consensus sequence from VCF, usually better to have input vcf normalized (indel left aligned with
bcftools norm --fasta-ref
)samtools faidx ref.fa 8:11870-11890 | bcftools consensus in.vcf.gz > out.fa
- filter hom-ref from a VCF, gives you any site where at least one sample has an alt.
bcftools view -i 'GT[*]="alt"' input.vcf
bcftools +mendelian
plugin option:-l +
will remove any inconsistent records,-d
will only maskGT
field to./.
- make a
bedtools
-compatible genome file byawk -v OFS='\t' {'print $1,$2'} ${reffa}.fai > ${reffa}.genome
- bcftools region option:
-r, --regions <region> restrict to comma-separated list of regions
-R, --regions-file <file> restrict to regions listed in a file
-t, --targets [^]<region> similar to -r but streams rather than index-jumps. Exclude regions with "^" prefix
-T, --targets-file [^]<file> similar to -R but streams rather than index-jumps. Exclude regions with "^" prefix
Yet another difference between the two is that -r checks both start and end positions of indels, whereas -t checks start positions only.
-
Phasing
- Ways to phase variants:
- statistical(popgen) phasing: beagle
- read-backed phasing: whatshap
- pedigree (parental scaffold) phasing: shapeit4??
- Compare two phased VCF with whatshap compare:
And check
whatshap compare --names ped,single\ --tsv-pairwise ${fam_id}_eval.tsv \ --sample ${proband_id} \ --longest-block-tsv ${fam_id}_longest-block.tsv \ <pvcf1> <pvcf2>
phase_agreeing
column fromlongest-block-tsv
. - switch error vs flip error (def=two switch errors in a row):
In this case, A<->B = 1 switch; A<->C = 1 flip; A<->D = 2 switches
A B C D 0|1 0|1 0|1 0|1 0|1 0|1 0|1 0|1 0|1 1|0 1|0 1|0 1|0 0|1 1|0 0|1 1|0 0|1 1|0 1|0
- Ways to phase variants:
-
Structural variants VCF with symbolic ALT (VCF4.3) symbolify ALT allele
Symbolic alternate alleles are described as follows: ##ALT=<ID=type,Description=description> Structural Variants In symbolic alternate alleles for imprecise structural variants, the ID field indicates the type of structural variant, and can be a colon-separated list of types and subtypes. ID values are case sensitive strings and must not contain whitespace or angle brackets. The first level type must be one of the following: DEL Deletion relative to the reference INS Insertion of novel sequence relative to the reference DUP Region of elevated copy number relative to the reference INV Inversion of reference sequence CNV Copy number variable region (may be both deletion and duplication) BND Breakend The CNV category should not be used when a more specific category can be applied. Reserved subtypes include: DUP:TANDEM Tandem duplication DEL:ME Deletion of mobile element relative to the reference INS:ME Insertion of a mobile element relative to the reference
- Explaining read flags
- Filtering by read tag with
samtools view -d TAG:VAL
orsamtools view -d TAG
for binary flag export GCS_OAUTH_TOKEN=$(gcloud auth application-default print-access-token)
so that samtools will streamgs://
- Note that
-r
insamtools view
denotes read group instead of region. Be aware... - samtools mpileup with regrex patterns in python:
- parsing outputs from samtools view, however inflating BAM should generally be avoided
- Max MAPQ with
samtools view ${bam} chr1:20000000-21000000 | awk 'BEGIN{a= 0}{if ($5>0+a) a=$5} END{print a}'
- Mean MAPQ with
samtools view aln.bam chr22:20000000-21000000 | awk '{sum+=$5} END { print "Mean MAPQ =",sum/NR}'
- bioawk functions similar to
bcftools query
- TODO: curl only a subset of a (super-large) BAM by region with BAI
- Get signed URL from GDC UUID
- visualizing Pacbio long reads in IGV
- Note: IGV handles long reads hardclip differently from short reads (appear to be a truncated magneta arrow) and it can be misleading! It makes more sense to make dotplot aligning reads to ref.
- Tricks for IGV for long reads
- write IGV xml from a list of bam paths with
bash /usr/writeIGV.sh ~{reference_version} ~{sep=" " input_files} ~{sep=" " input_names_prefix} > "~{file_name}.xml"
; python equivalent - Starting from IGV 2.13 it will stream s3 buckets from
[default]
profile - Starting from IGV 2.14 it displays 5mC with better color code [TODO; add explanation here], quick visuals can be done with methylation-artist
- TODO: linked-reads display
- Kill Vscode remote on host to emulate log out and log back in, otherwise it still runs as a background process
- Homebrewing without sudo
- Missing link to libcrypto openssl on OSX
- Length for chr1 is
249,250,621
in hg19; and248,956,422
in hg38 - Misc notes on gene annotations: Transcript Biotype and gnomad gene constraints
- HG002 phased Truthsets