GitHub - hurrialice/qutils: gist indices to bioinformatics-related scripts

The repo is a collection of simple code snippets that I often forget..

General python

ipdb cheatsheet
Conda basic commands; conda config with basic R4 + bioconductor
Enable %%R magic by %load_ext rpy2.ipython; more on rmagic
- Useful Bioconductor utilities 😪
python requests library
- advanced use
- pagination example with generator
pytest framework

Pandas

Assign a column based on multiple conditions
chromosome sorting done right: raw['chr'] = pd.Categorical(raw['chr'], categories=['chr'+str(x) for x in list(range(1,23))+list("XY")], ordered=False)
TODO: taking care of indices during merge / concat operations
x in pd.Series always returns False
pd.Series([True, False, np.nan]).astype(bool) will turn NA to True
MultiIndex handling
- adat.index = pd.MultiIndex.from_frame(<pd.dataframe>)
- adat.sort_index(level=['relationship','Subject ID']).sort_index(level = ["Dilution"], axis=1) # reorder a multindex dataframe
- adat2.reindex_like(adat1) # reorder adat2 to adat1-matching indices with certain filling logic

Numpy

Matplotlib

GaddyGram from aggregated MAF
Upset plot comparing calls from different sources
aws cli settings
Add y=x to existing figure without changing axis
```
x = np.linspace(*ax.get_xlim())
ax.plot(x, x)
```
Q-Q plot
centering diverging cmap
adding hatches to a heatmap/clustermap; note the use of np.ma module!
discrete color codes. I am still trying to find an elegant way to annotate colors but this question might be relevant. If to draw a scatterplot with categorized colors from a column, just use col.map(color_dict) in which color_dict is a preset mapping of colors, usually taking from company's visual guide.
```
ilmn = ['#E91207','#ED8B00', '#FFB81C', '#DE1B76', '#7D55C7', '#0077C8']
sns.palplot(sns.color_palette(ilmn)) # show colors
plt.scatter(...., c=[sns.color_palette(ilmn)[x] for x in df.colname.map(CMAPDICT)]))
```
custom legend
Extendable quick QC plot on tabular metrics file
twin axis with seaborn gridplot
TODO: pairwise alignment dotplot plot for given read name

General R (to be filled)

Vannila R example for several statistical tests, proficient use of lapply families, split, unnest, Map.
R stats functions, different glm designs
Tidyverse family of libraries (dplyr, purrr, ...)
SummerizedExpriments for efficient organization of assays, other reliable Bioconductor packages
ggplot2 family of bioinformatics related plotting
.rs.restartR() to restart R session from console (with proper backup of enviroment variables)

NGS manipulation

Cytoband file parser Credit to Julian Hess
Simulate reads from haplotig with NEAT from give error model

Variants (MAF/VCF)

Several useful commands with awk/sed/grep
Reheader VCF with another reference fasta index (e.g. different contig names) by bcftools reheader -f XXX.fasta.fai old.vcf -o new.vcf
Appending header lines to VCF by bcftools annotate -h <hdr> -a <annot> -s <sample>
Change sample name of VCF: bcftools reheader -samples <samples.list>, as in the order of original VCF
bcftools cheatsheet
convert a site/sample level information from a VCF to a table (note, it is not MAF! so be careful abut the position changes) with GATK VariantsToTable, or bcftools query with -f. Both ways work to tabulate info/format fields and can be piped to awk

Matching indel positions between VCF vs MAF

# insertion
ms = vs
me = ms + 1
# mask the first base for (vref, valt)
mref,malt = (-,G)

# deletion
ms = vs + 1
# mask the first base for (vref, valt)
malt = '-'
me = ms + len(mref) -1

bcftools expression syntax
Working with VCF info flags (binary tag):
- with bcftools annotate, explicitly include an additional column for 1/0/.; where as . denotes keeping existing flag
- with bcftools annotate, use -m <TAG> to denote the presence of a TAG
- Correct header for a binary flag looks like:
```
echo -e "##INFO=<ID=pathogenic_intron,Number=0,Type=Flag,Description="Intronic variants that is marked P/LP from clinvar"" > pathogenic_intron.hdr
```
- Filtering by bcftools filter -i 'pathogenic_intron=1' for presence of that tag
bcftools merge for non-overlapping sample (be careful with FILTER info loss); bcftools concat for identical order of sample with different regions / variant types (different FORMAT headers will be properly handled too) , however output needs to be sorted
GVCF: The key difference between a regular VCF and a GVCF is that the GVCF has records for all sites, whether there is a variant call there or not. Related GATK Joint calling framework
Compression (TODO: this part needs more work)
- Compress and index a vcf
```
bgzip file.vcf # or bcftools view file.vcf -Oz -o file.vcf.gz
tabix file.vcf.gz # or bcftools index file.vcf.gz
```
- bcftools convert vcf/bcf/gvcf
- -O: Output compressed BCF (b), uncompressed BCF (u), compressed VCF (z), uncompressed VCF (v). Use the -Ou option when piping between bcftools; subcommands to speed up performance by removing unnecessary compression/decompression and VCF<->BCF conversion. Size comparison
consensus sequence from VCF, usually better to have input vcf normalized (indel left aligned with bcftools norm --fasta-ref)
```
samtools faidx ref.fa 8:11870-11890 | bcftools consensus in.vcf.gz > out.fa
```
filter hom-ref from a VCF, gives you any site where at least one sample has an alt.
```
bcftools view -i 'GT[*]="alt"' input.vcf
```
bcftools +mendelian plugin option: -l + will remove any inconsistent records, -d will only mask GT field to ./.
make a bedtools-compatible genome file by awk -v OFS='\t' {'print $1,$2'} ${reffa}.fai > ${reffa}.genome
bcftools region option:

  -r, --regions <region>              restrict to comma-separated list of regions
  -R, --regions-file <file>           restrict to regions listed in a file
  -t, --targets [^]<region>           similar to -r but streams rather than index-jumps. Exclude regions with "^" prefix
  -T, --targets-file [^]<file>        similar to -R but streams rather than index-jumps. Exclude regions with "^" prefix

Yet another difference between the two is that -r checks both start and end positions of indels, whereas -t checks start positions only.

MiniBAM
Phasing
- Ways to phase variants:
  - statistical(popgen) phasing: beagle
  - read-backed phasing: whatshap
  - pedigree (parental scaffold) phasing: shapeit4??
- Compare two phased VCF with whatshap compare:
```
whatshap compare --names ped,single\
  --tsv-pairwise ${fam_id}_eval.tsv \
  --sample ${proband_id} \
  --longest-block-tsv ${fam_id}_longest-block.tsv \
  <pvcf1> <pvcf2>
```
  And check phase_agreeing column from longest-block-tsv.
- switch error vs flip error (def=two switch errors in a row):
```
A   B   C   D
0|1 0|1 0|1 0|1
0|1 0|1 0|1 0|1
0|1 1|0 1|0 1|0
1|0 0|1 1|0 0|1
1|0 0|1 1|0 1|0
```
  In this case, A<->B = 1 switch; A<->C = 1 flip; A<->D = 2 switches

Structural variants VCF with symbolic ALT (VCF4.3) symbolify ALT allele

Symbolic alternate alleles are described as follows:

##ALT=<ID=type,Description=description>

Structural Variants
In symbolic alternate alleles for imprecise structural variants, the ID field indicates the type of structural variant, and can be a colon-separated list of types and subtypes. ID values are case sensitive strings and must not contain whitespace or angle brackets. The first level type must be one of the following:

DEL Deletion relative to the reference
INS Insertion of novel sequence relative to the reference
DUP Region of elevated copy number relative to the reference
INV Inversion of reference sequence
CNV Copy number variable region (may be both deletion and duplication)
BND Breakend
The CNV category should not be used when a more specific category can be applied. Reserved subtypes include:

DUP:TANDEM Tandem duplication
DEL:ME Deletion of mobile element relative to the reference
INS:ME Insertion of a mobile element relative to the reference

Alignments (BAM/CRAM)

Explaining read flags
Filtering by read tag with samtools view -d TAG:VAL or samtools view -d TAG for binary flag
export GCS_OAUTH_TOKEN=$(gcloud auth application-default print-access-token) so that samtools will stream gs://
Note that -r in samtools view denotes read group instead of region. Be aware...
samtools mpileup with regrex patterns in python:
parsing outputs from samtools view, however inflating BAM should generally be avoided
- Max MAPQ with samtools view ${bam} chr1:20000000-21000000 | awk 'BEGIN{a= 0}{if ($5>0+a) a=$5} END{print a}'
- Mean MAPQ with samtools view aln.bam chr22:20000000-21000000 | awk '{sum+=$5} END { print "Mean MAPQ =",sum/NR}'
- bioawk functions similar to bcftools query
TODO: curl only a subset of a (super-large) BAM by region with BAI

IGV

Get signed URL from GDC UUID
visualizing Pacbio long reads in IGV
- Note: IGV handles long reads hardclip differently from short reads (appear to be a truncated magneta arrow) and it can be misleading! It makes more sense to make dotplot aligning reads to ref.
- Tricks for IGV for long reads
write IGV xml from a list of bam paths with bash /usr/writeIGV.sh ~{reference_version} ~{sep=" " input_files} ~{sep=" " input_names_prefix} > "~{file_name}.xml"; python equivalent
Starting from IGV 2.13 it will stream s3 buckets from [default] profile
Starting from IGV 2.14 it displays 5mC with better color code [TODO; add explanation here], quick visuals can be done with methylation-artist
TODO: linked-reads display

Misc

Kill Vscode remote on host to emulate log out and log back in, otherwise it still runs as a background process
Homebrewing without sudo
Missing link to libcrypto openssl on OSX
Length for chr1 is 249,250,621 in hg19; and 248,956,422 in hg38
Misc notes on gene annotations: Transcript Biotype and gnomad gene constraints
HG002 phased Truthsets

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
README.md		README.md
Rutils.md		Rutils.md
git-notes.md		git-notes.md
kitchensink.md		kitchensink.md
lib-rna-seq.md		lib-rna-seq.md
online_tools.md		online_tools.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

General python

Pandas

Numpy

Matplotlib

General R (to be filled)

NGS manipulation

Variants (MAF/VCF)

Alignments (BAM/CRAM)

IGV

Misc

About

Releases

Packages

hurrialice/qutils

Folders and files

Latest commit

History

Repository files navigation

General python

Pandas

Numpy

Matplotlib

General R (to be filled)

NGS manipulation

Variants (MAF/VCF)

Alignments (BAM/CRAM)

IGV

Misc

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages