Skip to content

Formatting data files for JBrowse

Kim Rutherford edited this page Aug 19, 2022 · 26 revisions

Chromosome IDs

All features in data files need to use these chromosome IDs:

  • I
  • II
  • III
  • chr_II_telomeric_gap
  • mitochondrial
  • mating_type_region

File processing software

Commands like samtools and bgzip are installed on oliver1. If something is missing please let me (kmr) know.

Data directories

On oliver1:

  • main directory: /data/pombase/external_datasets/
  • original files, one sub-directory per dataset: /data/pombase/external_datasets/originals
  • processed files, with correct chromosome IDs, sorted and indexed: /data/pombase/external_datasets/processed/

The processed directory is copied to the Babraham server as part of the nightly load:

  • using rsync
  • the files are in /home/ftp/pombase/external_datasets/ on the Babraham server

BED / TabixBED

BED format files need to be sorted and then compressed with bgzip for use by JBrowse. Example:

sort -k1,1 -k2,2n -k3,3n BranchPoint_v2.bed > BranchPoint_v2.sorted.bed
bgzip BranchPoint_v2.sorted.bed
tabix -p bed BranchPoint_v2.sorted.bed.gz

WIG

for i in `find . -name '*.wig'`
do
  BW=${i%.wig}.bw
  echo $i
  wigToBigWig $i ~/pombe/chromosome.sizes $BW
done

Where chromosome.sizes is:

chr_II_telomeric_gap    20000
mitochondrial   19431
mating_type_region      20128
III     2452883
II      4539804
I       5579133

SAM/BAM

JBrowse needed sorted, indexed BAM files with correct chromosome IDs. Both the .bam and .bam.bai need to be on the server.

For generating bigWig coverage graphs for use at lower zoom levels:

for i in *.bam
do
   base=`basename $i .bam`
   nice -19 bedtools genomecov -ibam $base.bam -bg -scale 1.0 > $base.bedgraph
   LC_COLLATE=C sort -k1,1 -k2,2n $base.bedgraph > $base.sorted_bedgraph
   /usr/local/bin/ucsc-utils/bedGraphToBigWig $base.sorted_bedgraph /var/pomcur/sources/pombe-embl/supporting_files/chromosome_sizes.txt $base.bw
   rm $base.sorted_bedgraph $base.bedgraph
done