Skip to content

Formatting data files for JBrowse

Kim Rutherford edited this page Aug 19, 2022 · 26 revisions

Chromosome IDs

All features in data files need to use these chromosome IDs:

  • I
  • II
  • III
  • chr_II_telomeric_gap
  • mitochondrial
  • mating_type_region

File processing software

Commands like samtools and bgzip are installed on oliver1. If something is missing please let me (kmr) know.

Data directories

On oliver1:

  • main directory: /data/pombase/external_datasets/
  • original files, one sub-directory per dataset: /data/pombase/external_datasets/originals
  • processed files, with correct chromosome IDs, sorted and indexed: /data/pombase/external_datasets/processed/

In most cases, each dataset/publication has a pair of sub-directories. One in originals and one in processed. For example:

  • /data/pombase/external_datasets/originals/Marguerat_2012_PMID_23101633

and

  • /data/pombase/external_datasets/processed/Marguerat_2012_PMID_23101633

We have been moving to standardise the dataset directory names with AUTHOR_DATE_PMID_PMID, like "Marguerat_2012_PMID_23101633".

New sub-directories and files in the processed directory are automatically copied to the Babraham server as part of the nightly load using rsync. The files are in /home/ftp/pombase/external_datasets/ on the Babraham server.

BED / TabixBED

BED format files need to be sorted, then compressed with bgzip, and then indexed to be usable by JBrowse.

Example commands:

sort -k1,1 -k2,2n -k3,3n BranchPoint_v2.bed > BranchPoint_v2.sorted.bed
bgzip BranchPoint_v2.sorted.bed
tabix -p bed BranchPoint_v2.sorted.bed.gz

The tabix command creates a .tbi index file for the compressed .bed file. (eg. BranchPoint_v2.sorted.bed.gz.tbi) Both the .sorted.bed.gz and the .sorted.bed.gz.tbi file need to be copied to the processed directory.

Use the .sorted.bed.gz filename in the URL column of the JBrowse metadata table.

See /data/pombase/external_datasets/processed/Yadav_2012_PMID_23163955_SIDD/ for an example.

WIG

Wiggle (.wig) format files need to converted to bigWig (.bw) format for use by JBrowse. No other processing is needed for bigWig files. The .bw file can be used in the URL column of the JBrowse metadata table.

We have wigToBigWig installed on oliver1 for conversion.

Example commands:

for i in *.wig
do
  # make a file name ending in .bw from one ending in .wig:
  BW=${i%.wig}.bw
  echo $i
  wigToBigWig $i /var/pomcur/sources/pombe-embl/supporting_files/chromosome_sizes.txt $BW
done

Where chromosome.sizes contains:

chr_II_telomeric_gap    20000
mitochondrial   19431
mating_type_region      20128
III     2452883
II      4539804
I       5579133

Fixing bigWig files

If the dataset is in .bw format (which is binary), we may need to convert to .wig (text) format so we can fix the chromosome IDs. We have the bigWigToWig command installed for this.

Example command, where GSE110976_EMM_exp_minus.bw exists and GSE110976_EMM_exp_minus.wig will be created:

bigWigToWig GSE110976_EMM_exp_minus.bw GSE110976_EMM_exp_minus.wig

SAM/BAM

JBrowse needs sorted, indexed BAM files with correct chromosome IDs. Both the .bam and .bam.bai need to be on the server.

For generating bigWig coverage graphs for use at lower zoom levels:

for i in *.bam
do
   base=`basename $i .bam`
   nice -19 bedtools genomecov -ibam $base.bam -bg -scale 1.0 > $base.bedgraph
   LC_COLLATE=C sort -k1,1 -k2,2n $base.bedgraph > $base.sorted_bedgraph
   /usr/local/bin/ucsc-utils/bedGraphToBigWig $base.sorted_bedgraph /var/pomcur/sources/pombe-embl/supporting_files/chromosome_sizes.txt $base.bw
   rm $base.sorted_bedgraph $base.bedgraph
done