Skip to content

Formatting data files for JBrowse

Kim Rutherford edited this page Aug 21, 2022 · 26 revisions

Chromosome IDs

All features in data files need to use these chromosome IDs:

  • I
  • II
  • III
  • chr_II_telomeric_gap
  • mitochondrial
  • mating_type_region

File processing software

Commands like samtools and bgzip are installed on oliver1. If something is missing please let me (kmr) know.

Data directories

On oliver1:

  • main directory: /data/pombase/external_datasets/
  • original files, one sub-directory per dataset: /data/pombase/external_datasets/originals
  • processed files, with correct chromosome IDs, sorted and indexed: /data/pombase/external_datasets/processed/

In most cases, each dataset/publication has a pair of sub-directories. One in originals and one in processed. For example:

  • /data/pombase/external_datasets/originals/Marguerat_2012_PMID_23101633

and

  • /data/pombase/external_datasets/processed/Marguerat_2012_PMID_23101633

We have been moving to standardise the dataset directory names with AUTHOR_DATE_PMID_PMID, like "Marguerat_2012_PMID_23101633".

New sub-directories and files in the processed directory are automatically copied to the Babraham server as part of the nightly load using rsync. The files are in /home/ftp/pombase/external_datasets/ on the Babraham server.

BED / TabixBED

BED format files need to be sorted, then compressed with bgzip, and then indexed to be usable by JBrowse.

Example commands:

sort -k1,1 -k2,2n -k3,3n BranchPoint_v2.bed > BranchPoint_v2.sorted.bed
bgzip BranchPoint_v2.sorted.bed
tabix -p bed BranchPoint_v2.sorted.bed.gz

The tabix command creates a .tbi index file for the compressed .bed file. (eg. BranchPoint_v2.sorted.bed.gz.tbi) Both the .sorted.bed.gz and the .sorted.bed.gz.tbi file need to be copied to the processed directory.

Use the .sorted.bed.gz filename in the URL column of the JBrowse metadata table.

See /data/pombase/external_datasets/processed/Yadav_2012_PMID_23163955_SIDD/ for an example.

WIG

Wiggle (.wig) format files need to converted to bigWig (.bw) format for use by JBrowse. No other processing is needed for bigWig files. The .bw file can be used in the URL column of the JBrowse metadata table.

We have wigToBigWig installed on oliver1 for conversion.

Example commands:

for i in *.wig
do
  # make a file name ending in .bw from one ending in .wig:
  BW=${i%.wig}.bw
  echo $i
  wigToBigWig $i /var/pomcur/sources/pombe-embl/supporting_files/chromosome_sizes.txt $BW
done

Where chromosome.sizes contains:

chr_II_telomeric_gap    20000
mitochondrial   19431
mating_type_region      20128
III     2452883
II      4539804
I       5579133

Fixing bigWig files

If the dataset is in .bw format (which is binary), we may need to convert to .wig (text) format so we can fix the chromosome IDs. We have the bigWigToWig command installed for this.

Example command, where GSE110976_EMM_exp_minus.bw exists and GSE110976_EMM_exp_minus.wig will be created:

bigWigToWig GSE110976_EMM_exp_minus.bw GSE110976_EMM_exp_minus.wig

SAM/BAM

SAM format is a plain text short read alignment format. BAM format is the compressed, binary equivalent.

JBrowse needs sorted, indexed BAM files with correct chromosome IDs. Both the .bam and .bam.bai (index) files are needed.

Most alignment datasets are provided in BAM format, but we need to check and possibly check chromosome IDs before use.

Samtools is the commonest BAM/SAM conversion program. It does many other things too.

Quick check to see which chromosome IDs a dataset uses:

for i in *.bam; do samtools view $i ; done | awk '{print $3}' | sort | uniq

If the files aren't using I, II, etc. the BAM files will need to be converted to SAM format for easier fixing. SAM files are much bigger than the equivalent BAM files. SAM format is text, but it's not very human readable. Luckily for us it's mostly a tab-delimited file and we only need to edit the header lines and column 3 of the main data to fix chromosome ID problems.

Converting BAM to SAM

Example command:

samtools view -h -o ERS146824.sam ERS146824.bam

Example SAM file:

@SQ     SN:mitochondrial        LN:19431
@SQ     SN:chr_II_telomeric_gap LN:20000
@SQ     SN:I    LN:5579133
@SQ     SN:II   LN:4539804
@SQ     SN:III  LN:2452883
@SQ     SN:mating_type_region   LN:20128
ERR135907.18940180      16      I       6       3       51M     *       0       0       CGTACATCACCTTGTAAGAATTTATCTGCAATAGTTCTTCGGTATTGTACA     D8:;=57>+?:D@:DDDBDDDDBBBB463556;7:'6>777555/56676;     MD:Z:35C15      NH:i:2  HI:i:1  NM:i:1  SM:i:3  XQ:i:40 X2:i:40
ERR135907.35403634      16      I       18      3       51M     *       0       0       TGTAAGAATTTATCTGCAATAGTCCTTCGGTATTGTACATTGTTCCAAGCA     GDGHGGFHFGHEEEF8F@EGGGEGFCHHHHGHHHHGHHBHHGHHHGHHGGH     MD:Z:51 NH:i:2  HI:i:1  NM:i:0  SM:i:3  XQ:i:40 X2:i:40
ERR135907.34855373      16      I       118     3       51M     *       0       0       CAGAAGTGTAAGCCATATCACTGTCGGCATGTTCAAACTTTGTCAAACCAC     DAAAGA=?C?A>>DGG?6;>ABE?B?B?CFEEC>GFEE<BGEBDEFGEBE@     MD:Z:51 NH:i:2  HI:i:1  NM:i:0  SM:i:3  XQ:i:40 X2:i:40
ERR135907.25522986      16      I       217     3       51M     *       0       0       AGTTGTGGTCGGCCTTGCCACATTTATAACAAGTAGATAAGCGTACGGGGC     EHEGHGHHDGHHHHHHHHHHHHDHHHGHHDHGHBHHHHGBHHHHHHHHHHH     MD:Z:51 NH:i:2  HI:i:1  NM:i:0  SM:i:3  XQ:i:40 X2:i:40
ERR135907.7153698       16      I       348     3       51M     *       0       0       AAGAATCGCTCGAGTTGTGATGAAATGTCAGTTGAGTCTACCCATTGTTTT     GGEGEEGDEGEGGGGGGBG>DHFHHHHHHHHHHHHEHHHHHHDHDHHDHHH     MD:Z:19C31      NH:i:2  HI:i:1  NM:i:1  SM:i:3  XQ:i:40 X2:i:40
ERR135907.22957655      16      I       350     3       51M     *       0       0       GAATCGCTCGAGTTGTGCTGAAATGTCAGTTGAGTCTACCCATTGTTTTTT     IIIGIHIFIIHFIGHIIIFIIIIIIIIIHDHIIGGGIIIIIIFIIGIIIII     MD:Z:51 NH:i:2  HI:i:1  NM:i:0  SM:i:3  XQ:i:40 X2:i:40
ERR135907.29163716      16      I       352     3       51M     *       0       0       ATCGCTCGAGTTGTGCTGAAATGTCAGTTGAGTCTACCCATTGTTTTTTGA     GGIHHIHIHCIIHIIIIIIHIEIIIIIIIIIIIHIIFIIIIIIIIIIIIII     MD:Z:51 NH:i:2  HI:i:1  NM:i:0  SM:i:3  XQ:i:40 X2:i:40
ERR135907.32129662      16      I       357     3       48M3S   *       0       0       TCGAGTTGTGCTGAAATGTCAGTTGAGTCTACCCATTGTTTTTTGACGCCT     HIBIIHGIGGIDBIDIDIIIF>GDGGDDEBDGGD@DD9HIHGGGGGGGDGG     MD:Z:48 NH:i:2  HI:i:1  NM:i:0  SM:i:3  XQ:i:40 X2:i:40
...

SAM to BAM conversion

BAM files need to be sorted and index for use by JBrowse. samtools can do both of those tasks.

Example command for converting SAM to BAM, without sorting:

samtools view -b -o ERS146824.bam ERS146824.sam

Example command to convert SAM to sorted BAM:

samtools view -u ERS146824.sam | samtools sort -o ERS146824.sorted.bam

Indexing

BAM files need to be indexed for use by JBrowse using samtools index.

Example command:

samtools index ERS146824.sorted.bam

Which generates ERS146824.sorted.bam.bai

Adding to the metadata file

For the example above we'd need to copy ERS146824.sorted.bam and ERS146824.sorted.bam.bai to the appropriate sub-directory of the processed directory. The URL in the metadata file should be for ERS146824.sorted.bam. JBrowse will automatically read the index file, ERS146824.sorted.bam.bai.

pysam

Another option for BAM/SAM manipulation is pysam.

Generating bigWig coverage graphs for use at lower zoom levels

for i in *.bam
do
   base=`basename $i .bam`
   nice -19 bedtools genomecov -ibam $base.bam -bg -scale 1.0 > $base.bedgraph
   LC_COLLATE=C sort -k1,1 -k2,2n $base.bedgraph > $base.sorted_bedgraph
   /usr/local/bin/ucsc-utils/bedGraphToBigWig $base.sorted_bedgraph /var/pomcur/sources/pombe-embl/supporting_files/chromosome_sizes.txt $base.bw
   rm $base.sorted_bedgraph $base.bedgraph
done