-
Notifications
You must be signed in to change notification settings - Fork 1
Formatting data files for JBrowse
All features in data files need to use these chromosome IDs:
- I
- II
- III
- chr_II_telomeric_gap
- mitochondrial
- mating_type_region
Commands like samtools
and bgzip
are installed on oliver1. If something is missing please let me (kmr) know.
On oliver1:
- main directory:
/data/pombase/external_datasets/
- original files, one sub-directory per dataset:
/data/pombase/external_datasets/originals
- processed files, with correct chromosome IDs, sorted and indexed:
/data/pombase/external_datasets/processed/
In most cases, each dataset/publication has a pair of sub-directories. One in originals
and one in processed
. For example:
/data/pombase/external_datasets/originals/Marguerat_2012_PMID_23101633
and
/data/pombase/external_datasets/processed/Marguerat_2012_PMID_23101633
We have been moving to standardise the dataset directory names with AUTHOR_DATE_PMID_PMID, like "Marguerat_2012_PMID_23101633".
New sub-directories and files in the processed
directory are automatically copied to the Babraham server as part of the nightly load using rsync
. The files are in /home/ftp/pombase/external_datasets/
on the Babraham server.
BED format files need to be sorted, then compressed with bgzip
, and then indexed to be usable by JBrowse.
Example commands:
sort -k1,1 -k2,2n -k3,3n BranchPoint_v2.bed > BranchPoint_v2.sorted.bed
bgzip BranchPoint_v2.sorted.bed
tabix -p bed BranchPoint_v2.sorted.bed.gz
The tabix
command creates a .tbi
index file for the compressed .bed
file. (eg. BranchPoint_v2.sorted.bed.gz.tbi
)
Both the .sorted.bed.gz
and the .sorted.bed.gz.tbi
file need to be copied to the processed
directory.
Use the .sorted.bed.gz
filename in the URL column of the JBrowse metadata table.
See /data/pombase/external_datasets/processed/Yadav_2012_PMID_23163955_SIDD/
for an example.
Wiggle (.wig) format files need to converted to bigWig (.bw) format for use by JBrowse. No other processing is needed for bigWig files. The .bw
file can be used in the URL column of the JBrowse metadata table.
We have wigToBigWig
installed on oliver1 for conversion.
Example commands:
for i in *.wig
do
# make a file name ending in .bw from one ending in .wig:
BW=${i%.wig}.bw
echo $i
wigToBigWig $i /var/pomcur/sources/pombe-embl/supporting_files/chromosome_sizes.txt $BW
done
Where chromosome.sizes
contains:
chr_II_telomeric_gap 20000
mitochondrial 19431
mating_type_region 20128
III 2452883
II 4539804
I 5579133
If the dataset is in .bw
format (which is binary), we may need to convert to .wig
(text) format so we can fix the chromosome IDs. We have the bigWigToWig
command installed for this.
Example command, where GSE110976_EMM_exp_minus.bw
exists and GSE110976_EMM_exp_minus.wig
will be created:
bigWigToWig GSE110976_EMM_exp_minus.bw GSE110976_EMM_exp_minus.wig
JBrowse needs sorted, indexed BAM files with correct chromosome IDs. Both the .bam
and .bam.bai
need to be on the server.
For generating bigWig coverage graphs for use at lower zoom levels:
for i in *.bam
do
base=`basename $i .bam`
nice -19 bedtools genomecov -ibam $base.bam -bg -scale 1.0 > $base.bedgraph
LC_COLLATE=C sort -k1,1 -k2,2n $base.bedgraph > $base.sorted_bedgraph
/usr/local/bin/ucsc-utils/bedGraphToBigWig $base.sorted_bedgraph /var/pomcur/sources/pombe-embl/supporting_files/chromosome_sizes.txt $base.bw
rm $base.sorted_bedgraph $base.bedgraph
done
PomBase is funded by the Wellcome Trust