-
Notifications
You must be signed in to change notification settings - Fork 1
Formatting data files for JBrowse
All features in data files need to use these chromosome IDs:
- I
- II
- III
- chr_II_telomeric_gap
- mitochondrial
- mating_type_region
Commands like samtools
and bgzip
are installed on oliver1. If something is missing please let me (kmr) know.
On oliver1:
- main directory:
/data/pombase/external_datasets/
- original files, one sub-directory per dataset:
/data/pombase/external_datasets/originals
- processed files, with correct chromosome IDs, sorted and indexed:
/data/pombase/external_datasets/processed/
In most cases, each dataset/publication has a pair of sub-directories. One in originals
and one in processed
. For example:
/data/pombase/external_datasets/originals/Marguerat_2012_PMID_23101633
and
/data/pombase/external_datasets/processed/Marguerat_2012_PMID_23101633
We have been moving to standardise the dataset directory names with AUTHOR_DATE_PMID_PMID, like "Marguerat_2012_PMID_23101633".
New sub-directories and files in the processed
directory are automatically copied to the Babraham server as part of the nightly load using rsync
. The files are in /home/ftp/pombase/external_datasets/
on the Babraham server.
BED format files need to be sorted, then compressed with bgzip
, and then indexed to be usable by JBrowse.
Example commands:
sort -k1,1 -k2,2n -k3,3n BranchPoint_v2.bed > BranchPoint_v2.sorted.bed
bgzip BranchPoint_v2.sorted.bed
tabix -p bed BranchPoint_v2.sorted.bed.gz
The tabix
command creates a .tbi
index file for the compressed .bed
file. (eg. BranchPoint_v2.sorted.bed.gz.tbi
)
Both the .sorted.bed.gz
and the .sorted.bed.gz.tbi
file need to be copied to the processed
directory.
Use the .sorted.bed.gz
filename in the URL column of the JBrowse metadata table.
See /data/pombase/external_datasets/processed/Yadav_2012_PMID_23163955_SIDD/
for an example.
Wiggle (.wig) format files need to converted to bigWig (.bw) format for use by JBrowse. No other processing is needed for bigWig files. The .bw
file can be used in the URL column of the JBrowse metadata table.
We have wigToBigWig
installed on oliver1 for conversion.
Example commands:
for i in *.wig
do
# make a file name ending in .bw from one ending in .wig:
BW=${i%.wig}.bw
echo $i
wigToBigWig $i /var/pomcur/sources/pombe-embl/supporting_files/chromosome_sizes.txt $BW
done
Where chromosome.sizes
contains:
chr_II_telomeric_gap 20000
mitochondrial 19431
mating_type_region 20128
III 2452883
II 4539804
I 5579133
If the dataset is in .bw
format (which is binary), we may need to convert to .wig
(text) format so we can fix the chromosome IDs. We have the bigWigToWig
command installed for this.
Example command, where GSE110976_EMM_exp_minus.bw
exists and GSE110976_EMM_exp_minus.wig
will be created:
bigWigToWig GSE110976_EMM_exp_minus.bw GSE110976_EMM_exp_minus.wig
SAM format is a plain text short read alignment format. BAM format is the compressed, binary equivalent.
JBrowse needs sorted, indexed BAM files with correct chromosome IDs. Both the .bam
and .bam.bai
(index) files are needed.
Most alignment datasets are provided in BAM format, but we need to check and possibly check chromosome IDs before use.
Samtools is the commonest BAM/SAM conversion program. It does many other things too.
Quick check to see which chromosome IDs a dataset uses:
for i in *.bam; do samtools view $i ; done | awk '{print $3}' | sort | uniq
If the files aren't using I, II, etc. the BAM files will need to be converted to SAM format for easier fixing. SAM files are much bigger than the equivalent BAM files. SAM format is text, but it's not very human readable. Luckily for us it's mostly a tab-delimited file and we only need to edit the header lines and column 3 of the main data to fix chromosome ID problems.
Example command:
samtools view -h -o ERS146824.sam ERS146824.bam
Example SAM file:
@SQ SN:mitochondrial LN:19431
@SQ SN:chr_II_telomeric_gap LN:20000
@SQ SN:I LN:5579133
@SQ SN:II LN:4539804
@SQ SN:III LN:2452883
@SQ SN:mating_type_region LN:20128
ERR135907.18940180 16 I 6 3 51M * 0 0 CGTACATCACCTTGTAAGAATTTATCTGCAATAGTTCTTCGGTATTGTACA D8:;=57>+?:D@:DDDBDDDDBBBB463556;7:'6>777555/56676; MD:Z:35C15 NH:i:2 HI:i:1 NM:i:1 SM:i:3 XQ:i:40 X2:i:40
ERR135907.35403634 16 I 18 3 51M * 0 0 TGTAAGAATTTATCTGCAATAGTCCTTCGGTATTGTACATTGTTCCAAGCA GDGHGGFHFGHEEEF8F@EGGGEGFCHHHHGHHHHGHHBHHGHHHGHHGGH MD:Z:51 NH:i:2 HI:i:1 NM:i:0 SM:i:3 XQ:i:40 X2:i:40
ERR135907.34855373 16 I 118 3 51M * 0 0 CAGAAGTGTAAGCCATATCACTGTCGGCATGTTCAAACTTTGTCAAACCAC DAAAGA=?C?A>>DGG?6;>ABE?B?B?CFEEC>GFEE<BGEBDEFGEBE@ MD:Z:51 NH:i:2 HI:i:1 NM:i:0 SM:i:3 XQ:i:40 X2:i:40
ERR135907.25522986 16 I 217 3 51M * 0 0 AGTTGTGGTCGGCCTTGCCACATTTATAACAAGTAGATAAGCGTACGGGGC EHEGHGHHDGHHHHHHHHHHHHDHHHGHHDHGHBHHHHGBHHHHHHHHHHH MD:Z:51 NH:i:2 HI:i:1 NM:i:0 SM:i:3 XQ:i:40 X2:i:40
ERR135907.7153698 16 I 348 3 51M * 0 0 AAGAATCGCTCGAGTTGTGATGAAATGTCAGTTGAGTCTACCCATTGTTTT GGEGEEGDEGEGGGGGGBG>DHFHHHHHHHHHHHHEHHHHHHDHDHHDHHH MD:Z:19C31 NH:i:2 HI:i:1 NM:i:1 SM:i:3 XQ:i:40 X2:i:40
ERR135907.22957655 16 I 350 3 51M * 0 0 GAATCGCTCGAGTTGTGCTGAAATGTCAGTTGAGTCTACCCATTGTTTTTT IIIGIHIFIIHFIGHIIIFIIIIIIIIIHDHIIGGGIIIIIIFIIGIIIII MD:Z:51 NH:i:2 HI:i:1 NM:i:0 SM:i:3 XQ:i:40 X2:i:40
ERR135907.29163716 16 I 352 3 51M * 0 0 ATCGCTCGAGTTGTGCTGAAATGTCAGTTGAGTCTACCCATTGTTTTTTGA GGIHHIHIHCIIHIIIIIIHIEIIIIIIIIIIIHIIFIIIIIIIIIIIIII MD:Z:51 NH:i:2 HI:i:1 NM:i:0 SM:i:3 XQ:i:40 X2:i:40
ERR135907.32129662 16 I 357 3 48M3S * 0 0 TCGAGTTGTGCTGAAATGTCAGTTGAGTCTACCCATTGTTTTTTGACGCCT HIBIIHGIGGIDBIDIDIIIF>GDGGDDEBDGGD@DD9HIHGGGGGGGDGG MD:Z:48 NH:i:2 HI:i:1 NM:i:0 SM:i:3 XQ:i:40 X2:i:40
...
BAM files need to be sorted and index for use by JBrowse. samtools
can do both of those tasks.
Example command for converting SAM to BAM, without sorting:
samtools view -b -o ERS146824.bam ERS146824.sam
Example command to convert SAM to sorted BAM:
samtools view -u ERS146824.sam | samtools sort -o ERS146824.sorted.bam
BAM files need to be indexed for use by JBrowse using samtools index
.
Example command:
samtools index ERS146824.sorted.bam
Which generates ERS146824.sorted.bam.bai
For the example above we'd need to copy ERS146824.sorted.bam
and ERS146824.sorted.bam.bai
to the appropriate sub-directory of the processed
directory. The URL in the metadata file should be for ERS146824.sorted.bam
. JBrowse will automatically read the index file, ERS146824.sorted.bam.bai
.
Another option for BAM/SAM manipulation is pysam.
for i in *.bam
do
base=`basename $i .bam`
nice -19 bedtools genomecov -ibam $base.bam -bg -scale 1.0 > $base.bedgraph
LC_COLLATE=C sort -k1,1 -k2,2n $base.bedgraph > $base.sorted_bedgraph
/usr/local/bin/ucsc-utils/bedGraphToBigWig $base.sorted_bedgraph /var/pomcur/sources/pombe-embl/supporting_files/chromosome_sizes.txt $base.bw
rm $base.sorted_bedgraph $base.bedgraph
done
PomBase is funded by the Wellcome Trust