Skip to content

Latest commit

 

History

History
173 lines (127 loc) · 4.94 KB

README.md

File metadata and controls

173 lines (127 loc) · 4.94 KB

AGP generation steps

Process GoldenGate Markers and generate mapped marker file

This is done for each NAM, replace the NAM with the actual NAM name and run it on each NAM scaffolds (except B73 and B73Ab10) The map files for each NAM should be placed in their own folder

Mo18W example run:

Step 0: Build index

build index for your scaffolds:

00_build_index_hisat2.sh Mo18W.scaffolds.fasta

Step 1: Process markers

for f in *.map; do
  gg1_ProcessMaizeGDBmaps.sh $f M018W;
done
stdout:

The example output for Mo18W

For M018W Chr 1, orignal file had 113 markers, but only 103 had information and 103 sequences were able to extract
For M018W Chr 2, orignal file had 75 markers, but only 65 had information and 65 sequences were able to extract
For M018W Chr 3, orignal file had 85 markers, but only 72 had information and 72 sequences were able to extract
For M018W Chr 4, orignal file had 76 markers, but only 64 had information and 64 sequences were able to extract
Feature (5:229199657-229199757) beyond the length of 5 size (217959525 bp).  Skipping.
For M018W Chr 5, orignal file had 88 markers, but only 79 had information and 78 sequences were able to extract
For M018W Chr 6, orignal file had 49 markers, but only 39 had information and 39 sequences were able to extract
For M018W Chr 7, orignal file had 46 markers, but only 39 had information and 39 sequences were able to extract
For M018W Chr 8, orignal file had 56 markers, but only 43 had information and 43 sequences were able to extract
For M018W Chr 9, orignal file had 57 markers, but only 46 had information and 46 sequences were able to extract
For M018W Chr 10, orignal file had 35 markers, but only 31 had information and 31 sequences were able to extract

You will see .fasta and .tsv for each chr separately

Step 2: Pool markers

gg2_MergeMarkers.sh M018W

No stdout, but you will see:

M018W_merged.fasta
M018W_merged.txt

as output

Step 3a: Map markers

Map markers

gg3_MapMarkers.sh \
   /work/LAS/mhufford-lab/arnstrm/Canu_1.8/genetic_maps/Mo18W/Mo18W_index/Mo18W \
   M018W_merged.fasta \
   M018W_merged.txt M018W

No stdout, but few files are generated by this script.

M018W_mapped.sam
mapping_stats.txt
M018W_mapped.bam
M018W_sorted.bam
M018W_sorted-q30.bam
M018W_sorted_noMismatch.bam
M018W_part1.txt
M018W_part2.txt
M018W_GG-mapped.csv

The file M018W_GG-mapped.csv is needed for the ALLMAPS step later, but we will renamed this file:

mv M018W_GG-mapped.csv goldengate.csv

Process PanGenome anchor markers and generate mapped marker file

First we need to prepare PG marker file (once for all NAM lines). File is located in CyVerse Data Commons that is accessible via iRods

icd /iplant/home/shared/panzea/genotypes/GBS/v27
iget Lu_2015_NatCommun_panGenomeAnchors20150219.txt.gz

or through direct link:

https://datacommons.cyverse.org/browse/iplant/home/shared/panzea/genotypes/GBS/v27/Lu_2015_NatCommun_panGenomeAnchors20150219.txt.gz
# wget/curl will not work

Once downloaded process the file downloaded as follows:

for f in {1..10}; do
 awk -v x=$f '$3==x && $7==0 {print $5"\tpg_"NR"\t"$8}' Lu_2015_NatCommun_panGenomeAnchors20150219.txt;
done > pb_anchors.txt
# bed file to extract sequence
for f in {1..10}; do
 awk -v x=$f '$3==x && $7==0 {print $5"\t"$6-50"\t"$6+50"\tpg_"NR"\t.\t+"}' Lu_2015_NatCommun_panGenomeAnchors20150219.txt;
done > pb_anchors.bed
# genome
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-22/fasta/zea_mays/dna/Zea_mays.AGPv3.22.dna.genome.fa.gz
gunzip Zea_mays.AGPv3.22.dna.genome.fa.gz
# marker sequence
ml bedtools2
awk '$2>0' pb_anchors.bed > temp
mv temp pb_anchors.bed
bedtools getfasta -fi Zea_mays.AGPv3.22.dna.genome.fa -fo pb_anchors.fasta -bed pb_anchors.bed  -name

Mo18W example run:

Step 3b: Map pangenome markers

gg3_MapMarkers.sh \
   /work/LAS/mhufford-lab/arnstrm/Canu_1.8/genetic_maps/Mo18W/Mo18W_index/Mo18W \
    pb_anchors.fasta \
    pb_anchors.txt \
    M018W

same set of files are created as before. But you will only need M018W_GG-mapped.csv. Rename this file

M018W_GG-mapped.csv M018W_pangenome.csv

Step 4: Clean markers/ Remove hets

gg4_clean-markers.sh M018W_pangenome.csv

this will create pangeome.csv file.

Only if you find heterozygous scaffolds in your genome (based on optical map), you can construct a file with just the names of scaffolds that are heterozygous and remove them form the marker file. This will prevent them from including those scaffolds in the AGP.

remove-hets.sh hets-to-remove.txt

Step 5: Run ALLMAPS to create AGP file and Pseudomolecules

Run ALLMAPS

singularity exec --bind $PWD /work/LAS/mhufford-lab/arnstrm/Canu_1.8/genetic_maps/jcvi.simg \
     /work/LAS/mhufford-lab/arnstrm/Canu_1.8/genetic_maps/98_runALLMAPS.sh \
     pangenome.csv \
     goldengate.csv \
     Mo18W.scaffolds.fasta