-
Notifications
You must be signed in to change notification settings - Fork 0
/
NSFdesc.tex
executable file
·321 lines (269 loc) · 24.9 KB
/
NSFdesc.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
%%%%%%%%% PROPOSAL -- 15 pages (including Prior NSF Support)
% The research and training plan presents the research that you will conduct and the training
% that you will receive during the fellowship period and how they relate to your career goals.
% Include in the research and training plan: 1) a brief and informative introduction or background
% section; 2) a statement of research objectives, methods, and significance; 3) training objectives
% and plan for achieving them (these may include scientific as well as other career preparation activities);
% 4) an explanation of how the fellowship activities will enhance your career development and future
% research directions as well as describing how this research differs from your dissertation research; 5)
% a justification of the choice of sponsoring scientist(s) and host institution(s); 6) a timetable with yearly
% goals with benchmarks for major anticipated outcomes. As with all NSF proposals, broader impacts
% must also be addressed.
%
% Some applications may require other documentation before the final decision can be made, e.g.,
% animal care and use, human subjects, government permits, letters of collaboration, and commitments
% from private sources. Their existence should be noted in the research and training plan, but they
% should not be included in the application. NSF may request them later.
\required{Project Description}
% From the NSF Grants Proposal Guide:
% "The Project Description should provide a clear statement of the work
% to be undertaken and must include: objectives for the period of the proposed
% work and expected significance; relation to longer-term goals of the PI's
% project; and relation to the present state of knowledge in the field,
% to work in progress by the PI under other support and to work in progress
% elsewhere."
\begin{wrapfigure}{r}{0.5\textwidth}
\setlength{\abovecaptionskip}{5pt}
\vspace{-30pt}
\centering
\includegraphics[scale=0.65]{rangemap}
\caption{Natural ranges for four study species}
\label{f:range}
\vspace{-5pt}
\end{wrapfigure}
Fire has a been crucial influence in global ecosystem processes and on forested ecosystems, driving not only locally
adapted phenotypes in plant populations \citep{Lamont:1991js,Vega:2008vk,Midgley:2011dw,Keeley:2011jw,
He:2012bz,Parchman:2012ca}, but also impacting carbon storage and climate \citep{Bowman:2009kp}.
Understanding the underlying genetic architecture
associated with fire-associated traits is therefore critical to dealing with the environmental impacts of increased carbon inputs and
changing climate. \citet{He:2012bz} recently investigated five fire-adapted traits for 101 species of \emph{Pinus} and found evidence,
from the Cretaceous, for a strong influence of fire on trait evolution. However, they considered only presence or absence of the traits in
their study. \textbf{There have been no studies to date which capture, as in this proposal, the quantitative nature of fire-associated trait
evolution in these species.}
Plants display strong patterns of adaptation, especially local adaptation, when populations sizes are large \citep{Leimu:2008fb}.
This characteristic is especially important in conifers, which display a lack of population structure and large effective population
sizes \citep{Neale:2004hi}; this is partly why association genetic studies have been so successful in these species
\citep{Gupta:2005fx,GonzalezMartinez:2006ij,GonzalezMartinez:2007gy, Eckert:2009hh, Wegrzyn:2010dd, Eckert:2010hd,Eckert:2012cw}.
The PI will investigate the genetic architecture of a complex phenotype in 2000 trees from four, economically-important
species of pines with a range along the eastern and south-eastern United States (25 trees from 20 populations for each species).
Bark thickness is a highly heritable trait for forest trees (e.g., $H^2 = 0.65$ for \emph{P. taeda}; \citet{Pederick1970}), and, as shown by
\citet{He:2012bz}, is involved with adaptation to fire by species of \emph{Pinus}, including the four focal species.
The PI will study bark thickness, as a quantitative, fire-adapted phenotypic trait, in populations of slash
pine (\emph{Pinus elliottii}), pond pine (\emph{P.\ serotina}), loblolly pine (\emph{P.\ taeda}), and long leaf pine
(\emph{P.\ palustris}). \textbf{Using next-generation, high-throughput sequencing, the PI will dissect bark thickness as a complex
and adaptive, quantitative trait across populations of these four closely related species with the goal of uncovering shared genetic
architecture at multiple evolutionary time scales.} These findings have direct economic impact. For example, \citet{Marshall:2006wl}
demonstrated the negative financial impact that bark mass has on the forest industry. Linking model-based
estimates of bark thickness with markers linked to the causative polymorphisms underlying variation for this trait can help to
inform breeding plans to minimize this loss.
The application of next-generation sequencing technology to conifer genomes, as proposed in this study, holds the promise of
being able to provide a way to dissect the genetic architecture of an adaptive trait. Until recently, techniques
to consider adaptation on a genomic scale have been no match for the size and complexity of conifer genomes \citep{Mackay:2012hr}.
\begin{wrapfigure}{r}{0.5\textwidth}
\setlength{\abovecaptionskip}{5pt}
\begin{ganttchart}[vgrid]{1}{12}
\gantttitle[title/.style={fill=green!60}]{2013}{4}
\gantttitle[title/.style={fill=yellow!60}]{2014}{4}
\gantttitle[title/.style={fill=red!60}]{2015}{4}\\
%\ganttbar[bar/.style={fill=green}]{Phase 1}{2.5}{4.5}\\
\ganttbar[bar/.style={fill=green!60}]{Phase 1}{2.5}{4}
\ganttbar[bar/.style={fill=yellow!60}]{}{5}{4.5}\\
%\ganttbar{Phase 2}{3.5}{8.5}\\
\ganttbar[bar/.style={fill=green!60}]{Phase 2}{3.5}{4}
\ganttbar[bar/.style={fill=yellow!60}]{}{5}{8}
\ganttbar[bar/.style={fill=red!60}]{}{9}{9.5}\\
\ganttbar[bar/.style={fill=yellow!60}]{Phase 3}{8.5}{12}
\ganttbar[bar/.style={fill=red!60}]{}{9}{12}
\end{ganttchart}
\caption{Project timeline}
\vspace{-10pt}
\label{f:timeline}
\end{wrapfigure}
However, advances in genomic enrichment techniques such as those by \citet{Parchman:2012ca} and \citet{Willing:2011jb}
coupled with the ever-decreasing cost of DNA sequencing have facilitated the application of new genotype-by-sequencing (GBS)
approaches to answering fundamental questions in evolutionary biology in non-model species. \textbf{The PI is well-positioned
to utilize the sophistication of the sequencing facilities at VCU, coupled with his backgrounds in evolutionary, molecular, and
computational biology to study how bark thickness has evolved across \emph{Pinus} and how this knowledge may serve to
inform ecologically-relevant and economically-important decisions related this important genus}.
Dr. Andrew J. Eckert was chosen as a sponsor of this project due to his experience with association mapping
and evolutionary genetics in conifers. He has published over 30 papers in this field and is recognized as an authority
on complex trait dissection in forest tree populations. The PI has elected to stay at VCU for this research due to the availability
of excellent faculty collaborators with extensive experience in phylogenetics and molecular evolution (Dr. Maria C. Rivera) and
landscape genetics (Dr. Rodney J. Dyer). The research and teaching experience gained in this project will position the PI to
competitively seek a future tenure-track position at a tier 1 research institution.
This study is laid out below in three general phases: (1) field sampling and DNA preparation, (2) DNA sequencing and bioinformatics,
and (3) statistical analysis and hypothesis testing. The quarterly timeline for this project is shown in Figure \ref{f:timeline}.
\subsection*{Phase 1: Field sampling and DNA preparation}
\paragraph{Field sampling}
The PI will sample four species in their natural ranges, with 25 individuals from 20 natural populations of each
species (Figure \ref{f:range}). The locations of each population will be determined randomly. Within each population, a representative
of the focal species will be chosen randomly. To avoid sampling close relatives, all other members of the population will
be chosen to be at least \SI{50}{m} from the first sampling location. Only trees with diameter at breast height (\SI{1.25}{m}
above ground; DBH) of at least \SI{15}{cm} will be sampled. All efforts will be made to sample adult trees with similar DBHs,
so as to standardize ages within sites. Three to five needle fascicles will be collected from each tree for use in DNA
extraction. Needles, with a desiccant, will be stored in \SI{15}{\ml} Falcon\texttrademark\ tubes at ambient
temperatures in the field and at \SI{-80}{\celsius} long-term in the laboratory.
Bark thickness will be treated as a quantitative character in all downstream analyses, and as such, will be measured for each tree at five
random locations around DBH. Measurements of bark thickness will be taken using a
Hagl\"{o}ff\textsuperscript{\textregistered} bark gauge, and the height of the tree will be measured using a clinometer. These values
will also be used to obtain the best possible regression model for future studies in these species, using the
regression equation from \citet{Cao:1986th}; including these covariates has been shown to significantly decrease estimation
bias by up to \SI{43}{\percent} in some species \citep{Li:2010bl}. At the end of sampling, \num{2000} individual tubes of
needles (20--25 needles/tube) will be banked along with GPS coordinates and measurements of bark thickness and height
for all trees for all species.
\paragraph{DNA preparation}
Genomic DNA will be extracted from needle tissue using Qiagen DNEasy extraction kits in a 96-well format following an established
tissue preparation protocol in the Eckert lab located at VCU. A single DNA extraction from multiple needles per individual
will be performed for each sampled tree.
\paragraph{Criteria for completion}
Phase 1 will be designated complete when genomic DNA has been extracted for all \num{2000} individuals.
DNA quantitation data as well as sample metadata will be stored in a relational database management system (RDBMS)
that will also house sequencing and genotyping data from phase 2. It is estimated that phase 1 will be complete by June 2014.
\subsection*{Phase 2: DNA sequencing and bioinformatics}
The PI will perform genotype-by-sequencing (GBS) utilizing highly-multiplexed libraries on the Illumina HiSeq 2000 platform.
Genomic enrichment will be performed using the approach outlined in
\cite{Parchman:2012ca}, in order to ensure high-coverage of the library given the complexity and size of conifer genomes
\citep{Mackay:2012hr}. Briefly, the genomic DNA from an individual is digested with two different restriction enzymes coupled
with Illumina-specific adaptor ligation, including a computationally-correctable sequencing barcode \citep{Roche454MID} tied
to an individual tree.
The product is then amplified by PCR, size-selected on an agarose gel, purified, and pooled for multiplex sequencing. For
2000 trees, the amount of sequence data will approach 8 TB (terabases). All costs for sequencing will be absorbed by the Eckert lab.
The Nucleic Acids Research Facility (NARF) at VCU is currently producing paired-end reads of 150 bases in length. Using
software that the PI has already written \citep{code:2008wq} coupled with existing libraries (e.g., BioPython \citep{Cock:2009hj},
NGS QC Toolkit \citep{Patel:2012fq}), the processed sequencing data will be computationally divided into individuals and purged of
low-quality reads. Additionally, reads that meet acceptance criteria globally, but possess bases (most often at the 3' end)
that have low quality scores, will be trimmed and retained.
Once quality-controlled reads are obtained for each individual (and each population), they will be aligned (mapped) to the
most current draft genome of \emph{P. taeda}, available from Dendrome (\url{dendrome.ucdavis.edu}), using the
Burroughs-Wheeler aligner (BWA) \citep{Li:2009fi}, and SNPs will be called. A SNP is called at particular alignment positions
given enough high-quality reads to indicate a difference between the sample and the reference genome given a model of error.
Sequence data from NGS technology requires the use of new SNP calling algorithms that properly model all the sources of errors
(e.g, base calling errors, mapping errors, and sample preparation errors, particularly PCR errors which can have a significant
impact on the analysis). Several tools have been developed to perform this task such as as SAMtools \citep{Li:2009ka},
GATK \citep{McKenna:2010bv}, and, more recently, GeMS \citep{You:2012iy}.
The result of this phase will be a sample ($n = 2000$) by SNP matrix ($n = \text{thousands}$) for each species ($n = 4$).
\paragraph{Criteria for completion} Phase 2 will be complete when all libraries have been sequenced and processed. Processing
will include read demultiplexing, quality control, and storage as compressed files and in a relational database system. Additionally,
mapping to the loblolly genome and SNP calling will result in a sample by SNP matrix. Sequence data will also be released at this time.
This phase is the most computationally complex, and should be complete in early summer 2015.
\subsection*{Phase 3: Statistical analysis and hypothesis testing}
This phase will focus on uncovering the underlying genetic architecture of bark thickness, summarized as four main
questions. Each question will be answered by testing a null hypothesis, the rejection of which leads to the subsequent question.
\subsubsection*{Question 1: What is the genetic architecture of bark thickness for the four focal pine species that
are adapted to fire?}
\paragraph{Null hypothesis} ${H_0}^1$: There is no association between genotypic variation at surveyed SNPs and bark thickness
in any of the four species.
\paragraph{Methods} The PI will employ standard linear models that correct for kinship and population structure that are
commonly used in genome-wide association studies (GWAS; \citet{Yu:2006ij}). Association analysis has been widely
used for dissecting the genetic basis of phenotypes for conifers \citep{Neale:2011jh, Ingvarsson:2011fg}.
In addition, the PI will explore the use of the Bayesian models presented by \citet{Parchman:2012ca}, which
was the first GWAS for a conifer, and the regression tree approaches presented by \citet{Holliday:2012fz}.
Linear models will be used to explicitly test the effects of genotypic class of each discovered SNP on quantitative
measures of bark thickness for each species. Multiple tests will be accounted for using a minimum \emph{P}-value approach
\citep{Conneely:2007ga}.
\paragraph{Expected outcomes and relevance} The null hypothesis will be rejected when at least one SNP in each
of the four pine species is significantly associated to quantitative variation in bark thickness. These results will classify
SNPs discovered in each species into two classes: those associated to bark thickness
(which rejects the null hypothesis for this question) and those unassociated with
bark thickness. The latter will be used as \emph{controls} for questions two through four. The former establishes the
genetic architecture of bark thickness for each focal species, and will be referred to as \emph{candidates}.
\subsubsection*{Question 2: Are candidate SNPs shared across the four focal species, and if so, are they shared to a
greater degree than randomly sampled SNPs from the controls?}
\paragraph{Null hypotheses} ${H_0}^2$: There is no shared genetic architecture for bark thickness among
the four focal species. ${H_0}^3$: The degree of allele sharing is higher for candidate SNPs than randomly
selected control SNPs.
\paragraph{Methods} To evaluate ${H_0}^2$, the PI will leverage the underlying, queryable data storage infrastructure
to produce the intersection of SNPs associated with bark thickness in all four focal species (${H_0}^1$). To
evaluate ${H_{0}}^3$, the union of all SNPs associated with bark thickness will be created. The fraction of this list
that is shared across all four species will be used as a test statistic, the magnitude of which will be tested via a simple
permutation analysis. The PI will construct a null distribution for the fraction of alleles shared across all four species
for sets of SNPs randomly selected from those unassociated to bark thickness.
Allele sharing across species is common among conifer species and has been attributed largely to incomplete lineage
sorting resulting from recent divergence times and large effective population sizes \citep{Syring:2007gd,Willyard:2009ez},
ancient admixture \citep{Liston:2007cx}, and long-term gene flow \citep{Zhou:2010hk}. For example, an analysis of
levels of long-term gene flow between loblolly and slash pines, using multiple random sets of 50 nuclear genes and an
isolation-with-migration model \citep{Becquet:2007js}, detected low but significant levels of gene flow over the divergence history
of these species (unpublished data, $4 N_{e}m = 2.5$ with a \SI{95}{\percent} confidence interval of 0.52--5.14). The sets of loci
randomly sampled from the unassociated loci will be assumed to largely reflect these processes, so that the
enrichment of allele sharing at loci associated to a trait such as bark thickness, with clear adaptive relevance
\citep{He:2012bz}, would imply additional processes such as natural selection, as in \citet{Segurel:vf} and
\citet{Roux:2012eb}, in promoting allele sharing among species.
\paragraph{Expected outcomes and relevance} ${H_0}^2$ will be rejected when multiple SNPs associated
with bark thickness are shared across species. The relevance of this result would be to establish the shared genetic
architecture of an adaptive trait. ${H_0}^3$ will be rejected when the fraction of SNPs associated to bark thickness that are
shared across the four focal species lies in the upper \SI{1}{\percent} tail of the null distribution based upon random sets of
unassociated SNPs. A significant result would support the existence of a process beyond incomplete lineage sorting,
admixture, and gene flow that is contributing to the maintenance of shared alleles across the four focal species.
\subsubsection*{Question 3: Are reconstructed allele frequences at shared candidate SNPs correlated more so with
reconstructed bark thickness than shared control SNPs across the phylogeny for these four species?}
\paragraph{Null hypothesis} ${H_0}^4$: Correlations between reconstructed allele frequencies at shared candidate SNPs
and reconstructed values of bark thickness are no larger than those for reconstructed allele frequencies at randomly-chosen,
shared control SNPs.
\paragraph{Methods} Phylogenies and associated data matrices for genus \emph{Pinus} will be obtained from
TreeBASE \citep{Morell:to} and GenBank \citep{Benson:2012kf}. Most data matrices are constructed primarily from
cpDNA coding sequences \citep{Eckert:2006iw, Gernandt:2008df} and cpDNA genome sequences \citep{Parks:2009bd}.
Chronograms will be obtained using the fossil calibration points from \citet{Willyard:2007in} as presented in \citet{He:2012bz}.
Once a set of plausible chronograms has been
generated, bark thickness and allele frequencies within lineages will be reconstructed as quantitative
characters using a Bayesian method \citep{Pagel:2004ic}.
Specifically, we will utilize (1) a set of plausible chronograms ($n = \num{10000}$) to account for uncertainty in the
phylogenetic tree and (2) the character states, in combination with phylogenetically informed multiple
regression analyses available in BayesTraits as BayesContinuous \citep{Pagel:2004ic}, to test the association with
reconstructed SNP allele frequencies and bark thickness for shared candidate and shared control SNPs.
Allele frequencies and bark thickness will be specified for each species as averages and variances across populations.
\paragraph{Expected outcomes and relevance} ${H_0}^4$ will be rejected when the multiple regression model for candidate
SNPs better predicts bark thickness than for control SNPs. This suggests that the correlation structure of allele frequencies
with bark thickness differs between the candidates and controls, as defined by answering questions 1 and 2. Since controls
are matched in number and levels of heterozygosity with the candidates, this would suggest non-neutral processes driving
this difference.
\subsubsection*{Question 4: To what extent is natural selection affecting rates of molecular evolution at candidate genes?}
\paragraph{Null hypothesis} ${H_0}^5$: There is no evidence of selection in genes containing SNPs that are significantly associated
with bark thickness.
\paragraph{Methods} As the sequence resource, including annotations, for the loblolly pine genome continue to improve,
establishing the genomic context of bark thickness-associated SNPs will enable the PI to test for evidence of selection in
polymorphic, protein coding genes. From the results stemming from testing ${H_0}^2$ and ${H_0}^3$, the PI will select
10 up to 20 SNPs from subsets of candidate and control loci from each population for each species exhibiting similar attributes
(e.g., average heterozygosity). Using genomic position of SNPs and the loblolly genome annotations, the PI will design
PCR primers to amplify the protein coding regions from each subsample ($n = 40\ \text{trees}$), and each amplicon will be
sent for Sanger sequencing. Once the sequences have been obtained, they will be aligned using standard tools such as
MAFFT \citep{Katoh:2005ia} or MUSCLE \citep{Edgar:2004ic}. From the alignment, $d_N/d_S$ will be
estimated, per locus and per site per locus, using established maximum likelihood methods \citep{Yang:2007ki}.
\paragraph{Expected outcome and relevance} Knowing whether or not genes containing SNPs are under positive ($d_N/d_S > 1$) or
purifying selection ($d_N/d_S < 1$) addresses a fundamental question surrounding maintenance of these polymorphisms in
natural populations. It is expected that $d_N/d_S$ should be higher for candidates than controls.
\required{Intellectual Merit and Broader Impacts}
% as in the project summary, broader impacts must be called out separately
% in the project description. You may be able to give more specific
% examples, or discuss how you've previously achieved these impacts.
% It should be similar, but not identical, to the Broader Impacts statement
% in the project summary
\textbf{By answering these four questions, the PI will have broken down bark thickness as an adaptive trait into
its genetic components, uncovering any shared genetic architecture in the focal species, at multiple evolutionary time scales}.
This project will benefit the local and global scientific communities, as well as help train
the next generation of multi-discipline scientists in the world of big data. First, data will be made
available for download and exploration from FTP and web sites hosted at VCU. Second, the PI will develop and teach a new
three-credit course, BIOL 591 Applied Ecological Genomics, which will train advanced undergraduates and
graduate students, using the data generated in this project, in skills necessary to both understand the
technology available at VCU and to process, manage, and analyze the deluge of data generated from current genomics
projects. This course will run in conjunction with an existing course, BIOL 693 Ecological Genomics taught by sponsor Eckert.
Given the diverse student body at VCU, the proposed research and
teaching activities will increase research opportunities for underrepresented groups. VCU is among the top 20 universities in the
nation for Biology degrees awarded to ethnic and racial minority students. In 2009, 18.3\% of baccalaureate Biology degrees were
conferred to African American students, exceeding the national average in Biology by 11.5\% (NSF, 2010). A high percentage of
VCU Biology (39.4\%) baccalaureate degrees were also earned by Asian/Pacific Islanders, a value that far exceeds the
national average in Biology of 6.2\%. Caucasian students were awarded 34.9\% of VCU baccalaureate Biology degrees as
compared with the national average in Biology of 66.4\% (NSF, 2010).
\required{Results From Prior NSF Support}
% 5 pages or fewer of the 15 pages for entire description document.
% include results from NSF grants received in the past 5 years.
% if supported by more than one grant, choose the most relevant one
% for each grant, include: NSF award number, amount, dates of
% support, and publications resulting from this research.
% due to space limitations, it is often advisable to use citations rather
% than putting the titles of the publications in the body
% of this section
% e.g.: "My prior grant, "Uses of Coffee in Mathematical Research" (DMS-0123456,
% $100,000, 2005-2008), resulted in 3 papers [1],[2],[3], demonstrating..."
% if requesting postdoctoral research salary, a supplemental 1-page document
% called "Postdoc Mentoring Plan" will be required
PI Friedline has yet to receive funding from the National Science Foundation.