questions.tex

\chapter{Questions}
\label{questions-chapter}

The state of syntax measures in dialectometry described above leaves
several research questions unresolved. It is not yet clear whether $R$
is a good measure of syntax distance. Previous results have shown that
it can obtain significant distances, but has either failed to do so
reliably, as in my work on British English \cite{sanders08b}, or has
not compared traditional dialect areas, as in
\namecite{nerbonne06}. Neither study showed that a statistical method
could adequately reproduce existing knowledge about some dialect area,
which is necessary before $R$, and statistical methods as a whole, can
contribute to dialectometry's study of syntax.

This leads to the first question: will the features found by
dialectologists agree with the highly ranked features used by a
statistical method for classification? I will investigate this
question by comparing statistical dialectometry results to the
syntactic dialectology literature on Swedish. A secondary, but related
question is whether the regions of Sweden accepted by dialectology
will be reproduced by a statistical method. For example, my previous
research on British English reproduced the well-known North
England-South England dialect regions. However, this dissertation
eliminates the corpus variability in that research, where a forty-year
gap separated the phonology and syntax corpora, and the syntax corpus
was not collected with dialectology in mind \cite{sanders08b}. With a
corpus collected for the purpose of dialect research, and with a
phonological corpus transcribed from the same interviews, more precise
comparisons should be possible, both between regions and between
syntax and phonology.

A secondary question, relevant once the utility of a statistical
measure for syntax is established, is what variations of the two
functions comprising the measure produce the best results. This
involves variation of both the feature extraction function and the
distance function. Choice of feature set is almost as important as
choice of distance. My previous work on British English showed
that leaf-ancestor paths provide a small advantage over part-of-speech
(POS) trigrams, presumably by capturing syntactic structure higher in
the parse tree. And, whereas development of a statistical distance
measure is difficult, new feature sets can be developed relatively
quickly.
% on average, each committee meeting results in 2.1 new feature sets
% being proposed, and only 0.7 new distance measures.
In this dissertation, I evaluate several feature sets besides POS trigrams and
leaf-ancestor paths, such as phrase structure rules, leaf-head paths,
and lexical trigrams. I also evaluate variants of these feature sets,
for example varying the POS tagger or POS tag set. I also evaluate
combined feature sets.

Feature sets can be evaluated by comparing performance of different
feature sets on a fixed corpus and with a fixed distance
measure. Here, performance is measured using the same criteria as for
distance measures: the number of significant distances between
interview sites and the similarity of the results to those found by
dialectologists.

Besides feature sets, this dissertation evaluates a number of measures
beyond the $R$ of previous work, such as Kullbeck-Leibler divergence
and cosine dissimilarity. $R$ is one way to aggregate features that
are created by decomposing sentences. It treats features as atomic,
and does not manipulate them in any syntax-specific ways. As such, $R$
differs from Goebl's WIV only in being designed for larger feature
sets and larger corpora. Both assume that independent, atomic features
derived from a sentence can adequately capture dialect differences. If
this is not the case, then a more syntax-aware way of comparing
individual features will be needed.

A final question is whether the syntactic dialectometry practiced here
agrees with phonological dialectometry on the same corpus. Unlike the
previous questions, which use agreement between syntactic
dialectometry and dialectology, there is no {\it a priori} reason to
expect syntax/phonology agreement; it is quite possible that
phonological features create one set of boundaries while syntactic
feature create another set. However, agreement between the two would
be further evidence for that statistical methods are useful for
syntactic dialectometry.

\section{Question 1 : Agreement with Dialectology}

The first question is whether a statistical dialectometry measure
agrees with dialectology. On closer inspection, this question covers a
number of more specific questions, each dealing with a specific
comparison to dialectology. First, and most important, is whether the features that
it counts most important are the same as the features discussed in the
dialectology literature. Three other questions are whether
regions, region boundaries, and distances found by this measure agree with
dialectology. Therefore, question 1 has a four-part
answer: agreement between dialectometry and dialectology on regions,
boundaries, distances, and features.

First, however, these terms from dialectology must be defined
precisely. Then the methods used to compare the dialectometry results
with dialectology can be developed.

\subsection{Definition of Dialectology Terms}

Definition of terms from dialectology is appropriate here, along with
an explanation of how they fit together. The basic unit in
dialectology is the feature, such as ``pronunciation of the word
`cow' '' or ``adjective placement in noun phrases''. During analysis,
the linguist may suspect that a certain variant of a feature is
characteristic of a particular region, but more information, usually
from a survey, is needed to make certain.

Given a survey or other source of geographical mapping information, a
boundary for a feature can be drawn. This boundary is called an
isogloss. For simple cases, isoglosses are usually simple to
determine, giving a clear line between two dialects. On the other
hand, complicated cases lead to more complicated geometry; for
example, a few occurrences of a feature variant
can be stranded in the middle of the other variant.

If a number of isoglosses coincide, they form an isogloss bundle,
which separates one region from another. Isogloss bundles are simple
in theory, but in practice they are difficult to find because
isoglosses rarely coincide perfectly. In practice, undisputed isogloss
bundles only occur between well-known dialects, such as the
boundaries between Low and High German or Northern and Southern
English of England. In cases where more precision is required, there
is not usually a sufficient number of coincident isoglosses. Even
though there may be plenty of isoglosses in the area, isoglosses so
rarely coincide that only a few may be construed as forming an
isogloss boundary.

Dialectology does not have a clear equivalent to dialectometry's
distance. The closest analog is size of isogloss bundle; dialect maps
typically indicate size of isogloss bundle by thickness of boundary
line. Additionally, regions that have many specific features known in the
dialectology literature can be inferred to be distant from the rest.

\subsection{Features}

The first aspect of dialectology to compare is the feature. To match the
features of dialectology to the features that a statistical
dialectometric method uses to produce a distance, I first need to find
discussion of Swedish dialect features in the dialectology
literature. For example, \namecite{rosenkvist07} discusses the South
Swedish apparent cleft. Here, the sentence contains an embedded clause
with similar surface appearance to a true cleft. Unlike a true cleft,
however, there is no clefted constituent in the matrix clause. The
apparent cleft appears in southern Sweden, but its precise
distribution is not known; Rosenkvist finds some uses everywhere
except Norrland (northern Sweden), but finds heaviest use in the
former Danish provinces in the south.

%% This stuff makes no sense!
% This feature is best analyzed as a
% single feature; Rosenkvist mentions that it occurs in southern and
% central Swedish and not northern Swedish, but does not give a more
% precise location than ``Svealand and G\"otaland''. So we will compare
% this feature directly to features found in any part of the corpus.

Next the feature should be expressed formally. This
formal description can then be translated to the format representation
used by the dialectometry. Again, these would be the same ideally, but
the dialectology study may not be complete; for example, Rosenkvist's
2007 paper does not yet include a syntactic analysis. And even in
cases where the dialectology gives a formal description, the syntactic
features of dialectometry, for example those described in the next
chapter, are based on more primitive formalisms at present. Therefore
the translation may lose information. Once translated, the features
discussed by dialectologists should appear in the high-ranked features
on which the statistical dialectometry method bases its distance.

In the apparent cleft example, the apparent cleft is realized as an
additional use of the word {\it som}, ordinarily a
complementizer. Typically, the next step is to identify the minimalist
structure for this, but Rosenkvist's 2007 paper does not yet provide
this analysis. Although there is no structure to translate to a
phrase-structure skeleton, his analysis provides enough clues to
produce some features directly. Part-of-speech n-grams are easiest; he
mentions that his corpus search used the strings {\it det \"ar som}
(``It is that'') and {\it det \"ar bara som} ``It's just that''. These
words only need part-of-speech annotation to be n-gram
features. Leaf-head paths can also use these parts of speech for the
local dependencies between {\it det,\"ar}, and {\it som}. Rosenkvist
also mentions some syntactic properties of apparent clefts that are
useful for specifying leaf-head path features: the subject of the {\it
  som}-clause must be a pronoun, so we should expect to see leaf-head
paths of the form {\it ROOT-som-PRON} in the regions that have the
apparent cleft.

Once dialectometric features have been specified from some linguistic
analysis, the analysis consists of the following questions: in what
regions do these features appear? Do these regions match the expected
distribution (if any) from the linguistic analysis? How much do
the features contribute to distance from other regions? If there are
other features that contribute more, what are they?

\subsection{Isogloss Boundaries}
%add connective sentences! or paragraph!

Isogloss boundaries are intermediate in complexity between features unspecified
for location and regions demarcated by isogloss bundles. For the
purposes of this dissertation, however, there is not much difference
between a feature with some documented locations and an isogloss
boundary. An isogloss makes the regions of interest clearer, but it is
a difference in degree and not in quality. The real difference in
analysis occurs when dialectology has identified an isogloss bundle.

\subsection{Isogloss Bundles}
%add connective sentences! or paragraph!

Isogloss bundles compare straightforwardly to dialectometry, once
regions have been identified from the dialectometric distances between
sites. There are two primary methods: hierarchical clustering and
multi-dimensional scaling. Neither method is perfect; as with isogloss
bundles, some human input is still needed to determine whether an
inter-region boundary truly exists at some point.

Hierarchical clustering produces well-delineated regions by
recursively merging sites into regions, at the cost of some
uncertainty---the results tend to vary quite a bit from feature set to
feature set. Only clusters that persist between results from multiple
feature sets should be considered valid. Consensus trees aggregate
multiple cluster dendrograms into a stable tree; see figure
\ref{consensus-example-small} for an example. However, because of
the recursive, nested nature of the grouping, there can still be a
question of which level of nesting is appropriate to treat as
a region.

In contrast, multi-dimensional scaling (MDS) is a mathematical
transformation of the high-dimensional space created by measuring
distances between all sites in the corpus; see figure
\ref{mds-example-small} for an example and section \ref{mds} for a
complete discussion. Although MDS does not produce
spurious information, its results are often hard to analyze because it
produces boundaries of varying strength. Very different regions stand
out, but similar regions appear similar even if they contain some
differences. This similarity can make it difficult to decide whether
an area should be considered one region or two.

\begin{figure}
  \includegraphics[scale=0.4]{Sverigekarta-Landskap-consensus-5-1000}
 \caption{Swedia, Consensus Tree Map}
  \label{consensus-example-small}
\end{figure}

\begin{figure}
  \includegraphics[scale=0.4]{Sverigekarta-mds-1-1000-js-trigram-ratio}
 \caption{Swedia, Multi-Dimensional Scaling of Trigrams measured by
    Jensen-Shannon divergence}
  \label{mds-example-small}
\end{figure}

Once both dialectologic and dialectometric regions have been
identified, comparison is straightforward. Each region can be checked
for overlap---regions with a greater overlap area are better matches.

\subsection{Distances}

Although comparing distances from dialectometry to qualitative
research in dialectology is possible, it is not very precise, because
the dialectometric distances must first be translated to something
like the isogloss bundles of dialectometry. Composite cluster maps
provide this translation by drawing dark boundaries when large
distances separate regions; see figure
\ref{composite-example-small} and discussion in section
\ref{methods-composite-clustering}. Alternatively, statements like ``in
general, Southern Swedish is syntactically identical to Standard
Swedish'' \cite{rosenkvist07} can be construed as saying, roughly,
that there is very l ittle distance between Southern and Standard
Swedish. Ultimately, though, the distances from a quantitative
analysis do not have a clear analogue in qualitative analyses.

\begin{figure}
  \includegraphics[scale=0.4]{Sverigekarta-cluster-1-full}
 \caption{Swedia, Composite Cluster Map}
  \label{composite-example-small}
\end{figure}


\section{Question 2 : Variations on the Measure}

The second question of this dissertation reflects the fact that the
distance measures in dialectometry have two
parts. The first part is the function used to extract features from a
corpus and the second is the distance measure that produces a distance
between the features of two corpora. This dissertation investigates a
number of implementations for both functions. The question is which
combination provides the best performance, as measured by agreement
with dialectology.

Specification of feature sets is not difficult; feature sets are
easier to create than distance measure algorithms, as discussion of
distance measures below will show. In addition, feature sets are
easier to combine and to tweak. The real problem is not in
specification of feature sets, but that new feature sets must be
evaluated, since it is not currently possible to produce features
based on a linguistic theory as with phonology's distinctive features.

For example, in previous work, I showed that leaf-ancestor paths have
a small advantage of trigrams \cite{sanders07} in terms of finding
significant distances. Therefore, Question 2 breaks into two smaller
questions: (1) how can new variations be proposed? and (2) how can
they be evaluated? However, before these questions are explored, an
definition of terms related to distance measures is in order.

\subsection{Definition of dialectometry terms}

There are several terms related to distance in mathematics. In order
from least restrictive to most restrictive, they are `divergence',
`dissimilarity' and `distance'. In this dissertation, a `measure' is
used to refer to any of these three functions.  All three kinds of
functions must always return positive numbers, and only return 0 for
corpora that are equal.  A symmetric function returns the same number
whether measuring from point X to point Y or from point Y to point
X. The triangle inequality means that distance from point X to point Y
plus point Y to point Z is at least as long as traveling straight from
point X to point Z.  In other words, it means that it will always be
longer to take the two-leg path than to take the single-leg
path. Equations
\ref{distance-properties-positive}-\ref{distance-properties-triangle}
list the properties formally.

\begin{equation}
  d(x,y) \ge 0
  \label{distance-properties-positive}
\end{equation}

\begin{equation}
 d(x,y) = 0 \textrm{ iff } x=y
 \label{distance-properties-eqq}
\end{equation}

\begin{equation}
  d(x,y) = d(y,x)
\end{equation}

\begin{equation}
  d(x,y) + d(y,z) \ge d(x,z)
\label{distance-properties-triangle}
\end{equation}

A divergence satisfies equations \ref{distance-properties-positive}
and \ref{distance-properties-eqq}: it is always positive and only zero
when two sites are equal. It is less restrictive than the other two
kinds of measures, and is the only one that can capture the common
dialect situation where speakers of dialect X can understand speakers
of dialect Y better than speakers of Y understand those of
X. Unfortunately, the methods used for aggregate comparison in
dialectology, such as hierarchical dendrograms and multi-dimensional
scaling (MDS), require more restrictive measures. Specifically,
dissimilarities are divergences that are in addition
symmetric. Dissimilarities can be used in hierarchical dendrograms and
MDS because multiple dissimilarity comparisons can be mapped into
distance space by situating each pair of sites in its own orthogonal
dimension. This high-dimensionality avoids violating the triangle
inequality. Finally, distances are dissimilarities that additionally
satisfy the triangle inequality without special consideration, so
multiple pairwise comparisons can inhabit the same
dimensions. However, this is not necessary for the analyses in this
dissertation.

Therefore, the measures described in the rest of the dissertation will
be dissimilarities, but not necessarily distances. In the rest of the
dissertation, `distance' will usually be used as a generic term to
refer to a dissimilarity; exceptions where the term `distance' implies
all three properties will be noted. In addition, some of the
dissimilarities have common names that contain other terms. For
example, Kullback-Leibler divergence is augmented here to behave as a
dissimilarity, but it retains its original name when mentioned.

\subsection{Feature Sets}

New feature sets are easy to propose. All that is needed is some way to
condense or divide the information about the sentence into symbols
that can be used as input to a statistical distance
measure. Specifically, the feature sets used in this dissertation use
per-word information, word-order information, and syntactic
information. They attach some information from the constituent tree or
dependency graph to each word, dividing the information according to
the word's position in the sentence. Trigrams attach the leaves to
each word, along with the leaves to the left and right.
Leaf-ancestor paths attach vertical slices of the tree to each
word. Leaf-head paths attach the path to the root to each word.

Feature sets that use other information might also be useful;
convolution kernels give a single number that captures the difference
between two trees \cite{collins01}; a similar feature that captures
aspects of a single tree such as depth, branching degree or
homogeneity might be useful. Besides this, there are numerous simple
features used in other computational linguistic work that attempt to
capture the most important characteristics of a sentence in a simple,
ad-hoc way, such as the first or last $n$ words of a sentence, a
certain number of words surrounding the predicate, or sentence length.

Even before looking at results, it seems that each of these has its
own advantages and disadvantages. Leaf-ancestor paths capture upper
structure of the constituent parse, but no left and right
context. Leaf-head paths capture some of the sentence structure, but
some of the surrounding context as well. Trigrams capture only
immediate left-right context, but include word order information. They
are also less influenced by annotator error since they require only
part-of-speech annotation.

Because evaluation of feature set performance is necessarily evaluation
of the overall combination of feature set and measure, the previously
discussed measures of agreement with dialectology should all be used
as measures of performance. With the distance measure held constant,
the different feature sets can be evaluated against one another.

Before comparison, though, the distances produced for a given
combination of feature set and measure must be checked for
significance. For example, a very sensitive combination could be
inappropriate for small data sets if it can only achieve significance
with large data sets. The significance test ensures that subsequent
evaluation is valid.

\subsection{Distance Measures}

Of the measures considered in this dissertation, $R$ and $R^2$ have
been tested in previous work. $R$ is quite simple; it is a sum of
differences of features. It treats features as opaque symbols; it is
not necessarily limited syntax. Perhaps because of its simplicity, $R$
performs more consistently than other measures tested in this
dissertation: It gives significant results across a larger variety of
feature sets than more complicated measures do.

% The question of measure is more important than feature set because
% measures are harder to construct than feature sets. The measure also
% has a greater effect on the results, and the relation between measure
% and the quality of its results is not as obvious as the same relation
% between feature set and quality of results.

There are two obvious directions to explore when creating a distance
measure to replace $R$. The first direction is to address $R$'s
simplicity by defining a more complex measure that uses sophisticated
ways to measure difference over still-opaque symbolic features. The
second direction is to address $R$'s ignorance of syntax by defining a
measure with specific knowledge of syntax. Finding candidates for the
first direction is easier, given the number of statistical measures
commonly used in computational linguistics. Additionally, the
dialectometric model that divides a measure into distance measure and
feature set is powerful enough that most syntax-specific knowledge can
be represented in terms of features instead of integrated into the
distance measure's algorithm.

Indeed, this makes syntax-aware measures difficult to
specify---they must incorporate knowledge of syntax in a way that
cannot be reified as features. Unlike dialect surveys of phonology,
dialect interviews do not consist of aligned lists of sentences. That
means that pairwise sentence-to-sentence comparison are impossible;
comparison must occur at a lower level. This constraint makes it
difficult to encode any useful awareness of syntax into a syntax-aware
distance measure that cannot be easily represented in the feature set
for a syntax-ignorant measure instead.

It is so difficult to define a useful syntax-aware distance measure
that none are presented in this dissertation. Syntax awareness is
restricted to the feature sets. However, a number of more complicated
statistical measures similar to $R$ are presented. Evaluation of the
distance measures proceeds similarly to evaluation of feature sets;
results for various measures are compared, holding the feature set
constant. The results are checked for significance, then for agreement
with dialectology.

% The hidden structure available for syntax is the parse---whether this
% is a constituent parse, dependency parse or some variant of shallow
% parse. Working by analogy from phonology, segments have hidden
% structure in the form of distinctive features. Segment order in
% phonology corresponds to word order in syntax. Unfortunately, as just
% mentioned, the analogy does not extend to corpus order. However, there
% is one piece of information that is not available to phonology (at
% least pre-autosegmental phonology): the upper structure is
% connected. For the current set of experiments, the upper structure is
% processed, divided and assigned as features attached to an individual
% word in the form of leaf-ancestor paths or dependency paths. This
% approach allows the features to be given to $R$ and treated as if they
% are independent, which is of course not true.

\section{Question 3 : Agreement with Phonological Dialectometry}

Finally, agreement with phonological dialectometry is a useful
indicator of quality. Agreement with phonology indicates a good
feature set, but cannot indicate a bad feature set. Phonological
boundaries need not agree with syntactic boundaries, but it seems {\it
  a priori} likely that they do. Note that agreement with phonology
has the reverse implication of statistical
significance---a test for significance can only indicate a bad
feature set, not prove a good feature set.

There is very little phonological dialectometry for Swedish, so this
comparison may not be valid yet. The only published paper, to my
knowledge, is \namecite{leinonen08}. Leinonen has extended this work
to a dissertation, which is currently unpublished.

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "dissertation.tex"
%%% End: