GNU General Public License v3.0 licensed. Source available on github.com/zifeo/EPFL.
Spring 2017: Introduction to Natural Language Processing
[TOC]
- applications : translation, writting assistance, information retrieval (Google), information classification (emails), interface (vocal)
- language
- natural : implicit, ambiguous
- formal : explicit, non-ambiguous
- communication : conciseness (D.R.Y.), shared knowledge, representation (unlimited expressive power)
- real-world NLP :
- task specification essential
- usefulness/drawbacks ratio (not always the best solution)
- constraints
- real-time communication : 180word/min
- knowledge representation formalism : huge amount of data
- fast processing : polynomial algorithms
- sub-language corresponding : sufficient linguistic resources
- grammar : expressive power vs real time, production rule
$A\to\alpha$ - regular, LR(k) :
$O(n)$ , identity, constant, symbol - context-free :
$O(n^3)$ , rule one-to-many/one - tree adjoining :
$O(n^6)$ , rewriting tree rather than symbol
- regular, LR(k) :
- difficulties
- lack of linguistic competence : hard to find, to produce, semantic level less developped
- power laws : zipf distribution, illusion of not too bad system
- curse of dimensionality : high dimension, sparseness
- subjectivity : inter-annotator (dis)agreement
- multi-scale : ambiguity
- corpus-based linguistics : reproduce for a given task corresponding linguistic behaviour with models semi-automatically trained from large amouts of textual data representative for considered task, does not try to explain, performance orientied
- linguistic processing levels
- morpho-lexical : recognize words, morphological rules, electronic lexica
- cases : non-alphabetical (chinese), languages without seperator (tahi), ambiguous separator (U.N.O.), out-of-vocabulary forms (.bat files)
- morphology : structure of words
- lexicography : inventory/classification of accepted forms
- paradigmatic : dimension of language, word choice, concepts
- syntagmatic : structure
- syntatic : structure word sequences, reduce ambiguities in lower levels (lexical/phonetic), formal grammars
- cases : solve ambiguities, help extraction/description/use of semantic/pragmatic facts
- syntax : contraints (selectional/agreement, positional) to be verified for a word sequence to be considered as syntactically correct in a given language
- semantic : understand meaning of word sequences, out of any contect, litteral, formal models of knowledge
- meaning space : numberical representation, distance
- knowledge representation : formal representation, symbolic operators
- pragmatic : contextualize the litteral meaning, within elocution context
- morpho-lexical : recognize words, morphological rules, electronic lexica
- NLP architecture
- lexicon
- grammar
- level 1 representation
- semantic information : dictionary
- semantic knowledge
- level 2 representation
- interdependencies between processing levels
- spelling error correction : syntax, semantic, pragrmatics
- syntactic-semantic
- syntactic-pragmatic
- lexicon : list of reference with surface form, part-of-speech, lemma, probability, pronunciation
- operations
- insertion/deletion
- existence
- extraction : all plural nouns ending in
- list whole content
- field representation
- internal structure : efficient access method (by value to many references, by reference to value)
- external structure
- surface forms
- lists/table : order, by reference
$O(1)$ , access$O(\log n)$ , insertion$O(n)$ , large size - hash tables : no by reference, extra inversion table, large size
- tries : require end of words, require adjunction of reference labels for access, code is list of numbers of sons, access
$O(1)$ , insertion$O(1)$ , good for dynamic lexia - finite-state automata :
$(Q,\Sigma,\delta:Q\times\Sigma\mapsto Q,q_0\in Q,F\subset Q)$ , access$O(1)$ , optimal size, hard implementation and insertion- DFSA equivalent to NFSA equivalent to NFSA without
$\varepsilon$ equivalent to regular expression, existence of unique minimal DFSA - regular expression
- concat :
$L_1L_2$ - repetition :
$L_1^i$ with$L_1^0=\varepsilon$ - infinite :
$\cup_{i=0}^\infty L_1^i$ - regexp : no empty nor
$\varepsilon$ , for$a\in\Sigma$ we have${a}$ , or$r|s$ , and$rs$
- concat :
- several references : represent several times the same string (same surface form but different record)
- association of a reference : a priori order on
$\Sigma$ , reference = rank of string in alphabetical order - by reference access
- DFSA equivalent to NFSA equivalent to NFSA without
- lists/table : order, by reference
- summary
- morphology : internal structure and variability of words in a language, use a priori knowledge to decompose
- cases : verbs conjugation, plurals, nominalization
- inflectional morphology : preserve grammatical category (give, given, gave)
- derivational morphology : change in category (process, processing, processor)
- morpheme : roots (stems) and affixes
- affixes : prefixes, suffixes, infixes, circumfixes
- concatenative morphology : only prefixes and suffixes
- pattern-based morphology : root with vocalic structure (insertions, LMD -> LAMAD)
- flexional morphology : plurals, conjugation with exception
- computational morphology
- analysis : find canonical representation corresponding to surface form
- generation : produce surface form decribed by canonical representation
- transducers
- string association :
$\forall i,|X_i|=|X_i'|$ , pairs$(X_n,X_n')$ - different lenght : require introduction of
$\varepsilon$ ,$(ab,ABC)\approx (\varepsilon ab,ABC)$ ,${n\choose m}$ possibilities, need convention - definition :
$\Sigma=((\Sigma_1\cup{\varepsilon})\times(\Sigma_2\cup{\varepsilon})\setminus {(\varepsilon,\varepsilon)}$ - operations
- association checking :
$(abba,baaa)\in\Sigma^*)$ , as FSA - generation :
$string_1\to string_2$ - analysis :
$string_2\to string_1$
- association checking :
- string association :
- transduction : walk through FSA following one or other element of the couple
- non-deterministic evaluation : backtracking on wrong solutions (not constant time)
- deterministic : sequential transducer, not in general especially when language is not bounded
- usage : text-to-speech (grapheme to phoneme), specific lexicon representation (composition of some access and inverse fonctions), filters (remove/add marks), text segmentation
- out of vocabulary :
$10%$ in newspaper- spelling errors : modelling by transformation (rewriting rules, sometimes weighted), edit space
- transposition : distortion,
$XY\to YX$ - tripling : distortion,
$XX\to XXX$ - name deviation :
$ize:INF\to ization:N$
- transposition : distortion,
- neologisms : Internetization
- borrowings : rendez-vous, identification of source language, computation of frequency matrix (
$ngram\times language$ ), language likelihood product-
$n$ -grams :$(n-1)$ -Markov assumption,$P(x_1\cdots x_n)=P(x_1\cdots x_{n-1})\Pi_{i=n}^NP(x_i\mid x_{i-n+1}\cdots x_{i-1})$
-
- forms difficult to exhaustively lexicalize : abbreviations, proper names
- spelling errors : modelling by transformation (rewriting rules, sometimes weighted), edit space
- spelling error correction
- statistical : probability metric
- lexicon : edit distance metric
- Levenshtein distance : minimal number of transformation to change one into other
- spelling error FSA correction : approximation, within maximum distance range, fault-tolerant recognition for regular language, find all ending paths s.t. corresponding string is within a distance range less than threshold of the given input string
- pruning criteria : cut-off edit distance,
- FSA walk within distance
- prefix-compatible depth-first version
- weighted edit distance : diacritics (é), uppercase, specific transformation (keyboard shift), whitespace
- morpho-lexical level : ambiguities resolution, reduction of vocabulary size, remove not usefull lexical variability
- PoS tagging : right tag according to the context among (word, tag) pairs
- lemmatization : reduce word to canonical form within context
- stemming : without PoS
- rule based : Brill algorithm
- probabilistic : HMM tagging, conditional random fields (CRF)
- Brill's tagger : error-driven transformation-based (rules) tagger, supervised learning
- learning : once, slow, iteratively compute error of each candidate rule, select best rule, add to rule set and apply, repeat until threshold
- application : quick, simple, initialize known words to most probable tag, unknown words to either Proper noun (captitalized), nouns or learning of guessing rules, apply all rules
- rules
- lexical :
$words\to tag$ if condition (e.g. current word has suffix, removing prefix give a known word, contain character$x$ ) - contextual :
$tag_1\to tag_2$ if condition (e.g. following tag at distance$2$ is$z$ , current workd is preceded by tag$Z$ )
- lexical :
- hidden markov model
- sequence of
$T$ words :$O_1^T=O_1\cdots O_T$ - tags :
$\xi^T_1=\xi_1\cdots\xi_T$ - tagging :
$O^T_1$ s.t. maximize$P(\xi_1,\ldots,\xi_T\mid O_1,\ldots,O_T)$ - maximization :
$\arg\max_{\xi_1^T}(P(O^T_1\mid\xi_1^T)P(\xi_1^T))$ with$P(O_1^T\mid\xi_1^T)=P(O_1\mid\xi^T_1)\cdots P(O_T\mid O^{T-1}_1,\xi_1^T)$ and$P(\xi_1^T)=P(\xi_1)P(\xi_2\mid\xi_1)\cdots P(\xi_T\mid\xi_1^{T-1})$ - hypotheses
- limited lexical conditioning :
$P(O_i\mid O_1,\ldots,O_{i-1},\xi_1,\ldots,\xi_i,\ldots,\xi_T)=P(O_i\mid\xi_i)$ - limited scope for syntactic dependencies :
$k$ neighbors,$P(\xi_i\mid\xi_1,\ldots,\xi_{i-1})=P(\xi_i\mid \xi_{i-k},\ldots,\xi_{i-1})$
- limited lexical conditioning :
-
$k$ order model :$P(O^T_1\mid\xi_1^T)P(\xi_1^T)=P(O_1^k\mid\xi_1^k)P(\xi_1^k)\Pi_{i=k+1}^{i=n}(P(O_i\mid\xi_i)P(\xi_i\mid\xi_{i-k}^{i-1}))$ with$P(O_1^T\mid\xi_1)=P(O_1\mid\xi^T_1)\cdots P(O_T\mid \xi_T)$ and$P(\xi_1^T)=P(\xi_1^k)P(\xi_{k+1}\mid\xi_1,\ldots,\xi_k)\cdots P(\xi_T\mid\xi_{T-k},\ldots,\xi_{T-1})$ - formally
- set of states :
$C={C_1,\ldots,C_n}$ - transition matrix :
$A$ with$a_{ij}=P(\xi_{t+1}=C_j\mid\xi_t = C_i)$ shorten$P(C_j\mid C_i)$ - initial probabilities vector :
$I$ with$I_i=P_I(\xi_1=C_i)$ shorten$P_I(C_i)$ - alphabet :
$\Sigma$ (not necessarily finite) -
$n$ emission probabilities on$\Sigma$ :$B_i(w)=P(O_t=w\mid\xi_t=C_i)$ shorten$P(w\mid C_i)$
- set of states :
- Viterbi : linear, compute
$\xi_1^T$ maximizing$P(\xi_1^T\mid O_1^T)$ - Baum-Welch : iterative algorithm for estimating parameters from unsupervised data (words only),
$P(w\mid C_i)$ ,$P(C_j\mid C_{i_1}^{i_k})$ ,$P_I(C_{i_1}\cdots C_{i_k})$ - parameter estimation : supervised has missing data, unsupervised has high initial conditions sensitivity, compromise of hybrid method (discriminative methods, CRF, averaged perceptrons, semi-supervised)- example : P(time/N flies/V like/Adv an/Det arrow/N) = PI(N) · P(time|N) · P(V|N) · P(flies|V) · P(Adv|V) · P(like|Adv) · P(Det|Adv) · P(an/Det) · P(N|Det) · P(arrow|N)
- problemes
- given parameters
$\theta$ of HMM, get probability of observation$P(O\mid\theta)$ (language identification) - given parameters
$\theta$ of HMM, fint most likely state sequence$\xi$ that produce$O$ $\arg\max_\xi P(\xi\mid O,\theta)$ (pos tagging, speech recognition) - find parameters that maximize probability of producing
$O$ $\arg\max_\theta P(\theta\mid O)$ (unsupervised learning)
- given parameters
- remarks
-
$\theta =(I,A,B)$ has$n(n+p-1)-1$ free parameters with$n=|C|$ and$p=|\Sigma|$ - supervised learning is easy :
$\arg\max_\theta P(\theta\mid O,\xi)$
-
- sequence of
- forward/backward algorithm :
$P(O\mid \theta)=\sum_\xi P(O,\xi\mid\theta)=\sum_\xi P(O\mid\xi,\theta)P(\xi\mid\theta)$ , complexity$O(|O|n^2)$ - viterbi : maximize
$P(\xi,O\mid\theta)$ - baum-welch : EM for estimating HMM parameters
- conditional random fields : discriminative generalization of HMM where features no longer needs to be state-conditionnal probabilities
- protocol
- define control task
- regroup large amount of typical data
- test and compare NLP systems on similar data
- publish and discuss
- difficulty for automatic classification : relation can be marked or not for finding linked propositions
- gold standard : set correct answers to task for a sample of correct inputs
- inter-annotator agreement : subtract chance agreement
- evaluation : training, validating, testing set
- confusion matrix
- accuracy = number of correctly classified items / total number of items x 100, bad on uneven classes
- precision = correctly retrieved documents / total number of retrieved document = true positives / (true positives + false positives)
- recall = correctly retrieved documents / total number of relevant documents = true positives / (true positives + false negatives)
- F-score : harmonic mean of precision and recall, penalize large divergence between numbers
$F_\beta=(1+\beta^2)\frac{precision\times recall}{(\beta^2\times precision)+ recall}$ - receiver operating characteristic : ROC curve, true positive rate = recall vs false positive rate = false positives / (true negatives + false positives)
- statistically significant evaluation : confidence intervals, confidence box, compute standard deviations, t-test
$t=\frac{\mu\sqrt{T}}{s}$ - mean :
$\mu=\frac{1}{T}\sum_{i=1}^T\Delta_i$ - standard deviation :
$s=\sqrt{\frac{1}{T-1}\sum_{i=1}^T(\Delta_i-\mu)^2}$
- mean :
- syntactic : analysis of sentences structure
- parsing algorithms : procedural, generic algorithms
- formal grammarer : declarative, data
- symbolic grammars
- phrase-structure grammars : recursively decompose sentences into constituants, words are atomic parts, good for ordered languages, express structural dependencies, not adapted to free-order languages
- dependency grammars : focus on words rather than relations, functional role of words, lexicaly oriented, dependency grammars provide simpler structures
- formally
- phrase-structure grammar
$G$ : finite set$C$ non-terminal, finite set$L$ of terminal symbols, upper level symbol$S\in C$ , finite set$R$ of rewriting rules$R\subset C^+\times(C\cup L)^*$ - context-free grammar CFG : set
$C$ syntactic categories (non-terminals), set$W$ words (terminal), element$S$ of$C$ top level category, proper subset$C_0$ of$C$ morpho-syntactic categories (part-of-speech), set$R$ rewriting rules (syntactic rules)$X\to X_1\cdots X_n$ where$X\in C-C_0$ and$X_1\cdots X_n\in C$ , set$L$ (lexicon) rewriting rule (lexical rules)$X\to w$ where$X\in C_0$ and$w$ word
- phrase-structure grammar
- parsing
- recognizing : syntactically correct, can be derived from upper symbol in finite number of rewriting step following the rule, output yes/no
- analysing : give all possible interpretation, output syntactic trees
- derivation : any sequence of rules deriving a given sentence, by convention restricted to left-most rule application
- cocke-younger-kasami CYK : always
$O(n^3)$ ,$O(|C|^2)$ , simple, particular format - earley parser : same complexcity as CYK but adaptive to least complex languages, no way to correct/reconstruct non parsable sentences, not intuitive
- bottom up chart parsing :
$O(n^3)$ , inference, online binarization, keep best of early and CYK
- probabilize
- recognizer : sentence probabilities,
$P(w_1^n)$ , n-gram$P(w_1,\ldots, w_n)=P(w_1,\ldots, w_{N-1})\Pi_{i=N}^n P(w_i\mid w_{i-N+1},\ldots, w_{i-1})$ - analyzer : parse-tree analyzer,
$P(T\mid w_1^n)$ , scfg
- recognizer : sentence probabilities,
- stochastic context-free grammars SCFG : CFG extension, for each rule stochastic coefficient
$p(R)$ ,$T=\arg\max_{W\subset T_i}P(T_i)$ - process :
$\zeta$ ,$P(\zeta=R_0,\ldots,R_n)=P(R_0)\Pi_{i=1}^n P(R_i\mid R_1,\ldots, R_{i-1})$ - assumption : only conditioned by
$left(R_i)=lmnt(R_0\circ \cdots\circ R_{i_1})$ - stochastic coefficient :
$R_i$
- assumption : only conditioned by
- probability :
$P(T)=\Pi_{i=1}^k p(R_i)$ , need to be normalized$\hat P(T)=\frac{P(T)}{\sum_{T\in T(G)} P(T)}$ if not consistent - sentences :
$P(W)=\sum_{F(T)=W, T\in T(G)} \hat P(T)$ - implementation : argmax while CYK analysis
- estimate probabilities
- supervised : tree-bank (annotated corpus), MLE,
$p( R)=\frac{\text{nb. occurence of }R}{R'\text{ s.t. left(R')=left(R)}}$ - unsupervised : EM, inside-outside algo, local minimum, sensitive to initial values
- hybrid : small tree-bank, large corpus
- supervised : tree-bank (annotated corpus), MLE,
- process :
- compositional semantics : study of meaning of linguistic sentences, no meaning for words alone, too complex at large scale
- usual representations
- symbolic : logical formula, graph-based
- vectorial : word embedding, large scale but too simplistic
- usual representations
- triangle of signification : minds grasp senses, words express them, objets are reffered to by them
- lexical semantics : study of the meaning of words, structured, context-sensitive
- lexeme : individual entry in lexicom, pairing of othograph, phonological, meaning, representation
- lexicon : finite list of lexemes, include compount nouns, non-compositional phrases (proper names)
- semantic relations
- homonymy : same surface form but different meaning, but different etymology, two different entries (vs polysemy)
- homophony : distinct lexemes with same pronunciation
- homography : distinct lexemes with same orthographic forms
- polysemy : multiple related meanings within a single lexeme
- metaphor
- metonymy
- synonymy : same sense, map same concept, same value for all semantic features
- Leibniz substitution theory : substitution of one for the other never changes the thruth
- hyponymy/hyperonymy : word whose meaning contains entire meaning of another, known as superordinate (hypernym), transistive, asymmetric, generate taxonomic hierarchy
- overlap : same value for some of semantic features
- meronymy/holonymy : word meronym of another word (holonym) if relation is-part-of holds, differents relationships
- genus differentia : each word meaning associated to hypernym through hyponymy/hypernymy relation, then each word meaning uniquely differentiated from other hyponyms of its hypernym by additional relations associating it with other words meaning
- synsets : set of word forms that share same sense, signify concepts exists (not the concept)
- differential approach : wordnet, list of word forms that distinguish their meaning (e.g. adjectives use antonymy relations)
- constructive approach : conventional dictionaries, meaning representation contain sufficient information to accurately define a concept, often cyclic
- semantic primes : small number of generic concepts to start with
- DIS
- classification :
$R^{N\times m}$ - symilarity matrix :
$d(x,y)\ge 0$ ,$d(x,y)=d(y,x)$ ,$d(x,y)\le d(x,z)+d(z,y)$ , dissimilarity (not the latest one) - Parzen window : weighted nearest neighbors
- dendrograms : binary cluster tree,
$O(N^2\log N)$ - evaluation
- multidimensional scaling : MDS, Sammon mapping criterion
$C(d,\tilde d)=\sum_{x\not = y}\frac{(d(x,y)-\tilde d(\tilde x,\tilde y))^2}{d(x,y)}$