Skip to content

Latest commit

 

History

History
349 lines (322 loc) · 20.9 KB

cs431.md

File metadata and controls

349 lines (322 loc) · 20.9 KB

CS431

GNU General Public License v3.0 licensed. Source available on github.com/zifeo/EPFL.

Spring 2017: Introduction to Natural Language Processing

[TOC]

Introduction

  • applications : translation, writting assistance, information retrieval (Google), information classification (emails), interface (vocal)
  • language
    • natural : implicit, ambiguous
    • formal : explicit, non-ambiguous
  • communication : conciseness (D.R.Y.), shared knowledge, representation (unlimited expressive power)
  • real-world NLP :
    • task specification essential
    • usefulness/drawbacks ratio (not always the best solution)
    • constraints
      • real-time communication : 180word/min
      • knowledge representation formalism : huge amount of data
      • fast processing : polynomial algorithms
      • sub-language corresponding : sufficient linguistic resources
  • grammar : expressive power vs real time, production rule $A\to\alpha$
    • regular, LR(k) : $O(n)$, identity, constant, symbol
    • context-free : $O(n^3)$, rule one-to-many/one
    • tree adjoining : $O(n^6)$, rewriting tree rather than symbol
  • difficulties
    • lack of linguistic competence : hard to find, to produce, semantic level less developped
    • power laws : zipf distribution, illusion of not too bad system
    • curse of dimensionality : high dimension, sparseness
    • subjectivity : inter-annotator (dis)agreement
    • multi-scale : ambiguity
  • corpus-based linguistics : reproduce for a given task corresponding linguistic behaviour with models semi-automatically trained from large amouts of textual data representative for considered task, does not try to explain, performance orientied

Processing levels in NLP

  • linguistic processing levels
    • morpho-lexical : recognize words, morphological rules, electronic lexica
      • cases : non-alphabetical (chinese), languages without seperator (tahi), ambiguous separator (U.N.O.), out-of-vocabulary forms (.bat files)
      • morphology : structure of words
      • lexicography : inventory/classification of accepted forms
      • paradigmatic : dimension of language, word choice, concepts
      • syntagmatic : structure
    • syntatic : structure word sequences, reduce ambiguities in lower levels (lexical/phonetic), formal grammars
      • cases : solve ambiguities, help extraction/description/use of semantic/pragmatic facts
      • syntax : contraints (selectional/agreement, positional) to be verified for a word sequence to be considered as syntactically correct in a given language
    • semantic : understand meaning of word sequences, out of any contect, litteral, formal models of knowledge
      • meaning space : numberical representation, distance
      • knowledge representation : formal representation, symbolic operators
    • pragmatic : contextualize the litteral meaning, within elocution context
  • NLP architecture
  • lexicon
  • grammar
  • level 1 representation
  • semantic information : dictionary
  • semantic knowledge
  • level 2 representation
  • interdependencies between processing levels
    • spelling error correction : syntax, semantic, pragrmatics
    • syntactic-semantic
    • syntactic-pragmatic

Electronic lexica

  • lexicon : list of reference with surface form, part-of-speech, lemma, probability, pronunciation
  • operations
    • insertion/deletion
    • existence
    • extraction : all plural nouns ending in
    • list whole content
  • field representation
    • internal structure : efficient access method (by value to many references, by reference to value)
    • external structure
  • surface forms
    • lists/table : order, by reference $O(1)$, access $O(\log n)$, insertion $O(n)$, large size
    • hash tables : no by reference, extra inversion table, large size
    • tries : require end of words, require adjunction of reference labels for access, code is list of numbers of sons, access $O(1)$, insertion $O(1)$, good for dynamic lexia
    • finite-state automata : $(Q,\Sigma,\delta:Q\times\Sigma\mapsto Q,q_0\in Q,F\subset Q)$, access $O(1)$, optimal size, hard implementation and insertion
      • DFSA equivalent to NFSA equivalent to NFSA without $\varepsilon$ equivalent to regular expression, existence of unique minimal DFSA
      • regular expression
        • concat : $L_1L_2$
        • repetition : $L_1^i$ with $L_1^0=\varepsilon$
        • infinite : $\cup_{i=0}^\infty L_1^i$
        • regexp : no empty nor $\varepsilon$, for $a\in\Sigma$ we have ${a}$, or $r|s$, and $rs$
      • several references : represent several times the same string (same surface form but different record)
      • association of a reference : a priori order on $\Sigma$, reference = rank of string in alphabetical order
      • by reference access
  • summary

Morphology - transducers

  • morphology : internal structure and variability of words in a language, use a priori knowledge to decompose
    • cases : verbs conjugation, plurals, nominalization
    • inflectional morphology : preserve grammatical category (give, given, gave)
    • derivational morphology : change in category (process, processing, processor)
    • morpheme : roots (stems) and affixes
    • affixes : prefixes, suffixes, infixes, circumfixes
    • concatenative morphology : only prefixes and suffixes
    • pattern-based morphology : root with vocalic structure (insertions, LMD -> LAMAD)
    • flexional morphology : plurals, conjugation with exception
  • computational morphology
    • analysis : find canonical representation corresponding to surface form
    • generation : produce surface form decribed by canonical representation
  • transducers
    • string association : $\forall i,|X_i|=|X_i'|$, pairs $(X_n,X_n')$
    • different lenght : require introduction of $\varepsilon$, $(ab,ABC)\approx (\varepsilon ab,ABC)$, ${n\choose m}$ possibilities, need convention
    • definition : $\Sigma=((\Sigma_1\cup{\varepsilon})\times(\Sigma_2\cup{\varepsilon})\setminus {(\varepsilon,\varepsilon)}$
    • operations
      • association checking : $(abba,baaa)\in\Sigma^*)$, as FSA
      • generation : $string_1\to string_2$
      • analysis : $string_2\to string_1$
  • transduction : walk through FSA following one or other element of the couple
    • non-deterministic evaluation : backtracking on wrong solutions (not constant time)
    • deterministic : sequential transducer, not in general especially when language is not bounded
    • usage : text-to-speech (grapheme to phoneme), specific lexicon representation (composition of some access and inverse fonctions), filters (remove/add marks), text segmentation

Out of vocabulary forms - spelling error correction

  • out of vocabulary : $10%$ in newspaper
    • spelling errors : modelling by transformation (rewriting rules, sometimes weighted), edit space
      • transposition : distortion, $XY\to YX$
      • tripling : distortion, $XX\to XXX$
      • name deviation : $ize:INF\to ization:N$
    • neologisms : Internetization
    • borrowings : rendez-vous, identification of source language, computation of frequency matrix ($ngram\times language$), language likelihood product
      • $n$-grams : $(n-1)$-Markov assumption, $P(x_1\cdots x_n)=P(x_1\cdots x_{n-1})\Pi_{i=n}^NP(x_i\mid x_{i-n+1}\cdots x_{i-1})$
    • forms difficult to exhaustively lexicalize : abbreviations, proper names
  • spelling error correction
    • statistical : probability metric
    • lexicon : edit distance metric
  • Levenshtein distance : minimal number of transformation to change one into other
    • insertion
    • deletion
    • substitution
    • transposition
    • non overlapping transformation
    • on non finite language : $\forall XY,\exists f:Y=f(X)$
    • dynamic programming : $m_{ij}=D(X_1^i;Y_1^j)$
  • spelling error FSA correction : approximation, within maximum distance range, fault-tolerant recognition for regular language, find all ending paths s.t. corresponding string is within a distance range less than threshold of the given input string
  • pruning criteria : cut-off edit distance,
  • FSA walk within distance
  • prefix-compatible depth-first version
  • weighted edit distance : diacritics (é), uppercase, specific transformation (keyboard shift), whitespace
    • semantic integrity, coherence constraints : introduction a new $f$ s.t. $f=\circ_i f_i$ useful iff $C(f)<\sum_i C(f_i)$
      • $C(Del)+C(Ins(x))>C(Sub(x))$
      • $C(Split)<C(Ins(x))$ : $C(Merge)< C(Del)$
      • $C(Transp)<C(Ins(s))+C(Del)$

Part of speech tagging

  • morpho-lexical level : ambiguities resolution, reduction of vocabulary size, remove not usefull lexical variability
  • PoS tagging : right tag according to the context among (word, tag) pairs
  • lemmatization : reduce word to canonical form within context
  • stemming : without PoS
  • rule based : Brill algorithm
  • probabilistic : HMM tagging, conditional random fields (CRF)
  • Brill's tagger : error-driven transformation-based (rules) tagger, supervised learning
    • learning : once, slow, iteratively compute error of each candidate rule, select best rule, add to rule set and apply, repeat until threshold
    • application : quick, simple, initialize known words to most probable tag, unknown words to either Proper noun (captitalized), nouns or learning of guessing rules, apply all rules
    • rules
      • lexical : $words\to tag$ if condition (e.g. current word has suffix, removing prefix give a known word, contain character $x$)
      • contextual : $tag_1\to tag_2$ if condition (e.g. following tag at distance $2$ is $z$, current workd is preceded by tag $Z$)
  • hidden markov model
    • sequence of $T$ words : $O_1^T=O_1\cdots O_T$
    • tags : $\xi^T_1=\xi_1\cdots\xi_T$
    • tagging : $O^T_1$ s.t. maximize $P(\xi_1,\ldots,\xi_T\mid O_1,\ldots,O_T)$
    • maximization : $\arg\max_{\xi_1^T}(P(O^T_1\mid\xi_1^T)P(\xi_1^T))$ with $P(O_1^T\mid\xi_1^T)=P(O_1\mid\xi^T_1)\cdots P(O_T\mid O^{T-1}_1,\xi_1^T)$ and $P(\xi_1^T)=P(\xi_1)P(\xi_2\mid\xi_1)\cdots P(\xi_T\mid\xi_1^{T-1})$
    • hypotheses
      • limited lexical conditioning : $P(O_i\mid O_1,\ldots,O_{i-1},\xi_1,\ldots,\xi_i,\ldots,\xi_T)=P(O_i\mid\xi_i)$
      • limited scope for syntactic dependencies : $k$ neighbors, $P(\xi_i\mid\xi_1,\ldots,\xi_{i-1})=P(\xi_i\mid \xi_{i-k},\ldots,\xi_{i-1})$
    • $k$ order model : $P(O^T_1\mid\xi_1^T)P(\xi_1^T)=P(O_1^k\mid\xi_1^k)P(\xi_1^k)\Pi_{i=k+1}^{i=n}(P(O_i\mid\xi_i)P(\xi_i\mid\xi_{i-k}^{i-1}))$ with $P(O_1^T\mid\xi_1)=P(O_1\mid\xi^T_1)\cdots P(O_T\mid \xi_T)$ and $P(\xi_1^T)=P(\xi_1^k)P(\xi_{k+1}\mid\xi_1,\ldots,\xi_k)\cdots P(\xi_T\mid\xi_{T-k},\ldots,\xi_{T-1})$
    • formally
      • set of states : $C={C_1,\ldots,C_n}$
      • transition matrix : $A$ with $a_{ij}=P(\xi_{t+1}=C_j\mid\xi_t = C_i)$ shorten $P(C_j\mid C_i)$
      • initial probabilities vector : $I$ with $I_i=P_I(\xi_1=C_i)$ shorten $P_I(C_i)$
      • alphabet : $\Sigma$ (not necessarily finite)
      • $n$ emission probabilities on $\Sigma$ : $B_i(w)=P(O_t=w\mid\xi_t=C_i)$ shorten $P(w\mid C_i)$
    • Viterbi : linear, compute $\xi_1^T$ maximizing $P(\xi_1^T\mid O_1^T)$
    • Baum-Welch : iterative algorithm for estimating parameters from unsupervised data (words only), $P(w\mid C_i)$, $P(C_j\mid C_{i_1}^{i_k})$, $P_I(C_{i_1}\cdots C_{i_k})$
      • parameter estimation : supervised has missing data, unsupervised has high initial conditions sensitivity, compromise of hybrid method (discriminative methods, CRF, averaged perceptrons, semi-supervised)- example : P(time/N flies/V like/Adv an/Det arrow/N) = PI(N) · P(time|N) · P(V|N) · P(flies|V) · P(Adv|V) · P(like|Adv) · P(Det|Adv) · P(an/Det) · P(N|Det) · P(arrow|N)
    • problemes
      • given parameters $\theta$ of HMM, get probability of observation $P(O\mid\theta)$ (language identification)
      • given parameters $\theta$ of HMM, fint most likely state sequence $\xi$ that produce $O$ $\arg\max_\xi P(\xi\mid O,\theta)$ (pos tagging, speech recognition)
      • find parameters that maximize probability of producing $O$ $\arg\max_\theta P(\theta\mid O)$ (unsupervised learning)
    • remarks
      • $\theta =(I,A,B)$ has $n(n+p-1)-1$ free parameters with $n=|C|$ and $p=|\Sigma|$
      • supervised learning is easy : $\arg\max_\theta P(\theta\mid O,\xi)$
  • forward/backward algorithm : $P(O\mid \theta)=\sum_\xi P(O,\xi\mid\theta)=\sum_\xi P(O\mid\xi,\theta)P(\xi\mid\theta)$, complexity $O(|O|n^2)$
  • viterbi : maximize $P(\xi,O\mid\theta)$
  • baum-welch : EM for estimating HMM parameters
  • conditional random fields : discriminative generalization of HMM where features no longer needs to be state-conditionnal probabilities

NLP evaluation

  • protocol
    • define control task
    • regroup large amount of typical data
    • test and compare NLP systems on similar data
    • publish and discuss
  • difficulty for automatic classification : relation can be marked or not for finding linked propositions
  • gold standard : set correct answers to task for a sample of correct inputs
  • inter-annotator agreement : subtract chance agreement
    • raw agreement = nb items agreed / total nb of items
    • Cohen's $\kappa$ : 2 graders, positive better than chance, $1$ perfect agreement, $0$ statistical independence, more than $0.6$ usually okay
    • Fleiss' $\kappa$ : $n$ graders
  • evaluation : training, validating, testing set
  • confusion matrix
    • accuracy = number of correctly classified items / total number of items x 100, bad on uneven classes
    • precision = correctly retrieved documents / total number of retrieved document = true positives / (true positives + false positives)
    • recall = correctly retrieved documents / total number of relevant documents = true positives / (true positives + false negatives)
  • F-score : harmonic mean of precision and recall, penalize large divergence between numbers $F_\beta=(1+\beta^2)\frac{precision\times recall}{(\beta^2\times precision)+ recall}$
  • receiver operating characteristic : ROC curve, true positive rate = recall vs false positive rate = false positives / (true negatives + false positives)
  • statistically significant evaluation : confidence intervals, confidence box, compute standard deviations, t-test $t=\frac{\mu\sqrt{T}}{s}$
    • mean : $\mu=\frac{1}{T}\sum_{i=1}^T\Delta_i$
    • standard deviation : $s=\sqrt{\frac{1}{T-1}\sum_{i=1}^T(\Delta_i-\mu)^2}$

Parsing & formal grammaers

  • syntactic : analysis of sentences structure
    • parsing algorithms : procedural, generic algorithms
    • formal grammarer : declarative, data
  • symbolic grammars
    • phrase-structure grammars : recursively decompose sentences into constituants, words are atomic parts, good for ordered languages, express structural dependencies, not adapted to free-order languages
    • dependency grammars : focus on words rather than relations, functional role of words, lexicaly oriented, dependency grammars provide simpler structures
    • formally
      • phrase-structure grammar $G$ : finite set $C$ non-terminal, finite set $L$ of terminal symbols, upper level symbol $S\in C$, finite set $R$ of rewriting rules $R\subset C^+\times(C\cup L)^*$
      • context-free grammar CFG : set $C$ syntactic categories (non-terminals), set $W$ words (terminal), element $S$ of $C$ top level category, proper subset $C_0$ of $C$ morpho-syntactic categories (part-of-speech), set $R$ rewriting rules (syntactic rules) $X\to X_1\cdots X_n$ where $X\in C-C_0$ and $X_1\cdots X_n\in C$, set $L$ (lexicon) rewriting rule (lexical rules) $X\to w$ where $X\in C_0$ and $w$ word
  • parsing
    • recognizing : syntactically correct, can be derived from upper symbol in finite number of rewriting step following the rule, output yes/no
    • analysing : give all possible interpretation, output syntactic trees
  • derivation : any sequence of rules deriving a given sentence, by convention restricted to left-most rule application
  • cocke-younger-kasami CYK : always $O(n^3)$, $O(|C|^2)$, simple, particular format
    • Chomsky normal form : any CFG can be converted, all rulles $X\to X_1X_2$
    • extended Chomsky normal form : allow also $X\to X_1$
  • earley parser : same complexcity as CYK but adaptive to least complex languages, no way to correct/reconstruct non parsable sentences, not intuitive
    • top down : predictive, search
    • lecture, complétion ($i\to i$), prédiction ($i\to i$)
  • bottom up chart parsing : $O(n^3)$, inference, online binarization, keep best of early and CYK

Stochastic parsing

  • probabilize
    • recognizer : sentence probabilities, $P(w_1^n)$, n-gram $P(w_1,\ldots, w_n)=P(w_1,\ldots, w_{N-1})\Pi_{i=N}^n P(w_i\mid w_{i-N+1},\ldots, w_{i-1})$
    • analyzer : parse-tree analyzer, $P(T\mid w_1^n)$, scfg
  • stochastic context-free grammars SCFG : CFG extension, for each rule stochastic coefficient $p(R)$, $T=\arg\max_{W\subset T_i}P(T_i)$
    • process : $\zeta$, $P(\zeta=R_0,\ldots,R_n)=P(R_0)\Pi_{i=1}^n P(R_i\mid R_1,\ldots, R_{i-1})$
      • assumption : only conditioned by $left(R_i)=lmnt(R_0\circ \cdots\circ R_{i_1})$
      • stochastic coefficient : $R_i$
    • probability : $P(T)=\Pi_{i=1}^k p(R_i)$, need to be normalized $\hat P(T)=\frac{P(T)}{\sum_{T\in T(G)} P(T)}$ if not consistent
    • sentences : $P(W)=\sum_{F(T)=W, T\in T(G)} \hat P(T)$
    • implementation : argmax while CYK analysis
    • estimate probabilities
      • supervised : tree-bank (annotated corpus), MLE, $p( R)=\frac{\text{nb. occurence of }R}{R'\text{ s.t. left(R')=left(R)}}$
      • unsupervised : EM, inside-outside algo, local minimum, sensitive to initial values
      • hybrid : small tree-bank, large corpus

Lexical semantics

  • compositional semantics : study of meaning of linguistic sentences, no meaning for words alone, too complex at large scale
    • usual representations
      • symbolic : logical formula, graph-based
      • vectorial : word embedding, large scale but too simplistic
  • triangle of signification : minds grasp senses, words express them, objets are reffered to by them
  • lexical semantics : study of the meaning of words, structured, context-sensitive
  • lexeme : individual entry in lexicom, pairing of othograph, phonological, meaning, representation
  • lexicon : finite list of lexemes, include compount nouns, non-compositional phrases (proper names)
  • semantic relations
    • homonymy : same surface form but different meaning, but different etymology, two different entries (vs polysemy)
    • homophony : distinct lexemes with same pronunciation
    • homography : distinct lexemes with same orthographic forms
    • polysemy : multiple related meanings within a single lexeme
      • metaphor
      • metonymy
    • synonymy : same sense, map same concept, same value for all semantic features
      • Leibniz substitution theory : substitution of one for the other never changes the thruth
    • hyponymy/hyperonymy : word whose meaning contains entire meaning of another, known as superordinate (hypernym), transistive, asymmetric, generate taxonomic hierarchy
    • overlap : same value for some of semantic features
    • meronymy/holonymy : word meronym of another word (holonym) if relation is-part-of holds, differents relationships
    • genus differentia : each word meaning associated to hypernym through hyponymy/hypernymy relation, then each word meaning uniquely differentiated from other hyponyms of its hypernym by additional relations associating it with other words meaning
  • synsets : set of word forms that share same sense, signify concepts exists (not the concept)
    • differential approach : wordnet, list of word forms that distinguish their meaning (e.g. adjectives use antonymy relations)
    • constructive approach : conventional dictionaries, meaning representation contain sufficient information to accurately define a concept, often cyclic
    • semantic primes : small number of generic concepts to start with

Information retrieval

  • DIS

Text classification

  • classification : $R^{N\times m}$
  • symilarity matrix : $d(x,y)\ge 0$, $d(x,y)=d(y,x)$, $d(x,y)\le d(x,z)+d(z,y)$, dissimilarity (not the latest one)
  • Parzen window : weighted nearest neighbors
  • dendrograms : binary cluster tree, $O(N^2\log N)$
  • evaluation
  • multidimensional scaling : MDS, Sammon mapping criterion $C(d,\tilde d)=\sum_{x\not = y}\frac{(d(x,y)-\tilde d(\tilde x,\tilde y))^2}{d(x,y)}$