Skip to content

Commit

Permalink
update docs/
Browse files Browse the repository at this point in the history
  • Loading branch information
EmilHvitfeldt committed Sep 14, 2021
1 parent 88af8d6 commit 7f56d72
Show file tree
Hide file tree
Showing 61 changed files with 3,890 additions and 3,300 deletions.
50 changes: 21 additions & 29 deletions docs/01_model_language.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,49 +21,42 @@ Throughout the course of this book, we will discuss creating predictors or featu
\index{semantics}
\index{pragmatics}

\begin{table}

\caption{(\#tab:lingsubfields)Some subfields of linguistics, moving from smaller structures to broader structures}
\centering
\begin{tabular}[t]{ll}
\toprule
Linguistics subfield & What does it focus on?\\
\midrule
Phonetics & Sounds that people use in language\\
Phonology & Systems of sounds in particular languages\\
Morphology & How words are formed\\
Syntax & How sentences are formed from words\\
Semantics & What sentences mean\\
Pragmatics & How language is used in context\\
\bottomrule
\end{tabular}
\end{table}

Table: (\#tab:lingsubfields)Some subfields of linguistics, moving from smaller structures to broader structures

|Linguistics subfield |What does it focus on? |
|:--------------------|:-----------------------------------------|
|Phonetics |Sounds that people use in language |
|Phonology |Systems of sounds in particular languages |
|Morphology |How words are formed |
|Syntax |How sentences are formed from words |
|Semantics |What sentences mean |
|Pragmatics |How language is used in context |

These fields each study a different level at which language exhibits organization. When we build supervised machine learning models for text data, we use these levels of organization to create _natural language features_, i.e., predictors or inputs for our models. These features often depend on the morphological characteristics of language, such as when text is broken into sequences of characters for a recurrent neural network deep learning model\index{neural network!recurrent}. Sometimes these features depend on the syntactic characteristics of language, such as when models use part-of-speech information. These roughly hierarchical levels of organization are key to the process of transforming unstructured language to a mathematical representation that can be used in modeling.

At the same time, \index{linguistics}this organization and the rules of language can be ambiguous; our ability to create text features for machine learning is constrained by the very nature of language. Beatrice Santorini, a linguist at the University of Pennsylvania, compiles examples of linguistic ambiguity from [news headlines](https://www.ling.upenn.edu/~beatrice/humor/headlines.html):


> Include Your Children When Baking Cookies
> March Planned For Next August
- Include Your Children When Baking Cookies
> Enraged Cow Injures Farmer with Ax
- March Planned For Next August
> Wives Kill Most Spouses In Chicago
- Enraged Cow Injures Farmer with Ax

- Wives Kill Most Spouses In Chicago



If you don't have knowledge about what linguists study and what they know about language, these news headlines are just hilarious. To linguists, these are hilarious because they exhibit certain kinds of semantic ambiguity.

Notice also that the first two subfields on this list are about sounds, i.e., speech.\index{speech} Most linguists view speech as primary, and writing down language as text as a technological step.

\begin{rmdnote}
Remember that some language is signed, not spoken, so the description
laid out here is itself limited.
\end{rmdnote}
<div class="rmdnote">
<p>Remember that some language is signed, not spoken, so the description laid out here is itself limited.</p>
</div>
\index{language!signed}

Written text is typically less creative and further from the primary language than we would wish. This points out how fundamentally limited modeling from written text is. Imagine that the abstract language data we want exists in some high-dimensional latent space; we would like to extract that information using the text somehow, but it just isn't completely possible. Any features we create or model we build are inherently limited.
Expand Down Expand Up @@ -95,10 +88,9 @@ The concept of differences in language is relevant for modeling beyond only the

Language is also changing over time. This is a known characteristic of language; if you notice the evolution of your own language, don't be depressed or angry, because it means that people are using it! Teenage girls are especially effective at language innovation and have been for centuries [@McCulloch15]; innovations spread from groups such as young women to other parts of society. This is another difference that impacts modeling.

\begin{rmdwarning}
Differences in language relevant for models also include the use of
slang, and even the context or medium of that text.
\end{rmdwarning}
<div class="rmdwarning">
<p>Differences in language relevant for models also include the use of slang, and even the context or medium of that text.</p>
</div>

Consider two bodies of text, both mostly standard written English, but one made up of tweets and one made up of medical documents. If an NLP practitioner trains a model on the data set of tweets to predict some characteristics of the text, it is very possible (in fact, likely, in our experience) that the model will perform poorly if applied to the data set of medical documents[^mednote]. Like machine learning in general, text modeling is exquisitely sensitive to the data used for training. This is why we are somewhat skeptical of AI products such as sentiment analysis APIs, not because they *never* work well, but because they work well only when the text you need to predict from is a good match to the text such a product was trained on.

Expand Down
24 changes: 12 additions & 12 deletions docs/02_tokenization.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ These elements don't contain any metadata or information to tell us which charac
In tokenization, we take an input (a string) and a token type (a meaningful unit of text, such as a word) and split the input into pieces (tokens) that correspond to the type [@Manning:2008:IIR:1394399]. Figure \@ref(fig:tokenizationdiag) outlines this process.\index{tokenization!definition}

<div class="figure" style="text-align: center">
<img src="diagram-files/tokenization-black-box.png" alt="A black box representation of a tokenizer. The text of these three example text fragments has been converted to lowercase and punctuation has been removed before the text is split." width="90%" />
<img src="diagram-files/tokenization-black-box.pdf" alt="A black box representation of a tokenizer. The text of these three example text fragments has been converted to lowercase and punctuation has been removed before the text is split." width="90%" />
<p class="caption">(\#fig:tokenizationdiag)A black box representation of a tokenizer. The text of these three example text fragments has been converted to lowercase and punctuation has been removed before the text is split.</p>
</div>

Expand Down Expand Up @@ -142,7 +142,7 @@ Thinking of a token as a word is a useful way to start understanding tokenizatio

- paragraphs, and

- n-grams.
- n-grams

In the following sections, we will explore how to tokenize text using the **tokenizers** package. These functions take a character vector as the input and return lists of character vectors as output. This same tokenization can also be done using the **tidytext** [@Silge16] package, for workflows using tidy data principles where the input and output are both in a dataframe.

Expand Down Expand Up @@ -181,7 +181,7 @@ sample_tibble %>%
```

```
#> # A tibble: 11 x 1
#> # A tibble: 11 × 1
#> word
#> <chr>
#> 1 far
Expand Down Expand Up @@ -210,7 +210,7 @@ sample_tibble %>%
```

```
#> # A tibble: 12 x 1
#> # A tibble: 12 × 1
#> word
#> <chr>
#> 1 far
Expand Down Expand Up @@ -279,7 +279,7 @@ tokenize_characters(x = the_fir_tree,

The results have more elements because the spaces and punctuation have not been removed.

Depending on the format you have your text data in, it might contain ligatures.\index{tokenization!ligatures} Ligatures are when multiple graphemes or letters are combined as a single character The graphemes "f" and "l" are combined into "fl", or "s" and "t" into "". When we apply normal tokenization rules the ligatures will not be split up.
Depending on the format you have your text data in, it might contain ligatures.\index{tokenization!ligatures} Ligatures are when multiple graphemes or letters are combined as a single character The graphemes "f" and "l" are combined into "fl", or "f" and "f" into "ff". When we apply normal tokenization rules the ligatures will not be split up.


```r
Expand Down Expand Up @@ -340,7 +340,7 @@ hcandersen_en %>%
```

```
#> # A tibble: 10 x 3
#> # A tibble: 10 × 3
#> # Groups: book [2]
#> book word n
#> <chr> <chr> <int>
Expand Down Expand Up @@ -863,14 +863,14 @@ bench::mark(check = FALSE, iterations = 10,
```

```
#> # A tibble: 5 x 6
#> # A tibble: 5 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 corpus 118ms 128ms 7.76 4.59MB 0.863
#> 2 tokenizers 151ms 154ms 6.48 1.01MB 1.62
#> 3 text2vec 130ms 134ms 7.33 20.64MB 0.815
#> 4 quanteda 241ms 243ms 4.11 8.7MB 2.74
#> 5 base R 471ms 499ms 2.02 10.51MB 0.505
#> 1 corpus 77.1ms 85.6ms 11.3 4.58MB 1.26
#> 2 tokenizers 96.7ms 100.1ms 9.80 1.01MB 1.09
#> 3 text2vec 79.5ms 81.1ms 11.9 20.64MB 1.32
#> 4 quanteda 157.2ms 164.4ms 6.07 8.7MB 1.52
#> 5 base R 325.7ms 332.3ms 3.00 10.51MB 2.00
```

The corpus package [@Perry2020] offers excellent performance for tokenization, and other options are not much worse. One exception is using a base R function as a tokenizer; you will see significant performance gains by instead using a package built specifically for text tokenization.
Expand Down
Loading

0 comments on commit 7f56d72

Please sign in to comment.