update docs/

EmilHvitfeldt · Sep 14, 2021 · 7f56d72 · 7f56d72
1 parent 88af8d6
commit 7f56d72
Show file tree

Hide file tree

Showing 61 changed files with 3,890 additions and 3,300 deletions.
diff --git a/docs/01_model_language.md b/docs/01_model_language.md
@@ -21,49 +21,42 @@ Throughout the course of this book, we will discuss creating predictors or featu
 \index{semantics}
 \index{pragmatics}
 
-\begin{table}
-
-\caption{(\#tab:lingsubfields)Some subfields of linguistics, moving from smaller structures to broader structures}
-\centering
-\begin{tabular}[t]{ll}
-\toprule
-Linguistics subfield & What does it focus on?\\
-\midrule
-Phonetics & Sounds that people use in language\\
-Phonology & Systems of sounds in particular languages\\
-Morphology & How words are formed\\
-Syntax & How sentences are formed from words\\
-Semantics & What sentences mean\\
-Pragmatics & How language is used in context\\
-\bottomrule
-\end{tabular}
-\end{table}
+
+Table: (\#tab:lingsubfields)Some subfields of linguistics, moving from smaller structures to broader structures
+
+|Linguistics subfield |What does it focus on?                    |
+|:--------------------|:-----------------------------------------|
+|Phonetics            |Sounds that people use in language        |
+|Phonology            |Systems of sounds in particular languages |
+|Morphology           |How words are formed                      |
+|Syntax               |How sentences are formed from words       |
+|Semantics            |What sentences mean                       |
+|Pragmatics           |How language is used in context           |
 
 These fields each study a different level at which language exhibits organization. When we build supervised machine learning models for text data, we use these levels of organization to create _natural language features_, i.e., predictors or inputs for our models. These features often depend on the morphological characteristics of language, such as when text is broken into sequences of characters for a recurrent neural network deep learning model\index{neural network!recurrent}. Sometimes these features depend on the syntactic characteristics of language, such as when models use part-of-speech information. These roughly hierarchical levels of organization are key to the process of transforming unstructured language to a mathematical representation that can be used in modeling.
 
 At the same time, \index{linguistics}this organization and the rules of language can be ambiguous; our ability to create text features for machine learning is constrained by the very nature of language. Beatrice Santorini, a linguist at the University of Pennsylvania, compiles examples of linguistic ambiguity from [news headlines](https://www.ling.upenn.edu/~beatrice/humor/headlines.html):
 
 
+> Include Your Children When Baking Cookies
 
+> March Planned For Next August
 
-- Include Your Children When Baking Cookies
+> Enraged Cow Injures Farmer with Ax
 
-- March Planned For Next August
+> Wives Kill Most Spouses In Chicago
 
-- Enraged Cow Injures Farmer with Ax
 
-- Wives Kill Most Spouses In Chicago
 
 
 
 If you don't have knowledge about what linguists study and what they know about language, these news headlines are just hilarious. To linguists, these are hilarious because they exhibit certain kinds of semantic ambiguity.
 
 Notice also that the first two subfields on this list are about sounds, i.e., speech.\index{speech} Most linguists view speech as primary, and writing down language as text as a technological step.
 
-\begin{rmdnote}
-Remember that some language is signed, not spoken, so the description
-laid out here is itself limited.
-\end{rmdnote}
+<div class="rmdnote">
+<p>Remember that some language is signed, not spoken, so the description laid out here is itself limited.</p>
+</div>
 \index{language!signed}
 
 Written text is typically less creative and further from the primary language than we would wish. This points out how fundamentally limited modeling from written text is. Imagine that the abstract language data we want exists in some high-dimensional latent space; we would like to extract that information using the text somehow, but it just isn't completely possible. Any features we create or model we build are inherently limited.
@@ -95,10 +88,9 @@ The concept of differences in language is relevant for modeling beyond only the
 
 Language is also changing over time. This is a known characteristic of language; if you notice the evolution of your own language, don't be depressed or angry, because it means that people are using it! Teenage girls are especially effective at language innovation and have been for centuries [@McCulloch15]; innovations spread from groups such as young women to other parts of society. This is another difference that impacts modeling.
 
-\begin{rmdwarning}
-Differences in language relevant for models also include the use of
-slang, and even the context or medium of that text.
-\end{rmdwarning}
+<div class="rmdwarning">
+<p>Differences in language relevant for models also include the use of slang, and even the context or medium of that text.</p>
+</div>
 
 Consider two bodies of text, both mostly standard written English, but one made up of tweets and one made up of medical documents. If an NLP practitioner trains a model on the data set of tweets to predict some characteristics of the text, it is very possible (in fact, likely, in our experience) that the model will perform poorly if applied to the data set of medical documents[^mednote]. Like machine learning in general, text modeling is exquisitely sensitive to the data used for training. This is why we are somewhat skeptical of AI products such as sentiment analysis APIs, not because they *never* work well, but because they work well only when the text you need to predict from is a good match to the text such a product was trained on.
 

diff --git a/docs/02_tokenization.md b/docs/02_tokenization.md
@@ -53,7 +53,7 @@ These elements don't contain any metadata or information to tell us which charac
 In tokenization, we take an input (a string) and a token type (a meaningful unit of text, such as a word) and split the input into pieces (tokens) that correspond to the type [@Manning:2008:IIR:1394399]. Figure \@ref(fig:tokenizationdiag) outlines this process.\index{tokenization!definition}
 
 <div class="figure" style="text-align: center">
-<img src="diagram-files/tokenization-black-box.png" alt="A black box representation of a tokenizer. The text of these three example text fragments has been converted to lowercase and punctuation has been removed before the text is split." width="90%" />
+<img src="diagram-files/tokenization-black-box.pdf" alt="A black box representation of a tokenizer. The text of these three example text fragments has been converted to lowercase and punctuation has been removed before the text is split." width="90%" />
 <p class="caption">(\#fig:tokenizationdiag)A black box representation of a tokenizer. The text of these three example text fragments has been converted to lowercase and punctuation has been removed before the text is split.</p>
 </div>
 
@@ -142,7 +142,7 @@ Thinking of a token as a word is a useful way to start understanding tokenizatio
 
 - paragraphs, and
 
-- n-grams.
+- n-grams
 
 In the following sections, we will explore how to tokenize text using the **tokenizers** package. These functions take a character vector as the input and return lists of character vectors as output. This same tokenization can also be done using the **tidytext** [@Silge16] package, for workflows using tidy data principles where the input and output are both in a dataframe.
 
@@ -181,7 +181,7 @@ sample_tibble %>%
 ```
 
 ```
-#> # A tibble: 11 x 1
+#> # A tibble: 11 × 1
 #>    word  
 #>    <chr> 
 #>  1 far   
@@ -210,7 +210,7 @@ sample_tibble %>%
 ```
 
 ```
-#> # A tibble: 12 x 1
+#> # A tibble: 12 × 1
 #>    word  
 #>    <chr> 
 #>  1 far   
@@ -279,7 +279,7 @@ tokenize_characters(x = the_fir_tree,
 
 The results have more elements because the spaces and punctuation have not been removed.
 
-Depending on the format you have your text data in, it might contain ligatures.\index{tokenization!ligatures} Ligatures are when multiple graphemes or letters are combined as a single character The graphemes "f" and "l" are combined into "ﬂ", or "s" and "t" into "ﬆ". When we apply normal tokenization rules the ligatures will not be split up.
+Depending on the format you have your text data in, it might contain ligatures.\index{tokenization!ligatures} Ligatures are when multiple graphemes or letters are combined as a single character The graphemes "f" and "l" are combined into "ﬂ", or "f" and "f" into "ff". When we apply normal tokenization rules the ligatures will not be split up.
 
 
 ```r
@@ -340,7 +340,7 @@ hcandersen_en %>%
 ```
 
 ```
-#> # A tibble: 10 x 3
+#> # A tibble: 10 × 3
 #> # Groups:   book [2]
 #>    book               word      n
 #>    <chr>              <chr> <int>
@@ -863,14 +863,14 @@ bench::mark(check = FALSE, iterations = 10,
 ```
 
 ```
-#> # A tibble: 5 x 6
+#> # A tibble: 5 × 6
 #>   expression      min   median `itr/sec` mem_alloc `gc/sec`
 #>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
-#> 1 corpus        118ms    128ms      7.76    4.59MB    0.863
-#> 2 tokenizers    151ms    154ms      6.48    1.01MB    1.62 
-#> 3 text2vec      130ms    134ms      7.33   20.64MB    0.815
-#> 4 quanteda      241ms    243ms      4.11     8.7MB    2.74 
-#> 5 base R        471ms    499ms      2.02   10.51MB    0.505
+#> 1 corpus       77.1ms   85.6ms     11.3     4.58MB     1.26
+#> 2 tokenizers   96.7ms  100.1ms      9.80    1.01MB     1.09
+#> 3 text2vec     79.5ms   81.1ms     11.9    20.64MB     1.32
+#> 4 quanteda    157.2ms  164.4ms      6.07     8.7MB     1.52
+#> 5 base R      325.7ms  332.3ms      3.00   10.51MB     2.00
 ```
 
 The corpus package [@Perry2020] offers excellent performance for tokenization, and other options are not much worse. One exception is using a base R function as a tokenizer; you will see significant performance gains by instead using a package built specifically for text tokenization.