From 94aebb4c3236768047fb7d584e9d94676da7f20e Mon Sep 17 00:00:00 2001
From: mdingemanse <mark.dingemanse@ru.nl>
Date: Mon, 19 Aug 2024 16:12:07 +0200
Subject: [PATCH 01/22] updates: intro

---
 vignettes/workflow.Rmd | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/vignettes/workflow.Rmd b/vignettes/workflow.Rmd
index 4d66c96..94b8db4 100644
--- a/vignettes/workflow.Rmd
+++ b/vignettes/workflow.Rmd
@@ -20,21 +20,29 @@ library(talkr)
 
 ## Data
 
-We will be using the IFADV corpus as example data for the workflow of `talkr`.
-A prepared dataset can be downloaded by installing the `ifadv` package:
+We will be using the IFADV corpus as example data for the workflow of `talkr`. This is a corpus consisting of 20 dyadic conversations in Dutch, published by the Nederlandse Taalunie in 2007 ([source](https://fon.hum.uva.nl/IFA-SpokenLanguageCorpora/IFADVcorpus/)). A prepared dataset can be downloaded by installing the `ifadv` package:
 
 ```{r install data package}
 # install.packages("devtools")
 devtools::install_github("elpaco-escience/ifadv")
 ```
-
 We will initialize the talkr dataset using the ifadv data, as follows:
 
 ```{r}
 data <- init(ifadv::ifadv)
 ```
 
-For the `talkr` workflow, the columns named `source`, `begin`, `end`, `participant` and `utterance` are essential.
+Essential to any `talkr` workflow is a minimal set of data fields. These are the following:
+* `source`: the source conversation (a corpus can consist of multiple sources)
+* `begin`: begin time (in ms) of an utterance
+* `end`: end time (in ms) of an utterance
+* `utterance`: content of an utterance
+* `participant`: the person who produced the utterance
+
+The `init()` function takes these minimal fields and generates a few more based on them. These are:
+* `uid`: a unique identifier at utterance-level, used to identify, select and felter specific utterances
+* `duration`: the duration (in ms) of the utterance, generated by subtracting `begin` from `end`
+
 The `init()` function can be used to rename columns if necessary.
 For example, if the column `participant` is named `speaker`, we can rename it as follows:
 
@@ -43,11 +51,13 @@ talkr_data <- init(data,
              participant = "speaker")
 ```
 
+A dataset can contain additional fields. For instance, the IFADV sample dataset also contain `language` (which is Dutch) and `utterance_raw` (a fuller, less processed version of the utterance content). It also contains measures related to turn-taking and timing, including `FTO` (floor transfer offset, the offset between current turn and that of a prior participant, in milliseconds) and `freq` and `rank`, frequency measures of the utterance content.
+
 ## Summaries
 
 The `report_summaries` function provides a summary of the data, including the
 total number of utterances, the total duration of the conversation, the number
-of speakers, and the number of sources.
+of participants, and the number of sources.
 
 ```{r}
 report_stats(data)

From 6575ef506efb1109113c3f5bd6132deebf43eeed Mon Sep 17 00:00:00 2001
From: mdingemanse <mark.dingemanse@ru.nl>
Date: Tue, 20 Aug 2024 12:03:43 +0200
Subject: [PATCH 02/22] rewrite in progress

---
 vignettes/workflow.Rmd | 50 +++++++++++++++++++-----------------------
 1 file changed, 22 insertions(+), 28 deletions(-)

diff --git a/vignettes/workflow.Rmd b/vignettes/workflow.Rmd
index 94b8db4..99b4690 100644
--- a/vignettes/workflow.Rmd
+++ b/vignettes/workflow.Rmd
@@ -3,14 +3,17 @@ title: "Basic workflow with talkr"
 output: rmarkdown::html_vignette
 vignette: >
   %\VignetteIndexEntry{workflow}
-  %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
+  %\VignetteEngine{knitr::rmarkdown}
+editor_options: 
+  chunk_output_type: console
 ---
 
 ```{r, include = FALSE}
 knitr::opts_chunk$set(
   collapse = TRUE,
-  comment = "#>"
+  comment = "#>",
+  fig.width = 8
 )
 ```
 
@@ -18,7 +21,7 @@ knitr::opts_chunk$set(
 library(talkr)
 ```
 
-## Data
+## Loading the data
 
 We will be using the IFADV corpus as example data for the workflow of `talkr`. This is a corpus consisting of 20 dyadic conversations in Dutch, published by the Nederlandse Taalunie in 2007 ([source](https://fon.hum.uva.nl/IFA-SpokenLanguageCorpora/IFADVcorpus/)). A prepared dataset can be downloaded by installing the `ifadv` package:
 
@@ -26,44 +29,37 @@ We will be using the IFADV corpus as example data for the workflow of `talkr`. T
 # install.packages("devtools")
 devtools::install_github("elpaco-escience/ifadv")
 ```
+
 We will initialize the talkr dataset using the ifadv data, as follows:
 
 ```{r}
 data <- init(ifadv::ifadv)
 ```
 
-Essential to any `talkr` workflow is a minimal set of data fields. These are the following:
-* `source`: the source conversation (a corpus can consist of multiple sources)
-* `begin`: begin time (in ms) of an utterance
-* `end`: end time (in ms) of an utterance
-* `utterance`: content of an utterance
-* `participant`: the person who produced the utterance
+Essential to any `talkr` workflow is a minimal set of data fields. These are the following: \* `source`: the source conversation (a corpus can consist of multiple sources) \* `begin`: begin time (in ms) of an utterance \* `end`: end time (in ms) of an utterance \* `utterance`: content of an utterance \* `participant`: the person who produced the utterance
 
-The `init()` function takes these minimal fields and generates a few more based on them. These are:
-* `uid`: a unique identifier at utterance-level, used to identify, select and felter specific utterances
-* `duration`: the duration (in ms) of the utterance, generated by subtracting `begin` from `end`
+The `init()` function takes these minimal fields and generates a few more based on them. These are: \* `uid`: a unique identifier at utterance-level, used to identify, select and felter specific utterances \* `duration`: the duration (in ms) of the utterance, generated by subtracting `begin` from `end`
 
-The `init()` function can be used to rename columns if necessary.
-For example, if the column `participant` is named `speaker`, we can rename it as follows:
+The `init()` function can be used to rename columns if necessary. For example, if the column `participant` is named `speaker`, we can rename it as follows:
 
-```r
+``` r
 talkr_data <- init(data,
              participant = "speaker")
 ```
 
 A dataset can contain additional fields. For instance, the IFADV sample dataset also contain `language` (which is Dutch) and `utterance_raw` (a fuller, less processed version of the utterance content). It also contains measures related to turn-taking and timing, including `FTO` (floor transfer offset, the offset between current turn and that of a prior participant, in milliseconds) and `freq` and `rank`, frequency measures of the utterance content.
 
-## Summaries
+## Workflow 1: Quality control
 
-The `report_summaries` function provides a summary of the data, including the
-total number of utterances, the total duration of the conversation, the number
-of participants, and the number of sources.
+### Summary statistics
+
+The `report_stats` function provides a simple summary of a dataset, including the total number of utterances, the total duration of the conversation, the number of participants, and the number of sources.
 
 ```{r}
 report_stats(data)
 ```
 
-## Visual quality checks
+### Visual quality checks
 
 The `plot_quality` function provides a visual check of the quality of the data, by visualizing the distribution of turn duration, and transition timing.
 
@@ -83,11 +79,11 @@ plot_quality(data, source = "/dutch2/DVA9M")
 
 ```
 
-## Plot conversations
+## Workflow 2: Plot conversations
+
+Another key use of `talkr` is to visualize conversational patterns.
 
-Individual conversations can be plotted quickly using `plot_turns_tokens()`.
-The default setting is to plot the first 60 seconds of the first source in the data,
-overlaying the 10 most frequent tokens.
+Individual conversations can be plotted quickly using `plot_turns_tokens()`. The default setting is to plot the first 60 seconds of the first source in the data, overlaying the 10 most frequent tokens.
 
 ```{r}
 plot_turns_tokens(data)
@@ -102,8 +98,7 @@ plot_turns_tokens(data, source = "/dutch2/DVA9M",
                   maxrank = 20)
 ```
 
-For more control over the plot, two specific geometries are available: `geom_turn` and `geom_token`.
-In addition, there is a `talkr`-specific theme provided.
+For more control over the plot, two specific geometries are available: `geom_turn` and `geom_token`. In addition, there is a `talkr`-specific theme provided.
 
 ```{r}
 library(ggplot2)
@@ -132,8 +127,7 @@ tokens <- tokenize(data)
 tokens
 ```
 
-Token frequencies are calculated over the entire dataset. For source-specific data, it is recommended to filter
-the source prior to tokenization:
+Token frequencies are calculated over the entire dataset. For source-specific data, it is recommended to filter the source prior to tokenization:
 
 ```{r}
 tokens <- data |>

From c509aab6e6175492b8227bd229dcd27ef987c48c Mon Sep 17 00:00:00 2001
From: mdingemanse <mark.dingemanse@ru.nl>
Date: Tue, 20 Aug 2024 12:56:12 +0200
Subject: [PATCH 03/22] + describe quality plots

---
 vignettes/workflow.Rmd | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/vignettes/workflow.Rmd b/vignettes/workflow.Rmd
index 99b4690..a924c51 100644
--- a/vignettes/workflow.Rmd
+++ b/vignettes/workflow.Rmd
@@ -25,7 +25,7 @@ library(talkr)
 
 We will be using the IFADV corpus as example data for the workflow of `talkr`. This is a corpus consisting of 20 dyadic conversations in Dutch, published by the Nederlandse Taalunie in 2007 ([source](https://fon.hum.uva.nl/IFA-SpokenLanguageCorpora/IFADVcorpus/)). A prepared dataset can be downloaded by installing the `ifadv` package:
 
-```{r install data package}
+```{r install_data_package}
 # install.packages("devtools")
 devtools::install_github("elpaco-escience/ifadv")
 ```
@@ -42,7 +42,7 @@ The `init()` function takes these minimal fields and generates a few more based
 
 The `init()` function can be used to rename columns if necessary. For example, if the column `participant` is named `speaker`, we can rename it as follows:
 
-``` r
+```{r init_demo}
 talkr_data <- init(data,
              participant = "speaker")
 ```
@@ -55,17 +55,17 @@ A dataset can contain additional fields. For instance, the IFADV sample dataset
 
 The `report_stats` function provides a simple summary of a dataset, including the total number of utterances, the total duration of the conversation, the number of participants, and the number of sources.
 
-```{r}
+```{r report_stats}
 report_stats(data)
 ```
 
 ### Visual quality checks
 
-The `plot_quality` function provides a visual check of the quality of the data, by visualizing the distribution of turn duration, and transition timing.
+The `plot_quality` function provides a visual check of the nature of the data, by visualizing the distribution of turn durations, and transition timing.
 
 Transition timing is similar to FTO, but calculated without additional quality checks: transitions are identified when the participant changes from one turn to the next. The transition time is then calculated as the difference between the beginning of the turn of the new participant, and the end of the turn of the previous one.
 
-By default, `plot_quality()` will plot the quality of the entire dataset:
+By default, `plot_quality()` will plot the entire dataset:
 
 ```{r}
 plot_quality(data)
@@ -75,10 +75,15 @@ plot_quality(data)
 Quality plots can also be run for a specific source:
 
 ```{r}
-plot_quality(data, source = "/dutch2/DVA9M")
-
+plot_quality(data, source = "/dutch2/DVA8K")
 ```
 
+A quality plot consists of three separate visualizations, all designed to allow rapid visual inspection and spotting oddities:
+1. A density plot of turn durations. This is normally expected to look like a distribution that has a peak around 2000ms (2 seconds) and maximum lengths that do not far exceed 10000ms (10 seconds) (Liesenfeld & Dingemanse 2022). The goal of this plot is to allow eyeballing of oddities like turns of extreme durations or sets of turns with the exact same duration (unlikely in carefully segmented conversational data).
+2. A density plot of turn transition times. A plot like this is expected to look like a normal distribution centered around 0-200ms (Stivers et al. 2009). Deviations from this may signal problems in the dataset, for instance due to imprecise or automated annotation methods.
+3. A scatterplot of turn transition (x) by turn duration. This combines both distributions and is expected to look like a cloud of datapoints that is thickest in the middle region. Any standout patterns (for instance, turns whose duration is equal to their transition time) are indicative of problems in the segmentation or timing data.
+
+
 ## Workflow 2: Plot conversations
 
 Another key use of `talkr` is to visualize conversational patterns.

From ed1d2b0738ad865e67f977ea2025502e542a1cd0 Mon Sep 17 00:00:00 2001
From: mdingemanse <mark.dingemanse@ru.nl>
Date: Tue, 20 Aug 2024 14:47:10 +0200
Subject: [PATCH 04/22] updates to Workflow part B

---
 vignettes/workflow.Rmd | 29 ++++++++++++++++-------------
 1 file changed, 16 insertions(+), 13 deletions(-)

diff --git a/vignettes/workflow.Rmd b/vignettes/workflow.Rmd
index a924c51..26a9283 100644
--- a/vignettes/workflow.Rmd
+++ b/vignettes/workflow.Rmd
@@ -86,24 +86,27 @@ A quality plot consists of three separate visualizations, all designed to allow
 
 ## Workflow 2: Plot conversations
 
-Another key use of `talkr` is to visualize conversational patterns.
+Another key use of `talkr` is to visualize conversational patterns. A first key way to do so is `geom_turn()`, a ggplot2-compatible geom that visualizes the timing and duration of turns in a conversation.
 
-Individual conversations can be plotted quickly using `plot_turns_tokens()`. The default setting is to plot the first 60 seconds of the first source in the data, overlaying the 10 most frequent tokens.
-
-```{r}
-plot_turns_tokens(data)
-```
+```{r geom_turn_demo_1}
+library(ggplot2)
 
-We can set other defaults; e.g. a specific source, a different time window, and a different number of tokens:
+p <- data |>
+  dplyr::filter(source == "/dutch2/DVA9M") |>
+  dplyr::filter(end < 60000) |>
+  ggplot(aes(x = end, y = participant)) +
+  geom_turn(aes(
+    begin = begin,
+    end = end)) +
+  xlab("Time (ms)") +
+  ylab("") +
+  theme_turnPlot()
 
-```{r}
-plot_turns_tokens(data, source = "/dutch2/DVA9M",
-                  begin = 120,
-                  duration = 120,
-                  maxrank = 20)
+p
 ```
 
-For more control over the plot, two specific geometries are available: `geom_turn` and `geom_token`. In addition, there is a `talkr`-specific theme provided.
+
+two specific geometries are available: `geom_turn` and `geom_token`. In addition, there is a `talkr`-specific theme provided.
 
 ```{r}
 library(ggplot2)

From c85b33e6b2e74ae343edc010c2eb04eb658209dc Mon Sep 17 00:00:00 2001
From: mdingemanse <mark.dingemanse@ru.nl>
Date: Wed, 21 Aug 2024 09:32:31 +0200
Subject: [PATCH 05/22] keep strip label for facets

---
 R/theme_turnPlot.R | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/R/theme_turnPlot.R b/R/theme_turnPlot.R
index 1e7cdf1..d4c64e5 100644
--- a/R/theme_turnPlot.R
+++ b/R/theme_turnPlot.R
@@ -15,7 +15,7 @@ theme_turnPlot <- function(base_size = 11, base_family = "serif", ticks = TRUE)
     theme(
       legend.position = "none",
       axis.text.y = element_text(),
-      strip.text = element_blank(),
+      strip.text.x = element_text(hjust = 0, margin=margin(l=0)),
       axis.ticks.y = element_blank(),
       plot.title.position = "plot",
       complete = TRUE)

From d55c808f8ca49805da6a85413dbb01f7efc4082b Mon Sep 17 00:00:00 2001
From: mdingemanse <mark.dingemanse@ru.nl>
Date: Wed, 21 Aug 2024 09:32:47 +0200
Subject: [PATCH 06/22] + first version of plots

---
 vignettes/workflow.Rmd | 128 +++++++++++++++++++++++++++++------------
 1 file changed, 90 insertions(+), 38 deletions(-)

diff --git a/vignettes/workflow.Rmd b/vignettes/workflow.Rmd
index 0ba6848..66ad57f 100644
--- a/vignettes/workflow.Rmd
+++ b/vignettes/workflow.Rmd
@@ -28,7 +28,8 @@ We will be using the IFADV corpus as example data for the workflow of `talkr`. T
 The snippet below initializes the talkr dataset using the ifadv data. For more information about the IFADV dataset, see the [repository link](https://github.com/elpaco-escience/ifadv).
 
 ```{r}
-data <- init(ifadv::ifadv)
+data <- init(ifadv::ifadv) 
+
 ```
 
 Essential to any `talkr` workflow is a minimal set of data fields. These are the following: \* `source`: the source conversation (a corpus can consist of multiple sources) \* `begin`: begin time (in ms) of an utterance \* `end`: end time (in ms) of an utterance \* `utterance`: content of an utterance \* `participant`: the person who produced the utterance
@@ -81,34 +82,40 @@ A quality plot consists of three separate visualizations, all designed to allow
 
 ## Workflow 2: Plot conversations
 
-Another key use of `talkr` is to visualize conversational patterns. A first key way to do so is `geom_turn()`, a ggplot2-compatible geom that visualizes the timing and duration of turns in a conversation.
+Another key use of `talkr` is to visualize conversational patterns. A first way to do so is `geom_turn()`, a ggplot2-compatible geom that visualizes the timing and duration of turns in a conversation.
+
+We can start by simply visualizing some of the conversations in the dataset. Here we sample the first four and plot the first minute of each. We display them together using `facet_wrap()` by `source`.
 
-```{r geom_turn_demo_1}
+```{r geom_turn_demon1}
 library(ggplot2)
 
-p <- data |>
-  dplyr::filter(source == "/dutch2/DVA9M") |>
-  dplyr::filter(end < 60000) |>
+# we simplify participant names
+conv <- data |>
+  dplyr::group_by(source) |>
+  dplyr::mutate(participant = as.character(factor(participant, labels=c("A","B"))))
+
+# select first four conversations
+these_sources <- unique(data$source)[1:4]
+
+conv |>
+  dplyr::filter(end < 60000, # select first 60 seconds
+                source %in% these_sources) |> # filter to keep only these conversations
   ggplot(aes(x = end, y = participant)) +
   geom_turn(aes(
     begin = begin,
     end = end)) +
   xlab("Time (ms)") +
   ylab("") +
-  theme_turnPlot()
-
-p
+  theme_turnPlot() +
+  facet_wrap(~source) # facet to show the conversations side by side
 ```
 
+More often, we will want to plot a single conversation and explore it in some more detail. Let's zoom in on one of these first four. If we plot it without further tweaking, it is not the most helpful: the conversation is 15 minutes long and it is hard to appreciate its structure when we put it all on a single line.
 
-two specific geometries are available: `geom_turn` and `geom_token`. In addition, there is a `talkr`-specific theme provided.
+```{r geom_turn_demo_2}
 
-```{r}
-library(ggplot2)
-
-p <- data |>
-  dplyr::filter(source == "/dutch2/DVA9M") |>
-  dplyr::filter(end < 60000) |>
+conv |>
+  dplyr::filter(source == "/dutch2/DVA12S") |>
   ggplot(aes(x = end, y = participant)) +
   geom_turn(aes(
     begin = begin,
@@ -117,44 +124,89 @@ p <- data |>
   ylab("") +
   theme_turnPlot()
 
-p
 ```
 
-This plot can be overlayed with plotted occurrences of tokens.
+So what we do is cut up the conversation into lines using `add_lines()`. By default, this will cut the conversation into lines of 60000ms each (1 minute), creating as many lines as needed. For now, let's focus on the first 5 minutes, which we can do by filtering for `line_id < 6` after we've added lines.
 
-To do so, we first need to calculate the token frequencies:
+```{r geom_turn_demo_3}
 
-```{r}
-tokens <- tokenize(data)
+conv |>
+  add_lines() |>
+  dplyr::filter(source == "/dutch2/DVA12S",
+                line_id < 6) |> # let's look at the first five lines
+  ggplot(aes(x = line_end, y = line_participant)) +
+  geom_turn(aes(
+    begin = line_begin,
+    end = line_end)) +
+  scale_y_reverse(breaks = seq(1, max(conv_token$line_id))) +  
+  xlab("Time (ms)") +
+  ylab("") +
+  theme_turnPlot()
+
+p <- last_plot()
+
+```
+
+We can style a plot like this using any available variables For instance, let's add a `fill` that corresponds to `duration`:
+
+```{r step9}
+
+p +
+  geom_turn(aes(
+    begin = line_begin,
+    end = line_end,
+    fill=duration
+  )) +
+  viridis::scale_fill_viridis(option="A")
 
-tokens
 ```
 
-Token frequencies are calculated over the entire dataset. For source-specific data, it is recommended to filter the source prior to tokenization:
+So far we have just visualized the temporal structure. But conversational turns typically consist of words and other elements. 
+
+We can start looking into the internal structure of turns by plotting occurrence of tokens. 
+
+To do so, we first need to calculate the token frequencies:
 
 ```{r}
-tokens <- data |>
-  dplyr::filter(source == "/dutch2/DVA9M") |>
-  tokenize()
+conv_tokens <- tokenize(conv)
 
-tokens
+conv_tokens
 ```
 
-Before we plot the tokens over the turns, we need to select the tokens we want to plot (e.g. the top 10 ranked), and the time window they occur in:
+
+With information about tokens in hand, we can start asking questions. For instance, what are words that are quite frequent and that appear in utterance-initial position?
 
 ```{r}
-tokenselection <- tokens |>
-  dplyr::filter(relative_time < 60000) |>
-  dplyr::filter(rank <= 10)
+
+p <- conv |>
+  add_lines() |>
+  dplyr::filter(source == "/dutch2/DVA12S",
+                line_id < 4) |> # let's look at the first five lines
+  ggplot(aes(x = line_end, y = line_participant)) +
+  geom_turn(aes(
+    begin = line_begin,
+    end = line_end)) +
+  scale_y_reverse(breaks = seq(1, max(conv_token$line_id))) +  
+  xlab("Time (ms)") +
+  ylab("") +
+  theme_turnPlot()
+
+p + 
+  geom_token(aes(data=conv_tokens |> filter(source == "/dutch2/DVA12S"),
+    begin=line_begin,
+    end = line_end))
+
+
 ```
 
-We can plot the tokens over the turns.
+### Notes & orphaned text
+
+Token frequencies are calculated over the entire dataset. If you want source-specific data, you can filter the source prior to tokenization:
 
 ```{r}
-p +
-geom_token(data = tokenselection,
-           aes(x = relative_time,
-               y = participant,
-               color = rank)) +
-  viridis::scale_color_viridis(option = "plasma", direction = -1, begin = 0.2, end = 0.8)
+tokens_DVA9M <- data |>
+  dplyr::filter(source == "/dutch2/DVA9M") |>
+  tokenize()
+
+tokens_DVA9M
 ```

From 8d2de919c478e3b8ec5d7e827b2cdf7908abf677 Mon Sep 17 00:00:00 2001
From: mdingemanse <mark.dingemanse@ru.nl>
Date: Wed, 21 Aug 2024 12:23:01 +0200
Subject: [PATCH 07/22] workflow updates

---
 R/theme_turnPlot.R     |  2 +-
 vignettes/workflow.Rmd | 35 ++++++++++++++++++++++++++---------
 2 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/R/theme_turnPlot.R b/R/theme_turnPlot.R
index d4c64e5..4b1ad77 100644
--- a/R/theme_turnPlot.R
+++ b/R/theme_turnPlot.R
@@ -13,7 +13,7 @@ theme_turnPlot <- function(base_size = 11, base_family = "serif", ticks = TRUE)
     ticks = ticks
   ) %+replace%
     theme(
-      legend.position = "none",
+#      legend.position = "none",
       axis.text.y = element_text(),
       strip.text.x = element_text(hjust = 0, margin=margin(l=0)),
       axis.ticks.y = element_blank(),
diff --git a/vignettes/workflow.Rmd b/vignettes/workflow.Rmd
index 66ad57f..289874c 100644
--- a/vignettes/workflow.Rmd
+++ b/vignettes/workflow.Rmd
@@ -131,12 +131,12 @@ So what we do is cut up the conversation into lines using `add_lines()`. By defa
 ```{r geom_turn_demo_3}
 
 conv |>
-  add_lines() |>
+  add_lines() |> # add lines
   dplyr::filter(source == "/dutch2/DVA12S",
-                line_id < 6) |> # let's look at the first five lines
+                line_id < 6) |> # limit to the first five lines
   ggplot(aes(x = line_end, y = line_participant)) +
   geom_turn(aes(
-    begin = line_begin,
+    begin = line_begin, # the begin and end aesthetics are now line-relative
     end = line_end)) +
   scale_y_reverse(breaks = seq(1, max(conv_token$line_id))) +  
   xlab("Time (ms)") +
@@ -152,15 +152,31 @@ We can style a plot like this using any available variables For instance, let's
 ```{r step9}
 
 p +
+  ggtitle("Turns coloured by duration") +
   geom_turn(aes(
     begin = line_begin,
     end = line_end,
     fill=duration
   )) +
-  viridis::scale_fill_viridis(option="A")
+  viridis::scale_fill_viridis(option="A",direction=-1)
 
 ```
 
+Or we can highlight turns that are produced in overlap:
+
+```{r step10}
+
+p +
+  ggtitle("Turns produced in overlap") +
+  geom_turn(aes(
+    begin = line_begin,
+    end = line_end,
+    fill=overlap,
+    colour=overlap
+  )) 
+
+
+```
 So far we have just visualized the temporal structure. But conversational turns typically consist of words and other elements. 
 
 We can start looking into the internal structure of turns by plotting occurrence of tokens. 
@@ -178,15 +194,16 @@ With information about tokens in hand, we can start asking questions. For instan
 
 ```{r}
 
-p <- conv |>
-  add_lines() |>
+conv |>
+  add_lines(line_duration=15000) |>
   dplyr::filter(source == "/dutch2/DVA12S",
-                line_id < 4) |> # let's look at the first five lines
+                line_id < 4) |> # let's look at the first three lines
   ggplot(aes(x = line_end, y = line_participant)) +
+  scale_y_reverse(breaks = seq(1, max(conv_token$line_id))) +  
   geom_turn(aes(
     begin = line_begin,
-    end = line_end)) +
-  scale_y_reverse(breaks = seq(1, max(conv_token$line_id))) +  
+    end = line_end,
+    fill = nwords)) +
   xlab("Time (ms)") +
   ylab("") +
   theme_turnPlot()

From d20e534c96714cc7d3d027622e3067c32b0768e2 Mon Sep 17 00:00:00 2001
From: mdingemanse <mark.dingemanse@ru.nl>
Date: Wed, 21 Aug 2024 12:25:41 +0200
Subject: [PATCH 08/22] rename to reflect multiple workflows

---
 vignettes/{workflow.Rmd => workflows.Rmd} | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
 rename vignettes/{workflow.Rmd => workflows.Rmd} (98%)

diff --git a/vignettes/workflow.Rmd b/vignettes/workflows.Rmd
similarity index 98%
rename from vignettes/workflow.Rmd
rename to vignettes/workflows.Rmd
index 289874c..74717d1 100644
--- a/vignettes/workflow.Rmd
+++ b/vignettes/workflows.Rmd
@@ -199,7 +199,7 @@ conv |>
   dplyr::filter(source == "/dutch2/DVA12S",
                 line_id < 4) |> # let's look at the first three lines
   ggplot(aes(x = line_end, y = line_participant)) +
-  scale_y_reverse(breaks = seq(1, max(conv_token$line_id))) +  
+  scale_y_reverse(breaks = seq(1, max(conv_token$line_id))) + # we reverse the axis because lines run top to bottom
   geom_turn(aes(
     begin = line_begin,
     end = line_end,

From 5d338be08677d095ffd884495d9381e91729100a Mon Sep 17 00:00:00 2001
From: mdingemanse <mark.dingemanse@ru.nl>
Date: Wed, 21 Aug 2024 12:28:21 +0200
Subject: [PATCH 09/22] add kableExtra to get nicer table output in vignette

---
 DESCRIPTION | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/DESCRIPTION b/DESCRIPTION
index 76cc671..c588883 100644
--- a/DESCRIPTION
+++ b/DESCRIPTION
@@ -35,6 +35,7 @@ Imports:
 Suggests: 
     rmarkdown,
     testthat (>= 3.0.0),
-    ifadv
+    ifadv,
+    kableExtra
 Remotes: git::https://github.com/elpaco-escience/ifadv.git
 Config/testthat/edition: 3

From aef9f72a4aa29fd876aaf41ab8e59862c4c2e8cf Mon Sep 17 00:00:00 2001
From: mdingemanse <mark.dingemanse@ru.nl>
Date: Wed, 21 Aug 2024 12:31:13 +0200
Subject: [PATCH 10/22] intro

---
 vignettes/workflows.Rmd | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/vignettes/workflows.Rmd b/vignettes/workflows.Rmd
index 74717d1..1ad467f 100644
--- a/vignettes/workflows.Rmd
+++ b/vignettes/workflows.Rmd
@@ -1,5 +1,5 @@
 ---
-title: "Basic workflow with talkr"
+title: "Basic workflows for talkr"
 output: rmarkdown::html_vignette
 vignette: >
   %\VignetteIndexEntry{workflow}
@@ -21,6 +21,9 @@ knitr::opts_chunk$set(
 library(talkr)
 ```
 
+# Introduction
+`talkr` is a package designed for working with conversational data in R. 
+
 ## Loading the data
 
 We will be using the IFADV corpus as example data for the workflow of `talkr`. This is a corpus consisting of 20 dyadic conversations in Dutch, published by the Nederlandse Taalunie in 2007 ([source](https://fon.hum.uva.nl/IFA-SpokenLanguageCorpora/IFADVcorpus/))

From 90d9ed38a8d879962e25ee8d159de5741c69599d Mon Sep 17 00:00:00 2001
From: mdingemanse <mark.dingemanse@ru.nl>
Date: Wed, 21 Aug 2024 13:49:16 +0200
Subject: [PATCH 11/22] ,

---
 DESCRIPTION | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/DESCRIPTION b/DESCRIPTION
index 4fbe1fc..192b793 100644
--- a/DESCRIPTION
+++ b/DESCRIPTION
@@ -35,7 +35,7 @@ Imports:
 Suggests: 
     rmarkdown,
     testthat (>= 3.0.0),
-    kableExtra
+    kableExtra,
     pkgdown,
     ifadv
 Remotes: git::https://github.com/elpaco-escience/ifadv.git

From fb0e8766b0873332b35eb2fbcf9bbfa070136d23 Mon Sep 17 00:00:00 2001
From: mdingemanse <mark.dingemanse@ru.nl>
Date: Wed, 21 Aug 2024 18:43:30 +0200
Subject: [PATCH 12/22] plots

---
 vignettes/workflows.Rmd | 42 +++++++++++++++++++++++++----------------
 1 file changed, 26 insertions(+), 16 deletions(-)

diff --git a/vignettes/workflows.Rmd b/vignettes/workflows.Rmd
index 1ad467f..4c08f28 100644
--- a/vignettes/workflows.Rmd
+++ b/vignettes/workflows.Rmd
@@ -24,13 +24,15 @@ library(talkr)
 # Introduction
 `talkr` is a package designed for working with conversational data in R. 
 
-## Loading the data
+## Loading some data
 
 We will be using the IFADV corpus as example data for the workflow of `talkr`. This is a corpus consisting of 20 dyadic conversations in Dutch, published by the Nederlandse Taalunie in 2007 ([source](https://fon.hum.uva.nl/IFA-SpokenLanguageCorpora/IFADVcorpus/))
 
 The snippet below initializes the talkr dataset using the ifadv data. For more information about the IFADV dataset, see the [repository link](https://github.com/elpaco-escience/ifadv).
 
+
 ```{r}
+
 data <- init(ifadv::ifadv) 
 
 ```
@@ -133,15 +135,16 @@ So what we do is cut up the conversation into lines using `add_lines()`. By defa
 
 ```{r geom_turn_demo_3}
 
+conv <- conv |> add_lines(line_duration = 60000) # adding lines to the dataset
+
 conv |>
-  add_lines() |> # add lines
   dplyr::filter(source == "/dutch2/DVA12S",
                 line_id < 6) |> # limit to the first five lines
   ggplot(aes(x = line_end, y = line_participant)) +
   geom_turn(aes(
     begin = line_begin, # the begin and end aesthetics are now line-relative
     end = line_end)) +
-  scale_y_reverse(breaks = seq(1, max(conv_token$line_id))) +  
+  scale_y_reverse(breaks = seq(1, max(conv$line_id))) +  
   xlab("Time (ms)") +
   ylab("") +
   theme_turnPlot()
@@ -184,12 +187,13 @@ So far we have just visualized the temporal structure. But conversational turns
 
 We can start looking into the internal structure of turns by plotting occurrence of tokens. 
 
-To do so, we first need to calculate the token frequencies:
+To do so, we first need to calculate the token frequencies. 
 
 ```{r}
-conv_tokens <- tokenize(conv)
 
-conv_tokens
+conv_tokens <- conv |> tokenize() 
+
+
 ```
 
 
@@ -197,25 +201,31 @@ With information about tokens in hand, we can start asking questions. For instan
 
 ```{r}
 
-conv |>
+this_conv <- conv |>
   add_lines(line_duration=15000) |>
   dplyr::filter(source == "/dutch2/DVA12S",
-                line_id < 4) |> # let's look at the first three lines
+                line_id < 4) # let's look at the first three lines
+these_tokens <- conv_tokens |>
+  add_lines(line_duration=15000, time_columns = "relative_time") |>
+  dplyr::filter(source == "/dutch2/DVA12S",
+                line_id < 4,
+                rank < 10)
+
+
+this_conv |>
   ggplot(aes(x = line_end, y = line_participant)) +
-  scale_y_reverse(breaks = seq(1, max(conv_token$line_id))) + # we reverse the axis because lines run top to bottom
+  scale_y_reverse() + # we reverse the axis because lines run top to bottom
   geom_turn(aes(
     begin = line_begin,
-    end = line_end,
-    fill = nwords)) +
+    end = line_end)) +
+  geom_label(data = these_tokens |> dplyr::filter(order=="first"),
+            aes(x = line_relative_time,
+                y=line_participant,
+                label=token)) +
   xlab("Time (ms)") +
   ylab("") +
   theme_turnPlot()
 
-p + 
-  geom_token(aes(data=conv_tokens |> filter(source == "/dutch2/DVA12S"),
-    begin=line_begin,
-    end = line_end))
-
 
 ```
 

From a6c06fa8736f38e370946ffed2d2d7aa589d0a30 Mon Sep 17 00:00:00 2001
From: mdingemanse <mark.dingemanse@ru.nl>
Date: Wed, 21 Aug 2024 19:57:58 +0200
Subject: [PATCH 13/22] + final set of examples using geom_token

---
 vignettes/workflows.Rmd | 114 +++++++++++++++++++++++++++++-----------
 1 file changed, 82 insertions(+), 32 deletions(-)

diff --git a/vignettes/workflows.Rmd b/vignettes/workflows.Rmd
index 4c08f28..45f22d7 100644
--- a/vignettes/workflows.Rmd
+++ b/vignettes/workflows.Rmd
@@ -21,7 +21,7 @@ knitr::opts_chunk$set(
 library(talkr)
 ```
 
-# Introduction
+## Introduction
 `talkr` is a package designed for working with conversational data in R. 
 
 ## Loading some data
@@ -80,9 +80,23 @@ plot_quality(data, source = "/dutch2/DVA8K")
 ```
 
 A quality plot consists of three separate visualizations, all designed to allow rapid visual inspection and spotting oddities:
+
 1. A density plot of turn durations. This is normally expected to look like a distribution that has a peak around 2000ms (2 seconds) and maximum lengths that do not far exceed 10000ms (10 seconds) (Liesenfeld & Dingemanse 2022). The goal of this plot is to allow eyeballing of oddities like turns of extreme durations or sets of turns with the exact same duration (unlikely in carefully segmented conversational data).
+
 2. A density plot of turn transition times. A plot like this is expected to look like a normal distribution centered around 0-200ms (Stivers et al. 2009). Deviations from this may signal problems in the dataset, for instance due to imprecise or automated annotation methods.
-3. A scatterplot of turn transition (x) by turn duration. This combines both distributions and is expected to look like a cloud of datapoints that is thickest in the middle region. Any standout patterns (for instance, turns whose duration is equal to their transition time) are indicative of problems in the segmentation or timing data.
+
+3. A scatterplot of turn transition (x) by turn duration (y). This combines both distributions and is expected to look like a cloud of datapoints that is thickest in the middle region. Any standout patterns (for instance, turns whose duration is equal to their transition time) are indicative of problems in the segmentation or timing data.
+
+Each of the three plots can also be generated separately:
+```r
+
+plot_density(data, colname="duration", title="Turn durations",xlab="duration (ms)")
+
+plot_density(data, colname="FTO", title="Turn transitions (FTO)",xlab="FTO (ms)")
+
+plot_scatter(data, colname_x="FTO",colname_y="duration",title="Turn transitions and durations",xlab="transition (ms)", ylab="duration (ms)")
+
+```
 
 
 ## Workflow 2: Plot conversations
@@ -97,7 +111,7 @@ library(ggplot2)
 # we simplify participant names
 conv <- data |>
   dplyr::group_by(source) |>
-  dplyr::mutate(participant = as.character(factor(participant, labels=c("A","B"))))
+  dplyr::mutate(participant = as.character(factor(participant, labels=c("A","B"),ordered=T)))
 
 # select first four conversations
 these_sources <- unique(data$source)[1:4]
@@ -112,16 +126,16 @@ conv |>
   xlab("Time (ms)") +
   ylab("") +
   theme_turnPlot() +
-  facet_wrap(~source) # facet to show the conversations side by side
+  facet_wrap(~source) # let's facet to show the conversations side by side
 ```
 
 More often, we will want to plot a single conversation and explore it in some more detail. Let's zoom in on one of these first four. If we plot it without further tweaking, it is not the most helpful: the conversation is 15 minutes long and it is hard to appreciate its structure when we put it all on a single line.
 
-```{r geom_turn_demo_2}
+```r
 
 conv |>
   dplyr::filter(source == "/dutch2/DVA12S") |>
-  ggplot(aes(x = end, y = participant)) +
+  ggplot(aes(x = begin, y = participant)) +
   geom_turn(aes(
     begin = begin,
     end = end)) +
@@ -131,16 +145,19 @@ conv |>
 
 ```
 
-So what we do is cut up the conversation into lines using `add_lines()`. By default, this will cut the conversation into lines of 60000ms each (1 minute), creating as many lines as needed. For now, let's focus on the first 5 minutes, which we can do by filtering for `line_id < 6` after we've added lines.
+So what we do is similar to conversational transcripts: we present the conversation in a left-to-right, top-to-bottom grid. To do so, we first need to divide the long conversation into a number of shorter lines. We do this using `add_lines()`. By default, this will divide the conversation into lines of 60000ms each (1 minute), creating as many lines as needed. 
+
+For now, let's focus on the first 4 minutes, which we can do by filtering for `line_id < 5` after we've added lines.
 
-```{r geom_turn_demo_3}
+```{r geom_turn_demo_3, fig.height=2.5}
 
 conv <- conv |> add_lines(line_duration = 60000) # adding lines to the dataset
 
 conv |>
   dplyr::filter(source == "/dutch2/DVA12S",
-                line_id < 6) |> # limit to the first five lines
+                line_id < 5) |> # limit to the first five lines
   ggplot(aes(x = line_end, y = line_participant)) +
+  ggtitle("The first four minutes from DVA12S") +
   geom_turn(aes(
     begin = line_begin, # the begin and end aesthetics are now line-relative
     end = line_end)) +
@@ -153,24 +170,25 @@ p <- last_plot()
 
 ```
 
-We can style a plot like this using any available variables For instance, let's add a `fill` that corresponds to `duration`:
+We can style a plot like this using any available variables. For instance, we can highlight turns that are produced in overlap:
 
-```{r step9}
+```{r step10, fig.height=2.5}
 
 p +
-  ggtitle("Turns coloured by duration") +
+  ggtitle("Turns produced in overlap") +
   geom_turn(aes(
     begin = line_begin,
     end = line_end,
-    fill=duration
-  )) +
-  viridis::scale_fill_viridis(option="A",direction=-1)
+    fill=overlap,
+    colour=overlap)) +
+  scale_fill_discrete(na.translate=F) + # stop NA value from showing up in legend
+  scale_colour_discrete(na.translate=F) # stop NA value from showing up in legend
 
-```
 
-Or we can highlight turns that are produced in overlap:
+```
+So far we have just visualized the temporal structure. But conversational turns typically consist of words and other elements. 
 
-```{r step10}
+```{r step11, fig.height=2.5}
 
 p +
   ggtitle("Turns produced in overlap") +
@@ -178,12 +196,13 @@ p +
     begin = line_begin,
     end = line_end,
     fill=overlap,
-    colour=overlap
-  )) 
+    colour=overlap)) +
+  scale_fill_discrete(na.translate=F) + # stop NA value from showing up in legend
+  scale_colour_discrete(na.translate=F) # stop NA value from showing up in legend
 
 
 ```
-So far we have just visualized the temporal structure. But conversational turns typically consist of words and other elements. 
+### Looking into tokens
 
 We can start looking into the internal structure of turns by plotting occurrence of tokens. 
 
@@ -197,46 +216,77 @@ conv_tokens <- conv |> tokenize()
 ```
 
 
-With information about tokens in hand, we can start asking questions. For instance, what are words that are quite frequent and that appear in utterance-initial position?
+With information about tokens in hand, we can start asking questions. For instance, how does the relative frequency of words relate to their position in the turn?
+
+To explore this question, let's look at a shorter excerpt: 1 minute in total, divided over 4 lines. To do this, we create a dataframe `this_conv`, dividing it into 4 lines of 15 seconds each. We also create a dataframe `these_tokens` with tokenized turn elements for the same conversation, divided up in the same way.
 
 ```{r}
 
 this_conv <- conv |>
   add_lines(line_duration=15000) |>
   dplyr::filter(source == "/dutch2/DVA12S",
-                line_id < 4) # let's look at the first three lines
+                line_id < 5) # let's look at the first three lines
+
 these_tokens <- conv_tokens |>
   add_lines(line_duration=15000, time_columns = "relative_time") |>
   dplyr::filter(source == "/dutch2/DVA12S",
-                line_id < 4,
-                rank < 10)
-
+                line_id < 5)
 
 this_conv |>
   ggplot(aes(x = line_end, y = line_participant)) +
+  ggtitle("Relative frequency of elements within turns") +
   scale_y_reverse() + # we reverse the axis because lines run top to bottom
   geom_turn(aes(
     begin = line_begin,
     end = line_end)) +
-  geom_label(data = these_tokens |> dplyr::filter(order=="first"),
-            aes(x = line_relative_time,
-                y=line_participant,
-                label=token)) +
+  geom_token(data=these_tokens,
+             aes(x=line_relative_time,
+                 size=frequency)) +
   xlab("Time (ms)") +
   ylab("") +
   theme_turnPlot()
 
+p <- last_plot()
 
 ```
 
-### Notes & orphaned text
 
-Token frequencies are calculated over the entire dataset. If you want source-specific data, you can filter the source prior to tokenization:
+Finally, we can also print the content of some of the elements. Here, we pick the most frequent turn-initial elements for plotting, highlight them with another layer of `geom_token()` and plot the text using `geom_label_repel()`:
 
 ```{r}
+
+these_tokens_first <- these_tokens |>
+  dplyr::filter(order=="first",
+                rank < 10)
+
+p +
+  ggtitle("Some frequent turn-initial elements") +
+  geom_token(data=these_tokens_first,
+             aes(x=line_relative_time),
+             color="red") +
+  ggrepel::geom_label_repel(data=these_tokens_first,
+                            aes(x=line_relative_time,
+                                label=token),
+                            direction="y")
+
+```
+
+
+
+
+### Notes 
+
+Token frequencies are calculated over the entire dataset. If you want source-specific data, you can filter the source prior to tokenization:
+
+```r
 tokens_DVA9M <- data |>
   dplyr::filter(source == "/dutch2/DVA9M") |>
   tokenize()
 
 tokens_DVA9M
 ```
+
+## References
+
+* Liesenfeld, Andreas, and Mark Dingemanse. 2022. ‘Building and Curating Conversational Corpora for Diversity-Aware Language Science and Technology’. In Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), 1178–92. Marseille. doi:10.48550/arXiv.2203.03399.
+* Stivers, Tanya, N. J. Enfield, Penelope Brown, C. Englert, Makoto Hayashi, Trine Heinemann, Gertie Hoymann, Federico Rossano, J. P. de Ruiter, Kyung-Eun Yoon, and Stephen C. Levinson. 2009. ‘Universals and Cultural Variation in Turn-Taking in Conversation’. Proceedings of the National Academy of Sciences 106 (26): 10587–92. doi:10.1073/pnas.0903616106.

From 08a4429f78f03feb567f02b4760ea61514599bea Mon Sep 17 00:00:00 2001
From: mdingemanse <mdingemanse@users.noreply.github.com>
Date: Wed, 21 Aug 2024 22:53:06 +0200
Subject: [PATCH 14/22] Suggests: we don't use kableExtra current but we do use
 ggrepel

---
 DESCRIPTION | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/DESCRIPTION b/DESCRIPTION
index 2ad0185..8542920 100644
--- a/DESCRIPTION
+++ b/DESCRIPTION
@@ -31,9 +31,9 @@ Imports:
 Suggests: 
     rmarkdown,
     testthat (>= 3.0.0),
-    kableExtra,
+    ifadv,
     pkgdown,
     viridis,
-    ifadv
+    ggrepel
 Remotes: git::https://github.com/elpaco-escience/ifadv.git
 Config/testthat/edition: 3

From 72d32f569221d448062629f900eed8296b24d217 Mon Sep 17 00:00:00 2001
From: mdingemanse <mdingemanse@users.noreply.github.com>
Date: Thu, 22 Aug 2024 08:25:38 +0200
Subject: [PATCH 15/22] Update R/theme_turnPlot.R

Co-authored-by: Barbara Vreede <b.vreede@esciencecenter.nl>
---
 R/theme_turnPlot.R | 1 -
 1 file changed, 1 deletion(-)

diff --git a/R/theme_turnPlot.R b/R/theme_turnPlot.R
index 4b1ad77..da0cd88 100644
--- a/R/theme_turnPlot.R
+++ b/R/theme_turnPlot.R
@@ -13,7 +13,6 @@ theme_turnPlot <- function(base_size = 11, base_family = "serif", ticks = TRUE)
     ticks = ticks
   ) %+replace%
     theme(
-#      legend.position = "none",
       axis.text.y = element_text(),
       strip.text.x = element_text(hjust = 0, margin=margin(l=0)),
       axis.ticks.y = element_blank(),

From f867c58070ea4a110435ca93b5b07482e6a9ce6c Mon Sep 17 00:00:00 2001
From: mdingemanse <mdingemanse@users.noreply.github.com>
Date: Thu, 22 Aug 2024 08:25:57 +0200
Subject: [PATCH 16/22] change to list

Co-authored-by: Barbara Vreede <b.vreede@esciencecenter.nl>
---
 vignettes/workflows.Rmd | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/vignettes/workflows.Rmd b/vignettes/workflows.Rmd
index 45f22d7..c7bcbed 100644
--- a/vignettes/workflows.Rmd
+++ b/vignettes/workflows.Rmd
@@ -37,7 +37,13 @@ data <- init(ifadv::ifadv)
 
 ```
 
-Essential to any `talkr` workflow is a minimal set of data fields. These are the following: \* `source`: the source conversation (a corpus can consist of multiple sources) \* `begin`: begin time (in ms) of an utterance \* `end`: end time (in ms) of an utterance \* `utterance`: content of an utterance \* `participant`: the person who produced the utterance
+Essential to any `talkr` workflow is a minimal set of data fields. These are the following:
+
+- `source`: the source conversation (a corpus can consist of multiple sources)
+- `begin`: begin time (in ms) of an utterance
+- `end`: end time (in ms) of an utterance
+- `utterance`: content of an utterance
+- `participant`: the person who produced the utterance
 
 The `init()` function takes these minimal fields and generates a few more based on them. These are: \* `uid`: a unique identifier at utterance-level, used to identify, select and felter specific utterances \* `duration`: the duration (in ms) of the utterance, generated by subtracting `begin` from `end`
 

From b582a27febe0123ceb0700d769302a890cb10b93 Mon Sep 17 00:00:00 2001
From: mdingemanse <mdingemanse@users.noreply.github.com>
Date: Thu, 22 Aug 2024 08:29:27 +0200
Subject: [PATCH 17/22] simplify

Co-authored-by: Barbara Vreede <b.vreede@esciencecenter.nl>
---
 vignettes/workflows.Rmd | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/vignettes/workflows.Rmd b/vignettes/workflows.Rmd
index c7bcbed..5108881 100644
--- a/vignettes/workflows.Rmd
+++ b/vignettes/workflows.Rmd
@@ -157,7 +157,7 @@ For now, let's focus on the first 4 minutes, which we can do by filtering for `l
 
 ```{r geom_turn_demo_3, fig.height=2.5}
 
-conv <- conv |> add_lines(line_duration = 60000) # adding lines to the dataset
+conv <- conv |> add_lines(line_duration = 60000)
 
 conv |>
   dplyr::filter(source == "/dutch2/DVA12S",

From f47693cf20954619fece26cb49880568d35958dd Mon Sep 17 00:00:00 2001
From: mdingemanse <mark.dingemanse@ru.nl>
Date: Thu, 22 Aug 2024 11:03:14 +0200
Subject: [PATCH 18/22] fix init description

---
 vignettes/workflows.Rmd | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/vignettes/workflows.Rmd b/vignettes/workflows.Rmd
index 5108881..e968c55 100644
--- a/vignettes/workflows.Rmd
+++ b/vignettes/workflows.Rmd
@@ -45,7 +45,7 @@ Essential to any `talkr` workflow is a minimal set of data fields. These are the
 - `utterance`: content of an utterance
 - `participant`: the person who produced the utterance
 
-The `init()` function takes these minimal fields and generates a few more based on them. These are: \* `uid`: a unique identifier at utterance-level, used to identify, select and felter specific utterances \* `duration`: the duration (in ms) of the utterance, generated by subtracting `begin` from `end`
+The `init()` function takes these minimal fields and generates a `uid`: a unique identifier at utterance-level that can be used as a reference to select and filter specific utterances.
 
 The `init()` function can be used to rename columns if necessary. For example, if the column `participant` is named `speaker`, we can rename it as follows:
 
@@ -282,6 +282,12 @@ p +
 
 ### Notes 
 
+The `init` function can also be used to reformat timestamps. Default is "ms", which expects milliseconds. '%H:%M:%OS' will format eg. 00:00:00.010 to milliseconds (10). See '?strptime' for more format examples.
+
+```r
+init(format_timestamps="ms")
+```
+
 Token frequencies are calculated over the entire dataset. If you want source-specific data, you can filter the source prior to tokenization:
 
 ```r

From 2cb511858861beabc727f53dbd32a676d172a6fc Mon Sep 17 00:00:00 2001
From: mdingemanse <mark.dingemanse@ru.nl>
Date: Thu, 22 Aug 2024 11:10:59 +0200
Subject: [PATCH 19/22] load dplyr and remove dplyr:: from function calls in
 sample code

---
 vignettes/workflows.Rmd | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/vignettes/workflows.Rmd b/vignettes/workflows.Rmd
index e968c55..264a007 100644
--- a/vignettes/workflows.Rmd
+++ b/vignettes/workflows.Rmd
@@ -113,17 +113,18 @@ We can start by simply visualizing some of the conversations in the dataset. Her
 
 ```{r geom_turn_demon1}
 library(ggplot2)
+library(dplyr)
 
 # we simplify participant names
 conv <- data |>
-  dplyr::group_by(source) |>
-  dplyr::mutate(participant = as.character(factor(participant, labels=c("A","B"),ordered=T)))
+  group_by(source) |>
+  mutate(participant = as.character(factor(participant, labels=c("A","B"),ordered=T)))
 
 # select first four conversations
 these_sources <- unique(data$source)[1:4]
 
 conv |>
-  dplyr::filter(end < 60000, # select first 60 seconds
+  filter(end < 60000, # select first 60 seconds
                 source %in% these_sources) |> # filter to keep only these conversations
   ggplot(aes(x = end, y = participant)) +
   geom_turn(aes(
@@ -140,7 +141,7 @@ More often, we will want to plot a single conversation and explore it in some mo
 ```r
 
 conv |>
-  dplyr::filter(source == "/dutch2/DVA12S") |>
+  filter(source == "/dutch2/DVA12S") |>
   ggplot(aes(x = begin, y = participant)) +
   geom_turn(aes(
     begin = begin,
@@ -160,7 +161,7 @@ For now, let's focus on the first 4 minutes, which we can do by filtering for `l
 conv <- conv |> add_lines(line_duration = 60000)
 
 conv |>
-  dplyr::filter(source == "/dutch2/DVA12S",
+  filter(source == "/dutch2/DVA12S",
                 line_id < 5) |> # limit to the first five lines
   ggplot(aes(x = line_end, y = line_participant)) +
   ggtitle("The first four minutes from DVA12S") +
@@ -230,12 +231,12 @@ To explore this question, let's look at a shorter excerpt: 1 minute in total, di
 
 this_conv <- conv |>
   add_lines(line_duration=15000) |>
-  dplyr::filter(source == "/dutch2/DVA12S",
+  filter(source == "/dutch2/DVA12S",
                 line_id < 5) # let's look at the first three lines
 
 these_tokens <- conv_tokens |>
   add_lines(line_duration=15000, time_columns = "relative_time") |>
-  dplyr::filter(source == "/dutch2/DVA12S",
+  filter(source == "/dutch2/DVA12S",
                 line_id < 5)
 
 this_conv |>
@@ -262,7 +263,7 @@ Finally, we can also print the content of some of the elements. Here, we pick th
 ```{r}
 
 these_tokens_first <- these_tokens |>
-  dplyr::filter(order=="first",
+  filter(order=="first",
                 rank < 10)
 
 p +
@@ -292,7 +293,7 @@ Token frequencies are calculated over the entire dataset. If you want source-spe
 
 ```r
 tokens_DVA9M <- data |>
-  dplyr::filter(source == "/dutch2/DVA9M") |>
+  filter(source == "/dutch2/DVA9M") |>
   tokenize()
 
 tokens_DVA9M

From ded2d8722334aa2ebd483993589a3fc5017e4cd4 Mon Sep 17 00:00:00 2001
From: mdingemanse <mark.dingemanse@ru.nl>
Date: Thu, 22 Aug 2024 11:12:44 +0200
Subject: [PATCH 20/22] begin > end

---
 vignettes/workflows.Rmd | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/vignettes/workflows.Rmd b/vignettes/workflows.Rmd
index 264a007..fd8e318 100644
--- a/vignettes/workflows.Rmd
+++ b/vignettes/workflows.Rmd
@@ -142,7 +142,7 @@ More often, we will want to plot a single conversation and explore it in some mo
 
 conv |>
   filter(source == "/dutch2/DVA12S") |>
-  ggplot(aes(x = begin, y = participant)) +
+  ggplot(aes(x = end, y = participant)) +
   geom_turn(aes(
     begin = begin,
     end = end)) +

From 2ac722fe6791978fa7f910413818afc2e4d73baf Mon Sep 17 00:00:00 2001
From: mdingemanse <mark.dingemanse@ru.nl>
Date: Thu, 22 Aug 2024 11:15:55 +0200
Subject: [PATCH 21/22] rm viridis closes #103

---
 DESCRIPTION | 1 -
 1 file changed, 1 deletion(-)

diff --git a/DESCRIPTION b/DESCRIPTION
index 8542920..fb93368 100644
--- a/DESCRIPTION
+++ b/DESCRIPTION
@@ -33,7 +33,6 @@ Suggests:
     testthat (>= 3.0.0),
     ifadv,
     pkgdown,
-    viridis,
     ggrepel
 Remotes: git::https://github.com/elpaco-escience/ifadv.git
 Config/testthat/edition: 3

From 458eee579bc0ca785f78abcc4946b51354cb0888 Mon Sep 17 00:00:00 2001
From: mdingemanse <mark.dingemanse@ru.nl>
Date: Thu, 22 Aug 2024 11:19:29 +0200
Subject: [PATCH 22/22] provide more info on `tokenize`, closes #102

---
 vignettes/workflows.Rmd | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/vignettes/workflows.Rmd b/vignettes/workflows.Rmd
index fd8e318..a851861 100644
--- a/vignettes/workflows.Rmd
+++ b/vignettes/workflows.Rmd
@@ -213,7 +213,10 @@ p +
 
 We can start looking into the internal structure of turns by plotting occurrence of tokens. 
 
-To do so, we first need to calculate the token frequencies. 
+To do so, we first need to generate a token-specific dataframe with `tokenize()`.
+This calculate token frequencies for all tokens in the selected dataset (all data by default). 
+It also calculates relative positions in time for individual tokens in a turn. 
+Finally, it provides a simple positional classification in to `only` (the token appears on its own), `first` (the token is turn-initial), `last` (the token is utterance-final), and `middle` (the token is not first nor last).
 
 ```{r}