Mandarin (Taiwan) WG full Child-by-Word data issue #328

liao961120 · 2024-12-03T03:10:47Z

Hi there,

Thank you for making CDI data publicly available!

I am currently working with the Mandarin (Taiwan) CDI data and have encountered some issues.

Specifically, I plotted the proportions of children who understand (i.e., understand or produce) a word across all age groups. I expected the resulting distribution to be roughly monotonically increasing, as shown in the plot below (created using the American English WG data). However, I found that many words in the Mandarin (Taiwan) WG data exhibit U-shaped distributions.

I suspect there may be errors in the Mandarin WG data, potentially stemming from coding errors in the original contributed data or pre-processing issues before it was uploaded to the Wordbank project.

Do the maintainers of the Wordbank project have any insights into this?

Below is the code for reproducing what I've mentioned:

library(dplyr)
library(tidyr)

get_age_distr_dat = function(fp, values=c("understands","produces") ) {
    d = readr::read_csv(fp) %>% 
        filter(item_kind == "word") %>% 
        select(child_id, item_definition, english_gloss, age, value)
    base_age_num = d %>% select(child_id, age) %>% distinct() %>% .$age %>% table()
    d %>% 
        filter(value %in% {{ values }}) %>% 
        ungroup() %>% 
        group_by(age, english_gloss, item_definition) %>% 
        summarise(n = n()) %>% 
        ungroup() %>%
        mutate(prop = n / base_age_num[as.character(age)]) %>% 
        arrange(age, desc(prop))
}
plot_word = function(d, english_gloss, ...) {
    dat = d %>% 
        filter(english_gloss == {{ english_gloss }}) %>% 
        arrange(age)
    
    plot( 1, type="n", xlim=c(0,nrow(dat)), ylim=c(0,1), xaxt="n", 
          xlab="Age", ylab="Proportion", ... )
    for (i in 1:nrow(dat)) {
        x_ = rep(i,2)
        y_ = c(0, dat$prop[i])
        lines( x_, y_, lwd=7, col=2)
    }
    axis(1, at=1:nrow(dat), labels=dat$age)
}

# CDI data retrieved from <https://wordbank.stanford.edu/data/?name=instrument_data> (Full Child-by-Word Data)
temp = tempfile()
download.file("https://raw.githubusercontent.com/liao961120/MCDI/refs/heads/main/raw/issue/wordbank_instrument_data.zip",temp)
dat = list(
    "Mandarin (Taiwan)"  = unz(temp, "wordbank_instrument_data_MandarinTW_WG.csv") %>% get_age_distr_dat(),
    "English (American)" = unz(temp, "wordbank_instrument_data_Eng_WG.csv") %>% get_age_distr_dat()
)

# English (American) WG data
dat[["English (American)"]]
#> # A tibble: 4,356 × 5
#>      age english_gloss    item_definition      n prop       
#>    <dbl> <chr>            <chr>            <int> <table[1d]>
#>  1     8 mommy*           mommy*             197 0.7725490  
#>  2     8 daddy*           daddy*             182 0.7137255  
#>  3     8 child's own name child's own name   169 0.6627451  
#>  4     8 hi               hi                 120 0.4705882  
#>  5     8 peekaboo         peekaboo           112 0.4392157  
#>  6     8 grandma*         grandma*           104 0.4078431  
#>  7     8 no               no                 101 0.3960784  
#>  8     8 bye              bye                 98 0.3843137  
#>  9     8 ball             ball                94 0.3686275  
#> 10     8 bath             bath                94 0.3686275  
#> # ℹ 4,346 more rows

# Mandarin (Taiwan) WG data
dat[["Mandarin (Taiwan)"]]
#> # A tibble: 3,182 × 5
#>      age english_gloss    item_definition     n prop       
#>    <dbl> <chr>            <chr>           <int> <table[1d]>
#>  1     8 mommy            媽媽/媽咪          83 0.9651163  
#>  2     8 milk             牛奶(ㄋㄟㄋㄟ)     82 0.9534884  
#>  3     8 bye or bye bye   再見(byebye)       81 0.9418605  
#>  4     8 clap             拍(手)             81 0.9418605  
#>  5     8 daddy            爸爸/爹地          81 0.9418605  
#>  6     8 clap             拍拍手             77 0.8953488  
#>  7     8 hug              抱(抱)             77 0.8953488  
#>  8     8 child's own name 自己的名字         73 0.8488372  
#>  9     8 do not           不要               72 0.8372093  
#> 10     8 water            水(茶茶)           72 0.8372093  
#> # ℹ 3,172 more rows

#### Plot words ####
par(mfrow=c(2,4))
for (nm in names(dat)) {
    for (word in c("ball", "mouth", "dog", "nose") )
        plot_word(dat[[nm]], word, main=paste0(nm, ": ", word) )
}

mcfrank · 2024-12-03T17:50:03Z

Hi, we noticed this too - it's highlighted in the Wordbank book in the section on "difficult datasets": https://langcog.github.io/wordbank-book/methods-and-data.html#difficult-datasets We think that most likely the parents of young children didn't understand what it means to mark "comprehension"... we've seen this pattern before in other WG datasets. Message ID: ***@***.***>

…

liao961120 · 2024-12-04T09:15:49Z

Thank you for your prompt reply! The over-responding of parents of particularly young children seems like a plausible explanation for the observed pattern.

However, I'm still concerned about the coding of the raw data found at https://github.com/langcog/wordbank/tree/master/raw_data/Mandarin_Taiwanese_WG. Specifically, I noticed the following:

Some of the item responses are left out as NAs (hence, there are three kinds of values found: NA, 0, and 1).
All four response combinations are observed for many items. For instance, among the 757 subjects in the WG dataset, the counts of the response combinations for the "dog" item (i.e., the d_01_01u and d_01_01p columns in MandarinTaiwaneseWG_Liu_data.csv) are shown as follows (with NAs recoded as -1):
```
        produces
understands  -1   0   1
        -1   0   9   3
        0    4 159  40
        1    8 437  97
```
In the processed version of the data (downloaded from the Wordbank website), the above table reduces to:
```
produces understands 
     140         445 
```

Given the design of the items in the WG form, isn't it only possible to observe a maximum of three response combinations ((produce=1, understand=1), (produce=0, understand=1), and (produce=0, understand=0))? Or have I misunderstood something?

I'd greatly appreciate any clarification or insights you could provide. Thank you for your time and assistance!

Code and data for reproducing what I've mentioned:

[Mandarin_Taiwanese_WG].csv
MandarinTaiwaneseWG_Liu_data.csv
wordbank_instrument_data_MandarinWG.csv

library(dplyr)
# Raw data at https://github.com/langcog/wordbank/tree/master/raw_data/Mandarin_Taiwanese_WG
d_raw = readr::read_csv('MandarinTaiwaneseWG_Liu_data.csv')
for ( c_ in names(d_raw)[17:724] )  # Recode NA responses as -1
    d_raw[[c_]] = ifelse(is.na(d_raw[[c_]]), -1L, d_raw[[c_]])
str(d_raw)
#> spc_tbl_ [757 × 724] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#>  $ sub     : chr [1:757] "0001" "0002" "0003" "0004" ...
#>  $ age     : num [1:757] 16 16 16 16 16 16 16 16 16 16 ...
#>  $ sex     : num [1:757] 2 2 2 2 2 2 2 1 1 1 ...
#>  $ sib     : num [1:757] 0 0 0 1 0 1 1 0 0 0 ...
#>  $ bir     : num [1:757] 1 1 1 2 1 2 2 1 1 1 ...
#>  $ way     : num [1:757] 1 2 2 1 1 2 1 1 2 2 ...
#>  $ con     : num [1:757] 1 1 3 1 1 1 1 1 1 1 ...
#>  $ wei     : num [1:757] 3270 3100 3960 3200 2979 ...
#>  $ dis     : num [1:757] 0 0 0 0 1 0 0 0 0 0 ...
#>  $ filler_r: num [1:757] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ mot_age : num [1:757] 39 33 31 40 42 28 35 32 34 39 ...
#>  $ mot_edu : num [1:757] 14 16 16 18 14 12 16 16 14 16 ...
#>  $ fat_age : num [1:757] 44 32 32 42 42 28 37 35 45 43 ...
#>  $ fat_edu : num [1:757] 16 16 16 18 14 9 16 16 16 18 ...
#>  $ fam     : num [1:757] 3 2 2 3 1 2 3 2 3 3 ...
#>  $ inc     : num [1:757] 4 4 5 5 6 3 3 5 2 6 ...
#>  $ d01_01u : num [1:757] 0 0 0 1 1 0 1 1 0 0 ...
#>  $ d01_02u : num [1:757] 0 0 0 1 0 0 0 0 0 0 ...
#>  $ d01_03u : num [1:757] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ d01_04u : num [1:757] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ d01_05u : num [1:757] 0 0 0 1 0 1 0 1 0 0 ...
#>  $ d01_06u : num [1:757] 0 0 0 1 0 0 0 1 0 0 ...
#>  $ d01_07u : num [1:757] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ d01_08u : num [1:757] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ d01_09u : num [1:757] 0 0 0 0 0 1 0 0 0 0 ...
#>  $ d01_10u : num [1:757] 0 0 0 1 0 0 0 0 0 0 ...
#>  $ d01_11u : num [1:757] 0 0 0 0 0 0 0 0 0 0 ...
#>  ...

# Wordbank Child-by-item data
d_wb = readr::read_csv("wordbank_instrument_data_MandarinWG.csv") %>% 
    select(child_id, item_definition, english_gloss, age, value)
str(d_wb)
#> tibble [267,978 × 5] (S3: tbl_df/tbl/data.frame)
#>  $ child_id       : num [1:267978] 70719 70719 70719 70719 70719 ...
#>  $ item_definition: chr [1:267978] "小狗(汪汪)" "小貓(喵喵)" "魚" "小鳥" ...
#>  $ english_gloss  : chr [1:267978] "dog" "cat" "fish" "bird" ...
#>  $ age            : num [1:267978] 16 16 16 16 16 16 16 16 16 16 ...
#>  $ value          : chr [1:267978] NA NA NA NA ...

# "dog" results (Wordbank processed data)
dog.wb = d_wb %>% filter(english_gloss == "dog")
table(dog.wb$value)
#> 
#>    produces understands 
#>         140         445

# "dog" results (raw data, -1s are NAs)
table(understands=d_raw$d01_01u, produces=d_raw$d01_01p)
#>             produces
#>  understands  -1   0   1
#>           -1   0   9   3
#>           0    4 159  40
#>           1    8 437  97

^{Created on 2024-12-04 with reprex v2.1.1}

mcfrank · 2024-12-04T20:13:30Z

Thanks for all this. The raw data for Taiwanese Mandarin is what we were
given by the original investigators. It does look like there is some
inconsistency (produces without understands checked) and missing data.

We don't recommend that investigators use those raw data, because we do a
lot of processing in the Wordbank import process.

When data show production, the wordbankr R package coerces the data to show
comprehension as well, since the "dogma" of the CDI is that "produces" implies "understands." (This is probably not true 100% of the time, but it's a good approximation).

d_m <- wordbankr::get_instrument_data(language = "Mandarin (Taiwanese)",
                                           form = "WG", item_info = TRUE)

d_m |>
  filter(english_gloss == "dog") |>
  summarise(produces = sum(produces),
            understands = sum(understands))

This code returns the desired values, I believe:

# A tibble: 1 × 2
  produces understands
     <int>       <int>
1      140         585

We recommend working with the columns produces and understands rather
than the value column.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mandarin (Taiwan) WG full Child-by-Word data issue #328

Mandarin (Taiwan) WG full Child-by-Word data issue #328

liao961120 commented Dec 3, 2024

mcfrank commented Dec 3, 2024 via email

liao961120 commented Dec 4, 2024

mcfrank commented Dec 4, 2024

Mandarin (Taiwan) WG full Child-by-Word data issue #328

Mandarin (Taiwan) WG full Child-by-Word data issue #328

Comments

liao961120 commented Dec 3, 2024

mcfrank commented Dec 3, 2024 via email

liao961120 commented Dec 4, 2024

mcfrank commented Dec 4, 2024