Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mandarin (Taiwan) WG full Child-by-Word data issue #328

Open
liao961120 opened this issue Dec 3, 2024 · 3 comments
Open

Mandarin (Taiwan) WG full Child-by-Word data issue #328

liao961120 opened this issue Dec 3, 2024 · 3 comments

Comments

@liao961120
Copy link

Hi there,

Thank you for making CDI data publicly available!

I am currently working with the Mandarin (Taiwan) CDI data and have encountered some issues.

Specifically, I plotted the proportions of children who understand (i.e., understand or produce) a word across all age groups. I expected the resulting distribution to be roughly monotonically increasing, as shown in the plot below (created using the American English WG data). However, I found that many words in the Mandarin (Taiwan) WG data exhibit U-shaped distributions.

I suspect there may be errors in the Mandarin WG data, potentially stemming from coding errors in the original contributed data or pre-processing issues before it was uploaded to the Wordbank project.

Do the maintainers of the Wordbank project have any insights into this?

image


Below is the code for reproducing what I've mentioned:

library(dplyr)
library(tidyr)

get_age_distr_dat = function(fp, values=c("understands","produces") ) {
    d = readr::read_csv(fp) %>% 
        filter(item_kind == "word") %>% 
        select(child_id, item_definition, english_gloss, age, value)
    base_age_num = d %>% select(child_id, age) %>% distinct() %>% .$age %>% table()
    d %>% 
        filter(value %in% {{ values }}) %>% 
        ungroup() %>% 
        group_by(age, english_gloss, item_definition) %>% 
        summarise(n = n()) %>% 
        ungroup() %>%
        mutate(prop = n / base_age_num[as.character(age)]) %>% 
        arrange(age, desc(prop))
}
plot_word = function(d, english_gloss, ...) {
    dat = d %>% 
        filter(english_gloss == {{ english_gloss }}) %>% 
        arrange(age)
    
    plot( 1, type="n", xlim=c(0,nrow(dat)), ylim=c(0,1), xaxt="n", 
          xlab="Age", ylab="Proportion", ... )
    for (i in 1:nrow(dat)) {
        x_ = rep(i,2)
        y_ = c(0, dat$prop[i])
        lines( x_, y_, lwd=7, col=2)
    }
    axis(1, at=1:nrow(dat), labels=dat$age)
}

# CDI data retrieved from <https://wordbank.stanford.edu/data/?name=instrument_data> (Full Child-by-Word Data)
temp = tempfile()
download.file("https://raw.githubusercontent.com/liao961120/MCDI/refs/heads/main/raw/issue/wordbank_instrument_data.zip",temp)
dat = list(
    "Mandarin (Taiwan)"  = unz(temp, "wordbank_instrument_data_MandarinTW_WG.csv") %>% get_age_distr_dat(),
    "English (American)" = unz(temp, "wordbank_instrument_data_Eng_WG.csv") %>% get_age_distr_dat()
)

# English (American) WG data
dat[["English (American)"]]
#> # A tibble: 4,356 × 5
#>      age english_gloss    item_definition      n prop       
#>    <dbl> <chr>            <chr>            <int> <table[1d]>
#>  1     8 mommy*           mommy*             197 0.7725490  
#>  2     8 daddy*           daddy*             182 0.7137255  
#>  3     8 child's own name child's own name   169 0.6627451  
#>  4     8 hi               hi                 120 0.4705882  
#>  5     8 peekaboo         peekaboo           112 0.4392157  
#>  6     8 grandma*         grandma*           104 0.4078431  
#>  7     8 no               no                 101 0.3960784  
#>  8     8 bye              bye                 98 0.3843137  
#>  9     8 ball             ball                94 0.3686275  
#> 10     8 bath             bath                94 0.3686275  
#> # ℹ 4,346 more rows

# Mandarin (Taiwan) WG data
dat[["Mandarin (Taiwan)"]]
#> # A tibble: 3,182 × 5
#>      age english_gloss    item_definition     n prop       
#>    <dbl> <chr>            <chr>           <int> <table[1d]>
#>  1     8 mommy            媽媽/媽咪          83 0.9651163  
#>  2     8 milk             牛奶(ㄋㄟㄋㄟ)     82 0.9534884  
#>  3     8 bye or bye bye   再見(byebye)       81 0.9418605  
#>  4     8 clap             拍(手)             81 0.9418605  
#>  5     8 daddy            爸爸/爹地          81 0.9418605  
#>  6     8 clap             拍拍手             77 0.8953488  
#>  7     8 hug              抱(抱)             77 0.8953488  
#>  8     8 child's own name 自己的名字         73 0.8488372  
#>  9     8 do not           不要               72 0.8372093  
#> 10     8 water            水(茶茶)           72 0.8372093  
#> # ℹ 3,172 more rows

#### Plot words ####
par(mfrow=c(2,4))
for (nm in names(dat)) {
    for (word in c("ball", "mouth", "dog", "nose") )
        plot_word(dat[[nm]], word, main=paste0(nm, ": ", word) )
}

image

@mcfrank
Copy link
Member

mcfrank commented Dec 3, 2024 via email

@liao961120
Copy link
Author

Thank you for your prompt reply! The over-responding of parents of particularly young children seems like a plausible explanation for the observed pattern.

However, I'm still concerned about the coding of the raw data found at https://github.com/langcog/wordbank/tree/master/raw_data/Mandarin_Taiwanese_WG. Specifically, I noticed the following:

  1. Some of the item responses are left out as NAs (hence, there are three kinds of values found: NA, 0, and 1).

  2. All four response combinations are observed for many items. For instance, among the 757 subjects in the WG dataset, the counts of the response combinations for the "dog" item (i.e., the d_01_01u and d_01_01p columns in MandarinTaiwaneseWG_Liu_data.csv) are shown as follows (with NAs recoded as -1):

            produces
    understands  -1   0   1
            -1   0   9   3
            0    4 159  40
            1    8 437  97
    

    In the processed version of the data (downloaded from the Wordbank website), the above table reduces to:

    produces understands 
         140         445 
    

Given the design of the items in the WG form, isn't it only possible to observe a maximum of three response combinations ((produce=1, understand=1), (produce=0, understand=1), and (produce=0, understand=0))? Or have I misunderstood something?

I'd greatly appreciate any clarification or insights you could provide. Thank you for your time and assistance!


Code and data for reproducing what I've mentioned:

[Mandarin_Taiwanese_WG].csv
MandarinTaiwaneseWG_Liu_data.csv
wordbank_instrument_data_MandarinWG.csv

library(dplyr)
# Raw data at https://github.com/langcog/wordbank/tree/master/raw_data/Mandarin_Taiwanese_WG
d_raw = readr::read_csv('MandarinTaiwaneseWG_Liu_data.csv')
for ( c_ in names(d_raw)[17:724] )  # Recode NA responses as -1
    d_raw[[c_]] = ifelse(is.na(d_raw[[c_]]), -1L, d_raw[[c_]])
str(d_raw)
#> spc_tbl_ [757 × 724] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#>  $ sub     : chr [1:757] "0001" "0002" "0003" "0004" ...
#>  $ age     : num [1:757] 16 16 16 16 16 16 16 16 16 16 ...
#>  $ sex     : num [1:757] 2 2 2 2 2 2 2 1 1 1 ...
#>  $ sib     : num [1:757] 0 0 0 1 0 1 1 0 0 0 ...
#>  $ bir     : num [1:757] 1 1 1 2 1 2 2 1 1 1 ...
#>  $ way     : num [1:757] 1 2 2 1 1 2 1 1 2 2 ...
#>  $ con     : num [1:757] 1 1 3 1 1 1 1 1 1 1 ...
#>  $ wei     : num [1:757] 3270 3100 3960 3200 2979 ...
#>  $ dis     : num [1:757] 0 0 0 0 1 0 0 0 0 0 ...
#>  $ filler_r: num [1:757] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ mot_age : num [1:757] 39 33 31 40 42 28 35 32 34 39 ...
#>  $ mot_edu : num [1:757] 14 16 16 18 14 12 16 16 14 16 ...
#>  $ fat_age : num [1:757] 44 32 32 42 42 28 37 35 45 43 ...
#>  $ fat_edu : num [1:757] 16 16 16 18 14 9 16 16 16 18 ...
#>  $ fam     : num [1:757] 3 2 2 3 1 2 3 2 3 3 ...
#>  $ inc     : num [1:757] 4 4 5 5 6 3 3 5 2 6 ...
#>  $ d01_01u : num [1:757] 0 0 0 1 1 0 1 1 0 0 ...
#>  $ d01_02u : num [1:757] 0 0 0 1 0 0 0 0 0 0 ...
#>  $ d01_03u : num [1:757] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ d01_04u : num [1:757] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ d01_05u : num [1:757] 0 0 0 1 0 1 0 1 0 0 ...
#>  $ d01_06u : num [1:757] 0 0 0 1 0 0 0 1 0 0 ...
#>  $ d01_07u : num [1:757] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ d01_08u : num [1:757] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ d01_09u : num [1:757] 0 0 0 0 0 1 0 0 0 0 ...
#>  $ d01_10u : num [1:757] 0 0 0 1 0 0 0 0 0 0 ...
#>  $ d01_11u : num [1:757] 0 0 0 0 0 0 0 0 0 0 ...
#>  ...

# Wordbank Child-by-item data
d_wb = readr::read_csv("wordbank_instrument_data_MandarinWG.csv") %>% 
    select(child_id, item_definition, english_gloss, age, value)
str(d_wb)
#> tibble [267,978 × 5] (S3: tbl_df/tbl/data.frame)
#>  $ child_id       : num [1:267978] 70719 70719 70719 70719 70719 ...
#>  $ item_definition: chr [1:267978] "小狗(汪汪)" "小貓(喵喵)" "魚" "小鳥" ...
#>  $ english_gloss  : chr [1:267978] "dog" "cat" "fish" "bird" ...
#>  $ age            : num [1:267978] 16 16 16 16 16 16 16 16 16 16 ...
#>  $ value          : chr [1:267978] NA NA NA NA ...

# "dog" results (Wordbank processed data)
dog.wb = d_wb %>% filter(english_gloss == "dog")
table(dog.wb$value)
#> 
#>    produces understands 
#>         140         445

# "dog" results (raw data, -1s are NAs)
table(understands=d_raw$d01_01u, produces=d_raw$d01_01p)
#>             produces
#>  understands  -1   0   1
#>           -1   0   9   3
#>           0    4 159  40
#>           1    8 437  97

Created on 2024-12-04 with reprex v2.1.1

@mcfrank
Copy link
Member

mcfrank commented Dec 4, 2024

Thanks for all this. The raw data for Taiwanese Mandarin is what we were
given by the original investigators. It does look like there is some
inconsistency (produces without understands checked) and missing data.

We don't recommend that investigators use those raw data, because we do a
lot of processing in the Wordbank import process.

When data show production, the wordbankr R package coerces the data to show
comprehension as well, since the "dogma" of the CDI is that "produces" implies "understands." (This is probably not true 100% of the time, but it's a good approximation).

d_m <- wordbankr::get_instrument_data(language = "Mandarin (Taiwanese)",
                                           form = "WG", item_info = TRUE)

d_m |>
  filter(english_gloss == "dog") |>
  summarise(produces = sum(produces),
            understands = sum(understands))

This code returns the desired values, I believe:

# A tibble: 1 × 2
  produces understands
     <int>       <int>
1      140         585

We recommend working with the columns produces and understands rather
than the value column.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants