-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mandarin (Taiwan) WG full Child-by-Word data issue #328
Comments
Hi, we noticed this too - it's highlighted in the Wordbank book in the
section on "difficult datasets":
https://langcog.github.io/wordbank-book/methods-and-data.html#difficult-datasets
We think that most likely the parents of young children didn't understand
what it means to mark "comprehension"... we've seen this pattern before in
other WG datasets.
Message ID: ***@***.***>
… |
Thank you for your prompt reply! The over-responding of parents of particularly young children seems like a plausible explanation for the observed pattern. However, I'm still concerned about the coding of the raw data found at https://github.com/langcog/wordbank/tree/master/raw_data/Mandarin_Taiwanese_WG. Specifically, I noticed the following:
Given the design of the items in the WG form, isn't it only possible to observe a maximum of three response combinations ( I'd greatly appreciate any clarification or insights you could provide. Thank you for your time and assistance! Code and data for reproducing what I've mentioned: [Mandarin_Taiwanese_WG].csv library(dplyr)
# Raw data at https://github.com/langcog/wordbank/tree/master/raw_data/Mandarin_Taiwanese_WG
d_raw = readr::read_csv('MandarinTaiwaneseWG_Liu_data.csv')
for ( c_ in names(d_raw)[17:724] ) # Recode NA responses as -1
d_raw[[c_]] = ifelse(is.na(d_raw[[c_]]), -1L, d_raw[[c_]])
str(d_raw)
#> spc_tbl_ [757 × 724] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#> $ sub : chr [1:757] "0001" "0002" "0003" "0004" ...
#> $ age : num [1:757] 16 16 16 16 16 16 16 16 16 16 ...
#> $ sex : num [1:757] 2 2 2 2 2 2 2 1 1 1 ...
#> $ sib : num [1:757] 0 0 0 1 0 1 1 0 0 0 ...
#> $ bir : num [1:757] 1 1 1 2 1 2 2 1 1 1 ...
#> $ way : num [1:757] 1 2 2 1 1 2 1 1 2 2 ...
#> $ con : num [1:757] 1 1 3 1 1 1 1 1 1 1 ...
#> $ wei : num [1:757] 3270 3100 3960 3200 2979 ...
#> $ dis : num [1:757] 0 0 0 0 1 0 0 0 0 0 ...
#> $ filler_r: num [1:757] 1 1 1 1 1 1 1 1 1 1 ...
#> $ mot_age : num [1:757] 39 33 31 40 42 28 35 32 34 39 ...
#> $ mot_edu : num [1:757] 14 16 16 18 14 12 16 16 14 16 ...
#> $ fat_age : num [1:757] 44 32 32 42 42 28 37 35 45 43 ...
#> $ fat_edu : num [1:757] 16 16 16 18 14 9 16 16 16 18 ...
#> $ fam : num [1:757] 3 2 2 3 1 2 3 2 3 3 ...
#> $ inc : num [1:757] 4 4 5 5 6 3 3 5 2 6 ...
#> $ d01_01u : num [1:757] 0 0 0 1 1 0 1 1 0 0 ...
#> $ d01_02u : num [1:757] 0 0 0 1 0 0 0 0 0 0 ...
#> $ d01_03u : num [1:757] 0 0 0 0 0 0 0 0 0 0 ...
#> $ d01_04u : num [1:757] 0 0 0 0 0 0 0 0 0 0 ...
#> $ d01_05u : num [1:757] 0 0 0 1 0 1 0 1 0 0 ...
#> $ d01_06u : num [1:757] 0 0 0 1 0 0 0 1 0 0 ...
#> $ d01_07u : num [1:757] 0 0 0 0 0 0 0 0 0 0 ...
#> $ d01_08u : num [1:757] 0 0 0 0 0 0 0 0 0 0 ...
#> $ d01_09u : num [1:757] 0 0 0 0 0 1 0 0 0 0 ...
#> $ d01_10u : num [1:757] 0 0 0 1 0 0 0 0 0 0 ...
#> $ d01_11u : num [1:757] 0 0 0 0 0 0 0 0 0 0 ...
#> ...
# Wordbank Child-by-item data
d_wb = readr::read_csv("wordbank_instrument_data_MandarinWG.csv") %>%
select(child_id, item_definition, english_gloss, age, value)
str(d_wb)
#> tibble [267,978 × 5] (S3: tbl_df/tbl/data.frame)
#> $ child_id : num [1:267978] 70719 70719 70719 70719 70719 ...
#> $ item_definition: chr [1:267978] "小狗(汪汪)" "小貓(喵喵)" "魚" "小鳥" ...
#> $ english_gloss : chr [1:267978] "dog" "cat" "fish" "bird" ...
#> $ age : num [1:267978] 16 16 16 16 16 16 16 16 16 16 ...
#> $ value : chr [1:267978] NA NA NA NA ...
# "dog" results (Wordbank processed data)
dog.wb = d_wb %>% filter(english_gloss == "dog")
table(dog.wb$value)
#>
#> produces understands
#> 140 445
# "dog" results (raw data, -1s are NAs)
table(understands=d_raw$d01_01u, produces=d_raw$d01_01p)
#> produces
#> understands -1 0 1
#> -1 0 9 3
#> 0 4 159 40
#> 1 8 437 97 Created on 2024-12-04 with reprex v2.1.1 |
Thanks for all this. The raw data for Taiwanese Mandarin is what we were We don't recommend that investigators use those raw data, because we do a When data show production, the
This code returns the desired values, I believe:
We recommend working with the columns |
Hi there,
Thank you for making CDI data publicly available!
I am currently working with the Mandarin (Taiwan) CDI data and have encountered some issues.
Specifically, I plotted the proportions of children who understand (i.e., understand or produce) a word across all age groups. I expected the resulting distribution to be roughly monotonically increasing, as shown in the plot below (created using the American English WG data). However, I found that many words in the Mandarin (Taiwan) WG data exhibit U-shaped distributions.
I suspect there may be errors in the Mandarin WG data, potentially stemming from coding errors in the original contributed data or pre-processing issues before it was uploaded to the Wordbank project.
Do the maintainers of the Wordbank project have any insights into this?
Below is the code for reproducing what I've mentioned:
The text was updated successfully, but these errors were encountered: