-
Notifications
You must be signed in to change notification settings - Fork 6
/
009-items-demographics.Rmd
902 lines (744 loc) · 54.5 KB
/
009-items-demographics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
# Demographic Variation in Individual Words {#items-demographics}
Note:
~ *The analyses below were presented at the 2019 Biennial Meeting of the Society for Research in Child Development; an earlier version of the sex analyses was presented at the Boston University Conference on Language Development in 2016.*
In Chapter \@ref(demographics), we documented demographic differences in total vocabulary size. But where do these differences come from? Concretely, if girls say more words than boys, *which words* do they say more frequently? Is it the case that they are simply more likely to be producing each word with some relatively uniform probability, or are there individual words that they are much more likely to produce? In this chapter, we consider the possibility that the probability of comprehending or producing individual words is related to demographic variables. We assess which words are learned differentially earlier or later by girls vs. boys, by first-born vs. later-born children, and by children with different levels of maternal education.
```{r itemsdemo-params}
sample_cutoff <- 50
```
```{r itemsdemo-raw_data, eval=FALSE}
get_inst_data <- function(inst_items) {
inst_language <- unique(inst_items$language)
print(inst_language)
inst_form <- unique(inst_items$form)
inst_admins <- filter(admins, language == inst_language, form == inst_form)
get_instrument_data(language = inst_language,
form = inst_form,
items = inst_items$item_id,
administrations = inst_admins,
iteminfo = inst_items) %>%
filter(!is.na(age)) %>%
mutate(produces = !is.na(value) & value == "produces",
understands = !is.na(value) &
(value == "understands" | value == "produces")) %>%
select(-value) %>%
gather(measure, value, produces, understands) %>%
mutate(language = inst_language,
form = inst_form)
}
get_lang_data <- function(lang_items) {
lang_items %>%
split(.$form) %>%
map_df(get_inst_data) %>%
## production for WS & WG, comprehension for WG only
filter(measure == "produces" | form == "WG")
}
demo_insts <- admins %>%
filter(!is.na(birth_order) | !is.na(mom_ed) | !is.na(sex)) %>%
distinct(language, form)
demo_words <- items %>%
right_join(demo_insts) %>%
filter(type == "word")
raw_data <- demo_words %>%
split(.$language) %>%
map(get_lang_data)
write_feather(bind_rows(raw_data), "data/items-demographics/_demo_data.feather")
```
```{r itemsdemo-raw_data_coded, eval=FALSE}
raw_data <- read_feather("data/items-demographics/_demo_data.feather") %>%
split(.$language)
raw_data_coded <- raw_data %>%
map(~.x %>%
mutate(birth_order = birth_order %>%
fct_collapse("Third+" = c("Third", "Fourth", "Fifth", "Sixth",
"Seventh", "Eighth")),
mom_ed = mom_ed %>%
fct_collapse(`Below Secondary` = c("None","Primary",
"Some Secondary"),
`Secondary` = c("Secondary", "Some College"),
`College & Above` = c("College",
"Some Graduate",
"Graduate"))))
samples <- raw_data_coded %>%
map_df(~.x %>%
distinct(language, measure, age, sex, birth_order, mom_ed, data_id) %>%
select(-data_id) %>%
gather(demo, value, sex, birth_order, mom_ed) %>%
count(language, measure, demo, value) %>%
filter(!is.na(value)))
```
```{r itemsdemo-plot_samples, eval=FALSE}
plot_samples <- function(plot_measure, plot_demo) {
ggplot(filter(samples, measure == plot_measure, demo == plot_demo),
aes(x = value, y = log(n), fill = value)) +
facet_wrap(~language) +
geom_col() +
geom_hline(yintercept = log(sample_cutoff), linetype = "dashed",
colour = "darkgrey") +
.scale_fill_discrete() +
ggtitle(paste(plot_measure, plot_demo, sep = " "))
}
plot_samples("produces", "birth_order")
plot_samples("produces", "mom_ed")
plot_samples("produces", "sex")
plot_samples("understands", "birth_order")
plot_samples("understands", "mom_ed")
plot_samples("understands", "sex")
```
```{r itemsdemo-get_demo_props, eval=FALSE}
get_demo_props <- function(lang_data_coded, demo) {
demo_str <- as.character(demo)[2]
message(unique(lang_data_coded$language), " - ", demo_str)
if (lang_data_coded %>% pull(!!demo) %>% is.na() %>% all()) return(NULL)
demo_lang_data <- lang_data_coded %>%
filter(!is.na(!!demo))
demo_samples <- demo_lang_data %>%
distinct(!!demo, data_id, measure) %>%
count(!!demo, measure)
if (any(demo_samples$n < sample_cutoff)) return(NULL)
demo_lang_data %>%
ungroup() %>%
group_by(language, measure, !!demo, age, uni_lemma, definition) %>%
summarise(n = n(),
num_true = sum(value, na.rm = TRUE),
num_false = n - num_true) %>%
ungroup() %>%
mutate(unscaled_age = age,
age = scale(age),
demo = demo_str) %>%
group_by(language, measure, demo) %>%
nest()
}
demos <- c(quo(birth_order), quo(mom_ed), quo(sex))
get_lang_props <- function(lang_data) {
demos %>% map_df(~get_demo_props(lang_data, .x))
}
demo_props <- raw_data_coded %>%
map_df(get_lang_props)
save(demo_props, file = "data/items-demographics/demo_props.Rds")
#write_feather(unnest(demo_props), "data/items-demographics/demo_props.feather")
```
## Methods
### Data
As reviewed in Chapter \@ref(methods-and-data), subsets of the datasets in Wordbank are coded for one or more demographic variables. Here as in previous analyses, we examine three: birth order, level of maternal education, and sex. For these analyses, we extract all of the instruments with demographically coded data and combine them into two datasets: comprehension from WG forms, and production across both WG and WS forms (using the "by item stitching" approach described in Appendix \@ref(appendix-stitching)). To avoid sparsity, we coded demographic variables are coded into the values First / Second / Third+ for birth order, Below Secondary / Secondary / College and Above for maternal education, and Female / Male for assigned sex at birth. This approach creates six different analyses, one for each combination of measure (comprehension/production) demographic variable. We exclude a language from a given analysis if it has fewer than `r sample_cutoff` children for each level of that demographic variable.
Each dataset yields a trajectory for each word, created by fitting a logistic curve to the proportion of children that are reported to understand or produce the word over age. These trajectories can be computed separately for each value of the demographic variable. For example, Figure \@ref(fig:itemsdemo-props-plot-item) illustrates the trajectories for some sample items in English for production data split by birth order. Note that the word _brother_ is spoken much earlier by second-born and third-born children than by first-born children, while _green_ is spoken much earlier by first-born children. And the word _dog_ is produced only slightly earlier by first-born or second-born children than by later-born children. Averaging all of these trajectories together reproduces the demographic vocabulary curves reported in Chapter \@ref(demographics).
```{r itemsdemo-plot_props}
load("data/items-demographics/demo_props.Rds")
demo_labels <- list("sex" = "Sex",
"birth_order" = "Birth order",
"mom_ed" = "Maternal education")
plot_demo_props <- demo_props %>%
unnest() %>%
mutate(prop = num_true / (num_true + num_false)) %>%
split(.$demo)
plot_props <- function(plot_demo, plot_measure, ncol = NULL) {
plot_demo_props[[plot_demo]] %>% filter(measure == plot_measure) %>%
ggplot(aes_string(x = "unscaled_age", y = "prop", colour = plot_demo)) +
facet_wrap(~language, ncol = ncol) +
geom_smooth(method = "glm", method.args = list(family = "binomial"),
se = FALSE, size = 0.8) +
.scale_colour_discrete() +
labs(x = "Age (months)", y = glue("Proportion {plot_measure}"),
colour = demo_labels[[plot_demo]]) +
lims(y = c(0, 1)) +
theme(legend.position = "top")
}
plot_item_props <- function(plot_demo, plot_measure, plot_items, ncol = NULL) {
plot_demo_props[[plot_demo]] %>%
filter(measure == plot_measure, definition %in% plot_items) %>%
mutate(total = num_true + num_false) %>%
ggplot(aes_string(x = "unscaled_age", y = "prop", colour = plot_demo)) +
facet_wrap(~definition, ncol = ncol) +
geom_smooth(aes(weight = total), se = FALSE, size = 0.8, method = "glm",
method.args = list(family = "binomial")) +
.scale_colour_discrete() +
labs(x = "Age (months)", y = glue("Proportion {plot_measure}"),
colour = demo_labels[[plot_demo]]) +
lims(y = c(0, 1)) +
theme(legend.position = "top")
}
```
```{r itemsdemo-props-plot-item, dependson="itemsdemo-plot_props", fig.height=3, fig.cap="Developmental trajectories for the production of three example American English words by birth order."}
plot_item_props("birth_order", "produces", c("brother", "dog", "green"), 4)
```
```{r itemsdemo-fit_models, eval=FALSE}
contr_back_diff <- function(k) {
column <- function(i) c(rep(-(k - i) / k, i), rep(i / k, k - i))
map(1:(k - 1), column) %>% as.data.frame() %>% as.matrix() %>% unname()
}
fit_demo_model <- function(language, measure, demo, demo_data) {
message(language, " - ", measure, " - ", demo)
if (length(levels(demo_data[[demo]])) < 2) return(NULL)
demo_contrasts <- contr_back_diff(length(levels(demo_data[[demo]])))
contrasts(demo_data[[demo]]) <- demo_contrasts
group_formula <- as.formula(glue(
"cbind(num_true, num_false) ~
(age + {demo} | definition) + {demo} + age"
))
safe_model <- possibly(function() {
model <- lme4::glmer(group_formula, family = binomial, data = demo_data)
save(model, file = glue("data/items-demographics/_models/{language}_{measure}_{demo}.Rds"))
return(model)
}, otherwise = NULL)
safe_model()
}
cores <- parallel::detectCores()
cluster <- multidplyr::create_cluster(cores = cores)
demo_groups <- demo_props %>%
mutate(group = rep(1:cores, length.out = n())) %>%
multidplyr::partition(group, cluster = cluster)
demo_groups %>%
multidplyr::cluster_library("tidyverse") %>%
multidplyr::cluster_library("glue") %>%
multidplyr::cluster_library("lme4") %>%
multidplyr::cluster_assign_value("fit_demo_model", fit_demo_model) %>%
multidplyr::cluster_assign_value("contr_back_diff", contr_back_diff)
demo_model_setup <- demo_groups %>%
mutate(model = pmap(list(language, measure, demo, data), fit_demo_model))
demo_models <- demo_model_setup %>%
collect() %>%
as_tibble()
```
```{r itemsdemo-demo_fits, eval=FALSE}
get_coef <- function(model) {
model %>% broom::tidy() %>% filter(group == "fixed") %>% select(-group)
}
get_ranef <- function(model) {
lme4::ranef(model)$definition %>% as_tibble(rownames = "definition")
}
demo_fits <- demo_models %>%
ungroup() %>%
filter(!map_lgl(model, is.null)) %>%
mutate(coefs = map(model, get_coef),
ranef = map(model, get_ranef))
demo_coefs <- demo_fits %>%
select(language, measure, demo, coefs) %>%
unnest()
demo_ranef <- demo_fits %>%
select(language, measure, demo, ranef) %>%
unnest()
write_feather(demo_coefs, "data/items-demographics/demo_coefs.feather")
write_feather(demo_ranef, "data/items-demographics/demo_ranef.feather")
```
```{r itemsdemo-update, eval=FALSE}
demo_coefs <- read_feather("data/items-demographics/demo_coefs.feather")
demo_ranef <- read_feather("data/items-demographics/demo_ranef.feather")
update_language <- function(lang) {
updated_models <- demo_props %>%
filter(language == lang) %>%
mutate(model = pmap(list(language, measure, demo, data), fit_demo_model))
updated_fits <- updated_models %>%
ungroup() %>%
filter(!map_lgl(model, is.null)) %>%
mutate(coefs = map(model, get_coef),
ranef = map(model, get_ranef))
updated_coefs <- updated_fits %>%
select(language, measure, demo, coefs) %>%
unnest()
updated_ranef <- updated_fits %>%
select(language, measure, demo, ranef) %>%
unnest()
demo_coefs <- demo_coefs %>%
filter(language != lang) %>%
bind_rows(updated_coefs)
write_feather(demo_coefs, "data/items-demographics/demo_coefs.feather")
demo_ranef <- demo_ranef %>%
filter(language != lang) %>%
bind_rows(updated_ranef)
write_feather(demo_ranef, "data/items-demographics/demo_ranef.feather")
}
```
### Models
There are a number of complementary methods to estimate individual item effects of the type visualized in Figure \@ref(fig:itemsdemo-props-plot-item). In Chapter \@ref(demographics), we explored relatively model-free approaches to estimating demographic effects across groups mostly using descriptive statistics. Here we are interested in estimating these effects for individual items, and data are sparser for each individual item. Thus, estimating an independent statistic for each item would be noisier and more variable. In exploring this issue, we found that it was much more effective to reduce this variance by using a model in which demographic effects are estimated simultaneously at the level of all items and specifically for individual items.
In particular, in the analysis below we use mixed-effects logistic regression to predict whether children understand or produce each item based on their age and their level of a single demographic variable, with random effects of age and demographic variable by item. A model of this type is fit separately to the data for each language and demographic measure, for example specified for birth order as:
```
produces ~ age + birth_order + (age + birth_order | item)
```
For each demographic variable, we specify the contrasts such that their coefficient compares each level of the variable to the previous level. For example, the coefficients for birth order reflect the overall difference between first-born children as compared to second-born children and the overall difference between second-born children as compared to later-born children. Our key coefficients of interests are the fitted slopes for each demographic group and item. These indicate the size of the demographic differences for that item, over and above the main effect.
Since these effects are coefficients in a logistic regression, they represent the odds ratio between the demographic levels they are comparing. For example, the birth order model yields the odds ratio between first-born and second-born children and the odds ratio between second-born children and later-born children. Thus an effect of 1 indicates no difference (e.g., first-born and second-born children are equally likely to know a word). An effect greater than 1 indicates an advantage for first-born children (e.g., an effect of 2 would mean that the odds of first-born children knowing a word are twice that of second-born children). Conversely, an effect lower than 1 indicates an advantage for second-born children (e.g. an effect of 0.5 would mean that the odds of first-born children knowing a word are half that of second-born children).
```{r itemsdemo-demo_coefs}
demo_coefs <- read_feather("data/items-demographics/demo_coefs.feather")
demo_levels <- function(dem) {
demo_props %>% filter(demo == dem) %>% slice(1) %>% pull(data) %>% .[[1]] %>%
pull(!!dem) %>% levels()
}
demo_coefs_coded <- demo_coefs %>%
filter(!(term %in% c("(Intercept)", "age"))) %>%
mutate(estimate = if_else(term %in% c("sexMale", "birth_order1", "birth_order2"),
-estimate, estimate),
ci_lower = estimate - 1.96 * std.error,
ci_upper = estimate + 1.96 * std.error,
exp_estimate = exp(estimate),
term = term %>%
fct_relevel("sexMale", "birth_order1", "birth_order2", "mom_ed2", "mom_ed1") %>%
fct_recode("Female : Male" = "sexMale",
"First : Second" = "birth_order1",
"Second : Third+" = "birth_order2",
"College & Above : Secondary" = "mom_ed2",
"Secondary : Below Secondary" = "mom_ed1",
NULL = "mom_edCollege and Above"),
measure = measure %>% fct_relevel("understands", "produces"),
demo = demo %>%
fct_relevel("sex", "birth_order", "mom_ed") %>%
fct_recode("Sex" = "sex", "Birth order" = "birth_order",
"Maternal education" = "mom_ed")) %>%
filter(!is.na(term))
terms <- levels(demo_coefs_coded$term)
term_colours <- ggthemes::palette_pander(length(terms)) %>% set_names(terms)
exp_breaks <- function(base, power) c(rev(1 / base^(1:power)), base^(0:power)) %>% log()
```
```{r itemsdemo-demo_coefs_plot, dependson="itemsdemo-demo_coefs"}
coef_breaks <- exp_breaks(1.25, 3)
plot_demo_fixed <- function(dem) {
plot_coefs <- demo_coefs_coded %>%
filter(demo == dem) %>%
arrange(measure, term, estimate) %>%
mutate(order = 1:n()) %>%
group_by(measure, language) %>%
mutate(group = factor(group_indices())) %>%
ungroup() %>%
mutate(group = fct_reorder(group, order))
ggplot(plot_coefs, aes(x = estimate, y = group, colour = term)) +
facet_grid(measure ~ ., scales = "free_y", space = "free_y", drop = TRUE,
labeller = label_caps) +
geom_vline(xintercept = 0, linetype = .refline, colour = .grey) +
ggstance::geom_pointrangeh(aes(xmin = ci_lower, xmax = ci_upper)) +
scale_x_continuous(breaks = coef_breaks, labels = round(exp(coef_breaks), 2)) +
scale_y_discrete(breaks = plot_coefs$group, labels = plot_coefs$language) +
scale_colour_manual(values = term_colours, name = "") +
labs(x = "Odds ratio (fixed effect)", y = "") +
theme(legend.position = "top", legend.direction = "horizontal", legend.justification = c(0, 0),
legend.margin = margin(b = -10, unit = "pt"),
panel.grid.major.y = .coef_line)
}
```
```{r itemsdemo-demo_ranef}
demo_ranef <- read_feather("data/items-demographics/demo_ranef.feather")
demo_ranef_coded <- demo_ranef %>%
gather(term, estimate, -language, -measure, -demo, -definition) %>%
filter(!is.na(estimate),
!(term %in% c("(Intercept)", "age"))) %>%
mutate(estimate = if_else(term %in% c("sexMale", "birth_order1", "birth_order2"),
-estimate, estimate),
exp_estimate = exp(estimate),
measure = measure %>% fct_relevel("understands", "produces"),
term = term %>%
fct_relevel("sexMale", "birth_order1", "birth_order2", "mom_ed2", "mom_ed1") %>%
fct_recode("Female : Male" = "sexMale",
"First : Second" = "birth_order1",
"Second : Third+" = "birth_order2",
"College & Above : Secondary" = "mom_ed2",
"Secondary : Below Secondary" = "mom_ed1",
NULL = "mom_edCollege and Above")) %>%
filter(!is.na(term),
!(language == "English (British)" & demo == "mom_ed"))
cis <- demo_ranef_coded %>%
summarise(ci_lower_exp = ci_lower(exp_estimate),
ci_upper_exp = ci_upper(exp_estimate))
```
```{r itemsdemo-plot_ranef-old, dependson="itemsdemo-demo_ranef"}
# num_extremes <- 2
#
# plot_demo_ranef <- function(plot_demo, plot_measure) {
#
# demo_data <- demo_ranef_coded %>%
# filter(measure == plot_measure, demo == plot_demo) %>%
# mutate(definition = str_remove(definition, " \\(.*\\)$"))
#
# demo_labelled <- bind_rows(
# demo_data %>% group_by(language, term) %>% top_n(num_extremes, estimate),
# demo_data %>% group_by(language, term) %>% top_n(num_extremes, -estimate)
# )
#
# plot_breaks <- exp_breaks(2, 3)
#
# ggplot(demo_data, aes(x = estimate, y = term, fill = term)) +
# facet_wrap(vars(language), ncol = 4) +
# geom_vline(xintercept = 0, colour = .grey, alpha = 0.6) +
# ggridges::geom_density_ridges(alpha = .5, scale = 0.9) +
# ggrepel::geom_label_repel(aes(label = definition, colour = term),
# data = demo_labelled,
# segment.size = 0.3,
# label.padding = 0.15,
# point.padding = unit(0.2, "lines"),
# arrow = arrow(length = unit(0.01, "npc")),
# nudge_y = 0.2,
# size = 2.5, fill = "white",
# family = .uni_font) +
# scale_x_continuous(breaks = plot_breaks, labels = round(exp(plot_breaks), 2)) +
# scale_y_discrete(expand = expand_scale(mult = c(0, 0.01))) +
# scale_colour_manual(values = term_colours, guide = FALSE) +
# scale_fill_manual(values = term_colours, name = "") +
# guides(fill = guide_legend(direction = "vertical", reverse = TRUE)) +
# labs(x = "Odds ratio", y = "") +
# theme(axis.text.y = element_blank(),
# axis.ticks.y = element_blank(),
# legend.position = "top")
#
# }
```
```{r itemsdemo-plot_ranef, dependson="itemsdemo-demo_ranef"}
demo_combined <- demo_ranef_coded %>%
mutate(demo = demo %>%
fct_relevel("sex", "birth_order", "mom_ed") %>%
fct_recode("Birth order" = "birth_order",
"Maternal education" = "mom_ed",
"Sex" = "sex")) %>%
select(language, measure, demo, term, definition, estimate) %>%
rename(estimate_ranef = estimate) %>%
left_join(demo_coefs_coded %>% select(language, measure, demo, term, estimate) %>% rename(estimate_fixed = estimate)) %>%
mutate(estimate = estimate_fixed + estimate_ranef)
plot_demo_ranef <- function(plot_demo, plot_measure, lims = c(-1, 1)) {
demo_data <- demo_combined %>%
filter(measure == plot_measure, demo == plot_demo) %>%
mutate(definition = str_remove(definition, " \\(.*\\)$"))
plot_breaks <- exp_breaks(2, 3)
plot_labels <- demo_data %>%
distinct(language, term) %>%
mutate(term = fct_drop(term)) %>%
filter(term == levels(term)[length(levels(term))])
plt <- ggplot(demo_data,
aes(x = estimate, y = fct_rev(language), fill = term)) +
coord_cartesian(clip = "off") +
geom_vline(xintercept = 0, colour = .grey, alpha = 0.6) +
ggridges::geom_density_ridges(quantile_lines = TRUE, quantiles = 2,
alpha = .5, panel_scaling = FALSE) +
# ggridges::geom_density_ridges(aes(height = ..density..), stat = "density",
# alpha = .5, scale = 0.95) +
scale_x_continuous(expand = c(0.01, 0), breaks = plot_breaks,
labels = round(exp(plot_breaks), 2), limits = lims) +
scale_y_discrete(expand = expand_scale(mult = c(0, 0.01))) +
scale_fill_manual(values = term_colours, guide = FALSE) +
labs(x = "Odds ratio", y = "") +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
legend.position = "top",
panel.border = element_blank(),
strip.text.y = element_blank())
if (n_distinct(demo_data$term) > 1) {
plt +
facet_grid(rows = vars(language), cols = vars(term), scales = "free_y",
drop = TRUE) +
geom_text(aes(label = language), x = -1, hjust = 0.5, vjust = 0,
nudge_y = 1, family = .font, data = plot_labels) +
theme(strip.text.x = element_text(face = "bold"))
} else {
plt +
facet_wrap(vars(language), scales = "free_y", drop = TRUE, ncol = 4,
dir = "v") +
# geom_text(aes(label = language), x = -1, y = Inf, hjust = 0, vjust = 0,
# nudge_y = -10,
# family = .font, data = plot_labels) +
# theme(strip.text.x = element_blank())
theme(strip.text.x = element_text(hjust = 0, size = rel(0.8),
margin = margin(b = 2, unit = "pt")))
}
}
```
```{r itemsdemo-tops_ranef, dependson="itemsdemo-demo_ranef"}
num_tops <- 10
tops <- demo_ranef_coded %>%
select(language, measure, demo, definition, term, estimate) %>%
mutate(neg_estimate = -estimate) %>%
rename(positive = estimate, negative = neg_estimate) %>%
gather(direction, estimate, positive, negative) %>%
group_by(language, measure, demo, term, direction) %>%
top_n(num_tops, estimate) %>%
arrange(measure, language, demo, term, desc(estimate)) %>%
mutate(rank = row_number(), exp_estimate = exp(estimate)) %>%
separate(term, c("numerator", "denominator"), sep = " : ",
remove = FALSE) %>%
mutate(effect = if_else(direction == "positive",
glue("{numerator} > {denominator}"),
glue("{denominator} > {numerator}")))
dir_codes <- list(
sex = c("Male biased" = "negative",
"Female biased" = "positive"),
birth_order = c("Later born biased" = "negative",
"Earlier born biased" = "positive"),
mom_ed = c("Lower maternal education biased" = "negative",
"Higher maternal education biased" = "positive")
)
top_plots <- function(plot_demo, plot_measure, max_rows = 5) {
plt_tops <- tops %>%
ungroup() %>%
filter(measure == plot_measure, demo == plot_demo) %>%
select(-measure, -demo) %>%
mutate(definition = str_remove(definition, " \\(.*\\)$"),
direction = direction %>%
fct_relevel("negative", "positive") %>%
fct_recode(!!!dir_codes[[plot_demo]]))
if (n_distinct(plt_tops$effect) <= n_distinct(plt_tops$direction)) {
max_rows <- max_rows * 2
}
plt_tops %>%
group_by(language) %>%
nest() %>%
ungroup() %>%
mutate(i = rep(1:ceiling((nrow(.) / max_rows)),
each = max_rows)[1:nrow(.)]) %>%
unnest() %>%
group_by(i) %>%
nest() %>%
mutate(rows = ceiling(map_int(data, nrow) / 40),
plt = map(data, function(plt_section) {
plt <- ggplot(plt_section,
aes(x = effect, y = rank, colour = term)) +
geom_text(aes(label = roundp(exp_estimate, 1)),
colour = .grey, hjust = "right",
nudge_x = -.25, family = .font, size = 2.25) +
geom_text(aes(label = definition), family = .uni_font,
hjust = "left", size = 2.5, nudge_x = -.2) +
scale_colour_manual(values = term_colours, guide = FALSE) +
scale_x_discrete(position = "top") +
scale_y_reverse() +
labs(x = "") +
theme_mikabr(base_size = 10, base_family = .font) +
theme(axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.ticks = element_blank(),
panel.border = element_blank(),
strip.placement = "outside",
strip.text = element_text(face = "bold",
margin = margin(b = 3)),
axis.text.x = element_text(size = rel(0.7)))
if (n_distinct(plt_tops$effect) > n_distinct(plt_tops$direction)) {
plt +
facet_wrap(~language + direction, scales = "free", ncol = 2)
} else {
plt +
facet_wrap(~language, scales = "free", ncol = 2)
}
}))
}
```
```{r, results="asis"}
# lang_top_table <- function(plot_demo, plot_measure) {
#
# plt_tops <- tops %>%
# ungroup() %>%
# filter(measure == plot_measure, demo == plot_demo) %>%
# select(-measure, -demo) %>%
# mutate(definition = str_remove(definition, " \\(.*\\)$"),
# direction = direction %>%
# fct_relevel("negative", "positive") %>%
# fct_recode(!!!dir_codes[[plot_demo]])) %>%
# group_by(language) %>%
# nest()
#
# walk2(plt_tops$data, plt_tops$language, function(lang_data, lang) {
# wide_tops <- lang_data %>%
# select(direction, effect, exp_estimate, definition) %>%
# group_by(direction) %>%
# nest() %>%
# mutate(data = map(data, ~.x %>% group_by(effect) %>% nest() %>% spread(effect, data) %>% unnest(.sep = ","))) %>%
# spread(direction, data) %>%
# unnest(.sep = ",")
#
# directions <- names(wide_tops) %>% str_split(",") %>% map_chr(~.[1]) %>% unique()
# effects <- names(wide_tops) %>% str_split(",") %>% map_chr(~.[2]) %>% unique()
# wide_tops %>%
# kable(col.names = rep("", 8), digits = 1) %>%
# kableExtra::add_header_above(set_names(rep(2, 4), effects)) %>%
# kableExtra::add_header_above(set_names(rep(4, 2), directions)) %>%
# kableExtra::add_header_above(set_names(8, lang)) %>%
# cat()
# })
# }
#
# lang_top_table("birth_order", "produces")
```
```{r itemsdemo-thresholds}
threshold <- log(1.5)
threshold_props <- demo_ranef_coded %>%
mutate(large = abs(estimate) > threshold) %>%
group_by(measure, demo, language) %>%
summarise(n_large = sum(large),
prop_large = mean(large)) %>%
summarise(mean_prop = mean(prop_large))
threshold_lang <- function(lang, meas, dem) {
demo_ranef_coded %>%
mutate(large = abs(estimate) > threshold) %>%
group_by(measure, demo, language) %>%
summarise(prop_large = mean(large)) %>%
filter(language == lang, measure == meas, demo == dem) %>%
pull(prop_large)
}
threshold_prop <- function(meas, dem) {
threshold_props %>%
filter(measure == meas, demo == dem) %>%
pull(mean_prop)
}
```
```{r itemsdemo-get_effect}
odds <- function(p) p / (1 - p)
prob <- function(o) o / (1 + o)
effect <- function(meas, dem, trm) {
demo_coefs_coded %>%
filter(measure == meas, demo == dem, term == trm) %>%
pull(exp_estimate) %>%
mean()
}
effect_item <- function(lang, meas, dem, trm, item) {
demo_ranef_coded %>%
filter(language == lang, measure == meas, demo == dem, term == trm, definition == item) %>%
pull(exp_estimate)
}
apply_effect <- function(p, effect) prob(odds(p) * effect)
max_item <- function(lang, meas, dem, trm) {
demo_ranef_coded %>%
filter(language == lang, measure == meas, demo == dem, term == trm) %>%
filter(estimate == max(estimate)) %>%
pull(definition)
}
min_item <- function(lang, meas, dem, trm) {
demo_ranef_coded %>%
filter(language == lang, measure == meas, demo == dem, term == trm) %>%
filter(estimate == min(estimate)) %>%
pull(definition)
}
```
## Results
As discussed above, the primary target of our analysis is the item random effects for each demographic variable, indicating our best estimate of the specific effect of a particular demographic on a particular item. TThe first question we address is the distribution of these random effects. Figure \@ref(fig:itemsdemo-demo-ranef-distrib) shows the distribution of demographic random effects across all languages and measures, using a quantile-quantile (QQ) plot. Points on a diagonal line indicate conformity to the standard normal distribution, while deviations suggest differences in distributional form.
The resulting plots show a broad, low-slope diagonal with skewed tails. The majority of coefficients are within a very tight range: 95% of the item effects (across all languages, measures, and demographic factors) are within (`r roundp(cis$ci_lower_exp, 2)`, `r roundp(cis$ci_upper_exp, 2)`), i.e. the odds for a child in one demographic group knowing the word are not more than `r roundp(min(1/cis$ci_lower_exp, cis$ci_upper_exp),1)` times higher than for another group, for 95% of words. Most of the action is in the tails of the distribution: a few words vary much more in how often they are produced according to some demographic feature.
```{r itemsdemo-demo-ranef-distrib, dependson="itemsdemo-demo_ranef", fig.height=4, fig.cap="Quantiles of item random effects compared to theoretical quantiles of a normal distribution, for both production and comprehension."}
demo_ranef_coded %>%
mutate(demo = demo %>%
fct_relevel("sex", "birth_order", "mom_ed") %>%
fct_recode("Birth order" = "birth_order",
"Maternal education" = "mom_ed",
"Sex" = "sex")) %>%
ggplot(aes(sample = estimate)) +
# facet_wrap(vars(demo)) +
facet_grid(rows = vars(measure), cols = vars(demo), labeller = label_caps) +
geom_qq() +
geom_qq_line(linetype = .refline, colour = .grey) +
labs(x = "Theoretical quantiles", y = "Data quantiles")
```
In the following subsections, we examine these coefficients. For each demographic factor, we first give the fixed effects of the demographic levels, and then show the distribution of item random effects and the top `r num_tops` largest effects in each direction. It is these extreme items that we are most interested in.
### Sex
As shown in Chapter \@ref(demographics) and reflected here in Figure \@ref(fig:itemsdemo-demo-coefs-sex), there is a highly consistent advantage for girls in language production. This advantage is slightly less pronounced for comprehension but still present. Independently of this advantage, we also see specific items emerge as understood differentially for boys or girls.
```{r itemsdemo-demo-coefs-sex, fig.height=8, fig.width=8, dependson="itemsdemo-demo_coefs_plot", fig.cap="Main effect of sex for each language and measure."}
plot_demo_fixed("Sex")
```
Figure \@ref(fig:itemsdemo-ranef-plots-sex-comp) gives the full distribution of item effects for comprehension, and Figure \@ref(fig:itemsdemo-tops-sex-comp-2) shows the top `r num_tops` items most biased in each direction, in each languages. These are almost exclusively traditionally gendered items of clothing and toys, plus genital words. For English, for example, the words with a substantial male advantage are mostly vehicle- and tool-related, while the female advantage words are mostly clothing. Thus, our first impression is that most of these tend to be specific content items associated with gendered play.
```{r itemsdemo-ranef-plots-sex-comp, dependson="itemsdemo-plot_ranef", fig.height=5, out.width="80%", fig.cap=glue("Distribution of sex item random effects for comprehension data in each language.")}
plot_demo_ranef("Sex", "understands")
```
```{r itemsdemo-tops-sex-comp-1, dependson="itemsdemo-tops_ranef", fig.height=2*top_plots("sex", "understands")$rows[[1]], fig.cap=""}
top_plots("sex", "understands")$plt[[1]]
```
```{r itemsdemo-tops-sex-comp-2, dependson="itemsdemo-tops_ranef", fig.height=2*top_plots("sex", "understands")$rows[[2]], fig.cap=glue("Top {num_tops} most sex biased words in each language for comprehension data."), fig.pos="t"}
top_plots("sex", "understands")$plt[[2]]
```
Figures \@ref(fig:itemsdemo-ranef-plot-sex-prod) and \@ref(fig:itemsdemo-tops-sex-prod-3) give the same measures for production. There are considerably more words per language with large sex biases (at least `r exp(threshold)` times higher in either direction) for production (mean across languages `r roundp(100 * threshold_prop("produces", "sex"))`%) than for comprehension (mean across languages `r roundp(100 * threshold_prop("understands", "sex"))`%). The content of these sex biased words is extremely similar across languages. For English, we again see the largest biases in each direction for genital terms, the largest male biases for vehicles and objects associated with traditionally male activities (e.g., sports), and the largest female biases for female-coded clothing and toys. This pattern is replicated quite robustly across languages.
```{r itemsdemo-ranef-plot-sex-prod, dependson="itemsdemo-plot_ranef", fig.height=7, out.width="80%", fig.cap=glue("Distribution of sex item random effects for production data in each language.")}
plot_demo_ranef("Sex", "produces")
```
```{r itemsdemo-tops-sex-prod-1, dependson="itemsdemo-tops_ranef", fig.height=2*top_plots("sex", "produces")$rows[[1]], fig.cap=""}
top_plots("sex", "produces")$plt[[1]]
```
```{r itemsdemo-tops-sex-prod-2, dependson="itemsdemo-tops_ranef", fig.height=2*top_plots("sex", "produces")$rows[[2]], fig.cap=""}
top_plots("sex", "produces")$plt[[2]]
```
```{r itemsdemo-tops-sex-prod-3, dependson="itemsdemo-tops_ranef", fig.height=2*top_plots("sex", "produces")$rows[[3]], fig.cap=glue("Top {num_tops} most sex biased words in each language for production data."), fig.pos="t"}
top_plots("sex", "produces")$plt[[3]]
```
In sum, there appear to be two different processes at work in the sex effects we observe. The first is a general shift in the probability that _any_ word will be produced or understood such that girls are slightly more likely to produce or understand it than boys. The average magnitude of this fixed effect across languages is `r roundp(effect("understands", "Sex", "Female : Male"), 1)` for comprehension and `r roundp(effect("produces", "Sex", "Female : Male"), 1)` for production. In other words, if a male child had a 50% chance of saying a word (odds 1:1), a female child would on average have `r roundp(effect("produces", "Sex", "Female : Male"), 1)` times higher odds of saying it, i.e. a `r roundp(100 * (apply_effect(0.5, effect("produces", "Sex", "Female : Male"))), 0)`% chance. However, beyond this fixed effect, there are also variable effects for individual words. Most of these effects are small, but a few of them are quite large. For example, if an English-speaking male child has a 50% chance of saying the word _dress_, a female child would have a `r roundp(100 * (apply_effect(0.5, effect_item("English (American)", "produces", "sex", "Female : Male", "dress (object)"))), 0)`% chance; if a female child has a 50% chance of saying the word _hammer_, a male child would have a `r roundp(100 * (apply_effect(0.5, 1 / effect_item("English (American)", "produces", "sex", "Female : Male", "hammer"))), 0)`% chance.
### Birth order
We next consider individual items that are more or less likely in the vocabularies of first-born vs. later-born children. Here we consider both the contrast between second-born and first-born children as well as between later-born and second-born children. As shown in Figure \@ref(fig:itemsdemo-demo-coefs-bo), across languages, second-born children are advantaged over later-born children in both comprehension and production, while first-born children are advantaged above second-born children in production. Somewhat surprisingly, across languages the reverse is true of comprehension -- later-born children have somewhat bigger comprehension vocabulary. We can only speculate as to the source of this pattern, especially since it is not present in the production data. The number of languages for which we have birth order data is small, so conclusions are somewhat tentative.
```{r itemsdemo-demo-coefs-bo, fig.height=5, fig.width=8, dependson="itemsdemo-demo_coefs_plot", fig.cap="Main effect of birth order for each language and measure."}
plot_demo_fixed("Birth order")
```
Figures \@ref(fig:itemsdemo-ranef-plot-bo-comp) and \@ref(fig:itemsdemo-tops-bo-comp) again represent random effects coefficients for particular items in comprehension. In general, there are few surprises here: the words for _brother_ and _sister_ are much more likely for second-born children to understand, and even more likely for later-born children.^[Of course, children could be marked as understanding a term like _brother_ even without a true understanding of its relational structure!] Several languages additionally show a few other words that second-born and later-born children might be more likely to be exposed to via their siblings, such as _school_ in English and Norwegian (_skole_).
```{r itemsdemo-ranef-plot-bo-comp, dependson="itemsdemo-plot_ranef", fig.height=4, fig.width=5, out.width="50%", fig.cap=glue("Distribution of birth order item random effects for comprehension data in each language.")}
plot_demo_ranef("Birth order", "understands")
```
```{r itemsdemo-tops-bo-comp, dependson="itemsdemo-tops_ranef", fig.height=2*top_plots("birth_order", "understands")$rows[[1]], fig.cap=glue("Top {num_tops} most birth order biased words in each language for comprehension data.")}
top_plots("birth_order", "understands")$plt[[1]]
```
The same general patterns are present in the production data (Figures \@ref(fig:itemsdemo-ranef-plot-bo-prod) and \@ref(fig:itemsdemo-tops-bo-prod-2)), with additional evidence that having elder siblings appears to be related exposure to sweets, at least in some cultures: _popsicle_, _donut_, and _candy_ all appear in the English data, and _tyggegummi_ (gum) and several soda- and candy-related words appear in the Norwegian data. _Hate_ also appears in the English data, suggesting some emotional expressions due to having a sibling. We interpret this pattern with caution, however, as birth order is likely partially confounded with socioeconomic status, in that families from lower socioeconomic status populations tend to have more children [@huber2010]. So later-born children might also be more likely to come from low-SES families, who have more environmental exposure to "junk foods" like soda and candy [@ghosh2014; see below].
```{r itemsdemo-ranef-plot-bo-prod, dependson="itemsdemo-plot_ranef", fig.height=7, fig.width=5, out.width="50%", fig.cap=glue("Distribution of birth order item random effects for production data in each language.")}
plot_demo_ranef("Birth order", "produces")
```
```{r itemsdemo-tops-bo-prod-1, dependson="itemsdemo-tops_ranef", fig.height=2*top_plots("birth_order", "produces")$rows[[1]], fig.cap=""}
top_plots("birth_order", "produces")$plt[[1]]
```
```{r itemsdemo-tops-bo-prod-2, dependson="itemsdemo-tops_ranef", fig.height=2*top_plots("birth_order", "produces")$rows[[2]], fig.cap=glue("Top {num_tops} most birth order biased words in each language for production data."), fig.pos="t"}
top_plots("birth_order", "produces")$plt[[2]]
```
```{r itemdemo-bo-examples}
bo_max <- max_item("English (American)", "produces", "birth_order", "First : Second")
bo_max_effect <- effect_item("English (American)", "produces", "birth_order", "First : Second", bo_max)
bo_min <- min_item("English (American)", "produces", "birth_order", "First : Second")
bo_min_effect <- effect_item("English (American)", "produces", "birth_order", "First : Second", bo_min)
```
In sum, across languages, a given word has lower odds of being understood by a first-born child than a second-born child (by `r roundp(effect("understands", "Birth order", "First : Second"), 1)`) and higher odds of being understood by a second-born child than a later-born child (by `r roundp(effect("understands", "Birth order", "Second : Third+"), 1)`). It also has higher odds of being produced by both first-born compared to second-born (by `r roundp(effect("produces", "Birth order", "First : Second"), 1)`) and second-born compared to later-born (by `r roundp(effect("produces", "Birth order", "First : Second"), 1)`). Additionally, a handful of individual items show some substantial differences by birth order: in American English, a first-born child having a 50% chance of saying the word _`r bo_max`_ corresponds to a second-born child-born child having a `r roundp(100 * apply_effect(0.5, bo_max_effect), 0)`% chance of saying it; conversely, a second-born child having a 50% chance of saying the word _`r bo_min`_ corresponds to a first-born child-born child having a `r roundp(100 * apply_effect(0.5, 1/bo_min_effect), 0)`% chance of saying it.
A specific claim has been made in the literature regarding effects of birth order on language development -- @oshima-takane1996 reported that second-born children learn second-person personal pronouns (e.g., *you*) earlier, likely due to the disambiguating effect of having siblings addressed using such pronouns in overheard speech. We examined this pattern in our own American English data in Figure \@ref(fig:itemsdemo-props-plot-pronouns), but did not find global support for it. The major trend that emerged was a smaller first-born advantage for *me* and *mine* than average (and smaller than for *my*, *you*, and *your*). *Prima facie*, this finding is less consistent with a global increase in second-person disambiguation -- which would have predicted a reversal of the global pattern for these terms -- than with later-born children needing to assert "mine" to declare ownership.
```{r itemsdemo-props-plot-pronouns, dependson="itemsdemo-plot_props", fig.height=5, fig.cap="(ref:plot-pronouns-cap)"}
pronouns <- c("me", "my", "mine", "you","your", "yours")
plot_demo_props[["birth_order"]] %>%
filter(measure == "produces", definition %in% pronouns,
language == "English (American)") %>%
mutate(total = num_true + num_false) %>%
ggplot(aes(x = unscaled_age, y = prop, colour = birth_order)) +
facet_wrap(~uni_lemma) +
geom_smooth(aes(weight = total), se = FALSE, size = 0.8, method = "glm",
method.args = list(family = "binomial")) +
.scale_colour_discrete() +
labs(x = "Age (months)", y = glue("Proportion Production"),
colour = demo_labels[["birth_order"]]) +
lims(y = c(0, 1)) +
theme(legend.position = "top")
```
(ref:plot-pronouns-cap) Developmental trajectories for personal pronouns in American English words by birth order. Note that lines are overplotted, especially in the case of _mine_.
Examining the data across languages in Figure \@ref(fig:itemsdemo-plot-pronouns), we see a similar pattern. *Me*, *my*, and *mine* are all learned earlier for second-born children than first-born. In contrast, *you* and *your*, while numerically positive, have confidence intervals that overlap with zero. These data thus support the hypothesis that first-person possessives are learned slightly earlier for later-born children (perhaps related to property conflicts, e.g. "that's mine!").
```{r itemsdemo-plot-pronouns, fig.height=1.5, fig.cap="Histogram showing item random effects for personal pronouns by birth order, with each point representing the effect in a given language.Red lines marks no effect, blue lines marks the mean, and blue bands shows bootstrapped 95\\% confidence intervals."}
pronoun_ranef <- demo_ranef_coded %>%
filter(demo == "birth_order", measure == "produces", str_detect(term, "First")) %>%
mutate(estimate = -estimate) %>%
left_join(distinct(items, language, definition, uni_lemma)) %>%
filter(uni_lemma %in% pronouns) %>%
select(language, item = definition, translation = uni_lemma, estimate)
pronoun_cis <- pronoun_ranef %>%
group_by(translation) %>%
multi_boot_standard(col = "estimate")
plot_breaks <- exp_breaks(1.2, 3)
ggplot() +
facet_wrap(~translation, nrow = 1) +
geom_dotplot(aes(x = estimate), data = pronoun_ranef, binwidth = .05, stackratio = 1.15,
dotsize = 0.7) +
geom_vline(xintercept = 0, colour = .pal()(2)[2]) +
geom_rect(aes(xmin = ci_lower, xmax = ci_upper, ymin = 0, ymax = Inf), data = pronoun_cis,
fill = .pal()(2)[1], alpha = .3) +
geom_vline(aes(xintercept = mean), data = pronoun_cis, colour = .pal()(2)[1]) +
scale_y_continuous(expand = expand_scale(mult = c(0.01, 0.05))) +
scale_x_continuous(breaks = plot_breaks, labels = round(exp(plot_breaks), 2)) +
labs(x = "Odds ratio (Second : First)", y = "") +
theme(axis.text.y = element_blank(), axis.ticks.y = element_blank(),
axis.title.y = element_blank())
```
### Maternal education
Our final set of analyses examines vocabulary items that are differentially present in the vocabulary of children with differing levels of maternal education. As noted in Chapter \@ref(demographics), there are substantial cross-linguistic differences in how large the overall socioeconomic stratification is. For example, we observe large differences in children's vocabulary size in the American English data, with children of less educated mothers reporting substantially lower production vocabulary. Fixed effects from this analyses are shown in Figure \@ref(fig:itemsdemo-demo-coefs-me). Children with higher levels of maternal education generally have larger productive vocabularies, but perhaps surprisingly, smaller comprehension vocabularies. We discussed this finding in depth in Chapter \@ref(demographics); here we reiterate that we believe it is plausibly due to reporting biases.
```{r itemsdemo-demo-coefs-me, fig.height=5, fig.width=8, dependson="itemsdemo-demo_coefs_plot", fig.cap="Main effect of maternal education for each language and measure."}
plot_demo_fixed("Maternal education")
```
Figures \@ref(fig:itemsdemo-ranef-plots-me-comp) and \@ref(fig:itemsdemo-tops-me-comp) show item random effects for comprehension. Many more words are strongly affected by maternal education for American English than for other languages: `r roundp(100 * threshold_lang("English (American)", "understands", "mom_ed"), 0)`% of American English words have effects of at least `r exp(threshold)` times in either direction, as compared to `r roundp(100 * threshold_prop("understands", "mom_ed"), 0)`% averaged over languages. This finding is consistent with the idea that maternal education shows a larger effect on total vocabulary size in the American English data than in other datasets, whether because of true cross-cultural differences in SES effects, the composition of the sample, or (most likely) both.
The words that are more likely to be understood by children of college-educated and secondary-educated mothers are often animal-related (e.g., Danish _får_ [sheep], _tiger_; English _cow_, _quack quack_) and may speculatively be related to reading books about animals (since most of these animals are not prominent in most children's experience). Some of the largest differences are for _read_ in English and _livro_ [book] in Portuguese, perhaps also related to reading practices (or the perception of the importance of these practices). Negatively linked words include some kinship terms (e.g., Danish _faster_ and _moster_ [aunt], _morbror_ and _fabror_ [uncle]; English _aunt_, _uncle_, _brother_), common sweets (e.g., English _candy_; Portuguese _chupa-chupa_ [brand of lollipop]; Spanish _soda_), and money-related words (e.g., Danish _småpenge_ [small change]; English _money_, _penny_).
```{r itemsdemo-ranef-plots-me-comp, dependson="itemsdemo-plot_ranef", fig.height=4, fig.width=5, out.width="50%", fig.cap=glue("Distribution of maternal education item random effects for comprehension data in each language.")}
plot_demo_ranef("Maternal education", "understands")
```
```{r itemsdemo-tops-me-comp, dependson="itemsdemo-tops_ranef", fig.height=2*top_plots("mom_ed", "understands")$rows[[1]], fig.cap=glue("Top {num_tops} most maternal education biased words in each language for production data.")}
top_plots("mom_ed", "understands")$plt[[1]]
```
Production data show a similar but more extreme picture (Figures \@ref(fig:itemsdemo-ranef-plot-me-prod) and \@ref(fig:itemsdemo-tops-me-prod)), with an even larger number of words linked to maternal education in American English (`r roundp(100 * threshold_lang("English (American)", "produces", "mom_ed"), 0)`%). Across languages, animal vocabulary is again more prevalent for children of more educated mothers (e.g., Danish _zebra_; English _sheep_; Portuguese _hipopótamo_), as is _babysitter's name_. Children of less education mothers are again more likely to say words for relatives (e.g., Czech _bráška_ [brother]; Danish _oldefar_ [great-grandfather]), junk foods (e.g., English _gum_, _candy_, _soda_), and money (e.g. German _arm_ [poor]; Spanish _dinero_ [money]), along with other harder to categorize items.
```{r itemsdemo-ranef-plot-me-prod, dependson="itemsdemo-plot_ranef", fig.height=6, fig.width=5, out.width="50%", fig.cap=glue("Distribution of maternal education item random effects for production data in each language")}
plot_demo_ranef("Maternal education", "produces", c(-1, 1.5))
```
```{r itemsdemo-tops-me-prod, dependson="itemsdemo-tops_ranef", fig.height=2*top_plots("mom_ed", "produces")$rows[[1]], fig.cap=glue("Top {num_tops} most maternal education biased words in each language for production data.")}
top_plots("mom_ed", "produces")$plt[[1]]
```
## Conclusions
Demographic factors like sex, birth order, and maternal education are related to children's vocabulary size. But, in addition to these global associations, they appear to be specifically associated with particular vocabulary items. Many of these are straightforwardly explicable in terms of differences in the environmental frequency (and importance) of particular lexical items for children in different circumstances. For example, there are many reasons why second-born children should say _brother_ or _sister_ more frequently than first-born children!
More generally, item level variation relates to two issues of interest within the context of our project. The first is the validity of CDI-based measurement. From a psychometric perspective, the sort of variation reported here is known as "differential item function" [@hambleton1991] and is a negative characteristic of tests that impairs their validity. Thus, from a purely psychometric perspective, items like _babysitter's name_ (or even _brother_) should probably not be included in global estimates of vocabulary size. (See Chapter \@ref(psychometrics) for more details on this issue). On the other hand, in instruments with more than 500 items, this handful of items probably cause minimal decreases in reliability or validity.
The second broader issue is the question of mechanisms responsible for the demographic associations documented in Chapter \@ref(demographics). Sex differences in vocabulary appear quite consistent across languages. Why is this? The analyses in this chapter allow us to gain one small piece of leverage on the issue by noticing that there appear to be two qualitatively different processes involved in the demographic effects we observed: first, girls have a small bump in their probability of producing almost every word, and second, there are a small number of particular words for which their production probability is substantially different (higher or lower). To the extent these processes are separable and differ in magnitude, we might look for causal mechanisms that would provide a broader boost to language (rather than trying to explain the small number of specifically sex-linked items identified above). Such hypotheses might appeal to dyadic factors like differences in amount and nature of language input directed to girls, to learner-internal factors like stronger social cognition, or to biological differences. A similar argument could be made for birth order and maternal education variables.
In sum, demographic differences in vocabulary likely have multiple sources. Average changes in vocabulary size could be a result of differences in the way children learn across groups, differences in the amount of language children in different groups hear, or an interaction between these two (for example, female children on average perhaps eliciting more lanugage in interaction). On the other hand, specific differences in which words are part of the vocabulary are more probably due to differences in input environment -- thus we see in the prevalence of words like _brother_ or _soda_ the effects of specific changes in children's environment on their learning outcomes.