-
Notifications
You must be signed in to change notification settings - Fork 89
/
10-wt-longitudinal-analysis.Rmd
1410 lines (1147 loc) · 53.4 KB
/
10-wt-longitudinal-analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
editor_options:
markdown:
wrap: 72
---
# Walkthrough 4: Longitudinal analysis with federal students with disabilities data {#c10}
```{r load custom themes, echo = FALSE, message = FALSE, eval=FALSE}
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
```
**Abstract**
This chapter explores cleaning, visualizing, and modeling aggregate
data. Data scientists in education frequently work with public aggregate
data when student-level data is not available. By analyzing aggregate
datasets, data scientists in education uncover context for other
analyses. Using a freely available federal government dataset that
includes students in special education, this chapter compares the gender
variable over time in the United States.
Analysis on this scale provides useful context for district and school
level analysis. It encourages questions about the experiences of
students in special education at the local level by offering a basis for
comparison at a national level. Data science tools in this chapter
include importing data, preparing data for analysis, visualizing data,
and selecting plots for communicating results.
## Topics emphasized
- Importing data
- Tidying data
- Transforming data
- Visualizing data
- Modeling data
- Communicating results
## Functions introduced
- `list.files()`
- `download.file()`
- `lubridate::ymd()`
- `identical()`
- `dplyr::top_n()`
- `ggplot2::geom_jitter()`
- `dplyr::arrange()`
## Vocabulary
- aggregate data
- file path
- list
- read in
- tidy format
- statistical model
- student-level data
- longitudinal analysis
- ratio
- subset
- vector
## Chapter overview
Data scientists working in education don't always have access to student
level data. Knowing how to model publicly available datasets, as in
[*the previous chapter*](#c9), empowers data scientists to do analysis
when individual-level data is not available. This walkthrough builds on
the last chapter by focusing on students with disabilities data over
time. Note that this kind of analysis goes by a number of names, such as
longitudinal analyses, time series analyses, or change over time.
*A note on the gender variable in this dataset*
The dataset in this chapter uses a binary approach to data collection
about gender, which does not reflect the experiences of students who do
not identify with either of these categories. It's important that data
scientists in education consider data governance to improve how gender
data is collected. It's also important for data scientists to interpret
existing binary data with the proper nuance and context. The need for an
inclusive approach to documenting gender identity is discussed in a
paper by @park2016 of The Williams Institute at UCLA.
### Background
In this chapter, you'll learn to explore a publicly available dataset
and analyze its data over time. Like most public datasets (see [*the
previous chapter*](#c9)), this one contains aggregate data. This means
that someone totaled up the student counts so that it doesn't reveal
individual and personally identifiable information.
You can download the datasets for this walkthrough on the United States
Department of Education website (see Department
@usdoe2019)[^10-wt-longitudinal-analysis-1]. They are also available in
the {dataedu} package that accompanies this book, described in the
"Importing the Data From the {dataedu} Package" section below.
[^10-wt-longitudinal-analysis-1]: The documentation for the dataset is
available here:
<https://www2.ed.gov/programs/osepidea/618-data/collection-documentation/data-documentation-files/part-b/child-count-and-educational-environment/idea-partb-childcountandedenvironment-2017-18.pdf>
### Methods
In this walkthrough, you'll learn how to read multiple datasets in using
the `map()` function. Next, you'll prepare the data for analysis by
cleaning the variable names. Finally, you'll explore the datasets by
visualizing student counts and comparing male-to-female ratios over
time.
## Load packages
The function `here()` from the {here} package can cause conflicts with
other functions called `here()`. Prevent problems by loading that
package last and including the package name for every call to `here()`,
like this: `here::here()`. This is called "including the namespace".
If you have not installed these packages, do so using the
`install.packages()` function; see the ["Packages" section](#c06p) of
the ["Foundational Skills"](#c06) chapter for instructions.
Load the packages for this walkthrough using this code:
```{r load packages, message=FALSE, warning=FALSE}
library(tidyverse)
library(dataedu)
library(lubridate)
library(here)
```
## Import data
In this analysis, you'll be importing and combining six datasets that
describe the number of students with disabilities in a given year.
First, you'll review how to get the `.csv` files you'll need for the
task. Alternatively, we make these data files available through the
{dataedu} package. You'll find more about the {dataedu} package below.
Note that if you use this walkthrough on different datasets or store
datasets in different locations on your computer, you'll need to your
code accordingly. We suggest only doing this if you have previous
experience using R.
### What to download
In this walkthrough, you'll be using six separate datasets of child
counts, one for each year between 2012 and 2017. If you're copying and
pasting the code in this walkthrough, downloading the datasets from our
GitHub repository for the most reliable results.
You can also access this data through the {dataedu} package; see the
"Importing the Data From the {dataedu} Package" section of this chapter.
Here's a link to each file; we also include a short URL via the
URL-shortener website *bit.ly*:
- [2012
data](https://github.com/data-edu/data-science-in-education/raw/master/data/longitudinal_data/bchildcountandedenvironments2012.csv)
(https[
]()://github.com/data-edu/data-science-in-education/raw/master/data/longitudinal_data/bchildcountandedenvironments2012.csv)
(<https://bit.ly/3dCtVtf>)
- [2013
data](https://github.com/data-edu/data-science-in-education/raw/master/data/longitudinal_data/bchildcountandedenvironments2013.csv)
(https[
]()://github.com/data-edu/data-science-in-education/raw/master/data/longitudinal_data/bchildcountandedenvironments2013.csv)
(<https://bit.ly/33WXnFX>)
- [2014
data](https://github.com/data-edu/data-science-in-education/raw/master/data/longitudinal_data/bchildcountandedenvironments2014.csv)
(https[
]()://github.com/data-edu/data-science-in-education/raw/master/data/longitudinal_data/bchildcountandedenvironments2014.csv)
(<https://bit.ly/2UvSwbx>)
- [2015
data](https://github.com/data-edu/data-science-in-education/raw/master/data/longitudinal_data/bchildcountandedenvironments2015.csv)
(https[
]()://github.com/data-edu/data-science-in-education/raw/master/data/longitudinal_data/bchildcountandedenvironments2015.csv)
(<https://bit.ly/39wQAUg>)
- [2016
data](https://github.com/data-edu/data-science-in-education/raw/master/data/longitudinal_data/bchildcountandedenvironments2016.csv)
(https[
]()://github.com/data-edu/data-science-in-education/raw/master/data/longitudinal_data/bchildcountandedenvironments2016.csv)
(<https://bit.ly/2JubWHC>)
- [2017
data](https://github.com/data-edu/data-science-in-education/raw/master/data/longitudinal_data/bchildcountandedenvironments2017-18.csv)
(https[
]()://github.com/data-edu/data-science-in-education/raw/master/data/longitudinal_data/bchildcountandedenvironments2017-18.csv)
(<https://bit.ly/2wPLu8w>)
For reference, we originally downloaded these files from the [United
States Department of Education
website](https://www2.ed.gov/programs/osepidea/618-data/state-level-data-files/index.html)
(https[
]()://www2.ed.gov/programs/osepidea/618-data/state-level-data-files/index.html)
### A note on file paths
When you download these files, store them in a folder in your working
directory. To get to the data in this walkthrough, use this file path in
your working directory: "data/longitudinal_data".
You'll be using the `here()` *function* from the {here} *package*, which
conveniently fills in the folders in the file path of your working
directory all the way up to the folders you specify in the arguments.
So, when referencing the file path "data/longitudinal_data", you'll use
code like this:
```{r demo here, eval = FALSE}
here::here("data",
"longitudinal_data",
"bchildcountandedenvironments2012.csv")
```
You can use a different file path if you like but take note of where
your downloaded files are so you can use the correct file path when
writing code to import the data.
### How to download the files
One way to download the files is manually---saving them to a working
directory. Another way is to read them directly into R, using the
`download.file()` function using the same file path described in the
previous section. This functionality works for any CSV files that
download from webpages; the key is that the URL must point to the CSV
file itself. One way to check this is to ensure the URL ends in `.csv`.
Here is how to do it for the first dataset, which contains 2012 data:
```{r, eval = FALSE}
download.file(
# the url argument takes a URL for a CSV file
url = 'https://bit.ly/3dCtVtf',
# destfile specifies where the file should be saved
destfile = here::here("data",
"longitudinal_data",
"bchildcountandedenvironments2012.csv"),
mode = "wb")
```
Do this for the remaining five datasets:
```{r, eval = FALSE}
download.file(
url = 'https://bit.ly/33WXnFX',
destfile = here::here("data",
"longitudinal_data",
"bchildcountandedenvironments2013.csv"),
mode = "wb")
download.file(
url = 'https://bit.ly/2UvSwbx',
destfile = here::here("data",
"longitudinal_data",
"bchildcountandedenvironments2014.csv"),
mode = "wb")
download.file(
url = 'https://bit.ly/39wQAUg',
destfile = here::here("data",
"longitudinal_data",
"bchildcountandedenvironments2015.csv"),
mode = "wb")
download.file(
url = 'https://bit.ly/2JubWHC',
destfile = here::here("data",
"longitudinal_data",
"bchildcountandedenvironments2016.csv"),
mode = "wb")
download.file(
url = 'https://bit.ly/2wPLu8w',
destfile = here::here("data",
"longitudinal_data",
"bchildcountandedenvironments2017-18.csv"),
mode = "wb")
```
Now that the files are downloaded you're ready to read the data into R.
If you were unable to download these files, they are also available
through the {dataedu} package, as we describe after the "Reading in Many
Datasets" section.
### Reading in one dataset
You'll be learning how to read in more than one dataset using the
`map()` function. Try it first with one dataset, then try it with
multiple datasets. When you are analyzing multiple datasets that have
the same structure, you can read in each dataset using one code chunk.
This code chunk will store each dataset as an element of a list.
Before doing that, explore one of the datasets to see what you can learn
about its structure. Clues from this exploration inform how you read in
all the datasets at once later on. For example, you can see that the
first dataset has some lines at the top that contain no data:
```{r read 2012 data, echo = FALSE, message = F, warning = F}
read_csv(here::here(
"data",
"longitudinal_data",
"bchildcountandedenvironments2012.csv"
))
```
The rows containing "Extraction Date:", "Updated:", and "Revised:"
are not actually rows. They are notes the authors left at the top of the
dataset to show when the dataset was changed.
`read_csv()` uses the first row as the variable names unless told
otherwise, so you need to tell `read_csv()` to skip those lines using
the `skip` argument. If you don't, `read_csv()` assumes the very first
line---the one that says "Extraction Date:"---is the correct row of
variable names. That's why calling `read_csv()` without the `skip`
argument results in column names like `X4`. When there's no obvious
column name to read in, `read_csv()` names them `X[...]` and lets you
know in a warning message.
Try using `skip = 4` in your call to `read_csv()`:
```{r read 2012 data with skip, warning = FALSE, message = FALSE}
read_csv(here::here(
"data",
"longitudinal_data",
"bchildcountandedenvironments2012.csv"
),
skip = 4)
```
The `skip` argument told `read_csv()` to make the line containing
"Year", "State Name", and so on as the first line. The result is a
dataset that has "Year", "State Name", and so on as variable names.
### Reading in many datasets
Will the `read_csv()` and `skip = 4` combination work on all our
datasets? To find out, use this strategy:
- Store a vector of filenames and paths in a list. These paths point
to our datasets
- Pass the list of filenames as arguments to `read_csv()` using
`purrr::map()`, including `skip = 4`, in our `read_csv()` call
- Examine the new list of datasets to see if the variable names are
correct
To understand what `purrr::map()` does, imagine a widget-making machine
that works by acting on raw materials it receives on a conveyer belt.
This machine executes one set of instructions on each of the raw
materials it receives. You are the operator of the machine, and you
design instructions to get a widget out of the raw materials. Your plan
might look something like this:
- **Raw materials**: a list of filenames and their paths
- **Widget-making machine**: `purrr:map()`
- **Widget-making instructions**: \`read_csv(path, skip = 4)
- **Expected widgets**: a list of datasets
Create the raw materials first. These will be file paths to each of the
CSVs you want to read. Use `list.files` to make a vector of filename
paths and name that vector `filenames`. `list.files` returns a vector of
file names in the folder specified in the `path` argument. When you set
the `full.names` argument to "TRUE", you get a full path of these
filenames. This will be useful later when you need the file names and
their paths to read the data in.
```{r get filenames, results='hide'}
# Get filenames from the data folder
filenames <-
list.files(path = here::here("data", "longitudinal_data"),
full.names = TRUE)
# A list of filenames and paths
filenames
```
That made a vector of six filenames, one for each year of child count
data stored in the data folder. Now pass the raw materials, the vector
called `filenames`, to the widget-making machine called `map()` and give
the machine the instructions `read_csv(., skip = 4)`. Name the list of
widgets it cranks out `all_files`:
```{r read list of CSVs, , message=FALSE, warning=FALSE}
# Pass filenames to map and read_csv
all_files <-
filenames %>%
# Apply the function read_csv to each element of filenames
map(., ~ read_csv(., skip = 4))
```
It is important to think ahead here. The goal is to combine the datasets
in `all_files` into one dataset using `bind_rows()`. But that will only
work if all the datasets in your list have the same number of columns
and the same column names. Check the column names by using `map()` and
`names()`:
```{r check column names, include=FALSE}
all_files %>%
# Apply the function names to each element of all_files
map(names)
```
Use `identical()` to see if the variables from the two datasets match.
You'll see that the variable names of the first and second datasets
don't match, but the variables from the second and third do.
```{r check names with identical}
# Variables of first and second dataset don't match
identical(names(all_files[[1]]), names(all_files[[2]]))
# Variables of third and second files match
identical(names(all_files[[2]]), names(all_files[[3]]))
```
Check the number of columns by using `map()` and `ncol()`:
```{r check cols}
all_files %>%
# apply the function ncol to each element of all_files
map(ncol)
```
This is an extremely common problem in education data. Neither the
number of columns nor the column names match. With different column
names you won't be able to combine the datasets in a later step.
You'll see that if you try `bind_rows()`, which returns a dataset with
100 columns instead of the expected 50.
```{r combine datasets 1}
# combining the datasets at this stage results in the incorrect
# number of columns
bind_rows(all_files) %>%
# check the number of columns
ncol()
```
You'll correct this in the next section by selecting and renaming the
variables, but it's good to notice this problem early in the process so
you know to work on it later.
### Loading the data from {dataedu}
If you were unable to download the files, you can read in the
`all_files` list of six data frames through the {dataedu} package with
the following line of code:
```{r, eval = FALSE}
all_files <- dataedu::all_files
```
## Process data
Transforming your dataset before visualizing it and fitting models is
critical. It's easier to write code when variable names are concise and
informative. Many functions in R---especially those in the {ggplot2}
package---work best when datasets are in a "tidy" format. It's easier to
do an analysis when you have just the variables you need because unused
variables can confuse your thought process.
Here are the steps you'll be taking to process the data:
1. Fix the variable names in the 2016 data
2. Combine the datasets
3. Pick variables
4. Filter for the desired categories
5. Rename the variables
6. Standardize the state names
7. Transform the column formats from wide to long using `pivot_longer`
8. Change the data types of variables
9. Explore `NA`'s
In real life, data scientists don't always know the cleaning steps until
they dive into the work. Learning what cleaning steps are needed
requires exploration, trial and error, and clarity on the analytic
questions you want to answer.
These are just some initial steps for processing data. When you do your
own projects, you will find different things to transform. As you do
more and more data analysis, your instincts for what to transform will
improve.
### Fix the variable names in the 2016 data
When you print the 2016 dataset, notice that the variable names are
incorrect. Verify that by looking at the first ten variable names of the
2016 dataset, which is the fifth element of `all_files`:
```{r 2016 dataset}
# Look at the first 10 column names of 2016
names(all_files[[5]])[1:10]
```
You want the variable names to be `Year` and `State Name`, not `2016`
and `Alabama`. But first, go back and review how to get at the 2016
dataset from `all_files`. You need to identify which element the 2016
dataset was in the list. The order of the list elements was set when you
fed `map()` the list of file names. If you look at `filenames` again,
notice that its fifth element is the 2016 dataset. Try looking at the
first and fifth elements of `filenames`:
```{r show filenamess, eval=FALSE}
filenames[[1]]
filenames[[5]]
```
Now that you know the 2016 dataset is the fifth element in the list,
pluck it out by using double brackets:
```{r pluck 2016 data}
all_files[[5]]
```
You used `skip = 4` when you read in the datasets in the list. That
worked for all datasets except the fifth one. In that one, skipping four
lines left out the variable name row. To fix it, read the 2016 dataset
again using `read_csv()` and the fifth element of `filenames`. This
time, use the argument `skip = 3`. You'll assign the newly read dataset
to the fifth element of the `all_files` list:
```{r fix 2016 colnames, message=FALSE, warning=FALSE}
all_files[[5]] <-
# Skip the first 3 lines instead of the first 4
read_csv(filenames[[5]], skip = 3)
```
Try printing `all_files` now. You can confirm you fixed the problem by
checking that the variable names are correct.
### Select variables
Now that you know the datasets have the correct variable names, simplify
them by picking the variables you need. This is a good place to think
carefully about which variables to pick. This usually requires trial and
error, but here's a good start:
- The analytic questions are about gender, so pick the gender variable
- Later, you'll need to filter the dataset by disability category and
program location so you'll want `SEA Education Environment` and
`SEA Disability Category`
- You'll be making comparisons by state and reporting year, so also
pick `State Name` and `Year`
Combining `select()` and `contains()` is a convenient way to pick these
variables without writing a lot of code. Knowing that you want variables
that contain the acronym "SEA" and variables that contain "male" in
their names, you can pass those characters to `contains()`:
```{r pick vars}
all_files[[1]] %>%
select(
Year,
contains("State", ignore.case = FALSE),
contains("SEA", ignore.case = FALSE),
contains("male")
)
```
That code chunk verifies that you selected the variables you wanted. Now turn
the code chunk into a function called `pick_vars()`. Then use `map()` to
apply `pick_vars()` to each dataset in the `all_files` list.
In this function, you'll use a special version of `select()` called
`select_at()`. This conveniently picks variables based on criteria you
give it. The argument
`vars(Year, contains("State", ignore.case = FALSE), contains("SEA", ignore.case = FALSE), contains("male"))`
tells R you want to keep any column whose name has "State" in upper or
lower case letters, has "SEA" in the title, and has "male" in the title.
This will result in a newly transformed `all_files` list that contains
six datasets, all with the desired variables.
```{r pick vars function}
# build the function
pick_vars <-
function(df) {
df %>%
select_at(vars(
Year,
contains("State", ignore.case = FALSE),
contains("SEA", ignore.case = FALSE),
contains("male")
))
}
# use the function with `all_files`
all_files <-
all_files %>%
map(pick_vars)
```
### Combine six datasets into one
Now it's time to combine the datasets in the list `all_files` into one.
You'll use `bind_rows()`, which combines datasets by adding each one to
the bottom of the one before it. The first step is to check and see if
the datasets have the same number of variables and the same variable
names. When you use `names()` on our list of newly changed datasets,
you'll see that each dataset's variable names are the same:
```{r verify var names}
# check variable names
all_files %>%
map(names)
```
That means you can combine all six datasets into one using `bind_rows()`.
You'll call this newly combined dataset `child_counts`:
```{r combine datasets 2}
child_counts <-
all_files %>%
# combine all datasets in `all_files`
bind_rows()
```
Now you know the following:
1. Each of the six datasets had eight variables
2. The combined dataset also has eight variables
So you can conclude that all the rows have combined together correctly. Just
to be safe, use `str()` to verify:
```{r check structure}
str(child_counts)
```
### Importing the data from the {dataedu} package
If you would like to load this processed dataset (`child_counts`) using
the {dataedu} package, run the following code:
```{r, eval = FALSE}
longitudinal_data <- dataedu::child_counts
```
### Filter for the desired disabilities and age groups
You want to explore gender-related variables, but the dataset has
additional aggregate data for other student groups. For example, you can
use `count()` to explore all the different disability groups in the
dataset. Here's the number of times an `SEA Disability Category` appears in the
dataset. Note that it is
essential to type the name of this variable using backticks because it includes
spaces---as you can see in the code below):
```{r count disabilities}
child_counts %>%
# count number of times the category appears in the dataset
count(`SEA Disability Category`)
```
Since you'll be visualizing and modeling gender variables for all
students in the dataset, you'll filter out all subgroups except "All
Disabilities" and the age totals:
```{r filter child_counts}
child_counts <-
child_counts %>%
filter(
# filter all but the All Disabilities category
`SEA Disability Category` == "All Disabilities",
# filter all but the age totals
`SEA Education Environment` %in% c("Total, Age 3-5", "Total, Age 6-21")
)
```
### Rename the variables
In the next section, you'll prepare the dataset for visualization and
modeling by "tidying" it. When you write code to transform datasets,
you'll be typing the column names a lot. Do yourself a favor and change
the names to more convenient names:
```{r rename vars}
child_counts <-
child_counts %>%
rename(
# change these columns to more convenient names
year = Year,
state = "State Name",
age = "SEA Education Environment",
disability = "SEA Disability Category",
f_3_5 = "Female Age 3 to 5",
m_3_5 = "Male Age 3 to 5",
f_6_21 = "Female Age 6 to 21",
m_6_21 = "Male Age 6 to 21"
)
```
### Clean state names
Have noticed that some state names are in uppercase letters and some are
in lowercase letters?
```{r count states}
child_counts %>%
count(state) %>%
head()
```
If you leave it like this, R will treat state values like "CALIFORNIA"
and "California" as two different states. You can use `mutate` and
`tolower` to transform all the state names to lowercase letters.
```{r fix caps}
child_counts <-
child_counts %>%
mutate(state = tolower(state))
```
### Tidy the dataset
Visualizing and modeling the data will be easier if it's in a "tidy"
format. In *Tidy Data* [@wickham2014], defines tidy datasets as
possessing the following characteristics:
> 1. Each variable forms a column
> 2. Each observation forms a row
> 3. Each type of observational unit forms a table
The gender variables in this dataset are spread across four columns,
with each representing a combination of gender and age range. Use
`pivot_longer()` to bring the gender variable into one column. In this
transformation, you'll create two new columns: a `gender` column and a
`total` column. The `total` column will contain the number of students
in each row's gender and age category:
```{r pivot_longer gender col}
child_counts <-
child_counts %>%
pivot_longer(cols = f_3_5:m_6_21,
names_to = "gender",
values_to = "total")
```
To make the values of the `gender` column more intuitive, use
`case_when()` to transform the values to either "f" or "m":
```{r replace gender values}
child_counts <-
child_counts %>%
mutate(
gender = case_when(
gender == "f_3_5" ~ "f",
gender == "m_3_5" ~ "m",
gender == "f_6_21" ~ "f",
gender == "m_6_21" ~ "m",
TRUE ~ as.character(gender)
)
)
```
### Convert data types
The values in the `total` column represent the number of students from a
specific year, state, gender, and age group. You know from the `chr`
under their variable names that R treats these values like characters
instead of numbers. While R does a decent job of deciding this, it's
much safer to prepare the dataset by changing these character columns to
numeric columns. Use `mutate()` to change the count columns:
```{r convert to numeric}
child_counts <-
child_counts %>%
mutate(total = as.numeric(total))
child_counts
```
Converting these count columns from character classes to number classes
results in two changes. First, the `chr` under these variable names
changed to `dbl`, short for "double-precision". This lets you know that
R recognizes these values as numbers with decimal points. Second, the
blank values changed to `NA`. When R sees a character class value like
`"4"`, it knows to change it to numeric class `4`. But there is no
obvious number represented by a value like `""` or `-`, so it changes it
to `NA`:
```{r demo as.numeric}
# Convert a character to a number
as.numeric("4")
# Convert a blank character or a symbol to a number
as.numeric("")
as.numeric("-")
```
Similarly, the variable `year` needs to be changed from the character
format to the date format. This will make sure R treats this variable
like a point in time when you plot the dataset. The package {lubridate}
has a function called `ymd` that does this. Use the `truncated` argument
to let R know you don't have a month and date to convert.
```{r convert date}
child_counts <-
child_counts %>%
mutate(year = ymd(year, truncated = 2))
```
### Explore and address NAs
You'll notice that some rows in the `total` column contain an `NA`. When
you used `pivot_longer()` to create a `gender` column, R created unique
rows for every year, state, age, disability, and gender combination.
Since the original dataset had both gender and age range stored in a
column like `Female Age 3 to 5`, R made rows where the `total` value is
`NA`. For example, there is no student count for the `age` value "Total,
Age 3--5" that also has the `gender` value for female students who were
age 6--21. You can see that more clearly by sorting the dataset by year,
state, and gender.
In the Foundational Skills chapter, you learned a {dplyr} function
called `arrange()` to sort the rows of a dataset by the values in a
column. Use `arrange()` here sort the dataset by the `year`, `state` and
`gender` columns. When you pass a variable to `arrange()`, it sorts by the
order of the values in that variable. When you pass it multiple
variables, `arrange()` sorts by the first variable, then by the second,
and so on. See see what it does on `child_counts` when you pass it the
`year`, `state`, and `gender` variables:
```{r show NAs}
child_counts %>%
arrange(year, state, gender)
```
You can simplify the dataset by removing the rows with `NA`, leaving one
row for each category:
- females age 3--5
- females age 6--21
- males age 3--5
- males age 6--21
Each of these categories will be associated with a state and reporting
year:
```{r remove NAs}
child_counts <-
child_counts %>%
filter(!is.na(total))
```
Verify you have the categories you want by sorting again:
```{r verify rows}
child_counts %>%
arrange(year, state, gender)
```
## Analysis
In the last section, you imported the dataset for analysis. In this
section, you'll ask, "How have child counts changed over time?"
First, you'll use visualization to explore the number of students in
special education over time. In particular, you'll compare the count of
gender categories. Next, you'll use what you learn from the
visualizations to quantify any differences.
### Visualize the dataset
Showing a lot of states in a plot will be overwhelming, so start by
making a subset of the dataset. Use a function in the {dplyr} package
called `top_n()` to learn which states have the highest mean count of
students with disabilities:
```{r filter dataset with top_n}
child_counts %>%
group_by(state) %>%
summarize(mean_count = mean(total)) %>%
# which six states have the highest mean count of students with disabilities
top_n(6, mean_count)
```
These six states have the highest mean count of students in special
education over the six years you are analyzing. For reasons you'll see
later, exclude outlying areas and freely associated states. That leaves
you with five states: California, Florida, New York, Pennsylvania, and
Texas. You can remove all other states but these by using `filter()`.
Call this new dataset `high_count`:
```{r filter most populated}
high_count <-
child_counts %>%
filter(state %in% c("california", "florida", "new york", "pennsylvania", "texas"))
```
Now, you can use `high_count` to do some initial exploration. This
analysis compares counts of gender categories in special education, but
visualization is also a great way to explore related curiosities.
You may surprise yourself with what you find when visualizing datasets.
For example, you might discover interesting hypotheses, find better ways
to transform the data or find interesting subsets. Recall that in the
high `mean_count` of freely associated states in the `state` column.
Let your curiosity and intuition drive this part of the analysis. It's
one of the activities that makes data analysis a creative process.
In that spirit, you'll start by visualizing specific gender categories
and age groups. Feel free to try these, but also try other student
groups for further exploration.
Start by running this code in your console to see what it does:
```{r fig10-1, results = 'hide', message = FALSE, warning = F, fig.cap = "Count of Female Students in Special Education Over Time", fig.showtext = TRUE}
high_count %>%
filter(gender == "f", age == "Total, Age 6-21") %>%
ggplot(aes(x = year, y = total, color = state)) +
geom_freqpoly(stat = "identity", size = 1) +
labs(title = "Count of Female Students in Special Education Over Time",
subtitle = "Ages 6-21") +
scale_color_dataedu() +
theme_dataedu()
```
The result is a plot that has the years on the x-axis and a count of
female students on the y-axis. Each line takes a different color based
on the state it represents.
Let's look closer: You used `filter()` to subset the dataset for
students who are female and ages 6 to 21. You used `aes` to connect
visual elements of our plot to our data. You connected the x-axis to
`year`, the y-axis to `total`, and the color of the line to `state`.
Note here that, instead of storing a new dataset in a new variable, you
filtered the dataset, then used the pipe operator `%>%` to feed it to
{ggplot2}. While exploring, you don't need to create a new variable if
you don't think you'll need them later.
Now try the same plot but subsetting for male students. Use the same
code from the last plot, but filter for the value "m" in the `gender`
field:
```{r fig10-2, results = 'hide', message = FALSE, fig.cap="Count of Male Students in Special Education Over Time", fig.showtext = TRUE}
high_count %>%
filter(gender == "m", age == "Total, Age 6-21") %>%
ggplot(aes(x = year, y = total, color = state)) +
geom_freqpoly(stat = "identity", size = 1) +
labs(title = "Count of Male Students in Special Education Over Time",
subtitle = "Ages 6-21") +
scale_color_dataedu() +
theme_dataedu()
```
You've looked at each gender separately. What do these lines look like
if you visualized the total number of students each year per state? To
do that, you'll need to add the gender values together, and both age
group values together. You'll do this using a common combination of
functions: `group_by()` and `summarize()`:
```{r fig10-3, results = 'hide', message = FALSE, fig.cap="Total Count of Students in Special Education Over Time", fig.showtext = TRUE}
high_count %>%
group_by(year, state) %>%
summarize(n = sum(total)) %>%
ggplot(aes(x = year, y = n, color = state)) +
geom_freqpoly(stat = "identity", size = 1) +
labs(title = "Total Count of Students in Special Education Over Time",
subtitle = "Ages 3-21") +
scale_color_dataedu() +
theme_dataedu()
```
So far, you've looked at ways to count students over time. In each plot,
you saw that while counts have grown overall, each state has different
sized populations. Now, you'll summarize that difference by looking at
the median student count for each state over time:
```{r fig10-4, results = 'hide', message = FALSE, fig.cap="Median Students with Disabilities Count", fig.showtext = TRUE}
high_count %>%
group_by(year, state) %>%
summarize(n = sum(total)) %>%
ggplot(aes(x = state, y = n)) +
geom_boxplot(fill = dataedu_colors("yellow")) +
labs(title = "Median Students with Disabilities Count",
subtitle = "All ages and genders, 2012-2017") +
theme_dataedu()
```
The boxplots show what you might have expected from our `freqpoly` plots
before it. The highest median student count over time is California and
the lowest is Pennsylvania.
What have you learned about the data so far? The five states in the US
with the highest total student counts (not including outlying areas and
freely associated states) do not have similar counts. The student counts
for each state also appear to have grown over time.
But how might you compare the male category count to the female category
count? One way is to use a ratio. For example, if Variable A is equal to
14, and Variable B is equal to 7, the ratio between Variable A and
Variable B is 2.00, indicating that the first number contains twice the
number of the second.
Use the count of male students in each state and divide it by the count
of each female student. The result is the number of times male students
are in special education more or less than female students in the
same state and year. The coding strategy will be to: