Skip to content

Latest commit

 

History

History
108 lines (55 loc) · 3.48 KB

baby_names.md

File metadata and controls

108 lines (55 loc) · 3.48 KB

Baby Names Example

Here's a fun example of a common problem. You download a zip archive with a bunch of files that need to be combined. Then you want to quickly see what's in them.

Here's a quick way to do that all from the terminal. And since you're working under the hood of the computer it's pretty fast work.

We're going to work with the Baby Names from Social Security Card Applications-National Level Data available on Data.gov.

First, let's make a folder where we want to put the data we're about to download:

$ mkdir my_folder

Move into that folder.

$ cd my_folder

Then grab the data from Data.gov to your machine. (A quick footnote: wget is not installed on your machine by default. You may need to install it on your personal computer)

$ wget https://www.ssa.gov/oact/babynames/names.zip

Extract the data.

$ unzip names.zip

Check out the new files.

$ ls

How many rows we talkin in the first file?

$ wc -l yob1880.txt

How many rows we talkin in all the files in this folder?

$ wc -l *.txt

Let's look at the first few rows of one.

$ head yob1880.txt

And the last few rows.

$ tail yob1880.txt

Smush it all into one file quickly. We use the 'cat' command to read all the files in our folder ending in .txt and then send the output to a new file. (Here's a more detailed explanation)

$ cat *.txt > baby_names_1880_2013.txt

How many lines of data we got in the smushed file?

$ wc -l baby_names_1880_2013.txt

Smush it all together, but let's do it again and record what file each row came from.

So kill the file we just created.

$ rm baby_names_1880_2013.txt

To smush it back together and record what file each row came from, we need to talk about a command called 'grep.' The grep command can search for a specific string. When you're searching across multiple files, it has the added benefit of telling you which file it found the string in.

A simple use of this command would be:

$ grep "A*gustin" *.txt

Let's smush all files in our current directly that end in .txt

$ grep "" *.txt > baby_names_1880_2013.txt

Check out the new file.

$ ls

How many lines of data we got?

$ wc -l baby_names_1880_2013.txt

What do the first few rows look like?

$ head baby_names_1880_2013.txt

Maybe we want to export one of these slices to look at in excel:

head baby_names_1880_2013.txt > baby_names_first_few_rows.csv

Okay, but what's the baby name most often given? I gotta know!

Since the file is large, let's do a test on a smaller batch.

When you're combining two commands you use a pipe (|). Let's print the third column to the screen. The -d, tells the cut command that the file is comma-sepparated. The -f3 tells it to look for the third column.

$ head baby_names_1880_2013.txt | cut -d, -f3

And if we wanted to grab the third column from the whole file?

cut -d, -f3 baby_names_1880_2013.txt

And sort it too? The -n says 'sort by numerical value,' instead of treating these numbers a text. Try it without the -n if you want to see the result.

cut -d, -f3 baby_names_1880_2013.txt | sort -n

Okay, there's the biggest number, but what name is it associated with?

$ egrep 99674 baby_names_1880_2013.txt

Let's look at a few of the big names at once.

$ egrep 90512\|90629\|90994\|91652\|92711\|94758\|96210\|99674 baby_names_1880_2013.txt