Bash allows you to redirect the output of a command to a file using >
and >>
. The former overwrites pre-existing content, the latter appends to pre-existing content.
curl -s http://example.com > some_file.html
Bash pipes allow you to feed the output from one command as the input to another command:
# find the rows with CA in them and
# pipe to wc to get a count of the matching rows
grep CA state_data.txt | wc -l
Note, various commands below may require installation using Homebrew on Macs.
Check if you have Homebrew already:
brew -v
If not, install it with:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Downloading files with curl.
Mac users should run
brew install curl
if they don't have the command
# Stash a url in a shell variable
URL=http://example.com
# Download the URL data
# This will print the file contents to the command line
curl $URL
# You can redirect the output to a file
curl $URL > example.html
# Silence metadata info while downloading
# This can be useful in a shell scripting context
curl --silent $URL > example.html # or just use -s
# Or download to an identically named file
# Note the actual file is called index.html
curl -O $URL/index.html
You can wrap up your commands into a shell script.
Drop the following commands into a shell script called do_stuff.sh
#!/bin/bash
date >> /tmp/doing-stuff.txt
Run the script by typing:
sh do_stuff.sh
You can automate shell scripts or even individual Unix commands by using cronjobs, a tool built into Unix machines for scheduling tasks.
Let's automate our do_stuff.sh
shell script from above.
To create a cronjob, you must edit your user's crontab
file using crontab -e
.
This will drop you into the crontab file using the default shell editor. Once you make changes and save, the cronjob will be active.
It's important to note that cron is a very limited environment. It can only "see" programs or files in a limited number of directories, so it's a good idea to provide full paths to built-in shell utilities, custom scripts and input/output files and directories.
You may also need to configure the PATH environment variable in a cron context. See cronjobs for more background and details.
To set up your cronjob, first determine the location of your script:
pwd
Then, in crontab, paste the full path to your script and set it to run every minute. You'll need to modify the path to do_stuff.sh
to match the location on your machine!
* * * * * /bin/sh /Users/tumgoren/code/unix-workbench/do_stuff.sh
After the script runs the first time, you should see the content:
cat /tmp/doing-stuff.txt
To watch as new content is added, you can continuously "tail" the file:
tail -f /tmp/doing-stuff.txt
# Hit "CTRL c" to exit
We can get feedback from our script by piping any logging and errors to a separate file.
Let's update our cronjob as below:
* * * * * /bin/sh /Users/tumgoren/code/unix-workbench/do_stuff.sh > /tmp/doing-stuff.log 2>&1
Now let's break our script by changing date
to dat
in do_stuff.sh
. This should produce an error that gets sent to our /tmp/doing-stuff.log
.
After saving the file, wait a minute and then check the content of /tmp/doing-stuff.log
cat /tmp/doing-stuff.log
You can "tail" this file continuously once it's created, which can be quite handy when debugging a script:
tail -f /tmp/doing-stuff.log
Lastly, you can disable cronjobs by "commenting them out" with a hash (#
), as below. Or you can of course delete them.
Hit crontab -e
and update as below:
#* * * * * /bin/sh /Users/tumgoren/code/unix-workbench/do_stuff.sh > /tmp/doing-stuff.log 2>&1
Let's work through a more real-world example of creating a shell script and automating it.
We'll use the failed_banks_ca.sh
script, which does the following:
- Downloads the FDIC Failed Banks list
- Ceates a new CSV containing only CA banks
- Prints out the number of failed banks
Download the script and try running it:
sh failed_banks_ca.sh
If the script ran correctly, you should see two new files: banklist.csv
and failed_banks_ca.csv
.
Delete these files:
rm banklist.csv failed_banks_ca.csv
Now, we'll try automating the script by adding the following to crontab:
# Changing the working directory simplifies things for this example
# NOTE:
# - double-arrow redirection appends to log file
# - We use FDIC_DIR to make the command more readable
FDIC_DIR=/Users/tumgoren/code/unix-workbench/fdic
* * * * * cd $FDIC_DIR && /bin/sh failed_banks_ca.sh >> /tmp/failed_banks.log 2>&1
You should see banklist.csv
and failed_banks_ca.csv
in the directory containing the script. And /tmp/failed_banks.log
should display a message showing the count of failed banks in CA.
Here's a smattering of tools and examples that might be useful.
See Power Tools for Data Wrangling for more.
The tree lists all directories and files and is quite handy when futzing about on the command line.
Mac users should
brew intstall tree
cd /some/directory
tree
wget is another tool that helps download files. In some ways it resembles curl
, but it also has some key differentiating features such as the ability to mirror an entire website.
wget --mirror https://data-driven.news/bna/2021
cd data-driven.news/
# fire up a local python web server
python -m http.server
Go to http://localhost:8000 and view the site.
socrata2sql allows you to easily import data sets from Socrata-backed government data sites into databases such as SQLite.
NOTE:
socrata2sql
only works on Python 3.x
To install:
pip install socrata2sql
List the data sets available on San Francisco open data portal.
socrata2sql ls data.sfgov.org
# Or redirect the list to a file
socrata2sql ls data.sfgov.org > sfdatasets.txt
Create a SQLite database of SF eviction notices.
socrata2sql insert data.sfgov.org 5cei-gny5
Depending on the site and the size of the data, you may need to register for an API key in order to pull the data down.
You can view and query SQLite databases using tools such as DB Browser.
Need state metadata or related GIS files?
Try the Python us library, which ships with a basic command-line utility:
pip install us
states ca
Stanford offers free VMs in their Farmshare cloud for you to experiment with.
You can use ssh
to connect via "secure shell" to a machine.
# ssh [email protected]
ssh [email protected]
These Ubuntu Linux VMs offer all the standard bash commands mentioned above.
Beware that you're not guaranteed to end up on the same machine, so installing software can get tricky. You can target a specific VM with a bit of a two-step:
# Get the specific machine name
tumgoren@rice03:~$ hostname
rice03
# Quit the machine
exit
# connect to specific machine
ssh [email protected]
tumgoren@rice03:~$ hostname
rice03
exit
These machines are free, but long term, you'll likely want to learn how to set up your own virtual machine in the cloud.
Both Amazon Web Services and Google Cloud Platform offer free beginner tiers for spinning up virtual machines:
csvkit is a collection of command-line utilities that allows you to more easily wrangle data.
Here's an example script that merges yearly budget files into a single CSV and adds population data.
NOTE: The tool is written in Python and the install can be flaky at times. It's worth the headaches so reach out if you have trouble installing.
- The Unix Shell - great starter tutorial