These are essential if you spend a lot of time in the shell. See the preface readme for more details on how I set up my terminal (some of these changes are necessary for the following to work):
control-a
: move cursor to beginning of terminal line.control-e
: move cursor to end of terminal line.control-k
: delete (or kill) all text from cursor to end of line.option-delete
: delete an entire wordoption-b
: move cursor backwards an entire word.option-f
: move cursor forwards an entire word.control-c
: cancel input text or when a command is running, stop it.- up arrow: access last entered command.
control-r
: start searching shell history. Start typing to search --enter
will enter the current command.control-c
will cancel.
I recommend you learn all of these -- they will greatly make working in the shell easier and more enjoyable.
I use this quote from Gary Bernhardt's excellent talk The UNIX Chainsaw. It's entertaining and includes some very nice examples of how powerful Unix (in the context of software development, but still generally applicable).
The "garden hose" quote is from an October 1964 Bell Labs memo. There's a cool image of the memo here: http://doc.cat-v.org/unix/pipes/
Brian Kernighan gives a great interview about the pipeline concept on the Computerphile YouTube channel.
There has been some debate about the Unix approach to bioinformatics versus alternatives (e.g. giant monolithic programs). One counter argument (incorrect in my opinion) is spelled out in The iniquities of the Unix shell (note this write has left bioinformatics). I've also written about this topic, in Bioinformatics and Interface Design.
The prevalence of Unix in not only bioinformatics, but also more generally data science and statistics, indicates its success. I describe why this approach is a good one both in this chapter and in chapter 7 when I discuss Unix data tools.
I use Z shell in my daily bioinformatics work -- if you're an advanced reader, you may wan to try it. Configuring your shell to your exact needs is one of the great joys of being a nerd; with Z shell, this is made much easier through a project called Oh My Zsh. I would recommend using Oh My Zsh if you're just getting started. You can take a closer look at my configurations in my dotfiles repository. Below are some other resources to get started with Z shell:
- Why Zsh is Cooler than Your Shell is a very nice introduction (and also includes a reference to the Knuth Programmer Pearls story I teach in chapter 7).
- zsh: The last shell you’ll ever need.
You can learn more about the latency statistics from this chapter from Peter Norvig's terrific classic essay, Teach Yourself Programming in Ten Years and this awesome interactive explanation, Latency Numbers Every Programmer Should Know.
Quick note: I have not had time to proof-read this section extensively
When we start running computationally intensive tasks, we want to keep track of
how they're running. One way of doing this is by following the output they
create, perhaps by using tail -f
on a log file or even ls -lrt
(list all
files by reverse time order) to see if the program is writing to disk (you
would see the output files' sizes increasing). In cases where a program isn't
writing to output files or actively logging what it's doing require a different
way monitoring their activity, and the two most programs to do this are top
and ps
.
ps
stands for process status, as it gives you the status of all running
processes. Without any arguments, it's not too useful; systems administrators
and bioinformaticians usually run it as ps aux
. Note that aux
isn't a
special keyword, but rather merged options which display processes for all
users (from -a
), adding a column indicating the user (from -u
), and outputs
processes that are running even if they weren't started from a terminal (-x
).
A feud between different Unix variants from UC Berkeley (BSD) and AT&T (System
V) and their different ps
variants in the 1980s means that the cryptic,
engrained aux
option is supported widely still (i.e. even in Apple's OS X).
Since ps aux
gives us a lot of processes to sort through, it's common to
pipe the output to grep
to give us a more powerful little combination
(there's also a tool called psgrep
just for this task). Below, let's take a
look at the top of ps
output (so we can see its column names) and then look
for "samtools":
$ ps aux | head -n3
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
todd 6384 12.7 4.0 4047300 166032 pts/7 S Thu10AM 37:42.69 samtools
todd 90710 8.9 4.6 1403080 193584 ?? S Thu08PM 57:15.12 fastq_stats
$ ps aux | grep "samtools"
todd 6384 12.7 4.0 4047300 166032 pts/7 S Thu10AM 42:16.59 samtools
We use ps
and grep
to search for our particular processes, which is useful
to see how much of our CPU (the third column above) and memory (the fourth
column) a process is using. The process ID, or PID. in the second column is a
unique identifier given to all our processes. Thee allows us to interact with
running processes through terminating and adjusting their priority (topics we
cover in the next section). Since ps aux
is such a common idiom to monitor
process, a table of the columns and their meaning is included below.
Columns in ps aux
:
USER
: user running the processPID
: process IDCPU
: percentage of CPU usedVSZ
: virtual memory size (in kilobytes)RSS
: resident set size (in kilobytes)TT
: controlling terminalSTAT
: process state codeSTART
: time command was started (sometimes this isSTARTED
)TIME
: time runningCOMMAND
: command that started process
When interacting with processes, it's common to read output and see columns
that look cryptic. With ps aux
, what the columns VSZ
, RSS
, TT
, and
STAT
mean isn't exactly clear. First, TT
is the controlling terminal, which
your or another user is using. Occasionally the process was started by another
process or your operating system, so this may appear as ??
. STAT
is a
"state code", a jargon term for a letter that tells you if your process is
running (R
), sleeping (S
), stopped (T
), or in another state (see man ps
for a full list). VSZ
and RSS
are more interesting to us, so we'll explore
them in more detail.
Occasionally your system runs low on physical memory (RAM), and your operating
system does its best to manage. Unfortunately the only way your operating
system can give a process more memory than is physical available is by taking a
chunk of less-used memory (these chunks are used by pages, a word you may see
in the ps
and top
manuals), writing it to disk (this is slow!), and then
using that now-free page for your process. Since you're swapping a in-memory
physical page of memory for one on a hard drive, the part of your disk that
manages this type of activity is called swap space. This may seem like a lot
of technical detail, but it has a very real result in day-to-day bioinformatics
work: if you run out of physical memory, your processes will be forced to start
swapping memory to the hard disk, and hard disks are really slow. High-memory
tasks like assembly (and in some cases alignment) on machines with insufficent
physical memory will halt even the fastest machines to a stand still.
With this information, VSZ
and RSS
will now make more sense. VSZ
is the
amount of virtual memory and RSS
is the amount of physical memory. Virtual
memory includes both swap and physical memory, so VSZ
is larger than RSS
.
ps aux
gives you a quick glance at these values, but the way an operating
system allocates memory can be a very baffling process to decode. When we want
to integate our processes to see which are using the most memory, CPU, or swap
space, ps
becomes a less useful and top
becomes our tool of choice. But
remember, for searching for processes, ps
and grep
can be combined in great
ways.
Unlike ps
, top
keeps refreshing your list of processes. Enter top
at a
command line, and it will update (Linux versions usually update every three
seconds and OS X every one second). To exit top
, just press 'q'. But
constantly updating isn't top's greatest strength — be able to interactively
look at which processes are using the most resources is.
Unfortunately, top
differs considerably between Linux and Apple's OS X (and
other Unix variants that have UC Berkeley's BSD as their ancestor). Since most
bioinformatics servers we interact with over SSH are Linux-based, I will cover
the Linux version. If you have a Mac workstation, consult the table
<> and the top
manual.
Let's start top
with the handy -M
option, which displays our memory
units in larger units than kilobytes where appropriate. You should see
something like the below:
$ top -M
top - 19:49:58 up 22 min, 3 users, load average: 0.88, 0.45, 0.23
Tasks: 62 total, 3 running, 59 sleeping, 0 stopped, 0 zombie
Cpu(s): 70.9%us, 2.0%sy, 0.0%ni, 0.0%id, 27.1%wa, 0.0%hi, 0.0%si
Mem: 594.219M total, 304.293M used, 289.926M free, 4268.000k buffers
Swap: 1983.992M total, 88.387M used, 1895.605M free, 248.055M cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1978 dalilah 20 0 65896 48m 864 R 86.7 8.2 0:10.17 bwa
1979 dalilah 20 0 17940 1240 812 S 11.3 0.2 0:01.95 samtools
1980 dalilah 20 0 66492 48m 724 S 1.7 8.1 0:00.25 samtools
1 root 20 0 19356 1544 1240 S 0.0 0.3 0:00.69 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
There's undoubtly a lot going on here, but let's take a look at the most
important parts. The manual for top
are quite dense, so I'll cover the most
commonly used sections. The very first line covers the uptime (22 minutes), how
many users are on the system, and the load averages. Loads in Unix systems
are given as three numbers: the load in the past one minute, five minutes, and
fifteen minutes. The number ranges from 0 to over 1, but the interpretation
depends on how many cores your processor has, and what's holding up
processes. When we run bioinformatics programs, we can hit different
bottlenecks: memory running out requiring we use swap space, reading large
files from a slow disk, waiting for BLAST results to be recieved over the
network from another machine, or actually processing done in the CPU. In cases
where we have many processes all trying to use our CPU, we will see the load
average rise. Processes will have to wait for their turn to use our CPUs, and
this can lead to slow behavior. This is why if we have a single processor
machine, we don't want to use GNU Parallel or xargs
in parallel: it would
create two processes that would have to weight on each other to share the only
CPU.
On a two core system, a load average of 2 means that both CPU cores are constantly working over a given time period (in the load average case, one, five, or fifteen minutes). A load average over 2 means that now processes are having wait for each other. Under 2 means that our CPU has periods where it's not doing processing. Load averages are a very important statistic to look at when running large bioinformatics jobs because they could give you a hint if you're doing too much at once, and maxing out your CPU (load average above the number of cores you have), or if processes have been running for a while and load average is low, this could indicate something else (disk, memory, or network) is a the bottleneck.
The next line breaks gives you the number of processes broken down by state
(running, sleeping, etc), which we saw in ps aux
's output per process. The
third line is another important one: it specifies what the CPU spent its time
on since the last update of top
. In our example, we see it spent 70.9% of
it's time in "us", 2% of it's time in "sy", and 27.1% of its time in "wa".
There are many options that we don't need to go into too much detail about (see
the manual if you are curious), but it's worth know that Unix systems divide up
userland ("us") and system ("sy"), and most of our data crunching
bioinformatics work will tax our userland resources. More importantly, Linux
top
gives "wa", which is the percent of time the CPU is waiting on other
stuff, and a consistenly high percent "wa" is a good warning sign something
other than CPU is the bottleneck.
The next too lines cover memory and swap space usage. This machine has physically more than 594 megabytes of memory, but this is the amount accessible to the operating system. The amount of free memory can be an indicator as to whether we're running out of memory and our machine is about to hit the much slower swap space.
The next lines are each process running, and a summary of their process ID
(PID
), who started them (USER
), their priority (PR
), the amount of
virtual and physical memory (VIRT
and RES
, repsectively), the state (S
),
percentage CPU and memory usage (%CPU
and %MEM
), the time running
(TIME
), and finally the actual command (COMMAND
). We had many of these same
columns with ps aux
, but a big advantage with top
is that we can
interactively sort them and watch them update. Recall that your bioinformatics
processes may need lots of memory one minute, then start getting CPU-hungry the
next, and finally start reading gigabytes of information off the disk. We can
watch our process's resource requirements live with top
.
Particularly useful is being able to sort this live in top
(and for
completness, note this is possible with ps
too). To sort by memory, just
press O
(capital letter "o"), which brings up a list of possible sort fields.
One of these fields is %MEM
, and corresponds to the letter "n" (recall, this
is only Linux top
, consult man top
if your version is different). Pressing
"n", then enter will resort your top
process list by memory usage. Other
options I commonly use are "K" for CPU usage, "p" for swap size, "e" for user
name, and "l" for CPU time. If your machine is running slowly, using top
interactively to find greedy processes should be the first place to start.
If you've used ps aux
and grep
to find a particular program, or
you've spotted a greedy process with top
, you can kill or set its
priority using the Unix tools kill
and renice
respectively.
The command kill
is used to send signals to running processes. The
default signal it sends to a running program is to terminate a
program. A signal is a way to communicate with a running
process. The termination signal, called SIGTERM
, and we can
terminate a program called "greedy_cmd" by first getting its process
ID with ps aux
and grep
(or through top
too) and then using
kill
:
$ ps aux | grep "greedy_cmd"
vinceb 10141 99.0 0.0 9235248 428 s004 U+ 1:48PM 0:00.00 greedy_cmd
$ kill 10141
However, programs can choose how they wish to handle a SIGTERM
. Some
programs could even ignore a termination signal entirley, although
this is not common practice with most programs we'll use. Still,
sometimes you'll need to send a more forceful signal like SIGKILL
,
which can't be ignored by programs. To specify the signal with kill
,
we use kill -s SIGKILL 10141
If you've used Unix for a while, you've probably run across pressing
control-C to stop a program. This sends another common signal called
SIGINT
, of which the the "int" is short for interupt. kill -s SIGINT
is the same as pretty control-C in the same terminal as a
program is running. If you've ever used control-Z (for suspend), this
is very similar too, as it sends the signal SIGSTP
. SIGSTP
is an
signal that suspends a process. Suspending a process pauses it, and
this paused process can then be changed to run in the background or
foreground with the jobs
, bg
, and fg
commands we learned
earlier.
Most of the time when we're using kill
we're not using it to send
these other signals to processes, we're trying to quickly kill an out
of control command that's using too many resources. For this, kill -s SIGKILL <pid>
is the standard tool we reach for. Signals also
have numeric shortcuts, and 9 corresponds to SIGKILL
, so kill -9 <pid>
does the same thing.
Resources are finite on even our most powerful servers, so it's necessary to mind how many resources are being used by programs. This is especially the case with Unix machines which multitask, running many processes simultaneously. Since multitasking is such a core part of Unix (and Unix comes from an era of comparatively slower machines with fewers resources), there's a way to prioritize certain processes: through a process's nice value.
The nice value of a process ranges from -20 to 20 (19 on some systems), where a lower nice value gives a process more priority. A good way to remember this is that a lower nice value means a program isn't being nice to other processes, and is instead using all the CPU resources for itself. A very high nice value like 19 tells your operating system that this process is pretty low priority, so it should run it whenever resources are available. Note that the nice value only affects how much CPU priority a process gets. Memory or disk-bound processes will not gain much from getting a lower nice value.
The default nice value a process is 0, but this can be set by a using
the command nice
to run a program under a specified nice value.
Good usage examples include tasks like backups, system updates, or
archiving old projects. For example, if you want to run gzip on a
large FASTQ file in the background, with lower priority, you would use
nice
:
$ nice -n 10 gzip zmaysA_R1.fastq
This runs the command gzip zmaysA_R1.fastq
, incrementing the default
value of 0 to 10. If we have an already running process, we could
adjust its nice value with the command renice
, which takes a nice
value and a process ID, like: renice 10 <pid>
. This sets the nice
value of the process with ID <pid>
.
As more cores are packed into modern CPUs, CPUs are less likely to be
the bottleneck than the disk or memory. Nice is handy, but it's not
can't work miracles on heavily-taxed systems. If a program is hanging,
or your system feels sluggish, it's very important to use top
to
monitor CPU and memory usage. Somewhat surprisingly for beginners in
large data processing, the disk is usually the culprit for processing
bottlenecks, so in the next section we'll look at a way to monitor
disk input and output.
It's not uncommon for the disk to be the bottleneck in bioinformatics.
Unfortunately, monitoring disks is a bit tricky, as intepretating the results
can depend quite a bit on your actual hardware. Some readers may wish to skip
this section, and revisit it if they experience a sluggish system that appears
to have free memory and CPU. Also, as with top
, we'll focus on the Linux
iostat
, which is the version more likely to be found on the large Linux
servers we do bioinformatics on.
Below is an example of a single disk that's facing too likely too much usage.
Using the iostat
command without any arguments, we see that there's around
42% usage (this is a single core system), and around 51% of the time the CPU is
waiting for I/O tasks to complete. In this case, we see that the disk, and
not the CPU would likely the cause of a sluggish process.
$ iostat
avg-cpu: %user %nice %system %iowait %steal %idle
41.52 0.00 7.59 50.89 0.00 0.00
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
xvdap1 59.83 528.73 4310.89 1002588 8174440
Additionally, for each disk, iostat
outputs the transfers per second
(tps
), block reads and writes per second (Blk_read/s
and
Blk_wrtn/s
), and total blocks read and written (Blk_read
and
Blk_wrtn
). For simple monitoring, we need to primarily look at
iostat
(to see if there CPU is waiting for disk I/O), and then
Blk_read/s
and Blk_wrtn/s
to see if the disk usage is reading or
writing. Additionally, if you wish to continually monitor disk
activity, iostat
can be run with two optional arguments, the
interval at which a report is generated, and the number of reports to
generate. For example, we could generate three reports continually (until we
exit with Control-C) in 10 seconds intervals with: iostat 10 3
If it's unclear which processes are the cause of increased I/O another
useful Linux program can help: iotop
. Like the top
we used earlier
to monitor memory and CPU usage, iotop
updates at a fixed interval,
and indicates which processes have the highest disk I/O usage:
$ iotop
Total DISK READ: 37.67 M/s | Total DISK WRITE: 35.43 M/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
85 be/4 violet 2.37 M/s 0.00 B/s 0.00 % 94.97 % biocmd in.fa -o out.txt
89 be/4 lauren 35.30 M/s 34.72 M/s 0.00 % 56.23 % gzip ref.fa
1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % init
6 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/0]
7 be/0 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [cpuset]
8 be/0 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [khelper]
9 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kdevtmpfs]
Here, the imaginary program biocmd
appears to be competing with gzip for
disk input and output. Gzip is reading and writing a fair amount, whereas
biocmd
is spending nearly 95% percent of its time waiting for disk I/O.
Finally, a word about disks. Monitoring can only help so much, and it's possible that under normal bioinformatics loads, some servers disk hardware won't be able to keep up. Often systems administrators use RAID (Redundany Array of Indepedent Disks) systems, which can increase disk I/O performance and redundancy against failure. These topics are outside the scope of this book, and in cases where disk I/O are common and continual, it may be worth discussing disk hardware options with the person managing your servers (if it's not you).
When processing bioinformatics data, it's not uncommon to fill up entire disk volumes. As disks fill up, they also become more fragmented, meaning that they write data to the disk in non-consecutive chunks. Disk fragmentation leads to slower disk performance; disks pushing 80% full not only run a risk of being filled up during data processing, but even performance tasks not requiring lots of disk space will suffer. Thus, it's useful to monitor disk usage periodically.
The two tools used to look at disk usage are df
and du
. The first, df
simply gives you a terminal-based display of your disk usage, broken down by
volume. Since I don't like having to trying to count the digits in figures like
"51228178" bytes (the default unit), I almost always the -h
option to
display the units in human readable format. On the machine I'm on now, this
looks like:
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 854G 32G 779G 4% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
udev 32G 4.0K 32G 1% /dev
tmpfs 6.3G 372K 6.3G 1% /run
nas-8-0:/export/1/vinceb 11T 2.6G 11T 1% /home/vinceb
Note that there are some file systems in this output that look strange, like
none
and tmpfs
. Many Unix-like operating systems use some virtual disks.
This all goes back to how files can be abstracted under Unix-like operating
systems. Real disks, like /dev/sda1
, are interfaced with device files in
/dev
on Unix (as well as the psuedo-devices like /dev/null
we saw in the
redirection section above).
The command df
is relatively straight forward: it lists the file system size,
the amount used, available, and the percent used, as well as where the file
system is mounted. Mounting file systems allows us to access them through a
certain point on the Unix filesystem. For example, /home/vinceb
is mounted to
a directory on nas-8-0, a Network Attached Storage device in the example above.
In day-to-day work, it's usually good to check that disks aren't getting too
full before running large data processing tasks.
However, now assume you've found that your disk is getting too full, and you
want to figure out which files are using the most disk space. One command
that's useful in finding large files is du
, which recursively lists the sizes
of the file in the directory it's being run. For example, if you suspect that
there are some large files in a project directory named
~/Projects/tarsier_genes
, you could use du
to find which directories
contain the largest files:
$ du -h ~/Projects/tarsier_genes
20M /home/vinceb/Projects/tarsier_genes
64K /home/vinceb/Projects/tarsier_genes/notes
20M /home/vinceb/Projects/tarsier_genes/data/
640G /home/vinceb/Projects/tarsier_genes/data/alignments
Clearly, there's some big files in our project alignment/
directory we may
want to delete or compress with a program like gzip
.
The program du
can also be combined with sort
and head
to find the
largest files on a Unix system. This is a good example of how Unix pipes allow
many small programs to be connected to create useful other programs. If we run
du /
, it will output the size of the contents for each directory below
/
(recursively). With /
, this will be a lot of directories, and
since we only care about the largest ones, we use sort
with the -n
option to numerically sort the lines, and to do so -r
in reverse order
so the largest files are at the top). Then, we pipe all output to head -n 10
to only give us the top five directories containg the most content. Note that
we can't use human readble formats (-h
) anymore, since numeric sorting
doens't understand suffixes like "K" and "M" for kilobytes and megabytes. On
some BSD-variant systems, it may be necessary to explicitly say that you want
sizes in one unit with (-m
). The basic command would like like:
$ du / | sort -r -n | head -n 5
2036816991542 /
411030308497 /share/data/genomes
19390538953 /Users
1042908919 /Users/lauren
1041811032 /Users/lauren/s_cereale_genome
Note that du
is hierarchical, so there will be some redundancy as the
directory containing many large files, as well as the directory containg this
directory, and so on are all included. Finding large files is a common problem,
and specialized programs are also u