Skip to content
This repository has been archived by the owner on Jan 3, 2018. It is now read-only.

What tool(s) do you use to manage data analysis pipelines? #375

Open
gvwilson opened this issue Mar 11, 2014 · 23 comments
Open

What tool(s) do you use to manage data analysis pipelines? #375

gvwilson opened this issue Mar 11, 2014 · 23 comments

Comments

@gvwilson
Copy link
Contributor

We all have our favorite tools for getting, crunching, and visualizing data; what I'd like to know is, how do you stitch them together? Do you write a shell script? Do you use a Makefile? Do you drive everything from Python, Perl, or R (and if so, how do you handle tools written in other languages)? Do you use web-based workflow tools, and if so, which ones?

@selik
Copy link
Contributor

selik commented Mar 11, 2014

In general, I prototype everything in an IPython Notebook, then gradually refactor into Python scripts that can chain together via command line pipes. I like to end up with something like this:

$ python collect.py | python wrangle.py | python analyze.py | python report.py

If useful, I'll break out longer parsing, wrangling, and analysis code into their own libraries.

@LJWilliams
Copy link

I tend to do my data visualization in R most of the time. I like that it makes pretty graphs without too much work. Sometimes I will use Python, but the final graphs from R are generally more attractive. If I need to stitch things together I tend to use a shell script.

@sveme
Copy link

sveme commented Mar 11, 2014

For bioinformatics I started out using shell scripts, however, I'm trying to transfer most of my workflows to julia as they provide a neat way for shelling out commands, something that is unfortunately really big in bioinformatics. Something like this:

using Report, JSON, Gadfly, DataFrames

picardmultiple = `java -jar -Xmx4g $picard/CollectMultipleMetrics.jar INPUT=$infile OUTPUT=$outfile ASSUME_SORTED=false REFERENCE_SEQUENCE=$refseq`

run(picardmultiple)

Nasty, I know, but that's bioinformatics I guess. You construct shell commands using the backtick operator then run it or, and that's the major advantage, detach it or run it in some subprocess.

@gvwilson
Copy link
Contributor Author

@selik What do you then use to manage those pipelines? Do you store each one in a single-line shell script with a descriptive name for later re-use? Or do you trust yourself to reconstruct the necessary pipelines on the fly?

@gvwilson
Copy link
Contributor Author

@LJWilliams @sveme So if one of your input data files changes, or if you tweak a parameter, you regenerate affected results manually?

@jkitzes
Copy link

jkitzes commented Mar 11, 2014

I do basically what I teach here, wrapped in some exploratory analysis. The process goes sort of like this (although the real world is, of course, messy and iterative, not linear like the below) -

  1. Look at everything in a IPython notebook, use pylab magic to quickly look at the data and make/examine graphs.
  2. Once I nail down analysis steps, refactor main "scientific" functions/classes into a python module. Add scripts for do_analysis.py (loads module, loads data file, crunches numbers and saves output results in some format) and a make_tables.py and/or make_figs.py file (loads output results from do_analysis and makes attractive).
  3. Use a controller runall (shell or Python) to run all scripts, in order, and save into a results directory. In some sense this is the pipeline part, although it's often pretty lightweight.
  4. Write manuscript in LaTeX - every time it compiles, it loads updated figures. (Tables are a problem that I've never effectively solved.)

If my inputs change or parameters change, I just change them and run the runall again. This requires, of course, that the entire pipeline not take too long - if it does, I break up the steps and rerun those that change (manually - a Makefile works in principle here, but I don't generally bother). As a failsafe, I always delete my entire results directory and rerun everything before writing/submitting results somewhere, just to make sure I haven't mucked something up.

For alternate simultaneous parameter sets, I'll create multiple results subdirectories and basically run the entire set of steps above within each subdirectory, loading the appropriate parameter file each time. This process is controlled by runall.

@LJWilliams
Copy link

At first I regenerate manually. Once I'm happy with the pipeline I try to put the elements of the script that change into functions and parameters to run the code from the command line. This last step doesn't always happen (even though it should).

@selik
Copy link
Contributor

selik commented Mar 11, 2014

@gvwilson I make extensive use of argparse to document usage. Like @jkitzes, I store intermediate steps as files and could make use of Makefiles in principle. I rely on memory to track what needs to be re-run.

@sveme
Copy link

sveme commented Mar 11, 2014

@gvwilson No, bioinformatics data often come in files that I put into one folder with a certain naming scheme. I then run the whole analysis or parts of it on the whole folder or on files with certain tags. I have written some functions to parse the (file) output from the commands (such as picard, samtools) into julia and then create plots or tables. I also started to write a small julia package Report.jl that generates Markdown documents with tables and figures within my workflow scripts. In the end I run pandoc to create a nicer looking pdf or odt file.

All in all, quite similar to @jkitzes workflow.

@DamienIrving
Copy link
Contributor

@gvwilson My data analysis involves a mix of python scripts (that parse the command line using argparse) and numerous command line utilities that manipulate netCDF files, most of which are specific to the weather/climate sciences. To stitch them all together I use Make. In fact, I've found your seed teaching material on Make to be very useful.

@karinlag
Copy link

@gvwilson I usually end up making python functions for each step in my analysis, and end up with a wrapper script that runs the whole shebang. I can then, depending on the complexity of things, add sanity checking for each step, so that it doesn't just run ahead with bad results. While developing, I usually work with a small test set to help with debugging. If there are important parameters that could/should change, I usually end up adding a config file too.

@TomRoche
Copy link

@DamienIrving "numerous command line utilities that manipulate netCDF files"

Try NCL: basically a "little language" for netCDF (plus some dubious graphical utilities). Friendly syntax, plus (importantly, IMHO) a REPL (unfortunately not as good as Python's or R's).

For too long I wrangled netCDF using NCO and various IOAPI CLUs, then switched to David Pierce's R packages (notably ncdf4, which I still use whenever getting in and out of R is more pain than gain). But for straight-up netCDF twiddling (put this in, take this out, simple math), NCL rules. Plus it has Python API.

@dpshelio
Copy link

In astrophysics we have a lot of tools that communicate each other through SAMP Protocol. Through this protocol one can transfer directly tables from a website (like from vizier catalogue) to a table visualiser (eg. topcat) with just one click. After some merging/filtering there with other tables I could transfer it to python (with astropy). The same workflow can be done with images connecting different imaging software (eg. aladin)

Also, there's a plugin for Taverna (a workflow management system) that it connects through SAMP to all these tools.

The issue/flaw I see in this is the difficulty of reproducibility of the whole process. I find it awesome for testing things quickly, because they are all connected in a seamlessly way. However, if I would have to save it all for publishing my results or to repeat it later I probably would write it all in a python script (I feel identified with this blog post).

@mkcor
Copy link
Contributor

mkcor commented Mar 12, 2014

At work we use cron.
My intermediate outputs are CSV files so they can be inputs to programs in either language (R or Python, in my case).

@synapticarbors
Copy link
Contributor

A website that I've enjoyed over the years is http://usesthis.com/. It's composed of short interviews with four basic questions posed to people in various fields:

  • Who are you, and what do you do?
  • What hardware do you use?
  • And what software?
  • What would be your dream setup?

I've found it to be a great resource for finding out about software and services. Maybe it would be interesting to make a variation for scientists and host it off of the SWC website and formalize something like what is being posted here. There are already a number of spin-offs:
http://usesthis.com/community/

and frame it as a way for scientists in a range of disciplines to learn about the tools that people are using that promote productivity and reproducibility.

@drio
Copy link
Contributor

drio commented Mar 12, 2014

@synapticarbors
I have built my own version. The engine is python based
instead of ruby. The look is also substantially different to the usesthis version.

I am trying to make it oriented towards scientists and data scientists (also developers). I am waiting to get interviews from people to have a buffer so I can release an interview every week. With
the amazing pool of interesting people at software carpentry I should be able to get enough interviews to release fairly quickly.

Let me know what you think.

@joonro
Copy link
Contributor

joonro commented Mar 12, 2014

I do everything in Python. Using one language for everything streamlines the transition between various component of research (if you use different language for different task, you have to constantly change your mindset and also data transfer is a pain as well).

I usually have scripts for processing raw data and generate the dataset (in hdf5 format) for analysis, and then another script for reading the dataset and actually run analyses on it.

For reproducibility, it is crucial to have command line switches for different options for these scripts, instead of modifying actual source code for different options) I use docopt for that.

Finally, I put actual commands in an IPython notebook. In the notebook, I have sections for data and analysis, and each cell has a command to run a script. For example:

Datawork

%run data_generation.py --option1=1 --option2=2

Analysis

%run do_analysis.py --option1=1 --option2=2

IPython notebook is very convenient since you can store the command line commands that you use to generate data and results, and you can remind yourself of the workflow.

@lexnederbragt
Copy link
Contributor

Funny you should ask, as we're now starting to look into this ourselves. We are at the beginning of a large project, mutliple datasets generated over time, some samples will give data files multiple times, different naming schemes of the data providers, running the same analysis on each data file, collective analyses according to sample, sample origin, other combinations of samples, wanting to redo part of the analysis, or add analyses or new data. I just learned how to use make (not easy), and came over ruffus, which I like as it is written in python, with make-like capabilities. Others point me to snake-make, or ... or ...

@iglpdc
Copy link
Contributor

iglpdc commented Mar 13, 2014

I'm very interested in this. I found that lacking a good workflow to run simulations and analyze results was my main source of frustrations as a scientist. Also, as happens with all this, the root of all evil was the lack of proper instruction in basic software skills. At least for most of the people in my field, there is a huge room for improvement in the efficiency if they adopt a good (or, at least, "one") automated workflow. I've also found that most tools are domain-specific or make some assumptions that are incompatible with your needs. So, I setup my own thing and more or less worked.

I use make to compile my C++ code and Python and shell scripts to create input files and analyze the results. After many years of struggles, I managed to automate the most of the process, so from my workstation or laptop, I could create the parameter files, submit the jobs, and make an automatic directory structure to keep the results organized (based on this paper). Basically, I removed the possibility of making decisions about how to name the files, where to store them, etc... My setup included a hook that auto-commits the results to a svn repo that lives in each cluster (that was the old days when I didn't know about git), and a sqlite database to keep track of the results.

It had many pitfalls, but was a huge improvement over my previous workflow of "just type everything in the command like each time", which, sadly, is what most people around me did and still do.

Some things I would like to improve are:

  • easy deployment of the code in the clusters. Once I have a repo with my code, I need to be able to deploy it often in all my clusters. Note that the code is changed and tuned often during the same project. I think I could incorporate this into my Python scripts using Fabric and make.
  • use autotools, or something similar, to create Makefiles automatically for each cluster, instead of keeping a ton of them, all slightly different, in my repo.
  • using a repo to auto-collect the results was a big improvement. git instead of svn will improve it even more.

@joonro
Copy link
Contributor

joonro commented Mar 15, 2014

I attended a talk about Sumatra at SciPy 2013 and it seems it is a tool aiming for exactly this:

Sumatra: automated tracking of scientific computations
Sumatra is a tool for managing and tracking projects based on numerical simulation and/or analysis, with the aim of supporting reproducible research. It can be thought of as an automated electronic lab notebook for computational projects.

@mbjones
Copy link

mbjones commented Mar 18, 2014

Here are some of the analysis tools and systems in use in the ecological and environmental science community:

For analysis, common tools include:

  1. Excel (ug, but yes)
  2. Jump
  3. R, Matlab, SAS, IDL, Mathematica, for general stats and models
  4. ArcGIS, rgeos, GDAL, GRASS, QGIS for GIS analysis
  5. Specialty stats packages such as Primer, MetaWin, HydroDesktop, etc.

A small portion of the community has begun to think about reproducibility and scriptability, provenance, and re-execution. That small segment uses various workflow tools, such as:

  1. R, SAS, Matlab, as scripted analysis environments
  2. Kepler, Taverna, VisTrails, and other dedicated workflow systems
  3. Bash scripts
  4. Python, perl, and other scripting languages
  5. Pegasus, Condor, and related batch computing workflow systems
  6. Make (but far less common, except in its traditional use in building code for models)

There are a variety of publications on these issues and usages, especially in the scientific workflow community. The Taylor et al. (2007) book is a nice overview of a number of systems, many of which like Kepler, Taverna, and VisTrails have persisted over time (Workflows for e-Science: Scientific Workflows for Grids; http://www.springer.com/computer/communication+networks/book/978-1-84628-519-6). The dedicated scientific workflow community has thought through and implemented many advanced features (such as model versions, provenance tracking, data derivation, model abstraction). For example, Kepler and VisTrails support provenance tracking, keep track of model versions as users change them, and allow users to archive specific versions of model runs along with full provenance traces. Current work is on a shared, interoperable provenance model for scientific workflows that derives from PROV. There is an extensive literature on these systems, partly arising from annual workshops such as IPAW.

Is this ticket meant to be the start of a comprehensive survey? I'm curious what the intent is. I suspect that GitHub/Software Carpentry users do not represent a random sample of the science community, and it could be argued that they are specifically misrepresentative of the users in many scientific disciplines. So I would be cautious to not use this data as a survey to represent relative usage of various approaches in any particular discipline. But as a (biased, partial) list of tools in use in various communities, it would be useful to know what is out there. There have been other more comprehensive tool surveys (see e.g., the EBM Tools database, and the DataONE software tools catalog). Hope this is helpful.

@rbeagrie
Copy link
Contributor

I'm a little late to the party here, but I'll add my two cents anyway.

Like @selik, @jkitzes and others have mentioned I prototype/do exploratory analyses in an iPython notebook. If the analysis is something I want to reuse, I'll then refactor it into a standalone python script, and I always make heavy use of argparse to try to make these scripts self documenting. I used to use Make for tying these pipelines together, but I didn't like the fact that I couldn't make the Makefiles self-documenting in terms of how and why to specify the different parameters. For this reason, I switched to a python build tool called doit, as it allows me to use argparse to document the parameters for the whole pipeline.

There's an example of one of my doit pipelines for handing sequencing data at:

https://github.com/rbeagrie/cookiecutter-tophat2mapping/blob/master/%7B%7Bcookiecutter.repo_name%7D%7D/make_bigwig.py

My impression is that the doit syntax, being much more verbose than make, is also easier to read. That's just my impression though, and I'd love people's impressions on how easy or hard it is to figure out what the above file is doing without necessarily knowing how the doit library works in advance.

Anyway, for anyone who is really interested, I've made some lessons for teaching doit to learners in #419 which also has a direct comparison of doit and Make.

@joschkazj
Copy link

Based on the features presented here (simple Python script, MD5 change detection), doit seems like a perfect fit for my Python-based workflows. So I was eager to try it after I read @rbeagrie's lesson.
Until now I have been using waf based on the project template by Hans-Martin von Gaudecker (GitHub repository, which I modified for my needs.

After converting a recent project workflow from waf to doit all seemed well, until I tried the parallel execution option, which fails on Windows (pydoit/doit#50). As I have to work on Windows and rely on parallel execution for things like parameter optimization and classifier comparisons, I am stuck with waf.

@gvwilson gvwilson self-assigned this Jun 26, 2016
@gvwilson gvwilson removed their assignment Apr 23, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests