-
-
Notifications
You must be signed in to change notification settings - Fork 381
What tool(s) do you use to manage data analysis pipelines? #375
Comments
In general, I prototype everything in an IPython Notebook, then gradually refactor into Python scripts that can chain together via command line pipes. I like to end up with something like this:
If useful, I'll break out longer parsing, wrangling, and analysis code into their own libraries. |
I tend to do my data visualization in R most of the time. I like that it makes pretty graphs without too much work. Sometimes I will use Python, but the final graphs from R are generally more attractive. If I need to stitch things together I tend to use a shell script. |
For bioinformatics I started out using shell scripts, however, I'm trying to transfer most of my workflows to julia as they provide a neat way for shelling out commands, something that is unfortunately really big in bioinformatics. Something like this: using Report, JSON, Gadfly, DataFrames
picardmultiple = `java -jar -Xmx4g $picard/CollectMultipleMetrics.jar INPUT=$infile OUTPUT=$outfile ASSUME_SORTED=false REFERENCE_SEQUENCE=$refseq`
run(picardmultiple) Nasty, I know, but that's bioinformatics I guess. You construct shell commands using the backtick operator then |
@selik What do you then use to manage those pipelines? Do you store each one in a single-line shell script with a descriptive name for later re-use? Or do you trust yourself to reconstruct the necessary pipelines on the fly? |
@LJWilliams @sveme So if one of your input data files changes, or if you tweak a parameter, you regenerate affected results manually? |
I do basically what I teach here, wrapped in some exploratory analysis. The process goes sort of like this (although the real world is, of course, messy and iterative, not linear like the below) -
If my inputs change or parameters change, I just change them and run the For alternate simultaneous parameter sets, I'll create multiple |
At first I regenerate manually. Once I'm happy with the pipeline I try to put the elements of the script that change into functions and parameters to run the code from the command line. This last step doesn't always happen (even though it should). |
@gvwilson No, bioinformatics data often come in files that I put into one folder with a certain naming scheme. I then run the whole analysis or parts of it on the whole folder or on files with certain tags. I have written some functions to parse the (file) output from the commands (such as All in all, quite similar to @jkitzes workflow. |
@gvwilson My data analysis involves a mix of python scripts (that parse the command line using |
@gvwilson I usually end up making python functions for each step in my analysis, and end up with a wrapper script that runs the whole shebang. I can then, depending on the complexity of things, add sanity checking for each step, so that it doesn't just run ahead with bad results. While developing, I usually work with a small test set to help with debugging. If there are important parameters that could/should change, I usually end up adding a config file too. |
@DamienIrving "numerous command line utilities that manipulate netCDF files" Try NCL: basically a "little language" for netCDF (plus some dubious graphical utilities). Friendly syntax, plus (importantly, IMHO) a REPL (unfortunately not as good as Python's or R's). For too long I wrangled netCDF using NCO and various IOAPI CLUs, then switched to David Pierce's R packages (notably ncdf4, which I still use whenever getting in and out of R is more pain than gain). But for straight-up netCDF twiddling (put this in, take this out, simple math), NCL rules. Plus it has Python API. |
In astrophysics we have a lot of tools that communicate each other through SAMP Protocol. Through this protocol one can transfer directly tables from a website (like from vizier catalogue) to a table visualiser (eg. topcat) with just one click. After some merging/filtering there with other tables I could transfer it to python (with astropy). The same workflow can be done with images connecting different imaging software (eg. aladin) Also, there's a plugin for Taverna (a workflow management system) that it connects through SAMP to all these tools. The issue/flaw I see in this is the difficulty of reproducibility of the whole process. I find it awesome for testing things quickly, because they are all connected in a seamlessly way. However, if I would have to save it all for publishing my results or to repeat it later I probably would write it all in a python script (I feel identified with this blog post). |
At work we use |
A website that I've enjoyed over the years is http://usesthis.com/. It's composed of short interviews with four basic questions posed to people in various fields:
I've found it to be a great resource for finding out about software and services. Maybe it would be interesting to make a variation for scientists and host it off of the SWC website and formalize something like what is being posted here. There are already a number of spin-offs: and frame it as a way for scientists in a range of disciplines to learn about the tools that people are using that promote productivity and reproducibility. |
@synapticarbors I am trying to make it oriented towards scientists and data scientists (also developers). I am waiting to get interviews from people to have a buffer so I can release an interview every week. With Let me know what you think. |
I do everything in Python. Using one language for everything streamlines the transition between various component of research (if you use different language for different task, you have to constantly change your mindset and also data transfer is a pain as well). I usually have scripts for processing raw data and generate the dataset (in hdf5 format) for analysis, and then another script for reading the dataset and actually run analyses on it. For reproducibility, it is crucial to have command line switches for different options for these scripts, instead of modifying actual source code for different options) I use docopt for that. Finally, I put actual commands in an IPython notebook. In the notebook, I have sections for data and analysis, and each cell has a command to run a script. For example: Datawork%run data_generation.py --option1=1 --option2=2 Analysis%run do_analysis.py --option1=1 --option2=2 IPython notebook is very convenient since you can store the command line commands that you use to generate data and results, and you can remind yourself of the workflow. |
Funny you should ask, as we're now starting to look into this ourselves. We are at the beginning of a large project, mutliple datasets generated over time, some samples will give data files multiple times, different naming schemes of the data providers, running the same analysis on each data file, collective analyses according to sample, sample origin, other combinations of samples, wanting to redo part of the analysis, or add analyses or new data. I just learned how to use make (not easy), and came over ruffus, which I like as it is written in python, with make-like capabilities. Others point me to snake-make, or ... or ... |
I'm very interested in this. I found that lacking a good workflow to run simulations and analyze results was my main source of frustrations as a scientist. Also, as happens with all this, the root of all evil was the lack of proper instruction in basic software skills. At least for most of the people in my field, there is a huge room for improvement in the efficiency if they adopt a good (or, at least, "one") automated workflow. I've also found that most tools are domain-specific or make some assumptions that are incompatible with your needs. So, I setup my own thing and more or less worked. I use It had many pitfalls, but was a huge improvement over my previous workflow of "just type everything in the command like each time", which, sadly, is what most people around me did and still do. Some things I would like to improve are:
|
I attended a talk about Sumatra at SciPy 2013 and it seems it is a tool aiming for exactly this:
|
Here are some of the analysis tools and systems in use in the ecological and environmental science community: For analysis, common tools include:
A small portion of the community has begun to think about reproducibility and scriptability, provenance, and re-execution. That small segment uses various workflow tools, such as:
There are a variety of publications on these issues and usages, especially in the scientific workflow community. The Taylor et al. (2007) book is a nice overview of a number of systems, many of which like Kepler, Taverna, and VisTrails have persisted over time (Workflows for e-Science: Scientific Workflows for Grids; http://www.springer.com/computer/communication+networks/book/978-1-84628-519-6). The dedicated scientific workflow community has thought through and implemented many advanced features (such as model versions, provenance tracking, data derivation, model abstraction). For example, Kepler and VisTrails support provenance tracking, keep track of model versions as users change them, and allow users to archive specific versions of model runs along with full provenance traces. Current work is on a shared, interoperable provenance model for scientific workflows that derives from PROV. There is an extensive literature on these systems, partly arising from annual workshops such as IPAW. Is this ticket meant to be the start of a comprehensive survey? I'm curious what the intent is. I suspect that GitHub/Software Carpentry users do not represent a random sample of the science community, and it could be argued that they are specifically misrepresentative of the users in many scientific disciplines. So I would be cautious to not use this data as a survey to represent relative usage of various approaches in any particular discipline. But as a (biased, partial) list of tools in use in various communities, it would be useful to know what is out there. There have been other more comprehensive tool surveys (see e.g., the EBM Tools database, and the DataONE software tools catalog). Hope this is helpful. |
I'm a little late to the party here, but I'll add my two cents anyway. Like @selik, @jkitzes and others have mentioned I prototype/do exploratory analyses in an iPython notebook. If the analysis is something I want to reuse, I'll then refactor it into a standalone python script, and I always make heavy use of There's an example of one of my doit pipelines for handing sequencing data at: My impression is that the doit syntax, being much more verbose than make, is also easier to read. That's just my impression though, and I'd love people's impressions on how easy or hard it is to figure out what the above file is doing without necessarily knowing how the doit library works in advance. Anyway, for anyone who is really interested, I've made some lessons for teaching doit to learners in #419 which also has a direct comparison of doit and Make. |
Based on the features presented here (simple Python script, MD5 change detection), doit seems like a perfect fit for my Python-based workflows. So I was eager to try it after I read @rbeagrie's lesson. After converting a recent project workflow from waf to doit all seemed well, until I tried the parallel execution option, which fails on Windows (pydoit/doit#50). As I have to work on Windows and rely on parallel execution for things like parameter optimization and classifier comparisons, I am stuck with waf. |
We all have our favorite tools for getting, crunching, and visualizing data; what I'd like to know is, how do you stitch them together? Do you write a shell script? Do you use a Makefile? Do you drive everything from Python, Perl, or R (and if so, how do you handle tools written in other languages)? Do you use web-based workflow tools, and if so, which ones?
The text was updated successfully, but these errors were encountered: