Skip to content

Latest commit

 

History

History
244 lines (192 loc) · 6.37 KB

slides.md

File metadata and controls

244 lines (192 loc) · 6.37 KB

% Title % Daniel Wheeler % 2014-04-27

{#overview .step data-scale=8}

{#title .step data-y=-2500 data-scale=4}

Simulation and Metadata Management

Daniel Wheeler • April 29, 2014 • Diffusion Workshop

{#automate .step data-y=200 data-x=-800}

{#automate .step data-y=200 data-x=-800}


Automate

About me {#aboutme .step data-y=200 data-x=800}


scientific/academic code developer

run/manage simulations (code monkey)

an epic Pythonista (according to OSRC)

FiPy developer

interested in reproducible research, see @wd15dan

Imagine... {.step .alwaysshow data-y=200 data-x=2400}


A declarative metadata standard

that you can use to tell a Linux VM how to download your data, execute your computational analysis, and spin up an interface to a literate computing environment with the analysis preloaded. Then we can provide buttons on scientific papers that say "run this analysis on Rackspace! or Amazon! Estimated cost: $25".


Automated integration tests for papers

where you provide the metadata to run your analysis while you're working on your paper and a service automatically pulls down your analysis source and data, runs it, and generates your figures for you to check. Then when the paper is ready to submit, the journal takes your metadata format and verifies it themselves, and passes it on to reviewers with a little "reproducible!" tick mark.


ideas by *C. Titus Brown*

Orthogonal Issues {#orthogonal .step .alwaysshow data-y=2500}

Workflow Control {.step .alwaysshow data-y=2720 data-x=180}

Scientific Development Process {.step data-y=2000 data-x=500 data-scale=0.2}


{.step .alwaysshow data-y=2500}

Version Control {.step .alwaysshow data-y=3110 data-x=-280 data-rotate-z="45"}

{#vc .step data-y=3110 data-x=-280 data-rotate-z="45"}


maintains history of workflow changes

but not workflow usage

already integrated into the scientific development process

Easy to use {.step data-y=3200 data-x=-500 data-rotate-z="45" data-scale=0.2}

$ git init
$ git add file.txt
$ git commit -m "add file.txt"
$ edit file.txt
$ git commit -am "edit file.txt"
$ git log
12e3c2618143 add file.txt
e00433e69a43 edit file.txt
$ git push github master

Manage Complexity {#managecomplexity .step data-y=3350 data-x=-350 data-rotate-z="-45" data-scale=0.2}

{.step .alwaysshow data-y=2500}

Event Control {.step .alwaysshow data-rotate-z="-45" data-y=2700 data-x=560 }

{.step data-rotate-z="-45" data-y=2700 data-x=560 }


provide a **unique ID (SHA checksum)** for every workflow execution

capture **metadata**, not data

**not** workflow control or version control

partial solution: **Sumatra**, a simulation management tool (not workflow)

{.step .alwaysshow data-y=2500}

Sumatra {.step data-x=-1000 data-y=1200 data-scale=0.5}


**doesn't change my workflow**

records the **metadata** (not the data): parameters, environment, data location, time stamps, commit message, duration, data hash

generates **unique ID** for each simulation

Easy to use {#easytouse1 .step data-x=-200 data-y=1200 data-scale=0.5}

$ smt init smt-demo
$ smt configure --executable=python --main=script.py
$ # python script.py params.json
$ smt run --tag=demo --reason="create demo record" params.json wait=3
Record label for this run: '0c50797f1e3f'
No data produced.
Created Django record store using SQLite

Easy to use {#easytouse2 .step data-x=-200 data-y=1200 data-scale=0.5}

$ smt list --long
------------------------------------------------------------------------
Label            : 6c9c7cd2bbc2
Timestamp        : 2014-04-21 16:07:52.100838
Reason           : create demo record
Outcome          : 
Duration         : 3.26091217995
Repository       : GitRepository at /home/wd15/git/diffusion-worksho ...
Main_File        : script.py
Version          : 08d04df6a9b561eb146d3a7461f763869fdc48a7
Script_Arguments : <parameters>
Executable       : Python (version: 2.7.6) at /home/wd15/anaconda/bi ...
Parameters       : {
                 :     "wait": 3
                 : }
Input_Data       : []
Launch_Mode      : serial
Output_Data      : []
User             : Daniel Wheeler <[email protected]>
Tags             : demo
Repeats          : None

Web Interface {#webinterface .step data-x=800 data-y=1200 data-scale=0.5}

<iframe width="100%" height="100%" src="http://127.0.0.1:8000/" frameborder="0" border="0"> </iframe>

Sumatra + IPython + Pandas { .step data-x=1800 data-y=1200 data-scale=0.5}


high level data manipulation

quickly mix parameters, metadata and output data in a dataframe

save Sumatra records as HDF file

disseminate instantly using [nbviewer.ipython.org](http://nbviewer.ipython.org/)

Using Pandas {#usingpandas .step data-x=2800 data-y=1200 data-scale=0.5}

$ smt export
$ ipython
>>> import json, pandas
>>> with open('.smt/records_export.json') as f:
...     data = json.load(f)     
>>> df = pandas.DataFrame(data)
>>> custom_df = df[['label', 'duration', 'tags']]
>>> custom_df
   label         duration  tags
0  6c9c7cd2bbc2  3.260912  [demo]
1  db8610f0c51f  3.248754  [demo]
2  0fdaf12e0cb2  3.247553  [demo]
...
>>> custom_df.to_hdf('records.h5')

Using IPython {#webinterface2 .step data-x=3800 data-y=1200 data-scale=0.5}

<iframe width="100%" height="100%" src="http://wd15.github.io/2013/05/07/extremefill2d/" frameborder="0" border="0"> </iframe>

{.step data-y=200 data-x=2400}

The Fantasy {.step data-x=3400 data-y=200 data-scale=0.5}


cloud service for Sumatra

integrated with Github, Buildbot and a VM provider

**sumatra-server 0.1.0** is out!

Thanks! {.step data-x=4400 data-y=200 data-scale=0.5}


slides: [wd15.github.io/diffusion-workshop-2014](http://wd15.github.io/diffusion-workshop-2014/)

parallel demo: [github.com/wd15/smt-demo](https://github.com/wd15/smt-demo)