Skip to content

Commit

Permalink
Documentation complete
Browse files Browse the repository at this point in the history
  • Loading branch information
DamienIrving committed Jun 26, 2018
1 parent 453ec72 commit f2dd8fc
Show file tree
Hide file tree
Showing 3 changed files with 73 additions and 22 deletions.
5 changes: 5 additions & 0 deletions docs/api_reference.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
API reference
------------

.. automodule:: cmdline_provenance
:members:
10 changes: 5 additions & 5 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,9 @@
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
# import os
# import sys
# sys.path.insert(0, os.path.abspath('.'))
import os
import sys
sys.path.insert(0, os.path.abspath('../cmdline_provenance/'))

# -- General configuration ------------------------------------------------

Expand All @@ -30,7 +30,7 @@
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = []
extensions = ['sphinx.ext.autodoc']

# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
Expand Down Expand Up @@ -175,7 +175,7 @@

# Custom sidebar templates, maps document names to template names.
#
# html_sidebars = {}
html_sidebars = {'**': ['relations.html', 'sourcelink.html', 'searchbox.html']}

# Additional templates that should be rendered to pages, maps page names to
# template names.
Expand Down
80 changes: 63 additions & 17 deletions docs/index.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,13 @@
Command Line Provenance
=======================

.. toctree::
:maxdepth: 3
:hidden:

index
api_reference


Introduction
------------
Expand All @@ -9,9 +16,11 @@ Introduction

It was inspired by the popular `NCO <http://nco.sourceforge.net/>`_ and
`CDO <https://code.mpimet.mpg.de/projects/cdo>`_ command line tools,
which automatically generate a record of what was executed at the command line,
append that record to the history attribute from the input (netCDF) data file,
and then set the new extended record as the history attribute of the output (netCDF) data file.
which are used to manipulate the data and/or metadata contained in netCDF files
(a self-describing file format that is popular in the weather and climate sciences).
These tools generate a record of what was executed at the command line,
append that record to the history attribute from the input data file,
and then set this command log as the history attribute of the output data file.

For example, after selecting the 2001-2005 time period from a rainfall data file
and then deleting the ``long_name`` file attribute,
Expand All @@ -24,7 +33,7 @@ the command log would look as follows:
Following this simple approach to data provenance,
it is possible maintain a record of all data processing steps
from intial download/creation of your data files to the end result (e.g. a .png image).
from intial download/creation of the data files to the end result (e.g. a .png image).

``cmdline_provenance`` contains a series of functions for generating history records in the NCO/CDO format,
and for combining the current record with previous records to maintain a complete command log.
Expand Down Expand Up @@ -82,17 +91,17 @@ and the log entry will specify the precise version of ``ocean_analysis.py`` that
Each commit in a git repository is associated with a unique 40-character identifier known as a hash.
The ``new_log`` function has included the first 7-characters
of the hash associated with the latest commit to the repository,
which is sufficient information to revert back to that previous version of ``ocean_analysis.py``.
which is sufficient information to revert back to that previous version of ``ocean_analysis.py`` if need be.


outputting a log
^^^^^^^^^^^^^^^^
writing a log to file
^^^^^^^^^^^^^^^^^^^^^

If our output file is a self-describing file format (i.e. a format that carries its metadata with it),
then we would include our new log in the file metadata.
then we can include our new log in the file metadata.
For instance, a common convention in weather and climate science is to include the command log
in the global history attribute of netCDF data files.
If we were using the iris library (for instance)
If we were using the `iris <https://scitools.org.uk/iris/docs/latest/>`_ library (for instance)
to read and write netCDF files using its cube data structure,
the process would look something like this:

Expand All @@ -107,22 +116,59 @@ the process would look something like this:
>>> output_cube.attributes['history'] = my_log
>>> iris.save(output_cube, 'output.nc')
If the output file was not a self-describing format (e.g. a .png image),
If the output file was not a self-describing format (e.g. ``output.png``),
then we can write a separate log file (i.e. a simple text file with the log in it)
using the ``write_log`` function.

.. code-block:: python
>>> outfile = 'output.png'
>>> logfile = 'output.log'
>>> cmdprov.write_log(logfile, my_log)
>>> cmdprov.write_log('output.log', my_log)
While it's not a formal requirement of the ``write_log`` function,
it's good practice to make the name of the log file exactly the same as the name of the output file,
just with a different file extension such as .log or .txt.
just with a different file extension (such as .log or .txt).

including logs from input files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In order to capture the complete provenance of the output file,
we need to include the command log from the input files in our new log.
We can do this using the ``infile_history`` keyword argument associated with the
``new_log`` function.

If our input files are a self describing format,
then similar to the iris example above,
we can extract the input file logs from the metadata of the input file/s:

.. code-block:: python
>>> inlogs = {}
>>> inlogs['temperature_data.nc'] = temperature_cube.attributes['history']
>>> inlogs['salinity_data.nc'] = salinity_cube.attributes['history']
>>> my_log = cmdprov.new_log(infile_history=inlogs, git_repo='/path/to/git/repo/')
>>> print(my_log)
Tue Jun 26 11:24:46 2018: /Applications/anaconda/bin/python ocean_analysis.py temperature_data.nc salinity_data.nc output.nc (Git hash: 026301f)
History of temperature_data.nc: Tue Jun 26 09:24:03 2018: cdo daymean temperature_data.nc
History of salinity_data.nc: Tue Jun 26 09:22:10 2018: cdo daymean salinity_data.nc
Tue Jun 26 09:21:54 2018: ncatted -O -a standard_name,so,o,c,"ocean_salinity" salinity_data.nc
If the input files aren't self describing,
you can use the ``read_log`` function to read the log files associated with the input data files
(these logs files may have been previously written using the ``write_log`` function):

.. code-block:: python
>>> inlogs = {}
>>> inlogs['temperature_data.csv'] = cmdprov.read_log('temperature_data.log')
>>> inlogs['salinity_data.csv'] = cmdprov.read_log('salinity_data.log')
>>> my_log = cmdprov.new_log(infile_history=inlogs, git_repo='/path/to/git/repo/')
input file history
^^^^^^^^^^^^^^^^^^
If we want a complete log...
For scripts that take many input files,
the resulting log files can become very long and unwieldy.
To avoid this, think about ways to avoid repetition.
For instance, if you've got one input file that contains data from the year 1999-2003
and another equivalent file with data from 2004-2008,
it's probably only necessary to include the log from the 1999-2003 file
(i.e. because essentially identical data processing steps were taken to produce both files).

0 comments on commit f2dd8fc

Please sign in to comment.