Documentation complete

DamienIrving · Jun 26, 2018 · f2dd8fc · f2dd8fc
1 parent 453ec72
commit f2dd8fc
Show file tree

Hide file tree

Showing 3 changed files with 73 additions and 22 deletions.
diff --git a/docs/api_reference.rst b/docs/api_reference.rst
@@ -0,0 +1,5 @@
+API reference
+------------
+
+.. automodule:: cmdline_provenance
+   :members:
diff --git a/docs/conf.py b/docs/conf.py
@@ -17,9 +17,9 @@
 # add these directories to sys.path here. If the directory is relative to the
 # documentation root, use os.path.abspath to make it absolute, like shown here.
 #
-# import os
-# import sys
-# sys.path.insert(0, os.path.abspath('.'))
+import os
+import sys
+sys.path.insert(0, os.path.abspath('../cmdline_provenance/'))
 
 # -- General configuration ------------------------------------------------
 
@@ -30,7 +30,7 @@
 # Add any Sphinx extension module names here, as strings. They can be
 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 # ones.
-extensions = []
+extensions = ['sphinx.ext.autodoc']
 
 # Add any paths that contain templates here, relative to this directory.
 templates_path = ['_templates']
@@ -175,7 +175,7 @@
 
 # Custom sidebar templates, maps document names to template names.
 #
-# html_sidebars = {}
+html_sidebars = {'**': ['relations.html', 'sourcelink.html', 'searchbox.html']}
 
 # Additional templates that should be rendered to pages, maps page names to
 # template names.

diff --git a/docs/index.rst b/docs/index.rst
@@ -1,6 +1,13 @@
 Command Line Provenance
 =======================
 
+.. toctree::
+   :maxdepth: 3
+   :hidden:
+
+   index
+   api_reference
+
 
 Introduction
 ------------
@@ -9,9 +16,11 @@ Introduction
 
 It was inspired by the popular `NCO <http://nco.sourceforge.net/>`_ and
 `CDO <https://code.mpimet.mpg.de/projects/cdo>`_ command line tools,
-which automatically generate a record of what was executed at the command line,
-append that record to the history attribute from the input (netCDF) data file,
-and then set the new extended record as the history attribute of the output (netCDF) data file.
+which are used to manipulate the data and/or metadata contained in netCDF files
+(a self-describing file format that is popular in the weather and climate sciences). 
+These tools generate a record of what was executed at the command line,
+append that record to the history attribute from the input data file,
+and then set this command log as the history attribute of the output data file.
 
 For example, after selecting the 2001-2005 time period from a rainfall data file
 and then deleting the ``long_name`` file attribute,
@@ -24,7 +33,7 @@ the command log would look as follows:
 
 Following this simple approach to data provenance,
 it is possible maintain a record of all data processing steps
-from intial download/creation of your data files to the end result (e.g. a .png image).
+from intial download/creation of the data files to the end result (e.g. a .png image).
 
 ``cmdline_provenance`` contains a series of functions for generating history records in the NCO/CDO format,
 and for combining the current record with previous records to maintain a complete command log.
@@ -82,17 +91,17 @@ and the log entry will specify the precise version of ``ocean_analysis.py`` that
 Each commit in a git repository is associated with a unique 40-character identifier known as a hash.
 The ``new_log`` function has included the first 7-characters
 of the hash associated with the latest commit to the repository,
-which is sufficient information to revert back to that previous version of ``ocean_analysis.py``.
+which is sufficient information to revert back to that previous version of ``ocean_analysis.py`` if need be.
 
 
-outputting a log
-^^^^^^^^^^^^^^^^
+writing a log to file
+^^^^^^^^^^^^^^^^^^^^^
 
 If our output file is a self-describing file format (i.e. a format that carries its metadata with it),
-then we would include our new log in the file metadata.
+then we can include our new log in the file metadata.
 For instance, a common convention in weather and climate science is to include the command log
 in the global history attribute of netCDF data files.
-If we were using the iris library (for instance)
+If we were using the `iris <https://scitools.org.uk/iris/docs/latest/>`_ library (for instance)
 to read and write netCDF files using its cube data structure,
 the process would look something like this:
 
@@ -107,22 +116,59 @@ the process would look something like this:
    >>> output_cube.attributes['history'] = my_log
    >>> iris.save(output_cube, 'output.nc')
 
-If the output file was not a self-describing format (e.g. a .png image),
+If the output file was not a self-describing format (e.g. ``output.png``),
 then we can write a separate log file (i.e. a simple text file with the log in it)
 using the ``write_log`` function.
 
 .. code-block:: python
 
-   >>> outfile = 'output.png'
-   >>> logfile = 'output.log'
-   >>> cmdprov.write_log(logfile, my_log)
+   >>> cmdprov.write_log('output.log', my_log)
 
 
 While it's not a formal requirement of the ``write_log`` function,
 it's good practice to make the name of the log file exactly the same as the name of the output file,
-just with a different file extension such as .log or .txt.
+just with a different file extension (such as .log or .txt).
+
+including logs from input files
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In order to capture the complete provenance of the output file,
+we need to include the command log from the input files in our new log.
+We can do this using the ``infile_history`` keyword argument associated with the
+``new_log`` function.
+
+If our input files are a self describing format,
+then similar to the iris example above,
+we can extract the input file logs from the metadata of the input file/s:
+
+.. code-block:: python
+
+   >>> inlogs = {}
+   >>> inlogs['temperature_data.nc'] = temperature_cube.attributes['history']
+   >>> inlogs['salinity_data.nc'] = salinity_cube.attributes['history']
+   >>> my_log = cmdprov.new_log(infile_history=inlogs, git_repo='/path/to/git/repo/')
+   >>> print(my_log)
+   Tue Jun 26 11:24:46 2018: /Applications/anaconda/bin/python ocean_analysis.py temperature_data.nc salinity_data.nc output.nc (Git hash: 026301f)
+   History of temperature_data.nc: Tue Jun 26 09:24:03 2018: cdo daymean temperature_data.nc
+   History of salinity_data.nc: Tue Jun 26 09:22:10 2018: cdo daymean salinity_data.nc
+   Tue Jun 26 09:21:54 2018: ncatted -O -a standard_name,so,o,c,"ocean_salinity" salinity_data.nc
+
+If the input files aren't self describing,
+you can use the ``read_log`` function to read the log files associated with the input data files
+(these logs files may have been previously written using the ``write_log`` function):
+
+.. code-block:: python
+
+   >>> inlogs = {}
+   >>> inlogs['temperature_data.csv'] = cmdprov.read_log('temperature_data.log')
+   >>> inlogs['salinity_data.csv'] = cmdprov.read_log('salinity_data.log')
+   >>> my_log = cmdprov.new_log(infile_history=inlogs, git_repo='/path/to/git/repo/')
 
-input file history
-^^^^^^^^^^^^^^^^^^
 
-If we want a complete log...
+For scripts that take many input files,
+the resulting log files can become very long and unwieldy.
+To avoid this, think about ways to avoid repetition.
+For instance, if you've got one input file that contains data from the year 1999-2003
+and another equivalent file with data from 2004-2008,
+it's probably only necessary to include the log from the 1999-2003 file
+(i.e. because essentially identical data processing steps were taken to produce both files).