From f33bb72ad8158b65c3d3c3cf560d0756b00cff43 Mon Sep 17 00:00:00 2001
From: David Belanger <david.b.belanger@gmail.com>
Date: Thu, 5 Jan 2017 11:46:51 -0500
Subject: [PATCH] updates to docs

---
 Applications.md             | 65 +++++++++++++++++++++++++++++--------
 Denoising.md                | 33 ++++++++++---------
 MultiLabelClassification.md | 15 ++++++---
 README.md                   | 39 +++++++++++++---------
 4 files changed, 103 insertions(+), 49 deletions(-)

diff --git a/Applications.md b/Applications.md
index ed6023c..eaf06de 100644
--- a/Applications.md
+++ b/Applications.md
@@ -1,26 +1,65 @@
-# The SPENProblem API
+# Implementing New SPEN Applications
 
-SPEN applications, such as SPENMultilabelClassification and SPENDenoise extend the SPENProblem class. You will need to implement the abstract methods defined towards the top of SPENProblem.lua for data loading, preprocessing, evaluation, etc. You will also need to make sure that your new class contains the following members:
+See main.lua for examples of various SPEN applications. The SPEN code is quite modular. The only thing that needs to be implemented is the load_problem method that is build in main.lua. This returns the following application-specific items.
 
-`problem.inference_net`: the energy network E_x(y), but using pre-computed features rather than the raw value of x. This takes {labels, features} and returns a single number per minibatch element
+`model` is an object that obeys the SPEN api, described below.
 
-`problem.fixed_features_net`: Feature mapping F(x). This may be pretrained using classification, or loaded from file. If the training mode is 'clampFeatures', then we don't update its parameters, and don't even backprop through it during training. 
- 
-`problem.learned_features_net`: (Optional) The overall feature mapping is fixed_features_net followed by learned_features_net. These features are learned even in 'clampFeatures' mode.
+`y_shape` is a table containing the shape of y, the iterates for gradient-based SPEN optimization.
 
-`problem.initialization_net`: This network takes {x,F(x)} and returns an initial guess y_0 for the labels. The reason it takes the raw input x is that this might be important for getting the size of the inputs when the problem can have variable-sized inputs. See SPENProblem.lua to see how this interacts with the --initAtLocalPrediction flag. Generally, you don't implement initialization_net directly. Instead, you decide to init the labels with the outputs of a local classifier, or initialize them to some fixed hard-coded value (eg. 0).
+`evaluator_factory` is a function that takes a batcher and a soft predictor and returns an object that implements an evaluate(timestep) method used for evaluating and logging performance. 
 
-`problem.iterate_transform`: problem.iterate_transform --Everything is set up for unconstrained optimization. This maps things onto the constrain set at the end of optimization. For example, it converts logits to probabilities. Set to Identity() if you don't need a transformation. 
+`preprocess_func` is a function that takes (y,x,num_examples) and returns optionally preprocessed versions of the data (eg. expanding int indices for y to a one-hot representation.) If preprocess_func is nil, then no such transformation will be applied. 
 
+`train_batcher` Object that provides two methods: get_iterator() and get_ongoing_iterator(). The first is an iterator used typically used for test data, which returns {nil,nil,nil} when it reaches the end of the data. The second is an infinite iterator (eg., it loops back to the beginning of the dataset once it reaches the end). Each method returns a lua iterator that can be called and will return: {y,x, num_actual_examples}. The outer dimension for y and x is expected to be params.batchsize always. If there isn't enough data to fill a tensor of this size, it may zero-pad the data, in which case num_actual_examples refers to the number of actual examples. This is useful at test time to make sure that over-inflated accuracy numbers are not computed on the padding. 
 
+`test_batcher` Similar batcher, but for test data. 
 
-`problem.iterateRange`: A table of 2 numbers. If we are doing projected gradient descent for test-time optimization, this is the upper and lower bound on the value for each prediction variable, eg. {0,1} or {0,255}.
 
-`problem.input_features_pretraining_net`: This is a simple feed-forward classifier, often used for pretraining the features used by the inference_net. Also, for problems where the energy function has terms analogous to the 'unary potentials' of a graphical model, this classifier may provide these per-label terms. 
+## The SPEN API
 
-`problem.structured_training_loss.loss_criterion`: The training criterion. Used for pretraining the 'unaries' and also for the RNN net. 
+SPEN applications, extend the SPEN class, given in model/SPEN.lua. See model/ for various examples.
 
-###Block-Structured Y
-For some problems, there are multiple blocks of optimization variables. This was supported in earlier versions of SPEN, but not anymore. If you need this functionality, let me know and maybe we can reboot it.
+
+### Methods that SPEN Subclasses Must Implement 
+
+`SPEN:features_net()` returns a network that takes in x and returns features F(x)
+
+`SPEN:unary_energy_net()` returns a network that takes F(x) and returns a set of 'local potentials' such that the local energy term is given between the inner product between the output of this network and self:convert_y_for_local_potentials(y). This network is used as a term in the SPEN energy. It is also used as the local classifier used in pretraining the features and optionally as a means to initialize a guess for y0 when performing iterative optimization to form predictions.
+
+`SPEN:convert_y_for_local_potentials(y)` takes an nngraph Node and returns an nngraph node used for taking inner products with the 'local potentials' of the unary energy network. Typically, this can be set to the identity.
+
+`SPEN:global_energy_net()` return a network that takes {y,F(x)} and returns a number. The total SPEN energy is the sum of this and the unary_energy_net, where the global term is weighted by config.global_term_weight. 
+
+`SPEN:uniform_initialization_net()` returns a network that takes no inputs and returns an initial guess y0 for iterative optimization. A default implementation, using nn.Constant, is implemented in SPEN.lua. Only override this if necessary. 
+
+## SPEN Members and Methods that Outside Code Accesses
+`spen.initialization_network` takes x and returns a guess for y for iterative optimization. May or may not be the same as the classifier network. 
+
+`spen.features_network` takes x and returns features F(x)
+
+`spen.energy_network` takes {y,F(x)} and returns an energy value (to be minimized wrt y)
+
+`spen.classifier_network` takes x and returns a guess for y. This is used for pretraining. 
+
+`spen.global_potentials_network` takes {y,F(x)} and returns the value of the global energy terms.
+
+`spen:set_feature_backprop(value)` takes a boolean value. If value is true, then no backprop will be performed through the features network during training. This prevents the parameters of the features network from being updated. 
+
+`spen:set_unary_backprop(value)` Similarly, this prevents any updates to both the features network and the local potentials, which are a term in the energy function, and also may be used for the initialization_network. 
+
+### Config options for SPEN
+
+The SPEN construct takes two tables, config and params, where the first is for application-specific options and the second contains general options for the entire SPEN software package. 
+`params.use_cuda` whether to use the GPU. 
+
+`params.use_cudnn` whether to use cudnn implementations for certain nn layers. 
+
+`params.init_at_local_prediction` whether gradient-based prediction should initialize y0 uniformly or using the local classifier network. 
+
+`config.y_shape` a table for the shape of y, the input to the energy function.
+
+`config.logit_iterates` whether gradient-based optimization of the energy is done in logit space or in normalized space.
+
+`config.global_term_weight` weight to place on global energy terms vs. local energy terms when composing the full energy function. 
 
 
diff --git a/Denoising.md b/Denoising.md
index f32b641..62cfab2 100644
--- a/Denoising.md
+++ b/Denoising.md
@@ -1,30 +1,26 @@
 
 # Image Denoising with SPENs
 
-A SPEN architecture for Image Denoising is implemented in Denoising.lua, with some general functionality added in SPENProblem.lua. 
+To run a self-contained example of image denoising, cd to the base directory for SPEN, and then execute
 
-Let x be the input blurry image and y be the sharpened image we seek to predict. We recover y by MAP inference, where we find y that maximizes P(x | y ) P(y). We assume a Gaussian noise model, so that P(x|y) is scaled mean squared error. There are various parametrizations for the prior distribution P(y). Many previous works have employed a 'field of experts' model: P(y) \propto exp(\sum_i \sum_xy w_i \rho (f_i(x,y))), where f_1(\cdot,\cdot), \ldots, f_k(\cdot,\cdot) are a set of localized linear filter responses and \rho is a nonlinearity. 
-
-Early work estimated the the weights w_i and the filters by maximizing the likelihood of a dataset of sharp images. Inference in the field of experts model is intractable, and thus practitioners employed approximate methods such as contrastive divergence. 
-
-An alternative line of work estimated the parameters using end-to-end approaches, by applying automatic differentiation to the procedure of iteratively solving the MAP objective, for a fixed number of iterations. 
+`wget https://www.cics.umass.edu/~belanger/depth_denoise.tar.gz`
 
+`tar -xvf depth_denoise.tar.gz`
 
+`depth_cmd.sh`
 
-<!--- #"Generic Methods for Optimization-Based Modeling." AISTATS 2012. [link](http://www.jmlr.org/proceedings/papers/v22/domke12/domke12.pdf). --->
 
-We employ this end-to-end approach, but consider substantially more expressive prior distributions over y than a field of experts from linear filters. Namely, we consider an arbitrary deep network: P(y) \propto exp(D(y)). We also support functionality where D can have terms that operate in the frequency domain. 
+This is downloads a preprocessed version of a small amount of the depth denoising data from this [paper](http://www.cs.toronto.edu/~slwang/proximalnet.pdf), made available [here](https://bitbucket.org/shenlongwang/), and then fits a SPEN. The associated SPEN architecture is defined in model/DepthSPEN.lua. 
 
+Note that this isn't a traditional denoising task where we assume a parametric noise model, which can be used for producing training pairs of noisy and clean images. 
 
-### Data Processing
-You will need a large number of sharp images, which you can then add noise to using some simple code. The denoising code assumes a Gaussian likelihood, so to avoid model mis-specification you should add white noise. However, feel free to use alternative image corruptions. Some helpful utility code is: 
-
-`scripts/im_pairs_to_torch.lua <file_list> <num examples per file> <output_dir> <num_total_examples>`
+Let x be the input blurry image and y be the sharpened image we seek to predict. We recover y by MAP inference, where we find y that maximizes P(x | y ) P(y). We assume a Gaussian noise model, so that P(x|y) is scaled mean squared error. There are various parametrizations for the prior distribution P(y). Many previous works have employed a 'field of experts' model: P(y) \propto exp(\sum_i \sum_xy w_i \rho (f_i(x,y))), where f_1(\cdot,\cdot), \ldots, f_k(\cdot,\cdot) are a set of localized linear filter responses and \rho is a nonlinearity. 
 
-This depends on the torch gm package, will requires you to install graphicsmagick. The images can be any format loadable by graphicsmagick.
+Early work estimated the the weights w_i and the filters by maximizing the likelihood of a dataset of sharp images. Inference in the field of experts model is intractable, and thus practitioners employed approximate methods such as contrastive divergence. 
 
-Each line of file_list is of the form `<blurry_image>\s<sharp image>`
+An alternative line of work estimated the parameters using end-to-end approaches, by applying automatic differentiation to the procedure of iteratively solving the MAP objective, for a fixed number of iterations. 
 
+We employ this end-to-end approach, but consider substantially more expressive prior distributions over y than a field of experts from linear filters. Namely, we consider an arbitrary deep network: P(y) \propto exp(D(y)). 
 
 
 ### Related Work
@@ -40,6 +36,11 @@ Each line of file_list is of the form `<blurry_image>\s<sharp image>`
 
 > Justin Domke. Generic Methods for Optimization-Based Modeling." AISTATS 2012.
 
-### Applications
+### Data Processing
+You will need a large number of pairs of noisy and clean images. Some helpful utility code is: 
 
-Besides providing an effective image denoising network, this learning procedure produces a standalong network P(y), which returns the prior log-probability of a given image. This may be useful in various downstream tasks. You could even sample from the space of images using, for example, Hamiltonian Monte Carlo. 
+`scripts/im_pairs_to_torch.lua <file_list> <num examples per file> <output_dir> <num_total_examples>`
+
+This depends on the torch gm package, will requires you to install graphicsmagick. The images can be any format loadable by graphicsmagick.
+
+Each line of file_list is of the form `<blurry_image>\s<sharp image>`
diff --git a/MultiLabelClassification.md b/MultiLabelClassification.md
index 70c6d18..6063842 100644
--- a/MultiLabelClassification.md
+++ b/MultiLabelClassification.md
@@ -1,15 +1,22 @@
 # Multi-Label Classification with SPENs
 
-The SPEN architecture for MLC is described in detail in our [paper](https://people.cs.umass.edu/~belanger/belanger_spen_icml.pdf). It is implemented in SPENMultiLabelClassification.lua, with some general functionality added in SPENProblem.lua. See main.lua for a description of the command line arguments. 
 
+To run a self-contained example of multi-label classification, cd to the base directory for SPEN, and then execute
 
-SPENMultiLabelClassification calls MultiLabelEvaluation.lua, which computes F1 score. This depends on a threshold, between 0 and 1, for converting soft decisions to hard decisions. If you use the -predictionThresh argument (eg., when evaluating on your test set), then we use a single threshold. Otherwise, it tries a bunch of thresholds and finds the best F1.
+`wget http://www.cics.umass.edu/~belanger/icml_mlc_data.tar.gz`
 
+`tar -xvf icml_mlc_data.tar.gz`
 
-See ml_cmd.sh for an example script for running the code.
+`sh mlc_cmd.sh`
+
+The SPEN architecture for MLC is described in detail in our [paper](https://people.cs.umass.edu/~belanger/belanger_spen_icml.pdf). It is implemented in MLCSPEN.lua. See main.lua for the load_problem implementation for MLC. This also instantiates data loading, evaluation, etc.
+
+We evaluate using evaluate/MultiLabelEvaluation.lua, which computes F1 score. This depends on a threshold, between 0 and 1, for converting soft decisions to hard decisions. If you use the -predictionThresh argument (eg., when evaluating on your test set), then we use a single threshold. Otherwise, it tries a bunch of thresholds and finds the best F1.
+
+Note that our new code does not reproduce the configuration of the ICML experiments. The evaluation is the same, but the training method is substantially different. Even if you train with an SSVM loss, there are various configuration differences (eg. how we detect convergence of the inner prediction problem) 
 
 ### Data Processing
-It will be useful to use the conversion script
+For new data it will be useful to use the conversion script
 
 `scripts/ml2torch.lua <features_file> <labels_file> <out_file>`
 
diff --git a/README.md b/README.md
index 9bf46d5..567f27f 100644
--- a/README.md
+++ b/README.md
@@ -1,41 +1,48 @@
-# Structured Prediction Energy Network Training Code
+# SPEN Code Version 2
 
 Structured Prediction Energy Networks (SPENs) are a flexible, expressive approach to structured prediction. See our paper:
 
 [David Belanger](https://people.cs.umass.edu/~belanger/) and [Andrew McCallum](https://people.cs.umass.edu/~mccallum/pubs.html) "Structured Prediction Energy Networks." ICML 2016. [link](https://people.cs.umass.edu/~belanger/belanger_spen_icml.pdf)
 
 
-This project contains [torch](http://torch.ch/) code for SPENs. We provide code for two use cases: multi-label classification and image denoising. We also provide a generic API for which it should be easy to prototype additional applications. If you would like to do so, feel free to contact David Belanger for advice. 
+<!-- This project contains [torch](http://torch.ch/) code for SPENs. We provide code for two use cases: multi-label classification and image denoising. We also provide a generic API for which it should be easy to prototype additional applications. If you would like to do so, feel free to contact David Belanger for advice.  -->
 
+## Updates in Version 2
+Basically everything. The code is substantially more modular: it now provides proper abstractions between models, prediction methods, training losses, etc. We have also added a considerable number of tests. We have also added back a structured SVM training method, as was used in the ICML paper, and examples for sequence tagging. Algorithmically, there are a number of improvements, including backpropagation through a broader selection of optimization methods, dynamic unrolling of the computation graph for iterative prediction (to account for variable numbers of iterations), and explicit regularization to encourage the iterative prediction to converge quickly. 
 
-## New End-to-End Training Method
-The ICML paper trains the energy network using a structured SVM (SSVM) loss. As we discuss in the paper, this approach does not gracefully handle situations where inexact optimization is performed in the inner loop of training. Since our energy functions are non-convex with respect to the output labels, this is a key concern in both in theory and practice. 
+## Differences Between this Code and the ICML Paper code
 
-In response, we have recently switched to more straightforward, 'end-to-end' training approach, based on:
+The ICML paper trains the energy network using a structured SVM (SSVM) loss. As we discuss in the paper, this approach does not gracefully handle situations where inexact optimization is performed in the inner loop of training. Since our energy functions are non-convex with respect to the output labels, this is a key concern in both in theory and practice. In response, we have recently switched to more straightforward, 'end-to-end' training approach, based on:
 
 [Justin Domke](http://users.cecs.anu.edu.au/~jdomke/)  "Generic Methods for Optimization-Based Modeling." AISTATS 2012. [link](http://www.jmlr.org/proceedings/papers/v22/domke12/domke12.pdf).
 
 Here, we construct a long computation graph corresponding to running gradient descent on the energy function for a fixed number of iterations. With this, prediction amounts to a feed-forward pass through this recurrent neural network, and training can be performed using backprop. There are some technical details regarding how to backpropagate through the process of taking gradient steps, and we employ Domke's finite differences technique. The advantage of this end-to-end approach is that we directly optimize the empirical loss: the computation graph used at train time is an exact implementation of the gradient-based inference (for a fixed number of steps) that we use at test time. 
 
-The only restriction on the family of energy functions optimizable with this approach vs. the structured SVM approach is that we need our energy function to be smooth (with respect to both the parameters and the inputs). Rather than using ReLUs, we recommend using a SoftPlus approximation. 
+The only restriction on the family of energy functions optimizable with this approach vs. the structured SVM approach is that we need our energy function to be smooth (with respect to both the parameters and the inputs). Rather than using ReLUs, we recommend using a SoftPlus. An ironic downside of the end-to-end approach fitting the training data much better is that it is more prone to overfitting. Therefore, it does not necessarily generate better test performance on the relatively small multi-label classification datasets we considered in the ICML paper.
 
-Our end-to-end approach is much more simple code-wise than the approach in the ICML paper and is less sensitive to hyperparameters. For example, the SSVM is very sensitive to stopping criteria for the inner optimization problem. End-to-end training also produces substantially better training losses on our multi-label classification data.  In response, we are not releasing code for SSVM training. An ironic downside of the end-to-end approach fitting the training data much better is that it is more prone to overfitting. Therefore, it does not necessarily generate better test performance on the relatively small multi-label classification datasets we considered in the ICML paper.
+## Useful Library Code
+We provide various bits of stand-alone code that might be useful in other applications. See their respective files for documentation. 
 
-## Code Dependencies
-You'll need to install the following torch packages, which can all be installed using 'luarocks install X:' torch, nn, cutorch, cunn, optim, nngraph. 
+`optimize/UnrolledGradientOptimizer.lua` takes an energy network E(y,x), and a network for guessing an initial value y0, and constructs a recurrent neural network that performs gradient-based minimization of the energy with respect to y. It provides various options for doing gradient descent with momentum, line search, etc.
 
-The 'deep mean-field' part of the code also depends on the autograd package. If you're doing stuff with images, we recommend configuring cudnn and using the -cudnn flag to main.lua.
+`optimize/GradientDirection.lua` takes an energy network E(y,x) and returns an nn module that returns the gradient of the energy with respect to y in the forward pass. In the backwards pass, the Hessian-vector product is computed using finite differences. 
 
-Finally, we use various utility functions from David's [torch-util](https://github.com/davidBelanger/torch-util) project. You will need to clone torch-util such that its relative path to this project is ../torch-util. 
+`infer1d/*.lua and model/ChainCRF.lua` provide various useful code for inference and learning in linear-chain CRFs. See various tests for examples of how to use these. 
 
 ## Applications
-We are releasing code for two applications: [Multi-Label Classification](MultiLabelClassification.md) and [Image Denoising](Denoising.md).	
+We are releasing code for three applications: [Multi-Label Classification](MultiLabelClassification.md), [Sequence Tagging](Tagging.md), and [Image Denoising](Denoising.md). All of these contain quick start scripts. 
 
 It is straightforward to implement new structured prediction applications using our code. See our [API](Applications.md) documentation.
 
-## Options
-See the top of main.lua for a long list of command line options and their explanations. 
+## Quick Start 
 
+We recommend running the sequence tagging example `quick_start_tagging.sh`. This uses main.lua, which has lots of functionality. For a more simple example, you can use test/test_chain_spen_learn.lua. 
+
+## Code Dependencies
+You'll need to install the following torch packages, which can all be installed using 'luarocks install X:' torch, nn, cutorch, cunn, optim, nngraph. If you're doing stuff with images, we recommend configuring cudnn and using the -cudnn flag to main.lua.
 
-## Coding Style
-Some time this year I adopted a terrible habit of interweaving camelCase and separated\_by\_underscores coding styles. I apologize. I will fix this at some point. 
+Finally, we use various utility functions from David's [torch-util](https://github.com/davidBelanger/torch-util) project. You will need to clone torch-util such that its relative path to this project is ../torch-util. 
+
+
+## Options
+See ./flags/*.lua for the various command line options and their explanations. See the example applications described above to see how some of the flags have been used.