Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define the Optimizer.py #7

Open
9 of 16 tasks
MarcCote opened this issue May 22, 2015 · 3 comments
Open
9 of 16 tasks

Define the Optimizer.py #7

MarcCote opened this issue May 22, 2015 · 3 comments

Comments

@MarcCote
Copy link
Member

Definitively the notion of optimizer is somewhat fuzzy and so is the class Optimizer.

We should attempt to clarify definitions we are going to use in the library.

Definitions (to be added in a wiki)

  • Trainer: manages the optimization procedure of a model on a particular dataset.
  • Optimizer: optimizes a certain objective function (e.g. a loss) by updating some parameters (the ones used in the computation of the objective function: parameters of the model)).
  • UpdateRule: Something that modified a direction (often the gradient) in order to update some parameters.
  • BatchScheduler: manages the batches (nb. of examples, order of the examples) to give to the learn function.
  • Loss: is responsible of outputting the Theano graph corresponding to the desired loss function to optimizer by the Optimizer. It takes as inputs a Model and a Dataset.

Some questions

  • What should an optimizer take as inputs? The loss function to optimize. Now a Loss class.
  • What are the different kind of optimizers? (bold: already available in the library)
    • Zeroth order (needs only function) (not used for real)
    • First order (needs only gradient):
      • GD, SGD, adam, adadelta, adagrad, nag, svrg, sdca, sag, sagvr, ...
    • Quasi-Newton (needs only gradient, builds an hessian approximation):
      • L-BFGS, ...
    • Second order (needs the gradient and the hessian (or hessian-vector product)
      • Newton, Newton-Trust Region, Hessian-Free, ARC, ...
  • Should an optimizer be agnostic to the notion of batch, batch size, batch ordering, etc? Yes, we created a BatchScheduler for that.
  • How do we call ADAGRAD, Adam, Adadelta, etc.? Right now those are called UpdateRule.
  • Should we allow trivially multiple UpdateRule or create special UpdateRule that will combine them as the user want. Right now, we blindly applied them one after the other.
  • Is SGD really something in our framework? Yes, otherwise we would need a SMART-optim module.
  • Is L-BFGS simply what we call an update rule? No. It requires the current, and past parameters and the past gradients.
  • Is using the Hessian (e.g. in Newton Method) can be seen as an update rule? No, using exact second-order information should be done in a given subclass of Optimizer, it would then call the necessary method of the model (e.g. hessian or the Rop - Hessian-vector product ).
  • Does Optimizer should be the one computing nb_updates_per_epoch? No, a BatchScheduler should do it.

Suggestions

  • We could define a class Loss that will be provided to the optimizer. This class could know about the model and the dataset, provide the necessary symbolic variables (maybe it should build the given for the Theano function).
  • Currently, all calls to update_rules.apply in SGD should be moved inside Optimizer. The same goes for calls to param_modifier.apply.
@MarcCote MarcCote changed the title Rethink the Optimizer Define the Optimizer.py May 22, 2015
MarcCote referenced this issue in MarcCote/smartmodels May 25, 2015
@ASalvail
Copy link
Member

Some questions

  • Should an optimizer be agnostic to the notion of batch, batch size, batch ordering, etc? Yes, we created a BatchScheduler for that. An algorithm such as SAG would need to keep a dataset-like memory of its gradient. It might require for the optimizer to have access to the current example id for efficient implementation.
  • How do we call ADAGRAD, Adam, Adadelta, etc.? Right now those are called UpdateRule. Optimizer. You want to clip a gradient, that's an UpdateRule. You want adagrad? It should subclass SGD (or be it own optimizer).

@MarcCote
Copy link
Member Author

@ASalvail I moved some of your comments into the original post (because it seems I can do that!).

I'm not familiar with SAG. Knowing the example is not enough, you need the id because you keep an history of the past gradients for each example. Is that it?

I think the term UpdateRule is unclear and refers to many different parts of the optimization process. This is probably why we have a hard time tracing the line between UpdateRule and Optimizer. A couple months ago @mgermain proposed the terms DirectionModifier and ParamModifier. For me, we should be able to combine multiple DirectionModifiers and I don't see it with ADAGRAD, adam and Adadelta.

For instance, examples of reusable and combinable DirectionModifiers:

  • Learning rate (change the direction length)
  • Decreasing learning rate
  • Momentum
  • Direction/gradient clipping

So, what I have in mind for the core of a first order optimizer (e.g. SGD) is something that looks like this:

  1. Get an initial descent direction (usually the gradient)
  2. Apply some DirectionModifiers
  3. Update parameters
  4. Apply some ParamModifiers
  5. Rinse and repeat

I can see ADAGRAD, Adam, Adadelta being called optimizers. They would inherit from SGD (or maybe a new class FirstOrderOptimizer) and use a custom DirectionModifier class (that may or may not be reusable).

So users would only have to specify --optimizer ADAGRAD to use it. In addition, users that want to do something funky could still specify the optimizer ADAGRAD and provide additional DirectionModifiers.

What do you think?

@ASalvail
Copy link
Member

@MarcCote That's exactly how SAG proceeds: it stores the gradient of all examples in order to get it's gradient average computation right.

Those modifiers could be useful as building blocks for the optimizer, but I don't think it'd be useful to use them out of it. If you want a new fancy optimizer, subclass it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants