Nevergrad for machine learning: which optimizer should I use ?

Let us assume that you have defined an objective function as in:

def myfunction(lr, num_layers, arg3, arg4, other_anything):
    ...
    return -accuracy  # something to minimize

You should define how it must be instrumented, i.e. what are the arguments you want to optimize upon, and on which space they are defined. If you have both continuous and discrete parameters, you have a good initial guess, maybe just use OrderedDiscrete for all discrete variables (yes, even if they are not ordered), Array for all your continuous variables, and use PortfolioDiscreteOnePlusOne as optimizer.

import nevergrad as ng
# instrument learning rate and number of layers, keep arg3 to 3 and arg4 to 4
lr = ng.var.Array(1).asscalar().bounded(0, 3).exponentiated(base=10, coeff=-1)  # log distributed between 0.001 and 1
num_layers = ng.var.OrderedDiscrete([4, 5, 6])
instrumentation = ng.Instrumentation(lr, num_layers, 3., arg4=4)

Just take care that the default value (your initial guess) is at the middle in the list of possible values for OrderedDiscrete, and 0 for Array (you can modify this with Array methods). You can check that things are correct by checking that for zero you get the default:

args, kwargs = instrumentation.data_to_arguments([0] * instrumentation.dimension)
print(args, kwargs)

The fact that you use ordered discrete variables is not a big deal because by nature PortfolioDiscreteOnePlusOne will ignore the order. This algorithm is quite stable.

If you have more budget, a cool possibility is to use CategoricalSoftmax for all discrete variables and then apply TwoPointsDE. You might also compare this to DE (classical differential evolution). This might need a budget in the hundreds.

If you want to double-check that you are not worse than random search, you might use RandomSearch.

If you want something fully parallel (the number of workers can be equal to the budget), then you might use ScrHammersleySearch, which includes the discrete case. Then, you should use OrderedDiscrete rather than CategoricalSoftmax. This does not have the traditional drawback of grid search and should still be more uniform than random. By nature ScrHammersleySearch will deal correctly with OrderedDiscrete type for discrete variables.

If you are optimizing weights in reinforcement learning, you might use TBPSA (high noise) or CMA (low noise).

Nevergrad applied to Machine Learning: 3 examples.

The first example is simply the optimization of continuous hyperparameters. It is also presented in an asynchronous setting. All other examples are based on the ask and tell interface, which can be synchronous or not but relies on the user for setting up asynchronicity.

The second example is the optimization of mixed (continuous and discrete) hyperparameters.

The third example is the optimization of parameters in a noisy setting, typically as in reinforcement learning.

First example: optimization of continuous hyperparameters with CMA, PSO, DE, Random and QuasiRandom

Let's first define our test case:

import nevergrad as ng
import numpy as np


# Optimization of continuous hyperparameters.
print("Optimization of continuous hyperparameters =========")


def train_and_return_test_error(x):
    return np.linalg.norm([int(50. * abs(x_ - 0.2)) for x_ in x])

instrumentation = ng.Instrumentation(ng.var.Array(300))  # optimize on R^300

budget = 1200  # How many trainings we will do before concluding.

names = ["RandomSearch", "TwoPointsDE", "CMA", "PSO", "ScrHammersleySearch"]

We will compare several algorithms (defined in names). RandomSearch is well known, ScrHammersleySearch is a quasirandom; these two methods are fully parallel, i.e. we can perform the 1200 trainings in parallel. CMA and PSO are classical optimization algorithms, and TwoPointsDE is Differential Evolution equipped with a 2-points crossover. A complete list is available in ng.optimizers.registry.

Ask and tell version

for name in names:
    optim = ng.optimizers.registry[name](instrumentation=instrumentation, budget=budget)
    for u in range(budget // 3):
        x1 = optim.ask()
        # Ask and tell can be asynchronous.
        # Just be careful that you "tell" something that was asked.
        # Here we ask 3 times and tell 3 times in order to fake asynchronicity
        x2 = optim.ask()
        x3 = optim.ask()
        # The three folowing lines could be parallelized.
        # We could also do things asynchronously, i.e. do one more ask
        # as soon as a training is over.
        y1 = train_and_return_test_error(*x1.args, **x1.kwargs)  # here we only defined an arg, so we could omit kwargs
        y2 = train_and_return_test_error(*x2.args, **x2.kwargs)  # (keeping it here for the sake of consistency)
        y3 = train_and_return_test_error(*x3.args, **x3.kwargs)
        optim.tell(x1, y1)
        optim.tell(x2, y2)
        optim.tell(x3, y3)
    recommendation = optim.recommend()
    print("* ", name, " provides a vector of parameters with test error ",
          train_and_return_test_error(*recommendation.args, **recommendation.kwargs))

Asynchronous version with concurrent.futures

from concurrent import futures

for name in names:
    optim = np.optimizers.registry[name](instrumentation=instrumentation, budget=budget)

    with futures.ThreadPoolExecutor(max_workers=optim.num_workers) as executor:  # the executor will evaluate the function in multiple threads
        recommendation = optim.minimize(train_and_return_test_error, executor=executor)
    print("* ", name, " provides a vector of parameters with test error ",
          train_and_return_test_error(recommendation))

Second example: optimization of mixed (continuous and discrete) hyperparameters.

Let's define our function:

import numpy as np

# Let us define a function.
def myfunction(arg1, arg2, arg3, value=3):
    return np.abs(value) + (1 if arg1 != "a" else 0) + (1 if arg2 != "e" else 0)

This function must then be instrumented in order to let the optimizer now what are the arguments:

import nevergrad as ng
# argument transformation
# Optimization of mixed (continuous and discrete) hyperparameters.
arg1 = ng.var.OrderedDiscrete(["a", "b"])  # 1st arg. = positional discrete argument
# We apply a softmax for converting real numbers to discrete values.
arg2 = ng.var.SoftmaxCategorical(["a", "c", "e"])  # 2nd arg. = positional discrete argument
value = ng.var.Gaussian(mean=1, std=2)  # the 4th arg. is a keyword argument with Gaussian prior

# create the instrumentation
# the 3rd arg. is a positional arg. which will be kept constant to "blublu"
instrumentation = ng.Instrumentation(arg1, arg2, "blublu", value=value)

print(instrumentation.dimension)  # 5 dimensional space

The dimension is 5 because:

the 1st discrete var. has 1 possible values, represented by a hard thresholding in a 1-dimensional space, i.e. we add 1 coordinate to the continuous problem
the 2nd discrete var. has 3 possible values, represented by softmax, i.e. we add 3 coordinates to the continuous problem
the 3rd var. has no uncertainty, so it does not introduce any coordinate in the continuous problem
the 4th var. is a real number, represented by single coordinate.

args, kwargs = instrumentation.data_to_arguments([1, -80, -80, 80, 3])
print(args, kwargs)
>>> ('b', 'e', 'blublu') {'value': 7}
myfunction(*args, **kwargs)
>>> 8

In this case:

args[0] == "b" because 1 > 0 (the threshold is 0 here since there are 2 values.
args[1] == "e" is selected because proba(e) = exp(80) / (exp(80) + exp(-80) + exp(-80)) = 1
args[2] == "blublu" because it is kept constant
value == 7 because std * 3 + mean = 2 * 3 + 1 = 7 The function therefore returns 7 + 1 = 8.

Then you can run the optimization as usual. PortfolioDiscreteOnePlusOne is quite a natural choice when you have a good initial guess and a mix of discrete and continuous variables; in this case, it might be better to use OrderedDiscrete rather than SoftmaxCategorical. TwoPointsDE is often excellent in the large scale case (budget in the hundreds).

import nevergrad as ng
budget = 1200  # How many episode we will do before concluding.
for name in ["RandomSearch", "ScrHammersleySearch", "TwoPointsDE", "PortfolioDiscreteOnePlusOne", "CMA", "PSO"]:
    optim = ng.optimizers.registry[name](instrumentation=instrumentation, budget=budget)
    for u in range(budget // 3):
        x1 = optim.ask()
        # Ask and tell can be asynchronous.
        # Just be careful that you "tell" something that was asked.
        # Here we ask 3 times and tell 3 times in order to fake asynchronicity
        x2 = optim.ask()
        x3 = optim.ask()
        # The three folowing lines could be parallelized.
        # We could also do things asynchronously, i.e. do one more ask
        # as soon as a training is over.
        y1 = myfunction(*x1.args, **x1.kwargs)  # here we only defined an arg, so we could omit kwargs
        y2 = myfunction(*x2.args, **x2.kwargs)  # (keeping it here for the sake of consistency)
        y3 = myfunction(*x3.args, **x3.kwargs)
        optim.tell(x1, y1)
        optim.tell(x2, y2)
        optim.tell(x3, y3)
    recommendation = optim.recommend()
    print("* ", name, " provides a vector of parameters with test error ",
          myfunction(*recommendation.args, **recommendation.kwargs))

Manual instrumentation

You always have the possibility to define your own instrumentation inside your function (not recommended):

def softmax(x, possible_values=None):
    expx = [np.exp(x_ - max(x)) for x_ in x]
    probas = [e / sum(expx) for e in expx]
    return np.random.choice(len(x) if possible_values is None
            else possible_values, size=1, p=probas)


def train_and_return_test_error_mixed(x):
    cx = [x_ - 0.1 for x_ in x[3:]]
    activation = softmax(x[:3], ["tanh", "sigmoid", "relu"])
    return np.linalg.norm(cx) + (1. if activation != "tanh" else 0.)

instrumentation = 10  # you can just provide the size of your input in this case

#This version is bigger.
def train_and_return_test_error_mixed(x):
    cx = x[:(len(x) // 2)]  # continuous part.
    presoftmax_values = x[(len(x) // 2):]  # discrete part.
    values_for_this_softmax = []
    dx = []
    for g in presoftmax:
        values_for_this_softmax += [g]
        if len(values_for_this_softmax) > 4:
            dx += softmax(values_for_this_softmax)
            values_for_this_softmax = []
    return np.linalg.norm([int(50. * abs(x_ - 0.2)) for x_ in cx]) + [
            1 if d != 1 else 0 for d in dx]

instrumentation = 300

Third example: optimization of parameters for reinforcement learning.

We do not average evaluations over multiple episodes - the algorithm is in charge of averaging, if need be. TBPSA, based on population-control mechanisms, performs quite well in this case.

import nevergrad as ng
import numpy as np

# Similar, but with a noisy case: typically a case in which we train in reinforcement learning.
# This is about parameters rather than hyperparameters. TBPSA is a strong candidate in this case.
# We do *not* manually average over multiple evaluations; the algorithm will take care
# of averaging or reevaluate whatever it wants to reevaluate.


print("Optimization of parameters in reinforcement learning ===============")


def simulate_and_return_test_error_with_rl(x, noisy=True):
    return np.linalg.norm([int(50. * abs(x_ - 0.2)) for x_ in x]) + noisy * len(x) * np.random.normal()


budget = 1200  # How many trainings we will do before concluding.


for tool in ["TwoPointsDE", "RandomSearch", "TBPSA", "CMA", "NaiveTBPSA",
        "PortfolioNoisyDiscreteOnePlusOne"]:

    optim = ng.optimizers.registry[tool](instrumentation=300, budget=budget)

    for u in range(budget // 3):
        # Ask and tell can be asynchronous.
        # Just be careful that you "tell" something that was asked.
        # Here we ask 3 times and tell 3 times in order to fake asynchronicity
        x1 = optim.ask()
        x2 = optim.ask()
        x3 = optim.ask()
        # The three folowing lines could be parallelized.
        # We could also do things asynchronously, i.e. do one more ask
        # as soon as a training is over.
        y1 = simulate_and_return_test_error_with_rl(*x1.args)
        y2 = simulate_and_return_test_error_with_rl(*x2.args)
        y3 = simulate_and_return_test_error_with_rl(*x3.args)
        optim.tell(x1, y1)
        optim.tell(x2, y2)
        optim.tell(x3, y3)

    recommendation = optim.recommend()
    print("* ", tool, " provides a vector of parameters with test error ",
          simulate_and_return_test_error_with_rl(*recommendation.args, noisy=False))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

machinelearning.md

machinelearning.md

Nevergrad for machine learning: which optimizer should I use ?

Nevergrad applied to Machine Learning: 3 examples.

First example: optimization of continuous hyperparameters with CMA, PSO, DE, Random and QuasiRandom

Ask and tell version

Asynchronous version with concurrent.futures

Second example: optimization of mixed (continuous and discrete) hyperparameters.

Manual instrumentation

Third example: optimization of parameters for reinforcement learning.

Files

machinelearning.md

Latest commit

History

machinelearning.md

File metadata and controls

Nevergrad for machine learning: which optimizer should I use ?

Nevergrad applied to Machine Learning: 3 examples.

First example: optimization of continuous hyperparameters with CMA, PSO, DE, Random and QuasiRandom

Ask and tell version

Asynchronous version with concurrent.futures

Second example: optimization of mixed (continuous and discrete) hyperparameters.

Manual instrumentation

Third example: optimization of parameters for reinforcement learning.