+ +
+

Model Exploration

+
+

Overview

+

The model exploration task provides a way to try out different types of machine +learning models and sets of parameters to those models. It tests those models +on splits of the training data and outputs information on the performance of +the models. The purpose of model exploration is to help you choose a model that +performs well without having to test each model individually on the entire +input datasets. If you’re interested in the exact workings of the model exploration +algorithm, see the Details section below.

+

Model exploration uses several configuration attributes listed in the training +section because it is closely related to training.

+
+
+

Searching for Model Parameters

+

Part of the process of model exploration is searching for model parameters which +give good results on the training data. Hlink supports three strategies for model +parameter searches, controlled by the training.model_parameter_search table.

+
+

Explicit Search (strategy = "explicit")

+

An explicit model parameter search lists out all of the parameter combinations +to be tested. Each element of the training.model_parameters list becomes one +set of parameters to evaluate. This is the simplest search strategy and is hlink’s +default behavior.

+

This example training section uses an explicit search over two sets of model parameters. +Model exploration will train two random forest models. The first will have a +maxDepth of 3 and numTrees of 50, and the second will have a maxDepth of 3 +and numTrees of 20.

+
[training.model_parameter_search]
+strategy = "explicit"
+
+[[training.model_parameters]]
+type = "random_forest"
+maxDepth = 3
+numTrees = 50
+
+[[training.model_parameters]]
+type = "random_forest"
+maxDepth = 3
+numTrees = 20
+
+
+
+
+

Grid Search (strategy = "grid")

+

A grid search takes multiple values for each model parameter and generates one +model for each possible combination of the given parameters. This is often much more +compact than writing out all of the possible combinations in an explicit search.

+

For example, this training section generates 30 combinations of model +parameters for testing. The first has a maxDepth of 1 and numTrees of 20, +the second has a maxDepth of 1 and numTrees of 30, and so on.

+
[training.model_parameter_search]
+strategy = "grid"
+
+[[training.model_parameters]]
+type = "random_forest"
+maxDepth = [1, 2, 3, 5, 10]
+numTrees = [20, 30, 40, 50, 60, 70]
+
+
+

Although grid search is more compact than explicitly listing out all of the model +parameters, it can be quite time-consuming to check every possible combination of +model parameters. Randomized search, described below, can be a more efficient way +to evaluate models with large numbers of parameters or large parameter ranges.

+
+
+

Randomized Search (strategy = "randomized")

+

Added in version 4.0.0.

+

A randomized parameter search generates model parameter settings by sampling each +parameter from a distribution or set. The number of samples is an additional parameter +to the strategy. This separates the size of the search space from the number of samples +taken, making a randomized search more flexible than a grid search. The downside of +this is that, unlike a grid search, a randomized search does not necessarily test +all of the possible values given for each parameter. It is necessarily non-exhaustive.

+

In a randomized search, each model parameter may take one of 3 forms:

+
    +
  • A list, which is a set of values to sample from with replacement. Each value has an equal chance +of being chosen for each sample.

  • +
+
[[training.model_parameters]]
+type = "random_forest"
+numTrees = [20, 30, 40]
+
+
+
    +
  • A single value, which “pins” the model parameter to always be that value. This +is syntactic sugar for sampling from a list with one element.

  • +
+
[[training.model_parameters]]
+type = "random_forest"
+# numTrees will always be 30.
+# This is equivalent to numTrees = [30].
+numTrees = 30
+
+
+
    +
  • A table defining a distribution from which to sample the parameter. The available +distributions are "randint", to choose a random integer from a range, "uniform", +to choose a random floating-point number from a range, and "normal", to choose +a floating-point number from a normal distribution with a given mean and standard +deviation.

  • +
+

For example, this training section generates 20 model parameter combinations +for testing, using a randomized search. Each of the three given model parameters +uses a different type of distribution.

+
[training.model_parameter_search]
+strategy = "randomized"
+num_samples = 20
+
+[[training.model_parameters]]
+type = "random_forest"
+numTrees = {distribution = "randint", low = 20, high = 70}
+minInfoGain = {distribution = "uniform", low = 0.0, high = 0.3}
+subsamplingRate = {distribution = "normal", mean = 1.0, standard_deviation = 0.2}
+
+
+
+
+

The training.param_grid Attribute

+

As of version 4.0.0, the training.param_grid attribute is deprecated. Please use +training.model_parameter_search instead, as it is more flexible and supports additional +parameter search strategies. Prior to version 4.0.0, you will need to use training.param_grid.

+

param_grid has a direct mapping to model_parameter_search.

+
[training]
+param_grid = true
+
+
+

is equivalent to

+
[training]
+model_parameter_search = {strategy = "grid"}
+
+
+

and

+
[training]
+param_grid = false
+
+
+

is equivalent to

+
[training]
+model_parameter_search = {strategy = "explicit"}
+
+
+
+
+

Types and Thresholds

+

There are 3 attributes which are hlink-specific and are not passed through as model parameters.

+
    +
  • type is the name of the model type.

  • +
  • threshold and threshold_ratio control how hlink classifies potential matches +based on the probabilistic output of the models. They may each be either a float +or a list of floats, and hlink will always use a grid strategy to generate the +set of test combinations for these parameters.

  • +
+

For more details, please see the Models page and the Details +section below.

+
+
+
+

The Details

+

The current model exploration implementation uses a technique called nested cross-validation to evaluate each model which the search strategy generates. The algorithm follows this basic outline.

+

Let N be the value of training.n_training_iterations. +Let J be 3. (Currently J is hard-coded).

+
    +
  1. Split the prepared training data into N outer folds. This forms a partition of the training data into N distinct pieces, each of roughly equal size.

  2. +
  3. Choose the first outer fold.

  4. +
  5. Combine the N - 1 other outer folds into the set of outer training data.

  6. +
  7. Split the outer training data into J inner folds. This forms a partition of the training data into J distinct pieces, each of roughly equal size.

  8. +
  9. Choose the first inner fold.

  10. +
  11. Combine the J - 1 other inner folds into the test of inner training data.

  12. +
  13. Train, test, and score all of the models using the inner training data and the first inner fold as the test data.

  14. +
  15. Repeat steps 5 - 7 for each other inner fold.

  16. +
  17. After finishing all of the inner folds, choose the single model with the best aggregate score over those folds.

  18. +
  19. For each setting of threshold and threshold_ratio, train the best model on the outer training data and the chosen outer fold. Collect metrics on the performance of the model based on its confusion matrix.

  20. +
  21. Repeat steps 2-10 for each other outer fold.

  22. +
  23. Report on all of the metrics gathered for the best-scoring models.

  24. +
+
+
+ + +