Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 4.0.0 #186

Draft
wants to merge 142 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
142 commits
Select commit Hold shift + click to select a range
5507b4b
Messing around with refactoring model exploration
ccdavis Nov 14, 2024
3b84f26
Fixed failures due to bad code
ccdavis Nov 15, 2024
62ff6e6
No errors, use model exploration approach that should get pr_auc mean…
ccdavis Nov 15, 2024
3477b71
remove cache() and typo
ccdavis Nov 15, 2024
c0397c5
Renaming for clarity
ccdavis Nov 16, 2024
1fe6224
wip
Nov 16, 2024
28c6cde
giving up for now
ccdavis Nov 16, 2024
1f70f66
wip
ccdavis Nov 18, 2024
8e5415f
refactoring
ccdavis Nov 19, 2024
941bd06
finished refactoring sketch
ccdavis Nov 19, 2024
1f2bd49
Fixed some typos
ccdavis Nov 19, 2024
21cac61
correctly save suspicious data
ccdavis Nov 19, 2024
c9576e8
Debugging _get_aggregates in test. It looks like the test data just d…
ccdavis Nov 20, 2024
319129f
Use all splits on thresholding
Nov 15, 2024
9a90143
Adjust test to account for results with only the best hyper parameter…
ccdavis Nov 21, 2024
a14ccdf
Clean up stdout and make a model-param selection report.
ccdavis Nov 21, 2024
2facf41
model exploration tests pass; need more
ccdavis Nov 21, 2024
3bbac41
Separate each fold test run output.
Nov 22, 2024
3b22f14
Clean up output
ccdavis Nov 25, 2024
efa67f7
Tests pass
ccdavis Nov 25, 2024
38c1006
fixed some tests, the FNS count test is broken because of the single …
ccdavis Nov 25, 2024
c5f5b13
[#167] Pull _custom_param_grid_builder() out of the LinkStepTrainTest…
riley-harper Nov 26, 2024
605369b
[#167] Simplify the interface to _custom_param_grid_builder()
riley-harper Nov 26, 2024
2204152
[#167] Pull _get_model_parameters() out of the LinkStep class
riley-harper Nov 26, 2024
7d48380
[#167] Add a few tests for _get_model_parameters()
riley-harper Nov 26, 2024
bc0bf7d
[#167] Just pass the training section of the config to _get_model_par…
riley-harper Nov 26, 2024
8be8806
[#167] Add a couple of tests for the new training.model_parameter_sea…
riley-harper Nov 26, 2024
a939ec2
[#167] Look for training.model_parameter_search in _get_model_paramet…
riley-harper Nov 26, 2024
801582e
[#167] Make sure that model_parameter_search takes precedence over pa…
riley-harper Nov 26, 2024
a94250c
wip
ccdavis Nov 27, 2024
667d322
Possibly working nested cv
Nov 22, 2024
a476884
[#167] Print a deprecation warning for training.param_grid
riley-harper Nov 27, 2024
8c72446
[#167] Refactor _get_model_parameters()
riley-harper Nov 27, 2024
896ad67
[#167] Improve an error condition in _get_model_parameters()
riley-harper Nov 27, 2024
46da4cb
[#167] Start supporting a randomized strategy which can randomly samp…
riley-harper Nov 27, 2024
51b4144
[#167] Support some simple distributions for randomized parameter search
riley-harper Nov 27, 2024
907818e
[#167] Use isinstance instead of directly checking types
riley-harper Nov 27, 2024
65cb5ff
[#167] Pull the edge case logic for "type" out of _choose_randomized_…
riley-harper Nov 27, 2024
1692c87
[#167] Support "pinned" parameters with model_parameter_search strate…
riley-harper Nov 27, 2024
f4a42f7
fix typo, testing
ccdavis Dec 2, 2024
0becd32
[#167] Respect training.seed when the search strategy is ""randomized"
riley-harper Dec 2, 2024
5d0ea0b
[#167] Add a normal distribution to randomized parameter search
riley-harper Dec 2, 2024
943fc0a
[#167] Improve the "unknown distribution" error message
riley-harper Dec 2, 2024
0f99e1b
[#167] Don't randomize threshold or threshold_ratio
riley-harper Dec 2, 2024
7fed016
[#167] Add a test for the unknown strategy error condition
riley-harper Dec 2, 2024
761e38f
reformatted
Dec 2, 2024
3e0cb90
better output for tracking progress of train-test
Dec 2, 2024
c7e7ba2
better messages
Dec 2, 2024
fdd402c
Better logging
ccdavis Dec 3, 2024
3500e7c
correctly group threshold metrics by outer fold iteration.
ccdavis Dec 3, 2024
1ea05d0
Try fewer shuffle partitions
ccdavis Dec 3, 2024
10ab7b4
set shuffle partitions back to 200
ccdavis Dec 3, 2024
47e28a6
Added nested-cv algo description in comments.
ccdavis Dec 3, 2024
b5e128f
Added seed on inner fold splitter; Update tests to at least pass.
ccdavis Dec 3, 2024
b123dbf
assert the logistic regression gives a decent result
ccdavis Dec 3, 2024
1ead1e7
Temporary commented out asserts due to different results presentation…
ccdavis Dec 3, 2024
45f3649
another test passes
ccdavis Dec 3, 2024
40f075d
all tests should pass
ccdavis Dec 3, 2024
0f5deb6
Merge branch 'main' into randomized_parameter_search
riley-harper Dec 3, 2024
b9c2123
fixed quote indent
ccdavis Dec 3, 2024
40f344e
Merge branch 'main' into refactor-nested-cross-validation
ccdavis Dec 3, 2024
c6d3a81
Merge branch 'main' into randomized_parameter_search
riley-harper Dec 3, 2024
1e55384
Address PR comments
ccdavis Dec 3, 2024
02d5f96
Merge branch 'main' into refactor-nested-cross-validation
ccdavis Dec 3, 2024
11bdfd4
Merge pull request #169 from ipums/refactor-nested-cross-validation
ccdavis Dec 4, 2024
73e6adc
Merge branch 'v4-dev' into randomized_parameter_search
riley-harper Dec 4, 2024
85802d3
Merge pull request #168 from ipums/randomized_parameter_search
riley-harper Dec 4, 2024
77a58c0
HH model exploration test passes; needed to adjust the expected colum…
ccdavis Dec 4, 2024
7e7baa0
Merge branch 'v4-dev' of github.com:ipums/hlink into v4-dev
ccdavis Dec 4, 2024
9542800
Merge branch 'main' into v4-dev
riley-harper Dec 4, 2024
e57dad6
[#172] Add type hints and docs to linking.core.classifier
riley-harper Dec 5, 2024
a736dd0
[#172] Don't handle threshold and threshold_ratio in choose_classifier()
riley-harper Dec 5, 2024
49bda13
[#174] Add type hints to linking.core.threshold
riley-harper Dec 5, 2024
28bcd03
[#174] Add a couple of unit tests for linking.core.threshold
riley-harper Dec 5, 2024
ad6ce10
[#174] Pass just decision into predict_with_thresholds() instead of t…
riley-harper Dec 5, 2024
5424513
[#174] Do some minor refactoring and cleanup of linking.core.threshold
riley-harper Dec 5, 2024
dd16360
[#174] Replace a SQL query with the equivalent spark expression
riley-harper Dec 5, 2024
647a751
[#174] Rewrite some thresholding code to use PySpark exprs instead of…
riley-harper Dec 5, 2024
b5c8ae9
[#174] Use withColumn() instead of select("*", ...)
riley-harper Dec 6, 2024
1ffb6d1
[#174] Improve the error message when there's no probability column
riley-harper Dec 6, 2024
d32c2bf
[#174] Update documentation and add a few logging debug statements
riley-harper Dec 6, 2024
3c9043c
Merge pull request #175 from ipums/core-arguments
riley-harper Dec 6, 2024
93a5c4e
WIP: refactor to combine threshold test results from all outer folds.…
ccdavis Dec 6, 2024
dd49937
WIP on correct metrics output; some tests break because of not enough…
ccdavis Dec 9, 2024
a041274
Cleaning up metrics
Dec 9, 2024
f083378
Tests pass
ccdavis Dec 10, 2024
1f162dc
Adjust hh model exploration test for new column names, no training co…
ccdavis Dec 10, 2024
bde173d
Merge pull request #177 from ipums/model-exploration-metrics
ccdavis Dec 10, 2024
b7f821c
[#176] Remove output_suspicious_TD and "suspicious traininig data" su…
riley-harper Dec 10, 2024
9755f73
[#176] Add a unit test for _get_confusion_matrix()
riley-harper Dec 10, 2024
c43b57d
[#176] Rewrite _get_confusion_matrix() to avoid using 4 filters + counts
riley-harper Dec 10, 2024
4aad62e
[#176] Add a unit test for _get_aggregate_metrics()
riley-harper Dec 10, 2024
3efbb0c
[#176] Lowercase tp/fp/fn/tn variable names
riley-harper Dec 10, 2024
627eed8
Try requiring scikit-learn<1.6 when xgboost is installed
riley-harper Dec 10, 2024
c1f0d8c
Merge pull request #178 from ipums/no-suspicious-data
riley-harper Dec 11, 2024
c166ace
[#179] Create a new core.model_metrics module and move _calc_mcc() there
riley-harper Dec 11, 2024
df9b463
[#179] Create precision() and recall() functions in core.model_metrics
riley-harper Dec 11, 2024
7817ed5
[#179] Factor away _get_aggregate_metrics()
riley-harper Dec 11, 2024
b93ab6f
[#179] Add hypothesis and some property tests for core.model_metrics
riley-harper Dec 11, 2024
8604767
[#179] Add a library function for F-measure, also known as F1-score
riley-harper Dec 11, 2024
75b4414
[#179] Unify variable and argument names
riley-harper Dec 11, 2024
ae59da3
[#179] Return math.nan from core.model_metrics
riley-harper Dec 11, 2024
fd40c35
[#179] Add .hypothesis/ to .gitignore
riley-harper Dec 11, 2024
1ecef81
[#179] Filter with math.isnan() instead of is not np.nan
riley-harper Dec 12, 2024
7f0c48c
[#179] Include F-measure in ThresholdTestResults
riley-harper Dec 12, 2024
a53c120
[#179] Put the raw confusion matrix counts in the ThresholdTestResults
riley-harper Dec 12, 2024
d87c5de
[#179] Simplify _aggregate_per_threshold_results()
riley-harper Dec 12, 2024
74a7dd9
[#179] Add F-measure to the output thresholded metrics data frame
riley-harper Dec 12, 2024
b454276
[#179] Return math.nan from core.model_metrics.mcc where it makes sense
riley-harper Dec 12, 2024
bd934f5
[#179] Don't automatically add or drop columns from thresholded metri…
riley-harper Dec 12, 2024
b2cf14c
[#179] Add documentation to core.model_metrics and refactor a bit
riley-harper Dec 13, 2024
7f8b49d
Merge pull request #180 from ipums/model_metrics
riley-harper Dec 13, 2024
4c6e602
[#181] Return a tuple (path, config) from load_conf_file
riley-harper Dec 13, 2024
46f79e3
[#181] Don't use load_conf() to set extra attributes on the configura…
riley-harper Dec 13, 2024
1f99c93
[#181] Remove the scripts.main.load_conf() function
riley-harper Dec 13, 2024
e0bf86e
[#181] Add a new checkpoint_dir argument to SparkConnection()
riley-harper Dec 13, 2024
3dbc75b
[#181] Implement checkpoint_dir behavior for SparkConnection and Spar…
riley-harper Dec 13, 2024
3f0d62f
Merge pull request #182 from ipums/checkpoint_directory_rework
riley-harper Dec 13, 2024
8bfe87e
Bump the version to 4.0.0a1
riley-harper Dec 13, 2024
7f802db
Run black
riley-harper Mar 5, 2025
0dd3d65
[#98] Remove hlink.linking.transformers.interaction_transformer
riley-harper Mar 5, 2025
305358a
[#127] Update test to avoid using blocking_steps
riley-harper Mar 5, 2025
3543afc
[#127] Remove support for "blocking_steps"
riley-harper Mar 5, 2025
08ac712
[#127] Inline matching._helpers.get_blocking()
riley-harper Mar 5, 2025
1a14cea
[#127] Remove support for old column_mappings format
riley-harper Mar 5, 2025
9c99a44
[#127] Remove support for deprecated form of mapping transforms
riley-harper Mar 5, 2025
727373f
[#127] Add tests for the "mapping" column mapping transform
riley-harper Mar 6, 2025
2004b2e
[#127] Update documentation for the mapping transform
riley-harper Mar 6, 2025
6769131
Merge pull request #184 from ipums/remove-deprecated
riley-harper Mar 6, 2025
7d44f8b
[#45] Use the tomli package instead of toml by default
riley-harper Mar 6, 2025
8518029
[#45] Add tests and docs for use_legacy_toml_parser
riley-harper Mar 6, 2025
d9d43cd
Merge pull request #185 from ipums/use_tomli
riley-harper Mar 6, 2025
94c7c8c
[#187] Fix a bug where model_metrics.mcc() < -1.0
riley-harper Mar 6, 2025
5152468
Merge pull request #188 from ipums/mcc-out-of-range
riley-harper Mar 6, 2025
4eda17d
[#183] Add a new model exploration docs page
riley-harper Mar 7, 2025
2a75c7d
[#183] Update the training and model exploration config docs
riley-harper Mar 7, 2025
d9ebc7d
[#183] Document the fine-grained details of model exploration
riley-harper Mar 7, 2025
b05ab22
Merge pull request #190 from ipums/model-exploration-docs
riley-harper Mar 7, 2025
8a86664
Merge branch 'main' into v4-dev
riley-harper Mar 7, 2025
be90274
[#183] Update docs for training.param_grid
riley-harper Mar 10, 2025
f6a4c47
Merge pull request #191 from ipums/param-grid-docs
riley-harper Mar 10, 2025
27a07c9
Bump the version to 4.0.0b1
riley-harper Mar 10, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ venv
sphinx-docs/_*
.coverage
coverage_*
.hypothesis/

# Scala
scala_jar/target
Expand Down
2 changes: 1 addition & 1 deletion docs/.buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file records the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 3d084ea912736a6c4043e49bc2b58167
config: 51aa15e7a138f908be12c347931eec38
tags: 645f666f9bcd5a90fca523b33c5a78b7
4 changes: 2 additions & 2 deletions docs/.buildinfo.bak
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 96d8a216541a8e03e59f47f661841dd9
# This file records the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 346c22873853f51d4bd34095fc5e3354
tags: 645f666f9bcd5a90fca523b33c5a78b7
24 changes: 13 additions & 11 deletions docs/_sources/column_mappings.md.txt
Original file line number Diff line number Diff line change
Expand Up @@ -288,25 +288,27 @@ transforms = [

### mapping

Map single or multiple values to a single output value, otherwise known as a "recoding."
Explicitly map from input values to output values. This is also known as a "recoding".
Input values which do not appear in the mapping are unchanged. By default, the output
column is of type string, but you can set `output_type = "int"` to cast the output
column to type integer instead.

Maps T → U.

```
```toml
[[column_mappings]]
column_name = "birthyr"
alias = "clean_birthyr"
transforms = [
{
type = "mapping",
values = [
{"from"=[9999,1999], "to" = ""},
{"from" = -9998, "to" = 9999}
]
}
]

[[column_mappings.transforms]]
type = "mapping"
mappings = {9999 = "", 1999 = "", "-9998" = "9999"}
output_type = "int"
```

*Changed in version 4.0.0: The deprecated `values` key is no longer supported.
Please use the `mappings` key documented above instead.*

### substring

Replace a column with a substring of the data in the column.
Expand Down
32 changes: 14 additions & 18 deletions docs/_sources/config.md.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@
12. [Household Comparisons](#household-comparisons)
13. [Comparison Features](#comparison-features)
14. [Pipeline-Generated Features](#pipeline-generated-features)
15. [Training and Models](#training-and-models)
16. [Household Training and Models](#household-training-and-models)
15. [Training and Model Exploration](#training-and-model-exploration)
16. [Household Training and Model Exploration](#household-training-and-model-exploration)

## Basic Config File

Expand Down Expand Up @@ -334,8 +334,7 @@ split_by_id_a = true
decision = "drop_duplicate_with_threshold_ratio"

n_training_iterations = 2
output_suspicious_TD = true
param_grid = true
model_parameter_search = {strategy = "grid"}
model_parameters = [
{ type = "random_forest", maxDepth = [7], numTrees = [100], threshold = [0.05, 0.005], threshold_ratio = [1.2, 1.3] },
{ type = "logistic_regression", threshold = [0.50, 0.65, 0.80], threshold_ratio = [1.0, 1.1] }
Expand All @@ -361,8 +360,7 @@ split_by_id_a = true
decision = "drop_duplicate_with_threshold_ratio"

n_training_iterations = 10
output_suspicious_TD = true
param_grid = false
model_parameter_search = {strategy = "explicit"}
model_parameters = [
{ type = "random_forest", maxDepth = 6, numTrees = 50, threshold = 0.5, threshold_ratio = 1.0 },
{ type = "probit", threshold = 0.5, threshold_ratio = 1.0 }
Expand Down Expand Up @@ -730,7 +728,7 @@ categorical = true
splits = [-1,0,6,11,9999]
```

## Training and [models](models)
## Training and [Model Exploration](model_exploration)

* Header name: `training`
* Description: Specifies the training data set as well as a myriad of attributes related to training a model including the dependent variable within that dataset, the independent variables created from the `comparison_features` section, and the different models you want to use for either model exploration or scoring.
Expand All @@ -740,21 +738,21 @@ splits = [-1,0,6,11,9999]
* `dataset` -- Type: `string`. Location of the training dataset. Must be a csv file.
* `dependent_var` -- Type: `string`. Name of dependent variable in training dataset.
* `independent_vars` -- Type: `list`. List of independent variables to use in the model. These must be either part of `pipeline_features` or `comparison_features`.
* `chosen_model` -- Type: `object`. The model to train with in the `training` task and score with in the `matching` task. See the [models](models) section for more information on model specifications.
* `chosen_model` -- Type: `object`. The model to train with in the `training` task and score with in the `matching` task. See the [Models](models) section for more information on model specifications.
* `threshold` -- Type: `float`. The threshold for which to accept model probability values as true predictions. Can be used to specify a threshold to use for all models, or can be specified within each `chosen_model` and `model_parameters` specification.
* `decision` -- Type: `string`. Optional. Specifies which decision function to use to create the final prediction. The first option is `drop_duplicate_a`, which drops any links for which a record in the `a` data set has a predicted match more than one time. The second option is `drop_duplicate_with_threshold_ratio` which only takes links for which the `a` record has the highest probability out of any other potential links, and the second best link for the `a` record is less than the `threshold_ratio`.
* `threshold_ratio` -- Type: `float`. Optional. For use when `decision` is `drop_duplicate_with_threshold_ratio` . Specifies the smallest possible ratio to accept between a best and second best link for a given record. Can be used to specify a threshold ratio (beta threshold) to use for all models. Alternatively, unique threshold ratios can be specified in each individual `chosen_model` and `model_parameters` specification.
* `model_parameters` -- Type: `list`. Specifies models to test out in the `model_exploration` task. See the [models](models) section for more information on model specifications.
* `param_grid` -- Type: `boolean`. Optional. If you would like to evaluate multiple hyper-parameters for a single model type in your `model_parameters` specification, you can give hyper-parameter inputs as arrays of length >= 1 instead of integers to allow one model per row specification with multiple model eval outputs.
* `decision` -- Type: `string`. Optional. Specifies which decision function to use to create the final prediction. The first option is `drop_duplicate_a`, which drops any links for which a record in the `a` data set has a predicted match more than one time. The second option is `drop_duplicate_with_threshold_ratio` which only takes links for which the `a` record has the highest probability out of any other potential links, and the second best link for the `a` record is less than the `threshold_ratio`.
* `score_with_model` -- Type: `boolean`. If set to false, will skip the `apply_model` step of the matching task. Use this if you want to use the `run_all_steps` command and are just trying to generate potential links, such as for the creation of training data.
* `n_training_iterations` -- Type: `integer`. Optional; default value is 10. The number of training iterations to use during the `model_exploration` task.
* `scale_data` -- Type: `boolean`. Optional. Whether to scale the data as part of the machine learning pipeline.
* `use_training_data_features` -- Type: `boolean`. Optional. If the identifiers in the training data set are not present in your raw input data, you will need to set this to `true`, or training features will not be able to be generated, giving null column errors. For example, if the training data set you are using has individuals from 1900 and 1910, but you are about to train a model to score the 1930-1940 potential matches, you need this to be set to `true` or it will fail, since the individual IDs are not present in the 1930 and 1940 raw input data. If you were about to train a model to score the 1900-1910 potential matches with this same training set, it would be best to set this to `false`, so you can be sure the training features are created from scratch to match your exact current configuration settings, although if you know the features haven't changed, you could set it to `true` to save a small amount of processing time.
* `output_suspicious_TD` -- Type: `boolean`. Optional. Used in the `model_exploration` link task. Outputs tables of potential matches that the model repeatedly scores differently than the match value given by the training data. Helps to identify false positives/false negatives in the training data, as well as areas that need additional training feature coverage in the model, or need increased representation in the training data set.
* `split_by_id_a` -- Type: `boolean`. Optional. Used in the `model_exploration` link task. When set to true, ensures that all potential matches for a given individual with ID_a are grouped together in the same train-test-split group. For example, if individual histid_a "A304BT" has three potential matches in the training data, one each to histid_b "B200", "C201", and "D425", all of those potential matches would either end up in the "train" split or the "test" split when evaluating the model performance.
* `feature_importances` -- Type: `boolean`. Optional. Whether to record
feature importances or coefficients for the training features when training
the ML model. Set this to true to enable training step 3.
* `model_parameters` -- Type: `list`. Specifies models to test out in the `model_exploration` task. See the [Model Exploration](model_exploration) page for a detailed description of how this works.
* `model_parameter_search` -- Type: `object`. Specifies which strategy hlink should
use to generate test models for [Model Exploration](model_exploration).
* `n_training_iterations` -- Type: `integer`. Optional; default value is 10. The number of outer folds to use during the `model_exploration` task. See [here](model_exploration.html#the-details) for more details.


```
Expand All @@ -764,7 +762,6 @@ scale_data = false
dataset = "/path/to/1900_1910_training_data_20191023.csv"
dependent_var = "match"
use_training_data_features = false
output_suspicious_TD = true
split_by_id_a = true

score_with_model = true
Expand All @@ -773,7 +770,7 @@ feature_importances = true
decision = "drop_duplicate_with_threshold_ratio"

n_training_iterations = 10
param_grid = false
model_parameter_search = {strategy = "explicit"}
model_parameters = [
{ type = "random_forest", maxDepth = 6, numTrees = 50 },
{ type = "probit", threshold = 0.5}
Expand All @@ -782,7 +779,7 @@ model_parameters = [
chosen_model = { type = "logistic_regression", threshold = 0.5, threshold_ratio = 1.0 }
```

## Household training and models
## Household Training and [Model Exploration](model_exploration)

* Header name: `hh_training`
* Description: Specifies the household training data set as well as a myriad of attributes related to training a model including the dependent var within that data set, the independent vars created from the `comparison_features` section, and the different models you want to use.
Expand All @@ -804,13 +801,12 @@ scale_data = false
dataset = "/path/to/hh_training_data_1900_1910.csv"
dependent_var = "match"
use_training_data_features = false
output_suspicious_TD = true
split_by_id_a = true
score_with_model = true
feature_importances = true
decision = "drop_duplicate_with_threshold_ratio"

param_grid = true
model_parameter_search = {strategy = "grid"}
n_training_iterations = 10
model_parameters = [
{ type = "logistic_regression", threshold = [0.5], threshold_ratio = [1.1]},
Expand Down
1 change: 1 addition & 0 deletions docs/_sources/index.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,5 @@ Configuration API
Feature Selection <feature_selection_transforms.md>
Pipeline Features <pipeline_features.md>
substitutions
model_exploration
models
195 changes: 195 additions & 0 deletions docs/_sources/model_exploration.md.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
# Model Exploration

## Overview

The model exploration task provides a way to try out different types of machine
learning models and sets of parameters to those models. It tests those models
on splits of the training data and outputs information on the performance of
the models. The purpose of model exploration is to help you choose a model that
performs well without having to test each model individually on the entire
input datasets. If you're interested in the exact workings of the model exploration
algorithm, see the [Details](#the-details) section below.

Model exploration uses several configuration attributes listed in the `training`
section because it is closely related to `training`.

## Searching for Model Parameters

Part of the process of model exploration is searching for model parameters which
give good results on the training data. Hlink supports three strategies for model
parameter searches, controlled by the `training.model_parameter_search` table.

### Explicit Search (`strategy = "explicit"`)

An explicit model parameter search lists out all of the parameter combinations
to be tested. Each element of the `training.model_parameters` list becomes one
set of parameters to evaluate. This is the simplest search strategy and is hlink's
default behavior.

This example `training` section uses an explicit search over two sets of model parameters.
Model exploration will train two random forest models. The first will have a
`maxDepth` of 3 and `numTrees` of 50, and the second will have a `maxDepth` of 3
and `numTrees` of 20.

```toml
[training.model_parameter_search]
strategy = "explicit"

[[training.model_parameters]]
type = "random_forest"
maxDepth = 3
numTrees = 50

[[training.model_parameters]]
type = "random_forest"
maxDepth = 3
numTrees = 20
```

### Grid Search (`strategy = "grid"`)

A grid search takes multiple values for each model parameter and generates one
model for each possible combination of the given parameters. This is often much more
compact than writing out all of the possible combinations in an explicit search.

For example, this `training` section generates 30 combinations of model
parameters for testing. The first has a `maxDepth` of 1 and `numTrees` of 20,
the second has a `maxDepth` of 1 and `numTrees` of 30, and so on.

```toml
[training.model_parameter_search]
strategy = "grid"

[[training.model_parameters]]
type = "random_forest"
maxDepth = [1, 2, 3, 5, 10]
numTrees = [20, 30, 40, 50, 60, 70]
```

Although grid search is more compact than explicitly listing out all of the model
parameters, it can be quite time-consuming to check every possible combination of
model parameters. Randomized search, described below, can be a more efficient way
to evaluate models with large numbers of parameters or large parameter ranges.


### Randomized Search (`strategy = "randomized"`)

*Added in version 4.0.0.*

A randomized parameter search generates model parameter settings by sampling each
parameter from a distribution or set. The number of samples is an additional parameter
to the strategy. This separates the size of the search space from the number of samples
taken, making a randomized search more flexible than a grid search. The downside of
this is that, unlike a grid search, a randomized search does not necessarily test
all of the possible values given for each parameter. It is necessarily non-exhaustive.

In a randomized search, each model parameter may take one of 3 forms:

* A list, which is a set of values to sample from with replacement. Each value has an equal chance
of being chosen for each sample.

```toml
[[training.model_parameters]]
type = "random_forest"
numTrees = [20, 30, 40]
```

* A single value, which "pins" the model parameter to always be that value. This
is syntactic sugar for sampling from a list with one element.

```toml
[[training.model_parameters]]
type = "random_forest"
# numTrees will always be 30.
# This is equivalent to numTrees = [30].
numTrees = 30
```

* A table defining a distribution from which to sample the parameter. The available
distributions are `"randint"`, to choose a random integer from a range, `"uniform"`,
to choose a random floating-point number from a range, and `"normal"`, to choose
a floating-point number from a normal distribution with a given mean and standard
deviation.

For example, this `training` section generates 20 model parameter combinations
for testing, using a randomized search. Each of the three given model parameters
uses a different type of distribution.

```toml
[training.model_parameter_search]
strategy = "randomized"
num_samples = 20

[[training.model_parameters]]
type = "random_forest"
numTrees = {distribution = "randint", low = 20, high = 70}
minInfoGain = {distribution = "uniform", low = 0.0, high = 0.3}
subsamplingRate = {distribution = "normal", mean = 1.0, standard_deviation = 0.2}
```

### The `training.param_grid` Attribute

As of version 4.0.0, the `training.param_grid` attribute is deprecated. Please use
`training.model_parameter_search` instead, as it is more flexible and supports additional
parameter search strategies. Prior to version 4.0.0, you will need to use `training.param_grid`.

`param_grid` has a direct mapping to `model_parameter_search`.

```toml
[training]
param_grid = true
```

is equivalent to

```toml
[training]
model_parameter_search = {strategy = "grid"}
```

and

```toml
[training]
param_grid = false
```

is equivalent to

```toml
[training]
model_parameter_search = {strategy = "explicit"}
```

### Types and Thresholds


There are 3 attributes which are hlink-specific and are not passed through as model parameters.
* `type` is the name of the model type.
* `threshold` and `threshold_ratio` control how hlink classifies potential matches
based on the probabilistic output of the models. They may each be either a float
or a list of floats, and hlink will always use a grid strategy to generate the
set of test combinations for these parameters.

For more details, please see the [Models](models) page and the [Details](#the-details)
section below.

## The Details

The current model exploration implementation uses a technique called nested cross-validation to evaluate each model which the search strategy generates. The algorithm follows this basic outline.

Let `N` be the value of `training.n_training_iterations`.
Let `J` be 3. (Currently `J` is hard-coded).

1. Split the prepared training data into `N` **outer folds**. This forms a partition of the training data into `N` distinct pieces, each of roughly equal size.
2. Choose the first **outer fold**.
3. Combine the `N - 1` other **outer folds** into the set of outer training data.
4. Split the outer training data into `J` **inner folds**. This forms a partition of the training data into `J` distinct pieces, each of roughly equal size.
5. Choose the first **inner fold**.
6. Combine the `J - 1` other **inner folds** into the test of inner training data.
7. Train, test, and score all of the models using the inner training data and the first **inner fold** as the test data.
8. Repeat steps 5 - 7 for each other **inner fold**.
9. After finishing all of the **inner folds**, choose the single model with the best aggregate score over those folds.
10. For each setting of `threshold` and `threshold_ratio`, train the best model on the outer training data and the chosen **outer fold**. Collect metrics on the performance of the model based on its confusion matrix.
11. Repeat steps 2-10 for each other **outer fold**.
12. Report on all of the metrics gathered for the best-scoring models.
Loading