Version 4.0.0 #186

riley-harper · 2025-03-06T20:02:34Z

This is a tracking PR for changes in version 4.0.0, and the issues that those changes are related to.

Changes

Redesign model exploration to use nested cross-validation and support randomized parameter search. Closes Support a randomized parameter search in model exploration #167. This implicitly fixes In Model Exploration, models with too many parameters can overflow the table name buffer #20 because we no longer compute these precision-recall AUC tables with the new design.
Document the new model exploration design and search strategies. Closes Update documentation for v4.0.0 #183.
Remove "suspicious data" logic from model exploration. Closes Remove the "suspicious data" logic from model exploration #176.
Add a core.model_metrics module with functions for computing metrics on model confusion matrices. Include more metrics and information on the raw confusion matrices in model exploration output. Closes Add F-measure to the computed model metrics, and include the raw confusion matrix in the output #179.
Change core.classifier functions to not interact with threshold and threshold_ratio. The caller should ensure that the passed dictionary only contains parameters for the model to be trained. Closes Don't handle threshold and threshold_ratio in core.classifier.choose_classifier() #172.
Clean up core.threshold and simplify the parameters required for a few of the functions. Closes Simplify the interface to linking/core/threshold.py #174.
Add a new checkpoint_dir argument to SparkConnection. Closes Don't set the Spark checkpoint directory to the tmp directory #181.
Remove deprecated code. Closes Remove the deprecated hlink.linking.transformers.interaction_transformer module #98 and closes Remove deprecation warnings and associated code for previous config structures #127.
Swap to using tomli as the default TOML parser. Closes Use a different TOML package #45.

… and test all threshold matrix members against that set of params. Still has a failure.

…oesn't give good results making no matches in the test data, so precision is NaN.

…s given to the thresholding eval.

…split used to test all thresholds isn't a good one.

…Models class

We can just pass the list of model_parameters from the config file to this function.

This will make this piece of code easier to understand and test.

…ameters()

…rch setting One of these tests is failing because we haven't implemented this logic in the _get_model_parameters() function yet.

…ers()

…ram_grid

…tion dictionary

Instead of using this function to get the config and add attributes to it, we now separately get the config with load_conf_file() and pass attributes to Spark. I've translated some of the tests for load_conf() to tests for load_conf_file().

Previously we always set the checkpoint directory to be the same as spark.local.dir, which we call "tmp_dir". However, this doesn't make sense because tmp_dir should be on a disk local to each executor, and the checkpoint directory has to be on shared storage to work correctly.

…kFactory

Allow setting the checkpoint directory through SparkConnection

This is an alpha release of 4.0.0. It's a pre-release, so pip shouldn't download it unless you specifically request it. Until we go to 4.0.0 for real, the last official release will be 3.8.0.

This module has been deprecated for more than a year and is ready for removal. pyspark.ml.feature.Interaction provides the same interface, and users should use that class instead.

This is an old, deprecated way of specifying blocking.

Now that blocking_steps isn't supported, it's simpler to inline this private helper function.

This has been deprecated in favor of the current column_mappings format.

This documentation was unfortunately using the old, deprecated form. So I've updated it to use the new form instead.

Remove deprecated code for version 4

To support backwards compatibility, there is a "use_legacy_toml_parser" argument. Setting this tells load_conf_file() to use the toml library.

Use tomli instead of the toml package by default

In some rare cases with very large inputs, mcc() could return values outside of the range [-1, 1] due to floating-point precision limitations. To fix this, I've just added a clamp() function and called it to force the return value into the acceptable range.

Fix a bug where model_metrics.mcc() < -1.0

So far, this has information on model parameter searches.

Add docs for Model Exploration

Because of the changes on main, needed to regenerate the Sphinx docs.

Since this is now deprecated, replace most of the references to training.param_grid with equivalent references to training.model_parameter_search.

Update docs for training.param_grid

This is the beta pre-release for version 4. At this point, we expect all feature work and breaking changes to be done. There may be bug fixes, documentation improvements, and code cleanup still happening, but the general behavior should be pretty close to stable if all goes well.

ccdavis and others added 30 commits November 14, 2024 15:12

Messing around with refactoring model exploration

5507b4b

Fixed failures due to bad code

3b84f26

No errors, use model exploration approach that should get pr_auc mean…

62ff6e6

… and test all threshold matrix members against that set of params. Still has a failure.

remove cache() and typo

3477b71

Renaming for clarity

c0397c5

wip

1fe6224

giving up for now

28c6cde

wip

1f70f66

refactoring

8e5415f

finished refactoring sketch

941bd06

Fixed some typos

1f2bd49

correctly save suspicious data

21cac61

Debugging _get_aggregates in test. It looks like the test data just d…

c9576e8

…oesn't give good results making no matches in the test data, so precision is NaN.

Use all splits on thresholding

319129f

Adjust test to account for results with only the best hyper parameter…

9a90143

…s given to the thresholding eval.

Clean up stdout and make a model-param selection report.

a14ccdf

model exploration tests pass; need more

2facf41

Separate each fold test run output.

3bbac41

Clean up output

3b22f14

Tests pass

efa67f7

fixed some tests, the FNS count test is broken because of the single …

38c1006

…split used to test all thresholds isn't a good one.

[#167] Pull _custom_param_grid_builder() out of the LinkStepTrainTest…

c5f5b13

…Models class

[#167] Simplify the interface to _custom_param_grid_builder()

605369b

We can just pass the list of model_parameters from the config file to this function.

[#167] Pull _get_model_parameters() out of the LinkStep class

2204152

This will make this piece of code easier to understand and test.

[#167] Add a few tests for _get_model_parameters()

7d48380

[#167] Just pass the training section of the config to _get_model_par…

bc0bf7d

…ameters()

[#167] Add a couple of tests for the new training.model_parameter_sea…

8be8806

…rch setting One of these tests is failing because we haven't implemented this logic in the _get_model_parameters() function yet.

[#167] Look for training.model_parameter_search in _get_model_paramet…

a939ec2

…ers()

[#167] Make sure that model_parameter_search takes precedence over pa…

801582e

…ram_grid

wip

a94250c

riley-harper and others added 21 commits December 13, 2024 19:37

[#181] Don't use load_conf() to set extra attributes on the configura…

46f79e3

…tion dictionary

[#181] Implement checkpoint_dir behavior for SparkConnection and Spar…

3dbc75b

…kFactory

Merge pull request #182 from ipums/checkpoint_directory_rework

3f0d62f

Allow setting the checkpoint directory through SparkConnection

Bump the version to 4.0.0a1

8bfe87e

This is an alpha release of 4.0.0. It's a pre-release, so pip shouldn't download it unless you specifically request it. Until we go to 4.0.0 for real, the last official release will be 3.8.0.

Run black

7f802db

[#98] Remove hlink.linking.transformers.interaction_transformer

0dd3d65

This module has been deprecated for more than a year and is ready for removal. pyspark.ml.feature.Interaction provides the same interface, and users should use that class instead.

[#127] Update test to avoid using blocking_steps

305358a

[#127] Remove support for "blocking_steps"

3543afc

This is an old, deprecated way of specifying blocking.

[#127] Inline matching._helpers.get_blocking()

08ac712

Now that blocking_steps isn't supported, it's simpler to inline this private helper function.

[#127] Remove support for old column_mappings format

1a14cea

This has been deprecated in favor of the current column_mappings format.

[#127] Remove support for deprecated form of mapping transforms

9c99a44

[#127] Add tests for the "mapping" column mapping transform

727373f

[#127] Update documentation for the mapping transform

2004b2e

This documentation was unfortunately using the old, deprecated form. So I've updated it to use the new form instead.

Merge pull request #184 from ipums/remove-deprecated

6769131

Remove deprecated code for version 4

[#45] Use the tomli package instead of toml by default

7d44f8b

To support backwards compatibility, there is a "use_legacy_toml_parser" argument. Setting this tells load_conf_file() to use the toml library.

[#45] Add tests and docs for use_legacy_toml_parser

8518029

Merge pull request #185 from ipums/use_tomli

d9d43cd

Use tomli instead of the toml package by default

Merge pull request #188 from ipums/mcc-out-of-range

5152468

Fix a bug where model_metrics.mcc() < -1.0

riley-harper marked this pull request as draft March 7, 2025 14:44

riley-harper and others added 8 commits March 7, 2025 20:01

[#183] Add a new model exploration docs page

4eda17d

So far, this has information on model parameter searches.

[#183] Update the training and model exploration config docs

2a75c7d

[#183] Document the fine-grained details of model exploration

d9ebc7d

Merge pull request #190 from ipums/model-exploration-docs

b05ab22

Add docs for Model Exploration

Merge branch 'main' into v4-dev

8a86664

Because of the changes on main, needed to regenerate the Sphinx docs.

[#183] Update docs for training.param_grid

be90274

Since this is now deprecated, replace most of the references to training.param_grid with equivalent references to training.model_parameter_search.

Merge pull request #191 from ipums/param-grid-docs

f6a4c47

Update docs for training.param_grid

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version 4.0.0 #186

Version 4.0.0 #186

riley-harper commented Mar 6, 2025 •

edited

Loading

Version 4.0.0 #186

Are you sure you want to change the base?

Version 4.0.0 #186

Conversation

riley-harper commented Mar 6, 2025 • edited Loading

Changes

riley-harper commented Mar 6, 2025 •

edited

Loading