Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 4.0.0 #186

Draft
wants to merge 142 commits into
base: main
Choose a base branch
from
Draft
Changes from 3 commits
Commits
Show all changes
142 commits
Select commit Hold shift + click to select a range
5507b4b
Messing around with refactoring model exploration
ccdavis Nov 14, 2024
3b84f26
Fixed failures due to bad code
ccdavis Nov 15, 2024
62ff6e6
No errors, use model exploration approach that should get pr_auc mean…
ccdavis Nov 15, 2024
3477b71
remove cache() and typo
ccdavis Nov 15, 2024
c0397c5
Renaming for clarity
ccdavis Nov 16, 2024
1fe6224
wip
Nov 16, 2024
28c6cde
giving up for now
ccdavis Nov 16, 2024
1f70f66
wip
ccdavis Nov 18, 2024
8e5415f
refactoring
ccdavis Nov 19, 2024
941bd06
finished refactoring sketch
ccdavis Nov 19, 2024
1f2bd49
Fixed some typos
ccdavis Nov 19, 2024
21cac61
correctly save suspicious data
ccdavis Nov 19, 2024
c9576e8
Debugging _get_aggregates in test. It looks like the test data just d…
ccdavis Nov 20, 2024
319129f
Use all splits on thresholding
Nov 15, 2024
9a90143
Adjust test to account for results with only the best hyper parameter…
ccdavis Nov 21, 2024
a14ccdf
Clean up stdout and make a model-param selection report.
ccdavis Nov 21, 2024
2facf41
model exploration tests pass; need more
ccdavis Nov 21, 2024
3bbac41
Separate each fold test run output.
Nov 22, 2024
3b22f14
Clean up output
ccdavis Nov 25, 2024
efa67f7
Tests pass
ccdavis Nov 25, 2024
38c1006
fixed some tests, the FNS count test is broken because of the single …
ccdavis Nov 25, 2024
c5f5b13
[#167] Pull _custom_param_grid_builder() out of the LinkStepTrainTest…
riley-harper Nov 26, 2024
605369b
[#167] Simplify the interface to _custom_param_grid_builder()
riley-harper Nov 26, 2024
2204152
[#167] Pull _get_model_parameters() out of the LinkStep class
riley-harper Nov 26, 2024
7d48380
[#167] Add a few tests for _get_model_parameters()
riley-harper Nov 26, 2024
bc0bf7d
[#167] Just pass the training section of the config to _get_model_par…
riley-harper Nov 26, 2024
8be8806
[#167] Add a couple of tests for the new training.model_parameter_sea…
riley-harper Nov 26, 2024
a939ec2
[#167] Look for training.model_parameter_search in _get_model_paramet…
riley-harper Nov 26, 2024
801582e
[#167] Make sure that model_parameter_search takes precedence over pa…
riley-harper Nov 26, 2024
a94250c
wip
ccdavis Nov 27, 2024
667d322
Possibly working nested cv
Nov 22, 2024
a476884
[#167] Print a deprecation warning for training.param_grid
riley-harper Nov 27, 2024
8c72446
[#167] Refactor _get_model_parameters()
riley-harper Nov 27, 2024
896ad67
[#167] Improve an error condition in _get_model_parameters()
riley-harper Nov 27, 2024
46da4cb
[#167] Start supporting a randomized strategy which can randomly samp…
riley-harper Nov 27, 2024
51b4144
[#167] Support some simple distributions for randomized parameter search
riley-harper Nov 27, 2024
907818e
[#167] Use isinstance instead of directly checking types
riley-harper Nov 27, 2024
65cb5ff
[#167] Pull the edge case logic for "type" out of _choose_randomized_…
riley-harper Nov 27, 2024
1692c87
[#167] Support "pinned" parameters with model_parameter_search strate…
riley-harper Nov 27, 2024
f4a42f7
fix typo, testing
ccdavis Dec 2, 2024
0becd32
[#167] Respect training.seed when the search strategy is ""randomized"
riley-harper Dec 2, 2024
5d0ea0b
[#167] Add a normal distribution to randomized parameter search
riley-harper Dec 2, 2024
943fc0a
[#167] Improve the "unknown distribution" error message
riley-harper Dec 2, 2024
0f99e1b
[#167] Don't randomize threshold or threshold_ratio
riley-harper Dec 2, 2024
7fed016
[#167] Add a test for the unknown strategy error condition
riley-harper Dec 2, 2024
761e38f
reformatted
Dec 2, 2024
3e0cb90
better output for tracking progress of train-test
Dec 2, 2024
c7e7ba2
better messages
Dec 2, 2024
fdd402c
Better logging
ccdavis Dec 3, 2024
3500e7c
correctly group threshold metrics by outer fold iteration.
ccdavis Dec 3, 2024
1ea05d0
Try fewer shuffle partitions
ccdavis Dec 3, 2024
10ab7b4
set shuffle partitions back to 200
ccdavis Dec 3, 2024
47e28a6
Added nested-cv algo description in comments.
ccdavis Dec 3, 2024
b5e128f
Added seed on inner fold splitter; Update tests to at least pass.
ccdavis Dec 3, 2024
b123dbf
assert the logistic regression gives a decent result
ccdavis Dec 3, 2024
1ead1e7
Temporary commented out asserts due to different results presentation…
ccdavis Dec 3, 2024
45f3649
another test passes
ccdavis Dec 3, 2024
40f075d
all tests should pass
ccdavis Dec 3, 2024
0f5deb6
Merge branch 'main' into randomized_parameter_search
riley-harper Dec 3, 2024
b9c2123
fixed quote indent
ccdavis Dec 3, 2024
40f344e
Merge branch 'main' into refactor-nested-cross-validation
ccdavis Dec 3, 2024
c6d3a81
Merge branch 'main' into randomized_parameter_search
riley-harper Dec 3, 2024
1e55384
Address PR comments
ccdavis Dec 3, 2024
02d5f96
Merge branch 'main' into refactor-nested-cross-validation
ccdavis Dec 3, 2024
11bdfd4
Merge pull request #169 from ipums/refactor-nested-cross-validation
ccdavis Dec 4, 2024
73e6adc
Merge branch 'v4-dev' into randomized_parameter_search
riley-harper Dec 4, 2024
85802d3
Merge pull request #168 from ipums/randomized_parameter_search
riley-harper Dec 4, 2024
77a58c0
HH model exploration test passes; needed to adjust the expected colum…
ccdavis Dec 4, 2024
7e7baa0
Merge branch 'v4-dev' of github.com:ipums/hlink into v4-dev
ccdavis Dec 4, 2024
9542800
Merge branch 'main' into v4-dev
riley-harper Dec 4, 2024
e57dad6
[#172] Add type hints and docs to linking.core.classifier
riley-harper Dec 5, 2024
a736dd0
[#172] Don't handle threshold and threshold_ratio in choose_classifier()
riley-harper Dec 5, 2024
49bda13
[#174] Add type hints to linking.core.threshold
riley-harper Dec 5, 2024
28bcd03
[#174] Add a couple of unit tests for linking.core.threshold
riley-harper Dec 5, 2024
ad6ce10
[#174] Pass just decision into predict_with_thresholds() instead of t…
riley-harper Dec 5, 2024
5424513
[#174] Do some minor refactoring and cleanup of linking.core.threshold
riley-harper Dec 5, 2024
dd16360
[#174] Replace a SQL query with the equivalent spark expression
riley-harper Dec 5, 2024
647a751
[#174] Rewrite some thresholding code to use PySpark exprs instead of…
riley-harper Dec 5, 2024
b5c8ae9
[#174] Use withColumn() instead of select("*", ...)
riley-harper Dec 6, 2024
1ffb6d1
[#174] Improve the error message when there's no probability column
riley-harper Dec 6, 2024
d32c2bf
[#174] Update documentation and add a few logging debug statements
riley-harper Dec 6, 2024
3c9043c
Merge pull request #175 from ipums/core-arguments
riley-harper Dec 6, 2024
93a5c4e
WIP: refactor to combine threshold test results from all outer folds.…
ccdavis Dec 6, 2024
dd49937
WIP on correct metrics output; some tests break because of not enough…
ccdavis Dec 9, 2024
a041274
Cleaning up metrics
Dec 9, 2024
f083378
Tests pass
ccdavis Dec 10, 2024
1f162dc
Adjust hh model exploration test for new column names, no training co…
ccdavis Dec 10, 2024
bde173d
Merge pull request #177 from ipums/model-exploration-metrics
ccdavis Dec 10, 2024
b7f821c
[#176] Remove output_suspicious_TD and "suspicious traininig data" su…
riley-harper Dec 10, 2024
9755f73
[#176] Add a unit test for _get_confusion_matrix()
riley-harper Dec 10, 2024
c43b57d
[#176] Rewrite _get_confusion_matrix() to avoid using 4 filters + counts
riley-harper Dec 10, 2024
4aad62e
[#176] Add a unit test for _get_aggregate_metrics()
riley-harper Dec 10, 2024
3efbb0c
[#176] Lowercase tp/fp/fn/tn variable names
riley-harper Dec 10, 2024
627eed8
Try requiring scikit-learn<1.6 when xgboost is installed
riley-harper Dec 10, 2024
c1f0d8c
Merge pull request #178 from ipums/no-suspicious-data
riley-harper Dec 11, 2024
c166ace
[#179] Create a new core.model_metrics module and move _calc_mcc() there
riley-harper Dec 11, 2024
df9b463
[#179] Create precision() and recall() functions in core.model_metrics
riley-harper Dec 11, 2024
7817ed5
[#179] Factor away _get_aggregate_metrics()
riley-harper Dec 11, 2024
b93ab6f
[#179] Add hypothesis and some property tests for core.model_metrics
riley-harper Dec 11, 2024
8604767
[#179] Add a library function for F-measure, also known as F1-score
riley-harper Dec 11, 2024
75b4414
[#179] Unify variable and argument names
riley-harper Dec 11, 2024
ae59da3
[#179] Return math.nan from core.model_metrics
riley-harper Dec 11, 2024
fd40c35
[#179] Add .hypothesis/ to .gitignore
riley-harper Dec 11, 2024
1ecef81
[#179] Filter with math.isnan() instead of is not np.nan
riley-harper Dec 12, 2024
7f0c48c
[#179] Include F-measure in ThresholdTestResults
riley-harper Dec 12, 2024
a53c120
[#179] Put the raw confusion matrix counts in the ThresholdTestResults
riley-harper Dec 12, 2024
d87c5de
[#179] Simplify _aggregate_per_threshold_results()
riley-harper Dec 12, 2024
74a7dd9
[#179] Add F-measure to the output thresholded metrics data frame
riley-harper Dec 12, 2024
b454276
[#179] Return math.nan from core.model_metrics.mcc where it makes sense
riley-harper Dec 12, 2024
bd934f5
[#179] Don't automatically add or drop columns from thresholded metri…
riley-harper Dec 12, 2024
b2cf14c
[#179] Add documentation to core.model_metrics and refactor a bit
riley-harper Dec 13, 2024
7f8b49d
Merge pull request #180 from ipums/model_metrics
riley-harper Dec 13, 2024
4c6e602
[#181] Return a tuple (path, config) from load_conf_file
riley-harper Dec 13, 2024
46f79e3
[#181] Don't use load_conf() to set extra attributes on the configura…
riley-harper Dec 13, 2024
1f99c93
[#181] Remove the scripts.main.load_conf() function
riley-harper Dec 13, 2024
e0bf86e
[#181] Add a new checkpoint_dir argument to SparkConnection()
riley-harper Dec 13, 2024
3dbc75b
[#181] Implement checkpoint_dir behavior for SparkConnection and Spar…
riley-harper Dec 13, 2024
3f0d62f
Merge pull request #182 from ipums/checkpoint_directory_rework
riley-harper Dec 13, 2024
8bfe87e
Bump the version to 4.0.0a1
riley-harper Dec 13, 2024
7f802db
Run black
riley-harper Mar 5, 2025
0dd3d65
[#98] Remove hlink.linking.transformers.interaction_transformer
riley-harper Mar 5, 2025
305358a
[#127] Update test to avoid using blocking_steps
riley-harper Mar 5, 2025
3543afc
[#127] Remove support for "blocking_steps"
riley-harper Mar 5, 2025
08ac712
[#127] Inline matching._helpers.get_blocking()
riley-harper Mar 5, 2025
1a14cea
[#127] Remove support for old column_mappings format
riley-harper Mar 5, 2025
9c99a44
[#127] Remove support for deprecated form of mapping transforms
riley-harper Mar 5, 2025
727373f
[#127] Add tests for the "mapping" column mapping transform
riley-harper Mar 6, 2025
2004b2e
[#127] Update documentation for the mapping transform
riley-harper Mar 6, 2025
6769131
Merge pull request #184 from ipums/remove-deprecated
riley-harper Mar 6, 2025
7d44f8b
[#45] Use the tomli package instead of toml by default
riley-harper Mar 6, 2025
8518029
[#45] Add tests and docs for use_legacy_toml_parser
riley-harper Mar 6, 2025
d9d43cd
Merge pull request #185 from ipums/use_tomli
riley-harper Mar 6, 2025
94c7c8c
[#187] Fix a bug where model_metrics.mcc() < -1.0
riley-harper Mar 6, 2025
5152468
Merge pull request #188 from ipums/mcc-out-of-range
riley-harper Mar 6, 2025
4eda17d
[#183] Add a new model exploration docs page
riley-harper Mar 7, 2025
2a75c7d
[#183] Update the training and model exploration config docs
riley-harper Mar 7, 2025
d9ebc7d
[#183] Document the fine-grained details of model exploration
riley-harper Mar 7, 2025
b05ab22
Merge pull request #190 from ipums/model-exploration-docs
riley-harper Mar 7, 2025
8a86664
Merge branch 'main' into v4-dev
riley-harper Mar 7, 2025
be90274
[#183] Update docs for training.param_grid
riley-harper Mar 10, 2025
f6a4c47
Merge pull request #191 from ipums/param-grid-docs
riley-harper Mar 10, 2025
27a07c9
Bump the version to 4.0.0b1
riley-harper Mar 10, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 27 additions & 4 deletions hlink/configs/load_config.py
Original file line number Diff line number Diff line change
@@ -7,11 +7,14 @@
from typing import Any
import json
import toml
import tomli

from hlink.errors import UsageError


def load_conf_file(conf_name: str) -> tuple[Path, dict[str, Any]]:
def load_conf_file(
conf_name: str, *, use_legacy_toml_parser: bool = False
) -> tuple[Path, dict[str, Any]]:
"""Flexibly load a config file.
Given a path `conf_name`, look for a file at that path. If that file
@@ -20,8 +23,18 @@ def load_conf_file(conf_name: str) -> tuple[Path, dict[str, Any]]:
name with a '.toml' extension added and load it if it exists. Then do the
same for a file with a '.json' extension added.
`use_legacy_toml_parser` tells this function to use the legacy TOML library
which hlink used to use instead of the current default. This is provided
for backwards compatibility. Some previously written config files may
depend on bugs in the legacy TOML library, making it hard to migrate to the
new TOML v1.0 compliant parser. It is strongly recommended that new code
and config files use the default parser. Old code and config files should
also try to migrate to the default parser when possible.
Args:
conf_name: the file to look for
use_legacy_toml_parser: (Not Recommended) Use the legacy, buggy TOML
parser instead of the default parser.
Returns:
a tuple (absolute path to the config file, contents of the config file)
@@ -40,9 +53,19 @@ def load_conf_file(conf_name: str) -> tuple[Path, dict[str, Any]]:

for file in existing_files:
if file.suffix == ".toml":
with open(file) as f:
conf = toml.load(f)
return file.absolute(), conf
# Legacy support for using the "toml" library instead of "tomli".
#
# Eventually we should remove use_legacy_toml_parser and just use
# tomli or Python's standard library tomllib, which is available in
# Python 3.11+.
if use_legacy_toml_parser:
with open(file) as f:
conf = toml.load(f)
return file.absolute(), conf
else:
with open(file, "rb") as f:
conf = tomli.load(f)
return file.absolute(), conf

if file.suffix == ".json":
with open(file) as f:
15 changes: 15 additions & 0 deletions hlink/tests/config_loader_test.py
Original file line number Diff line number Diff line change
@@ -50,3 +50,18 @@ def test_load_conf_file_unrecognized_extension(tmp_path: Path) -> None:
match="The file .+ exists, but it doesn't have a '.toml' or '.json' extension",
):
load_conf_file(str(conf_file))


def test_load_conf_file_json_legacy_parser(conf_dir_path: str) -> None:
"""
The use_legacy_toml_parser argument does not affect json parsing.
"""
conf_file = Path(conf_dir_path) / "test.json"
_, conf = load_conf_file(str(conf_file), use_legacy_toml_parser=True)
assert conf["id_column"] == "id"


def test_load_conf_file_toml_legacy_parser(conf_dir_path: str) -> None:
conf_file = Path(conf_dir_path) / "test1.toml"
_, conf = load_conf_file(str(conf_file), use_legacy_toml_parser=True)
assert conf["id_column"] == "id-toml"
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -25,6 +25,7 @@ dependencies = [
"pyspark~=3.5.0",
"scikit-learn>=1.1.0",
"toml>=0.10.0",
"tomli>=2.0",
]

[project.optional-dependencies]