Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 4.0.0 #186

Draft
wants to merge 142 commits into
base: main
Choose a base branch
from
Draft

Version 4.0.0 #186

wants to merge 142 commits into from

Conversation

riley-harper
Copy link
Contributor

@riley-harper riley-harper commented Mar 6, 2025

This is a tracking PR for changes in version 4.0.0, and the issues that those changes are related to.

Changes

ccdavis and others added 30 commits November 14, 2024 15:12
… and test all threshold matrix members against that set of params. Still has a failure.
…oesn't give good results making no matches in the test data, so precision is NaN.
…split used to test all thresholds isn't a good one.
We can just pass the list of model_parameters from the config file to this
function.
This will make this piece of code easier to understand and test.
…rch setting

One of these tests is failing because we haven't implemented this logic in the
_get_model_parameters() function yet.
riley-harper and others added 21 commits December 13, 2024 19:37
Instead of using this function to get the config and add attributes to it, we
now separately get the config with load_conf_file() and pass attributes to
Spark. I've translated some of the tests for load_conf() to tests for
load_conf_file().
Previously we always set the checkpoint directory to be the same as
spark.local.dir, which we call "tmp_dir". However, this doesn't make sense
because tmp_dir should be on a disk local to each executor, and the checkpoint
directory has to be on shared storage to work correctly.
Allow setting the checkpoint directory through SparkConnection
This is an alpha release of 4.0.0. It's a pre-release, so pip shouldn't
download it unless you specifically request it. Until we go to 4.0.0 for real,
the last official release will be 3.8.0.
This module has been deprecated for more than a year and is ready for removal.
pyspark.ml.feature.Interaction provides the same interface, and users should
use that class instead.
This is an old, deprecated way of specifying blocking.
Now that blocking_steps isn't supported, it's simpler to inline this private
helper function.
This has been deprecated in favor of the current column_mappings format.
This documentation was unfortunately using the old, deprecated form. So I've
updated it to use the new form instead.
Remove deprecated code for version 4
To support backwards compatibility, there is a "use_legacy_toml_parser"
argument. Setting this tells load_conf_file() to use the toml library.
Use tomli instead of the toml package by default
In some rare cases with very large inputs, mcc() could return values outside of
the range [-1, 1] due to floating-point precision limitations. To fix this,
I've just added a clamp() function and called it to force the return value into
the acceptable range.
Fix a bug where model_metrics.mcc() < -1.0
@riley-harper riley-harper marked this pull request as draft March 7, 2025 14:44
riley-harper and others added 8 commits March 7, 2025 20:01
So far, this has information on model parameter searches.
Because of the changes on main, needed to regenerate the Sphinx docs.
Since this is now deprecated, replace most of the references to
training.param_grid with equivalent references to
training.model_parameter_search.
Update docs for training.param_grid
This is the beta pre-release for version 4. At this point, we expect all
feature work and breaking changes to be done. There may be bug fixes,
documentation improvements, and code cleanup still happening, but the general
behavior should be pretty close to stable if all goes well.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment