-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Version 4.0.0 #186
Draft
riley-harper
wants to merge
142
commits into
main
Choose a base branch
from
v4-dev
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Version 4.0.0 #186
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… and test all threshold matrix members against that set of params. Still has a failure.
…oesn't give good results making no matches in the test data, so precision is NaN.
…s given to the thresholding eval.
…split used to test all thresholds isn't a good one.
We can just pass the list of model_parameters from the config file to this function.
This will make this piece of code easier to understand and test.
…rch setting One of these tests is failing because we haven't implemented this logic in the _get_model_parameters() function yet.
Instead of using this function to get the config and add attributes to it, we now separately get the config with load_conf_file() and pass attributes to Spark. I've translated some of the tests for load_conf() to tests for load_conf_file().
Previously we always set the checkpoint directory to be the same as spark.local.dir, which we call "tmp_dir". However, this doesn't make sense because tmp_dir should be on a disk local to each executor, and the checkpoint directory has to be on shared storage to work correctly.
Allow setting the checkpoint directory through SparkConnection
This is an alpha release of 4.0.0. It's a pre-release, so pip shouldn't download it unless you specifically request it. Until we go to 4.0.0 for real, the last official release will be 3.8.0.
This module has been deprecated for more than a year and is ready for removal. pyspark.ml.feature.Interaction provides the same interface, and users should use that class instead.
This is an old, deprecated way of specifying blocking.
Now that blocking_steps isn't supported, it's simpler to inline this private helper function.
This has been deprecated in favor of the current column_mappings format.
This documentation was unfortunately using the old, deprecated form. So I've updated it to use the new form instead.
Remove deprecated code for version 4
To support backwards compatibility, there is a "use_legacy_toml_parser" argument. Setting this tells load_conf_file() to use the toml library.
Use tomli instead of the toml package by default
In some rare cases with very large inputs, mcc() could return values outside of the range [-1, 1] due to floating-point precision limitations. To fix this, I've just added a clamp() function and called it to force the return value into the acceptable range.
Fix a bug where model_metrics.mcc() < -1.0
So far, this has information on model parameter searches.
Add docs for Model Exploration
Because of the changes on main, needed to regenerate the Sphinx docs.
Since this is now deprecated, replace most of the references to training.param_grid with equivalent references to training.model_parameter_search.
Update docs for training.param_grid
This is the beta pre-release for version 4. At this point, we expect all feature work and breaking changes to be done. There may be bug fixes, documentation improvements, and code cleanup still happening, but the general behavior should be pretty close to stable if all goes well.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a tracking PR for changes in version 4.0.0, and the issues that those changes are related to.
Changes
core.model_metrics
module with functions for computing metrics on model confusion matrices. Include more metrics and information on the raw confusion matrices in model exploration output. Closes Add F-measure to the computed model metrics, and include the raw confusion matrix in the output #179.core.classifier
functions to not interact withthreshold
andthreshold_ratio
. The caller should ensure that the passed dictionary only contains parameters for the model to be trained. Closes Don't handle threshold and threshold_ratio in core.classifier.choose_classifier() #172.core.threshold
and simplify the parameters required for a few of the functions. Closes Simplify the interface to linking/core/threshold.py #174.checkpoint_dir
argument toSparkConnection
. Closes Don't set the Spark checkpoint directory to the tmp directory #181.hlink.linking.transformers.interaction_transformer
module #98 and closes Remove deprecation warnings and associated code for previous config structures #127.tomli
as the default TOML parser. Closes Use a different TOML package #45.