Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use EvoTrees instead of XGBoost in documentation #57

Merged
merged 24 commits into from
Jan 17, 2023
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ jobs:
fail-fast: false
matrix:
version:
- '1.3'
- '1.6'
- '1'
- 'nightly'
os:
Expand Down
4 changes: 1 addition & 3 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
*.jl.*.cov
*.jl.cov
*.jl.mem
/Manifest.toml
/test/Manifest.toml
/test/rstar/Manifest.toml
Manifest.toml
/docs/build/
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ SpecialFunctions = "0.8, 0.9, 0.10, 1, 2"
StatsBase = "0.33"
StatsFuns = "1"
Tables = "1"
julia = "1.3"
julia = "1.6"

[extras]
Pkg = "44cfe95a-1eb2-52ea-b672-e2afdf69b78f"
Expand Down
8 changes: 5 additions & 3 deletions docs/Project.toml
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
[deps]
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
EvoTrees = "f6006082-12f8-11e9-0c9c-0d5d367ab1e5"
MCMCDiagnosticTools = "be115224-59cd-429b-ad48-344e309966f0"
MLJBase = "a7f614a8-145f-11e9-1d2a-a57a1082229d"
MLJXGBoostInterface = "54119dfa-1dab-4055-a167-80440f4f7a91"
MLJIteration = "614be32b-d00c-4edb-bd02-1eb411ab5e55"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"

[compat]
Documenter = "0.27"
EvoTrees = "0.14.6"
MCMCDiagnosticTools = "0.2"
MLJBase = "0.19, 0.20, 0.21"
MLJXGBoostInterface = "0.1, 0.2, 0.3"
julia = "1.3"
MLJIteration = "0.5"
julia = "1.6"
40 changes: 26 additions & 14 deletions src/rstar.jl
Original file line number Diff line number Diff line change
Expand Up @@ -40,21 +40,20 @@ function rstar(
throw(ArgumentError("training and test data subsets must not be empty"))

xtable = _astable(x)
ycategorical = MLJModelInterface.categorical(ysplit)
devmotion marked this conversation as resolved.
Show resolved Hide resolved
xdata, ydata = MLJModelInterface.reformat(classifier, xtable, ycategorical)

# train classifier on training data
ycategorical = MLJModelInterface.categorical(ysplit)
xtrain = MLJModelInterface.selectrows(xtable, train_ids)
fitresult, _ = MLJModelInterface.fit(
classifier, verbosity, xtrain, ycategorical[train_ids]
)
xtrain, ytrain = MLJModelInterface.selectrows(classifier, train_ids, xdata, ydata)
fitresult, _ = MLJModelInterface.fit(classifier, verbosity, xtrain, ytrain)

# compute predictions on test data
xtest = MLJModelInterface.selectrows(xtable, test_ids)
xtest, = MLJModelInterface.selectrows(classifier, test_ids, xdata)
ytest = ycategorical[test_ids]
predictions = _predict(classifier, fitresult, xtest)

# compute statistic
ytest = ycategorical[test_ids]
result = _rstar(predictions, ytest)
result = _rstar(classifier, predictions, ytest)

return result
end
Expand Down Expand Up @@ -109,7 +108,7 @@ is returned (algorithm 2).
# Examples

```jldoctest rstar; setup = :(using Random; Random.seed!(101))
julia> using MLJBase, MLJXGBoostInterface, Statistics
julia> using MLJBase, MLJIteration, EvoTrees, Statistics

julia> samples = fill(4.0, 100, 3, 2);
```
Expand All @@ -118,7 +117,16 @@ One can compute the distribution of the ``R^*`` statistic (algorithm 2) with the
probabilistic classifier.

```jldoctest rstar
julia> distribution = rstar(XGBoostClassifier(), samples);
julia> model = IteratedModel(;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too bad that this setup is so much more verbose than just calling XGBoostClassifier().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we can just use EvoTrees(; nrounds=100, eta=x) (don't remember the default in XGBoostClassifier) and would get the same setting since XGBoostClassifier just uses nrounds = 100 by default without any tuning of this hyperparameter. Based on the comments above I thought it would be good though to highlight how it can be set/estimated in a better way. Maybe should add a comment though and show EvoTrees(; nrounds=100) as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah for a usage example in the docstring I slightly prefer the simpler approach. But I agree that it is good to document the more robust approach somewhere.

model=EvoTreeClassifier(; eta=0.005),
iteration_parameter=:nrounds,
resampling=Holdout(),
measures=log_loss,
controls=[Step(5), Patience(2), NumberLimit(100)],
retrain=true,
);

julia> distribution = rstar(model, samples);

julia> isapprox(mean(distribution), 1; atol=0.1)
true
Expand All @@ -129,9 +137,9 @@ Deterministic classifiers can also be derived from probabilistic classifiers by
predicting the mode. In MLJ this corresponds to a pipeline of models.

```jldoctest rstar
julia> xgboost_deterministic = Pipeline(XGBoostClassifier(); operation=predict_mode);
julia> evotree_deterministic = Pipeline(model; operation=predict_mode);

julia> value = rstar(xgboost_deterministic, samples);
julia> value = rstar(evotree_deterministic, samples);

julia> isapprox(value, 1; atol=0.2)
true
Expand Down Expand Up @@ -161,7 +169,9 @@ function rstar(classif::MLJModelInterface.Supervised, x::AbstractArray{<:Any,3};
end

# R⋆ for deterministic predictions (algorithm 1)
function _rstar(predictions::AbstractVector{T}, ytest::AbstractVector{T}) where {T}
function _rstar(
::MLJModelInterface.Deterministic, predictions::AbstractVector, ytest::AbstractVector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we only support Deterministic and Probabilistic, perhaps we should constrain the types for rstar and update the docstring to only take Union{MLJModelInterface.Probabilistic,MLJModelInterface.Deterministic}.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, why not, probably it's more user-friendly to error out when calling rstar. I wonder though how useful the information would be in the docstring - do users actually know about Probabilistic/Deterministic or even MLJModelInterface?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps not. And I'm not certain how common the other subtypes of Supervised are.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also use StatisticalTraits, e.g., StatisticalTraits.prediction_type, throw a descriptive error if a rstar is called with a model for which it is not :probabilistic or :deterministic (see https://github.com/JuliaAI/MLJModelInterface.jl/blob/d9e9703947fc04b0a5e63680289e41d0ba0d65bd/src/model_traits.jl#L27-L28), and dispatch on it (using Val, it seems all these traits return Symbols but since they are based on the type of the models, the compiler should be smart enough to handle it).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And not add anything more to the docstring since we would not restrict the type at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prediction_type shows up in the model search and if you print info(model): https://alan-turing-institute.github.io/MLJ.jl/dev/model_search/ So it seems to be quite official?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This point would actually be strongest argument in favour of traits: We would also support models that are not subtypes of Probabilistic or Deterministic (for whatever reason) and that we do not know about but whose predictions would still be of the desired form (probabilistic or deterministic).

I think this presumes that the results of fit and predict for such models would be of the same form as we need. But it's not clear to me from the docs that they are (e.g. https://alan-turing-institute.github.io/MLJ.jl/dev/adding_models_for_general_use/#Outlier-detection-models). We would be relying on an undocumented trait implementation, where the decisions about which traits apply are not defined anywhere. In particular, for supervised outlier detection, it's not clear to me whether these models support multiple labels as we have. To be convinced, I'd need to see a test case of one of these detectors (see https://github.com/OutlierDetectionJL/OutlierDetection.jl) used to compute rstar.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it seems to be quite official?

There's even an example that filters on prediction_type on the same page:

julia> filter(model) = model.is_supervised &&
                       model.input_scitype >: MLJ.Table(Continuous) &&
                       model.target_scitype >: AbstractVector{<:Multiclass{3}} &&
                       model.prediction_type == :deterministic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems there are large portions of their traits interface that are undocumented.

I wonder if instead of using prediction_type it is more useful to use input_scitype to check that the models accept tables of continuous values, target_scitype to check that the model supports multiclass labels, and then predict_scitype to determine whether the predictions are labels, probabilities, or something else (error).

These have the benefit that the scitype interface is documented (mostly; not predict_scitype: https://alan-turing-institute.github.io/MLJ.jl/dev/adding_models_for_general_use/#Trait-declarations) and connects directly to what we need (we assume the model accepts certain inputs and labels and makes certain types of predictions).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a good idea but unfortunately it seems that Pipelines do not support the traits properly (maybe they can't - even though when working with instances all information should be available?):

julia> using MLJBase, MLJXGBoostInterface, MLJModelInterface

julia> const MMI = MLJModelInterface

julia> classifier = XGBoostClassifier();

julia> MMI.input_scitype(classifier)
Table{<:AbstractVector{<:Continuous}}

julia> MMI.target_scitype(classifier)
AbstractVector{<:Finite} (alias for AbstractArray{<:Finite, 1})

julia> MMI.predict_scitype(classifier)
AbstractVector{Density{<:Finite}} (alias for AbstractArray{ScientificTypesBase.Density{<:Finite}, 1})

julia> classifier = Pipeline(XGBoostClassifier(); operation=predict_mode);

julia> MMI.input_scitype(classifier)
Unknown

julia> MMI.target_scitype(classifier)
AbstractVector{<:Finite} (alias for AbstractArray{<:Finite, 1})

julia> MMI.predict_scitype(classifier)
Unknown

And supporting Unknown as well seems to be less satisfying since that is the fallback for models that don't implement the traits... I wonder if this could/should be fixed in MLJBase for Pipeline, and hence in principle could work for models that implement traits?

In any case, we could at least dispatch on the supported prediction types in _rstar instead of restricting it to specific model types.

)
length(predictions) == length(ytest) ||
error("numbers of predictions and targets must be equal")
mean_accuracy = Statistics.mean(p == y for (p, y) in zip(predictions, ytest))
Expand All @@ -170,7 +180,9 @@ function _rstar(predictions::AbstractVector{T}, ytest::AbstractVector{T}) where
end

# R⋆ for probabilistic predictions (algorithm 2)
function _rstar(predictions::AbstractVector, ytest::AbstractVector)
function _rstar(
::MLJModelInterface.Probabilistic, predictions::AbstractVector, ytest::AbstractVector
)
length(predictions) == length(ytest) ||
error("numbers of predictions and targets must be equal")

Expand Down
11 changes: 7 additions & 4 deletions test/Project.toml
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
[deps]
Distributions = "31c24e10-a181-5473-b8eb-7969acd0382f"
DynamicHMC = "bbc10e6e-7c05-544b-b16e-64fede858acb"
EvoTrees = "f6006082-12f8-11e9-0c9c-0d5d367ab1e5"
FFTW = "7a1cc6ca-52ef-59f5-83cd-3a7055c09341"
LogDensityProblems = "6fdf6af0-433a-55f7-b3ed-c6c6e0b8df7c"
LogExpFunctions = "2ab3a3ac-af41-5b50-aa03-7779005ae688"
MCMCDiagnosticTools = "be115224-59cd-429b-ad48-344e309966f0"
MLJBase = "a7f614a8-145f-11e9-1d2a-a57a1082229d"
MLJIteration = "614be32b-d00c-4edb-bd02-1eb411ab5e55"
MLJLIBSVMInterface = "61c7150f-6c77-4bb1-949c-13197eac2a52"
MLJXGBoostInterface = "54119dfa-1dab-4055-a167-80440f4f7a91"
Pkg = "44cfe95a-1eb2-52ea-b672-e2afdf69b78f"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
Expand All @@ -18,13 +19,15 @@ Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
[compat]
Distributions = "0.25"
DynamicHMC = "3"
EvoTrees = "0.14.6"
FFTW = "1.1"
LogDensityProblems = "0.12, 1, 2"
LogExpFunctions = "0.3"
MCMCDiagnosticTools = "0.2"
MLJBase = "0.19, 0.20, 0.21"
MLJLIBSVMInterface = "0.1, 0.2"
MLJXGBoostInterface = "0.1, 0.2, 0.3"
MLJIteration = "0.5"
MLJLIBSVMInterface = "0.2"
MLJXGBoostInterface = "0.3"
StatsBase = "0.33"
Tables = "1"
julia = "1.3"
julia = "1.6"
31 changes: 28 additions & 3 deletions test/rstar.jl
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
using MCMCDiagnosticTools

using Distributions
using EvoTrees
using MLJBase
using MLJLIBSVMInterface
using MLJXGBoostInterface
Expand All @@ -9,13 +10,27 @@ using Tables
using Random
using Test

const xgboost_deterministic = Pipeline(XGBoostClassifier(); operation=predict_mode)
# XGBoost errors on 32bit systems: https://github.com/dmlc/XGBoost.jl/issues/92
const XGBoostClassifiers = if Sys.WORD_SIZE == 64
(
XGBoostClassifier(),
Pipeline(XGBoostClassifier(); operation=predict_mode),
)
devmotion marked this conversation as resolved.
Show resolved Hide resolved
else
()
devmotion marked this conversation as resolved.
Show resolved Hide resolved
end

@testset "rstar.jl" begin
classifiers = (XGBoostClassifier(), xgboost_deterministic, SVC())
N = 1_000

@testset "samples input type: $wrapper" for wrapper in [Vector, Array, Tables.table]
# In practice, probably you want to use EvoTreeClassifier with early stopping
classifiers = (
EvoTreeClassifier(; nrounds=100, eta=0.3),
Pipeline(EvoTreeClassifier(; nrounds=100, eta=0.3); operation=predict_mode),
SVC(),
XGBoostClassifiers...,
)
@testset "examples (classifier = $classifier)" for classifier in classifiers
sz = wrapper === Vector ? N : (N, 2)
# Compute R⋆ statistic for a mixed chain.
Expand Down Expand Up @@ -111,8 +126,18 @@ const xgboost_deterministic = Pipeline(XGBoostClassifier(); operation=predict_mo
i += 1
end

# In practice, probably you want to use EvoTreeClassifier with early stopping
rng = MersenneTwister(42)
classifiers = (
EvoTreeClassifier(; rng=rng, nrounds=100, eta=0.3),
Pipeline(
EvoTreeClassifier(; rng=rng, nrounds=100, eta=0.3); operation=predict_mode
),
SVC(),
XGBoostClassifiers...,
)
@testset "classifier = $classifier" for classifier in classifiers
rng = MersenneTwister(42)
Random.seed!(rng, 42)
dist1 = rstar(rng, classifier, samples_mat, chain_inds)
Random.seed!(rng, 42)
dist2 = rstar(rng, classifier, samples)
Expand Down
9 changes: 1 addition & 8 deletions test/runtests.jl
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
using Pkg

using MCMCDiagnosticTools
using FFTW

Expand Down Expand Up @@ -40,11 +38,6 @@ Random.seed!(1)
include("rafterydiag.jl")
end
@testset "R⋆ diagnostic" begin
# XGBoost errors on 32bit systems: https://github.com/dmlc/XGBoost.jl/issues/92
if Sys.WORD_SIZE == 64
include("rstar.jl")
else
@info "R⋆ not tested: requires 64bit architecture"
end
include("rstar.jl")
end
end