Skip to content

Commit

Permalink
Merge pull request #713 from mlr-org/final_global
Browse files Browse the repository at this point in the history
Final global
  • Loading branch information
RaphaelS1 authored Jul 4, 2023
2 parents 8aafe64 + ded8248 commit 09ec526
Show file tree
Hide file tree
Showing 20 changed files with 175 additions and 209 deletions.
26 changes: 8 additions & 18 deletions R/links.R
Original file line number Diff line number Diff line change
Expand Up @@ -12,25 +12,18 @@ update_db = function() {
#' Creates a markdown link to a function reference.
#'
#' @param topic Name of the topic to link against.
#' @param text Text to use for the link. Defaults to the topic name.
#' @param index If `TRUE` calls `index`
#' @param aside Passed to `index`
#'
#' @return (`character(1)`) markdown link.
#' @export
ref = function(topic, text = NULL, index = FALSE, aside = FALSE) {
ref = function(topic, index = FALSE, aside = FALSE) {

strip_parenthesis = function(x) sub("\\(\\)$", "", x)

checkmate::assert_string(topic, pattern = "^[[:alnum:]._-]+(::[[:alnum:]._-]+)?(\\(\\))?$")
checkmate::assert_string(text, min.chars = 1L, null.ok = TRUE)

topic = trimws(topic)
text = if (is.null(text)) {
topic
} else {
trimws(text)
}

if (stringi::stri_detect_fixed(topic, "::")) {
parts = strsplit(topic, "::", fixed = TRUE)[[1L]]
Expand Down Expand Up @@ -64,10 +57,10 @@ ref = function(topic, text = NULL, index = FALSE, aside = FALSE) {
url = sprintf("https://www.rdocumentation.org/packages/%s/topics/%s", pkg, name)
}

out = sprintf("[`%s`](%s)", text, url)
out = sprintf("[`%s`](%s)", topic, url)

if (index || aside) {
out = paste0(out, index(main = NULL, index = text, aside = aside, code = TRUE))
out = paste0(out, index(main = NULL, index = topic, aside = aside, code = TRUE))
}

out
Expand Down Expand Up @@ -218,7 +211,9 @@ index = function(main = NULL, index = NULL, aside = FALSE, code = FALSE, lower =
out = sprintf("%s\\index{%s}", out, index)

if (aside) {
if (is.null(asidetext)) asidetext = if (code) main else toproper(main)
if (is.null(asidetext)) {
asidetext = if (code || !lower) main else toproper(main)
}
out = sprintf("%s[%s]{.aside}", out, asidetext)
}

Expand All @@ -238,15 +233,10 @@ define = function(text) {
#' Creates markdown link and footnote with full link
#'
#' @param url URL to link to
#' @param text Text to display in main text
#'
#' @export
link = function(url, text = NULL) {
if (is.null(text)) {
sprintf("[%s](%s)", url, url)
} else {
sprintf("[%s](%s)^[[%s](%s)]", text, url, url, url)
}
link = function(url) {
sprintf("[%s](%s)", url, url)
}

#' @name paradox
Expand Down
24 changes: 12 additions & 12 deletions book/chapters/appendices/solutions.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ rr$aggregate(msr("regr.mse"))
```


2. Use `tsk("spam")` and five-fold CV to benchmark Random Forest (`lrn("classif.ranger")`), Logistic Regression (`lrn("classif.log_reg")`), and XGBoost (`lrn("classif.xgboost")`) with respect to AUC.
2. Use `tsk("spam")` and five-fold CV to benchmark Random forest (`lrn("classif.ranger")`), Logistic Regression (`lrn("classif.log_reg")`), and XGBoost (`lrn("classif.xgboost")`) with respect to AUC.
Which learner appears to do best? How confident are you in your conclusion?
How would you improve upon this?

Expand All @@ -121,7 +121,7 @@ This is only a small example for a benchmark workflow, but without tuning (see @

3. A colleague claims to have achieved a 93.1% classification accuracy using `lrn("classif.rpart")` on `tsk("penguins_simple")`.
You want to reproduce their results and ask them about their resampling strategy.
They said they used a custom 3-fold CV with folds assigned as `factor(task$row_ids %% 3)`.
They said they used a custom three-fold CV with folds assigned as `factor(task$row_ids %% 3)`.
See if you can reproduce their results.

```{r solutions-008}
Expand All @@ -144,7 +144,7 @@ rr$aggregate(msr("classif.acc"))

1. Tune the `mtry`, `sample.fraction`, ` num.trees` hyperparameters of a random forest model (`lrn("regr.ranger")`) on the `"mtcars"` task.
Use a simple random search with 50 evaluations and select a suitable batch size.
Evaluate with a 3-fold CV and the root mean squared error.
Evaluate with a three-fold CV and the root mean squared error.

```{r solutions-009}
set.seed(4)
Expand All @@ -168,7 +168,7 @@ tuner$optimize(instance)
```

2. Evaluate the performance of the model created in Question 1 with nested resampling.
Use a holdout validation for the inner resampling and a 3-fold CV for the outer resampling.
Use a holdout validation for the inner resampling and a three-fold CV for the outer resampling.
Print the unbiased performance estimate of the model.

```{r solutions-010}
Expand Down Expand Up @@ -320,7 +320,7 @@ optimizer_rf$optimize(instance)
rf_data = instance$archive$data
rf_data[, y_min := cummin(y)]
rf_data[, nr_eval := seq_len(.N)]
rf_data[, surrogate := "Random Forest"]
rf_data[, surrogate := "Random forest"]
```

We collect all data and visualize the anytime performance:
Expand Down Expand Up @@ -535,8 +535,8 @@ as.data.table(aggr)[, .(learner_id, classif.acc, time_train)]

## Solutions to @sec-pipelines

1. Concatenate the named PipeOps using the `r ref("concat_graphs", "%>>%")` operator.
To get a `r ref("Learner")` object, use `r ref("as_learner", "as_learner()")`
1. Concatenate the named PipeOps using `%>>%`.
To get a `r ref("Learner")` object, use `r ref("as_learner()")`
```{r pipelines-001}
library(mlr3pipelines)
library(mlr3learners)
Expand All @@ -545,7 +545,7 @@ graph = po("imputeoor") %>>% po("scale") %>>% lrn("classif.log_reg")
graph_learner = as_learner(graph)
```

2. After training, the underlying `lrn("classif.log_reg")` can be accessed through the `$base_learner()` method.
1. After training, the underlying `lrn("classif.log_reg")` can be accessed through the `$base_learner()` method.
Alternatively, the learner can be accessed explicitly using `po("learner")`.
```{r pipelines-002}
graph_learner$train(tsk("pima"))
Expand Down Expand Up @@ -766,7 +766,7 @@ This improves the average error of our model by a further 1600$.

## Solutions to @sec-special

1. Run a benchmark experiment on the `"german_credit"` task with algorithms: `lrn("classif.featureless")`, `lrn("classif.log_reg")`, `lrn("classif.ranger")`. Tune the `lrn("classif.featureless")` model using `tunetreshold` and `learner_cv`. Use 2-fold CV and evaluate with `msr("classif.costs", costs = costs)` where you should make the parameter `costs` so that the cost of a true positive is -10, the cost of a true negative is -1, the cost of a false positive is 2, and the cost of a false negative is 3. Use `set.seed(11)` to make sure you get the same results as us. Are your results surprising?
1. Run a benchmark experiment on the `"german_credit"` task with algorithms: `lrn("classif.featureless")`, `lrn("classif.log_reg")`, `lrn("classif.ranger")`. Tune the `lrn("classif.featureless")` model using `tunetreshold` and `learner_cv`. Use two-fold CV and evaluate with `msr("classif.costs", costs = costs)` where you should make the parameter `costs` so that the cost of a true positive is -10, the cost of a true negative is -1, the cost of a false positive is 2, and the cost of a false negative is 3. Use `set.seed(11)` to make sure you get the same results as us. Are your results surprising?

```{r solutions-017}
library(mlr3verse)
Expand Down Expand Up @@ -999,7 +999,7 @@ if (file.exists(path_automl_suite)) {
}
```

2. Find all tasks with less than 4000 observations and convert them to mlr3 tasks.
2. Find all tasks with less than 4000 observations and convert them to `mlr3` tasks.

We can find all tasks with less than 4000 observations by specifying this constraint in `r ref("mlr3oml::list_oml_tasks()")`.

Expand Down Expand Up @@ -1036,7 +1036,7 @@ We can create `mlr3` tasks from these OpenML IDs using `tsk("oml")`.
tasks = lapply(tbl$task_id, function(id) tsk("oml", task_id = id))
```

3. Create an experimental design that compares `regr.ranger` to `regr.rpart`, use the robustify pipeline for both learners and a featureless fallback learner. Use 3-fold cross-validation instead of the OpenML resamplings.
3. Create an experimental design that compares `regr.ranger` to `regr.rpart`, use the robustify pipeline for both learners and a featureless fallback learner. Use three-fold cross-validation instead of the OpenML resamplings.

Below, we define the robustified learners with a featureless fallback learner.

Expand Down Expand Up @@ -1331,7 +1331,7 @@ lrn_2 = po("learner_cv", lrn("classif.rpart")) %>>%
po("EOd")
```

And run the benchmark again. Note, that we use 3-fold CV this time for comparison.
And run the benchmark again. Note, that we use three-fold CV this time for comparison.

```{r}
learners = list(learner, lrn_1, lrn_2)
Expand Down
8 changes: 4 additions & 4 deletions book/chapters/chapter1/introduction_and_overview.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Welcome to the **M**achine **L**earning in **R** universe.
In this book, we will guide you through the functionality offered by `mlr3` step by step.
If you want to contribute to our universe, ask any questions, read documentation, or just chat with the team, head to `r link("https://github.com/mlr-org/mlr3")` which has several useful links in the README.

The `r mlr3` (@mlr3) package and the wider `mlr3` ecosystem provide a generic, `r index("object-oriented", "object-oriented programming")`, and extensible framework for `r index("regression")` (@sec-tasks), `r index("classification")` (@sec-classif), and other machine learning `r index("tasks")` (@sec-special) for the R language (@R).
The `r mlr3` [@mlr3] package and the wider `mlr3` ecosystem provide a generic, `r index("object-oriented", "object-oriented programming")`, and extensible framework for `r index("regression")` (@sec-tasks), `r index("classification")` (@sec-classif), and other machine learning `r index("tasks")` (@sec-special) for the R language [@R].
On the most basic level, the unified interface provides functionality to train, test, and evaluate many machine learning algorithms.
You can also take this a step further with hyperparameter optimization, computational pipelines, model interpretation, and much more.
`mlr3` has similar overall aims to `caret` and `tidymodels` for R, `scikit-learn` for Python, and `MLJ` for Julia.
Expand Down Expand Up @@ -60,7 +60,7 @@ You can see an up-to-date list of all our extension packages at `r link("https:/
The `mlr3` ecosystem is the result of many years of methodological and applied research.
This book describes the resulting features and discusses best practices for ML, technical implementation details, and in-depth considerations for model optimization.
This book may be helpful for both practitioners who want to quickly apply machine learning (ML) algorithms and researchers who want to implement, benchmark, and compare their new methods in a structured environment.
Whilst we hope this book is accessible to a wide range of readers and levels of ML expertise, we do assume that readers have taken at least an introductory ML course or have the equivalent expertise and some basic experience with R.
While we hope this book is accessible to a wide range of readers and levels of ML expertise, we do assume that readers have taken at least an introductory ML course or have the equivalent expertise and some basic experience with R.
A background in computer science or statistics is beneficial for understanding the advanced functionality described in the later chapters of this book, but not required.
A comprehensive ML introduction for those new to the field can be found in @james_introduction_2014.
@Wickham2017R provides a comprehensive introduction to data science in R.
Expand Down Expand Up @@ -348,7 +348,7 @@ Plot types are documented in the respective manual page that can be accessed thr

{{< include ../../common/_optional.qmd >}}

The `r ref_pkg("mlr")`\index{\texttt{mlr}} package (@mlr) was first released on CRAN in 2013, with the core design and architecture dating back further.
The `r ref_pkg("mlr")`\index{\texttt{mlr}} package [@mlr] was first released on CRAN in 2013, with the core design and architecture dating back further.
Over time, the addition of many features led to a complex design that made it too difficult for us to extend further.
In hindsight, we saw that some design and architecture choices in `r ref_pkg("mlr")` made it difficult to support new features, in particular with respect to ML pipelines.
So in 2018, we set about working on a reimplementation, which resulted in the first release of `r mlr3` on CRAN in July 2019.
Expand All @@ -365,7 +365,7 @@ Embrace `r ref_pkg("data.table")` for its top-notch computational performance as
This considerably simplifies the API and allows easy selection and "split-apply-combine" (aggregation) operations.
We combine `data.table` and `R6` to place references to non-atomic and compound objects in tables and make heavy use of list columns.
* **Defensive programming and type safety**.
All user input is checked with `r ref_pkg("checkmate")` (@checkmate).
All user input is checked with `r ref_pkg("checkmate")` [@checkmate].
We use `data.table`, which has behavior that is more consistent than several base R methods (e.g., indexing `data.frame`s simplifies the result when the `drop` argument is omitted).
And we have extensive unit tests!
* **Light on dependencies**.
Expand Down
14 changes: 6 additions & 8 deletions book/chapters/chapter10/advanced_technical_aspects_of_mlr3.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -90,8 +90,8 @@ There are also options to control the chunk size for parallelization in `mlr3`,
::: {.callout-tip}
# Reproducibility

Reproducibility is often a concern during parallelization because special Pseudorandom number generators (PRNGs) may be required (@future119).
However, `r ref_pkg("future")` ensures that all workers will receive the same PRNG streams, independent of the number of workers (@future119).
Reproducibility is often a concern during parallelization because special Pseudorandom number generators (PRNGs) may be required [@future119].
However, `r ref_pkg("future")` ensures that all workers will receive the same PRNG streams, independent of the number of workers [@future119].
Therefore, `mlr3` experiments will be reproducible as long as you use `set.seed` at the start of your scripts (with the PRNG of your choice).
:::

Expand Down Expand Up @@ -136,7 +136,7 @@ set_threads(lrn_ranger, n = 4)
```

If we did not specify an argument for the `n` parameter then the default is a heuristic to detect the correct number using `r ref("parallelly::availableCores()")`.
This heuristic is not always ideal (interested readers might want to look up "Amdahl's Law") and utilizing all available cores is occasionally counterproductive and can slow down overall runtime (@avoiddetect), moreover using all cores is not ideal if:
This heuristic is not always ideal (interested readers might want to look up "Amdahl's Law") and utilizing all available cores is occasionally counterproductive and can slow down overall runtime [@avoiddetect], moreover using all cores is not ideal if:

* You want to simultaneously use your system for other purposes.
* You are on a multi-user system and want to spare some resources for other users.
Expand Down Expand Up @@ -599,7 +599,7 @@ We strongly recommend the final option, which is statistically sound and can be

To make this procedure convenient during resampling and benchmarking, we support fitting a baseline (though in theory you could use any `Learner`) as a `r index('fallback learner')` by passing a `r ref("Learner")` to `r index('$fallback', parent = "Learner", aside = TRUE, code = TRUE)`.
In the next example, we add a classification baseline to our debug learner, so that when the debug learner errors, `mlr3` falls back to the predictions of the featureless learner internally.
Note that whilst encapsulation is not enabled explicitly, it is automatically enabled and set to `"evaluate"` if a fallback learner is added.
Note that while encapsulation is not enabled explicitly, it is automatically enabled and set to `"evaluate"` if a fallback learner is added.

```{r technical-022}
lrn_debug = lrn("classif.debug", error_train = 1)
Expand Down Expand Up @@ -851,7 +851,7 @@ The backend can also operate on a folder with multiple parquet files.
## Extending mlr3 and Defining a New `Measure` {#sec-extending}

After getting this far in the book you are well on your way to being an `mlr3` expert and may even want to add more classes to our universe.
Whilst many classes could be extended, all have a similar design interface and so, we will only demonstrate how to create a custom `r ref("Measure")`.
While many classes could be extended, all have a similar design interface and so, we will only demonstrate how to create a custom `r ref("Measure")`.
If you are interested in implementing new learners, `PipeOp`s, or tuners, then check out the vignettes in the respective packages: `r mlr3extralearners`, `r mlr3pipelines`, or `r mlr3tuning`.
If you are considering creating a package that adds an entirely new task type then feel free to contact us for some support via GitHub, email, or Mattermost.
This section assumes good knowledge of `R6`, see @sec-r6 for a brief introduction and references to further resources.
Expand Down Expand Up @@ -928,7 +928,7 @@ mlr3::mlr_measures$add("regr.thresh_acc", MeasureRegrThresholdAcc)
prediction$score(msr("regr.thresh_acc"))
```

Whilst we only covered how to create a simple regression measure, the process of adding other classes to our universe is in essence the same:
While we only covered how to create a simple regression measure, the process of adding other classes to our universe is in essence the same:

1. Find the right class to inherit from
2. Add methods that:
Expand All @@ -949,8 +949,6 @@ If you are interested to learn more about parallelization in R, we recommend @Sc
To find out more about logging, have a read of the vignettes in `lgr`, which cover everything from logging to JSON files to retrieving logged objects for debugging.
For an overview of available DBMS in R, see the CRAN task view on databases at `r link("https://cran.r-project.org/view=Databases")`, and in particular the vignettes of the `dbplyr` package for DBMS readily available in `mlr3`.

@tbl-technical-api summarizes the most important classes, functions, and methods seen in this chapter.

| Class | Constructor/Function | Fields/Methods |
| --- | --- | --- |
| - | `r ref("future::plan()")` | - |
Expand Down
Loading

0 comments on commit 09ec526

Please sign in to comment.