newtonchat/data/SKLearn_subjects.json

[{"name": "Supervised learning", "children": [{"name": "Linear Models", "description": "Linear models are a way of describing a response variable in terms of a linear combination of predictor variables. The response should be a continuous variable and be at least approximately normally distributed. Such models find wide application, but cannot handle clearly discrete or skewed continuous responses [1].", "url": "https://scikit-learn.org/stable/modules/linear_model.html", "children": [{"name": "Ordinary Least Squares", "description": "The coefficient estimates for Ordinary Least Squares rely on the independence of the features. When features are correlated and the columns of the design matrix have an approximately linear dependence, the design matrix becomes close to singular and as a result, the least-squares estimate becomes highly sensitive to random errors in the observed target, producing a large variance. This situation of multicollinearity can arise, for example, when data are collected without an experimental design.OLS fits a linear model to the data. The model chooses coefficients to minimize the residual sum of squares between the observed targets in the dataset and the targets predicted by the linear approximation.", "url": "https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares"}, {"name": "Non-Negative Least Squares", "description": "It is possible to constrain all the coefficients to be non-negative, which may be useful when they represent some physical or naturally non-negative quantities (e.g., frequency counts or prices of goods). LinearRegression accepts a boolean positive parameter: when set to True Non-Negative Least Squares are then applied.", "url": "https://scikit-learn.org/stable/modules/linear_model.html#non-negative-least-squares"}, {"name": "Ridge regression and classification", "description": "Ridge regression addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of the coefficients.Ridge chooses coefficients to minimize a penalized residual sum of squares. \n\nIt might seem questionable to use a (penalized) Least Squares loss to fit a classification model instead of the more traditional logistic or hinge losses. However, in practice, all those models can lead to similar cross-validation scores in terms of accuracy or precision/recall, while the penalized least squares loss used by the RidgeClassifier allows for a very different choice of the numerical solvers with distinct computational performance profiles.\n\nThe RidgeClassifier can be significantly faster than e.g. LogisticRegression with a high number of classes because it can compute the projection matrix only once.\n\nThe Ridge classifier converts binary targets to {-1, 1} and then treats the problem as a regression task, optimizing the same objective as above.", "url": "https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression-and-classification"}, {"name": "Ridge classification", "description": "Ridge addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of the coefficients.The RidgeClassifier can be significantly faster than e.g. LogisticRegression with a high number of classes because it can compute the projection matrix only once.\n\nThe Ridge classifier converts binary targets to {-1, 1} and then treats the problem as a regression task, optimizing the same objective as above.", "url": "https://scikit-learn.org/stable/modules/linear_model.html#ridge-classification"}, {"name": "Ridge with Cross Validation", "description": "RidgeCV implements ridge regression with built-in cross-validation of the alpha parameter. The object works in the same way as GridSearchCV except that it defaults to Leave-One-Out Cross-Validation:", "url": "https://scikit-learn.org/stable/modules/linear_model.html#ridge-with-cross-validation"}, {"name": "Lasso", "description": "The Lasso is a linear model that estimates sparse coefficients. It is useful in some contexts due to its tendency to prefer solutions with fewer non-zero coefficients, effectively reducing the number of features upon which the given solution is dependent. For this reason, Lasso and its variants are fundamental to the field of compressed sensing. Under certain conditions, it can recover the exact set of non-zero coefficients (see Compressive sensing: tomography reconstruction with L1 prior (Lasso)).", "url": "https://scikit-learn.org/stable/modules/linear_model.html#lasso"}, {"name": "LassoCV", "description": "Wanting to use cross validation with lasso, commonly used For high-dimensional datasets with many collinear features", "url": "https://scikit-learn.org/stable/modules/linear_model.html#lassocv"}, {"name": "LassoLarsCV", "description": "based on the Least Angle Regression algorithm, has the advantage of exploring more relevant values of alpha parameter, and if the number of samples is very small compared to the number of features, it is often faster than LassoCV.", "url": "https://scikit-learn.org/stable/modules/linear_model.html#lassolarscv"}, {"name": "LassoLarsIC", "description": "Lasso model fit with Lars using BIC or AIC for model selection. It is a computationally cheaper alternative to find the optimal value of alpha as the regularization path is computed only once instead of k+1 times when using k-fold cross-validation. they penalize the over-optimistic scores of the different Lasso models by their flexibility. However, also tend to break when the problem is badly conditioned (e.g. more features than samples).", "url": "https://scikit-learn.org/stable/modules/linear_model.html#lassolarsic"}, {"name": "Multi-task Lasso", "description": "The MultiTaskLasso is a linear model that estimates sparse coefficients for multiple regression problems jointly: y is a 2D array, of shape (n_samples, n_tasks). The constraint is that the selected features are the same for all the regression problems, also called tasks.", "url": "https://scikit-learn.org/stable/modules/linear_model.html#multi-task-lasso"}, {"name": "Elastic-Net", "description": "Elastic-net is useful when there are multiple features that are correlated with one another. Lasso is likely to pick one of these at random, while elastic-net is likely to pick both.A practical advantage of trading-off between Lasso and Ridge is that it allows Elastic-Net to inherit some of Ridge\u00d5s stability under rotation.", "url": "https://scikit-learn.org/stable/modules/linear_model.html#elastic-net"}, {"name": "Multi-task Elastic-Net", "description": "The MultiTaskElasticNet is an elastic-net model that estimates sparse coefficients for multiple regression problems jointly: Y is a 2D array of shape (n_samples, n_tasks). The constraint is that the selected features are the same for all the regression problems, also called tasks.", "url": "https://scikit-learn.org/stable/modules/linear_model.html#multi-task-elastic-net"}, {"name": "Least Angle Regression", "description": "Least-angle regression (LARS) is a regression algorithm for high-dimensional data, developed by Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani. LARS is similar to forward stepwise regression. At each step, it finds the feature most correlated with the target. When there are multiple features having equal correlation, instead of continuing along the same feature, it proceeds in a direction equiangular between the features.The advantages of LARS are:\n\nIt is numerically efficient in contexts where the number of features is significantly greater than the number of samples.\n\nIt is computationally just as fast as forward selection and has the same order of complexity as ordinary least squares.\n\nIt produces a full piecewise linear solution path, which is useful in cross-validation or similar attempts to tune the model.\n\nIf two features are almost equally correlated with the target, then their coefficients should increase at approximately the same rate. The algorithm thus behaves as intuition would expect, and also is more stable.\n\nIt is easily modified to produce solutions for other estimators, like the Lasso.\n\nThe disadvantages of the LARS method include:\n\nBecause LARS is based upon an iterative refitting of the residuals, it would appear to be especially sensitive to the effects of noise. This problem is discussed in detail by Weisberg in the discussion section of the Efron et al. (2004) Annals of Statistics article.", "url": "https://scikit-learn.org/stable/modules/linear_model.html#least-angle-regression"}, {"name": "LARS Lasso", "description": "LassoLars is a lasso model implemented using the LARS algorithm, and unlike the implementation based on coordinate descent, this yields the exact solution, which is piecewise linear as a function of the norm of its coefficients. \n", "url": "https://scikit-learn.org/stable/modules/linear_model.html#lars-lasso"}, {"name": "Orthogonal Matching Pursuit (OMP)", "description": "OrthogonalMatchingPursuit and orthogonal_mp implements the OMP algorithm for approximating the fit of a linear model with constraints imposed on the number of non-zero coefficients (ie. the pseudo-norm).Being a forward feature selection method like Least Angle Regression, orthogonal matching pursuit can approximate the optimum solution vector with a fixed number of non-zero element. Alternatively, orthogonal matching pursuit can target a specific error instead of a specific number of non-zero coefficients. OMP is based on a greedy algorithm that includes at each step the atom most highly correlated with the current residual. It is similar to the simpler matching pursuit (MP) method, but better in that at each iteration, the residual is recomputed using an orthogonal projection on the space of the previously chosen dictionary elements.", "url": "https://scikit-learn.org/stable/modules/linear_model.html#orthogonal-matching-pursuit-(omp)"}, {"name": "Bayesian Regression", "description": "Bayesian regression techniques can be used to include regularization parameters in the estimation procedure: the regularization parameter is not set in a hard sense but tuned to the data at hand.The advantages of Bayesian Regression are: \n\nIt adapts to the data at hand. \n\nIt can be used to include regularization parameters in the estimation procedure. \n\nThe disadvantages of Bayesian regression include: \n\nInference of the model can be time consuming.", "url": "https://scikit-learn.org/stable/modules/linear_model.html#bayesian-regression"}, {"name": "Bayesian Ridge Regression", "description": "Due to the Bayesian framework, the weights found are slightly different to the ones found by Ordinary Least Squares. However, Bayesian Ridge Regression is more robust to ill-posed problems.", "url": "https://scikit-learn.org/stable/modules/linear_model.html#bayesian-ridge-regression"}, {"name": "Automatic Relevance Determination - ARD", "description": "ARDRegression is very similar to Bayesian Ridge Regression, but can lead to sparser coefficients. ARDRegression poses a different prior over , by dropping the assumption of the Gaussian being spherical.", "url": "https://scikit-learn.org/stable/modules/linear_model.html#automatic-relevance-determination---ard"}, {"name": "Logistic regression", "description": "Logistic regression, despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.", "url": "https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression"}, {"name": "Generalized Linear Regression", "description": "Generalized Linear Models (GLM) extend linear models in two ways. First, the predicted values are linked to a linear combination of the input variables via an inverse link function. Secondly, the squared loss function is replaced by the unit deviance of a distribution in the exponential family (or more precisely, a reproductive exponential dispersion model (EDM)).You can use three of these distributions - Poisson, Tweedie, and Gamma. The choice of the distribution depends on the problem at hand: \n\nIf the target values are counts (non-negative integer valued) or relative frequencies (non-negative), you might use a Poisson deviance with log-link. \n\nIf the target values are positive valued and skewed, you might try a Gamma deviance with log-link. \n\nIf the target values seem to be heavier tailed than a Gamma distribution, you might try an Inverse Gaussian deviance (or even higher variance powers of the Tweedie family). \n\nExamples of use cases include: \n\nAgriculture / weather modeling: number of rain events per year (Poisson), amount of rainfall per event (Gamma), total rainfall per year (Tweedie / Compound Poisson Gamma). \n\nRisk modeling / insurance policy pricing: number of claim events / policyholder per year (Poisson), cost per event (Gamma), total cost per policyholder per year (Tweedie / Compound Poisson Gamma). \n\nPredictive maintenance: number of production interruption events per year (Poisson), duration of interruption (Gamma), total interruption time per year (Tweedie / Compound Poisson Gamma).", "url": "https://scikit-learn.org/stable/modules/linear_model.html#generalized-linear-regression"}, {"name": "Stochastic Gradient Descent - SGD", "description": "Stochastic gradient descent is a simple yet very efficient approach to fit linear models. It is particularly useful when the number of samples (and the number of features) is very large. The partial_fit method allows online/out-of-core learning.provide functionality to fit linear models for classification and regression using different (convex) loss functions and different penalties \n\nwith loss=\"log\", SGDClassifier fits a logistic regression model, while with loss=\"hinge\" it fits a linear support vector machine (SVM)", "url": "https://scikit-learn.org/stable/modules/linear_model.html#stochastic-gradient-descent---sgd"}, {"name": "Perceptron", "description": "The Perceptron is another simple classification algorithm suitable for large scale learning. By default: \n\nIt does not require a learning rate. \n\nIt is not regularized (penalized). \n\nIt updates its model only on mistakes. \n\nThe last characteristic implies that the Perceptron is slightly faster to train than SGD with the hinge loss and that the resulting models are sparser.Perceptron is slightly faster to train than SGD with the hinge loss and that the resulting models are sparser.", "url": "https://scikit-learn.org/stable/modules/linear_model.html#perceptron"}, {"name": "Passive Aggressive Algorithms", "description": "The passive-aggressive algorithms are a family of algorithms for large-scale learning. They are similar to the Perceptron in that they do not require a learning rate. However, contrary to the Perceptron, they include a regularization parameter C.For classification, PassiveAggressiveClassifier can be used with loss='hinge' (PA-I) or loss='squared_hinge' (PA-II) \nFor regression, PassiveAggressiveRegressor can be used with loss='epsilon_insensitive' (PA-I) or loss='squared_epsilon_insensitive' (PA-II).", "url": "https://scikit-learn.org/stable/modules/linear_model.html#passive-aggressive-algorithms"}, {"name": "Robustness regression: outliers and modeling errors", "description": "Robust regression aims to fit a regression model in the presence of corrupt data: either outliers, or error in the model.Note that in general, robust fitting in high-dimensional setting (large n_features) is very hard. The robust models here will probably not work in these settings.", "url": "https://scikit-learn.org/stable/modules/linear_model.html#robustness-regression:-outliers-and-modeling-errors"}, {"name": "RANSAC (RANdom SAmple Consensus)", "description": "fits a model from random subsets of inliers from the complete data set.\n\nRANSAC is a non-deterministic algorithm producing only a reasonable result with a certain probability, which is dependent on the number of iterations (see max_trials parameter). It is typically used for linear and non-linear regression problems and is especially popular in the field of photogrammetric computer vision.\n\nThe algorithm splits the complete input sample data into a set of inliers, which may be subject to noise, and outliers, which are e.g. caused by erroneous measurements or invalid hypotheses about the data. The resulting model is then estimated only from the determined inliers.", "url": "https://scikit-learn.org/stable/modules/linear_model.html#ransac-(random-sample-consensus)"}, {"name": "Theil-Sen estimator: generalized-median-based estimator", "description": "The TheilSenRegressor estimator uses a generalization of the median in multiple dimensions. It is thus robust to multivariate outliers. Note however that the robustness of the estimator decreases quickly with the dimensionality of the problem. It loses its robustness properties and becomes no better than an ordinary least squares in high dimension. Since Theil-Sen is a median-based estimator, it is more robust against corrupted data aka outliers. In univariate setting, Theil-Sen has a breakdown point of about 29.3% in case of a simple linear regression which means that it can tolerate arbitrary corrupted data of up to 29.3%.", "url": "https://scikit-learn.org/stable/modules/linear_model.html#theil-sen-estimator:-generalized-median-based-estimator"}, {"name": "Huber Regression", "description": "The HuberRegressor is different to Ridge because it applies a linear loss to samples that are classified as outliers. A sample is classified as an inlier if the absolute error of that sample is lesser than a certain threshold. It differs from TheilSenRegressor and RANSACRegressor because it does not ignore the effect of the outliers but gives a lesser weight to them.\nIt is advised to set the parameter epsilon to 1.35 to achieve 95% statistical efficiency.\n\nThe HuberRegressor differs from using SGDRegressor with loss set to huber in the following ways.\n\nHuberRegressor is scaling invariant. Once epsilon is set, scaling X and y down or up by different values would produce the same robustness to outliers as before. as compared to SGDRegressor where epsilon has to be set again when X and y are scaled.\n\nHuberRegressor should be more efficient to use on data with small number of samples while SGDRegressor needs a number of passes on the training data to produce the same robustness.", "url": "https://scikit-learn.org/stable/modules/linear_model.html#huber-regression"}, {"name": "Quantile Regression", "description": "Quantile regression estimates the median or other quantiles of conditional on , while ordinary least squares (OLS) estimates the conditional mean.Quantile regression may be useful if one is interested in predicting an interval instead of point prediction. Sometimes, prediction intervals are calculated based on the assumption that prediction error is distributed normally with zero mean and constant variance. Quantile regression provides sensible prediction intervals even for errors with non-constant (but predictable) variance or non-normal distribution.", "url": "https://scikit-learn.org/stable/modules/linear_model.html#quantile-regression"}, {"name": "Polynomial regression: extending linear models with basis functions", "description": "One common pattern within machine learning is to use linear models trained on nonlinear functions of the data. This approach maintains the generally fast performance of linear methods, while allowing them to fit a much wider range of data.For example, a simple linear regression can be extended by constructing polynomial features from the coefficients. \nBy considering linear fits within a higher-dimensional space built with these basis functions, the model has the flexibility to fit a much broader range of data.", "url": "https://scikit-learn.org/stable/modules/linear_model.html#polynomial-regression:-extending-linear-models-with-basis-functions"}]}, {"name": "Linear and Quadratic Discriminant Analysis", "description": "Linear Discriminant Analysis (LinearDiscriminantAnalysis) and Quadratic Discriminant Analysis (QuadraticDiscriminantAnalysis) are two classic classifiers, with, as their names suggest, a linear and a quadratic decision surface, respectively. These classifiers are attractive because they have closed-form solutions that can be easily computed, are inherently multiclass, have proven to work well in practice, and have no hyperparameters to tune. LinearDiscriminantAnalysis can be used to perform supervised dimensionality reduction, by projecting the input data to a linear subspace consisting of the directions which maximize the separation between classes (in a precise sense discussed in the mathematics section below). The dimension of the output is necessarily less than the number of classes, so this is in general a rather strong dimensionality reduction, and only makes sense in a multiclass setting.", "url": "https://scikit-learn.org/stable/modules/lda_qda.html"}, {"name": "Kernel ridge regression", "description": "Linear Discriminant Analysis (LinearDiscriminantAnalysis) and Quadratic Discriminant Analysis (QuadraticDiscriminantAnalysis) are two classic classifiers, with, as their names suggest, a linear and a quadratic decision surface, respectively. These classifiers are attractive because they have closed-form solutions that can be easily computed, are inherently multiclass, have proven to work well in practice, and have no hyperparameters to tune. LinearDiscriminantAnalysis can be used to perform supervised dimensionality reduction, by projecting the input data to a linear subspace consisting of the directions which maximize the separation between classes (in a precise sense discussed in the mathematics section below). The dimension of the output is necessarily less than the number of classes, so this is in general a rather strong dimensionality reduction, and only makes sense in a multiclass setting.", "url": "https://scikit-learn.org/stable/modules/kernel_ridge.html"}, {"name": "Support Vector Machines", "description": "Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.\n\nThe advantages of support vector machines are:\n\nEffective in high dimensional spaces.\n\nStill effective in cases where number of dimensions is greater than the number of samples.\n\nUses a subset of training points in the decision function (called support vectors), so it is also memory efficient.\n\nVersatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.\n\nThe disadvantages of support vector machines include:\n\nIf the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.\n\nSVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).\n\nThe support vector machines in scikit-learn support both dense (numpy.ndarray and convertible to that by numpy.asarray) and sparse (any scipy.sparse) sample vectors as input. However, to use an SVM to make predictions for sparse data, it must have been fit on such data. For optimal performance, use C-ordered numpy.ndarray (dense) or scipy.sparse.csr_matrix (sparse) with dtype=float64.", "url": "https://scikit-learn.org/stable/modules/svm.html", "children": [{"name": "Classification", "description": "SVC and NuSVC are similar methods, but accept slightly different sets of parameters and have different mathematical formulations (see section Mathematical formulation). On the other hand, LinearSVC is another (faster) implementation of Support Vector Classification for the case of a linear kernel. Note that LinearSVC does not accept parameter kernel, as this is assumed to be linear. It also lacks some of the attributes of SVC and NuSVC, like support_.", "url": "https://scikit-learn.org/stable/modules/svm.html#classification"}, {"name": "Regression", "description": "The method of Support Vector Classification can be extended to solve regression problems. This method is called Support Vector Regression. \n\nThe model produced by support vector classification (as described above) depends only on a subset of the training data, because the cost function for building the model does not care about training points that lie beyond the margin. Analogously, the model produced by Support Vector Regression depends only on a subset of the training data, because the cost function ignores samples whose prediction is close to their target. \n\nThere are three different implementations of Support Vector Regression: SVR, NuSVR and LinearSVR. LinearSVR provides a faster implementation than SVR but only considers the linear kernel, while NuSVR implements a slightly different formulation than SVR and LinearSVR. See Implementation details for further details.", "url": "https://scikit-learn.org/stable/modules/svm.html#regression"}]}, {"name": "Stochastic Gradient Descent", "description": "Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to fitting linear classifiers and regressors under convex loss functions such as (linear) Support Vector Machines and Logistic Regression. Even though SGD has been around in the machine learning community for a long time, it has received a considerable amount of attention just recently in the context of large-scale learning.\n\nSGD has been successfully applied to large-scale and sparse machine learning problems often encountered in text classification and natural language processing. Given that the data is sparse, the classifiers in this module easily scale to problems with more than 10^5 training examples and more than 10^5 features.\n\nStrictly speaking, SGD is merely an optimization technique and does not correspond to a specific family of machine learning models. It is only a way to train a model. Often, an instance of SGDClassifier or SGDRegressor will have an equivalent estimator in the scikit-learn API, potentially using a different optimization technique. For example, using SGDClassifier(loss='log_loss') results in logistic regression, i.e. a model equivalent to LogisticRegression which is fitted via SGD instead of being fitted by one of the other solvers in LogisticRegression. Similarly, SGDRegressor(loss='squared_error', penalty='l2') and Ridge solve the same optimization problem, via different means.\n\nThe advantages of Stochastic Gradient Descent are:\n\nEfficiency.\n\nEase of implementation (lots of opportunities for code tuning).\n\nThe disadvantages of Stochastic Gradient Descent include:\n\nSGD requires a number of hyperparameters such as the regularization parameter and the number of iterations.\n\nSGD is sensitive to feature scaling.\n", "url": "https://scikit-learn.org/stable/modules/sgd.html", "children": [{"name": "Classification", "description": "The class SGDClassifier implements a plain stochastic gradient descent learning routine which supports different loss functions and penalties for classification. Below is the decision boundary of a SGDClassifier trained with the hinge loss, equivalent to a linear SVM.SGDClassifier supports multi-class classification by combining multiple binary classifiers in a \u00d2one versus all\u00d3 (OVA) scheme. For each of the K classes, a binary classifier is learned that discriminates between that and all other K-1 classes. At testing time, we compute the confidence score (i.e. the signed distances to the hyperplane) for each classifier and choose the class with the highest confidence. \n\nSGDClassifier supports both weighted classes and weighted instances via the fit parameters class_weight and sample_weight.\n\nSGDClassifier supports averaged SGD (ASGD). Averaging can be enabled by setting average=True. ASGD performs the same updates as the regular SGD, but instead of using the last value of the coefficients as the coef_ attribute (i.e. the values of the last update), coef_ is set instead to the average value of the coefficients across all updates.", "url": "https://scikit-learn.org/stable/modules/sgd.html#classification"}, {"name": "Regression", "description": "The class SGDRegressor implements a plain stochastic gradient descent learning routine which supports different loss functions and penalties to fit linear regression models. SGDRegressor is well suited for regression problems with a large number of training samples (> 10.000), for other problems we recommend Ridge, Lasso, or ElasticNet.", "url": "https://scikit-learn.org/stable/modules/sgd.html#regression"}, {"name": "Online One-Class SVM", "description": "The class sklearn.linear_model.SGDOneClassSVM implements an online linear version of the One-Class SVM using a stochastic gradient descent. Combined with kernel approximation techniques, sklearn.linear_model.SGDOneClassSVM can be used to approximate the solution of a kernelized One-Class SVM, implemented in sklearn.svm.OneClassSVM, with a linear complexity in the number of samples. Note that the complexity of a kernelized One-Class SVM is at best quadratic in the number of samples.", "url": "https://scikit-learn.org/stable/modules/sgd.html#online-one-class-svm"}]}, {"name": "Nearest Neighbors", "description": "sklearn.neighbors provides functionality for unsupervised and supervised neighbors-based learning methods. Unsupervised nearest neighbors is the foundation of many other learning methods, notably manifold learning and spectral clustering. Supervised neighbors-based learning comes in two flavors: classification for data with discrete labels, and regression for data with continuous labels.\n\nThe principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these. The number of samples can be a user-defined constant (k-nearest neighbor learning), or vary based on the local density of points (radius-based neighbor learning). The distance can, in general, be any metric measure: standard Euclidean distance is the most common choice. Neighbors-based methods are known as non-generalizing machine learning methods, since they simply \u00d2remember\u00d3 all of its training data (possibly transformed into a fast indexing structure such as a Ball Tree or KD Tree).\n\nDespite its simplicity, nearest neighbors has been successful in a large number of classification and regression problems, including handwritten digits and satellite image scenes. Being a non-parametric method, it is often successful in classification situations where the decision boundary is very irregular.\n\nThe classes in sklearn.neighbors can handle either NumPy arrays or scipy.sparse matrices as input. For dense matrices, a large number of possible distance metrics are supported. For sparse matrices, arbitrary Minkowski metrics are supported for searches.\n\nThere are many learning routines which rely on nearest neighbors at their core. One example is kernel density estimation, discussed in the density estimation section.", "url": "https://scikit-learn.org/stable/modules/neighbors.html", "children": [{"name": "Unsupervised Nearest Neighbors", "description": "NearestNeighbors implements unsupervised nearest neighbors learning. It acts as a uniform interface to three different nearest neighbors algorithms: BallTree, KDTree, and a brute-force algorithm based on routines in sklearn.metrics.pairwise. The choice of neighbors search algorithm is controlled through the keyword 'algorithm', which must be one of ['auto', 'ball_tree', 'kd_tree', 'brute']. When the default value 'auto' is passed, the algorithm attempts to determine the best approach from the training data. For a discussion of the strengths and weaknesses of each option, see Nearest Neighbor Algorithms.", "url": "https://scikit-learn.org/stable/modules/neighbors.html#unsupervised-nearest-neighbors"}, {"name": "Nearest Neighbors Classification", "description": "Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.The basic nearest neighbors classification uses uniform weights: that is, the value assigned to a query point is computed from a simple majority vote of the nearest neighbors. Under some circumstances, it is better to weight the neighbors such that nearer neighbors contribute more to the fit. This can be accomplished through the weights keyword. The default value, weights = 'uniform', assigns uniform weights to each neighbor. weights = 'distance' assigns weights proportional to the inverse of the distance from the query point. Alternatively, a user-defined function of the distance can be supplied to compute the weights.\n\nThe k-neighbors classification in KNeighborsClassifier is the most commonly used technique. The optimal choice of the value k is highly data-dependent: in general a larger k suppresses the effects of noise, but makes the classification boundaries less distinct.\n\nIn cases where the data is not uniformly sampled, radius-based neighbors classification in RadiusNeighborsClassifier can be a better choice. The user specifies a fixed radius , such that points in sparser neighborhoods use fewer nearest neighbors for the classification. For high-dimensional parameter spaces, this method becomes less effective due to the so-called \u00d2curse of dimensionality\u00d3.", "url": "https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification"}, {"name": "Nearest Neighbors Regression", "description": "Neighbors-based regression can be used in cases where the data labels are continuous rather than discrete variables. The label assigned to a query point is computed based on the mean of the labels of its nearest neighbors. \n\nscikit-learn implements two different neighbors regressors: KNeighborsRegressor implements learning based on the nearest neighbors of each query point, where is an integer value specified by the user. RadiusNeighborsRegressor implements learning based on the neighbors within a fixed radius of the query point, where is a floating-point value specified by the user.The basic nearest neighbors regression uses uniform weights: that is, each point in the local neighborhood contributes uniformly to the classification of a query point. Under some circumstances, it can be advantageous to weight points such that nearby points contribute more to the regression than faraway points.", "url": "https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-regression"}, {"name": "Nearest Centroid Classifier", "description": "The NearestCentroid classifier is a simple algorithm that represents each class by the centroid of its members. In effect, this makes it similar to the label updating phase of the KMeans algorithm. It also has no parameters to choose, making it a good baseline classifier. It does, however, suffer on non-convex classes, as well as when classes have drastically different variances, as equal variance in all dimensions is assumed.", "url": "https://scikit-learn.org/stable/modules/neighbors.html#nearest-centroid-classifier"}, {"name": "Nearest Neighbors Transformer", "description": "Many scikit-learn estimators rely on nearest neighbors: Several classifiers and regressors such as KNeighborsClassifier and KNeighborsRegressor, but also some clustering methods such as DBSCAN and SpectralClustering, and some manifold embeddings such as TSNE and Isomap.All these estimators can compute internally the nearest neighbors, but most of them also accept precomputed nearest neighbors sparse graph, as given by kneighbors_graph and radius_neighbors_graph. With mode mode='connectivity', these functions return a binary adjacency sparse graph as required, for instance, in SpectralClustering. Whereas with mode='distance', they return a distance sparse graph as required, for instance, in DBSCAN. To include these functions in a scikit-learn pipeline, one can also use the corresponding classes KNeighborsTransformer and RadiusNeighborsTransformer. The benefits of this sparse graph API are multiple.", "url": "https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-transformer"}, {"name": "Neighborhood Component Analysis", "description": "Combined with a nearest neighbors classifier (KNeighborsClassifier), NCA is attractive for classification because it can naturally handle multi-class problems without any increase in the model size, and does not introduce additional parameters that require fine-tuning by the user.Neighborhood Components Analysis (NCA, NeighborhoodComponentsAnalysis) is a distance metric learning algorithm which aims to improve the accuracy of nearest neighbors classification compared to the standard Euclidean distance. The algorithm directly maximizes a stochastic variant of the leave-one-out k-nearest neighbors (KNN) score on the training set. It can also learn a low-dimensional linear projection of data that can be used for data visualization and fast classification.\n\nNCA classification has been shown to work well in practice for data sets of varying size and difficulty. In contrast to related methods such as Linear Discriminant Analysis, NCA does not make any assumptions about the class distributions. The nearest neighbor classification can naturally produce highly irregular decision boundaries.", "url": "https://scikit-learn.org/stable/modules/neighbors.html#neighborhood-component-analysis"}]}, {"name": "Gaussian Processes", "description": "Gaussian Processes (GP) are a generic supervised learning method designed to solve regression and probabilistic classification problems.\n\nThe advantages of Gaussian processes are:\n\nThe prediction interpolates the observations (at least for regular kernels).\n\nThe prediction is probabilistic (Gaussian) so that one can compute empirical confidence intervals and decide based on those if one should refit (online fitting, adaptive fitting) the prediction in some region of interest.\n\nVersatile: different kernels can be specified. Common kernels are provided, but it is also possible to specify custom kernels.\n\nThe disadvantages of Gaussian processes include:\n\nThey are not sparse, i.e., they use the whole samples/features information to perform the prediction.\n\nThey lose efficiency in high dimensional spaces \u00d0 namely when the number of features exceeds a few dozens.\n", "url": "https://scikit-learn.org/stable/modules/gaussian_process.html", "children": [{"name": "Gaussian Process Regression (GPR)", "description": "The GaussianProcessRegressor implements Gaussian processes (GP) for regression purposes. For this, the prior of the GP needs to be specified. The prior mean is assumed to be constant and zero (for normalize_y=False) or the training data\u00d5s mean (for normalize_y=True). The prior\u00d5s covariance is specified by passing a kernel object. The hyperparameters of the kernel are optimized during fitting of GaussianProcessRegressor by maximizing the log-marginal-likelihood (LML) based on the passed optimizer. As the LML may have multiple local optima, the optimizer can be started repeatedly by specifying n_restarts_optimizer. The first run is always conducted starting from the initial hyperparameter values of the kernel; subsequent runs are conducted from hyperparameter values that have been chosen randomly from the range of allowed values. If the initial hyperparameters should be kept fixed, None can be passed as optimizer.Both kernel ridge regression (KRR) and GPR learn a target function by employing internally the \u00d2kernel trick\u00d3. KRR learns a linear function in the space induced by the respective kernel which corresponds to a non-linear function in the original space. The linear function in the kernel space is chosen based on the mean-squared error loss with ridge regularization. GPR uses the kernel to define the covariance of a prior distribution over the target functions and uses the observed training data to define a likelihood function. Based on Bayes theorem, a (Gaussian) posterior distribution over target functions is defined, whose mean is used for prediction. \n\nA major difference is that GPR can choose the kernel\u00d5s hyperparameters based on gradient-ascent on the marginal likelihood function while KRR needs to perform a grid search on a cross-validated loss function (mean-squared error loss). A further difference is that GPR learns a generative, probabilistic model of the target function and can thus provide meaningful confidence intervals and posterior samples along with the predictions while KRR only provides predictions.", "url": "https://scikit-learn.org/stable/modules/gaussian_process.html#gaussian-process-regression-(gpr)"}, {"name": "Gaussian Process Classification (GPC)", "description": "The GaussianProcessClassifier implements Gaussian processes (GP) for classification purposes, more specifically for probabilistic classification, where test predictions take the form of class probabilities. In contrast to the regression setting, the posterior of the latent function is not Gaussian even for a GP prior since a Gaussian likelihood is inappropriate for discrete class labels. Rather, a non-Gaussian likelihood corresponding to the logistic link function (logit) is used. \n\nGaussianProcessClassifier supports multi-class classification by performing either one-versus-rest or one-versus-one based training and prediction. In one-versus-rest, one binary Gaussian process classifier is fitted for each class, which is trained to separate this class from the rest. \n\nIn the case of Gaussian process classification, \u00d2one_vs_one\u00d3 might be computationally cheaper since it has to solve many problems involving only a subset of the whole training set rather than fewer problems on the whole dataset. Since Gaussian process classification scales cubically with the size of the dataset, this might be considerably faster.", "url": "https://scikit-learn.org/stable/modules/gaussian_process.html#gaussian-process-classification-(gpc)"}]}, {"name": "Cross decomposition", "description": "The cross decomposition module contains supervised estimators for dimensionality reduction and regression, belonging to the \u00d2Partial Least Squares\u00d3 family. Cross decomposition algorithms find the fundamental relations between two matrices (X and Y). They are latent variable approaches to modeling the covariance structures in these two spaces. They will try to find the multidimensional direction in the X space that explains the maximum multidimensional variance direction in the Y space. In other words, PLS projects both X and Y into a lower-dimensional subspace such that the covariance between transformed(X) and transformed(Y) is maximal.\n\nPLS draws similarities with Principal Component Regression (PCR), where the samples are first projected into a lower-dimensional subspace, and the targets y are predicted using transformed(X). One issue with PCR is that the dimensionality reduction is unsupervized, and may lose some important variables: PCR would keep the features with the most variance, but it\u00d5s possible that features with a small variances are relevant from predicting the target. In a way, PLS allows for the same kind of dimensionality reduction, but by taking into account the targets y. An illustration of this fact is given in the following example: * Principal Component Regression vs Partial Least Squares Regression.\n\nApart from CCA, the PLS estimators are particularly suited when the matrix of predictors has more variables than observations, and when there is multicollinearity among the features. By contrast, standard linear regression would fail in these cases unless it is regularized.", "url": "https://scikit-learn.org/stable/modules/cross_decomposition.html"}, {"name": "Naive Bayes", "description": "naive Bayes classifiers have worked quite well in many real-world situations, famously document classification and spam filtering. They require a small amount of training data to estimate the necessary parameters. (For theoretical reasons why naive Bayes works well, and on which types of data it does, see the references below.)\n\nNaive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods. The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one dimensional distribution. This in turn helps to alleviate problems stemming from the curse of dimensionality.\n\nOn the flip side, although naive Bayes is known as a decent classifier, it is known to be a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.", "url": "https://scikit-learn.org/stable/modules/naive_bayes.html", "children": [{"name": "Gaussian Naive Bayes", "description": "GaussianNB implements the Gaussian Naive Bayes algorithm for classification. The likelihood of the features is assumed to be following Gaussian distribution.\n\n", "url": "https://scikit-learn.org/stable/modules/naive_bayes.html#gaussian-naive-bayes"}, {"name": "Multinomial Naive Bayes", "description": "MultinomialNB implements the naive Bayes algorithm for multinomially distributed data, and is one of the two classic naive Bayes variants used in text classification (where the data are typically represented as word vector counts, although tf-idf vectors are also known to work well in practice).", "url": "https://scikit-learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes"}, {"name": "Complement Naive Bayes", "description": "ComplementNB implements the complement naive Bayes (CNB) algorithm. CNB is an adaptation of the standard multinomial naive Bayes (MNB) algorithm that is particularly suited for imbalanced data sets. Specifically, CNB uses statistics from the complement of each class to compute the model\u00d5s weights. The inventors of CNB show empirically that the parameter estimates for CNB are more stable than those for MNB.", "url": "https://scikit-learn.org/stable/modules/naive_bayes.html#complement-naive-bayes"}, {"name": "Bernoulli Naive Bayes", "description": "BernoulliNB implements the naive Bayes training and classification algorithms for data that is distributed according to multivariate Bernoulli distributions; i.e., there may be multiple features but each one is assumed to be a binary-valued (Bernoulli, boolean) variable. Therefore, this class requires samples to be represented as binary-valued feature vectors; if handed any other kind of data, a BernoulliNB instance may binarize its input (depending on the binarize parameter).\n\nIn the case of text classification, word occurrence vectors (rather than word count vectors) may be used to train and use this classifier. BernoulliNB might perform better on some datasets, especially those with shorter documents. It is advisable to evaluate both models, if time permits.", "url": "https://scikit-learn.org/stable/modules/naive_bayes.html#bernoulli-naive-bayes"}, {"name": "Categorical Naive Bayes", "description": "CategoricalNB implements the categorical naive Bayes algorithm for categorically distributed data. It assumes that each feature, which is described by the index , has its own categorical distribution.", "url": "https://scikit-learn.org/stable/modules/naive_bayes.html#categorical-naive-bayes"}]}, {"name": "Decision Trees", "description": "Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation. Some advantages of decision trees are:\n\nSimple to understand and to interpret. Trees can be visualized.\n\nRequires little data preparation. Other techniques often require data normalization, dummy variables need to be created and blank values to be removed. Note however that this module does not support missing values.\n\nThe cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.\n\nAble to handle both numerical and categorical data. However scikit-learn implementation does not support categorical variables for now. Other techniques are usually specialized in analyzing datasets that have only one type of variable. See algorithms for more information.\n\nAble to handle multi-output problems.\n\nUses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic. By contrast, in a black box model (e.g., in an artificial neural network), results may be more difficult to interpret.\n\nPossible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.\n\nPerforms well even if its assumptions are somewhat violated by the true model from which the data were generated.\n\nThe disadvantages of decision trees include:\n\nDecision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting. Mechanisms such as pruning, setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.\n\nDecision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.\n\nPredictions of decision trees are neither smooth nor continuous, but piecewise constant approximations as seen in the above figure. Therefore, they are not good at extrapolation.\n\nThe problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.\n\nThere are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.\n\nDecision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.\n", "url": "https://scikit-learn.org/stable/modules/tree.html", "children": [{"name": "Classification", "description": "DecisionTreeClassifier is a class capable of performing multi-class classification on a dataset.As with other classifiers, DecisionTreeClassifier takes as input two arrays: an array X, sparse or dense, of shape (n_samples, n_features) holding the training samples, and an array Y of integer values, shape (n_samples,), holding the class labels for the training samples. After being fitted, the model can then be used to predict the class of samples.\n\nIn case that there are multiple classes with the same and highest probability, the classifier will predict the class with the lowest index amongst those classes.", "url": "https://scikit-learn.org/stable/modules/tree.html#classification"}, {"name": "Regression", "description": "Decision trees can also be applied to regression problems, using the DecisionTreeRegressor class.", "url": "https://scikit-learn.org/stable/modules/tree.html#regression"}, {"name": "ID3 (Iterative Dichotomiser 3)", "description": "was developed in 1986 by Ross Quinlan. The algorithm creates a multiway tree, finding for each node (i.e. in a greedy manner) the categorical feature that will yield the largest information gain for categorical targets. Trees are grown to their maximum size and then a pruning step is usually applied to improve the ability of the tree to generalize to unseen data.", "url": "https://scikit-learn.org/stable/modules/tree.html#id3-(iterative-dichotomiser-3)"}, {"name": "C4.5", "description": "the successor to ID3 and removed the restriction that features must be categorical by dynamically defining a discrete attribute (based on numerical variables) that partitions the continuous attribute value into a discrete set of intervals. C4.5 converts the trained trees (i.e. the output of the ID3 algorithm) into sets of if-then rules. These accuracy of each rule is then evaluated to determine the order in which they should be applied. Pruning is done by removing a rule\u00d5s precondition if the accuracy of the rule improves without it.", "url": "https://scikit-learn.org/stable/modules/tree.html#c4.5"}, {"name": "C5.0", "description": "Quinlan\u00d5s latest version release under a proprietary license. It uses less memory and builds smaller rulesets than C4.5 while being more accurate.", "url": "https://scikit-learn.org/stable/modules/tree.html#c5.0"}, {"name": "CART (Classification and Regression Trees)", "description": "very similar to C4.5, but it differs in that it supports numerical target variables (regression) and does not compute rule sets. CART constructs binary trees using the feature and threshold that yield the largest information gain at each node.", "url": "https://scikit-learn.org/stable/modules/tree.html#cart-(classification-and-regression-trees)"}]}, {"name": "Ensemble methods", "description": "The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator. Two families of ensemble methods are usually distinguished: In averaging methods, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced. By contrast, in boosting methods, base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.", "url": "https://scikit-learn.org/stable/modules/ensemble.html", "children": [{"name": "Bagging meta-estimator", "description": "In ensemble algorithms, bagging methods form a class of algorithms which build several instances of a black-box estimator on random subsets of the original training set and then aggregate their individual predictions to form a final prediction. These methods are used as a way to reduce the variance of a base estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it. In many cases, bagging methods constitute a very simple way to improve with respect to a single model, without making it necessary to adapt the underlying base algorithm. As they provide a way to reduce overfitting, bagging methods work best with strong and complex models (e.g., fully developed decision trees), in contrast with boosting methods which usually work best with weak models (e.g., shallow decision trees).", "url": "https://scikit-learn.org/stable/modules/ensemble.html#bagging-meta-estimator"}, {"name": "Forests of randomized trees", "description": "The sklearn.ensemble module includes two averaging algorithms based on randomized decision trees: the RandomForest algorithm and the Extra-Trees method. Both algorithms are perturb-and-combine techniques [B1998] specifically designed for trees. This means a diverse set of classifiers is created by introducing randomness in the classifier construction. The prediction of the ensemble is given as the averaged prediction of the individual classifiers.", "url": "https://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees"}, {"name": "AdaBoost", "description": "The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that are only slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data. The predictions from all of them are then combined through a weighted majority vote (or sum) to produce the final prediction. The data modifications at each so-called boosting iteration consist of applying weights to each of the training samples.\n\nAt a given step, those training examples that were incorrectly predicted by the boosted model induced at the previous step have their weights increased, whereas the weights are decreased for those that were predicted correctly. As iterations proceed, examples that are difficult to predict receive ever-increasing influence. Each subsequent weak learner is thereby forced to concentrate on the examples that are missed by the previous ones in the sequence \n\nAdaBoost can be used both for classification and regression problems:\n\nFor multi-class classification, AdaBoostClassifier implements AdaBoost-SAMME and AdaBoost-SAMME.R.\n\nFor regression, AdaBoostRegressor implements AdaBoost.R2\n\n", "url": "https://scikit-learn.org/stable/modules/ensemble.html#adaboost"}, {"name": "GradientBoostingRegressor", "description": "GradientBoostingRegressor supports a number of different loss functions for regression which can be specified via the argument loss; the default loss function for regression is squared error ('squared_error').", "url": "https://scikit-learn.org/stable/modules/ensemble.html#gradientboostingregressor"}, {"name": "GradientBoostingClassifier", "description": "GradientBoostingClassifier supports both binary and multi-class classification.", "url": "https://scikit-learn.org/stable/modules/ensemble.html#gradientboostingclassifier"}, {"name": "Histogram-Based Gradient Boosting", "description": "These histogram-based estimators can be orders of magnitude faster than GradientBoostingClassifier and GradientBoostingRegressor when the number of samples is larger than tens of thousands of samples. \n\nThey also have built-in support for missing values, which avoids the need for an imputer. \n\nThese fast estimators first bin the input samples X into integer-valued bins (typically 256 bins) which tremendously reduces the number of splitting points to consider, and allows the algorithm to leverage integer-based data structures (histograms) instead of relying on sorted continuous values when building the trees.", "url": "https://scikit-learn.org/stable/modules/ensemble.html#histogram-based-gradient-boosting"}, {"name": "Voting Classifier", "description": "The idea behind the VotingClassifier is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing model in order to balance out their individual weaknesses.", "url": "https://scikit-learn.org/stable/modules/ensemble.html#voting-classifier"}, {"name": "Voting Regressor", "description": "The idea behind the VotingRegressor is to combine conceptually different machine learning regressors and return the average predicted values. Such a regressor can be useful for a set of equally well performing models in order to balance out their individual weaknesses.", "url": "https://scikit-learn.org/stable/modules/ensemble.html#voting-regressor"}, {"name": "Stacked generalization", "description": "Stacked generalization is a method for combining estimators to reduce their biases. More precisely, the predictions of each individual estimator are stacked together and used as input to a final estimator to compute the prediction. This final estimator is trained through cross-validation.\n\nThe StackingClassifier and StackingRegressor provide such strategies which can be applied to classification and regression problems.\n\nIn practice, a stacking predictor predicts as good as the best predictor of the base layer and even sometimes outperforms it by combining the different strengths of the these predictors. However, training a stacking predictor is computationally expensive.", "url": "https://scikit-learn.org/stable/modules/ensemble.html#stacked-generalization"}]}, {"name": "Multiclass and multioutput algorithms", "description": "Multiclass classification is a classification task with more than two classes. Each sample can only be labeled as one class. For example, classification using features extracted from a set of images of fruit, where each image may either be of an orange, an apple, or a pear. Each image is one sample and is labeled as one of the 3 possible classes. Multiclass classification makes the assumption that each sample is assigned to one and only one label - one sample cannot, for example, be both a pear and an apple. While all scikit-learn classifiers are capable of multiclass classification, the meta-estimators offered by sklearn.multiclass permit changing the way they handle more than two classes because this may have an effect on classifier performance (either in terms of generalization error or required computational resources). Multilabel classification (closely related to multioutput classification) is a classification task labeling each sample with m labels from n_classes possible classes, where m can be 0 to n_classes inclusive. This can be thought of as predicting properties of a sample that are not mutually exclusive. Formally, a binary output is assigned to each class, for every sample. Positive classes are indicated with 1 and negative classes with 0 or -1. It is thus comparable to running n_classes binary classification tasks, for example with MultiOutputClassifier. This approach treats each label independently whereas multilabel classifiers may treat the multiple classes simultaneously, accounting for correlated behavior among them. For example, prediction of the topics relevant to a text document or video. The document or video may be about one of \u00d4religion\u00d5, \u00d4politics\u00d5, \u00d4finance\u00d5 or \u00d4education\u00d5, several of the topic classes or all of the topic classes.", "url": "https://scikit-learn.org/stable/modules/multiclass.html", "children": [{"name": "Multiclass classification", "description": "Multiclass classification is a classification task with more than two classes. Each sample can only be labeled as one class.\n\nFor example, classification using features extracted from a set of images of fruit, where each image may either be of an orange, an apple, or a pear. Each image is one sample and is labeled as one of the 3 possible classes. Multiclass classification makes the assumption that each sample is assigned to one and only one label - one sample cannot, for example, be both a pear and an apple.", "url": "https://scikit-learn.org/stable/modules/multiclass.html#multiclass-classification"}, {"name": "Multilabel classification", "description": "Multilabel classification (closely related to multioutput classification) is a classification task labeling each sample with m labels from n_classes possible classes, where m can be 0 to n_classes inclusive. This can be thought of as predicting properties of a sample that are not mutually exclusive. Formally, a binary output is assigned to each class, for every sample. Positive classes are indicated with 1 and negative classes with 0 or -1. It is thus comparable to running n_classes binary classification tasks, for example with MultiOutputClassifier. This approach treats each label independently whereas multilabel classifiers may treat the multiple classes simultaneously, accounting for correlated behavior among them.For example, prediction of the topics relevant to a text document or video. The document or video may be about one of \u00d4religion\u00d5, \u00d4politics\u00d5, \u00d4finance\u00d5 or \u00d4education\u00d5, several of the topic classes or all of the topic classes.", "url": "https://scikit-learn.org/stable/modules/multiclass.html#multilabel-classification"}, {"name": "Multiclass-multioutput classification", "description": "Multiclass-multioutput classification (also known as multitask classification) is a classification task which labels each sample with a set of non-binary properties. Both the number of properties and the number of classes per property is greater than 2. A single estimator thus handles several joint classification tasks. This is both a generalization of the multilabel classification task, which only considers binary attributes, as well as a generalization of the multiclass classification task, where only one property is considered.For example, classification of the properties \u00d2type of fruit\u00d3 and \u00d2colour\u00d3 for a set of images of fruit. The property \u00d2type of fruit\u00d3 has the possible classes: \u00d2apple\u00d3, \u00d2pear\u00d3 and \u00d2orange\u00d3. The property \u00d2colour\u00d3 has the possible classes: \u00d2green\u00d3, \u00d2red\u00d3, \u00d2yellow\u00d3 and \u00d2orange\u00d3. Each sample is an image of a fruit, a label is output for both properties and each label is one of the possible classes of the corresponding property. \n\nNote that all classifiers handling multiclass-multioutput (also known as multitask classification) tasks, support the multilabel classification task as a special case. Multitask classification is similar to the multioutput classification task with different model formulations.", "url": "https://scikit-learn.org/stable/modules/multiclass.html#multiclass-multioutput-classification"}, {"name": "Multioutput regression", "description": "Multioutput regression predicts multiple numerical properties for each sample. Each property is a numerical variable and the number of properties to be predicted for each sample is greater than or equal to 2. Some estimators that support multioutput regression are faster than just running n_output estimators.For example, prediction of both wind speed and wind direction, in degrees, using data obtained at a certain location. Each sample would be data obtained at one location and both wind speed and direction would be output for each sample.", "url": "https://scikit-learn.org/stable/modules/multiclass.html#multioutput-regression"}]}, {"name": "Feature selection", "description": "A feature selection algorithm can be seen as the combination of a search technique for proposing new feature subsets, along with an evaluation measure which scores the different feature subsets. The simplest algorithm is to test each possible subset of features finding the one which minimizes the error rate. This is an exhaustive search of the space, and is computationally intractable for all but the smallest of feature sets. The choice of evaluation metric heavily influences the algorithm, and it is these evaluation metrics which distinguish between the three main categories of feature selection algorithms: wrappers, filters and embedded method [1]", "url": "https://scikit-learn.org/stable/modules/feature_selection.html", "children": [{"name": "Removing features with low variance", "description": "VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn\u00d5t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.", "url": "https://scikit-learn.org/stable/modules/feature_selection.html#removing-features-with-low-variance"}, {"name": "Univariate feature selection", "description": "Univariate feature selection works by selecting the best features based on univariate statistical tests. It can be seen as a preprocessing step to an estimator. Scikit-learn exposes feature selection routines as objects that implement the transform method: \n\nSelectKBest removes all but the highest scoring features \n\nSelectPercentile removes all but a user-specified highest scoring percentage of features \n\nusing common univariate statistical tests for each feature: false positive rate SelectFpr, false discovery rate SelectFdr, or family wise error SelectFwe. \n\nGenericUnivariateSelect allows to perform univariate feature selection with a configurable strategy. This allows to select the best univariate selection strategy with hyper-parameter search estimator.", "url": "https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection"}, {"name": "Recursive feature elimination", "description": "Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through any specific attribute (such as coef_, feature_importances_) or callable. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.", "url": "https://scikit-learn.org/stable/modules/feature_selection.html#recursive-feature-elimination"}, {"name": "Feature selection using SelectFromModel", "description": "SelectFromModel is a meta-transformer that can be used alongside any estimator that assigns importance to each feature through a specific attribute (such as coef_, feature_importances_) or via an importance_getter callable after fitting. The features are considered unimportant and removed if the corresponding importance of the feature values are below the provided threshold parameter. Apart from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument. Available heuristics are \u00d2mean\u00d3, \u00d2median\u00d3 and float multiples of these like \u00d20.1*mean\u00d3. In combination with the threshold criteria, one can use the max_features parameter to set a limit on the number of features to select.", "url": "https://scikit-learn.org/stable/modules/feature_selection.html#feature-selection-using-selectfrommodel"}, {"name": "Sequential Feature Selection", "description": "Sequential Feature Selection [sfs] (SFS) is available in the SequentialFeatureSelector transformer. SFS can be either forward or backward: \n\nForward-SFS is a greedy procedure that iteratively finds the best new feature to add to the set of selected features. Concretely, we initially start with zero feature and find the one feature that maximizes a cross-validated score when an estimator is trained on this single feature. Once that first feature is selected, we repeat the procedure by adding a new feature to the set of selected features. The procedure stops when the desired number of selected features is reached, as determined by the n_features_to_select parameter. \n\nBackward-SFS follows the same idea but works in the opposite direction: instead of starting with no feature and greedily adding features, we start with all the features and greedily remove features from the set. The direction parameter controls whether forward or backward SFS is used. \n\nIn general, forward and backward selection do not yield equivalent results. Also, one may be much faster than the other depending on the requested number of selected features: if we have 10 features and ask for 7 selected features, forward selection would need to perform 7 iterations while backward selection would only need to perform 3.", "url": "https://scikit-learn.org/stable/modules/feature_selection.html#sequential-feature-selection"}]}, {"name": "Semi-supervised learning", "description": "Semi-supervised learning is a situation in which in your training data some of the samples are not labeled. The semi-supervised estimators in sklearn.semi_supervised are able to make use of this additional unlabeled data to better capture the shape of the underlying data distribution and generalize better to new samples. These algorithms can perform well when we have a very small amount of labeled points and a large amount of unlabeled points.", "url": "https://scikit-learn.org/stable/modules/semi_supervised.html"}, {"name": "Isotonic regression", "description": "IsotonicRegression produces a series of predictions for the training data which are the closest to the targets in terms of mean squared error. These predictions are interpolated for predicting to unseen data. The predictions of IsotonicRegression thus form a function that is piecewise linear.", "url": "https://scikit-learn.org/stable/modules/isotonic.html"}, {"name": "Probability calibration", "description": "When performing classification you often want not only to predict the class label, but also obtain a probability of the respective label. This probability gives you some kind of confidence on the prediction. Some models can give you poor estimates of the class probabilities and some even do not support probability prediction (e.g., some instances of SGDClassifier). The calibration module allows you to better calibrate the probabilities of a given model, or to add support for probability prediction. Well calibrated classifiers are probabilistic classifiers for which the output of the predict_proba method can be directly interpreted as a confidence level. For instance, a well calibrated (binary) classifier should classify the samples such that among the samples to which it gave a predict_proba value close to 0.8, approximately 80% actually belong to the positive class.", "url": "https://scikit-learn.org/stable/modules/calibration.html"}, {"name": "Neural network models", "description": "A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes.[1] An artificial neural network (ANN), oftenjust called a \u00d2neural network\u00d3 (NN), is amathematical model or computational modelbased on biological neural networksIn most cases an ANN is an adaptive system thatchanges its structure based on external orinternal information that flows through thenetwork during the learning phase.In more practical terms neural networks are non-linear statistical data modeling tools. They canbe used to model complex relationships betweeninputs and outputs or to find patterns in data [2] \n\nA fully connected multi-layer neural network is called a Multilayer Perceptron (MLP).[source2]\nThe advantages of Multi-layer Perceptron are:\n\nCapability to learn non-linear models.\n\nCapability to learn models in real-time (on-line learning) using partial_fit.\n\nThe disadvantages of Multi-layer Perceptron (MLP) include:\n\nMLP with hidden layers have a non-convex loss function where there exists more than one local minimum. Therefore different random weight initializations can lead to different validation accuracy.\n\nMLP requires tuning a number of hyperparameters such as the number of hidden neurons, layers, and iterations.\n\nMLP is sensitive to feature scaling.", "url": "https://scikit-learn.org/stable/modules/neural_networks_supervised.html", "children": [{"name": "Classification", "description": "Class MLPClassifier implements a multi-layer perceptron (MLP) algorithm that trains using Backpropagation.", "url": "https://scikit-learn.org/stable/modules/neural_networks_supervised.html#classification"}, {"name": "Regression", "description": "Class MLPRegressor implements a multi-layer perceptron (MLP) that trains using backpropagation with no activation function in the output layer, which can also be seen as using the identity function as activation function. Therefore, it uses the square error as the loss function, and the output is a set of continuous values.", "url": "https://scikit-learn.org/stable/modules/neural_networks_supervised.html#regression"}, {"name": "Regularization", "description": "Both MLPRegressor and MLPClassifier use parameter alpha for regularization (L2 regularization) term which helps in avoiding overfitting by penalizing weights with large magnitudes. Following plot displays varying decision function with value of alpha.", "url": "https://scikit-learn.org/stable/modules/neural_networks_supervised.html#regularization"}]}]}, {"name": "Unsupervised learning", "children": [{"name": "Gaussian mixture models", "description": "A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. One can think of mixture models as generalizing k-means clustering to incorporate information about the covariance structure of the data as well as the centers of the latent Gaussians.\n\nScikit-learn implements different classes to estimate Gaussian mixture models, that correspond to different estimation strategies, detailed below.", "url": "https://scikit-learn.org/stable/modules/mixture.html", "children": [{"name": "Gaussian Mixture", "description": "The GaussianMixture object implements the expectation-maximization (EM) algorithm for fitting mixture-of-Gaussian models. It can also draw confidence ellipsoids for multivariate models, and compute the Bayesian Information Criterion to assess the number of clusters in the data. A GaussianMixture.fit method is provided that learns a Gaussian Mixture Model from train data. Given test data, it can assign to each sample the Gaussian it mostly probably belong to using the GaussianMixture.predict method.Pros \nSpeed \nIt is the fastest algorithm for learning mixture models \n\nAgnostic \nAs this algorithm maximizes only the likelihood, it will not bias the means towards zero, or bias the cluster sizes to have specific structures that might or might not apply. \n\nCons \nSingularities \nWhen one has insufficiently many points per mixture, estimating the covariance matrices becomes difficult, and the algorithm is known to diverge and find solutions with infinite likelihood unless one regularizes the covariances artificially. \n\nNumber of components \nThis algorithm will always use all the components it has access to, needing held-out data or information theoretical criteria to decide how many components to use in the absence of external cues.", "url": "https://scikit-learn.org/stable/modules/mixture.html#gaussian-mixture"}, {"name": "Variational Bayesian Gaussian Mixture", "description": "The BayesianGaussianMixture object implements a variant of the Gaussian mixture model with variational inference algorithms. The API is similar as the one defined by GaussianMixture.Variational inference is an extension of expectation-maximization that maximizes a lower bound on model evidence (including priors) instead of data likelihood. The principle behind variational methods is the same as expectation-maximization (that is both are iterative algorithms that alternate between finding the probabilities for each point to be generated by each mixture and fitting the mixture to these assigned points), but variational methods add regularization by integrating information from prior distributions. This avoids the singularities often found in expectation-maximization solutions but introduces some subtle biases to the model. Inference is often notably slower, but not usually as much so as to render usage unpractical.\n\nDue to its Bayesian nature, the variational algorithm needs more hyper- parameters than expectation-maximization, the most important of these being the concentration parameter weight_concentration_prior. Specifying a low value for the concentration prior will make the model put most of the weight on few components set the remaining components weights very close to zero. High values of the concentration prior will allow a larger number of components to be active in the mixture.\n\nPros\nAutomatic selection\nwhen weight_concentration_prior is small enough and n_components is larger than what is found necessary by the model, the Variational Bayesian mixture model has a natural tendency to set some mixture weights values close to zero. This makes it possible to let the model choose a suitable number of effective components automatically. Only an upper bound of this number needs to be provided. Note however that the \u00d2ideal\u00d3 number of active components is very application specific and is typically ill-defined in a data exploration setting.\n\nLess sensitivity to the number of parameters\nunlike finite models, which will almost always use all components as much as they can, and hence will produce wildly different solutions for different numbers of components, the variational inference with a Dirichlet process prior (weight_concentration_prior_type='dirichlet_process') won\u00d5t change much with changes to the parameters, leading to more stability and less tuning.\n\nRegularization\ndue to the incorporation of prior information, variational solutions have less pathological special cases than expectation-maximization solutions.\n\nCons\nSpeed\nthe extra parametrization necessary for variational inference make inference slower, although not by much.\n\nHyperparameters\nthis algorithm needs an extra hyperparameter that might need experimental tuning via cross-validation.\n\nBias\nthere are many implicit biases in the inference algorithms (and also in the Dirichlet process if used), and whenever there is a mismatch between these biases and the data it might be possible to fit better models using a finite mixture.", "url": "https://scikit-learn.org/stable/modules/mixture.html#variational-bayesian-gaussian-mixture"}]}, {"name": "Manifold learning", "description": "Manifold learning is an approach to non-linear dimensionality reduction. Algorithms for this task are based on the idea that the dimensionality of many data sets is only artificially high.\n\nHigh-dimensional datasets can be very difficult to visualize. While data in two or three dimensions can be plotted to show the inherent structure of the data, equivalent high-dimensional plots are much less intuitive. To aid visualization of the structure of a dataset, the dimension must be reduced in some way.\n\nThe simplest way to accomplish this dimensionality reduction is by taking a random projection of the data. Though this allows some degree of visualization of the data structure, the randomness of the choice leaves much to be desired. In a random projection, it is likely that the more interesting structure within the data will be lost.\n\nManifold Learning can be thought of as an attempt to generalize linear frameworks like PCA to be sensitive to non-linear structure in data. Though supervised variants exist, the typical manifold learning problem is unsupervised: it learns the high-dimensional structure of the data from the data itself, without the use of predetermined classifications.", "url": "https://scikit-learn.org/stable/modules/manifold.html", "children": [{"name": "Isomap", "description": "One of the earliest approaches to manifold learning is the Isomap algorithm, short for Isometric Mapping. Isomap can be viewed as an extension of Multi-dimensional Scaling (MDS) or Kernel PCA. Isomap seeks a lower-dimensional embedding which maintains geodesic distances between all points. Isomap can be performed with the object Isomap.", "url": "https://scikit-learn.org/stable/modules/manifold.html#isomap"}, {"name": "Locally linear embedding", "description": "Locally linear embedding (LLE) seeks a lower-dimensional projection of the data which preserves distances within local neighborhoods. It can be thought of as a series of local Principal Component Analyses which are globally compared to find the best non-linear embedding. \n\nLocally linear embedding can be performed with function locally_linear_embedding or its object-oriented counterpart LocallyLinearEmbedding.", "url": "https://scikit-learn.org/stable/modules/manifold.html#locally-linear-embedding"}, {"name": "Modified Locally Linear Embedding", "description": "One well-known issue with LLE is the regularization problem. When the number of neighbors is greater than the number of input dimensions, the matrix defining each local neighborhood is rank-deficient.\nOne method to address the regularization problem is to use multiple weight vectors in each neighborhood. This is the essence of modified locally linear embedding (MLLE). MLLE can be performed with function locally_linear_embedding or its object-oriented counterpart LocallyLinearEmbedding, with the keyword method = 'modified'. It requires n_neighbors > n_components.", "url": "https://scikit-learn.org/stable/modules/manifold.html#modified-locally-linear-embedding"}, {"name": "Hessian Eigenmapping", "description": "Hessian Eigenmapping (also known as Hessian-based LLE: HLLE) is another method of solving the regularization problem of LLE. It revolves around a hessian-based quadratic form at each neighborhood which is used to recover the locally linear structure. Though other implementations note its poor scaling with data size, sklearn implements some algorithmic improvements which make its cost comparable to that of other LLE variants for small output dimension.", "url": "https://scikit-learn.org/stable/modules/manifold.html#hessian-eigenmapping"}, {"name": "Spectral Embedding", "description": "Spectral Embedding is an approach to calculating a non-linear embedding. Scikit-learn implements Laplacian Eigenmaps, which finds a low dimensional representation of the data using a spectral decomposition of the graph Laplacian. The graph generated can be considered as a discrete approximation of the low dimensional manifold in the high dimensional space. Minimization of a cost function based on the graph ensures that points close to each other on the manifold are mapped close to each other in the low dimensional space, preserving local distances. Spectral embedding can be performed with the function spectral_embedding or its object-oriented counterpart SpectralEmbedding.", "url": "https://scikit-learn.org/stable/modules/manifold.html#spectral-embedding"}, {"name": "Local Tangent Space Alignment", "description": "Though not technically a variant of LLE, Local tangent space alignment (LTSA) is algorithmically similar enough to LLE that it can be put in this category. Rather than focusing on preserving neighborhood distances as in LLE, LTSA seeks to characterize the local geometry at each neighborhood via its tangent space, and performs a global optimization to align these local tangent spaces to learn the embedding.", "url": "https://scikit-learn.org/stable/modules/manifold.html#local-tangent-space-alignment"}, {"name": "Multi-dimensional Scaling", "description": "Multidimensional scaling (MDS) seeks a low-dimensional representation of the data in which the distances respect well the distances in the original high-dimensional space. \n\nIn general, MDS is a technique used for analyzing similarity or dissimilarity data. It attempts to model similarity or dissimilarity data as distances in a geometric spaces. The data can be ratings of similarity between objects, interaction frequencies of molecules, or trade indices between countries.", "url": "https://scikit-learn.org/stable/modules/manifold.html#multi-dimensional-scaling"}, {"name": "t-distributed Stochastic Neighbor Embedding", "description": "t-SNE (TSNE) converts affinities of data points to probabilities. The affinities in the original space are represented by Gaussian joint probabilities and the affinities in the embedded space are represented by Student\u00d5s t-distributions. This allows t-SNE to be particularly sensitive to local structure and has a few other advantages over existing techniques: \n\nRevealing the structure at many scales on a single map \n\nRevealing data that lie in multiple, different, manifolds or clusters \n\nReducing the tendency to crowd points together at the center \n\nWhile Isomap, LLE and variants are best suited to unfold a single continuous low dimensional manifold, t-SNE will focus on the local structure of the data and will tend to extract clustered local groups of samples as highlighted on the S-curve example. This ability to group samples based on the local structure might be beneficial to visually disentangle a dataset that comprises several manifolds at once as is the case in the digits dataset.", "url": "https://scikit-learn.org/stable/modules/manifold.html#t-distributed-stochastic-neighbor-embedding"}]}, {"name": "Clustering", "description": "Clustering of unlabeled data can be performed with the module sklearn.cluster.\n\nEach clustering algorithm comes in two variants: a class, that implements the fit method to learn the clusters on train data, and a function, that, given train data, returns an array of integer labels corresponding to the different clusters. For the class, the labels over the training data can be found in the labels_ attribute.", "url": "https://scikit-learn.org/stable/modules/clustering.html", "children": [{"name": "K-means", "description": "The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares (see below). This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.", "url": "https://scikit-learn.org/stable/modules/clustering.html#k-means"}, {"name": "Affinity Propagation", "description": "AffinityPropagation creates clusters by sending messages between pairs of samples until convergence. A dataset is then described using a small number of exemplars, which are identified as those most representative of other samples. The messages sent between pairs represent the suitability for one sample to be the exemplar of the other, which is updated in response to the values from other pairs. This updating happens iteratively until convergence, at which point the final exemplars are chosen, and hence the final clustering is given.", "url": "https://scikit-learn.org/stable/modules/clustering.html#affinity-propagation"}, {"name": "Mean Shift", "description": "MeanShift clustering aims to discover blobs in a smooth density of samples. It is a centroid based algorithm, which works by updating candidates for centroids to be the mean of the points within a given region. These candidates are then filtered in a post-processing stage to eliminate near-duplicates to form the final set of centroids.", "url": "https://scikit-learn.org/stable/modules/clustering.html#mean-shift"}, {"name": "Spectral clustering", "description": "SpectralClustering performs a low-dimension embedding of the affinity matrix between samples, followed by clustering, e.g., by KMeans, of the components of the eigenvectors in the low dimensional space. It is especially computationally efficient if the affinity matrix is sparse and the amg solver is used for the eigenvalue problem (Note, the amg solver requires that the pyamg module is installed.) \n\nThe present version of SpectralClustering requires the number of clusters to be specified in advance. It works well for a small number of clusters, but is not advised for many clusters.", "url": "https://scikit-learn.org/stable/modules/clustering.html#spectral-clustering"}, {"name": "Hierarchical clustering", "description": "Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging or splitting them successively. This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample. See the Wikipedia page for more details. \n\nThe AgglomerativeClustering object performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together. The linkage criteria determines the metric used for the merge strategy: \n\nWard minimizes the sum of squared differences within all clusters. It is a variance-minimizing approach and in this sense is similar to the k-means objective function but tackled with an agglomerative hierarchical approach. \n\nMaximum or complete linkage minimizes the maximum distance between observations of pairs of clusters. \n\nAverage linkage minimizes the average of the distances between all observations of pairs of clusters. \n\nSingle linkage minimizes the distance between the closest observations of pairs of clusters. \n\nAgglomerativeClustering can also scale to large number of samples when it is used jointly with a connectivity matrix, but is computationally expensive when no connectivity constraints are added between samples: it considers at each step all the possible merges.", "url": "https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering"}, {"name": "DBSCAN", "description": "The DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped. The central component to the DBSCAN is the concept of core samples, which are samples that are in areas of high density.", "url": "https://scikit-learn.org/stable/modules/clustering.html#dbscan"}, {"name": "OPTICS", "description": "The OPTICS algorithm shares many similarities with the DBSCAN algorithm, and can be considered a generalization of DBSCAN that relaxes the eps requirement from a single value to a value range. The key difference between DBSCAN and OPTICS is that the OPTICS algorithm builds a reachability graph, which assigns each sample both a reachability_ distance, and a spot within the cluster ordering_ attribute; these two attributes are assigned when the model is fitted, and are used to determine cluster membership. If OPTICS is run with the default value of inf set for max_eps, then DBSCAN style cluster extraction can be performed repeatedly in linear time for any given eps value using the cluster_optics_dbscan method. Setting max_eps to a lower value will result in shorter run times, and can be thought of as the maximum neighborhood radius from each point to find other potential reachable points.", "url": "https://scikit-learn.org/stable/modules/clustering.html#optics"}, {"name": "BIRCH", "description": "The Birch builds a tree called the Clustering Feature Tree (CFT) for the given data. The data is essentially lossy compressed to a set of Clustering Feature nodes (CF Nodes). The CF Nodes have a number of subclusters called Clustering Feature subclusters (CF Subclusters) and these CF Subclusters located in the non-terminal CF Nodes can have CF Nodes as children. \n\nThe CF Subclusters hold the necessary information for clustering which prevents the need to hold the entire input data in memory. This information includes: \n\nNumber of samples in a subcluster. \n\nLinear Sum - An n-dimensional vector holding the sum of all samples \n\nSquared Sum - Sum of the squared L2 norm of all samples. \n\nCentroids - To avoid recalculation linear sum / n_samples. \n\nSquared norm of the centroids. \n\nThe BIRCH algorithm has two parameters, the threshold and the branching factor. The branching factor limits the number of subclusters in a node and the threshold limits the distance between the entering sample and the existing subclusters.", "url": "https://scikit-learn.org/stable/modules/clustering.html#birch"}]}, {"name": "Biclustering", "description": "Biclustering can be performed with the module sklearn.cluster.bicluster. Biclustering algorithms simultaneously cluster rows and columns of a data matrix. These clusters of rows and columns are known as biclusters. Each determines a submatrix of the original data matrix with some desired properties. For visualization purposes, given a bicluster, the rows and columns of the data matrix may be rearranged to make the bicluster contiguous.\n\nAlgorithms differ in how they define biclusters. Some of the common types include:\n\nconstant values, constant rows, or constant columns\n\nunusually high or low values\n\nsubmatrices with low variance\n\ncorrelated rows or columns\n\nAlgorithms also differ in how rows and columns may be assigned to biclusters, which leads to different bicluster structures. Block diagonal or checkerboard structures occur when rows and columns are divided into partitions.\n\nIf each row and each column belongs to exactly one bicluster, then rearranging the rows and columns of the data matrix reveals the biclusters on the diagonal. Here is an example of this structure where biclusters have higher average values than the other rows and columns:", "url": "https://scikit-learn.org/stable/modules/biclustering.html", "children": [{"name": "Spectral Co-Clustering", "description": "The SpectralCoclustering algorithm finds biclusters with values higher than those in the corresponding other rows and columns. Each row and each column belongs to exactly one bicluster, so rearranging the rows and columns to make partitions contiguous reveals these high values along the diagona", "url": "https://scikit-learn.org/stable/modules/biclustering.html#spectral-co-clustering"}, {"name": "Spectral Biclustering", "description": "The SpectralBiclustering algorithm assumes that the input data matrix has a hidden checkerboard structure. The rows and columns of a matrix with this structure may be partitioned so that the entries of any bicluster in the Cartesian product of row clusters and column clusters are approximately constant. For instance, if there are two row partitions and three column partitions, each row will belong to three biclusters, and each column will belong to two biclusters. \n\nThe algorithm partitions the rows and columns of a matrix so that a corresponding blockwise-constant checkerboard matrix provides a good approximation to the original matrix.", "url": "https://scikit-learn.org/stable/modules/biclustering.html#spectral-biclustering"}]}, {"name": "Decomposing signals in components", "description": "Decompositions: Transforms, Subbands, and Wavelets\n\nThe signal decomposition (and reconstruction) techniques developed in this book have three salient characteristics:\n\n(1)\n\nOrthonormality. As we shall see, the block transforms will be square unitary matrices, i.e., the rows of the transformation matrix will be orthogonal to each other; the subband filter banks will be paraunitary, a special kind of orthonormality, and the wavelets will be orthonormal.\n(2)\n\nPerfect reconstruction (PR). This means that, in the absence of encoding, quantization, and transmission errors, the reconstructed signal can be reassembled perfectly at the receiver.\n(3)\n\nCritical sampling. This implies that the signal is subsampled at a minimum possible rate consistent with the applicable Nyquist theorem. From a practical standpoint, this means that if the original signal has a data rate of fs samples or pixels per second, the sum of the transmission rates out of all the subbands is also fs.", "url": "https://scikit-learn.org/stable/modules/decomposition.html", "children": [{"name": "Principal component analysis (PCA)", "description": "PCA is used to decompose a multivariate dataset in a set of successive orthogonal components that explain a maximum amount of the variance. In scikit-learn, PCA is implemented as a transformer object that learns components in its fit method, and can be used on new data to project it on these components.", "url": "https://scikit-learn.org/stable/modules/decomposition.html#principal-component-analysis-(pca)"}, {"name": "Kernel Principal Component Analysis (kPCA)", "description": "KernelPCA is an extension of PCA which achieves non-linear dimensionality reduction through the use of kernels (see Pairwise metrics, Affinities and Kernels) [Scholkopf1997]. It has many applications including denoising, compression and structured prediction (kernel dependency estimation). KernelPCA supports both transform and inverse_transform. \n", "url": "https://scikit-learn.org/stable/modules/decomposition.html#kernel-principal-component-analysis-(kpca)"}, {"name": "Truncated singular value decomposition and latent semantic analysis", "description": "TruncatedSVD implements a variant of singular value decomposition (SVD) that only computes the largest singular values, where is a user-specified parameter. \n\nWhen truncated SVD is applied to term-document matrices (as returned by CountVectorizer or TfidfVectorizer), this transformation is known as latent semantic analysis (LSA), because it transforms such matrices to a \u00d2semantic\u00d3 space of low dimensionality. In particular, LSA is known to combat the effects of synonymy and polysemy (both of which roughly mean there are multiple meanings per word), which cause term-document matrices to be overly sparse and exhibit poor similarity under measures such as cosine similarity.", "url": "https://scikit-learn.org/stable/modules/decomposition.html#truncated-singular-value-decomposition-and-latent-semantic-analysis"}, {"name": "Dictionary Learning", "description": "Dictionary learning (DictionaryLearning) is a matrix factorization problem that amounts to finding a (usually overcomplete) dictionary that will perform well at sparsely encoding the fitted data.\n\nRepresenting data as sparse combinations of atoms from an overcomplete dictionary is suggested to be the way the mammalian primary visual cortex works. Consequently, dictionary learning applied on image patches has been shown to give good results in image processing tasks such as image completion, inpainting and denoising, as well as for supervised recognition tasks.\n\nDictionary learning is an optimization problem solved by alternatively updating the sparse code, as a solution to multiple Lasso problems, considering the dictionary fixed, and then updating the dictionary to best fit the sparse code.", "url": "https://scikit-learn.org/stable/modules/decomposition.html#dictionary-learning"}, {"name": "Independent component analysis (ICA)", "description": "Independent component analysis separates a multivariate signal into additive subcomponents that are maximally independent. It is implemented in scikit-learn using the Fast ICA algorithm. Typically, ICA is not used for reducing dimensionality but for separating superimposed signals. Since the ICA model does not include a noise term, for the model to be correct, whitening must be applied. This can be done internally using the whiten argument or manually using one of the PCA variants.", "url": "https://scikit-learn.org/stable/modules/decomposition.html#independent-component-analysis-(ica)"}, {"name": "Non-negative matrix factorization", "description": "NMF 1 is an alternative approach to decomposition that assumes that the data and the components are non-negative. NMF can be plugged in instead of PCA or its variants, in the cases where the data matrix does not contain negative values. It finds a decomposition of samples X into two matrices W and H of non-negative elements, by optimizing the distance d between X and the matrix product WH.\n\nUnlike PCA, the representation of a vector is obtained in an additive fashion, by superimposing the components, without subtracting. Such additive models are efficient for representing images and text. \n\nIt has been observed in [Hoyer, 2004] 2 that, when carefully constrained, NMF can produce a parts-based representation of the dataset, resulting in interpretable models. The following example displays 16 sparse components found by NMF from the images in the Olivetti faces dataset, in comparison with the PCA eigenfaces.", "url": "https://scikit-learn.org/stable/modules/decomposition.html#non-negative-matrix-factorization"}, {"name": "Latent Dirichlet Allocation", "description": "Latent Dirichlet Allocation is a generative probabilistic model for collections of discrete dataset such as text corpora. It is also a topic model that is used for discovering abstract topics from a collection of documents.", "url": "https://scikit-learn.org/stable/modules/decomposition.html#latent-dirichlet-allocation"}]}, {"name": "Covariance estimation", "description": "Many statistical problems require the estimation of a population\u00d5s covariance matrix, which can be seen as an estimation of data set scatter plot shape. Most of the time, such an estimation has to be done on a sample whose properties (size, structure, homogeneity) have a large influence on the estimation\u00d5s quality. The sklearn.covariance package provides tools for accurately estimating a population\u00d5s covariance matrix under various settings.", "url": "https://scikit-learn.org/stable/modules/covariance.html"}, {"name": "Novelty and Outlier Detection", "description": "Outlier detection and novelty detection are both used for anomaly detection, where one is interested in detecting abnormal or unusual observations. Outlier detection is then also known as unsupervised anomaly detection and novelty detection as semi-supervised anomaly detection. In the context of outlier detection, the outliers/anomalies cannot form a dense cluster as available estimators assume that the outliers/anomalies are located in low density regions. On the contrary, in the context of novelty detection, novelties/anomalies can form a dense cluster as long as they are in a low density region of the training data, considered as normal in this context.", "url": "https://scikit-learn.org/stable/modules/outlier_detection.html"}, {"name": "Density Estimation", "description": "Density estimation walks the line between unsupervised learning, feature engineering, and data modeling. Some of the most popular and useful density estimation techniques are mixture models such as Gaussian Mixtures (GaussianMixture), and neighbor-based approaches such as the kernel density estimate (KernelDensity). Gaussian Mixtures are discussed more fully in the context of clustering, because the technique is also useful as an unsupervised clustering scheme.\n\nDensity estimation is a very simple concept, and most people are already familiar with one common density estimation technique: the histogram.", "url": "https://scikit-learn.org/stable/modules/density.html"}, {"name": "Neural network models (unsupervised)", "url": "https://scikit-learn.org/stable/modules/neural_networks_unsupervised.html", "children": [{"name": "Restricted Boltzmann machines", "description": "Restricted Boltzmann machines (RBM) are unsupervised nonlinear feature learners based on a probabilistic model. The features extracted by an RBM or a hierarchy of RBMs often give good results when fed into a linear classifier such as a linear SVM or a perceptron. \n\nThe model makes assumptions regarding the distribution of inputs. At the moment, scikit-learn only provides BernoulliRBM, which assumes the inputs are either binary values or values between 0 and 1, each encoding the probability that the specific feature would be turned on.", "url": "https://scikit-learn.org/stable/modules/neural_networks_unsupervised.html#restricted-boltzmann-machines"}]}]}, {"name": "Model selection and evaluation", "children": [{"name": "Cross-validation: evaluating estimator performance", "description": "Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test. Note that the word \u00d2experiment\u00d3 is not intended to denote academic use only, because even in commercial settings machine learning usually starts out experimentally.", "url": "https://scikit-learn.org/stable/modules/cross_validation.html", "children": [{"name": "Computing cross-validated metrics", "description": "The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset.", "url": "https://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics"}, {"name": "Cross validation iterators", "description": "The following sections list utilities to generate indices that can be used to generate dataset splits according to different cross validation strategies.", "url": "https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators"}]}, {"name": "Tuning the hyper-parameters of an estimator", "description": "Hyper-parameters are parameters that are not directly learnt within estimators. In scikit-learn they are passed as arguments to the constructor of the estimator classes. Typical examples include C, kernel and gamma for Support Vector Classifier, alpha for Lasso, etc. It is possible and recommended to search the hyper-parameter space for the best cross validation score. Any parameter provided when constructing an estimator may be optimized in this manner. Specifically, to find the names and current values for all parameters for a given estimator,", "url": "https://scikit-learn.org/stable/modules/grid_search.html", "children": [{"name": "Exhaustive Grid Search", "description": "he grid search provided by GridSearchCV exhaustively generates candidates from a grid of parameter values specified with the param_grid parameter.\n\nThe GridSearchCV instance implements the usual estimator API: when \u00d2fitting\u00d3 it on a dataset all the possible combinations of parameter values are evaluated and the best combination is retained.", "url": "https://scikit-learn.org/stable/modules/grid_search.html#exhaustive-grid-search"}, {"name": "Randomized Parameter Optimization", "description": "While using a grid of parameter settings is currently the most widely used method for parameter optimization, other search methods have more favorable properties. RandomizedSearchCV implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. This has two main benefits over an exhaustive search: \n\nA budget can be chosen independent of the number of parameters and possible values. \n\nAdding parameters that do not influence the performance does not decrease efficiency. \n\nSpecifying how parameters should be sampled is done using a dictionary, very similar to specifying parameters for GridSearchCV. Additionally, a computation budget, being the number of sampled candidates or sampling iterations, is specified using the n_iter parameter. For each parameter, either a distribution over possible values or a list of discrete choices (which will be sampled uniformly) can be specified:", "url": "https://scikit-learn.org/stable/modules/grid_search.html#randomized-parameter-optimization"}]}, {"name": "Metrics and scoring: quantifying the quality of predictions", "description": "There are 3 different APIs for evaluating the quality of a model\u00d5s predictions:\n\nEstimator score method: Estimators have a score method providing a default evaluation criterion for the problem they are designed to solve. This is not discussed on this page, but in each estimator\u00d5s documentation.\n\nScoring parameter: Model-evaluation tools using cross-validation (such as model_selection.cross_val_score and model_selection.GridSearchCV) rely on an internal scoring strategy. This is discussed in the section The scoring parameter: defining model evaluation rules.\n\nMetric functions: The sklearn.metrics module implements functions assessing prediction error for specific purposes. These metrics are detailed in sections on Classification metrics, Multilabel ranking metrics, Regression metrics and Clustering metrics.", "url": "https://scikit-learn.org/stable/modules/model_evaluation.html"}, {"name": "Validation curves: plotting scores to evaluate models", "description": "Every estimator has its advantages and drawbacks. Its generalization error can be decomposed in terms of bias, variance and noise. The bias of an estimator is its average error for different training sets. The variance of an estimator indicates how sensitive it is to varying training sets. Noise is a property of the data.", "url": "https://scikit-learn.org/stable/modules/learning_curve.html", "children": [{"name": "Validation curve", "description": "To validate a model we need a scoring function (see Metrics and scoring: quantifying the quality of predictions), for example accuracy for classifiers. The proper way of choosing multiple hyperparameters of an estimator is of course grid search or similar methods (see Tuning the hyper-parameters of an estimator) that select the hyperparameter with the maximum score on a validation set or multiple validation sets. Note that if we optimize the hyperparameters based on a validation score the validation score is biased and not a good estimate of the generalization any longer. To get a proper estimate of the generalization we have to compute the score on another test set. \n\nHowever, it is sometimes helpful to plot the influence of a single hyperparameter on the training score and the validation score to find out whether the estimator is overfitting or underfitting for some hyperparameter values.", "url": "https://scikit-learn.org/stable/modules/learning_curve.html#validation-curve"}, {"name": "Learning curve", "description": "A learning curve shows the validation and training score of an estimator for varying numbers of training samples. It is a tool to find out how much we benefit from adding more training data and whether the estimator suffers more from a variance error or a bias error. Consider the following example where we plot the learning curve of a naive Bayes classifier and an SVM. \n\nFor the naive Bayes, both the validation score and the training score converge to a value that is quite low with increasing size of the training set. Thus, we will probably not benefit much from more training data. \n\nIn contrast, for small amounts of data, the training score of the SVM is much greater than the validation score. Adding more training samples will most likely increase generalization.", "url": "https://scikit-learn.org/stable/modules/learning_curve.html#learning-curve"}]}]}, {"name": "Inspection", "children": [{"name": "Partial Dependence and Individual Conditional Expectation plots", "description": "Partial dependence plots (PDP) and individual conditional expectation (ICE) plots can be used to visualize and analyze interaction between the target response 1 and a set of input features of interest. Both PDPs [H2009] and ICEs [G2015] assume that the input features of interest are independent from the complement features, and this assumption is often violated in practice. Thus, in the case of correlated features, we will create absurd data points to compute the PDP/ICE [M2019].", "url": "https://scikit-learn.org/stable/modules/partial_dependence.html", "children": [{"name": "Partial dependence plots", "description": "Partial dependence plots (PDP) show the dependence between the target response and a set of input features of interest, marginalizing over the values of all other input features (the \u00d4complement\u00d5 features). Intuitively, we can interpret the partial dependence as the expected target response as a function of the input features of interest. \n\nDue to the limits of human perception the size of the set of input feature of interest must be small (usually, one or two) thus the input features of interest are usually chosen among the most important features.\n\nOne-way PDPs tell us about the interaction between the target response and an input feature of interest feature (e.g. linear, non-linear). The left plot in the above figure shows the effect of the average occupancy on the median house price; we can clearly see a linear relationship among them when the average occupancy is inferior to 3 persons. Similarly, we could analyze the effect of the house age on the median house price (middle plot). Thus, these interpretations are marginal, considering a feature at a time.", "url": "https://scikit-learn.org/stable/modules/partial_dependence.html#partial-dependence-plots"}, {"name": "Individual conditional expectation (ICE) plot", "description": "Similar to a PDP, an individual conditional expectation (ICE) plot shows the dependence between the target function and an input feature of interest. However, unlike a PDP, which shows the average effect of the input feature, an ICE plot visualizes the dependence of the prediction on a feature for each sample separately with one line per sample. Due to the limits of human perception, only one input feature of interest is supported for ICE plots. \n\nThe figures below show four ICE plots for the California housing dataset, with a HistGradientBoostingRegressor. The second figure plots the corresponding PD line overlaid on ICE lines.\n\nWhile the PDPs are good at showing the average effect of the target features, they can obscure a heterogeneous relationship created by interactions. When interactions are present the ICE plot will provide many more insights. For example, we could observe a linear relationship between the median income and the house price in the PD line. However, the ICE lines show that there are some exceptions, where the house price remains constant in some ranges of the median income.", "url": "https://scikit-learn.org/stable/modules/partial_dependence.html#individual-conditional-expectation-(ice)-plot"}]}, {"name": "Permutation feature importance", "description": "Permutation feature importance is a model inspection technique that can be used for any fitted estimator when the data is tabular. This is especially useful for non-linear or opaque estimators. The permutation feature importance is defined to be the decrease in a model score when a single feature value is randomly shuffled 1. This procedure breaks the relationship between the feature and the target, thus the drop in the model score is indicative of how much the model depends on the feature. This technique benefits from being model agnostic and can be calculated many times with different permutations of the feature.", "url": "https://scikit-learn.org/stable/modules/permutation_importance.html"}]}, {"name": "Visualizations", "children": [{"name": "Visualizations", "description": "Scikit-learn defines a simple API for creating visualizations for machine learning. The key feature of this API is to allow for quick plotting and visual adjustments without recalculation. We provide Display classes that exposes two methods allowing to make the plotting: from_estimator and from_predictions. The from_estimator method will take a fitted estimator and some data (X and y) and create a Display object.", "url": "https://scikit-learn.org/stable/visualizations.html"}]}, {"name": "Dataset transformations", "children": [{"name": "Pipelines and composite estimators", "description": "Transformers are usually combined with classifiers, regressors or other estimators to build a composite estimator. The most common tool is a Pipeline. Pipeline is often used in combination with FeatureUnion which concatenates the output of transformers into a composite feature space. TransformedTargetRegressor deals with transforming the target (i.e. log-transform y). In contrast, Pipelines only transform the observed data (X).", "url": "https://scikit-learn.org/stable/modules/compose.html"}, {"name": "Feature extraction", "description": "The sklearn.feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.", "url": "https://scikit-learn.org/stable/modules/feature_extraction.html"}, {"name": "Preprocessing data", "description": "The sklearn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.\n\nIn general, learning algorithms benefit from standardization of the data set. If some outliers are present in the set, robust scalers or transformers are more appropriate. The behaviors of the different scalers, transformers, and normalizers on a dataset containing marginal outliers is highlighted in Compare the effect of different scalers on data with outliers.", "url": "https://scikit-learn.org/stable/modules/preprocessing.html"}, {"name": "Imputation of missing values", "description": "For various reasons, many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. Such datasets however are incompatible with scikit-learn estimators which assume that all values in an array are numerical, and that all have and hold meaning. A basic strategy to use incomplete datasets is to discard entire rows and/or columns containing missing values. However, this comes at the price of losing data which may be valuable (even though incomplete). A better strategy is to impute the missing values, i.e., to infer them from the known part of the data. See the Glossary of Common Terms and API Elements entry on imputation.", "url": "https://scikit-learn.org/stable/modules/impute.html"}, {"name": "Unsupervised dimensionality reduction", "description": "If your number of features is high, it may be useful to reduce it with an unsupervised step prior to supervised steps. Many of the Unsupervised learning methods implement a transform method that can be used to reduce the dimensionality. Below we discuss two specific example of this pattern that are heavily used.", "url": "https://scikit-learn.org/stable/modules/unsupervised_reduction.html"}, {"name": "Random Projection", "description": "The sklearn.random_projection module implements a simple and computationally efficient way to reduce the dimensionality of the data by trading a controlled amount of accuracy (as additional variance) for faster processing times and smaller model sizes. This module implements two types of unstructured random matrix: Gaussian random matrix and sparse random matrix.\n\nThe dimensions and distribution of random projections matrices are controlled so as to preserve the pairwise distances between any two samples of the dataset. Thus random projection is a suitable approximation technique for distance based method.", "url": "https://scikit-learn.org/stable/modules/random_projection.html"}, {"name": "Kernel Approximation", "description": "This submodule contains functions that approximate the feature mappings that correspond to certain kernels, as they are used for example in support vector machines (see Support Vector Machines). The following feature functions perform non-linear transformations of the input, which can serve as a basis for linear classification or other algorithms.\n\nThe advantage of using approximate explicit feature maps compared to the kernel trick, which makes use of feature maps implicitly, is that explicit mappings can be better suited for online learning and can significantly reduce the cost of learning with very large datasets. Standard kernelized SVMs do not scale well to large datasets, but using an approximate kernel map it is possible to use much more efficient linear SVMs. In particular, the combination of kernel map approximations with SGDClassifier can make non-linear learning on large datasets possible.\n\nSince there has not been much empirical work using approximate embeddings, it is advisable to compare results against exact kernel methods when possible.", "url": "https://scikit-learn.org/stable/modules/kernel_approximation.html"}, {"name": "Pairwise metrics, Affinities and Kernels", "description": "The sklearn.metrics.pairwise submodule implements utilities to evaluate pairwise distances or affinity of sets of samples.\n\nThis module contains both distance metrics and kernels. A brief summary is given on the two here.\n\nDistance metrics are functions d(a, b) such that d(a, b) < d(a, c) if objects a and b are considered \u00d2more similar\u00d3 than objects a and c. Two objects exactly alike would have a distance of zero. One of the most popular examples is Euclidean distance. To be a \u00d4true\u00d5 metric, it must obey the following four conditions:", "url": "https://scikit-learn.org/stable/modules/metrics.html"}, {"name": "Transforming the prediction target (y)", "description": "These are transformers that are not intended to be used on features, only on supervised learning targets. See also Transforming target in regression if you want to transform the prediction target for learning, but evaluate the model in the original (untransformed) space.", "url": "https://scikit-learn.org/stable/modules/preprocessing_targets.html"}]}, {"name": "Dataset loading utilities", "children": [{"name": "Dataset loading utilities", "description": "The sklearn.datasets package embeds some small toy datasets as introduced in the Getting Started section.\n\nThis package also features helpers to fetch larger datasets commonly used by the machine learning community to benchmark algorithms on data that comes from the \u00d4real world\u00d5.\n\nTo evaluate the impact of the scale of the dataset (n_samples and n_features) while controlling the statistical properties of the data (typically the correlation and informativeness of the features), it is also possible to generate synthetic data.\n\nGeneral dataset API. There are three main kinds of dataset interfaces that can be used to get datasets depending on the desired type of dataset.\n\nThe dataset loaders. They can be used to load small standard datasets, described in the Toy datasets section.\n\nThe dataset fetchers. They can be used to download and load larger datasets, described in the Real world datasets section.\nScikit-learn also embeds a couple of sample JPEG images published under Creative Commons license by their authors. Those images can be useful to test algorithms and pipelines on 2D data.", "url": "https://scikit-learn.org/stable/datasets.html"}]}]