In this document, we give a brief overview of the ML.NET high-level concepts. This document is mainly intended to describe the model training scenarios in ML.NET, since not all these concepts are relevant for the more simple scenario of prediction with existing model.
This document is going to cover the following ML.NET concepts:
- Data, represented as an
IDataView
interface.- In ML.NET, data is very similar to a SQL view: it's a lazily-evaluated, immutable, cursorable, heterogenous, schematized dataset.
- An excellent document about the data interface is IDataView Design Principles.
- Transformer, represented as
ITransformer
interface.- In one sentence, a transformer is a component that takes data, does some work on it, and return new 'transformed' data.
- For example, you can think of a machine learning model as a transformer that takes features and returns predictions.
- Another example, 'text tokenizer' would take a single text column and output a vector column with individual 'words' extracted out of the texts.
- Data reader, represented as an
IDataReader<T>
interface.- The data reader is ML.NET component to 'create' data: it takes an instance of
T
and returns data out of it. - For example, a TextLoader is an
IDataReader<FileSource>
: it takes the file source and produces data.
- The data reader is ML.NET component to 'create' data: it takes an instance of
- Estimator, represented as an
IEstimator<T>
interface.- This is an object that learns from data. The result of the learning is a transformer.
- You can think of a machine learning algorithm as an estimator that learns on data and produces a machine learning model (which is a transformer).
- Prediction function, represented as a
PredictionFunction<TSrc, TDst>
class.- The prediction function can be seen as a machine that applies a transformer to one 'row', such as at prediction time.
In ML.NET, data is very similar to a SQL view: it's a lazily-evaluated, cursorable, heterogenous, schematized dataset.
- It has Schema (an instance of an
ISchema
interface), that contains the information about the data view's columns.- Each column has a Name, a Type, and an arbitrary set of metadata associated with it.
- It is important to note that one of the types is the
vector<T, N>
type, which means that the column's values are vectors of items of type T, with the size of N. This is a recommended way to represent multi-dimensional data associated with every row, like pixels in an image, or tokens in a text. - The column's metadata contains information like 'slot names' of a vector column and suchlike. The metadata itself is actually represented as another one-row data, that is unique to each column.
- The data view is a source of cursors. Think SQL cursors: a cursor is an object that iterates through the data, one row at a time, and presents the available data.
- Naturally, data can have as many active cursors over it as needed: since data itself is immutable, cursors are truly independent.
- Note that cursors typically access only a subset of columns: for efficiency, we do not compute the values of columns that are not 'needed' by the cursor.
A transformer is a component that takes data, does some work on it, and return new 'transformed' data.
Here's the interface of ITransformer
:
public interface ITransformer
{
IDataView Transform(IDataView input);
ISchema GetOutputSchema(ISchema inputSchema);
}
As you can see, the transformer can Transform
an input data to produce the output data. The other method, GetOutputSchema
, is a mechanism of schema propagation: it allows you to see how the output data will look like for a given shape of the input data without actually performing the transformation.
Most transformers in ML.NET tend to operate on one input column at a time, and produce the output column. For example a new HashTransformer("foo", "bar")
would take the values from column "foo", hash them and put them into column "bar".
It is also common that the input and output column names are the same. In this case, the old column is 'replaced' with the new one. For example, a new HashTransformer("foo")
would take the values from column "foo", hash them and 'put them back' into "foo".
Any transformer will, of course, produce a new data view when Transform
is called: remember, data views are immutable.
Another important consideration is that, because data is lazily evaluated, transformers are lazy too. Essentially, after you call
var newData = transformer.Transform(oldData)
no actual computation will happen: only after you get a cursor from newData
and start consuming the value will newData
invoke the transformer
's transformation logic (and even that only if transformer
in question is actually needed to produce the requested columns).
A useful property of a transformer is that you can phrase a sequential application of transformers as yet another transformer:
var fullTransformer = transformer1.Append(transformer2).Append(transformer3);
We utilize this property a lot in ML.NET: typically, the trained ML.NET model is a 'chain of transformers', which is, for all intents and purposes, a transformer.
The data reader is ML.NET component to 'create' data: it takes an instance of T
and returns data out of it.
Here's the exact interface of IDataReader<T>
:
public interface IDataReader<in TSource>
{
IDataView Read(TSource input);
ISchema GetOutputSchema();
}
As you can see, the reader is capable of reading data (potentially multiple times, and from different 'inputs'), but the resulting data will always have the same schema, denoted by GetOutputSchema
.
An interesting property to note is that you can create a new data reader by 'attaching' a transformer to an existing data reader. This way you can have 'reader' with transformation behavior baked in:
var newReader = reader.Append(transformer1).Append(transformer2)
Another similarity to transformers is that, since data is lazily evaluated, readers are lazy: no (or minimal) actual 'reading' happens when you call dataReader.Read()
: only when a cursor is requested on the resulting data does the reader begin to work.
The estimator is an object that learns from data. The result of the learning is a transformer.
Here is the interface of IEstimator<T>
:
public interface IEstimator<out TTransformer>
where TTransformer : ITransformer
{
TTransformer Fit(IDataView input);
SchemaShape GetOutputSchema(SchemaShape inputSchema);
}
You can easily imagine how a sequence of estimators can be phrased as an estimator of its own. In ML.NET, we rely on this property to create 'learning pipelines' that chain together different estimators:
var env = new LocalEnvironment(); // Initialize the ML.NET environment.
var estimator = new ConcatEstimator(env, "Features", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth")
.Append(new ToKeyEstimator(env, "Label"))
.Append(new SdcaMultiClassTrainer(env, "Features", "Label")) // This is the actual 'machine learning algorithm'.
.Append(new ToValueEstimator(env, "PredictedLabel"));
var endToEndModel = estimator.Fit(data); // This now contains all the transformers that were used at training.
One important property of estimators is that estimators are eager, not lazy: every call to Fit
is causing 'learning' to happen, which is potentially a time-consuming operation.
The prediction function can be seen as a machine that applies a transformer to one 'row', such as at prediction time.
Once we obtain the model (which is a transformer that we either trained via Fit()
, or loaded from somewhere), we can use it to make 'predictions' using the normal calls to model.Transform(data)
. However, when we use this model in a real life scenario, we often don't have a whole 'batch' of examples to predict on. Instead, we have one example at a time, and we need to make timely predictions on them immediately.
Of course, we can reduce this to the batch prediction:
- Create a data view with exactly one row.
- Call
model.Transform(data)
to obtain the 'predicted data view'. - Get a cursor over the resulting data.
- Advance the cursor one step to get to the first (and only) row.
- Extract the predicted values out of it.
The above algorithm can be implemented using the schema comprehension, with two user-defined objects InputExample
and OutputPrediction
as follows:
var inputData = env.CreateDataView(new InputExample[] { example });
var outputData = model.Transform(inputData);
var output = outputData.AsDynamic.AsEnumerable<OutputPrediction>(env, reuseRowObject: false).Single();
But this would be cumbersome, and would incur performance costs.
Instead, we have a 'prediction function' object that performs the same work, but faster and more convenient, via an extension method MakePredictionFunction
:
var predictionFunc = model.MakePredictionFunction<InputExample, OutputPrediction>(env);
var output = predictionFunc.Predict(example);
The same predictionFunc
can (and should!) be used multiple times, thus amortizing the initial cost of MakePredictionFunction
call.
The prediction function is not re-entrant / thread-safe: if you want to conduct predictions simultaneously with multiple threads, you need to have a prediction function per thread.