-
-
Notifications
You must be signed in to change notification settings - Fork 92
Flow Serialization
One of the key features of OpenML is the ability to serialize flows on OpenML. Workbench packages such as Weka, Scikit-learn and mlR contain many algorithms/classifiers that can be uploaded to OpenML. Uploading in this sense is more "registering". In order to download and re-use the algorithm/classifier, one needs the actual workbench package in combination with the meta-data on OpenML.
In this article, we refer to the classifier as the algorithm that lives in the workbench package, and to the flow as the registered version of this algorithm on the OpenML server.
We aim to create a perfect mapping between the actual instantiation of the classifier/algorithm in the workbench and the registered flow
- Any instantiation of a given algorithm from a workbench package should be mapped to the same flow description on OpenML. For example, consider the Scikit-learn classifier Random Forest without any additional pipeline or preprocessing components. Every user that uses this classifier within the Openml-python package should have the results linked to the same flow on OpenML.
- Flows on OpenML should contain all information to be reinstantiated on the computer of the user, given the correct version of the workbench and the connector package.
- Hyperparameter settings are irrelevant on flow level. Any two (combinations of) algorithms that utilize the same entity in the workbench but have different settings of the hyperparameters are considered to be the same flow on OpenML.
- Ideally, none of the registered flows have any source or binary files attached (as all information should be available in a condensed format)
- A good unit test would consist of the following steps:
- instantiate a classifier
- solve a small task with a complex decision boundary
- upload the classifier to OpenML (not necessarily the run result)
- download the flow from OpenML and re-instantiate the classifier
- solve the same small task as in step 2
- assert that the predictions from the classifier before uploading are exactly the same as the predictions from the re-instantiated classifier
The database consists of the following fields:
- flow id (assigned by the openml server)
- flow name and external version. The main information that determines equality between algorithms. The combination is a unique key; any algorithm that mapped to a given flow name and external version combination are considered equal.
- automated uploaded meta-data (uploader, upload date, version, assigned by the openml-server)
- custom name field, for human readable name (currently not used)
- free-format meta-data, such as dependencies, installation notes, description, etc.
- attached source file and binary file
- parameters
- subflows (recursive definition of flow)
There are currently several different ways to ensure the perfect mapping between client-side algorithms and openml flows:
- each (combination of) algorithms is represented in a canonical name. For example, a pipeline that consists of an imputation component and a bagging classifier that contains a tree as base-classifier can be represented as Pipeline(Imputation, Bagging(DecisionTree)). The Weka and Scikit-learn packages use this representation schema.
- utilizing a hash of the code. An algorithm gets assigned a (not necessarily unique) name and the code is (MD5) hashed and used as external version. This way, name-clashes on the flow name field are resolved. Additionally, the name will not be enough to re-instantiate the classifier, so a source or binary file will be attached. RapidMiner and mlR use this representation schema.
- The MOA client uses a canonical name for flows which is a unique representation of a given algorithm (like weka and scikit-learn). However, it currently lacks the functionality to re-instantiate the flows.
- No uniform standard across workbenches
- It is often hard and unnatural to break down a flow into a hierarchy of subflows
- The current representation does not allow for component identifiers within a flow. This means that if the flow contains the same component twice (e.g., an imputer for categorical features and an imputer for numerical features) these can not be refered unambiguously.
- Space and size restrictions on the name field (which is unacceptable for a field that ensures serialization)
- (Not really a flaw, but this design is quite biased on the structure of Weka, whereas it is harder to adapt this to Scikit-learn, mlR or RapidMiner)
The following fields will be of key importance in the new version:
- serialization: Currently the name of the flow. This will no longer be used as the displayed name on the frontend, but rather as information required to re-instantiate the flow. Format specified below.
- external_version: Same as currently. specifies the version number of the workbench package.
- The custom name field will be used as a human readable name on the webserver
- A more defined way to specify which package was used and which packages are required to run the flow.
The following fields will be removed:
- (openml) version: useless
- binary file / source file: Barely used in practise (mlr/rapidminer utilize this field, but the information that is stored here should move to the serialization field)
- subflows: although this is an interesting feature, we haven't utilized this in our research, it is hard to comply to this standard and it makes it hard to build consistent packages.
The naming schema that Weka and Scikit-learn use to represent their algorithms on line (e.g., Pipeline(Preprocessing1, Preprocessing2, MetaClassifier(BaseClassifier))
) is a successful example of using this serialization (although we can also use the graph format of Darpa, TODO: source for description). One such schema will be selected, and ideally all workbenches that can comply to this will do so. Keras, RapidMiner and other workbench packages that do not easily fall into such schema can use their internal schema, although it will be hard to compare flows across packages.
- The pipeline is described as a JSON string. It has a list of all 'inputs' (datasets), 'components' (algorithms), and 'steps' (structure).
- Components: Every component of the flow has a name and an ID (hash). The ID can be something internal, not an OpenML ID. This should allow the library to look up the right components.
- Steps: Basically a list of all the steps that need to be done to execute the pipeline, but at the same time it also shows the structure of the pipeline. Each step also has it own ID, since the same component could be used multiple times (in multiple steps).
- The pipeline can be a DAG, so components can have multiple inputs. Every library can decide how to name the inputs, below I'm just using input1, input2,... For convenience, the steps are ordered by a topological sort (so a step is always in front of every other step that depends on its input).
- Inputs are part of the serialization. This is needed when you have more than 1 input dataset.
Consider a simple pipeline: Data_0 -> 1 (preprocessor) -> 2 (learner)
Then the JSON serialization looks something like this:
{ inputs: [
0: "Training Data"
],
components: [
0: {name: sklearn.preprocessor.OneHotEncoder,
version: 0.1},
1: {name: sklearn.svm.SVC,
version: 0.1}
],
steps: [
0: {0.input1 = input.0} # The input of component 0 in input 0
1: {1.input1 = steps.0.produce} # The input of component 1 is the output of step 0 generated by the 'produce' method
]
}
It also supports very complex pipelines. Imagine this pipeline which includes a pretrained component (e.g. a pretrained kernel) and builds a stacking ensemble afterwards.
Then the structure would be described like this:
steps: [
0: {4.input1 = input.0}
1: {5.input1 = steps.0.produce}
2: {1.input1 = input.0}
3: {2.input1 = steps.2.produce, 2.input2 = steps.0.produce}
4: {3.input1 = steps.1.produce, 3.input2 = steps.3.produce}
]
Note that this representation separates algorithms from pipelines, which looks cleaner:
- Algorithms can optionally be linked to code (e.g. a github link), have versioning, dependencies,... They also have hyperparameters and their descriptions, default settings,...
- Pipelines are much simpler things that are fully described by a list of components and their structure.
Changes:
- For clients, it would be a change in the way the flows are serialized and deserialized.
- For the OpenML API, it would mean quite a few changes in how we store and list flows. Also, we'd have to think about whether we want to upload algorithms and pipelines separately, regardless of how we store them internally.
- For the frontend, it allows easier search (e.g. search for algorithms and flows separately), and it allows a much cleaner rendering of flows. Currently, this is super-messy, and I have to write different code to render flows from different libraries, which is a major headache. A common representation would be super useful for that. The same is true for anyone who wants to analyze OpenML pipelines.
- It seems that the scikit-learn interface can be adapted rather easily to comply with this standard. An important open question is how this is the case for weka, mlr, moa and rapidminer.
- Weka: how to handle parameters of subflows?
- connect-ability with Docker. This would allow a uniform interface to rerun the models on the server
- a way to check whether uploaded models comply to server standards
@amueller says: In sklearn there is no concept of hyper-parameter and separating flow from hyper-parameter setting is not semantically meaningful. Also, sklearn doesn't allow arbitrary DAGs, and even though it allows creating complex graphs, these are implicit in the objects and there is currently no explicit representation of the graph.
Drafts:
Proposals:
Other: