-
Notifications
You must be signed in to change notification settings - Fork 2
Using doepipeline
Decide first what tools and steps the data processing pipeline should consist of. The pipeline may be simple and consist of only one step, or more complex and consist of many steps.
Then decide what parameters (factors) of the pipeline you wish to optimize, and what characteristic(s) you are going to use as response. A response is what doepipeline uses to evaluate the outcome of each pipeline run, i.e. how well the pipeline performed using a specific set of factor settings.
The factors may be both numerical and categorical in nature, and the numerical factors may be either ordinal (integer) or continuous (float).
Of course, you will also want to install doepipeline.
Essentially, doepipeline works iteratively. In each iteration, an experimental design is used to construct a set of pipelines that are subsequently executed and their outcomes evaluated. The factor settings used in each of the pipelines are governed by said experimental design. A simple experimental worksheet for investigating two quantitative factors may look like this:
Experiment no. | Factor A | Factor B |
---|---|---|
1 | 0 | 1 |
2 | 0 | 0 |
3 | 1 | 1 |
4 | 1 | 0 |
For this very simple example above, a total of four instances of the pipeline will be executed, where the settings for factors A and B are set accordingly. Each execution of a pipeline is refered to as an experiment.
Each iteration of doepipeline will get its own directory within the working_directory
, and each experiment in an iteration will get its own directory within the iteration's directory. The experiment directories will hold all scripts and output data that is produced by doepipeline.
Importantly, the first iteration of doepipeline is a screening phase where the applied experimental design is a so called Generalized Subset Design (GSD) (Surowiec et al, Analytical Chemistry, 2017). The GSD can efficiently span a large search space and allows for the approximation of optimal factor settings, to be used as an initial anchor point in the subsequent optimization phase. Additionally, the GSD can handle qualitative (categorical/multi-level) factors.
So, the screening phase will (1) find an approximate optimum, and (2) fixate any qualitative factors at a specific value, ahead of the the subsequent optimization phase.
The resolution, and number of experiments, in the screening phase is controlled by the reduction factor (see below) and by the number of levels that each factor is investigated at. To understand how to set the number of levels for each individual factor, please see the wiki page for the configuration file.
The reduction factor dictates what fraction of the full design that the GSD should represent. A full factorial design including K factors, investigated at L levels, will consist of LK experiments. The number of experiments in a GSD of reduction factor (R) can be approximated as LK/R.
As an illustration, imagine an experimental design including K=4 factors that are each investigated at L=5 levels. The number of experiments would, for a full factorial design, amount to 54=625. A corresponding GSD with a reduction factor of 5 would then consist of 625/5=25 experiments. Note that this is an approximation, as using another setting for the reduction factor can result in an uneven fraction.
The reduction factor can be controlled by specifying --screening-reduction INT
in the doepipeline command. By default, the reduction factor is set to the number of factors in the investigation.
In the optimization phase, the approximated optimum found in the screening phase is further refined in an iterative manner. A response surface design is created with the approximated optimum as starting (center) point. Essentially, the optimization phase continues until no further improvement in the response can be observed, or when the maximum number of iterations has been reached. In the end, the factor settings that produced the best response is reported back to the user.
In each optimization phase iteration, an OLS model is calculated from the response. If the model's predictive power (as measured by Q2) is above a given cutoff, an optimum is predicted from the model. The optimum is validated by executing a pipeline with the predicted optimal factor settings, and the result is treated just like that of any other pipeline executed in the current iteration.
By default the Q2 cutoff is set to be 0.5, but can be changed by specifying --q2_limit [0-1]
.
The model selection can be performed in 2 ways, (1) brute-force, and (2) greedy. The brute-force method simply tries all possible combinations of model parameters and evaluates the predictive power of each, and the model with the parameter composition producing the highest Q2 is selected. The greedy method instead starts by choosing the best model from only the baseline terms, and then adds higher order terms until no further improvement in Q2 is observed. You can set the model selection method by specifying --model_selection_method {brute,greedy}
in the doepipeline command. The default method is brute-force. Recommendation: if investigating more than 3 factors we recommend to set the model selection method to greedy, as this will significantly reduce the time spent on selecting the model.
The selected model in each iteration is specified in the output of doepipeline.
If the best point in the iteration is close to the edge of the current design space doepipeline will try to move the design space in that direction for the next iteration. The design space will be moved 25% of the factor span (defined as the difference between the factor's min and max values) for each factor for which the optimum was close to the edge. The new design space will never be allowed to move outside of the globally defined design space (see configuration file), and is nudged back to confine to the global design space.
The design space can be shrunk between optimization phase iterations. This can be useful to reduce the risk of doepipeline getting stuck and moving back and forth between iterations, and to make later iterations more fine-grained. The shrinkage can be controlled by specifying --shrinkage [0.9-1.0]
in the doepipeline command. For example, to shrink the design space by 5%, set the shrinkage to 0.95. Recommendation: we have found that setting the shrinkage to 0.9 (10% shrinkage) is sufficient.
You can use more than one response to quantify the outcome of your pipeline. Essentially, the response variables will be summarized into a single response variable which will be used to guide doepipeline. When using multiple responses you will need to set accepted limits for each response (minimum or maximum, and target). These limits are used to rescale the observed response values to values between 0 and 1. If outside of the accepted limits the rescaled value will be set to 0. Conversely, if better than target, the rescaled value will be set to 1. The rescaled response values are then combined into the overall desirability using the geometric mean.
More on how to specify multiple responses can be found in the wiki for the configuration file.
If for some reason the execution of doepipeline is terminated prematurely, there is a functionality to recover a doepipeline optimzation. Simply specify --recover
in the doepipeline command to activate the recovery functionality. Be sure to use the exact same YAML comfiguration file and other doepipeline command line parameter settings in the recovery run as in the original run.