Collection of tools meant to enable faster progress and reproducibility of results in ML/DL projects and competitions, without assuming too much about the projects themselves and used tools.
Meant to support a process where we have:
- a dataset, possibly pre-processed to some degree, that can be deterministically loaded (multiple versions of data? loading code should take an argument specifying which version to load)
- an objective - we only consider supervised ML problems here, and we need some quality metric (be it accuracy or AUROC of classification, RMSE of regression, or something else). The metric can be changed, multiple metrics also supported.
A common, useful approach is to then to quickly develop a simple version of the full pipeline:
- loading the data
- pre-processing, data augmentation, etc
- the model logic (if not a ready-made solution - perhaps a custom neural architecture?)
- model training procedure
- scoring the model on the validation set
Then the real work begins - multiple iterations of:
- developing and testing new versions of parts of the pipeline - adding dimensions to the hyper-parameter space (e.g. different neural architecture, different kind of model altogether, different pre-trained network, different data pre-processing logic),
- optimizing hyper-parameters,
- inspecting how/when the model fails, and how it works - to guide further development,
- (in some cases) periodically pushing new versions to production.
This package is meant to provide common utilities supporting workflows like those, regardless of the frameworks/libraries in use or the structure of the problem to solve. It was born from common parts of multiple projects, often having a "weird" structure - e.g.:
- fitting multiple models with identical hyper-parameters to time-series from corresponding machines,
- periodic re-training of the model (also for time-series)
- testing the same approach on multiple subsets of the training set
The strength of the package lies in its modularity, minimal assumptions about what you need to do and how you need to do it. It is meant to get out of your way - your code should still work without it, ran the way you need - be it CLI, a notebook, etc.
It should be possible to:
- store and analyze results of all past experiments,
- reproduce past results (-> store results together with hyper-parameters, data, versions of code, ...),
- easily re-run past experiments with some hyper-parameters changed,
- develop or inspect parts of the pipeline in a notebook, easily run all of it from there (TODO: instructions),
- extend the code's functionality and the hyper-parameter space; new hyper-params should always be added with default values matching previous behaviour. Replacing pieces of logic with different ones, with different hyper-parameters should be handled as well.
- queue runs from a notebook (e.g. generate a grid of HPs in some dimensions) and have them executed by worker processes (possibly distributed among multiple machines). Then close the notebook, or add some more runs, never again waiting for computations to finish.
- install core parts of the project's package and use it in production code (if the code got ugly with too many options, write a new version of it, and compare results with the old version, all in the same environment)
- use automatic hyper-parameter tuning algorithms, informed by all past experiments during development.
Run
- single execution of aTask
with a specific config (set of parameters), concluded by calculation of a quality metric.
Example: training a classifier with the hyper-parameters provided in the config, calculating accuracy on the validation set.
Each run gets its entry in a selected MongoDB database, in theruns
collection. It includes all parameters, code versions, result metrics, captured output, and possibly custom diagnostic information.Task
- fully specified problem to solve, and then compare different solutions to.
Consists of a chosenScenario
class and parameter values for its constructor. Each task should be specified as a .json file and include keys:Scenario
(= dictionary of its settings, starting withclassName
) and optionallyseed
.Scenario
- template of aTask
without its parameters. When parameterized byTask
parameters and aRun
config it can be executed, and should return a quality metric. EachScenario
is a class, inheriting fromscenario_base.Scenario
. Its.single_run()
method typically contains dataset construction, model construction, model training and evaluation.Study
- a project in which we define one or moreScenarios
to study solutions for, with a shared codebase, database, and virtual environment. Corresponds to a Python package, which has a dependency onhyperspace_explorer
and includes ascenarios
module.
Install with pip, as a normal python package.
Other than all the modules being made available for import, it will also make hyperspace_worker.py
available in your system's PATH (or a specific virtual environment).
To use most of this package's functions a running instance of MongoDB will be needed.
TODO
As we add new parameters to our code (our classes), we must provide default values that ensure behaviour consistent with the earlier versions of the code.
To achieve that, default values (they do not have to be optimal/recommended
ones!) live inside the classes. They must not be set in function - instead,
they should be provided by the .get_default_values()
method of each
Configurable
. If using dataclasses, that method is provided automatically.
When a worker process starts to process a Run, it fills in all default values
- they end up stored in the database. It is redundant, but makes result analysis much easier - they do not have to be aware of the Study-specific codebase.
If expanding a Scenario to fit more Tasks, and adding parameters to it, similarly default values should be provided for all of them.
A tool for back-filling new default values to past runs (while keeping original config in a different field as backup?) would be useful - future development.
Run a command hyperspace_worker.py [path to tasks dir] [mongo db name]
.
Important: it has to be ran from a directory containing a scenarios.py
module, which defines
experiment scenarios allowed to be ran within the given project. The hyperspace_worker.py
file
should not be present in the folder.
Arguments:
path to tasks dir
- path to a directory containing.json
files, describing each allowedtask
- a parameterization of ascenario
mongo db name
- name of the mongoDB database to store results in- optional params: mongoDB URI (if not localhost, or if password is required), interval to query for new tasks
This project (ab)uses Sacred to collect and store information about each run.
One of the benefits: we can use many ready-made dashboards for Sacred, e.g. Omniboard - highly recommended, works out of the box, many impressive features.
TODO
Start workers on 1 or more nodes, set them up to use the same database (which also serves as a task queue). Workers are specific to one Study (project) - will only process tasks for the Study they were started for.
Example code, usually ran from a notebook, to submit one task. From here, it is easy to e.g. submit a grid of hyper-parameters for the workers to test.
from hyperspace_explorer.queue import RunQueue
from pathlib import Path
tasks_dir = Path.cwd().resolve().parent / 'tasks' # just an example - relative to the notebook
db_name = 'ulmfit_attention'
mongo_uri = 'localhost:27017'
q = RunQueue(mongo_uri, db_name, tasks_dir)
task_name = 'imdb_1k_sample_single'
conf = {
'aggregation': { # different additional parameters are available depending on `className`
'className': 'BranchingAttentionAggregation',
'agg_layers': [50, 10]
},
'classifier': { # this dict is passed to a specific function within a scenario, but polymorphism is not needed
'lin_ftrs': [],
'drop_mult': 0.5,
},
'training_schedule': { # even if we do not want to change any default parameters, className is required
'className': 'DefaultSchedule',
}
}
q.submit(task_name, conf)
The code above works with the project: https://github.com/tpietruszka/ulmfit_attention.
In this case workers should be ran from within the inner ulmfit_attention
directory.
An example notebook
meant for development of a new class fitting within an existing project can be found in the ulmfit_attention
project.
While somewhat involved in that particular project, it demonstrates how to use hyperspace_explorer
's infrastructure
to quickly create an experiment within a notebook and run it.
TODO