layout | title | subtitle | date | author | author-id | background |
---|---|---|---|---|---|---|
post |
How to improve your Python project (or at least try to do it) |
Python Tutorial for Cookiecutter and Sacred |
2019-08-21 |
Giorgia Cantisani |
giorgia |
/img/blog_images/giorgia/sfondo.jpg |
What I mostly learned during this first year of Ph.D. is that in Data Science how we organize and structure our projects is fundamental for doing good and reproducible science. What I also learned is that the time is usually very limited to come up with something meaningful to the next deadline you need to be quick and efficient.
Coding with Python can be done in many different ways but the quick-and-dirty way may be good for small scale experiment or proofs of concept. For long term projects this may turn into chaotic folders where you are scared to look in even after the weekend.
In this post I will explain how to quick install and use some tools which will help you to organize your Data Science projects without the need of a degree in Computer Science.
The ideal workflow you should have always in mind is:
- Organize it
- Keep track of your changes
- Make it reproducible and system independent
- Keep track of the experiments.
Cookiecutter is a great tool that answers the followings:
- how should I structure my project?
- how did I structure my project one year ago? How will I tomorrow?
- how should I structure my project if I want to release a code which can be pip-install-ed?
- how to import my functions easily in my future projects? and in notebooks?
It is a command-line utility that creates projects from project templates (e.g. creating a Python package project from a Python package project template). In practice it deploys folders that allow you to organize yourself and control your data sources.
Features:
- Cross-platform: Windows, Mac, and Linux are officially supported
- Works with Python 2.7, 3.4, 3.5, 3.6
- 100% of templating is done with Jinja2. This includes file and directory names.
But let's have a look on how it works step by step.
-
Install cookiecutter in your system/user python profile (not a virtual environment).
$ pip install --user cookiecutter
-
Surf the file system until your code folder (e.g.
path/to/repos_folder
). This is the parent folder of your code. NB: cookiecutter will create a new folderproject_name
with everything inside. Your actual code will be in a subfolder, i.e.path/to/repos_folder/project_name/src/
Then runcookiecutter
with the link to the data-science-template and prompt the question it will ask you:$ cd Documents/xxx/xxx/Code/ $ cookiecutter https://github.com/drivendata/cookiecutter-data-science # fill the question using project name: project_name # then it will ask you about the author's name, the license, the python interpreter, a brief description of the project. # once it is finished, cd the forder project_name $ cd project_name
From now on, our current directory will be
path/to/project_name/
unless specifiedNow your project organization will look like this: ------------
├── LICENSE ├── Makefile <- Makefile with commands like
make data
ormake train
├── README.md <- The top-level README for developers using this project. ├── data │ ├── external <- Data from third party sources. │ ├── interim <- Intermediate data that has been transformed. │ ├── processed <- The final, canonical data sets for modeling. │ └── raw <- The original, immutable data dump. │ ├── docs <- A default Sphinx project; see sphinx-doc.org for details │ ├── models <- Trained and serialized models, model predictions, or model summaries │ ├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering), │ the creator's initials, and a short-
delimited description, e.g. │1.0-jqp-initial-data-exploration
. │ ├── references <- Data dictionaries, manuals, and all other explanatory materials. │ ├── reports <- Generated analysis as HTML, PDF, LaTeX, etc. │ └── figures <- Generated graphics and figures to be used in reporting │ ├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g. │ generated withpip freeze > requirements.txt
│ ├── setup.py <- makes project pip installable (pip install -e .) so src can be imported ├── src <- Source code for use in this project. │ ├── init.py <- Makes src a Python module │ │ │ ├── data <- Scripts to download or generate data │ │ └── make_dataset.py │ │ │ ├── features <- Scripts to turn raw data into features for modeling │ │ └── build_features.py │ │ │ ├── models <- Scripts to train models and then use trained models to make │ │ │ predictions │ │ ├── predict_model.py │ │ └── train_model.py │ │ │ └── visualization <- Scripts to create exploratory and results oriented visualizations │ └── visualize.py │ └── tox.ini <- tox file with settings for running tox; see tox.testrun.org-------- This project organization allows to
- raw data has an immutable folder to live in and no more accidentally saving over the raw data
- no more dataV1.csv or finalV1.csv, because everything is in its own place.
- write all the functions in .src/ (with the structure we want) and install src as an editable package
- Import the function we want in notebooks or from other projects with a nice and clean (so understandable) structure.
python from src.data.make_dataset import bellaciao
python from src.models.predict_model import *
If you don't like this template don't worry, there are many many others: just google them (e.g. data-science, reproducible-science). You can also create your own template (not really immediate tough - need jinja2)
Check out this research project, which successfully applied the cookiecutter philosophy: SEMIC: an efficient surface energy and mass balance model applied to the Greenland ice sheet
-
set up a virtual environment, named
venv
, specifying the python version. This code will create a folder namedvenv
containing lot of things and a local copy of all the packages you will pip-install from now on.$ virtualenv venv -p python3
if you use conda: ```bash $ conda create -n venv python=3 ```
-
edit the
.gitignore
by adding the virtualenv's folder with you favorite text editor or just run the following command$ echo venv >> .gitignore
-
set up git and link it to a new github reporitory:
- On github.com create an empty reporitory online (it means no README and no license. If you do so it will display usefull command).
- Start git locally and synch it with the following commands:
$ git init # check we are not 'saving' wried files $ git status # if so, commit $ git add . $ git commit -m "first commit" # If github $ git remote add origin https://github.com/USER/project_name.git $ git push -u origin master # avoid writing login and password for the future time $ git config credential.helper store
- If you have Windows use GitHub Desktop
-
Activate the virtualenv
[user@localhost] project_name/ $ source venv/bin/activate # check that it is activated. You should have (venv) at the beginnig of your command line (venv) [user@localhost] project_name/ $
if you use conda you don't have to care to be in the right folder:
$ conda activate venv (venv) $
-
Install the basic dependencies of cookiecutter (if you want). Notice that doing so also you will install the src package by default. Then install your everyday-coding-favorite-life packages: numpy, matplotlib, jupyter
(venv) $ pip install -r requirements.txt (venv) $ pip install numpy matplotlib jupyter
You can also inatall the src package as editable
$ pip install -e .
-
Freeze the requirements ('>' overwrite, '>>' append)
(venv) $ pip freeze >> requirements.txt
-
Install the package for a toy example
(venv) $ pip install sklearn
-
in
src/
create themain.py
file and paste the following code:from numpy.random import permutation from sklearn import svm, datasets C = 1.0 gamma = 0.7 iris = datasets.load_iris() perm = permutation(iris.target.size) iris.data = iris.data[perm] iris.target = iris.target[perm] model = svm.SVC(C, 'rbf', gamma=gamma) model.fit(iris.data[:90], iris.target[:90]) print(model.score(iris.data[90:], iris.target[90:]))
-
commit the changes
$ git add . $ git commit -m 'toy svm'
-
edit the
models/train_model.py
andmodels/predict_model.py
files. I In both of the files (actually python modules) create new function respectively In./src/models/train_model.py
:from sklearn import svm def train(data, target, C, gamma): clf = svm.SVC(C, 'rbf', gamma=gamma) clf.fit(data[:90], target[:90]) return clf
In
./src/models/predict_model.py
:def predict(clf, data, target): return clf.score(data, target)
-
Update the main file in order to import with the following imports In
src/main.py
add:from models.predict_model import predict from models.train_model import train
Now The main code should looks like:
# std imports from numpy.random import permutation from sklearn import datasets # my imports from models.predict_model import predict from models.train_model import train C = 1.0 gamma = 0.7 iris = datasets.load_iris() per = permutation(iris.target.size) iris.data = iris.data[per] iris.target = iris.target[per] model = train(iris.data[:90], iris.target[:90], C, gamma) score = predict(model, iris.data[90:], iris.target[90:]) print(score)
-
Run and debug
(venv) $ python src/main.py
-
PIP-install Sacred for tracking experiments
(venv) $ pip install sacred pymongo
-
create a new function for the parameters C and gamma and add the colorators for Sacred
#...add here the new nice imports and add the followings from sacred import Experiment ex = Experiment('iris_svm') # id of the experiments @ex.config def cfg(): C = 1.0 gamma = 0.7 @ex.automain def run(C, gamma): # ... # ... paste here the main #... return score
-
run it from the project's root directory
(venv) $ python src/main.py
-
install mongodb in your system. In a new terminal
$ sudo dnf install mongodb mongodb-server mongoose # start service $ sudo service mongod start # verify it is woring $ mongo # it will start the mongo-db-shell
-
Run and re-run as many time as you want the code with the database flag:
(venv) $ python src/main.py -m MY_IRIS_EXP
notice how the ID value increase at each run. Now we have also an "observer" to our sacred experiment.
-
In a mongo shell (just run mongo in the command line) check if the MY_IRIS_EXP database exists
$ mongo # after in the mongo shell > show dbs # look for MY_IRIS_EXP entry
-
download and install Ominboard, the sacred+mongo frontends N.B. npm is a Javascript package manager. You will probably need to install it (https://docs.npmjs.com/downloading-and-installing-node-js-and-npm).
# in a new terminal $ sudo npm install -g omniboard
-
In the same shell run the server listener
$ omniboard -m localhost:27017:MY_IRIS_EXP
-
go to http://localhost:9000 to access omniboard frontends:
-
play with it
-
add a metric in the main.py file add
@ex.automain def run(C, gamma): ... # the code before ex.log_scalar("val.score", score) return score
-
And what about a typical loss fuction in a for loop? for instance add the following line. We need to pass the object
_run
at themain()
@ex.automain def run(_run, C, gamma): ... # the code before my_loss = 0 for i in range(20): # Explicit step counter (0, 1, 2, 3, ...) # incremented with each call for training.accuracy: _run.log_scalar("training.loss", my_loss, i) my_loss += 1.5*i + np.random.random(1) return score
-
run some experiments
-
play in omniboard