The primary change to V2 of the tool is to simplify by integrating all external processes so one line will execute all steps.
The steps to execute all workflow are:
- Verify the PIO EventServer is running
pio status
will do this. - Export Data We assume that the data exists in the EventServer or in HDFS. Export the data to the location specified in the
config.json
file. - Split the data using the
map_test split ...
directive. It is often desirable to use an existing data split so the split step can be omitted. - Train and Deploy To do this 3 PIO workflow steps need to be taken. These must be executed inside the directory of the UR version we are testing.
pio build
This will create the Universal Recommender code and register the algorithm parameterspio train
This will create a model with the UR from theengine.json
parameters. There are parameters in engine.json that are passed to Spark in the training process that are system and data dependent so make sure train completes correctly before moving on. Using these tools usually happens after the bootstrap dataset has be3en successfully trained so the split is made on the dataset.pio deploy
This will create a running PIO PredictionServer that responds to UR queries based on the training split of the dataset.
- Analyze the model using
map_test test ...
##Requirements
- PredictionI0 v0.11+
- UR v0.7.3+
- Python 3+
- Spark as recommended by you versions of PIO and UR
Install Spark, PredictionIO v0.11.0 or greater, and the Universal Recommender v0.7.3 or greater. Make sure pio status
completes with no errors and the integration-test for the UR runs correctly.
-
Install Python and check the version
python3 --v
if the version is less than 3.x upgrade to the most recent stable version of python3 using systems package management tools like
apt-get
for Ubuntu linux orbrew
for the macOS. This tool has been tested minimally with python3 and does require it. Leave an issue if you find one. -
Install Python libraries using the Python package manager found here Note that if you have python3 running you may alrady have pip3,
which pip3
will check. If you don't have it installed the above link will work with python3 so install it first.sudo pip3 install numpy scipy pandas ml_metrics predictionio tqdm click openpyxl
-
Setup Spark and Pyspark paths in
.bashrc
(linux) or.bash_profile
macOS.export SPARK_HOME=/path/to/spark export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/build/:$PYTONPAHTH
Analysis script should be run from UR (Universal recommender) folder. It uses two configuration files:
engine.json
(configuration of UR, this file is used to take event list and primary event)config.json
(all other configuration including engine.json path if necessary)
config.json has the following structure:
{
"engine_config": "./engine.json",
"splitting": {
"version": "1",
"source_file": "hdfs:...<PUT SOME PATH>...",
"train_file": "hdfs:...<PUT SOME PATH>...train",
"test_file": "hdfs:...<PUT SOME PATH>...test",
"type": "date",
"train_ratio": 0.8,
"random_seed": 29750,
"split_event": "<SOME NAME>"
},
"reporting": {
"file": "./report.xlsx"
},
"testing": {
"map_k": 10,
"non_zero_users_file": "./non_zero_users.dat",
"consider_non_zero_scores_only": true,
"custom_combos": {
"event_groups": [["ev2", "ev3"], ["ev6", "ev8", "ev9"]]
}
},
"spark": {
"master": "spark://<some-url>:7077"
}
}
- engine_config - file to be used as engine.json (see configuration of UR)
- splitting - this section is about splitting data into train and test sets
- version - version to append to train and test file names (may be helpful is different test with different split configurations are run)
- source_file - file with data to be split
- train_file - file with training data to be produced (note that version will be append to file name)
- test_file - file with test data to be produced (note that version will be append to file name)
- type - split type (can be time in this case eventTime will be used to make split or random)
- train_ratio - float in (0..1), share of training samples
- random_seed - seed for random split
- split_event - in case of type = "date", this is event to use to look for split date, all events with this name are ordered by eventTime and time which devides all such events into first train_ratio and last (1 - train_ratio) sets is used to split all the rest data (events with all names) into training set and test set
- reporting - reporting settings
- file - excel file to write report to
- csv_dir - directory name for csv reporting
- use_uuid - append to csv file names unique uuid associated with script run (can be useful to manage different results and reports)
- testing - this section is about different tests and how to perform them
- map_k - maximum map @ k to be reported
- non_zero_users_file - file to save users with scores != 0 after first run of test set with primary event, this set may be much smaller then initial set of users so saving it and reusing can save much time
- consider_non_zero_scores_only (default: true) whether take into account only users with non-zero scores (i.e. users for which recommendations exist)
- custom_combos
- event_groups - groups of events to be additionally tested if necessary
- spark - Apache Spark configuration
- master - for now only master URL is configurable
Get data from the EventServer with:
pio export --output path/to/store/events
Use this command to run split of data into "train" and "test" sets
SPARK_HOME=/usr/local/spark PYTHONPATH=/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.9-src.zip ./map_test.py split
Additional options are available:
--csv_report
- put report to csv file not excel--intersections
- calculate train / test event intersection data (Advanced)
The above command will create a test and training split in the location specified in config.json. Now you must import, setup engine.json, train and deploy the "train" model so the rest of the MAP@k tests will be able to query the model.
To run tests
SPARK_HOME=/usr/local/spark PYTHONPATH=/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.9-src.zip ./map_test.py test --all
Additional options are available and may be used to run not all test:
--csv_report
---dummy_test
- run dummy test--separate_test
- run test for each separate event--all_but_test
- run test with all events and tests with all but each every event--primary_pairs_test
- run tests with all pairs of events with primary event--custom_combos_test
- run custom combo tests as configured in config.json--non_zero_users_from_file
- use list of users from file prepared on previous script run to save time
Todo
This is not recommended old approach to run ipython notebook.
IPYTHON_OPTS="notebook" /usr/local/spark/bin/pyspark --master spark://spark-url:7077