The examples
directory contains various examples of input, config and queries for the built-in Engines. They are often in the form of an integration test but should not be relied upon as such unless you know what you are doing.
For instance the UR integration tests will fail if you are not using a specific version of Elasticsearch. This is purely an artifact of the way the test works and does not indicate a problem.
So treat "tests" as examples, not as something that must pass before use of the Engine.
The UR tests are run from the harness-cli container or wherever it is installed. Each test targets a certain style of deployment. People usually start with the "localhost" version which runs on a dev machine or wherever the CLI and Harness Server are installed. It assumes:
localhost:9090
is the address of the Harness Server- the Spark Master is localhost. The Spark settings are in the UR Engines JSON config which has a
sparkConf
section that setsmaster="local"
The flow of the test does the following:
- Delete any
test_ur
Engine Instance usingharness-cli delete test_ur
. If there is no Engine Instance by that name there will be a warning, ignore it if this is true. - Add the
test_ur
withharness-cli add ...
- input an extremely small sample dataset using the Harness Python SDK and the Harness REST API.
- trigger training of the
test_ur
- sleep for some time to wait for the training, which executes a Spark Job. If the delay is too short, it should be lengthened. The Spark Job does the following:
- read all data by connecting to Mongo
- calculate the UR model from the sample data in Mongo
- write the model to Elasticsearch, which the Engine's JSON points to. See the
spark.es.nodes
value for the Elasticsearch node address.
- once the delay is over, run queries by using
curl
in a script file. A potential problem is the delay may be too small for the Job to complete. - collects results into an
actuall...
file - compares
actuall...
toexamples/ur/expected...
file and fails if there are differences, which will be printed to the console.
This will exercise all parts of a workflow for the UR. If there are any problems we examine the contents of the DB, ES, and logs to discover the phase of workflow where the problem occurred.
The simples example test is in harness-cli/examples/ur/simple-integration-test.sh
The script defaults to assuming all service are running on the localhost except Spark, which will run inside the Harness process and so does not need to be installed.
This works well when using a debugger to change code in Engines or Harness.
Run the script in the all-in-one OS level installation with:
./examples/ur/simple-integration-test.sh
When using a different deployment style—like Kubernetes, docker-compose containers, etc—use the following general rules
By nature you will run harness-cli
remotely connected to the Harness server. You will have Harness mapped or "port forwarded" to localhost:9090
. This will fit the assumptions in simple-integration-test.sh
. Run the test:
./examples/ur/simple-integration-test.sh k8s
From the Harness server's point of view the Kubernetes addresses for services will have to be used in the UR Engines config, particularly the sparkConf
section. With the k8s
flag the script will use the following engine config for Spark:
"sparkConf": {
"es.index.auto.create": "true",
"es.nodes": "elasticsearch-client",
"es.nodes.wan.only": "true",
"master": "spark://spark-api:7077",
"spark.driver.memory": "3g",
"spark.es.index.auto.create": "true",
"spark.es.nodes": "elasticsearch-client",
"spark.es.nodes.wan.only": "true",
"spark.executor.memory": "3g",
"spark.kryo.referenceTracking": "false",
"spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryoserializer.buffer": "300m",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer"
},
This should work for an ActionML k8s installation. Change Elasticsearch and Spark master
addresses to use the k8s service addresses if they are different.
This deployment assumes you use the ActionML docker-compose.yml. Start with instructions here Login to the harness-cli
container to run the test. Using the ActionML docker-compose.yml and all the containers it specifies, use the following sparkConf
.
"sparkConf": {
"es.index.auto.create": "true",
"es.nodes": "elasticsearch",
"es.nodes.wan.only": "true",
"master": "local",
"spark.driver.memory": "3g",
"spark.es.index.auto.create": "true",
"spark.es.nodes": "elasticsearch",
"spark.es.nodes.wan.only": "true",
"spark.executor.memory": "3g",
"spark.kryo.referenceTracking": "false",
"spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
"spark.kryoserializer.buffer": "300m",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer"
},
The docker-compose.yml configures the CLI and Harness properly so nothing changes in the test script except the above sparkConf
- Harness Log ERROR: most ERRORs are reported in Harness logs. Check here for clues first. Other clues depend on the phase of the workflow where they occurred.
- Phase 1-2: errors here may be connection related. Check that the CLI finds Harness by running
harness-cli status
- Phase 3: if there is trouble sending data, since this uses the Python SDK as does the CLI, run
harness-cli status
. To verify data has been input look in the DB in the dbtest_ur
collectionevents
using the Mongo shell. - Phase 4-5: Spark Job problems will usually show up in the Harness logs. To look deeper examine the contents of Elasticsearch by using curl to see indexes. The most recent
test_ur_...
index should have documents for each item in the sample data. These will be namediPad Pro
,Galaxy
etc. If no index has data, the Spark job was unable to write the model to Elasticsearch. This may be an addressing issue so look in the UR Engine's JSON in thespark.es.nodes
value to make sure it is correct. - Phase 6: uses
curl
to send queries. These are recorded in andactual...
file to be compare toexpected...
If the items returned inactual
are the same asexpected
AND the ranking is identical, this means the results may be correct and only differ by the score. If the scores are different this may meanexpected
were created with a different version of Elasticsearch. This may or may not be a problem. - Other Logs: Failures may also have clues in the logs of MongoDB, Elasticsearch, and Spark.
Other possible errors may occur due to an improper sparkConf
. Settings here are specific to the Spark Job, not the services setup. These include:
master
make sure this is the address of the port and host of the Spark master. "local" means use a Spark master inside the Harness process. If you are using an external Spark node the address should be something likespark://spark-api:7077
or whatever address is shown in the Spark GUI.spark.driver.memory
this much memory must be available wherever Harness is running since the Spark Driver is inside the Harness process. This must be large enough to create a hashmap of all IDs in the data—so all user and item IDs. The smallest possiblespark.driver.memory
is3g
and it goes up with the number of IDs.spark.executor.memory
this number * (number of executors) must be big enough to read in all data. Spark uses in-memory calculations but spreads them and the data across Executors.spark.es.nodes
should be the address that a Spark Executor can use to connect to Elasticsearch.