scikit-learn custom prediction routines sample (only the training par…

…t) (GoogleCloudPlatform#63) * New sample to show how to use custom prediction routines (training). * A number of fixes and adjustments. * Fixed README files. * Added kokoro test files. * Fixed the typo * flake8 fixes * flake8 fixes * Changed pantheon to console.cloud everywhere.
halio-g · Sep 9, 2019 · 3efc3b0 · 3efc3b0
1 parent ef9c655
commit 3efc3b0
Show file tree

Hide file tree

Showing 28 changed files with 1,004 additions and 28 deletions.
diff --git a/.kokoro/tests/training/sklearn_structured_custom_routines.sh b/.kokoro/tests/training/sklearn_structured_custom_routines.sh
@@ -0,0 +1,49 @@
+#!/bin/bash
+# Copyright 2019 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+set -eo pipefail
+
+
+download_files() {
+    # Download files for testing.
+    GCS_FOLDER="gs://cloud-samples-data/ml-engine/chicago_taxi"
+
+    echo "Downloading files"
+    gsutil cp ${GCS_FOLDER}/training/small/taxi_trips_train.csv data/taxi_trips_train.csv
+    gsutil cp ${GCS_FOLDER}/training/small/taxi_trips_eval.csv data/taxi_trips_eval.csv
+    gsutil cp ${GCS_FOLDER}/prediction/taxi_trips_prediction_dict.ndjson data/taxi_trips_prediction_dict.ndjson
+
+    # Define ENV for `train-local.sh` script
+    export TAXI_TRAIN_SMALL=data/taxi_trips_train.csv
+    export TAXI_EVAL_SMALL=data/taxi_trips_eval.csv
+    export TAXI_PREDICTION_DICT_NDJSON=data/taxi_trips_prediction_dict.ndjson
+}
+
+
+run_tests() {
+    # Run base tests.
+    echo "Running code tests in `pwd`."
+    download_files
+    # Run local training and local prediction
+    source scripts/train-local.sh
+}
+
+
+main(){
+    cd ${KOKORO_ARTIFACTS_DIR}/github/ai-platform-samples/${CAIP_TEST_DIR}
+    run_tests
+    echo 'Test was successful'
+}
+
+main
diff --git a/.kokoro/training/sklearn/structured/custom_routines/common.cfg b/.kokoro/training/sklearn/structured/custom_routines/common.cfg
@@ -0,0 +1,50 @@
+# Copyright 2019 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Format: //devtools/kokoro/config/proto/build.proto
+
+# Download trampoline resources.
+gfile_resources: "/bigstore/cloud-devrel-kokoro-resources/trampoline"
+
+
+# Download credentials from Cloud Storage.
+gfile_resources: "/bigstore/cloud-devrel-kokoro-resources/ai-platform-samples"
+
+
+# Use the trampoline script to run in docker.
+build_file: "ai-platform-samples/.kokoro/trampoline.sh"
+
+
+# Environment Variables.
+env_vars: {
+    key: "TRAMPOLINE_IMAGE"
+    value: "gcr.io/cloud-devrel-kokoro-resources/python"
+}
+
+# Tell the trampoline which tests to run.
+env_vars: {
+    key: "TRAMPOLINE_BUILD_FILE"
+    value: "github/ai-platform-samples/.kokoro/tests/run_tests.sh"
+}
+
+env_vars: {
+    key: "CAIP_TEST_DIR"
+    value: "training/sklearn/structured/custom_routines"
+}
+
+# Run specific tests
+env_vars: {
+    key: "CAIP_TEST_SCRIPT"
+    value: "github/ai-platform-samples/.kokoro/tests/training/sklearn_structured_custom_routines.sh"
+}
diff --git a/.kokoro/training/sklearn/structured/custom_routines/continuous.cfg b/.kokoro/training/sklearn/structured/custom_routines/continuous.cfg
@@ -0,0 +1,15 @@
+# Copyright 2019 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Format: //devtools/kokoro/config/proto/build.proto
diff --git a/.kokoro/training/sklearn/structured/custom_routines/periodic.cfg b/.kokoro/training/sklearn/structured/custom_routines/periodic.cfg
@@ -0,0 +1,15 @@
+# Copyright 2019 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Format: //devtools/kokoro/config/proto/build.proto
diff --git a/.kokoro/training/sklearn/structured/custom_routines/presubmit.cfg b/.kokoro/training/sklearn/structured/custom_routines/presubmit.cfg
@@ -0,0 +1,15 @@
+# Copyright 2019 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Format: //devtools/kokoro/config/proto/build.proto
diff --git a/setup/README.md b/setup/README.md
@@ -21,9 +21,9 @@ and follow the instructions.
 
 5- Enable the API for the following services:
 
-  * [Compute Engine](https://pantheon.corp.google.com/compute)
-  * [Storage](https://pantheon.corp.google.com/storage)
-  * [AI Platform](https://pantheon.corp.google.com/mlengine)
+  * [Compute Engine](https://console.cloud.google.com/compute)
+  * [Storage](https://console.cloud.google.com/storage)
+  * [AI Platform](https://console.cloud.google.com/mlengine)
 
 From your terminal, run:
 

diff --git a/training/__init__.py b/training/__init__.py
diff --git a/training/sklearn/structured/__init__.py b/training/sklearn/structured/__init__.py
diff --git a/training/sklearn/structured/base/README.md b/training/sklearn/structured/base/README.md
@@ -30,7 +30,7 @@ executed on your local machine.
   * [task.py](trainer/task.py) initializes and parses task arguments. This is the entry point to the trainer.
   * [model.py](trainer/model.py) includes a function to create the scikit-learn estimator or pipeline
   * [metadata.py](trainer/metadata.py) contains the definition for the target and feature names, among other configuring variables 
-  * [util.py](trainer/task.py) contains a number of helper functions used in task.py  
+  * [util.py](trainer/util.py) contains a number of helper functions used in task.py  
 * [scripts](./scripts) directory: command-line scripts to train the model locally or on AI Platform.
   We recommend to run the scripts in this directory in the following order, and use
   the `source` command to run them, in order to export the environment variables at each step:
@@ -71,6 +71,12 @@ This will create a training job on AI Platform and displays some instructions on
 At the end of a successful training job, it will upload the trained model object to a GCS
 bucket and sets `$MODEL_DIR` environment variable to the directory containing the model.
 
+### Monitoring
+Once the training starts and the models are generated, you may view the training job in
+the [AI Platform page](https://console.cloud.google.com/mlengine/jobs). If you click on the 
+corresponding training job, you will be able to view the chosen hyperparamters, along with the
+metric scores for each model. All the generated model objects will be stored on GCS. 
+
 ## Explaining Key Elements
 
 In this section, we'll highlight the main elements of this sample.

diff --git a/training/sklearn/structured/base/trainer/metadata.py b/training/sklearn/structured/base/trainer/metadata.py
@@ -22,7 +22,9 @@
 # Target name
 TARGET_NAME = 'tip'
 
-# The features to be used for training
+# The features to be used for training.
+# If FEATURE_NAMES is None, then all the available columns will be
+# used as features, except for the target column.
 FEATURE_NAMES = [
     'trip_miles',
     'trip_seconds',

diff --git a/training/sklearn/structured/base/trainer/utils.py b/training/sklearn/structured/base/trainer/utils.py
@@ -35,8 +35,14 @@ def data_train_test_split(data_df):
                   pandas.DataFrame, pandas.Series)
     """
 
-    # Only use metadata.FEATURE_NAMES + metadata.TARGET_NAME
-    features = data_df[metadata.FEATURE_NAMES]
+    if metadata.FEATURE_NAMES is None:
+        # Use all the columns as features, except for the target column
+        feature_names = list(data_df.columns)
+        feature_names.remove(metadata.TARGET_NAME)
+        features = data_df[feature_names]
+    else:
+        # Only use metadata.FEATURE_NAMES
+        features = data_df[metadata.FEATURE_NAMES]
     target = data_df[metadata.TARGET_NAME]
 
     x_train, x_val, y_train, y_val = ms.train_test_split(features,
@@ -70,7 +76,7 @@ def read_df_from_bigquery(full_table_path, project_id=None, num_samples=None):
 
 
 def read_df_from_gcs(file_pattern):
-    """Read data from Google Cloud Storage, split into train and validation sets.
+    """Read data from Google Cloud Storage, split into train and validation sets
 
     Assume that the data on GCS is in csv format without header.
     The column names will be provided through metadata