diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md index 2cd74c7a..f85ace50 100644 --- a/DEVELOPMENT.md +++ b/DEVELOPMENT.md @@ -1,5 +1,9 @@ -# propensity-modeling -Marketing Intelligence Solution with Propensity Modeling +# Marketing Analytics Jumpstart +Marketing Analytics Jumpstart consists of an easy, extensible and automated implementation of an end-to-end solution that enables Marketing Technology teams to store, transform, enrich with 1PD and analyze marketing data, and programmatically send predictive events to Google Analytics 4 to support conversion optimization and remarketing campaigns. + +## Developer pre-requisites +Use Visual Studio Code to develop the solution. Install Gemini Code Assistant, Docker, Github, Hashicopr Terraform, Jinja extensions. +You should have Python 3, Poetry, Terraform, Git and Docker installed in your developer terminal environment. ## Preparing development environment @@ -33,10 +37,71 @@ Automatically running tests will aslo execute code coverage. To execute tests on terminal ```bash poetry run pytest -c pyproject.toml - ``` -flags: - -n # Number of Tests to run in Parallel - -m " # Execute only tests marked with the given marker (@pytest.mark.unit) - --maxfail= # Number of failed tests before aboarding - --cov # test incudes code coverage + +## Customizing the solution +The solution is customizable using a set of configurations defined in the config file in YAML format located in the `config/` folder, the terraform files located in the `infrastructure/terraform/` folder, the Python files located in the `python/` folder, the SQL files located in the `sql/` folder, and the template files located in the `templates/` folder. + +Here's a brief breakdown of the contents of each folder: +* `config/`: +* * `config.yaml.tftpl`: This file contains the main configuration parameters for the solution, including the project ID, dataset names, and pipeline schedules. +* `infrastructure/terraform/`:` +* * `terraform.tfvars`: This file contains the Terraform variables that can be used to override the default configuration values and to choose which components of the solution to deploy. +* `infrastructure/terraform/modules/`: +* * `activation/main.tf`: This Terraform file defines the Cloud Function that triggers the activation application. +* * `data-store/main.tf`: This Terraform file defines the parameters to deploy the Dataform code defined in the [repository](https://github.com/GoogleCloudPlatform/marketing-analytics-jumpstart-dataform) +* * `dataform-workflow/dataform-workflow.tf`: This Terraform file defines the parameters to deploy the Cloud Workflow that triggers the Dataform code. +* * `feature-store/bigquery-*.tf`: This Terraform file defines the BigQuery datasets, tables, and stored procedures that are used to store and transform the features extracted from the marketing data store. +* * `monitor/main.tf`: This Terraform file defines the Cloud Logging Sink destination in BigQuery used by the Looker Studio Dashboard. +* * `pipelines/pipelines.tf`: This Terraform file defines the Vertex AI pipelines used for feature engineering, training, prediction, and explanation. +* `python/`: +* * `activation`: This python module implements the Dataflow/ Apache Beam pipeline that sends all predictions to Google Analytics 4 via Measurement Protocol API. +* * `base_component_image`: This python module implements the base component image used by the Vertex AI pipelines components, all libraries dependencies are installed in this docker image. +* * `function/trigger_activation`: This python module implements the Cloud Function that triggers the activation application. +* * `ga4_setup`: This python module implements the Google Analytics 4 Admin SDK that is used to setup the custom dimensions on the Google Analytics 4 property. +* * `lookerstudio`: This python module automated the copy and deployment of the Looker Studio Dashboard. +* * `pipelines`: This python module implements all the custom kubeflow pipelines components using the Google Cloud Pipeline Components library used by the Vertex AI pipelines. It also contains the pipeline definitions for the feature engineering, training, prediction, and explanation pipelines of all use cases. +* `sql/`: +* * `procedures/`: This folder contains the JINJA template files with the `.sqlx` extension used to generate the stored procedures deployed in BigQuery. +* * `queries/`: This folder contains the JINJA template files with the `.sqlx` extension used to generate the queries deployed in BigQuery. +* `templates/`: +* * `app_payload_template.jinja2`: This file defines the JINJA template used to generate the payload for the Measurement Protocol API used by the Activation Application. +* * `activation_query`: This folder contains the JINJA template files with the `.sqlx` extension used to generate the SQL queries for each use case used by the Activation Application to get all the predictions to be prepared and send to Google Analytics 4. + +## Out-of-the-box configuration parameters provided by the solution + +### Overall configuration parameters +The `config.yaml.tftpl` file is a YAML file that contains all the configuration parameters for the Marketing Analytics Jumpstart solution. A YAML file is a map or a list, and it follows a hierarchy depending on the indentation, and how you define your key values. Maps allow you to associate key-value pairs. This configuration file is organized in section blocks mappings. +| Key | Description | +| ---------- | ---------- | +| google_cloud_project | This section contains general configuration parameters for the GCP project | +| google_cloud_project | This section contains the Google Cloud project ID and project number | +| cloud_build | This section contains the configuration parameters for the Cloud Build pipeline | +| container | This section contains the configuration parameters for the container images | +| artifact_registry | This section contains the configuration parameters for the Artifact Registry repository | +| dataflow | This section contains the configuration parameters for the Dataflow pipeline | +| vertex_ai | This section contains the configuration parameters for the Vertex AI pipelines | +| bigquery | This section contains the configuration parameters for the BigQuery artifacts | + +There are two sections mappings which are very important, `vertex_ai` and `bigquery`. +- `vertex_ai` section mapping: In the vertex_ai section, there pipelines blocks for each vertex AI pipeline implemented. +This `pipelines` section contains configuration parameters for the Vertex AI pipelines into subsections defined as: `feature-creation-auto-audience-segmentation`, `feature-creation-audience-segmentation`, `feature-creation-purchase-propensity`, `feature-creation-customer-ltv`, `propensity.training`, `propensity.prediction`, `segmentation.training`, `segmentation.prediction`, `auto_segmentation.training`, `auto_segmentation.prediction`, `propensity_clv.training`, `clv.training`, `clv.prediction`, `reporting_preparation`. +For those subsections described above, inside the execution section you have the `schedule` and `pipeline_parameters` blocks mappings. The `schedule` defines the schedule key-values of the pipeline. The `pipeline_parameters` defines the key-values that are going to be used to compile the pipeline. +Observe the key-values pairs inside the `pipeline_parameters` for each `vertex_ai.pipelines`, since most of the pipeline parameters are changed inside that section mapping. + +The `bigquery` section contains configuration parameters for the BigQuery datasets, tables, queries and procedures into subsections defined as: `dataset`, `table`, `query`, `procedure`. +- `dataset`: Contains key-values pairs for all the configuration parameters of the datasets deployed in BigQuery, such as name, location and description. +- `table`: Contains key-values pairs for all the configuration parameters of the tables deployed in BigQuery, such as dataset it is part of, table_name and location. +- `query`: Contains key-values pairs for all the configuration parameters of the queries deployed in BigQuery, such as interval days and split numbers. +- `procedure`: Contains key-values pairs for all the configuration parameters of the procedures deployed in BigQuery, such as start and end dates. + +### Modules configuration parameters +The `terraform.tfvars` file is a terraform variables definition file created during the installation process that lets you define custom Terraform variables that will overwrite the defaults. Here are few examples of changes you can make: +Change the `project_id` to store the Terraform Remote backend state; change the data staging `project_id`; change the data processing `project_id`; the `website_url` for the customer digital store; the feature store and activation `project_id`; the source GA4 and GAds export projects and datasets; and a few more variables. +The Terraform definition files for the modules `feature-store` and `pipelines` contains all the terraform resources and data that reads local files to deploy the SQL code to BigQuery. In the `bigquery-procedures.tf`, you can configure which stored procedures are being deployed, in which datasets, using which `local_file` code in which project. In the `bigquery-datasets.tf`, you can configure which datasets are being deployed, their names, locations and whether the contents of the dataset will be deleted when you ask to run a terraform destroy command. In the `bigquery-tables.tf`, you can configure which tables are being deployed, their names, their datasets and schema. + +### Feature Store configuration parameters +The SQL files in the folder `sql/procedure/` and `sql/query/` contains `.sqlx` JINJA templates files containing SQL code that are hydrated from the configuration parameters defined in the `config.yaml` file, more specifically from the sections sql.query and sql.procedure. + +## Activation Application configuration parameters +The files in the folder `templates/activation_query/` contains `.sqlx` JINJA template files containing BigQuery SQL code the retrieves the model predictions produced in the prediction tables for each use case. You can configure the columns and the filter conditions to send user-level prediction events only a subset of users. diff --git a/README.md b/README.md index d377eecb..3f35a6bc 100644 --- a/README.md +++ b/README.md @@ -1,44 +1,139 @@ # Marketing Analytics Jumpstart +Marketing Analytics Jumpstart is a terraform automated, quick-to-deploy, customizable end-to-end marketing solution on Google Cloud Platform (GCP). This solution aims at helping customer better understand and better use their digital advertising budget. -Marketing Analytics Jumpstart is a terraform based, quick-to-deploy end-to-end marketing solutions on Google Cloud. This solutions aims at helping customer better understand and better use their digital advertising budget. -After installing the solutions users will get: -* Scheduled ETL jobs for an extensible data model based on the Google Analytics 4 and Google Ads daily exports -* End-to-end ML pipelines for Purchase Propensity, Customer Lifetime Value and Audience Segmentation -* Dashboard for interpreting the data and model predictions -* Activation pipeline that sends models prediction to Google Analytics 4 as custom dimensions +Customers are looking to drive revenue and increase media efficiency be identifying, predicting and targeting valuable users through the use of machine learning. However, marketers first have to solve the challenge of having a number of disparate data sources that prevent them from having a holistic view of customers. Marketers also often don't have the expertise and/or resources in their marketing departments to train, run, and activate ML models on paid channels. Without this solution that enables innovation through predictive analytics, marketers are missing opportunities to advance their marketing program and accelerate key goals and objectives (e.g. acquire new customers, improve customer retention, etc). -This solution handles scheduling, data engineering, data modeling, data normalization, feature engineering, model training, model evaluation, and programatically sending predictions back into Google Analytics 4. -## Disclaimer +## Benefits +After installing the solution users will get: +* Scheduled ETL jobs for an extensible logical data model based on the Google Analytics 4 (GA4) and Google Ads (GAds) daily exports +* Validated feature engineering SQL transformations from event-level data to user-level data for reporting and machine learning models training and prediction +* End-to-end ML pipelines for Purchase Propensity, Customer Lifetime Value, Audience Segmentation and Value Based Bidding +* Dashboard for interpreting the data, model predictions and monitoring the pipelines and jobs in a seamless manner +* Activation application that sends models prediction to GA4 via Measurement Protocol API -This is not an officially supported Google product. -This solution in a work in progress and currently in the preview stage. -## High Level Architecture +## Who can benefit from this solution? +This solution is intended for Marketing Technologist teams using GA4 and GAds products. It facilitates efforts to store, transform, analyze marketing data, and programmatically creates audiences segments in Google Ads to support conversion optimization and remarketing campaigns. + +| Role | User Journeys | Skillset | Can Deploy? | +|-------|-------------|----------|-------------| +| Marketing Scientist | Using an isolated and secure sandbox infrastructure to perform and monitor explorations with sensitive data. Using automated machine learning to accelerate time-to-value on building use cases solutions. Faster learning curve to quickly and easily access and analyze data from the marketing data store. Ability to collaborate with other teams by reusing similar components. | Vertex AI, Python, SQL, Data Science | No | +| Marketing Analyst | Simplifying the operation of the marketing data store (data assertions), machine learning pipelines (model training, prediction, explanation) and the activation application. Monitoring Ads Campaigns Performance, Web Traffic and Predictive Insights Reports. Interpreting the insights provided to plan and activate Ads campaigns. Defining audience segments using predictive metrics. | BigQuery, Looker Studio, Google Analytics 4, Google Ads | Yes | + | Digital Marketing Manager | Gaining insights into customer behavior to improve marketing campaigns. Identifying and targeting new customers. Measuring the effectiveness of marketing campaigns. | Looker Studio, Google Analytics 4, Google Ads | No | +| IT/Data Engineer | Building and maintaining marketing data store transformation jobs. Developing and deploying custom marketing use cases reusing a consistent infrastructure. Integrating 1st party data and Google 3rd party data by extending the marketing data store. | Python, SQL, Google Cloud Platform, Data Engineering | Yes | + + +## Use Cases +This solution enables customer to plan and take action on their marketing campaigns by interpreting the insights provided by four common predictive use cases (purchase propensity, customer lifetime value, audience segmentation and aggregated value based bidding) and an operation dashboard that monitors Campaigns, Traffic, User Behavior and Models Performance, using the best of Google Cloud Data and AI products and practices. + +These insights are used to serve as a basis to optimize paid media efforts and investments by: +* Building audience segments by using all Google first party data to identify user interests and demographic characteristics relevant to the campaign +* Improving campaign performance by identifying and targeting users deciles most likely to take an action (i.e. purchase, sign-up, churn, abandon a cart, etc) +* Driving a more personalized experience for your highly valued customers and improve return on ads spend (ROAS) via customer lifetime value +* Attributing bidding values to specific users according to their journeys through the conversion funnel which Ads platform uses to guide better campaign performance in specific markets + + +## Repository Structure +The solution's source code is written in Terraform, Python, SQL, YAML and JSON; and it is organized into five main folders: +* `config/`: This folder contains the configuration file for the solution. This file define the parameters and settings used by the various components of the solution. +* `infrastructure/terraform/`: This folder contains the Terraform modules, variables and the installation guide to deploy the solution's infrastructure on GCP. + * `infrastructure/terraform/modules/`: This folder contains the Terraform modules and their corresponding Terraform resources. These modules corresponds to the architectural components broken down in the next section. +* `python/`: This folder contains most of the Python code. This code implements the activation application, which sends model predictions to Google Analytics 4; and the custom Vertex AI pipelines, its components and the base component docker image used for feature engineering, training, prediction, and explanation pipelines. It also implements the cloud function that triggers the activation application, and the Google Analytics Admin SDK code that creates the custom dimensions on the GA4 property. +* `sql/`: This folder contains the SQL code and table schemas specified in JSON files. This code implements the stored procedures used to transform and enrich the marketing data, as well as the queries used to invoke the stored procedures and retrieve the data for analysis. +* `templates/`: This folder contains the templates for generating the Google Analytics 4 Measurement Protocol API payloads used to send model predictions to Google Analytics 4. + +In addition to that, there is a `tasks.py` file which implements python invoke tests who hydrate values to the JINJA template files with the `.sqlx` extension located in the `sql/` folder that defines the DDL and DML statements for the BigQuery datasets, tables, procedures and queries. + +## High Level Architecture ![](https://i.imgur.com/5D3WPEb.png) -## Pre-Requisites +The provided architecture diagram depicts the high-level architecture of the Marketing Analytics Jumpstart solution. Let's break down the components: + +1. Data Sources: +* Google Analytics 4 Export: This provides daily data exports from your Google Analytics 4 property to BigQuery. +* Google Ads Export: This provides daily data exports from your Google Ads account to BigQuery. + +2. Marketing Data Store: +* Dataform: This tool manages the data transformation and enrichment process. It uses SQL-like code to define data pipelines that transform the raw data from Google Analytics 4 and Google Ads into a unified and enriched format. + +3. Feature Store: +* BigQuery: This serves as the central repository for storing the features extracted from the marketing data. +* Vertex AI Pipelines: These pipelines automate the feature engineering process, generating features based on user behavior, traffic sources, devices, and other relevant factors. + +4. Machine Learning Pipelines: +* Vertex AI Pipelines: These pipelines handle the training, prediction, and explanation of various machine learning models. +* Tabular Workflow End-to-End AutoML: This approach automates the model training process for tasks like purchase propensity and customer lifetime value prediction. +* Custom Training and Prediction Pipelines: These pipelines are used for the auto audience segmentation training and prediction; and for the aggregated value based bidding model explanation. + +5. Activation Application: +* Dataflow: This tool processes the model predictions and sends them to Google Analytics 4 via the Measurement Protocol API. +* User-level Predictions: These predictions are used to enhance your Google Analytics 4 data with insights about user behavior and purchase likelihood. + +6. Dashboards: +* Looker Studio: This tool provides interactive dashboards for visualizing the performance of your Google Ads campaigns, user behavior in Google Analytics 4, and the results of the machine learning models. + +7. Monitoring: +* Dataform Jobs: These jobs are monitored for errors to ensure the data transformation process runs smoothly. +* Vertex AI Pipelines Runs: These runs are monitored to track the performance and success of the machine learning pipelines. + +This high-level architecture demonstrates how Marketing Analytics Jumpstart integrates various Google Cloud services to provide a comprehensive solution for analyzing and activating your marketing data. + + +## Advantages +1. Easy to deploy: Deploy the resources and use cases that you need. +2. Cost Effective: Pay only for the cost of infrastructure in order to maintain the Data Store, Feature Store and ML Models. +3. Keep control of your data: This solution runs entirely in your environment and doesn’t transfer data out of your ownership or organization. +4. Fondation for 1st Party Data Strategy: The data store can serves as a basis for your team to customize or implement your own use cases and enable in house expertise to thrive. +5. Enable team collaboration: Use Terraform to maintain dependency graph between the resources and to manage resources lifecycle. + + +## Installation Pre-Requisites - [ ] [Create GCP project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#creating_a_project) and [Enable Billing](https://cloud.google.com/billing/docs/how-to/modify-project) - [ ] Set up [Google Analyics 4 Export](https://support.google.com/analytics/answer/9823238?hl=en#zippy=%2Cin-this-article) and [Google Ads Export](https://cloud.google.com/bigquery/docs/google-ads-transfer) to Bigquery - [ ] [Backfill](https://cloud.google.com/bigquery/docs/google-ads-transfer) BigQuery Data Transfer service for Google Ads - [ ] Have existing Google Analytics 4 property with [Measurement ID](https://support.google.com/analytics/answer/12270356?hl=en) -## Permissions + +## Installation Permissions and Privileges - [ ] Google Analytics Property Editor or Owner - [ ] Google Ads Reader - [ ] Project Owner for GCP Project - [ ] Github or Gitlab account priviledges for repo creation and access token. [Details](https://cloud.google.com/dataform/docs/connect-repository) -## Installation +## Installation Please follow the step by step installation guide with [![Open in Cloud Shell](https://gstatic.com/cloudssh/images/open-btn.svg)](https://shell.cloud.google.com/cloudshell/editor?cloudshell_git_repo=https://github.com/GoogleCloudPlatform/marketing-analytics-jumpstart.git&cloudshell_git_branch=main&cloudshell_workspace=&cloudshell_tutorial=infrastructure/cloudshell/tutorial.md) **Note:** If you are working from a forked repository, be sure to update the `cloudshell_git_repo` parameter to the URL of your forked repository for the button link above. The detailed installation instructions can be found at the [Installation Guide](./infrastructure/README.md). -## Contributing +## Contributing We welcome all feedback and contributions! Please read [CONTRIBUTING.md](./CONTRIBUTING.md) for more information on how to publish your contributions. + + +## License +This project is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). + + +## Resources +This a list of public websites you can use to learn more about the Google Analytics 4, Google Ads, Google Cloud Products we used to build this solution. + +| Websites | Description | +|----------|-------------| +| [support.google.com/google-ads/*](https://support.google.com/google-ads/) [support.google.com/analytics/*](https://support.google.com/analytics/) | Google Ads and Google Analytics Support | +| [support.google.com/looker-studio/*](https://support.google.com/looker-studio/) | Looker Studio Support | +| [developers.google.com/analytics/*](https://developers.google.com/analytics/) [developers.google.com/google-ads/*](https://developers.google.com/analytics/) | Google Ads and Google Analytics Developers Guides | +| [cloud.google.com/developers/*](https://cloud.google.com/developers/) [developers.google.com/looker-studio/*](https://developers.google.com/looker-studio/) | Google Cloud & Looker Studio Developers Guides | +| [cloud.google.com/bigquery/docs/*](https://cloud.google.com/bigquery/docs/) [cloud.google.com/vertex-ai/docs/*](https://cloud.google.com/vertex-ai/docs/) [cloud.google.com/looker/docs/*](https://cloud.google.com/looker/docs/) [cloud.google.com/dataform/docs/*](https://cloud.google.com/dataform/docs/) | Google Cloud Product Documentation | +| [cloud.google.com/python/docs/reference/aiplatform/latest/*](https://cloud.google.com/python/docs/reference/aiplatform/latest/) [cloud.google.com/python/docs/reference/automl/latest/*](https://cloud.google.com/python/docs/reference/automl/latest/) [cloud.google.com/python/docs/reference/bigquery/latest/*](https://cloud.google.com/python/docs/reference/bigquery/latest/) | Google Cloud API References Documentation | + + +## Disclaimer +This is not an officially supported Google product. +This solution in a work in progress and currently in the preview stage. + diff --git a/config.yaml b/config.yaml deleted file mode 100644 index a65c6d23..00000000 --- a/config.yaml +++ /dev/null @@ -1,7877 +0,0 @@ -# Copyright 2023 Google LLC -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -pipelineSpec: - components: - comp-automl-tabular-cv-trainer: - executorLabel: exec-automl-tabular-cv-trainer - inputDefinitions: - artifacts: - materialized_cv_splits: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - metadata: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - transform_output: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - tuning_result_input: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - parameters: - deadline_hours: - type: DOUBLE - encryption_spec_key_name: - type: STRING - location: - type: STRING - num_parallel_trials: - type: INT - num_selected_trials: - type: INT - project: - type: STRING - root_dir: - type: STRING - single_run_max_secs: - type: INT - worker_pool_specs_override_json: - type: STRING - outputDefinitions: - artifacts: - tuning_result_output: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - parameters: - execution_metrics: - type: STRING - gcp_resources: - type: STRING - comp-automl-tabular-cv-trainer-2: - executorLabel: exec-automl-tabular-cv-trainer-2 - inputDefinitions: - artifacts: - materialized_cv_splits: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - metadata: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - transform_output: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - tuning_result_input: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - parameters: - deadline_hours: - type: DOUBLE - encryption_spec_key_name: - type: STRING - location: - type: STRING - num_parallel_trials: - type: INT - num_selected_trials: - type: INT - project: - type: STRING - root_dir: - type: STRING - single_run_max_secs: - type: INT - worker_pool_specs_override_json: - type: STRING - outputDefinitions: - artifacts: - tuning_result_output: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - parameters: - execution_metrics: - type: STRING - gcp_resources: - type: STRING - comp-automl-tabular-ensemble: - executorLabel: exec-automl-tabular-ensemble - inputDefinitions: - artifacts: - dataset_schema: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - instance_baseline: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - metadata: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - transform_output: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - tuning_result_input: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - warmup_data: - artifactType: - schemaTitle: system.Dataset - schemaVersion: 0.0.1 - parameters: - encryption_spec_key_name: - type: STRING - export_additional_model_without_custom_ops: - type: STRING - location: - type: STRING - project: - type: STRING - root_dir: - type: STRING - outputDefinitions: - artifacts: - explanation_metadata_artifact: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - model: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - model_architecture: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - model_without_custom_ops: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - unmanaged_container_model: - artifactType: - schemaTitle: google.UnmanagedContainerModel - schemaVersion: 0.0.1 - parameters: - explanation_metadata: - type: STRING - explanation_parameters: - type: STRING - gcp_resources: - type: STRING - comp-automl-tabular-ensemble-2: - executorLabel: exec-automl-tabular-ensemble-2 - inputDefinitions: - artifacts: - dataset_schema: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - instance_baseline: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - metadata: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - transform_output: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - tuning_result_input: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - warmup_data: - artifactType: - schemaTitle: system.Dataset - schemaVersion: 0.0.1 - parameters: - encryption_spec_key_name: - type: STRING - export_additional_model_without_custom_ops: - type: STRING - location: - type: STRING - project: - type: STRING - root_dir: - type: STRING - outputDefinitions: - artifacts: - explanation_metadata_artifact: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - model: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - model_architecture: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - model_without_custom_ops: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - unmanaged_container_model: - artifactType: - schemaTitle: google.UnmanagedContainerModel - schemaVersion: 0.0.1 - parameters: - explanation_metadata: - type: STRING - explanation_parameters: - type: STRING - gcp_resources: - type: STRING - comp-automl-tabular-ensemble-3: - executorLabel: exec-automl-tabular-ensemble-3 - inputDefinitions: - artifacts: - dataset_schema: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - instance_baseline: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - metadata: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - transform_output: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - tuning_result_input: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - warmup_data: - artifactType: - schemaTitle: system.Dataset - schemaVersion: 0.0.1 - parameters: - encryption_spec_key_name: - type: STRING - export_additional_model_without_custom_ops: - type: STRING - location: - type: STRING - project: - type: STRING - root_dir: - type: STRING - outputDefinitions: - artifacts: - explanation_metadata_artifact: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - model: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - model_architecture: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - model_without_custom_ops: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - unmanaged_container_model: - artifactType: - schemaTitle: google.UnmanagedContainerModel - schemaVersion: 0.0.1 - parameters: - explanation_metadata: - type: STRING - explanation_parameters: - type: STRING - gcp_resources: - type: STRING - comp-automl-tabular-finalizer: - executorLabel: exec-automl-tabular-finalizer - inputDefinitions: - parameters: - encryption_spec_key_name: - type: STRING - location: - type: STRING - project: - type: STRING - root_dir: - type: STRING - outputDefinitions: - parameters: - gcp_resources: - type: STRING - comp-automl-tabular-infra-validator: - executorLabel: exec-automl-tabular-infra-validator - inputDefinitions: - artifacts: - unmanaged_container_model: - artifactType: - schemaTitle: google.UnmanagedContainerModel - schemaVersion: 0.0.1 - comp-automl-tabular-infra-validator-2: - executorLabel: exec-automl-tabular-infra-validator-2 - inputDefinitions: - artifacts: - unmanaged_container_model: - artifactType: - schemaTitle: google.UnmanagedContainerModel - schemaVersion: 0.0.1 - comp-automl-tabular-infra-validator-3: - executorLabel: exec-automl-tabular-infra-validator-3 - inputDefinitions: - artifacts: - unmanaged_container_model: - artifactType: - schemaTitle: google.UnmanagedContainerModel - schemaVersion: 0.0.1 - comp-automl-tabular-stage-1-tuner: - executorLabel: exec-automl-tabular-stage-1-tuner - inputDefinitions: - artifacts: - materialized_eval_split: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - materialized_train_split: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - metadata: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - transform_output: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - parameters: - deadline_hours: - type: DOUBLE - disable_early_stopping: - type: STRING - encryption_spec_key_name: - type: STRING - location: - type: STRING - num_parallel_trials: - type: INT - num_selected_trials: - type: INT - project: - type: STRING - reduce_search_space_mode: - type: STRING - root_dir: - type: STRING - run_distillation: - type: STRING - single_run_max_secs: - type: INT - study_spec_parameters_override: - type: STRING - tune_feature_selection_rate: - type: STRING - worker_pool_specs_override_json: - type: STRING - outputDefinitions: - artifacts: - tuning_result_output: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - parameters: - execution_metrics: - type: STRING - gcp_resources: - type: STRING - comp-automl-tabular-stage-1-tuner-2: - executorLabel: exec-automl-tabular-stage-1-tuner-2 - inputDefinitions: - artifacts: - materialized_eval_split: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - materialized_train_split: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - metadata: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - transform_output: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - parameters: - deadline_hours: - type: DOUBLE - disable_early_stopping: - type: STRING - encryption_spec_key_name: - type: STRING - location: - type: STRING - num_parallel_trials: - type: INT - num_selected_trials: - type: INT - project: - type: STRING - reduce_search_space_mode: - type: STRING - root_dir: - type: STRING - run_distillation: - type: STRING - single_run_max_secs: - type: INT - study_spec_parameters_override: - type: STRING - tune_feature_selection_rate: - type: STRING - worker_pool_specs_override_json: - type: STRING - outputDefinitions: - artifacts: - tuning_result_output: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - parameters: - execution_metrics: - type: STRING - gcp_resources: - type: STRING - comp-automl-tabular-transform: - executorLabel: exec-automl-tabular-transform - inputDefinitions: - artifacts: - dataset_schema: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - eval_split: - artifactType: - schemaTitle: system.Dataset - schemaVersion: 0.0.1 - metadata: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - test_split: - artifactType: - schemaTitle: system.Dataset - schemaVersion: 0.0.1 - train_split: - artifactType: - schemaTitle: system.Dataset - schemaVersion: 0.0.1 - parameters: - dataflow_disk_size_gb: - type: INT - dataflow_machine_type: - type: STRING - dataflow_max_num_workers: - type: INT - dataflow_service_account: - type: STRING - dataflow_subnetwork: - type: STRING - dataflow_use_public_ips: - type: STRING - encryption_spec_key_name: - type: STRING - location: - type: STRING - project: - type: STRING - root_dir: - type: STRING - outputDefinitions: - artifacts: - materialized_eval_split: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - materialized_test_split: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - materialized_train_split: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - training_schema_uri: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - transform_output: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - parameters: - gcp_resources: - type: STRING - comp-automl-tabular-transform-2: - executorLabel: exec-automl-tabular-transform-2 - inputDefinitions: - artifacts: - dataset_schema: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - eval_split: - artifactType: - schemaTitle: system.Dataset - schemaVersion: 0.0.1 - metadata: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - test_split: - artifactType: - schemaTitle: system.Dataset - schemaVersion: 0.0.1 - train_split: - artifactType: - schemaTitle: system.Dataset - schemaVersion: 0.0.1 - parameters: - dataflow_disk_size_gb: - type: INT - dataflow_machine_type: - type: STRING - dataflow_max_num_workers: - type: INT - dataflow_service_account: - type: STRING - dataflow_subnetwork: - type: STRING - dataflow_use_public_ips: - type: STRING - encryption_spec_key_name: - type: STRING - location: - type: STRING - project: - type: STRING - root_dir: - type: STRING - outputDefinitions: - artifacts: - materialized_eval_split: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - materialized_test_split: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - materialized_train_split: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - training_schema_uri: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - transform_output: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - parameters: - gcp_resources: - type: STRING - comp-bool-identity: - executorLabel: exec-bool-identity - inputDefinitions: - parameters: - value: - type: STRING - outputDefinitions: - parameters: - Output: - type: STRING - comp-bool-identity-2: - executorLabel: exec-bool-identity-2 - inputDefinitions: - parameters: - value: - type: STRING - outputDefinitions: - parameters: - Output: - type: STRING - comp-bool-identity-3: - executorLabel: exec-bool-identity-3 - inputDefinitions: - parameters: - value: - type: STRING - outputDefinitions: - parameters: - Output: - type: STRING - comp-calculate-training-parameters: - executorLabel: exec-calculate-training-parameters - inputDefinitions: - parameters: - fast_testing: - type: STRING - is_skip_architecture_search: - type: STRING - run_distillation: - type: STRING - stage_1_num_parallel_trials: - type: INT - stage_2_num_parallel_trials: - type: INT - train_budget_milli_node_hours: - type: DOUBLE - outputDefinitions: - parameters: - distill_stage_1_deadline_hours: - type: DOUBLE - reduce_search_space_mode: - type: STRING - stage_1_deadline_hours: - type: DOUBLE - stage_1_num_selected_trials: - type: INT - stage_1_single_run_max_secs: - type: INT - stage_2_deadline_hours: - type: DOUBLE - stage_2_single_run_max_secs: - type: INT - comp-calculate-training-parameters-2: - executorLabel: exec-calculate-training-parameters-2 - inputDefinitions: - parameters: - fast_testing: - type: STRING - is_skip_architecture_search: - type: STRING - run_distillation: - type: STRING - stage_1_num_parallel_trials: - type: INT - stage_2_num_parallel_trials: - type: INT - train_budget_milli_node_hours: - type: DOUBLE - outputDefinitions: - parameters: - distill_stage_1_deadline_hours: - type: DOUBLE - reduce_search_space_mode: - type: STRING - stage_1_deadline_hours: - type: DOUBLE - stage_1_num_selected_trials: - type: INT - stage_1_single_run_max_secs: - type: INT - stage_2_deadline_hours: - type: DOUBLE - stage_2_single_run_max_secs: - type: INT - comp-condition-is-distill-7: - dag: - outputs: - artifacts: - feature-attribution-3-feature_attributions: - artifactSelectors: - - outputArtifactKey: feature-attribution-3-feature_attributions - producerSubtask: condition-is-evaluation-8 - model-evaluation-3-evaluation_metrics: - artifactSelectors: - - outputArtifactKey: model-evaluation-3-evaluation_metrics - producerSubtask: condition-is-evaluation-8 - tasks: - automl-tabular-ensemble-3: - cachingOptions: - enableCache: true - componentRef: - name: comp-automl-tabular-ensemble-3 - dependentTasks: - - automl-tabular-stage-1-tuner-2 - - automl-tabular-transform-2 - inputs: - artifacts: - dataset_schema: - componentInputArtifact: pipelineparam--tabular-stats-and-example-gen-dataset_schema - instance_baseline: - componentInputArtifact: pipelineparam--tabular-stats-and-example-gen-instance_baseline - metadata: - componentInputArtifact: pipelineparam--tabular-stats-and-example-gen-metadata - transform_output: - taskOutputArtifact: - outputArtifactKey: transform_output - producerTask: automl-tabular-transform-2 - tuning_result_input: - taskOutputArtifact: - outputArtifactKey: tuning_result_output - producerTask: automl-tabular-stage-1-tuner-2 - warmup_data: - componentInputArtifact: pipelineparam--tabular-stats-and-example-gen-eval_split - parameters: - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - export_additional_model_without_custom_ops: - componentInputParameter: pipelineparam--export_additional_model_without_custom_ops - location: - componentInputParameter: pipelineparam--location - project: - componentInputParameter: pipelineparam--project - root_dir: - componentInputParameter: pipelineparam--root_dir - taskInfo: - name: automl-tabular-ensemble-3 - automl-tabular-infra-validator-3: - cachingOptions: - enableCache: true - componentRef: - name: comp-automl-tabular-infra-validator-3 - dependentTasks: - - automl-tabular-ensemble-3 - inputs: - artifacts: - unmanaged_container_model: - taskOutputArtifact: - outputArtifactKey: unmanaged_container_model - producerTask: automl-tabular-ensemble-3 - taskInfo: - name: automl-tabular-infra-validator-3 - automl-tabular-stage-1-tuner-2: - cachingOptions: - enableCache: true - componentRef: - name: comp-automl-tabular-stage-1-tuner-2 - dependentTasks: - - automl-tabular-transform-2 - inputs: - artifacts: - materialized_eval_split: - taskOutputArtifact: - outputArtifactKey: materialized_eval_split - producerTask: automl-tabular-transform-2 - materialized_train_split: - taskOutputArtifact: - outputArtifactKey: materialized_train_split - producerTask: automl-tabular-transform-2 - metadata: - componentInputArtifact: pipelineparam--tabular-stats-and-example-gen-metadata - transform_output: - taskOutputArtifact: - outputArtifactKey: transform_output - producerTask: automl-tabular-transform-2 - parameters: - deadline_hours: - componentInputParameter: pipelineparam--calculate-training-parameters-2-distill_stage_1_deadline_hours - disable_early_stopping: - componentInputParameter: pipelineparam--disable_early_stopping - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - location: - componentInputParameter: pipelineparam--location - num_parallel_trials: - componentInputParameter: pipelineparam--stage_1_num_parallel_trials - num_selected_trials: - runtimeValue: - constantValue: - intValue: '1' - project: - componentInputParameter: pipelineparam--project - reduce_search_space_mode: - componentInputParameter: pipelineparam--calculate-training-parameters-2-reduce_search_space_mode - root_dir: - componentInputParameter: pipelineparam--root_dir - run_distillation: - runtimeValue: - constantValue: - intValue: '1' - single_run_max_secs: - componentInputParameter: pipelineparam--calculate-training-parameters-2-stage_1_single_run_max_secs - study_spec_parameters_override: - runtimeValue: - constantValue: - stringValue: '[]' - tune_feature_selection_rate: - runtimeValue: - constantValue: - stringValue: 'false' - worker_pool_specs_override_json: - componentInputParameter: pipelineparam--stage_1_tuner_worker_pool_specs_override - taskInfo: - name: automl-tabular-stage-1-tuner-2 - automl-tabular-transform-2: - cachingOptions: - enableCache: true - componentRef: - name: comp-automl-tabular-transform-2 - dependentTasks: - - write-bp-result-path - - write-bp-result-path-2 - inputs: - artifacts: - dataset_schema: - componentInputArtifact: pipelineparam--tabular-stats-and-example-gen-dataset_schema - eval_split: - taskOutputArtifact: - outputArtifactKey: result - producerTask: write-bp-result-path-2 - metadata: - componentInputArtifact: pipelineparam--tabular-stats-and-example-gen-metadata - test_split: - componentInputArtifact: pipelineparam--tabular-stats-and-example-gen-test_split - train_split: - taskOutputArtifact: - outputArtifactKey: result - producerTask: write-bp-result-path - parameters: - dataflow_disk_size_gb: - componentInputParameter: pipelineparam--transform_dataflow_disk_size_gb - dataflow_machine_type: - componentInputParameter: pipelineparam--transform_dataflow_machine_type - dataflow_max_num_workers: - componentInputParameter: pipelineparam--transform_dataflow_max_num_workers - dataflow_service_account: - componentInputParameter: pipelineparam--dataflow_service_account - dataflow_subnetwork: - runtimeValue: - constantValue: - stringValue: '' - dataflow_use_public_ips: - runtimeValue: - constantValue: - stringValue: 'true' - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - location: - componentInputParameter: pipelineparam--location - project: - componentInputParameter: pipelineparam--project - root_dir: - componentInputParameter: pipelineparam--root_dir - taskInfo: - name: automl-tabular-transform-2 - condition-is-evaluation-8: - componentRef: - name: comp-condition-is-evaluation-8 - dependentTasks: - - automl-tabular-ensemble-3 - - model-upload-4 - inputs: - artifacts: - pipelineparam--automl-tabular-ensemble-3-explanation_metadata_artifact: - taskOutputArtifact: - outputArtifactKey: explanation_metadata_artifact - producerTask: automl-tabular-ensemble-3 - pipelineparam--automl-tabular-ensemble-3-unmanaged_container_model: - taskOutputArtifact: - outputArtifactKey: unmanaged_container_model - producerTask: automl-tabular-ensemble-3 - pipelineparam--model-upload-4-model: - taskOutputArtifact: - outputArtifactKey: model - producerTask: model-upload-4 - parameters: - pipelineparam--automl-tabular-ensemble-3-explanation_parameters: - taskOutputParameter: - outputParameterKey: explanation_parameters - producerTask: automl-tabular-ensemble-3 - pipelineparam--bool-identity-2-Output: - componentInputParameter: pipelineparam--bool-identity-2-Output - pipelineparam--bool-identity-3-Output: - componentInputParameter: pipelineparam--bool-identity-3-Output - pipelineparam--dataflow_service_account: - componentInputParameter: pipelineparam--dataflow_service_account - pipelineparam--dataflow_subnetwork: - componentInputParameter: pipelineparam--dataflow_subnetwork - pipelineparam--dataflow_use_public_ips: - componentInputParameter: pipelineparam--dataflow_use_public_ips - pipelineparam--encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - pipelineparam--evaluation_batch_predict_machine_type: - componentInputParameter: pipelineparam--evaluation_batch_predict_machine_type - pipelineparam--evaluation_batch_predict_max_replica_count: - componentInputParameter: pipelineparam--evaluation_batch_predict_max_replica_count - pipelineparam--evaluation_batch_predict_starting_replica_count: - componentInputParameter: pipelineparam--evaluation_batch_predict_starting_replica_count - pipelineparam--evaluation_dataflow_disk_size_gb: - componentInputParameter: pipelineparam--evaluation_dataflow_disk_size_gb - pipelineparam--evaluation_dataflow_machine_type: - componentInputParameter: pipelineparam--evaluation_dataflow_machine_type - pipelineparam--evaluation_dataflow_max_num_workers: - componentInputParameter: pipelineparam--evaluation_dataflow_max_num_workers - pipelineparam--location: - componentInputParameter: pipelineparam--location - pipelineparam--prediction_type: - componentInputParameter: pipelineparam--prediction_type - pipelineparam--project: - componentInputParameter: pipelineparam--project - pipelineparam--root_dir: - componentInputParameter: pipelineparam--root_dir - pipelineparam--string-not-empty-Output: - componentInputParameter: pipelineparam--string-not-empty-Output - pipelineparam--tabular-stats-and-example-gen-downsampled_test_split_json: - componentInputParameter: pipelineparam--tabular-stats-and-example-gen-downsampled_test_split_json - pipelineparam--tabular-stats-and-example-gen-test_split_json: - componentInputParameter: pipelineparam--tabular-stats-and-example-gen-test_split_json - pipelineparam--target_column: - componentInputParameter: pipelineparam--target_column - taskInfo: - name: condition-is-evaluation-8 - triggerPolicy: - condition: inputs.parameters['pipelineparam--bool-identity-2-Output'].string_value - == 'true' - model-batch-predict-3: - cachingOptions: - enableCache: true - componentRef: - name: comp-model-batch-predict-3 - dependentTasks: - - model-upload-3 - - read-input-uri - inputs: - artifacts: - model: - taskOutputArtifact: - outputArtifactKey: model - producerTask: model-upload-3 - parameters: - accelerator_count: - runtimeValue: - constantValue: - intValue: '0' - accelerator_type: - runtimeValue: - constantValue: - stringValue: '' - bigquery_destination_output_uri: - runtimeValue: - constantValue: - stringValue: '' - bigquery_source_input_uri: - runtimeValue: - constantValue: - stringValue: '' - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - explanation_metadata: - runtimeValue: - constantValue: - stringValue: '{}' - explanation_parameters: - runtimeValue: - constantValue: - stringValue: '{}' - gcs_destination_output_uri_prefix: - componentInputParameter: pipelineparam--root_dir - gcs_source_uris: - taskOutputParameter: - outputParameterKey: Output - producerTask: read-input-uri - generate_explanation: - runtimeValue: - constantValue: - intValue: '0' - instances_format: - runtimeValue: - constantValue: - stringValue: tf-record - job_display_name: - runtimeValue: - constantValue: - stringValue: batch-predict-train-split - labels: - runtimeValue: - constantValue: - stringValue: '{}' - location: - componentInputParameter: pipelineparam--location - machine_type: - componentInputParameter: pipelineparam--distill_batch_predict_machine_type - manual_batch_tuning_parameters_batch_size: - runtimeValue: - constantValue: - intValue: '0' - max_replica_count: - componentInputParameter: pipelineparam--distill_batch_predict_max_replica_count - model_parameters: - runtimeValue: - constantValue: - stringValue: '{}' - predictions_format: - runtimeValue: - constantValue: - stringValue: tf-record - project: - componentInputParameter: pipelineparam--project - starting_replica_count: - componentInputParameter: pipelineparam--distill_batch_predict_starting_replica_count - taskInfo: - name: model-batch-predict-3 - model-batch-predict-4: - cachingOptions: - enableCache: true - componentRef: - name: comp-model-batch-predict-4 - dependentTasks: - - model-upload-3 - - read-input-uri-2 - inputs: - artifacts: - model: - taskOutputArtifact: - outputArtifactKey: model - producerTask: model-upload-3 - parameters: - accelerator_count: - runtimeValue: - constantValue: - intValue: '0' - accelerator_type: - runtimeValue: - constantValue: - stringValue: '' - bigquery_destination_output_uri: - runtimeValue: - constantValue: - stringValue: '' - bigquery_source_input_uri: - runtimeValue: - constantValue: - stringValue: '' - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - explanation_metadata: - runtimeValue: - constantValue: - stringValue: '{}' - explanation_parameters: - runtimeValue: - constantValue: - stringValue: '{}' - gcs_destination_output_uri_prefix: - componentInputParameter: pipelineparam--root_dir - gcs_source_uris: - taskOutputParameter: - outputParameterKey: Output - producerTask: read-input-uri-2 - generate_explanation: - runtimeValue: - constantValue: - intValue: '0' - instances_format: - runtimeValue: - constantValue: - stringValue: tf-record - job_display_name: - runtimeValue: - constantValue: - stringValue: batch-predict-eval-split - labels: - runtimeValue: - constantValue: - stringValue: '{}' - location: - componentInputParameter: pipelineparam--location - machine_type: - componentInputParameter: pipelineparam--distill_batch_predict_machine_type - manual_batch_tuning_parameters_batch_size: - runtimeValue: - constantValue: - intValue: '0' - max_replica_count: - componentInputParameter: pipelineparam--distill_batch_predict_max_replica_count - model_parameters: - runtimeValue: - constantValue: - stringValue: '{}' - predictions_format: - runtimeValue: - constantValue: - stringValue: tf-record - project: - componentInputParameter: pipelineparam--project - starting_replica_count: - componentInputParameter: pipelineparam--distill_batch_predict_starting_replica_count - taskInfo: - name: model-batch-predict-4 - model-upload-3: - cachingOptions: - enableCache: true - componentRef: - name: comp-model-upload-3 - dependentTasks: - - set-model-can-skip-validation - inputs: - artifacts: - explanation_metadata_artifact: - componentInputArtifact: pipelineparam--automl-tabular-ensemble-2-explanation_metadata_artifact - unmanaged_container_model: - componentInputArtifact: pipelineparam--automl-tabular-ensemble-2-unmanaged_container_model - parameters: - description: - runtimeValue: - constantValue: - stringValue: '' - display_name: - runtimeValue: - constantValue: - stringValue: automl-tabular-model-upload-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}} - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - explanation_metadata: - runtimeValue: - constantValue: - stringValue: '{}' - explanation_parameters: - componentInputParameter: pipelineparam--automl-tabular-ensemble-2-explanation_parameters - labels: - runtimeValue: - constantValue: - stringValue: '{}' - location: - componentInputParameter: pipelineparam--location - project: - componentInputParameter: pipelineparam--project - taskInfo: - name: model-upload-3 - model-upload-4: - cachingOptions: - enableCache: true - componentRef: - name: comp-model-upload-4 - dependentTasks: - - automl-tabular-ensemble-3 - - automl-tabular-infra-validator-3 - inputs: - artifacts: - explanation_metadata_artifact: - taskOutputArtifact: - outputArtifactKey: explanation_metadata_artifact - producerTask: automl-tabular-ensemble-3 - unmanaged_container_model: - taskOutputArtifact: - outputArtifactKey: unmanaged_container_model - producerTask: automl-tabular-ensemble-3 - parameters: - description: - runtimeValue: - constantValue: - stringValue: '' - display_name: - runtimeValue: - constantValue: - stringValue: automl-tabular-distill-model-upload-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}} - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - explanation_metadata: - runtimeValue: - constantValue: - stringValue: '{}' - explanation_parameters: - taskOutputParameter: - outputParameterKey: explanation_parameters - producerTask: automl-tabular-ensemble-3 - labels: - runtimeValue: - constantValue: - stringValue: '{}' - location: - componentInputParameter: pipelineparam--location - project: - componentInputParameter: pipelineparam--project - taskInfo: - name: model-upload-4 - read-input-uri: - cachingOptions: - enableCache: true - componentRef: - name: comp-read-input-uri - inputs: - artifacts: - split_uri: - componentInputArtifact: pipelineparam--tabular-stats-and-example-gen-train_split - taskInfo: - name: read-input-uri - read-input-uri-2: - cachingOptions: - enableCache: true - componentRef: - name: comp-read-input-uri-2 - inputs: - artifacts: - split_uri: - componentInputArtifact: pipelineparam--tabular-stats-and-example-gen-eval_split - taskInfo: - name: read-input-uri-2 - set-model-can-skip-validation: - cachingOptions: - enableCache: true - componentRef: - name: comp-set-model-can-skip-validation - inputs: - artifacts: - model: - componentInputArtifact: pipelineparam--automl-tabular-ensemble-2-unmanaged_container_model - taskInfo: - name: set-model-can-skip-validation - write-bp-result-path: - cachingOptions: - enableCache: true - componentRef: - name: comp-write-bp-result-path - dependentTasks: - - model-batch-predict-3 - inputs: - artifacts: - bp_job: - taskOutputArtifact: - outputArtifactKey: batchpredictionjob - producerTask: model-batch-predict-3 - taskInfo: - name: write-bp-result-path - write-bp-result-path-2: - cachingOptions: - enableCache: true - componentRef: - name: comp-write-bp-result-path-2 - dependentTasks: - - model-batch-predict-4 - inputs: - artifacts: - bp_job: - taskOutputArtifact: - outputArtifactKey: batchpredictionjob - producerTask: model-batch-predict-4 - taskInfo: - name: write-bp-result-path-2 - inputDefinitions: - artifacts: - pipelineparam--automl-tabular-ensemble-2-explanation_metadata_artifact: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - pipelineparam--automl-tabular-ensemble-2-unmanaged_container_model: - artifactType: - schemaTitle: google.UnmanagedContainerModel - schemaVersion: 0.0.1 - pipelineparam--tabular-stats-and-example-gen-dataset_schema: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - pipelineparam--tabular-stats-and-example-gen-eval_split: - artifactType: - schemaTitle: system.Dataset - schemaVersion: 0.0.1 - pipelineparam--tabular-stats-and-example-gen-instance_baseline: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - pipelineparam--tabular-stats-and-example-gen-metadata: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - pipelineparam--tabular-stats-and-example-gen-test_split: - artifactType: - schemaTitle: system.Dataset - schemaVersion: 0.0.1 - pipelineparam--tabular-stats-and-example-gen-train_split: - artifactType: - schemaTitle: system.Dataset - schemaVersion: 0.0.1 - parameters: - pipelineparam--automl-tabular-ensemble-2-explanation_parameters: - type: STRING - pipelineparam--bool-identity-2-Output: - type: STRING - pipelineparam--bool-identity-3-Output: - type: STRING - pipelineparam--calculate-training-parameters-2-distill_stage_1_deadline_hours: - type: DOUBLE - pipelineparam--calculate-training-parameters-2-reduce_search_space_mode: - type: STRING - pipelineparam--calculate-training-parameters-2-stage_1_single_run_max_secs: - type: INT - pipelineparam--dataflow_service_account: - type: STRING - pipelineparam--dataflow_subnetwork: - type: STRING - pipelineparam--dataflow_use_public_ips: - type: STRING - pipelineparam--disable_early_stopping: - type: STRING - pipelineparam--distill_batch_predict_machine_type: - type: STRING - pipelineparam--distill_batch_predict_max_replica_count: - type: INT - pipelineparam--distill_batch_predict_starting_replica_count: - type: INT - pipelineparam--encryption_spec_key_name: - type: STRING - pipelineparam--evaluation_batch_predict_machine_type: - type: STRING - pipelineparam--evaluation_batch_predict_max_replica_count: - type: INT - pipelineparam--evaluation_batch_predict_starting_replica_count: - type: INT - pipelineparam--evaluation_dataflow_disk_size_gb: - type: INT - pipelineparam--evaluation_dataflow_machine_type: - type: STRING - pipelineparam--evaluation_dataflow_max_num_workers: - type: INT - pipelineparam--export_additional_model_without_custom_ops: - type: STRING - pipelineparam--location: - type: STRING - pipelineparam--prediction_type: - type: STRING - pipelineparam--project: - type: STRING - pipelineparam--root_dir: - type: STRING - pipelineparam--stage_1_num_parallel_trials: - type: INT - pipelineparam--stage_1_tuner_worker_pool_specs_override: - type: STRING - pipelineparam--string-not-empty-Output: - type: STRING - pipelineparam--tabular-stats-and-example-gen-downsampled_test_split_json: - type: STRING - pipelineparam--tabular-stats-and-example-gen-test_split_json: - type: STRING - pipelineparam--target_column: - type: STRING - pipelineparam--transform_dataflow_disk_size_gb: - type: INT - pipelineparam--transform_dataflow_machine_type: - type: STRING - pipelineparam--transform_dataflow_max_num_workers: - type: INT - outputDefinitions: - artifacts: - feature-attribution-3-feature_attributions: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - model-evaluation-3-evaluation_metrics: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - comp-condition-is-evaluation-3: - dag: - outputs: - artifacts: - feature-attribution-feature_attributions: - artifactSelectors: - - outputArtifactKey: feature_attributions - producerSubtask: feature-attribution - model-evaluation-evaluation_metrics: - artifactSelectors: - - outputArtifactKey: evaluation_metrics - producerSubtask: model-evaluation - tasks: - feature-attribution: - cachingOptions: - enableCache: true - componentRef: - name: comp-feature-attribution - dependentTasks: - - model-batch-explanation - inputs: - artifacts: - predictions_gcs_source: - taskOutputArtifact: - outputArtifactKey: gcs_output_directory - producerTask: model-batch-explanation - parameters: - dataflow_disk_size: - componentInputParameter: pipelineparam--evaluation_dataflow_disk_size_gb - dataflow_machine_type: - componentInputParameter: pipelineparam--evaluation_dataflow_machine_type - dataflow_max_workers_num: - componentInputParameter: pipelineparam--evaluation_dataflow_max_num_workers - dataflow_service_account: - componentInputParameter: pipelineparam--dataflow_service_account - dataflow_subnetwork: - componentInputParameter: pipelineparam--dataflow_subnetwork - dataflow_use_public_ips: - componentInputParameter: pipelineparam--dataflow_use_public_ips - dataflow_workers_num: - runtimeValue: - constantValue: - intValue: '1' - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - location: - componentInputParameter: pipelineparam--location - predictions_format: - runtimeValue: - constantValue: - stringValue: jsonl - project: - componentInputParameter: pipelineparam--project - root_dir: - componentInputParameter: pipelineparam--root_dir - taskInfo: - name: feature-attribution - model-batch-explanation: - cachingOptions: - enableCache: true - componentRef: - name: comp-model-batch-explanation - inputs: - artifacts: - explanation_metadata_artifact: - componentInputArtifact: pipelineparam--automl-tabular-ensemble-explanation_metadata_artifact - unmanaged_container_model: - componentInputArtifact: pipelineparam--automl-tabular-ensemble-unmanaged_container_model - parameters: - accelerator_count: - runtimeValue: - constantValue: - intValue: '0' - accelerator_type: - runtimeValue: - constantValue: - stringValue: '' - bigquery_destination_output_uri: - runtimeValue: - constantValue: - stringValue: '' - bigquery_source_input_uri: - runtimeValue: - constantValue: - stringValue: '' - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - explanation_metadata: - runtimeValue: - constantValue: - stringValue: '{}' - explanation_parameters: - componentInputParameter: pipelineparam--automl-tabular-ensemble-explanation_parameters - gcs_destination_output_uri_prefix: - componentInputParameter: pipelineparam--root_dir - gcs_source_uris: - componentInputParameter: pipelineparam--tabular-stats-and-example-gen-downsampled_test_split_json - generate_explanation: - runtimeValue: - constantValue: - intValue: '1' - instances_format: - runtimeValue: - constantValue: - stringValue: tf-record - job_display_name: - runtimeValue: - constantValue: - stringValue: batch-explain-evaluation-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}} - labels: - runtimeValue: - constantValue: - stringValue: '{}' - location: - componentInputParameter: pipelineparam--location - machine_type: - componentInputParameter: pipelineparam--evaluation_batch_predict_machine_type - manual_batch_tuning_parameters_batch_size: - runtimeValue: - constantValue: - intValue: '0' - max_replica_count: - componentInputParameter: pipelineparam--evaluation_batch_predict_max_replica_count - model_parameters: - runtimeValue: - constantValue: - stringValue: '{}' - predictions_format: - runtimeValue: - constantValue: - stringValue: jsonl - project: - componentInputParameter: pipelineparam--project - starting_replica_count: - componentInputParameter: pipelineparam--evaluation_batch_predict_starting_replica_count - taskInfo: - name: model-batch-explanation - model-batch-predict: - cachingOptions: - enableCache: true - componentRef: - name: comp-model-batch-predict - inputs: - artifacts: - unmanaged_container_model: - componentInputArtifact: pipelineparam--automl-tabular-ensemble-unmanaged_container_model - parameters: - accelerator_count: - runtimeValue: - constantValue: - intValue: '0' - accelerator_type: - runtimeValue: - constantValue: - stringValue: '' - bigquery_destination_output_uri: - runtimeValue: - constantValue: - stringValue: '' - bigquery_source_input_uri: - runtimeValue: - constantValue: - stringValue: '' - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - explanation_metadata: - runtimeValue: - constantValue: - stringValue: '{}' - explanation_parameters: - runtimeValue: - constantValue: - stringValue: '{}' - gcs_destination_output_uri_prefix: - componentInputParameter: pipelineparam--root_dir - gcs_source_uris: - componentInputParameter: pipelineparam--tabular-stats-and-example-gen-test_split_json - generate_explanation: - runtimeValue: - constantValue: - intValue: '0' - instances_format: - runtimeValue: - constantValue: - stringValue: tf-record - job_display_name: - runtimeValue: - constantValue: - stringValue: batch-predict-evaluation-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}} - labels: - runtimeValue: - constantValue: - stringValue: '{}' - location: - componentInputParameter: pipelineparam--location - machine_type: - componentInputParameter: pipelineparam--evaluation_batch_predict_machine_type - manual_batch_tuning_parameters_batch_size: - runtimeValue: - constantValue: - intValue: '0' - max_replica_count: - componentInputParameter: pipelineparam--evaluation_batch_predict_max_replica_count - model_parameters: - runtimeValue: - constantValue: - stringValue: '{}' - predictions_format: - runtimeValue: - constantValue: - stringValue: jsonl - project: - componentInputParameter: pipelineparam--project - starting_replica_count: - componentInputParameter: pipelineparam--evaluation_batch_predict_starting_replica_count - taskInfo: - name: model-batch-predict - model-evaluation: - cachingOptions: - enableCache: true - componentRef: - name: comp-model-evaluation - dependentTasks: - - model-batch-predict - inputs: - artifacts: - batch_prediction_job: - taskOutputArtifact: - outputArtifactKey: batchpredictionjob - producerTask: model-batch-predict - parameters: - class_names: - runtimeValue: - constantValue: - stringValue: '{}' - classification_type: - runtimeValue: - constantValue: - stringValue: '' - dataflow_disk_size: - componentInputParameter: pipelineparam--evaluation_dataflow_disk_size_gb - dataflow_machine_type: - componentInputParameter: pipelineparam--evaluation_dataflow_machine_type - dataflow_max_workers_num: - componentInputParameter: pipelineparam--evaluation_dataflow_max_num_workers - dataflow_service_account: - componentInputParameter: pipelineparam--dataflow_service_account - dataflow_subnetwork: - componentInputParameter: pipelineparam--dataflow_subnetwork - dataflow_use_public_ips: - componentInputParameter: pipelineparam--dataflow_use_public_ips - dataflow_workers_num: - runtimeValue: - constantValue: - intValue: '1' - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - example_weight_column: - runtimeValue: - constantValue: - stringValue: '' - generate_feature_attribution: - runtimeValue: - constantValue: - intValue: '0' - ground_truth_column: - componentInputParameter: pipelineparam--target_column - ground_truth_format: - runtimeValue: - constantValue: - stringValue: jsonl - ground_truth_gcs_source: - runtimeValue: - constantValue: - stringValue: '{}' - key_columns: - runtimeValue: - constantValue: - stringValue: '{}' - location: - componentInputParameter: pipelineparam--location - positive_classes: - runtimeValue: - constantValue: - stringValue: '{}' - prediction_id_column: - runtimeValue: - constantValue: - stringValue: '' - prediction_label_column: - runtimeValue: - constantValue: - stringValue: '' - prediction_score_column: - runtimeValue: - constantValue: - stringValue: '' - predictions_format: - runtimeValue: - constantValue: - stringValue: jsonl - problem_type: - componentInputParameter: pipelineparam--prediction_type - project: - componentInputParameter: pipelineparam--project - root_dir: - componentInputParameter: pipelineparam--root_dir - taskInfo: - name: model-evaluation - model-evaluation-import: - cachingOptions: - enableCache: true - componentRef: - name: comp-model-evaluation-import - dependentTasks: - - feature-attribution - - model-evaluation - inputs: - artifacts: - feature_attributions: - taskOutputArtifact: - outputArtifactKey: feature_attributions - producerTask: feature-attribution - metrics: - taskOutputArtifact: - outputArtifactKey: evaluation_metrics - producerTask: model-evaluation - model: - componentInputArtifact: pipelineparam--model-upload-model - parameters: - dataset_path: - runtimeValue: - constantValue: - stringValue: '' - dataset_paths: - componentInputParameter: pipelineparam--tabular-stats-and-example-gen-test_split_json - dataset_type: - runtimeValue: - constantValue: - stringValue: tf-record - display_name: - runtimeValue: - constantValue: - stringValue: AutoML Tabular - problem_type: - componentInputParameter: pipelineparam--prediction_type - taskInfo: - name: model-evaluation-import - inputDefinitions: - artifacts: - pipelineparam--automl-tabular-ensemble-explanation_metadata_artifact: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - pipelineparam--automl-tabular-ensemble-unmanaged_container_model: - artifactType: - schemaTitle: google.UnmanagedContainerModel - schemaVersion: 0.0.1 - pipelineparam--model-upload-model: - artifactType: - schemaTitle: google.VertexModel - schemaVersion: 0.0.1 - parameters: - pipelineparam--automl-tabular-ensemble-explanation_parameters: - type: STRING - pipelineparam--bool-identity-Output: - type: STRING - pipelineparam--dataflow_service_account: - type: STRING - pipelineparam--dataflow_subnetwork: - type: STRING - pipelineparam--dataflow_use_public_ips: - type: STRING - pipelineparam--encryption_spec_key_name: - type: STRING - pipelineparam--evaluation_batch_predict_machine_type: - type: STRING - pipelineparam--evaluation_batch_predict_max_replica_count: - type: INT - pipelineparam--evaluation_batch_predict_starting_replica_count: - type: INT - pipelineparam--evaluation_dataflow_disk_size_gb: - type: INT - pipelineparam--evaluation_dataflow_machine_type: - type: STRING - pipelineparam--evaluation_dataflow_max_num_workers: - type: INT - pipelineparam--location: - type: STRING - pipelineparam--prediction_type: - type: STRING - pipelineparam--project: - type: STRING - pipelineparam--root_dir: - type: STRING - pipelineparam--string-not-empty-Output: - type: STRING - pipelineparam--tabular-stats-and-example-gen-downsampled_test_split_json: - type: STRING - pipelineparam--tabular-stats-and-example-gen-test_split_json: - type: STRING - pipelineparam--target_column: - type: STRING - outputDefinitions: - artifacts: - feature-attribution-feature_attributions: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - model-evaluation-evaluation_metrics: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - comp-condition-is-evaluation-6: - dag: - outputs: - artifacts: - feature-attribution-2-feature_attributions: - artifactSelectors: - - outputArtifactKey: feature_attributions - producerSubtask: feature-attribution-2 - model-evaluation-2-evaluation_metrics: - artifactSelectors: - - outputArtifactKey: evaluation_metrics - producerSubtask: model-evaluation-2 - tasks: - feature-attribution-2: - cachingOptions: - enableCache: true - componentRef: - name: comp-feature-attribution-2 - dependentTasks: - - model-batch-explanation-2 - inputs: - artifacts: - predictions_gcs_source: - taskOutputArtifact: - outputArtifactKey: gcs_output_directory - producerTask: model-batch-explanation-2 - parameters: - dataflow_disk_size: - componentInputParameter: pipelineparam--evaluation_dataflow_disk_size_gb - dataflow_machine_type: - componentInputParameter: pipelineparam--evaluation_dataflow_machine_type - dataflow_max_workers_num: - componentInputParameter: pipelineparam--evaluation_dataflow_max_num_workers - dataflow_service_account: - componentInputParameter: pipelineparam--dataflow_service_account - dataflow_subnetwork: - componentInputParameter: pipelineparam--dataflow_subnetwork - dataflow_use_public_ips: - componentInputParameter: pipelineparam--dataflow_use_public_ips - dataflow_workers_num: - runtimeValue: - constantValue: - intValue: '1' - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - location: - componentInputParameter: pipelineparam--location - predictions_format: - runtimeValue: - constantValue: - stringValue: jsonl - project: - componentInputParameter: pipelineparam--project - root_dir: - componentInputParameter: pipelineparam--root_dir - taskInfo: - name: feature-attribution-2 - model-batch-explanation-2: - cachingOptions: - enableCache: true - componentRef: - name: comp-model-batch-explanation-2 - inputs: - artifacts: - explanation_metadata_artifact: - componentInputArtifact: pipelineparam--automl-tabular-ensemble-2-explanation_metadata_artifact - unmanaged_container_model: - componentInputArtifact: pipelineparam--automl-tabular-ensemble-2-unmanaged_container_model - parameters: - accelerator_count: - runtimeValue: - constantValue: - intValue: '0' - accelerator_type: - runtimeValue: - constantValue: - stringValue: '' - bigquery_destination_output_uri: - runtimeValue: - constantValue: - stringValue: '' - bigquery_source_input_uri: - runtimeValue: - constantValue: - stringValue: '' - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - explanation_metadata: - runtimeValue: - constantValue: - stringValue: '{}' - explanation_parameters: - componentInputParameter: pipelineparam--automl-tabular-ensemble-2-explanation_parameters - gcs_destination_output_uri_prefix: - componentInputParameter: pipelineparam--root_dir - gcs_source_uris: - componentInputParameter: pipelineparam--tabular-stats-and-example-gen-downsampled_test_split_json - generate_explanation: - runtimeValue: - constantValue: - intValue: '1' - instances_format: - runtimeValue: - constantValue: - stringValue: tf-record - job_display_name: - runtimeValue: - constantValue: - stringValue: batch-explain-evaluation-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}} - labels: - runtimeValue: - constantValue: - stringValue: '{}' - location: - componentInputParameter: pipelineparam--location - machine_type: - componentInputParameter: pipelineparam--evaluation_batch_predict_machine_type - manual_batch_tuning_parameters_batch_size: - runtimeValue: - constantValue: - intValue: '0' - max_replica_count: - componentInputParameter: pipelineparam--evaluation_batch_predict_max_replica_count - model_parameters: - runtimeValue: - constantValue: - stringValue: '{}' - predictions_format: - runtimeValue: - constantValue: - stringValue: jsonl - project: - componentInputParameter: pipelineparam--project - starting_replica_count: - componentInputParameter: pipelineparam--evaluation_batch_predict_starting_replica_count - taskInfo: - name: model-batch-explanation-2 - model-batch-predict-2: - cachingOptions: - enableCache: true - componentRef: - name: comp-model-batch-predict-2 - inputs: - artifacts: - unmanaged_container_model: - componentInputArtifact: pipelineparam--automl-tabular-ensemble-2-unmanaged_container_model - parameters: - accelerator_count: - runtimeValue: - constantValue: - intValue: '0' - accelerator_type: - runtimeValue: - constantValue: - stringValue: '' - bigquery_destination_output_uri: - runtimeValue: - constantValue: - stringValue: '' - bigquery_source_input_uri: - runtimeValue: - constantValue: - stringValue: '' - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - explanation_metadata: - runtimeValue: - constantValue: - stringValue: '{}' - explanation_parameters: - runtimeValue: - constantValue: - stringValue: '{}' - gcs_destination_output_uri_prefix: - componentInputParameter: pipelineparam--root_dir - gcs_source_uris: - componentInputParameter: pipelineparam--tabular-stats-and-example-gen-test_split_json - generate_explanation: - runtimeValue: - constantValue: - intValue: '0' - instances_format: - runtimeValue: - constantValue: - stringValue: tf-record - job_display_name: - runtimeValue: - constantValue: - stringValue: batch-predict-evaluation-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}} - labels: - runtimeValue: - constantValue: - stringValue: '{}' - location: - componentInputParameter: pipelineparam--location - machine_type: - componentInputParameter: pipelineparam--evaluation_batch_predict_machine_type - manual_batch_tuning_parameters_batch_size: - runtimeValue: - constantValue: - intValue: '0' - max_replica_count: - componentInputParameter: pipelineparam--evaluation_batch_predict_max_replica_count - model_parameters: - runtimeValue: - constantValue: - stringValue: '{}' - predictions_format: - runtimeValue: - constantValue: - stringValue: jsonl - project: - componentInputParameter: pipelineparam--project - starting_replica_count: - componentInputParameter: pipelineparam--evaluation_batch_predict_starting_replica_count - taskInfo: - name: model-batch-predict-2 - model-evaluation-2: - cachingOptions: - enableCache: true - componentRef: - name: comp-model-evaluation-2 - dependentTasks: - - model-batch-predict-2 - inputs: - artifacts: - batch_prediction_job: - taskOutputArtifact: - outputArtifactKey: batchpredictionjob - producerTask: model-batch-predict-2 - parameters: - class_names: - runtimeValue: - constantValue: - stringValue: '{}' - classification_type: - runtimeValue: - constantValue: - stringValue: '' - dataflow_disk_size: - componentInputParameter: pipelineparam--evaluation_dataflow_disk_size_gb - dataflow_machine_type: - componentInputParameter: pipelineparam--evaluation_dataflow_machine_type - dataflow_max_workers_num: - componentInputParameter: pipelineparam--evaluation_dataflow_max_num_workers - dataflow_service_account: - componentInputParameter: pipelineparam--dataflow_service_account - dataflow_subnetwork: - componentInputParameter: pipelineparam--dataflow_subnetwork - dataflow_use_public_ips: - componentInputParameter: pipelineparam--dataflow_use_public_ips - dataflow_workers_num: - runtimeValue: - constantValue: - intValue: '1' - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - example_weight_column: - runtimeValue: - constantValue: - stringValue: '' - generate_feature_attribution: - runtimeValue: - constantValue: - intValue: '0' - ground_truth_column: - componentInputParameter: pipelineparam--target_column - ground_truth_format: - runtimeValue: - constantValue: - stringValue: jsonl - ground_truth_gcs_source: - runtimeValue: - constantValue: - stringValue: '{}' - key_columns: - runtimeValue: - constantValue: - stringValue: '{}' - location: - componentInputParameter: pipelineparam--location - positive_classes: - runtimeValue: - constantValue: - stringValue: '{}' - prediction_id_column: - runtimeValue: - constantValue: - stringValue: '' - prediction_label_column: - runtimeValue: - constantValue: - stringValue: '' - prediction_score_column: - runtimeValue: - constantValue: - stringValue: '' - predictions_format: - runtimeValue: - constantValue: - stringValue: jsonl - problem_type: - componentInputParameter: pipelineparam--prediction_type - project: - componentInputParameter: pipelineparam--project - root_dir: - componentInputParameter: pipelineparam--root_dir - taskInfo: - name: model-evaluation-2 - model-evaluation-import-2: - cachingOptions: - enableCache: true - componentRef: - name: comp-model-evaluation-import-2 - dependentTasks: - - feature-attribution-2 - - model-evaluation-2 - inputs: - artifacts: - feature_attributions: - taskOutputArtifact: - outputArtifactKey: feature_attributions - producerTask: feature-attribution-2 - metrics: - taskOutputArtifact: - outputArtifactKey: evaluation_metrics - producerTask: model-evaluation-2 - model: - componentInputArtifact: pipelineparam--model-upload-2-model - parameters: - dataset_path: - runtimeValue: - constantValue: - stringValue: '' - dataset_paths: - componentInputParameter: pipelineparam--tabular-stats-and-example-gen-test_split_json - dataset_type: - runtimeValue: - constantValue: - stringValue: tf-record - display_name: - runtimeValue: - constantValue: - stringValue: AutoML Tabular - problem_type: - componentInputParameter: pipelineparam--prediction_type - taskInfo: - name: model-evaluation-import-2 - inputDefinitions: - artifacts: - pipelineparam--automl-tabular-ensemble-2-explanation_metadata_artifact: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - pipelineparam--automl-tabular-ensemble-2-unmanaged_container_model: - artifactType: - schemaTitle: google.UnmanagedContainerModel - schemaVersion: 0.0.1 - pipelineparam--model-upload-2-model: - artifactType: - schemaTitle: google.VertexModel - schemaVersion: 0.0.1 - parameters: - pipelineparam--automl-tabular-ensemble-2-explanation_parameters: - type: STRING - pipelineparam--bool-identity-2-Output: - type: STRING - pipelineparam--bool-identity-3-Output: - type: STRING - pipelineparam--dataflow_service_account: - type: STRING - pipelineparam--dataflow_subnetwork: - type: STRING - pipelineparam--dataflow_use_public_ips: - type: STRING - pipelineparam--encryption_spec_key_name: - type: STRING - pipelineparam--evaluation_batch_predict_machine_type: - type: STRING - pipelineparam--evaluation_batch_predict_max_replica_count: - type: INT - pipelineparam--evaluation_batch_predict_starting_replica_count: - type: INT - pipelineparam--evaluation_dataflow_disk_size_gb: - type: INT - pipelineparam--evaluation_dataflow_machine_type: - type: STRING - pipelineparam--evaluation_dataflow_max_num_workers: - type: INT - pipelineparam--location: - type: STRING - pipelineparam--prediction_type: - type: STRING - pipelineparam--project: - type: STRING - pipelineparam--root_dir: - type: STRING - pipelineparam--string-not-empty-Output: - type: STRING - pipelineparam--tabular-stats-and-example-gen-downsampled_test_split_json: - type: STRING - pipelineparam--tabular-stats-and-example-gen-test_split_json: - type: STRING - pipelineparam--target_column: - type: STRING - outputDefinitions: - artifacts: - feature-attribution-2-feature_attributions: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - model-evaluation-2-evaluation_metrics: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - comp-condition-is-evaluation-8: - dag: - outputs: - artifacts: - feature-attribution-3-feature_attributions: - artifactSelectors: - - outputArtifactKey: feature_attributions - producerSubtask: feature-attribution-3 - model-evaluation-3-evaluation_metrics: - artifactSelectors: - - outputArtifactKey: evaluation_metrics - producerSubtask: model-evaluation-3 - tasks: - feature-attribution-3: - cachingOptions: - enableCache: true - componentRef: - name: comp-feature-attribution-3 - dependentTasks: - - model-batch-explanation-3 - inputs: - artifacts: - predictions_gcs_source: - taskOutputArtifact: - outputArtifactKey: gcs_output_directory - producerTask: model-batch-explanation-3 - parameters: - dataflow_disk_size: - componentInputParameter: pipelineparam--evaluation_dataflow_disk_size_gb - dataflow_machine_type: - componentInputParameter: pipelineparam--evaluation_dataflow_machine_type - dataflow_max_workers_num: - componentInputParameter: pipelineparam--evaluation_dataflow_max_num_workers - dataflow_service_account: - componentInputParameter: pipelineparam--dataflow_service_account - dataflow_subnetwork: - componentInputParameter: pipelineparam--dataflow_subnetwork - dataflow_use_public_ips: - componentInputParameter: pipelineparam--dataflow_use_public_ips - dataflow_workers_num: - runtimeValue: - constantValue: - intValue: '1' - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - location: - componentInputParameter: pipelineparam--location - predictions_format: - runtimeValue: - constantValue: - stringValue: jsonl - project: - componentInputParameter: pipelineparam--project - root_dir: - componentInputParameter: pipelineparam--root_dir - taskInfo: - name: feature-attribution-3 - model-batch-explanation-3: - cachingOptions: - enableCache: true - componentRef: - name: comp-model-batch-explanation-3 - inputs: - artifacts: - explanation_metadata_artifact: - componentInputArtifact: pipelineparam--automl-tabular-ensemble-3-explanation_metadata_artifact - unmanaged_container_model: - componentInputArtifact: pipelineparam--automl-tabular-ensemble-3-unmanaged_container_model - parameters: - accelerator_count: - runtimeValue: - constantValue: - intValue: '0' - accelerator_type: - runtimeValue: - constantValue: - stringValue: '' - bigquery_destination_output_uri: - runtimeValue: - constantValue: - stringValue: '' - bigquery_source_input_uri: - runtimeValue: - constantValue: - stringValue: '' - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - explanation_metadata: - runtimeValue: - constantValue: - stringValue: '{}' - explanation_parameters: - componentInputParameter: pipelineparam--automl-tabular-ensemble-3-explanation_parameters - gcs_destination_output_uri_prefix: - componentInputParameter: pipelineparam--root_dir - gcs_source_uris: - componentInputParameter: pipelineparam--tabular-stats-and-example-gen-downsampled_test_split_json - generate_explanation: - runtimeValue: - constantValue: - intValue: '1' - instances_format: - runtimeValue: - constantValue: - stringValue: tf-record - job_display_name: - runtimeValue: - constantValue: - stringValue: batch-explain-evaluation-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}} - labels: - runtimeValue: - constantValue: - stringValue: '{}' - location: - componentInputParameter: pipelineparam--location - machine_type: - componentInputParameter: pipelineparam--evaluation_batch_predict_machine_type - manual_batch_tuning_parameters_batch_size: - runtimeValue: - constantValue: - intValue: '0' - max_replica_count: - componentInputParameter: pipelineparam--evaluation_batch_predict_max_replica_count - model_parameters: - runtimeValue: - constantValue: - stringValue: '{}' - predictions_format: - runtimeValue: - constantValue: - stringValue: jsonl - project: - componentInputParameter: pipelineparam--project - starting_replica_count: - componentInputParameter: pipelineparam--evaluation_batch_predict_starting_replica_count - taskInfo: - name: model-batch-explanation-3 - model-batch-predict-5: - cachingOptions: - enableCache: true - componentRef: - name: comp-model-batch-predict-5 - inputs: - artifacts: - unmanaged_container_model: - componentInputArtifact: pipelineparam--automl-tabular-ensemble-3-unmanaged_container_model - parameters: - accelerator_count: - runtimeValue: - constantValue: - intValue: '0' - accelerator_type: - runtimeValue: - constantValue: - stringValue: '' - bigquery_destination_output_uri: - runtimeValue: - constantValue: - stringValue: '' - bigquery_source_input_uri: - runtimeValue: - constantValue: - stringValue: '' - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - explanation_metadata: - runtimeValue: - constantValue: - stringValue: '{}' - explanation_parameters: - runtimeValue: - constantValue: - stringValue: '{}' - gcs_destination_output_uri_prefix: - componentInputParameter: pipelineparam--root_dir - gcs_source_uris: - componentInputParameter: pipelineparam--tabular-stats-and-example-gen-test_split_json - generate_explanation: - runtimeValue: - constantValue: - intValue: '0' - instances_format: - runtimeValue: - constantValue: - stringValue: tf-record - job_display_name: - runtimeValue: - constantValue: - stringValue: batch-predict-evaluation-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}} - labels: - runtimeValue: - constantValue: - stringValue: '{}' - location: - componentInputParameter: pipelineparam--location - machine_type: - componentInputParameter: pipelineparam--evaluation_batch_predict_machine_type - manual_batch_tuning_parameters_batch_size: - runtimeValue: - constantValue: - intValue: '0' - max_replica_count: - componentInputParameter: pipelineparam--evaluation_batch_predict_max_replica_count - model_parameters: - runtimeValue: - constantValue: - stringValue: '{}' - predictions_format: - runtimeValue: - constantValue: - stringValue: jsonl - project: - componentInputParameter: pipelineparam--project - starting_replica_count: - componentInputParameter: pipelineparam--evaluation_batch_predict_starting_replica_count - taskInfo: - name: model-batch-predict-5 - model-evaluation-3: - cachingOptions: - enableCache: true - componentRef: - name: comp-model-evaluation-3 - dependentTasks: - - model-batch-predict-5 - inputs: - artifacts: - batch_prediction_job: - taskOutputArtifact: - outputArtifactKey: batchpredictionjob - producerTask: model-batch-predict-5 - parameters: - class_names: - runtimeValue: - constantValue: - stringValue: '{}' - classification_type: - runtimeValue: - constantValue: - stringValue: '' - dataflow_disk_size: - componentInputParameter: pipelineparam--evaluation_dataflow_disk_size_gb - dataflow_machine_type: - componentInputParameter: pipelineparam--evaluation_dataflow_machine_type - dataflow_max_workers_num: - componentInputParameter: pipelineparam--evaluation_dataflow_max_num_workers - dataflow_service_account: - componentInputParameter: pipelineparam--dataflow_service_account - dataflow_subnetwork: - componentInputParameter: pipelineparam--dataflow_subnetwork - dataflow_use_public_ips: - componentInputParameter: pipelineparam--dataflow_use_public_ips - dataflow_workers_num: - runtimeValue: - constantValue: - intValue: '1' - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - example_weight_column: - runtimeValue: - constantValue: - stringValue: '' - generate_feature_attribution: - runtimeValue: - constantValue: - intValue: '0' - ground_truth_column: - componentInputParameter: pipelineparam--target_column - ground_truth_format: - runtimeValue: - constantValue: - stringValue: jsonl - ground_truth_gcs_source: - runtimeValue: - constantValue: - stringValue: '{}' - key_columns: - runtimeValue: - constantValue: - stringValue: '{}' - location: - componentInputParameter: pipelineparam--location - positive_classes: - runtimeValue: - constantValue: - stringValue: '{}' - prediction_id_column: - runtimeValue: - constantValue: - stringValue: '' - prediction_label_column: - runtimeValue: - constantValue: - stringValue: '' - prediction_score_column: - runtimeValue: - constantValue: - stringValue: '' - predictions_format: - runtimeValue: - constantValue: - stringValue: jsonl - problem_type: - componentInputParameter: pipelineparam--prediction_type - project: - componentInputParameter: pipelineparam--project - root_dir: - componentInputParameter: pipelineparam--root_dir - taskInfo: - name: model-evaluation-3 - model-evaluation-import-3: - cachingOptions: - enableCache: true - componentRef: - name: comp-model-evaluation-import-3 - dependentTasks: - - feature-attribution-3 - - model-evaluation-3 - inputs: - artifacts: - feature_attributions: - taskOutputArtifact: - outputArtifactKey: feature_attributions - producerTask: feature-attribution-3 - metrics: - taskOutputArtifact: - outputArtifactKey: evaluation_metrics - producerTask: model-evaluation-3 - model: - componentInputArtifact: pipelineparam--model-upload-4-model - parameters: - dataset_path: - runtimeValue: - constantValue: - stringValue: '' - dataset_paths: - componentInputParameter: pipelineparam--tabular-stats-and-example-gen-test_split_json - dataset_type: - runtimeValue: - constantValue: - stringValue: tf-record - display_name: - runtimeValue: - constantValue: - stringValue: AutoML Tabular - problem_type: - componentInputParameter: pipelineparam--prediction_type - taskInfo: - name: model-evaluation-import-3 - inputDefinitions: - artifacts: - pipelineparam--automl-tabular-ensemble-3-explanation_metadata_artifact: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - pipelineparam--automl-tabular-ensemble-3-unmanaged_container_model: - artifactType: - schemaTitle: google.UnmanagedContainerModel - schemaVersion: 0.0.1 - pipelineparam--model-upload-4-model: - artifactType: - schemaTitle: google.VertexModel - schemaVersion: 0.0.1 - parameters: - pipelineparam--automl-tabular-ensemble-3-explanation_parameters: - type: STRING - pipelineparam--bool-identity-2-Output: - type: STRING - pipelineparam--bool-identity-3-Output: - type: STRING - pipelineparam--dataflow_service_account: - type: STRING - pipelineparam--dataflow_subnetwork: - type: STRING - pipelineparam--dataflow_use_public_ips: - type: STRING - pipelineparam--encryption_spec_key_name: - type: STRING - pipelineparam--evaluation_batch_predict_machine_type: - type: STRING - pipelineparam--evaluation_batch_predict_max_replica_count: - type: INT - pipelineparam--evaluation_batch_predict_starting_replica_count: - type: INT - pipelineparam--evaluation_dataflow_disk_size_gb: - type: INT - pipelineparam--evaluation_dataflow_machine_type: - type: STRING - pipelineparam--evaluation_dataflow_max_num_workers: - type: INT - pipelineparam--location: - type: STRING - pipelineparam--prediction_type: - type: STRING - pipelineparam--project: - type: STRING - pipelineparam--root_dir: - type: STRING - pipelineparam--string-not-empty-Output: - type: STRING - pipelineparam--tabular-stats-and-example-gen-downsampled_test_split_json: - type: STRING - pipelineparam--tabular-stats-and-example-gen-test_split_json: - type: STRING - pipelineparam--target_column: - type: STRING - outputDefinitions: - artifacts: - feature-attribution-3-feature_attributions: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - model-evaluation-3-evaluation_metrics: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - comp-condition-no-distill-5: - dag: - outputs: - artifacts: - feature-attribution-2-feature_attributions: - artifactSelectors: - - outputArtifactKey: feature-attribution-2-feature_attributions - producerSubtask: condition-is-evaluation-6 - model-evaluation-2-evaluation_metrics: - artifactSelectors: - - outputArtifactKey: model-evaluation-2-evaluation_metrics - producerSubtask: condition-is-evaluation-6 - tasks: - condition-is-evaluation-6: - componentRef: - name: comp-condition-is-evaluation-6 - dependentTasks: - - model-upload-2 - inputs: - artifacts: - pipelineparam--automl-tabular-ensemble-2-explanation_metadata_artifact: - componentInputArtifact: pipelineparam--automl-tabular-ensemble-2-explanation_metadata_artifact - pipelineparam--automl-tabular-ensemble-2-unmanaged_container_model: - componentInputArtifact: pipelineparam--automl-tabular-ensemble-2-unmanaged_container_model - pipelineparam--model-upload-2-model: - taskOutputArtifact: - outputArtifactKey: model - producerTask: model-upload-2 - parameters: - pipelineparam--automl-tabular-ensemble-2-explanation_parameters: - componentInputParameter: pipelineparam--automl-tabular-ensemble-2-explanation_parameters - pipelineparam--bool-identity-2-Output: - componentInputParameter: pipelineparam--bool-identity-2-Output - pipelineparam--bool-identity-3-Output: - componentInputParameter: pipelineparam--bool-identity-3-Output - pipelineparam--dataflow_service_account: - componentInputParameter: pipelineparam--dataflow_service_account - pipelineparam--dataflow_subnetwork: - componentInputParameter: pipelineparam--dataflow_subnetwork - pipelineparam--dataflow_use_public_ips: - componentInputParameter: pipelineparam--dataflow_use_public_ips - pipelineparam--encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - pipelineparam--evaluation_batch_predict_machine_type: - componentInputParameter: pipelineparam--evaluation_batch_predict_machine_type - pipelineparam--evaluation_batch_predict_max_replica_count: - componentInputParameter: pipelineparam--evaluation_batch_predict_max_replica_count - pipelineparam--evaluation_batch_predict_starting_replica_count: - componentInputParameter: pipelineparam--evaluation_batch_predict_starting_replica_count - pipelineparam--evaluation_dataflow_disk_size_gb: - componentInputParameter: pipelineparam--evaluation_dataflow_disk_size_gb - pipelineparam--evaluation_dataflow_machine_type: - componentInputParameter: pipelineparam--evaluation_dataflow_machine_type - pipelineparam--evaluation_dataflow_max_num_workers: - componentInputParameter: pipelineparam--evaluation_dataflow_max_num_workers - pipelineparam--location: - componentInputParameter: pipelineparam--location - pipelineparam--prediction_type: - componentInputParameter: pipelineparam--prediction_type - pipelineparam--project: - componentInputParameter: pipelineparam--project - pipelineparam--root_dir: - componentInputParameter: pipelineparam--root_dir - pipelineparam--string-not-empty-Output: - componentInputParameter: pipelineparam--string-not-empty-Output - pipelineparam--tabular-stats-and-example-gen-downsampled_test_split_json: - componentInputParameter: pipelineparam--tabular-stats-and-example-gen-downsampled_test_split_json - pipelineparam--tabular-stats-and-example-gen-test_split_json: - componentInputParameter: pipelineparam--tabular-stats-and-example-gen-test_split_json - pipelineparam--target_column: - componentInputParameter: pipelineparam--target_column - taskInfo: - name: condition-is-evaluation-6 - triggerPolicy: - condition: inputs.parameters['pipelineparam--bool-identity-2-Output'].string_value - == 'true' - model-upload-2: - cachingOptions: - enableCache: true - componentRef: - name: comp-model-upload-2 - inputs: - artifacts: - explanation_metadata_artifact: - componentInputArtifact: pipelineparam--automl-tabular-ensemble-2-explanation_metadata_artifact - unmanaged_container_model: - componentInputArtifact: pipelineparam--automl-tabular-ensemble-2-unmanaged_container_model - parameters: - description: - runtimeValue: - constantValue: - stringValue: '' - display_name: - runtimeValue: - constantValue: - stringValue: automl-tabular-model-upload-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}} - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - explanation_metadata: - runtimeValue: - constantValue: - stringValue: '{}' - explanation_parameters: - componentInputParameter: pipelineparam--automl-tabular-ensemble-2-explanation_parameters - labels: - runtimeValue: - constantValue: - stringValue: '{}' - location: - componentInputParameter: pipelineparam--location - project: - componentInputParameter: pipelineparam--project - taskInfo: - name: model-upload-2 - inputDefinitions: - artifacts: - pipelineparam--automl-tabular-ensemble-2-explanation_metadata_artifact: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - pipelineparam--automl-tabular-ensemble-2-unmanaged_container_model: - artifactType: - schemaTitle: google.UnmanagedContainerModel - schemaVersion: 0.0.1 - parameters: - pipelineparam--automl-tabular-ensemble-2-explanation_parameters: - type: STRING - pipelineparam--bool-identity-2-Output: - type: STRING - pipelineparam--bool-identity-3-Output: - type: STRING - pipelineparam--dataflow_service_account: - type: STRING - pipelineparam--dataflow_subnetwork: - type: STRING - pipelineparam--dataflow_use_public_ips: - type: STRING - pipelineparam--encryption_spec_key_name: - type: STRING - pipelineparam--evaluation_batch_predict_machine_type: - type: STRING - pipelineparam--evaluation_batch_predict_max_replica_count: - type: INT - pipelineparam--evaluation_batch_predict_starting_replica_count: - type: INT - pipelineparam--evaluation_dataflow_disk_size_gb: - type: INT - pipelineparam--evaluation_dataflow_machine_type: - type: STRING - pipelineparam--evaluation_dataflow_max_num_workers: - type: INT - pipelineparam--location: - type: STRING - pipelineparam--prediction_type: - type: STRING - pipelineparam--project: - type: STRING - pipelineparam--root_dir: - type: STRING - pipelineparam--string-not-empty-Output: - type: STRING - pipelineparam--tabular-stats-and-example-gen-downsampled_test_split_json: - type: STRING - pipelineparam--tabular-stats-and-example-gen-test_split_json: - type: STRING - pipelineparam--target_column: - type: STRING - outputDefinitions: - artifacts: - feature-attribution-2-feature_attributions: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - model-evaluation-2-evaluation_metrics: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - comp-condition-stage-1-tuning-result-artifact-uri-empty-4: - dag: - outputs: - artifacts: - feature-attribution-2-feature_attributions: - artifactSelectors: - - outputArtifactKey: feature-attribution-2-feature_attributions - producerSubtask: condition-no-distill-5 - feature-attribution-3-feature_attributions: - artifactSelectors: - - outputArtifactKey: feature-attribution-3-feature_attributions - producerSubtask: condition-is-distill-7 - model-evaluation-2-evaluation_metrics: - artifactSelectors: - - outputArtifactKey: model-evaluation-2-evaluation_metrics - producerSubtask: condition-no-distill-5 - model-evaluation-3-evaluation_metrics: - artifactSelectors: - - outputArtifactKey: model-evaluation-3-evaluation_metrics - producerSubtask: condition-is-distill-7 - tasks: - automl-tabular-cv-trainer-2: - cachingOptions: - enableCache: true - componentRef: - name: comp-automl-tabular-cv-trainer-2 - dependentTasks: - - automl-tabular-stage-1-tuner - - calculate-training-parameters-2 - inputs: - artifacts: - materialized_cv_splits: - componentInputArtifact: pipelineparam--merge-materialized-splits-splits - metadata: - componentInputArtifact: pipelineparam--tabular-stats-and-example-gen-metadata - transform_output: - componentInputArtifact: pipelineparam--automl-tabular-transform-transform_output - tuning_result_input: - taskOutputArtifact: - outputArtifactKey: tuning_result_output - producerTask: automl-tabular-stage-1-tuner - parameters: - deadline_hours: - taskOutputParameter: - outputParameterKey: stage_2_deadline_hours - producerTask: calculate-training-parameters-2 - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - location: - componentInputParameter: pipelineparam--location - num_parallel_trials: - componentInputParameter: pipelineparam--stage_2_num_parallel_trials - num_selected_trials: - componentInputParameter: pipelineparam--stage_2_num_selected_trials - project: - componentInputParameter: pipelineparam--project - root_dir: - componentInputParameter: pipelineparam--root_dir - single_run_max_secs: - taskOutputParameter: - outputParameterKey: stage_2_single_run_max_secs - producerTask: calculate-training-parameters-2 - worker_pool_specs_override_json: - componentInputParameter: pipelineparam--cv_trainer_worker_pool_specs_override - taskInfo: - name: automl-tabular-cv-trainer-2 - automl-tabular-ensemble-2: - cachingOptions: - enableCache: true - componentRef: - name: comp-automl-tabular-ensemble-2 - dependentTasks: - - automl-tabular-cv-trainer-2 - inputs: - artifacts: - dataset_schema: - componentInputArtifact: pipelineparam--tabular-stats-and-example-gen-dataset_schema - instance_baseline: - componentInputArtifact: pipelineparam--tabular-stats-and-example-gen-instance_baseline - metadata: - componentInputArtifact: pipelineparam--tabular-stats-and-example-gen-metadata - transform_output: - componentInputArtifact: pipelineparam--automl-tabular-transform-transform_output - tuning_result_input: - taskOutputArtifact: - outputArtifactKey: tuning_result_output - producerTask: automl-tabular-cv-trainer-2 - warmup_data: - componentInputArtifact: pipelineparam--tabular-stats-and-example-gen-eval_split - parameters: - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - export_additional_model_without_custom_ops: - componentInputParameter: pipelineparam--export_additional_model_without_custom_ops - location: - componentInputParameter: pipelineparam--location - project: - componentInputParameter: pipelineparam--project - root_dir: - componentInputParameter: pipelineparam--root_dir - taskInfo: - name: automl-tabular-ensemble-2 - automl-tabular-infra-validator-2: - cachingOptions: - enableCache: true - componentRef: - name: comp-automl-tabular-infra-validator-2 - dependentTasks: - - automl-tabular-ensemble-2 - inputs: - artifacts: - unmanaged_container_model: - taskOutputArtifact: - outputArtifactKey: unmanaged_container_model - producerTask: automl-tabular-ensemble-2 - taskInfo: - name: automl-tabular-infra-validator-2 - automl-tabular-stage-1-tuner: - cachingOptions: - enableCache: true - componentRef: - name: comp-automl-tabular-stage-1-tuner - dependentTasks: - - calculate-training-parameters-2 - inputs: - artifacts: - materialized_eval_split: - componentInputArtifact: pipelineparam--automl-tabular-transform-materialized_eval_split - materialized_train_split: - componentInputArtifact: pipelineparam--automl-tabular-transform-materialized_train_split - metadata: - componentInputArtifact: pipelineparam--tabular-stats-and-example-gen-metadata - transform_output: - componentInputArtifact: pipelineparam--automl-tabular-transform-transform_output - parameters: - deadline_hours: - taskOutputParameter: - outputParameterKey: stage_1_deadline_hours - producerTask: calculate-training-parameters-2 - disable_early_stopping: - componentInputParameter: pipelineparam--disable_early_stopping - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - location: - componentInputParameter: pipelineparam--location - num_parallel_trials: - componentInputParameter: pipelineparam--stage_1_num_parallel_trials - num_selected_trials: - taskOutputParameter: - outputParameterKey: stage_1_num_selected_trials - producerTask: calculate-training-parameters-2 - project: - componentInputParameter: pipelineparam--project - reduce_search_space_mode: - taskOutputParameter: - outputParameterKey: reduce_search_space_mode - producerTask: calculate-training-parameters-2 - root_dir: - componentInputParameter: pipelineparam--root_dir - run_distillation: - runtimeValue: - constantValue: - stringValue: 'false' - single_run_max_secs: - taskOutputParameter: - outputParameterKey: stage_1_single_run_max_secs - producerTask: calculate-training-parameters-2 - study_spec_parameters_override: - componentInputParameter: pipelineparam--study_spec_parameters_override - tune_feature_selection_rate: - runtimeValue: - constantValue: - stringValue: 'false' - worker_pool_specs_override_json: - componentInputParameter: pipelineparam--stage_1_tuner_worker_pool_specs_override - taskInfo: - name: automl-tabular-stage-1-tuner - bool-identity-2: - cachingOptions: - enableCache: true - componentRef: - name: comp-bool-identity-2 - inputs: - parameters: - value: - componentInputParameter: pipelineparam--run_evaluation - taskInfo: - name: bool-identity-2 - bool-identity-3: - cachingOptions: - enableCache: true - componentRef: - name: comp-bool-identity-3 - inputs: - parameters: - value: - componentInputParameter: pipelineparam--run_distillation - taskInfo: - name: bool-identity-3 - calculate-training-parameters-2: - cachingOptions: - enableCache: true - componentRef: - name: comp-calculate-training-parameters-2 - inputs: - parameters: - fast_testing: - componentInputParameter: pipelineparam--fast_testing - is_skip_architecture_search: - runtimeValue: - constantValue: - intValue: '0' - run_distillation: - componentInputParameter: pipelineparam--run_distillation - stage_1_num_parallel_trials: - componentInputParameter: pipelineparam--stage_1_num_parallel_trials - stage_2_num_parallel_trials: - componentInputParameter: pipelineparam--stage_2_num_parallel_trials - train_budget_milli_node_hours: - componentInputParameter: pipelineparam--train_budget_milli_node_hours - taskInfo: - name: calculate-training-parameters-2 - condition-is-distill-7: - componentRef: - name: comp-condition-is-distill-7 - dependentTasks: - - automl-tabular-ensemble-2 - - automl-tabular-infra-validator-2 - - bool-identity-2 - - bool-identity-3 - - calculate-training-parameters-2 - inputs: - artifacts: - pipelineparam--automl-tabular-ensemble-2-explanation_metadata_artifact: - taskOutputArtifact: - outputArtifactKey: explanation_metadata_artifact - producerTask: automl-tabular-ensemble-2 - pipelineparam--automl-tabular-ensemble-2-unmanaged_container_model: - taskOutputArtifact: - outputArtifactKey: unmanaged_container_model - producerTask: automl-tabular-ensemble-2 - pipelineparam--tabular-stats-and-example-gen-dataset_schema: - componentInputArtifact: pipelineparam--tabular-stats-and-example-gen-dataset_schema - pipelineparam--tabular-stats-and-example-gen-eval_split: - componentInputArtifact: pipelineparam--tabular-stats-and-example-gen-eval_split - pipelineparam--tabular-stats-and-example-gen-instance_baseline: - componentInputArtifact: pipelineparam--tabular-stats-and-example-gen-instance_baseline - pipelineparam--tabular-stats-and-example-gen-metadata: - componentInputArtifact: pipelineparam--tabular-stats-and-example-gen-metadata - pipelineparam--tabular-stats-and-example-gen-test_split: - componentInputArtifact: pipelineparam--tabular-stats-and-example-gen-test_split - pipelineparam--tabular-stats-and-example-gen-train_split: - componentInputArtifact: pipelineparam--tabular-stats-and-example-gen-train_split - parameters: - pipelineparam--automl-tabular-ensemble-2-explanation_parameters: - taskOutputParameter: - outputParameterKey: explanation_parameters - producerTask: automl-tabular-ensemble-2 - pipelineparam--bool-identity-2-Output: - taskOutputParameter: - outputParameterKey: Output - producerTask: bool-identity-2 - pipelineparam--bool-identity-3-Output: - taskOutputParameter: - outputParameterKey: Output - producerTask: bool-identity-3 - pipelineparam--calculate-training-parameters-2-distill_stage_1_deadline_hours: - taskOutputParameter: - outputParameterKey: distill_stage_1_deadline_hours - producerTask: calculate-training-parameters-2 - pipelineparam--calculate-training-parameters-2-reduce_search_space_mode: - taskOutputParameter: - outputParameterKey: reduce_search_space_mode - producerTask: calculate-training-parameters-2 - pipelineparam--calculate-training-parameters-2-stage_1_single_run_max_secs: - taskOutputParameter: - outputParameterKey: stage_1_single_run_max_secs - producerTask: calculate-training-parameters-2 - pipelineparam--dataflow_service_account: - componentInputParameter: pipelineparam--dataflow_service_account - pipelineparam--dataflow_subnetwork: - componentInputParameter: pipelineparam--dataflow_subnetwork - pipelineparam--dataflow_use_public_ips: - componentInputParameter: pipelineparam--dataflow_use_public_ips - pipelineparam--disable_early_stopping: - componentInputParameter: pipelineparam--disable_early_stopping - pipelineparam--distill_batch_predict_machine_type: - componentInputParameter: pipelineparam--distill_batch_predict_machine_type - pipelineparam--distill_batch_predict_max_replica_count: - componentInputParameter: pipelineparam--distill_batch_predict_max_replica_count - pipelineparam--distill_batch_predict_starting_replica_count: - componentInputParameter: pipelineparam--distill_batch_predict_starting_replica_count - pipelineparam--encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - pipelineparam--evaluation_batch_predict_machine_type: - componentInputParameter: pipelineparam--evaluation_batch_predict_machine_type - pipelineparam--evaluation_batch_predict_max_replica_count: - componentInputParameter: pipelineparam--evaluation_batch_predict_max_replica_count - pipelineparam--evaluation_batch_predict_starting_replica_count: - componentInputParameter: pipelineparam--evaluation_batch_predict_starting_replica_count - pipelineparam--evaluation_dataflow_disk_size_gb: - componentInputParameter: pipelineparam--evaluation_dataflow_disk_size_gb - pipelineparam--evaluation_dataflow_machine_type: - componentInputParameter: pipelineparam--evaluation_dataflow_machine_type - pipelineparam--evaluation_dataflow_max_num_workers: - componentInputParameter: pipelineparam--evaluation_dataflow_max_num_workers - pipelineparam--export_additional_model_without_custom_ops: - componentInputParameter: pipelineparam--export_additional_model_without_custom_ops - pipelineparam--location: - componentInputParameter: pipelineparam--location - pipelineparam--prediction_type: - componentInputParameter: pipelineparam--prediction_type - pipelineparam--project: - componentInputParameter: pipelineparam--project - pipelineparam--root_dir: - componentInputParameter: pipelineparam--root_dir - pipelineparam--stage_1_num_parallel_trials: - componentInputParameter: pipelineparam--stage_1_num_parallel_trials - pipelineparam--stage_1_tuner_worker_pool_specs_override: - componentInputParameter: pipelineparam--stage_1_tuner_worker_pool_specs_override - pipelineparam--string-not-empty-Output: - componentInputParameter: pipelineparam--string-not-empty-Output - pipelineparam--tabular-stats-and-example-gen-downsampled_test_split_json: - componentInputParameter: pipelineparam--tabular-stats-and-example-gen-downsampled_test_split_json - pipelineparam--tabular-stats-and-example-gen-test_split_json: - componentInputParameter: pipelineparam--tabular-stats-and-example-gen-test_split_json - pipelineparam--target_column: - componentInputParameter: pipelineparam--target_column - pipelineparam--transform_dataflow_disk_size_gb: - componentInputParameter: pipelineparam--transform_dataflow_disk_size_gb - pipelineparam--transform_dataflow_machine_type: - componentInputParameter: pipelineparam--transform_dataflow_machine_type - pipelineparam--transform_dataflow_max_num_workers: - componentInputParameter: pipelineparam--transform_dataflow_max_num_workers - taskInfo: - name: condition-is-distill-7 - triggerPolicy: - condition: inputs.parameters['pipelineparam--bool-identity-3-Output'].string_value - == 'true' - condition-no-distill-5: - componentRef: - name: comp-condition-no-distill-5 - dependentTasks: - - automl-tabular-ensemble-2 - - automl-tabular-infra-validator-2 - - bool-identity-2 - - bool-identity-3 - inputs: - artifacts: - pipelineparam--automl-tabular-ensemble-2-explanation_metadata_artifact: - taskOutputArtifact: - outputArtifactKey: explanation_metadata_artifact - producerTask: automl-tabular-ensemble-2 - pipelineparam--automl-tabular-ensemble-2-unmanaged_container_model: - taskOutputArtifact: - outputArtifactKey: unmanaged_container_model - producerTask: automl-tabular-ensemble-2 - parameters: - pipelineparam--automl-tabular-ensemble-2-explanation_parameters: - taskOutputParameter: - outputParameterKey: explanation_parameters - producerTask: automl-tabular-ensemble-2 - pipelineparam--bool-identity-2-Output: - taskOutputParameter: - outputParameterKey: Output - producerTask: bool-identity-2 - pipelineparam--bool-identity-3-Output: - taskOutputParameter: - outputParameterKey: Output - producerTask: bool-identity-3 - pipelineparam--dataflow_service_account: - componentInputParameter: pipelineparam--dataflow_service_account - pipelineparam--dataflow_subnetwork: - componentInputParameter: pipelineparam--dataflow_subnetwork - pipelineparam--dataflow_use_public_ips: - componentInputParameter: pipelineparam--dataflow_use_public_ips - pipelineparam--encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - pipelineparam--evaluation_batch_predict_machine_type: - componentInputParameter: pipelineparam--evaluation_batch_predict_machine_type - pipelineparam--evaluation_batch_predict_max_replica_count: - componentInputParameter: pipelineparam--evaluation_batch_predict_max_replica_count - pipelineparam--evaluation_batch_predict_starting_replica_count: - componentInputParameter: pipelineparam--evaluation_batch_predict_starting_replica_count - pipelineparam--evaluation_dataflow_disk_size_gb: - componentInputParameter: pipelineparam--evaluation_dataflow_disk_size_gb - pipelineparam--evaluation_dataflow_machine_type: - componentInputParameter: pipelineparam--evaluation_dataflow_machine_type - pipelineparam--evaluation_dataflow_max_num_workers: - componentInputParameter: pipelineparam--evaluation_dataflow_max_num_workers - pipelineparam--location: - componentInputParameter: pipelineparam--location - pipelineparam--prediction_type: - componentInputParameter: pipelineparam--prediction_type - pipelineparam--project: - componentInputParameter: pipelineparam--project - pipelineparam--root_dir: - componentInputParameter: pipelineparam--root_dir - pipelineparam--string-not-empty-Output: - componentInputParameter: pipelineparam--string-not-empty-Output - pipelineparam--tabular-stats-and-example-gen-downsampled_test_split_json: - componentInputParameter: pipelineparam--tabular-stats-and-example-gen-downsampled_test_split_json - pipelineparam--tabular-stats-and-example-gen-test_split_json: - componentInputParameter: pipelineparam--tabular-stats-and-example-gen-test_split_json - pipelineparam--target_column: - componentInputParameter: pipelineparam--target_column - taskInfo: - name: condition-no-distill-5 - triggerPolicy: - condition: inputs.parameters['pipelineparam--bool-identity-3-Output'].string_value - == 'false' - inputDefinitions: - artifacts: - pipelineparam--automl-tabular-transform-materialized_eval_split: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - pipelineparam--automl-tabular-transform-materialized_train_split: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - pipelineparam--automl-tabular-transform-transform_output: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - pipelineparam--merge-materialized-splits-splits: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - pipelineparam--tabular-stats-and-example-gen-dataset_schema: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - pipelineparam--tabular-stats-and-example-gen-eval_split: - artifactType: - schemaTitle: system.Dataset - schemaVersion: 0.0.1 - pipelineparam--tabular-stats-and-example-gen-instance_baseline: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - pipelineparam--tabular-stats-and-example-gen-metadata: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - pipelineparam--tabular-stats-and-example-gen-test_split: - artifactType: - schemaTitle: system.Dataset - schemaVersion: 0.0.1 - pipelineparam--tabular-stats-and-example-gen-train_split: - artifactType: - schemaTitle: system.Dataset - schemaVersion: 0.0.1 - parameters: - pipelineparam--cv_trainer_worker_pool_specs_override: - type: STRING - pipelineparam--dataflow_service_account: - type: STRING - pipelineparam--dataflow_subnetwork: - type: STRING - pipelineparam--dataflow_use_public_ips: - type: STRING - pipelineparam--disable_early_stopping: - type: STRING - pipelineparam--distill_batch_predict_machine_type: - type: STRING - pipelineparam--distill_batch_predict_max_replica_count: - type: INT - pipelineparam--distill_batch_predict_starting_replica_count: - type: INT - pipelineparam--encryption_spec_key_name: - type: STRING - pipelineparam--evaluation_batch_predict_machine_type: - type: STRING - pipelineparam--evaluation_batch_predict_max_replica_count: - type: INT - pipelineparam--evaluation_batch_predict_starting_replica_count: - type: INT - pipelineparam--evaluation_dataflow_disk_size_gb: - type: INT - pipelineparam--evaluation_dataflow_machine_type: - type: STRING - pipelineparam--evaluation_dataflow_max_num_workers: - type: INT - pipelineparam--export_additional_model_without_custom_ops: - type: STRING - pipelineparam--fast_testing: - type: STRING - pipelineparam--location: - type: STRING - pipelineparam--prediction_type: - type: STRING - pipelineparam--project: - type: STRING - pipelineparam--root_dir: - type: STRING - pipelineparam--run_distillation: - type: STRING - pipelineparam--run_evaluation: - type: STRING - pipelineparam--stage_1_num_parallel_trials: - type: INT - pipelineparam--stage_1_tuner_worker_pool_specs_override: - type: STRING - pipelineparam--stage_2_num_parallel_trials: - type: INT - pipelineparam--stage_2_num_selected_trials: - type: INT - pipelineparam--string-not-empty-Output: - type: STRING - pipelineparam--study_spec_parameters_override: - type: STRING - pipelineparam--tabular-stats-and-example-gen-downsampled_test_split_json: - type: STRING - pipelineparam--tabular-stats-and-example-gen-test_split_json: - type: STRING - pipelineparam--target_column: - type: STRING - pipelineparam--train_budget_milli_node_hours: - type: DOUBLE - pipelineparam--transform_dataflow_disk_size_gb: - type: INT - pipelineparam--transform_dataflow_machine_type: - type: STRING - pipelineparam--transform_dataflow_max_num_workers: - type: INT - outputDefinitions: - artifacts: - feature-attribution-2-feature_attributions: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - feature-attribution-3-feature_attributions: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - model-evaluation-2-evaluation_metrics: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - model-evaluation-3-evaluation_metrics: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - comp-condition-stage-1-tuning-result-artifact-uri-not-empty-2: - dag: - outputs: - artifacts: - feature-attribution-feature_attributions: - artifactSelectors: - - outputArtifactKey: feature-attribution-feature_attributions - producerSubtask: condition-is-evaluation-3 - model-evaluation-evaluation_metrics: - artifactSelectors: - - outputArtifactKey: model-evaluation-evaluation_metrics - producerSubtask: condition-is-evaluation-3 - tasks: - automl-tabular-cv-trainer: - cachingOptions: - enableCache: true - componentRef: - name: comp-automl-tabular-cv-trainer - dependentTasks: - - calculate-training-parameters - - importer - inputs: - artifacts: - materialized_cv_splits: - componentInputArtifact: pipelineparam--merge-materialized-splits-splits - metadata: - componentInputArtifact: pipelineparam--tabular-stats-and-example-gen-metadata - transform_output: - componentInputArtifact: pipelineparam--automl-tabular-transform-transform_output - tuning_result_input: - taskOutputArtifact: - outputArtifactKey: artifact - producerTask: importer - parameters: - deadline_hours: - taskOutputParameter: - outputParameterKey: stage_2_deadline_hours - producerTask: calculate-training-parameters - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - location: - componentInputParameter: pipelineparam--location - num_parallel_trials: - componentInputParameter: pipelineparam--stage_2_num_parallel_trials - num_selected_trials: - componentInputParameter: pipelineparam--stage_2_num_selected_trials - project: - componentInputParameter: pipelineparam--project - root_dir: - componentInputParameter: pipelineparam--root_dir - single_run_max_secs: - taskOutputParameter: - outputParameterKey: stage_2_single_run_max_secs - producerTask: calculate-training-parameters - worker_pool_specs_override_json: - componentInputParameter: pipelineparam--cv_trainer_worker_pool_specs_override - taskInfo: - name: automl-tabular-cv-trainer - automl-tabular-ensemble: - cachingOptions: - enableCache: true - componentRef: - name: comp-automl-tabular-ensemble - dependentTasks: - - automl-tabular-cv-trainer - inputs: - artifacts: - dataset_schema: - componentInputArtifact: pipelineparam--tabular-stats-and-example-gen-dataset_schema - instance_baseline: - componentInputArtifact: pipelineparam--tabular-stats-and-example-gen-instance_baseline - metadata: - componentInputArtifact: pipelineparam--tabular-stats-and-example-gen-metadata - transform_output: - componentInputArtifact: pipelineparam--automl-tabular-transform-transform_output - tuning_result_input: - taskOutputArtifact: - outputArtifactKey: tuning_result_output - producerTask: automl-tabular-cv-trainer - warmup_data: - componentInputArtifact: pipelineparam--tabular-stats-and-example-gen-eval_split - parameters: - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - export_additional_model_without_custom_ops: - componentInputParameter: pipelineparam--export_additional_model_without_custom_ops - location: - componentInputParameter: pipelineparam--location - project: - componentInputParameter: pipelineparam--project - root_dir: - componentInputParameter: pipelineparam--root_dir - taskInfo: - name: automl-tabular-ensemble - automl-tabular-infra-validator: - cachingOptions: - enableCache: true - componentRef: - name: comp-automl-tabular-infra-validator - dependentTasks: - - automl-tabular-ensemble - inputs: - artifacts: - unmanaged_container_model: - taskOutputArtifact: - outputArtifactKey: unmanaged_container_model - producerTask: automl-tabular-ensemble - taskInfo: - name: automl-tabular-infra-validator - bool-identity: - cachingOptions: - enableCache: true - componentRef: - name: comp-bool-identity - inputs: - parameters: - value: - componentInputParameter: pipelineparam--run_evaluation - taskInfo: - name: bool-identity - calculate-training-parameters: - cachingOptions: - enableCache: true - componentRef: - name: comp-calculate-training-parameters - inputs: - parameters: - fast_testing: - componentInputParameter: pipelineparam--fast_testing - is_skip_architecture_search: - runtimeValue: - constantValue: - intValue: '1' - run_distillation: - componentInputParameter: pipelineparam--run_distillation - stage_1_num_parallel_trials: - componentInputParameter: pipelineparam--stage_1_num_parallel_trials - stage_2_num_parallel_trials: - componentInputParameter: pipelineparam--stage_2_num_parallel_trials - train_budget_milli_node_hours: - componentInputParameter: pipelineparam--train_budget_milli_node_hours - taskInfo: - name: calculate-training-parameters - condition-is-evaluation-3: - componentRef: - name: comp-condition-is-evaluation-3 - dependentTasks: - - automl-tabular-ensemble - - bool-identity - - model-upload - inputs: - artifacts: - pipelineparam--automl-tabular-ensemble-explanation_metadata_artifact: - taskOutputArtifact: - outputArtifactKey: explanation_metadata_artifact - producerTask: automl-tabular-ensemble - pipelineparam--automl-tabular-ensemble-unmanaged_container_model: - taskOutputArtifact: - outputArtifactKey: unmanaged_container_model - producerTask: automl-tabular-ensemble - pipelineparam--model-upload-model: - taskOutputArtifact: - outputArtifactKey: model - producerTask: model-upload - parameters: - pipelineparam--automl-tabular-ensemble-explanation_parameters: - taskOutputParameter: - outputParameterKey: explanation_parameters - producerTask: automl-tabular-ensemble - pipelineparam--bool-identity-Output: - taskOutputParameter: - outputParameterKey: Output - producerTask: bool-identity - pipelineparam--dataflow_service_account: - componentInputParameter: pipelineparam--dataflow_service_account - pipelineparam--dataflow_subnetwork: - componentInputParameter: pipelineparam--dataflow_subnetwork - pipelineparam--dataflow_use_public_ips: - componentInputParameter: pipelineparam--dataflow_use_public_ips - pipelineparam--encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - pipelineparam--evaluation_batch_predict_machine_type: - componentInputParameter: pipelineparam--evaluation_batch_predict_machine_type - pipelineparam--evaluation_batch_predict_max_replica_count: - componentInputParameter: pipelineparam--evaluation_batch_predict_max_replica_count - pipelineparam--evaluation_batch_predict_starting_replica_count: - componentInputParameter: pipelineparam--evaluation_batch_predict_starting_replica_count - pipelineparam--evaluation_dataflow_disk_size_gb: - componentInputParameter: pipelineparam--evaluation_dataflow_disk_size_gb - pipelineparam--evaluation_dataflow_machine_type: - componentInputParameter: pipelineparam--evaluation_dataflow_machine_type - pipelineparam--evaluation_dataflow_max_num_workers: - componentInputParameter: pipelineparam--evaluation_dataflow_max_num_workers - pipelineparam--location: - componentInputParameter: pipelineparam--location - pipelineparam--prediction_type: - componentInputParameter: pipelineparam--prediction_type - pipelineparam--project: - componentInputParameter: pipelineparam--project - pipelineparam--root_dir: - componentInputParameter: pipelineparam--root_dir - pipelineparam--string-not-empty-Output: - componentInputParameter: pipelineparam--string-not-empty-Output - pipelineparam--tabular-stats-and-example-gen-downsampled_test_split_json: - componentInputParameter: pipelineparam--tabular-stats-and-example-gen-downsampled_test_split_json - pipelineparam--tabular-stats-and-example-gen-test_split_json: - componentInputParameter: pipelineparam--tabular-stats-and-example-gen-test_split_json - pipelineparam--target_column: - componentInputParameter: pipelineparam--target_column - taskInfo: - name: condition-is-evaluation-3 - triggerPolicy: - condition: inputs.parameters['pipelineparam--bool-identity-Output'].string_value - == 'true' - importer: - cachingOptions: - enableCache: true - componentRef: - name: comp-importer - inputs: - parameters: - uri: - componentInputParameter: pipelineparam--stage_1_tuning_result_artifact_uri - taskInfo: - name: importer - model-upload: - cachingOptions: - enableCache: true - componentRef: - name: comp-model-upload - dependentTasks: - - automl-tabular-ensemble - inputs: - artifacts: - unmanaged_container_model: - taskOutputArtifact: - outputArtifactKey: unmanaged_container_model - producerTask: automl-tabular-ensemble - parameters: - description: - runtimeValue: - constantValue: - stringValue: '' - display_name: - runtimeValue: - constantValue: - stringValue: automl-tabular-model-upload-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}} - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - explanation_metadata: - runtimeValue: - constantValue: - stringValue: '{}' - explanation_parameters: - runtimeValue: - constantValue: - stringValue: '{}' - labels: - runtimeValue: - constantValue: - stringValue: '{}' - location: - componentInputParameter: pipelineparam--location - project: - componentInputParameter: pipelineparam--project - taskInfo: - name: model-upload - inputDefinitions: - artifacts: - pipelineparam--automl-tabular-transform-transform_output: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - pipelineparam--merge-materialized-splits-splits: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - pipelineparam--tabular-stats-and-example-gen-dataset_schema: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - pipelineparam--tabular-stats-and-example-gen-eval_split: - artifactType: - schemaTitle: system.Dataset - schemaVersion: 0.0.1 - pipelineparam--tabular-stats-and-example-gen-instance_baseline: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - pipelineparam--tabular-stats-and-example-gen-metadata: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - parameters: - pipelineparam--cv_trainer_worker_pool_specs_override: - type: STRING - pipelineparam--dataflow_service_account: - type: STRING - pipelineparam--dataflow_subnetwork: - type: STRING - pipelineparam--dataflow_use_public_ips: - type: STRING - pipelineparam--encryption_spec_key_name: - type: STRING - pipelineparam--evaluation_batch_predict_machine_type: - type: STRING - pipelineparam--evaluation_batch_predict_max_replica_count: - type: INT - pipelineparam--evaluation_batch_predict_starting_replica_count: - type: INT - pipelineparam--evaluation_dataflow_disk_size_gb: - type: INT - pipelineparam--evaluation_dataflow_machine_type: - type: STRING - pipelineparam--evaluation_dataflow_max_num_workers: - type: INT - pipelineparam--export_additional_model_without_custom_ops: - type: STRING - pipelineparam--fast_testing: - type: STRING - pipelineparam--location: - type: STRING - pipelineparam--prediction_type: - type: STRING - pipelineparam--project: - type: STRING - pipelineparam--root_dir: - type: STRING - pipelineparam--run_distillation: - type: STRING - pipelineparam--run_evaluation: - type: STRING - pipelineparam--stage_1_num_parallel_trials: - type: INT - pipelineparam--stage_1_tuning_result_artifact_uri: - type: STRING - pipelineparam--stage_2_num_parallel_trials: - type: INT - pipelineparam--stage_2_num_selected_trials: - type: INT - pipelineparam--string-not-empty-Output: - type: STRING - pipelineparam--tabular-stats-and-example-gen-downsampled_test_split_json: - type: STRING - pipelineparam--tabular-stats-and-example-gen-test_split_json: - type: STRING - pipelineparam--target_column: - type: STRING - pipelineparam--train_budget_milli_node_hours: - type: DOUBLE - outputDefinitions: - artifacts: - feature-attribution-feature_attributions: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - model-evaluation-evaluation_metrics: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - comp-exit-handler-1: - dag: - outputs: - artifacts: - feature-attribution-2-feature_attributions: - artifactSelectors: - - outputArtifactKey: feature-attribution-2-feature_attributions - producerSubtask: condition-stage-1-tuning-result-artifact-uri-empty-4 - feature-attribution-3-feature_attributions: - artifactSelectors: - - outputArtifactKey: feature-attribution-3-feature_attributions - producerSubtask: condition-stage-1-tuning-result-artifact-uri-empty-4 - feature-attribution-feature_attributions: - artifactSelectors: - - outputArtifactKey: feature-attribution-feature_attributions - producerSubtask: condition-stage-1-tuning-result-artifact-uri-not-empty-2 - model-evaluation-2-evaluation_metrics: - artifactSelectors: - - outputArtifactKey: model-evaluation-2-evaluation_metrics - producerSubtask: condition-stage-1-tuning-result-artifact-uri-empty-4 - model-evaluation-3-evaluation_metrics: - artifactSelectors: - - outputArtifactKey: model-evaluation-3-evaluation_metrics - producerSubtask: condition-stage-1-tuning-result-artifact-uri-empty-4 - model-evaluation-evaluation_metrics: - artifactSelectors: - - outputArtifactKey: model-evaluation-evaluation_metrics - producerSubtask: condition-stage-1-tuning-result-artifact-uri-not-empty-2 - tasks: - automl-tabular-transform: - cachingOptions: - enableCache: true - componentRef: - name: comp-automl-tabular-transform - dependentTasks: - - tabular-stats-and-example-gen - inputs: - artifacts: - dataset_schema: - taskOutputArtifact: - outputArtifactKey: dataset_schema - producerTask: tabular-stats-and-example-gen - eval_split: - taskOutputArtifact: - outputArtifactKey: eval_split - producerTask: tabular-stats-and-example-gen - metadata: - taskOutputArtifact: - outputArtifactKey: metadata - producerTask: tabular-stats-and-example-gen - test_split: - taskOutputArtifact: - outputArtifactKey: test_split - producerTask: tabular-stats-and-example-gen - train_split: - taskOutputArtifact: - outputArtifactKey: train_split - producerTask: tabular-stats-and-example-gen - parameters: - dataflow_disk_size_gb: - componentInputParameter: pipelineparam--transform_dataflow_disk_size_gb - dataflow_machine_type: - componentInputParameter: pipelineparam--transform_dataflow_machine_type - dataflow_max_num_workers: - componentInputParameter: pipelineparam--transform_dataflow_max_num_workers - dataflow_service_account: - componentInputParameter: pipelineparam--dataflow_service_account - dataflow_subnetwork: - componentInputParameter: pipelineparam--dataflow_subnetwork - dataflow_use_public_ips: - componentInputParameter: pipelineparam--dataflow_use_public_ips - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - location: - componentInputParameter: pipelineparam--location - project: - componentInputParameter: pipelineparam--project - root_dir: - componentInputParameter: pipelineparam--root_dir - taskInfo: - name: automl-tabular-transform - condition-stage-1-tuning-result-artifact-uri-empty-4: - componentRef: - name: comp-condition-stage-1-tuning-result-artifact-uri-empty-4 - dependentTasks: - - automl-tabular-transform - - merge-materialized-splits - - string-not-empty - - tabular-stats-and-example-gen - inputs: - artifacts: - pipelineparam--automl-tabular-transform-materialized_eval_split: - taskOutputArtifact: - outputArtifactKey: materialized_eval_split - producerTask: automl-tabular-transform - pipelineparam--automl-tabular-transform-materialized_train_split: - taskOutputArtifact: - outputArtifactKey: materialized_train_split - producerTask: automl-tabular-transform - pipelineparam--automl-tabular-transform-transform_output: - taskOutputArtifact: - outputArtifactKey: transform_output - producerTask: automl-tabular-transform - pipelineparam--merge-materialized-splits-splits: - taskOutputArtifact: - outputArtifactKey: splits - producerTask: merge-materialized-splits - pipelineparam--tabular-stats-and-example-gen-dataset_schema: - taskOutputArtifact: - outputArtifactKey: dataset_schema - producerTask: tabular-stats-and-example-gen - pipelineparam--tabular-stats-and-example-gen-eval_split: - taskOutputArtifact: - outputArtifactKey: eval_split - producerTask: tabular-stats-and-example-gen - pipelineparam--tabular-stats-and-example-gen-instance_baseline: - taskOutputArtifact: - outputArtifactKey: instance_baseline - producerTask: tabular-stats-and-example-gen - pipelineparam--tabular-stats-and-example-gen-metadata: - taskOutputArtifact: - outputArtifactKey: metadata - producerTask: tabular-stats-and-example-gen - pipelineparam--tabular-stats-and-example-gen-test_split: - taskOutputArtifact: - outputArtifactKey: test_split - producerTask: tabular-stats-and-example-gen - pipelineparam--tabular-stats-and-example-gen-train_split: - taskOutputArtifact: - outputArtifactKey: train_split - producerTask: tabular-stats-and-example-gen - parameters: - pipelineparam--cv_trainer_worker_pool_specs_override: - componentInputParameter: pipelineparam--cv_trainer_worker_pool_specs_override - pipelineparam--dataflow_service_account: - componentInputParameter: pipelineparam--dataflow_service_account - pipelineparam--dataflow_subnetwork: - componentInputParameter: pipelineparam--dataflow_subnetwork - pipelineparam--dataflow_use_public_ips: - componentInputParameter: pipelineparam--dataflow_use_public_ips - pipelineparam--disable_early_stopping: - componentInputParameter: pipelineparam--disable_early_stopping - pipelineparam--distill_batch_predict_machine_type: - componentInputParameter: pipelineparam--distill_batch_predict_machine_type - pipelineparam--distill_batch_predict_max_replica_count: - componentInputParameter: pipelineparam--distill_batch_predict_max_replica_count - pipelineparam--distill_batch_predict_starting_replica_count: - componentInputParameter: pipelineparam--distill_batch_predict_starting_replica_count - pipelineparam--encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - pipelineparam--evaluation_batch_predict_machine_type: - componentInputParameter: pipelineparam--evaluation_batch_predict_machine_type - pipelineparam--evaluation_batch_predict_max_replica_count: - componentInputParameter: pipelineparam--evaluation_batch_predict_max_replica_count - pipelineparam--evaluation_batch_predict_starting_replica_count: - componentInputParameter: pipelineparam--evaluation_batch_predict_starting_replica_count - pipelineparam--evaluation_dataflow_disk_size_gb: - componentInputParameter: pipelineparam--evaluation_dataflow_disk_size_gb - pipelineparam--evaluation_dataflow_machine_type: - componentInputParameter: pipelineparam--evaluation_dataflow_machine_type - pipelineparam--evaluation_dataflow_max_num_workers: - componentInputParameter: pipelineparam--evaluation_dataflow_max_num_workers - pipelineparam--export_additional_model_without_custom_ops: - componentInputParameter: pipelineparam--export_additional_model_without_custom_ops - pipelineparam--fast_testing: - componentInputParameter: pipelineparam--fast_testing - pipelineparam--location: - componentInputParameter: pipelineparam--location - pipelineparam--prediction_type: - componentInputParameter: pipelineparam--prediction_type - pipelineparam--project: - componentInputParameter: pipelineparam--project - pipelineparam--root_dir: - componentInputParameter: pipelineparam--root_dir - pipelineparam--run_distillation: - componentInputParameter: pipelineparam--run_distillation - pipelineparam--run_evaluation: - componentInputParameter: pipelineparam--run_evaluation - pipelineparam--stage_1_num_parallel_trials: - componentInputParameter: pipelineparam--stage_1_num_parallel_trials - pipelineparam--stage_1_tuner_worker_pool_specs_override: - componentInputParameter: pipelineparam--stage_1_tuner_worker_pool_specs_override - pipelineparam--stage_2_num_parallel_trials: - componentInputParameter: pipelineparam--stage_2_num_parallel_trials - pipelineparam--stage_2_num_selected_trials: - componentInputParameter: pipelineparam--stage_2_num_selected_trials - pipelineparam--string-not-empty-Output: - taskOutputParameter: - outputParameterKey: Output - producerTask: string-not-empty - pipelineparam--study_spec_parameters_override: - componentInputParameter: pipelineparam--study_spec_parameters_override - pipelineparam--tabular-stats-and-example-gen-downsampled_test_split_json: - taskOutputParameter: - outputParameterKey: downsampled_test_split_json - producerTask: tabular-stats-and-example-gen - pipelineparam--tabular-stats-and-example-gen-test_split_json: - taskOutputParameter: - outputParameterKey: test_split_json - producerTask: tabular-stats-and-example-gen - pipelineparam--target_column: - componentInputParameter: pipelineparam--target_column - pipelineparam--train_budget_milli_node_hours: - componentInputParameter: pipelineparam--train_budget_milli_node_hours - pipelineparam--transform_dataflow_disk_size_gb: - componentInputParameter: pipelineparam--transform_dataflow_disk_size_gb - pipelineparam--transform_dataflow_machine_type: - componentInputParameter: pipelineparam--transform_dataflow_machine_type - pipelineparam--transform_dataflow_max_num_workers: - componentInputParameter: pipelineparam--transform_dataflow_max_num_workers - taskInfo: - name: condition-stage-1-tuning-result-artifact-uri-empty-4 - triggerPolicy: - condition: inputs.parameters['pipelineparam--string-not-empty-Output'].string_value - == 'false' - condition-stage-1-tuning-result-artifact-uri-not-empty-2: - componentRef: - name: comp-condition-stage-1-tuning-result-artifact-uri-not-empty-2 - dependentTasks: - - automl-tabular-transform - - merge-materialized-splits - - string-not-empty - - tabular-stats-and-example-gen - inputs: - artifacts: - pipelineparam--automl-tabular-transform-transform_output: - taskOutputArtifact: - outputArtifactKey: transform_output - producerTask: automl-tabular-transform - pipelineparam--merge-materialized-splits-splits: - taskOutputArtifact: - outputArtifactKey: splits - producerTask: merge-materialized-splits - pipelineparam--tabular-stats-and-example-gen-dataset_schema: - taskOutputArtifact: - outputArtifactKey: dataset_schema - producerTask: tabular-stats-and-example-gen - pipelineparam--tabular-stats-and-example-gen-eval_split: - taskOutputArtifact: - outputArtifactKey: eval_split - producerTask: tabular-stats-and-example-gen - pipelineparam--tabular-stats-and-example-gen-instance_baseline: - taskOutputArtifact: - outputArtifactKey: instance_baseline - producerTask: tabular-stats-and-example-gen - pipelineparam--tabular-stats-and-example-gen-metadata: - taskOutputArtifact: - outputArtifactKey: metadata - producerTask: tabular-stats-and-example-gen - parameters: - pipelineparam--cv_trainer_worker_pool_specs_override: - componentInputParameter: pipelineparam--cv_trainer_worker_pool_specs_override - pipelineparam--dataflow_service_account: - componentInputParameter: pipelineparam--dataflow_service_account - pipelineparam--dataflow_subnetwork: - componentInputParameter: pipelineparam--dataflow_subnetwork - pipelineparam--dataflow_use_public_ips: - componentInputParameter: pipelineparam--dataflow_use_public_ips - pipelineparam--encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - pipelineparam--evaluation_batch_predict_machine_type: - componentInputParameter: pipelineparam--evaluation_batch_predict_machine_type - pipelineparam--evaluation_batch_predict_max_replica_count: - componentInputParameter: pipelineparam--evaluation_batch_predict_max_replica_count - pipelineparam--evaluation_batch_predict_starting_replica_count: - componentInputParameter: pipelineparam--evaluation_batch_predict_starting_replica_count - pipelineparam--evaluation_dataflow_disk_size_gb: - componentInputParameter: pipelineparam--evaluation_dataflow_disk_size_gb - pipelineparam--evaluation_dataflow_machine_type: - componentInputParameter: pipelineparam--evaluation_dataflow_machine_type - pipelineparam--evaluation_dataflow_max_num_workers: - componentInputParameter: pipelineparam--evaluation_dataflow_max_num_workers - pipelineparam--export_additional_model_without_custom_ops: - componentInputParameter: pipelineparam--export_additional_model_without_custom_ops - pipelineparam--fast_testing: - componentInputParameter: pipelineparam--fast_testing - pipelineparam--location: - componentInputParameter: pipelineparam--location - pipelineparam--prediction_type: - componentInputParameter: pipelineparam--prediction_type - pipelineparam--project: - componentInputParameter: pipelineparam--project - pipelineparam--root_dir: - componentInputParameter: pipelineparam--root_dir - pipelineparam--run_distillation: - componentInputParameter: pipelineparam--run_distillation - pipelineparam--run_evaluation: - componentInputParameter: pipelineparam--run_evaluation - pipelineparam--stage_1_num_parallel_trials: - componentInputParameter: pipelineparam--stage_1_num_parallel_trials - pipelineparam--stage_1_tuning_result_artifact_uri: - componentInputParameter: pipelineparam--stage_1_tuning_result_artifact_uri - pipelineparam--stage_2_num_parallel_trials: - componentInputParameter: pipelineparam--stage_2_num_parallel_trials - pipelineparam--stage_2_num_selected_trials: - componentInputParameter: pipelineparam--stage_2_num_selected_trials - pipelineparam--string-not-empty-Output: - taskOutputParameter: - outputParameterKey: Output - producerTask: string-not-empty - pipelineparam--tabular-stats-and-example-gen-downsampled_test_split_json: - taskOutputParameter: - outputParameterKey: downsampled_test_split_json - producerTask: tabular-stats-and-example-gen - pipelineparam--tabular-stats-and-example-gen-test_split_json: - taskOutputParameter: - outputParameterKey: test_split_json - producerTask: tabular-stats-and-example-gen - pipelineparam--target_column: - componentInputParameter: pipelineparam--target_column - pipelineparam--train_budget_milli_node_hours: - componentInputParameter: pipelineparam--train_budget_milli_node_hours - taskInfo: - name: condition-stage-1-tuning-result-artifact-uri-not-empty-2 - triggerPolicy: - condition: inputs.parameters['pipelineparam--string-not-empty-Output'].string_value - == 'true' - merge-materialized-splits: - cachingOptions: - enableCache: true - componentRef: - name: comp-merge-materialized-splits - dependentTasks: - - automl-tabular-transform - inputs: - artifacts: - split_0: - taskOutputArtifact: - outputArtifactKey: materialized_train_split - producerTask: automl-tabular-transform - split_1: - taskOutputArtifact: - outputArtifactKey: materialized_eval_split - producerTask: automl-tabular-transform - taskInfo: - name: merge-materialized-splits - string-not-empty: - cachingOptions: - enableCache: true - componentRef: - name: comp-string-not-empty - inputs: - parameters: - value: - componentInputParameter: pipelineparam--stage_1_tuning_result_artifact_uri - taskInfo: - name: string-not-empty - tabular-stats-and-example-gen: - cachingOptions: - enableCache: true - componentRef: - name: comp-tabular-stats-and-example-gen - inputs: - parameters: - additional_experiments: - runtimeValue: - constantValue: - stringValue: '' - additional_experiments_json: - componentInputParameter: pipelineparam--additional_experiments - data_source_bigquery_table_path: - componentInputParameter: pipelineparam--data_source_bigquery_table_path - data_source_csv_filenames: - componentInputParameter: pipelineparam--data_source_csv_filenames - dataflow_disk_size_gb: - componentInputParameter: pipelineparam--stats_and_example_gen_dataflow_disk_size_gb - dataflow_machine_type: - componentInputParameter: pipelineparam--stats_and_example_gen_dataflow_machine_type - dataflow_max_num_workers: - componentInputParameter: pipelineparam--stats_and_example_gen_dataflow_max_num_workers - dataflow_service_account: - componentInputParameter: pipelineparam--dataflow_service_account - dataflow_subnetwork: - componentInputParameter: pipelineparam--dataflow_subnetwork - dataflow_use_public_ips: - componentInputParameter: pipelineparam--dataflow_use_public_ips - enable_probabilistic_inference: - componentInputParameter: pipelineparam--enable_probabilistic_inference - encryption_spec_key_name: - componentInputParameter: pipelineparam--encryption_spec_key_name - location: - componentInputParameter: pipelineparam--location - optimization_objective: - componentInputParameter: pipelineparam--optimization_objective - optimization_objective_precision_value: - componentInputParameter: pipelineparam--optimization_objective_precision_value - optimization_objective_recall_value: - componentInputParameter: pipelineparam--optimization_objective_recall_value - predefined_split_key: - componentInputParameter: pipelineparam--predefined_split_key - prediction_type: - componentInputParameter: pipelineparam--prediction_type - project: - componentInputParameter: pipelineparam--project - quantiles: - componentInputParameter: pipelineparam--quantiles - request_type: - runtimeValue: - constantValue: - stringValue: COLUMN_STATS_ONLY - root_dir: - componentInputParameter: pipelineparam--root_dir - run_distillation: - componentInputParameter: pipelineparam--run_distillation - stratified_split_key: - componentInputParameter: pipelineparam--stratified_split_key - target_column_name: - componentInputParameter: pipelineparam--target_column - test_fraction: - componentInputParameter: pipelineparam--test_fraction - timestamp_split_key: - componentInputParameter: pipelineparam--timestamp_split_key - training_fraction: - componentInputParameter: pipelineparam--training_fraction - transformations: - runtimeValue: - constantValue: - stringValue: '[]' - transformations_path: - componentInputParameter: pipelineparam--transformations - validation_fraction: - componentInputParameter: pipelineparam--validation_fraction - weight_column_name: - componentInputParameter: pipelineparam--weight_column - taskInfo: - name: tabular-stats-and-example-gen - inputDefinitions: - parameters: - pipelineparam--additional_experiments: - type: STRING - pipelineparam--cv_trainer_worker_pool_specs_override: - type: STRING - pipelineparam--data_source_bigquery_table_path: - type: STRING - pipelineparam--data_source_csv_filenames: - type: STRING - pipelineparam--dataflow_service_account: - type: STRING - pipelineparam--dataflow_subnetwork: - type: STRING - pipelineparam--dataflow_use_public_ips: - type: STRING - pipelineparam--disable_early_stopping: - type: STRING - pipelineparam--distill_batch_predict_machine_type: - type: STRING - pipelineparam--distill_batch_predict_max_replica_count: - type: INT - pipelineparam--distill_batch_predict_starting_replica_count: - type: INT - pipelineparam--enable_probabilistic_inference: - type: STRING - pipelineparam--encryption_spec_key_name: - type: STRING - pipelineparam--evaluation_batch_predict_machine_type: - type: STRING - pipelineparam--evaluation_batch_predict_max_replica_count: - type: INT - pipelineparam--evaluation_batch_predict_starting_replica_count: - type: INT - pipelineparam--evaluation_dataflow_disk_size_gb: - type: INT - pipelineparam--evaluation_dataflow_machine_type: - type: STRING - pipelineparam--evaluation_dataflow_max_num_workers: - type: INT - pipelineparam--export_additional_model_without_custom_ops: - type: STRING - pipelineparam--fast_testing: - type: STRING - pipelineparam--location: - type: STRING - pipelineparam--optimization_objective: - type: STRING - pipelineparam--optimization_objective_precision_value: - type: DOUBLE - pipelineparam--optimization_objective_recall_value: - type: DOUBLE - pipelineparam--predefined_split_key: - type: STRING - pipelineparam--prediction_type: - type: STRING - pipelineparam--project: - type: STRING - pipelineparam--quantiles: - type: STRING - pipelineparam--root_dir: - type: STRING - pipelineparam--run_distillation: - type: STRING - pipelineparam--run_evaluation: - type: STRING - pipelineparam--stage_1_num_parallel_trials: - type: INT - pipelineparam--stage_1_tuner_worker_pool_specs_override: - type: STRING - pipelineparam--stage_1_tuning_result_artifact_uri: - type: STRING - pipelineparam--stage_2_num_parallel_trials: - type: INT - pipelineparam--stage_2_num_selected_trials: - type: INT - pipelineparam--stats_and_example_gen_dataflow_disk_size_gb: - type: INT - pipelineparam--stats_and_example_gen_dataflow_machine_type: - type: STRING - pipelineparam--stats_and_example_gen_dataflow_max_num_workers: - type: INT - pipelineparam--stratified_split_key: - type: STRING - pipelineparam--study_spec_parameters_override: - type: STRING - pipelineparam--target_column: - type: STRING - pipelineparam--test_fraction: - type: DOUBLE - pipelineparam--timestamp_split_key: - type: STRING - pipelineparam--train_budget_milli_node_hours: - type: DOUBLE - pipelineparam--training_fraction: - type: DOUBLE - pipelineparam--transform_dataflow_disk_size_gb: - type: INT - pipelineparam--transform_dataflow_machine_type: - type: STRING - pipelineparam--transform_dataflow_max_num_workers: - type: INT - pipelineparam--transformations: - type: STRING - pipelineparam--validation_fraction: - type: DOUBLE - pipelineparam--weight_column: - type: STRING - outputDefinitions: - artifacts: - feature-attribution-2-feature_attributions: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - feature-attribution-3-feature_attributions: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - feature-attribution-feature_attributions: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - model-evaluation-2-evaluation_metrics: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - model-evaluation-3-evaluation_metrics: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - model-evaluation-evaluation_metrics: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - comp-feature-attribution: - executorLabel: exec-feature-attribution - inputDefinitions: - artifacts: - predictions_gcs_source: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - parameters: - dataflow_disk_size: - type: INT - dataflow_machine_type: - type: STRING - dataflow_max_workers_num: - type: INT - dataflow_service_account: - type: STRING - dataflow_subnetwork: - type: STRING - dataflow_use_public_ips: - type: STRING - dataflow_workers_num: - type: INT - encryption_spec_key_name: - type: STRING - location: - type: STRING - predictions_format: - type: STRING - project: - type: STRING - root_dir: - type: STRING - outputDefinitions: - artifacts: - feature_attributions: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - parameters: - gcp_resources: - type: STRING - comp-feature-attribution-2: - executorLabel: exec-feature-attribution-2 - inputDefinitions: - artifacts: - predictions_gcs_source: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - parameters: - dataflow_disk_size: - type: INT - dataflow_machine_type: - type: STRING - dataflow_max_workers_num: - type: INT - dataflow_service_account: - type: STRING - dataflow_subnetwork: - type: STRING - dataflow_use_public_ips: - type: STRING - dataflow_workers_num: - type: INT - encryption_spec_key_name: - type: STRING - location: - type: STRING - predictions_format: - type: STRING - project: - type: STRING - root_dir: - type: STRING - outputDefinitions: - artifacts: - feature_attributions: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - parameters: - gcp_resources: - type: STRING - comp-feature-attribution-3: - executorLabel: exec-feature-attribution-3 - inputDefinitions: - artifacts: - predictions_gcs_source: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - parameters: - dataflow_disk_size: - type: INT - dataflow_machine_type: - type: STRING - dataflow_max_workers_num: - type: INT - dataflow_service_account: - type: STRING - dataflow_subnetwork: - type: STRING - dataflow_use_public_ips: - type: STRING - dataflow_workers_num: - type: INT - encryption_spec_key_name: - type: STRING - location: - type: STRING - predictions_format: - type: STRING - project: - type: STRING - root_dir: - type: STRING - outputDefinitions: - artifacts: - feature_attributions: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - parameters: - gcp_resources: - type: STRING - comp-importer: - executorLabel: exec-importer - inputDefinitions: - parameters: - uri: - type: STRING - outputDefinitions: - artifacts: - artifact: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - comp-merge-materialized-splits: - executorLabel: exec-merge-materialized-splits - inputDefinitions: - artifacts: - split_0: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - split_1: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - outputDefinitions: - artifacts: - splits: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - comp-model-batch-explanation: - executorLabel: exec-model-batch-explanation - inputDefinitions: - artifacts: - explanation_metadata_artifact: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - unmanaged_container_model: - artifactType: - schemaTitle: google.UnmanagedContainerModel - schemaVersion: 0.0.1 - parameters: - accelerator_count: - type: INT - accelerator_type: - type: STRING - bigquery_destination_output_uri: - type: STRING - bigquery_source_input_uri: - type: STRING - encryption_spec_key_name: - type: STRING - explanation_metadata: - type: STRING - explanation_parameters: - type: STRING - gcs_destination_output_uri_prefix: - type: STRING - gcs_source_uris: - type: STRING - generate_explanation: - type: STRING - instances_format: - type: STRING - job_display_name: - type: STRING - labels: - type: STRING - location: - type: STRING - machine_type: - type: STRING - manual_batch_tuning_parameters_batch_size: - type: INT - max_replica_count: - type: INT - model_parameters: - type: STRING - predictions_format: - type: STRING - project: - type: STRING - starting_replica_count: - type: INT - outputDefinitions: - artifacts: - batchpredictionjob: - artifactType: - schemaTitle: google.VertexBatchPredictionJob - schemaVersion: 0.0.1 - bigquery_output_table: - artifactType: - schemaTitle: google.BQTable - schemaVersion: 0.0.1 - gcs_output_directory: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - parameters: - gcp_resources: - type: STRING - comp-model-batch-explanation-2: - executorLabel: exec-model-batch-explanation-2 - inputDefinitions: - artifacts: - explanation_metadata_artifact: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - unmanaged_container_model: - artifactType: - schemaTitle: google.UnmanagedContainerModel - schemaVersion: 0.0.1 - parameters: - accelerator_count: - type: INT - accelerator_type: - type: STRING - bigquery_destination_output_uri: - type: STRING - bigquery_source_input_uri: - type: STRING - encryption_spec_key_name: - type: STRING - explanation_metadata: - type: STRING - explanation_parameters: - type: STRING - gcs_destination_output_uri_prefix: - type: STRING - gcs_source_uris: - type: STRING - generate_explanation: - type: STRING - instances_format: - type: STRING - job_display_name: - type: STRING - labels: - type: STRING - location: - type: STRING - machine_type: - type: STRING - manual_batch_tuning_parameters_batch_size: - type: INT - max_replica_count: - type: INT - model_parameters: - type: STRING - predictions_format: - type: STRING - project: - type: STRING - starting_replica_count: - type: INT - outputDefinitions: - artifacts: - batchpredictionjob: - artifactType: - schemaTitle: google.VertexBatchPredictionJob - schemaVersion: 0.0.1 - bigquery_output_table: - artifactType: - schemaTitle: google.BQTable - schemaVersion: 0.0.1 - gcs_output_directory: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - parameters: - gcp_resources: - type: STRING - comp-model-batch-explanation-3: - executorLabel: exec-model-batch-explanation-3 - inputDefinitions: - artifacts: - explanation_metadata_artifact: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - unmanaged_container_model: - artifactType: - schemaTitle: google.UnmanagedContainerModel - schemaVersion: 0.0.1 - parameters: - accelerator_count: - type: INT - accelerator_type: - type: STRING - bigquery_destination_output_uri: - type: STRING - bigquery_source_input_uri: - type: STRING - encryption_spec_key_name: - type: STRING - explanation_metadata: - type: STRING - explanation_parameters: - type: STRING - gcs_destination_output_uri_prefix: - type: STRING - gcs_source_uris: - type: STRING - generate_explanation: - type: STRING - instances_format: - type: STRING - job_display_name: - type: STRING - labels: - type: STRING - location: - type: STRING - machine_type: - type: STRING - manual_batch_tuning_parameters_batch_size: - type: INT - max_replica_count: - type: INT - model_parameters: - type: STRING - predictions_format: - type: STRING - project: - type: STRING - starting_replica_count: - type: INT - outputDefinitions: - artifacts: - batchpredictionjob: - artifactType: - schemaTitle: google.VertexBatchPredictionJob - schemaVersion: 0.0.1 - bigquery_output_table: - artifactType: - schemaTitle: google.BQTable - schemaVersion: 0.0.1 - gcs_output_directory: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - parameters: - gcp_resources: - type: STRING - comp-model-batch-predict: - executorLabel: exec-model-batch-predict - inputDefinitions: - artifacts: - unmanaged_container_model: - artifactType: - schemaTitle: google.UnmanagedContainerModel - schemaVersion: 0.0.1 - parameters: - accelerator_count: - type: INT - accelerator_type: - type: STRING - bigquery_destination_output_uri: - type: STRING - bigquery_source_input_uri: - type: STRING - encryption_spec_key_name: - type: STRING - explanation_metadata: - type: STRING - explanation_parameters: - type: STRING - gcs_destination_output_uri_prefix: - type: STRING - gcs_source_uris: - type: STRING - generate_explanation: - type: STRING - instances_format: - type: STRING - job_display_name: - type: STRING - labels: - type: STRING - location: - type: STRING - machine_type: - type: STRING - manual_batch_tuning_parameters_batch_size: - type: INT - max_replica_count: - type: INT - model_parameters: - type: STRING - predictions_format: - type: STRING - project: - type: STRING - starting_replica_count: - type: INT - outputDefinitions: - artifacts: - batchpredictionjob: - artifactType: - schemaTitle: google.VertexBatchPredictionJob - schemaVersion: 0.0.1 - bigquery_output_table: - artifactType: - schemaTitle: google.BQTable - schemaVersion: 0.0.1 - gcs_output_directory: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - parameters: - gcp_resources: - type: STRING - comp-model-batch-predict-2: - executorLabel: exec-model-batch-predict-2 - inputDefinitions: - artifacts: - unmanaged_container_model: - artifactType: - schemaTitle: google.UnmanagedContainerModel - schemaVersion: 0.0.1 - parameters: - accelerator_count: - type: INT - accelerator_type: - type: STRING - bigquery_destination_output_uri: - type: STRING - bigquery_source_input_uri: - type: STRING - encryption_spec_key_name: - type: STRING - explanation_metadata: - type: STRING - explanation_parameters: - type: STRING - gcs_destination_output_uri_prefix: - type: STRING - gcs_source_uris: - type: STRING - generate_explanation: - type: STRING - instances_format: - type: STRING - job_display_name: - type: STRING - labels: - type: STRING - location: - type: STRING - machine_type: - type: STRING - manual_batch_tuning_parameters_batch_size: - type: INT - max_replica_count: - type: INT - model_parameters: - type: STRING - predictions_format: - type: STRING - project: - type: STRING - starting_replica_count: - type: INT - outputDefinitions: - artifacts: - batchpredictionjob: - artifactType: - schemaTitle: google.VertexBatchPredictionJob - schemaVersion: 0.0.1 - bigquery_output_table: - artifactType: - schemaTitle: google.BQTable - schemaVersion: 0.0.1 - gcs_output_directory: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - parameters: - gcp_resources: - type: STRING - comp-model-batch-predict-3: - executorLabel: exec-model-batch-predict-3 - inputDefinitions: - artifacts: - model: - artifactType: - schemaTitle: google.VertexModel - schemaVersion: 0.0.1 - parameters: - accelerator_count: - type: INT - accelerator_type: - type: STRING - bigquery_destination_output_uri: - type: STRING - bigquery_source_input_uri: - type: STRING - encryption_spec_key_name: - type: STRING - explanation_metadata: - type: STRING - explanation_parameters: - type: STRING - gcs_destination_output_uri_prefix: - type: STRING - gcs_source_uris: - type: STRING - generate_explanation: - type: STRING - instances_format: - type: STRING - job_display_name: - type: STRING - labels: - type: STRING - location: - type: STRING - machine_type: - type: STRING - manual_batch_tuning_parameters_batch_size: - type: INT - max_replica_count: - type: INT - model_parameters: - type: STRING - predictions_format: - type: STRING - project: - type: STRING - starting_replica_count: - type: INT - outputDefinitions: - artifacts: - batchpredictionjob: - artifactType: - schemaTitle: google.VertexBatchPredictionJob - schemaVersion: 0.0.1 - bigquery_output_table: - artifactType: - schemaTitle: google.BQTable - schemaVersion: 0.0.1 - gcs_output_directory: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - parameters: - gcp_resources: - type: STRING - comp-model-batch-predict-4: - executorLabel: exec-model-batch-predict-4 - inputDefinitions: - artifacts: - model: - artifactType: - schemaTitle: google.VertexModel - schemaVersion: 0.0.1 - parameters: - accelerator_count: - type: INT - accelerator_type: - type: STRING - bigquery_destination_output_uri: - type: STRING - bigquery_source_input_uri: - type: STRING - encryption_spec_key_name: - type: STRING - explanation_metadata: - type: STRING - explanation_parameters: - type: STRING - gcs_destination_output_uri_prefix: - type: STRING - gcs_source_uris: - type: STRING - generate_explanation: - type: STRING - instances_format: - type: STRING - job_display_name: - type: STRING - labels: - type: STRING - location: - type: STRING - machine_type: - type: STRING - manual_batch_tuning_parameters_batch_size: - type: INT - max_replica_count: - type: INT - model_parameters: - type: STRING - predictions_format: - type: STRING - project: - type: STRING - starting_replica_count: - type: INT - outputDefinitions: - artifacts: - batchpredictionjob: - artifactType: - schemaTitle: google.VertexBatchPredictionJob - schemaVersion: 0.0.1 - bigquery_output_table: - artifactType: - schemaTitle: google.BQTable - schemaVersion: 0.0.1 - gcs_output_directory: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - parameters: - gcp_resources: - type: STRING - comp-model-batch-predict-5: - executorLabel: exec-model-batch-predict-5 - inputDefinitions: - artifacts: - unmanaged_container_model: - artifactType: - schemaTitle: google.UnmanagedContainerModel - schemaVersion: 0.0.1 - parameters: - accelerator_count: - type: INT - accelerator_type: - type: STRING - bigquery_destination_output_uri: - type: STRING - bigquery_source_input_uri: - type: STRING - encryption_spec_key_name: - type: STRING - explanation_metadata: - type: STRING - explanation_parameters: - type: STRING - gcs_destination_output_uri_prefix: - type: STRING - gcs_source_uris: - type: STRING - generate_explanation: - type: STRING - instances_format: - type: STRING - job_display_name: - type: STRING - labels: - type: STRING - location: - type: STRING - machine_type: - type: STRING - manual_batch_tuning_parameters_batch_size: - type: INT - max_replica_count: - type: INT - model_parameters: - type: STRING - predictions_format: - type: STRING - project: - type: STRING - starting_replica_count: - type: INT - outputDefinitions: - artifacts: - batchpredictionjob: - artifactType: - schemaTitle: google.VertexBatchPredictionJob - schemaVersion: 0.0.1 - bigquery_output_table: - artifactType: - schemaTitle: google.BQTable - schemaVersion: 0.0.1 - gcs_output_directory: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - parameters: - gcp_resources: - type: STRING - comp-model-evaluation: - executorLabel: exec-model-evaluation - inputDefinitions: - artifacts: - batch_prediction_job: - artifactType: - schemaTitle: google.VertexBatchPredictionJob - schemaVersion: 0.0.1 - parameters: - class_names: - type: STRING - classification_type: - type: STRING - dataflow_disk_size: - type: INT - dataflow_machine_type: - type: STRING - dataflow_max_workers_num: - type: INT - dataflow_service_account: - type: STRING - dataflow_subnetwork: - type: STRING - dataflow_use_public_ips: - type: STRING - dataflow_workers_num: - type: INT - encryption_spec_key_name: - type: STRING - example_weight_column: - type: STRING - generate_feature_attribution: - type: STRING - ground_truth_column: - type: STRING - ground_truth_format: - type: STRING - ground_truth_gcs_source: - type: STRING - key_columns: - type: STRING - location: - type: STRING - positive_classes: - type: STRING - prediction_id_column: - type: STRING - prediction_label_column: - type: STRING - prediction_score_column: - type: STRING - predictions_format: - type: STRING - problem_type: - type: STRING - project: - type: STRING - root_dir: - type: STRING - outputDefinitions: - artifacts: - evaluation_metrics: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - parameters: - gcp_resources: - type: STRING - comp-model-evaluation-2: - executorLabel: exec-model-evaluation-2 - inputDefinitions: - artifacts: - batch_prediction_job: - artifactType: - schemaTitle: google.VertexBatchPredictionJob - schemaVersion: 0.0.1 - parameters: - class_names: - type: STRING - classification_type: - type: STRING - dataflow_disk_size: - type: INT - dataflow_machine_type: - type: STRING - dataflow_max_workers_num: - type: INT - dataflow_service_account: - type: STRING - dataflow_subnetwork: - type: STRING - dataflow_use_public_ips: - type: STRING - dataflow_workers_num: - type: INT - encryption_spec_key_name: - type: STRING - example_weight_column: - type: STRING - generate_feature_attribution: - type: STRING - ground_truth_column: - type: STRING - ground_truth_format: - type: STRING - ground_truth_gcs_source: - type: STRING - key_columns: - type: STRING - location: - type: STRING - positive_classes: - type: STRING - prediction_id_column: - type: STRING - prediction_label_column: - type: STRING - prediction_score_column: - type: STRING - predictions_format: - type: STRING - problem_type: - type: STRING - project: - type: STRING - root_dir: - type: STRING - outputDefinitions: - artifacts: - evaluation_metrics: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - parameters: - gcp_resources: - type: STRING - comp-model-evaluation-3: - executorLabel: exec-model-evaluation-3 - inputDefinitions: - artifacts: - batch_prediction_job: - artifactType: - schemaTitle: google.VertexBatchPredictionJob - schemaVersion: 0.0.1 - parameters: - class_names: - type: STRING - classification_type: - type: STRING - dataflow_disk_size: - type: INT - dataflow_machine_type: - type: STRING - dataflow_max_workers_num: - type: INT - dataflow_service_account: - type: STRING - dataflow_subnetwork: - type: STRING - dataflow_use_public_ips: - type: STRING - dataflow_workers_num: - type: INT - encryption_spec_key_name: - type: STRING - example_weight_column: - type: STRING - generate_feature_attribution: - type: STRING - ground_truth_column: - type: STRING - ground_truth_format: - type: STRING - ground_truth_gcs_source: - type: STRING - key_columns: - type: STRING - location: - type: STRING - positive_classes: - type: STRING - prediction_id_column: - type: STRING - prediction_label_column: - type: STRING - prediction_score_column: - type: STRING - predictions_format: - type: STRING - problem_type: - type: STRING - project: - type: STRING - root_dir: - type: STRING - outputDefinitions: - artifacts: - evaluation_metrics: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - parameters: - gcp_resources: - type: STRING - comp-model-evaluation-import: - executorLabel: exec-model-evaluation-import - inputDefinitions: - artifacts: - feature_attributions: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - metrics: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - model: - artifactType: - schemaTitle: google.VertexModel - schemaVersion: 0.0.1 - parameters: - dataset_path: - type: STRING - dataset_paths: - type: STRING - dataset_type: - type: STRING - display_name: - type: STRING - problem_type: - type: STRING - outputDefinitions: - parameters: - gcp_resources: - type: STRING - comp-model-evaluation-import-2: - executorLabel: exec-model-evaluation-import-2 - inputDefinitions: - artifacts: - feature_attributions: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - metrics: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - model: - artifactType: - schemaTitle: google.VertexModel - schemaVersion: 0.0.1 - parameters: - dataset_path: - type: STRING - dataset_paths: - type: STRING - dataset_type: - type: STRING - display_name: - type: STRING - problem_type: - type: STRING - outputDefinitions: - parameters: - gcp_resources: - type: STRING - comp-model-evaluation-import-3: - executorLabel: exec-model-evaluation-import-3 - inputDefinitions: - artifacts: - feature_attributions: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - metrics: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - model: - artifactType: - schemaTitle: google.VertexModel - schemaVersion: 0.0.1 - parameters: - dataset_path: - type: STRING - dataset_paths: - type: STRING - dataset_type: - type: STRING - display_name: - type: STRING - problem_type: - type: STRING - outputDefinitions: - parameters: - gcp_resources: - type: STRING - comp-model-upload: - executorLabel: exec-model-upload - inputDefinitions: - artifacts: - unmanaged_container_model: - artifactType: - schemaTitle: google.UnmanagedContainerModel - schemaVersion: 0.0.1 - parameters: - description: - type: STRING - display_name: - type: STRING - encryption_spec_key_name: - type: STRING - explanation_metadata: - type: STRING - explanation_parameters: - type: STRING - labels: - type: STRING - location: - type: STRING - project: - type: STRING - outputDefinitions: - artifacts: - model: - artifactType: - schemaTitle: google.VertexModel - schemaVersion: 0.0.1 - parameters: - gcp_resources: - type: STRING - comp-model-upload-2: - executorLabel: exec-model-upload-2 - inputDefinitions: - artifacts: - explanation_metadata_artifact: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - unmanaged_container_model: - artifactType: - schemaTitle: google.UnmanagedContainerModel - schemaVersion: 0.0.1 - parameters: - description: - type: STRING - display_name: - type: STRING - encryption_spec_key_name: - type: STRING - explanation_metadata: - type: STRING - explanation_parameters: - type: STRING - labels: - type: STRING - location: - type: STRING - project: - type: STRING - outputDefinitions: - artifacts: - model: - artifactType: - schemaTitle: google.VertexModel - schemaVersion: 0.0.1 - parameters: - gcp_resources: - type: STRING - comp-model-upload-3: - executorLabel: exec-model-upload-3 - inputDefinitions: - artifacts: - explanation_metadata_artifact: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - unmanaged_container_model: - artifactType: - schemaTitle: google.UnmanagedContainerModel - schemaVersion: 0.0.1 - parameters: - description: - type: STRING - display_name: - type: STRING - encryption_spec_key_name: - type: STRING - explanation_metadata: - type: STRING - explanation_parameters: - type: STRING - labels: - type: STRING - location: - type: STRING - project: - type: STRING - outputDefinitions: - artifacts: - model: - artifactType: - schemaTitle: google.VertexModel - schemaVersion: 0.0.1 - parameters: - gcp_resources: - type: STRING - comp-model-upload-4: - executorLabel: exec-model-upload-4 - inputDefinitions: - artifacts: - explanation_metadata_artifact: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - unmanaged_container_model: - artifactType: - schemaTitle: google.UnmanagedContainerModel - schemaVersion: 0.0.1 - parameters: - description: - type: STRING - display_name: - type: STRING - encryption_spec_key_name: - type: STRING - explanation_metadata: - type: STRING - explanation_parameters: - type: STRING - labels: - type: STRING - location: - type: STRING - project: - type: STRING - outputDefinitions: - artifacts: - model: - artifactType: - schemaTitle: google.VertexModel - schemaVersion: 0.0.1 - parameters: - gcp_resources: - type: STRING - comp-read-input-uri: - executorLabel: exec-read-input-uri - inputDefinitions: - artifacts: - split_uri: - artifactType: - schemaTitle: system.Dataset - schemaVersion: 0.0.1 - outputDefinitions: - parameters: - Output: - type: STRING - comp-read-input-uri-2: - executorLabel: exec-read-input-uri-2 - inputDefinitions: - artifacts: - split_uri: - artifactType: - schemaTitle: system.Dataset - schemaVersion: 0.0.1 - outputDefinitions: - parameters: - Output: - type: STRING - comp-set-model-can-skip-validation: - executorLabel: exec-set-model-can-skip-validation - inputDefinitions: - artifacts: - model: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - comp-string-not-empty: - executorLabel: exec-string-not-empty - inputDefinitions: - parameters: - value: - type: STRING - outputDefinitions: - parameters: - Output: - type: STRING - comp-tabular-stats-and-example-gen: - executorLabel: exec-tabular-stats-and-example-gen - inputDefinitions: - parameters: - additional_experiments: - type: STRING - additional_experiments_json: - type: STRING - data_source_bigquery_table_path: - type: STRING - data_source_csv_filenames: - type: STRING - dataflow_disk_size_gb: - type: INT - dataflow_machine_type: - type: STRING - dataflow_max_num_workers: - type: INT - dataflow_service_account: - type: STRING - dataflow_subnetwork: - type: STRING - dataflow_use_public_ips: - type: STRING - enable_probabilistic_inference: - type: STRING - encryption_spec_key_name: - type: STRING - location: - type: STRING - optimization_objective: - type: STRING - optimization_objective_precision_value: - type: DOUBLE - optimization_objective_recall_value: - type: DOUBLE - predefined_split_key: - type: STRING - prediction_type: - type: STRING - project: - type: STRING - quantiles: - type: STRING - request_type: - type: STRING - root_dir: - type: STRING - run_distillation: - type: STRING - stratified_split_key: - type: STRING - target_column_name: - type: STRING - test_fraction: - type: DOUBLE - timestamp_split_key: - type: STRING - training_fraction: - type: DOUBLE - transformations: - type: STRING - transformations_path: - type: STRING - validation_fraction: - type: DOUBLE - weight_column_name: - type: STRING - outputDefinitions: - artifacts: - dataset_schema: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - dataset_stats: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - eval_split: - artifactType: - schemaTitle: system.Dataset - schemaVersion: 0.0.1 - instance_baseline: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - metadata: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - test_split: - artifactType: - schemaTitle: system.Dataset - schemaVersion: 0.0.1 - train_split: - artifactType: - schemaTitle: system.Dataset - schemaVersion: 0.0.1 - parameters: - downsampled_test_split_json: - type: STRING - gcp_resources: - type: STRING - test_split_json: - type: STRING - comp-write-bp-result-path: - executorLabel: exec-write-bp-result-path - inputDefinitions: - artifacts: - bp_job: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - outputDefinitions: - artifacts: - result: - artifactType: - schemaTitle: system.Dataset - schemaVersion: 0.0.1 - comp-write-bp-result-path-2: - executorLabel: exec-write-bp-result-path-2 - inputDefinitions: - artifacts: - bp_job: - artifactType: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - outputDefinitions: - artifacts: - result: - artifactType: - schemaTitle: system.Dataset - schemaVersion: 0.0.1 - deploymentSpec: - executors: - exec-automl-tabular-cv-trainer: - container: - args: - - --type - - CustomJob - - --project - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --payload - - '{"display_name": "automl-tabular-cv-tuner-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}}", - "encryption_spec": {"kms_key_name":"{{$.inputs.parameters[''encryption_spec_key_name'']}}"}, - "job_spec": {"worker_pool_specs": [{"replica_count": 1, "machine_spec": - {"machine_type": "n1-standard-8"}, "container_spec": {"image_uri":"us-docker.pkg.dev/vertex-ai-restricted/automl-tabular/training:20230110_1125_RC00", - "args": ["l2l_cv_tuner", "--transform_output_path={{$.inputs.artifacts[''transform_output''].uri}}", - "--training_docker_uri=us-docker.pkg.dev/vertex-ai-restricted/automl-tabular/training:20230110_1125_RC00", - "--component_id={{$.pipeline_task_uuid}}", "--training_base_dir={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/train", - "--num_parallel_trial={{$.inputs.parameters[''num_parallel_trials'']}}", - "--single_run_max_secs={{$.inputs.parameters[''single_run_max_secs'']}}", - "--deadline_hours={{$.inputs.parameters[''deadline_hours'']}}", "--valid_trials_completed_threshold=0.7", - "--num_selected_trials={{$.inputs.parameters[''num_selected_trials'']}}", - "--lro_job_info={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/lro", - "--error_file_path={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/error.pb", - "--metadata_path={{$.inputs.artifacts[''metadata''].uri}}", "--materialized_cv_splits={{$.inputs.artifacts[''materialized_cv_splits''].uri}}", - "--tuning_result_input_path={{$.inputs.artifacts[''tuning_result_input''].uri}}", - "--tuning_result_output_path={{$.outputs.artifacts[''tuning_result_output''].uri}}", - "--kms_key_name={{$.inputs.parameters[''encryption_spec_key_name'']}}", - "--gcp_resources_path={{$.outputs.parameters[''gcp_resources''].output_file}}", - "--execution_metrics_path={{$.outputs.parameters[''execution_metrics''].output_file}}", - "--use_custom_job=true", "--use_json=true", "--log_level=ERROR", "--executor_input={{$.json_escape[1]}}"]}}]}}' - command: - - python3 - - -u - - -m - - google_cloud_pipeline_components.container.v1.custom_job.launcher - image: gcr.io/ml-pipeline/google-cloud-pipeline-components:1.0.32 - exec-automl-tabular-cv-trainer-2: - container: - args: - - --type - - CustomJob - - --project - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --payload - - '{"display_name": "automl-tabular-cv-tuner-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}}", - "encryption_spec": {"kms_key_name":"{{$.inputs.parameters[''encryption_spec_key_name'']}}"}, - "job_spec": {"worker_pool_specs": [{"replica_count": 1, "machine_spec": - {"machine_type": "n1-standard-8"}, "container_spec": {"image_uri":"us-docker.pkg.dev/vertex-ai-restricted/automl-tabular/training:20230110_1125_RC00", - "args": ["l2l_cv_tuner", "--transform_output_path={{$.inputs.artifacts[''transform_output''].uri}}", - "--training_docker_uri=us-docker.pkg.dev/vertex-ai-restricted/automl-tabular/training:20230110_1125_RC00", - "--component_id={{$.pipeline_task_uuid}}", "--training_base_dir={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/train", - "--num_parallel_trial={{$.inputs.parameters[''num_parallel_trials'']}}", - "--single_run_max_secs={{$.inputs.parameters[''single_run_max_secs'']}}", - "--deadline_hours={{$.inputs.parameters[''deadline_hours'']}}", "--valid_trials_completed_threshold=0.7", - "--num_selected_trials={{$.inputs.parameters[''num_selected_trials'']}}", - "--lro_job_info={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/lro", - "--error_file_path={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/error.pb", - "--metadata_path={{$.inputs.artifacts[''metadata''].uri}}", "--materialized_cv_splits={{$.inputs.artifacts[''materialized_cv_splits''].uri}}", - "--tuning_result_input_path={{$.inputs.artifacts[''tuning_result_input''].uri}}", - "--tuning_result_output_path={{$.outputs.artifacts[''tuning_result_output''].uri}}", - "--kms_key_name={{$.inputs.parameters[''encryption_spec_key_name'']}}", - "--gcp_resources_path={{$.outputs.parameters[''gcp_resources''].output_file}}", - "--execution_metrics_path={{$.outputs.parameters[''execution_metrics''].output_file}}", - "--use_custom_job=true", "--use_json=true", "--log_level=ERROR", "--executor_input={{$.json_escape[1]}}"]}}]}}' - command: - - python3 - - -u - - -m - - google_cloud_pipeline_components.container.v1.custom_job.launcher - image: gcr.io/ml-pipeline/google-cloud-pipeline-components:1.0.32 - exec-automl-tabular-ensemble: - container: - args: - - --type - - CustomJob - - --project - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --payload - - '{"display_name": "automl-tabular-ensemble-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}}", - "encryption_spec": {"kms_key_name":"{{$.inputs.parameters[''encryption_spec_key_name'']}}"}, - "job_spec": {"worker_pool_specs": [{"replica_count": 1, "machine_spec": - {"machine_type": "n1-highmem-8"}, "container_spec": {"image_uri":"us-docker.pkg.dev/vertex-ai-restricted/automl-tabular/training:20230110_1125_RC00", - "args": ["ensemble", "--transform_output_path={{$.inputs.artifacts[''transform_output''].uri}}", - "--model_output_path={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/model", - "--custom_model_output_path={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/custom_model", - "--error_file_path={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/error.pb", - "--export_custom_model={{$.inputs.parameters[''export_additional_model_without_custom_ops'']}}", - "--metadata_path={{$.inputs.artifacts[''metadata''].uri}}", "--dataset_schema_path={{$.inputs.artifacts[''dataset_schema''].uri}}", - "--tuning_result_input_path={{$.inputs.artifacts[''tuning_result_input''].uri}}", - "--instance_baseline_path={{$.inputs.artifacts[''instance_baseline''].uri}}", - "--warmup_data={{$.inputs.artifacts[''warmup_data''].uri}}", "--prediction_docker_uri=us-docker.pkg.dev/vertex-ai/automl-tabular/prediction-server:20230110_1125_RC00", - "--model_path={{$.outputs.artifacts[''model''].uri}}", "--custom_model_path={{$.outputs.artifacts[''model_without_custom_ops''].uri}}", - "--explanation_metadata_path={{$.outputs.parameters[''explanation_metadata''].output_file}},{{$.outputs.artifacts[''explanation_metadata_artifact''].uri}}", - "--explanation_parameters_path={{$.outputs.parameters[''explanation_parameters''].output_file}}", - "--model_architecture_path={{$.outputs.artifacts[''model_architecture''].uri}}", - "--use_json=true", "--executor_input={{$.json_escape[1]}}"]}}]}}' - command: - - python3 - - -u - - -m - - google_cloud_pipeline_components.container.v1.custom_job.launcher - image: gcr.io/ml-pipeline/google-cloud-pipeline-components:1.0.32 - exec-automl-tabular-ensemble-2: - container: - args: - - --type - - CustomJob - - --project - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --payload - - '{"display_name": "automl-tabular-ensemble-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}}", - "encryption_spec": {"kms_key_name":"{{$.inputs.parameters[''encryption_spec_key_name'']}}"}, - "job_spec": {"worker_pool_specs": [{"replica_count": 1, "machine_spec": - {"machine_type": "n1-highmem-8"}, "container_spec": {"image_uri":"us-docker.pkg.dev/vertex-ai-restricted/automl-tabular/training:20230110_1125_RC00", - "args": ["ensemble", "--transform_output_path={{$.inputs.artifacts[''transform_output''].uri}}", - "--model_output_path={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/model", - "--custom_model_output_path={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/custom_model", - "--error_file_path={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/error.pb", - "--export_custom_model={{$.inputs.parameters[''export_additional_model_without_custom_ops'']}}", - "--metadata_path={{$.inputs.artifacts[''metadata''].uri}}", "--dataset_schema_path={{$.inputs.artifacts[''dataset_schema''].uri}}", - "--tuning_result_input_path={{$.inputs.artifacts[''tuning_result_input''].uri}}", - "--instance_baseline_path={{$.inputs.artifacts[''instance_baseline''].uri}}", - "--warmup_data={{$.inputs.artifacts[''warmup_data''].uri}}", "--prediction_docker_uri=us-docker.pkg.dev/vertex-ai/automl-tabular/prediction-server:20230110_1125_RC00", - "--model_path={{$.outputs.artifacts[''model''].uri}}", "--custom_model_path={{$.outputs.artifacts[''model_without_custom_ops''].uri}}", - "--explanation_metadata_path={{$.outputs.parameters[''explanation_metadata''].output_file}},{{$.outputs.artifacts[''explanation_metadata_artifact''].uri}}", - "--explanation_parameters_path={{$.outputs.parameters[''explanation_parameters''].output_file}}", - "--model_architecture_path={{$.outputs.artifacts[''model_architecture''].uri}}", - "--use_json=true", "--executor_input={{$.json_escape[1]}}"]}}]}}' - command: - - python3 - - -u - - -m - - google_cloud_pipeline_components.container.v1.custom_job.launcher - image: gcr.io/ml-pipeline/google-cloud-pipeline-components:1.0.32 - exec-automl-tabular-ensemble-3: - container: - args: - - --type - - CustomJob - - --project - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --payload - - '{"display_name": "automl-tabular-ensemble-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}}", - "encryption_spec": {"kms_key_name":"{{$.inputs.parameters[''encryption_spec_key_name'']}}"}, - "job_spec": {"worker_pool_specs": [{"replica_count": 1, "machine_spec": - {"machine_type": "n1-highmem-8"}, "container_spec": {"image_uri":"us-docker.pkg.dev/vertex-ai-restricted/automl-tabular/training:20230110_1125_RC00", - "args": ["ensemble", "--transform_output_path={{$.inputs.artifacts[''transform_output''].uri}}", - "--model_output_path={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/model", - "--custom_model_output_path={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/custom_model", - "--error_file_path={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/error.pb", - "--export_custom_model={{$.inputs.parameters[''export_additional_model_without_custom_ops'']}}", - "--metadata_path={{$.inputs.artifacts[''metadata''].uri}}", "--dataset_schema_path={{$.inputs.artifacts[''dataset_schema''].uri}}", - "--tuning_result_input_path={{$.inputs.artifacts[''tuning_result_input''].uri}}", - "--instance_baseline_path={{$.inputs.artifacts[''instance_baseline''].uri}}", - "--warmup_data={{$.inputs.artifacts[''warmup_data''].uri}}", "--prediction_docker_uri=us-docker.pkg.dev/vertex-ai/automl-tabular/prediction-server:20230110_1125_RC00", - "--model_path={{$.outputs.artifacts[''model''].uri}}", "--custom_model_path={{$.outputs.artifacts[''model_without_custom_ops''].uri}}", - "--explanation_metadata_path={{$.outputs.parameters[''explanation_metadata''].output_file}},{{$.outputs.artifacts[''explanation_metadata_artifact''].uri}}", - "--explanation_parameters_path={{$.outputs.parameters[''explanation_parameters''].output_file}}", - "--model_architecture_path={{$.outputs.artifacts[''model_architecture''].uri}}", - "--use_json=true", "--executor_input={{$.json_escape[1]}}"]}}]}}' - command: - - python3 - - -u - - -m - - google_cloud_pipeline_components.container.v1.custom_job.launcher - image: gcr.io/ml-pipeline/google-cloud-pipeline-components:1.0.32 - exec-automl-tabular-finalizer: - container: - args: - - --type - - CustomJob - - --project - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --payload - - '{"display_name": "automl-tabular-finalizer-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}}", - "encryption_spec": {"kms_key_name":"{{$.inputs.parameters[''encryption_spec_key_name'']}}"}, - "job_spec": {"worker_pool_specs": [{"replica_count": 1, "machine_spec": - {"machine_type": "n1-standard-8"}, "container_spec": {"image_uri":"us-docker.pkg.dev/vertex-ai-restricted/automl-tabular/training:20230110_1125_RC00", - "args": ["cancel_l2l_tuner", "--error_file_path={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/error.pb", - "--cleanup_lro_job_infos={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/lro"]}}]}}' - command: - - python3 - - -u - - -m - - google_cloud_pipeline_components.container.v1.custom_job.launcher - image: gcr.io/ml-pipeline/google-cloud-pipeline-components:1.0.32 - exec-automl-tabular-infra-validator: - container: - args: - - --executor_input - - '{{$}}' - image: us-docker.pkg.dev/vertex-ai/automl-tabular/prediction-server:20230110_1125_RC00 - resources: - cpuLimit: 8.0 - memoryLimit: 52.0 - exec-automl-tabular-infra-validator-2: - container: - args: - - --executor_input - - '{{$}}' - image: us-docker.pkg.dev/vertex-ai/automl-tabular/prediction-server:20230110_1125_RC00 - resources: - cpuLimit: 8.0 - memoryLimit: 52.0 - exec-automl-tabular-infra-validator-3: - container: - args: - - --executor_input - - '{{$}}' - image: us-docker.pkg.dev/vertex-ai/automl-tabular/prediction-server:20230110_1125_RC00 - resources: - cpuLimit: 8.0 - memoryLimit: 52.0 - exec-automl-tabular-stage-1-tuner: - container: - args: - - --type - - CustomJob - - --project - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --payload - - '{"display_name": "automl-tabular-stage-1-tuner-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}}", - "encryption_spec": {"kms_key_name":"{{$.inputs.parameters[''encryption_spec_key_name'']}}"}, - "job_spec": {"worker_pool_specs": [{"replica_count": 1, "machine_spec": - {"machine_type": "n1-standard-8"}, "container_spec": {"image_uri":"us-docker.pkg.dev/vertex-ai-restricted/automl-tabular/training:20230110_1125_RC00", - "args": ["l2l_stage_1_tuner", "--transform_output_path={{$.inputs.artifacts[''transform_output''].uri}}", - "--training_docker_uri=us-docker.pkg.dev/vertex-ai-restricted/automl-tabular/training:20230110_1125_RC00", - "--disable_early_stopping={{$.inputs.parameters[''disable_early_stopping'']}}", - "--tune_feature_selection_rate={{$.inputs.parameters[''tune_feature_selection_rate'']}}", - "--reduce_search_space_mode={{$.inputs.parameters[''reduce_search_space_mode'']}}", - "--component_id={{$.pipeline_task_uuid}}", "--training_base_dir={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/train", - "--num_parallel_trial={{$.inputs.parameters[''num_parallel_trials'']}}", - "--single_run_max_secs={{$.inputs.parameters[''single_run_max_secs'']}}", - "--deadline_hours={{$.inputs.parameters[''deadline_hours'']}}", "--num_selected_trials={{$.inputs.parameters[''num_selected_trials'']}}", - "--lro_job_info={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/lro", - "--error_file_path={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/error.pb", - "--metadata_path={{$.inputs.artifacts[''metadata''].uri}}", "--materialized_train_split={{$.inputs.artifacts[''materialized_train_split''].uri}}", - "--materialized_eval_split={{$.inputs.artifacts[''materialized_eval_split''].uri}}", - "--is_distill={{$.inputs.parameters[''run_distillation'']}}", "--tuning_result_output_path={{$.outputs.artifacts[''tuning_result_output''].uri}}", - "--kms_key_name={{$.inputs.parameters[''encryption_spec_key_name'']}}", - "--gcp_resources_path={{$.outputs.parameters[''gcp_resources''].output_file}}", - "--execution_metrics_path={{$.outputs.parameters[''execution_metrics''].output_file}}", - "--use_json=true", "--log_level=ERROR", "--executor_input={{$.json_escape[1]}}"]}}]}}' - command: - - python3 - - -u - - -m - - google_cloud_pipeline_components.container.v1.custom_job.launcher - image: gcr.io/ml-pipeline/google-cloud-pipeline-components:1.0.32 - exec-automl-tabular-stage-1-tuner-2: - container: - args: - - --type - - CustomJob - - --project - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --payload - - '{"display_name": "automl-tabular-stage-1-tuner-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}}", - "encryption_spec": {"kms_key_name":"{{$.inputs.parameters[''encryption_spec_key_name'']}}"}, - "job_spec": {"worker_pool_specs": [{"replica_count": 1, "machine_spec": - {"machine_type": "n1-standard-8"}, "container_spec": {"image_uri":"us-docker.pkg.dev/vertex-ai-restricted/automl-tabular/training:20230110_1125_RC00", - "args": ["l2l_stage_1_tuner", "--transform_output_path={{$.inputs.artifacts[''transform_output''].uri}}", - "--training_docker_uri=us-docker.pkg.dev/vertex-ai-restricted/automl-tabular/training:20230110_1125_RC00", - "--disable_early_stopping={{$.inputs.parameters[''disable_early_stopping'']}}", - "--tune_feature_selection_rate={{$.inputs.parameters[''tune_feature_selection_rate'']}}", - "--reduce_search_space_mode={{$.inputs.parameters[''reduce_search_space_mode'']}}", - "--component_id={{$.pipeline_task_uuid}}", "--training_base_dir={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/train", - "--num_parallel_trial={{$.inputs.parameters[''num_parallel_trials'']}}", - "--single_run_max_secs={{$.inputs.parameters[''single_run_max_secs'']}}", - "--deadline_hours={{$.inputs.parameters[''deadline_hours'']}}", "--num_selected_trials={{$.inputs.parameters[''num_selected_trials'']}}", - "--lro_job_info={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/lro", - "--error_file_path={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/error.pb", - "--metadata_path={{$.inputs.artifacts[''metadata''].uri}}", "--materialized_train_split={{$.inputs.artifacts[''materialized_train_split''].uri}}", - "--materialized_eval_split={{$.inputs.artifacts[''materialized_eval_split''].uri}}", - "--is_distill={{$.inputs.parameters[''run_distillation'']}}", "--tuning_result_output_path={{$.outputs.artifacts[''tuning_result_output''].uri}}", - "--kms_key_name={{$.inputs.parameters[''encryption_spec_key_name'']}}", - "--gcp_resources_path={{$.outputs.parameters[''gcp_resources''].output_file}}", - "--execution_metrics_path={{$.outputs.parameters[''execution_metrics''].output_file}}", - "--use_json=true", "--log_level=ERROR", "--executor_input={{$.json_escape[1]}}"]}}]}}' - command: - - python3 - - -u - - -m - - google_cloud_pipeline_components.container.v1.custom_job.launcher - image: gcr.io/ml-pipeline/google-cloud-pipeline-components:1.0.32 - exec-automl-tabular-transform: - container: - args: - - --type - - CustomJob - - --project - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --payload - - '{"display_name": "automl-tabular-transform-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}}", - "encryption_spec": {"kms_key_name":"{{$.inputs.parameters[''encryption_spec_key_name'']}}"}, - "job_spec": {"worker_pool_specs": [{"replica_count": 1, "machine_spec": - {"machine_type": "n1-standard-8"}, "container_spec": {"image_uri":"us-docker.pkg.dev/vertex-ai-restricted/automl-tabular/training:20230110_1125_RC00", - "args": ["transform", "--transform_output_artifact_path={{$.outputs.artifacts[''transform_output''].uri}}", - "--transform_output_path={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/transform", - "--materialized_splits_output_path={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/transform_materialized", - "--metadata_path={{$.inputs.artifacts[''metadata''].uri}}", "--dataset_schema_path={{$.inputs.artifacts[''dataset_schema''].uri}}", - "--train_split={{$.inputs.artifacts[''train_split''].uri}}", "--eval_split={{$.inputs.artifacts[''eval_split''].uri}}", - "--test_split={{$.inputs.artifacts[''test_split''].uri}}", "--materialized_train_split={{$.outputs.artifacts[''materialized_train_split''].uri}}", - "--materialized_eval_split={{$.outputs.artifacts[''materialized_eval_split''].uri}}", - "--materialized_test_split={{$.outputs.artifacts[''materialized_test_split''].uri}}", - "--training_schema_path={{$.outputs.artifacts[''training_schema_uri''].uri}}", - "--job_name=automl-tabular-transform-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}}", - "--dataflow_project={{$.inputs.parameters[''project'']}}", "--error_file_path={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/error.pb", - "--dataflow_staging_dir={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/dataflow_staging", - "--dataflow_tmp_dir={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/dataflow_tmp", - "--dataflow_max_num_workers={{$.inputs.parameters[''dataflow_max_num_workers'']}}", - "--dataflow_machine_type={{$.inputs.parameters[''dataflow_machine_type'']}}", - "--dataflow_worker_container_image=us-docker.pkg.dev/vertex-ai/automl-tabular/dataflow-worker:20230110_1125_RC00", - "--dataflow_disk_size_gb={{$.inputs.parameters[''dataflow_disk_size_gb'']}}", - "--dataflow_subnetwork_fully_qualified={{$.inputs.parameters[''dataflow_subnetwork'']}}", - "--dataflow_use_public_ips={{$.inputs.parameters[''dataflow_use_public_ips'']}}", - "--dataflow_kms_key={{$.inputs.parameters[''encryption_spec_key_name'']}}", - "--dataflow_service_account={{$.inputs.parameters[''dataflow_service_account'']}}", - "--lro_job_info={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/lro", - "--gcp_resources_path={{$.outputs.parameters[''gcp_resources''].output_file}}"]}}]}}' - command: - - python3 - - -u - - -m - - google_cloud_pipeline_components.container.v1.custom_job.launcher - image: gcr.io/ml-pipeline/google-cloud-pipeline-components:1.0.32 - exec-automl-tabular-transform-2: - container: - args: - - --type - - CustomJob - - --project - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --payload - - '{"display_name": "automl-tabular-transform-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}}", - "encryption_spec": {"kms_key_name":"{{$.inputs.parameters[''encryption_spec_key_name'']}}"}, - "job_spec": {"worker_pool_specs": [{"replica_count": 1, "machine_spec": - {"machine_type": "n1-standard-8"}, "container_spec": {"image_uri":"us-docker.pkg.dev/vertex-ai-restricted/automl-tabular/training:20230110_1125_RC00", - "args": ["transform", "--transform_output_artifact_path={{$.outputs.artifacts[''transform_output''].uri}}", - "--transform_output_path={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/transform", - "--materialized_splits_output_path={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/transform_materialized", - "--metadata_path={{$.inputs.artifacts[''metadata''].uri}}", "--dataset_schema_path={{$.inputs.artifacts[''dataset_schema''].uri}}", - "--train_split={{$.inputs.artifacts[''train_split''].uri}}", "--eval_split={{$.inputs.artifacts[''eval_split''].uri}}", - "--test_split={{$.inputs.artifacts[''test_split''].uri}}", "--materialized_train_split={{$.outputs.artifacts[''materialized_train_split''].uri}}", - "--materialized_eval_split={{$.outputs.artifacts[''materialized_eval_split''].uri}}", - "--materialized_test_split={{$.outputs.artifacts[''materialized_test_split''].uri}}", - "--training_schema_path={{$.outputs.artifacts[''training_schema_uri''].uri}}", - "--job_name=automl-tabular-transform-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}}", - "--dataflow_project={{$.inputs.parameters[''project'']}}", "--error_file_path={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/error.pb", - "--dataflow_staging_dir={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/dataflow_staging", - "--dataflow_tmp_dir={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/dataflow_tmp", - "--dataflow_max_num_workers={{$.inputs.parameters[''dataflow_max_num_workers'']}}", - "--dataflow_machine_type={{$.inputs.parameters[''dataflow_machine_type'']}}", - "--dataflow_worker_container_image=us-docker.pkg.dev/vertex-ai/automl-tabular/dataflow-worker:20230110_1125_RC00", - "--dataflow_disk_size_gb={{$.inputs.parameters[''dataflow_disk_size_gb'']}}", - "--dataflow_subnetwork_fully_qualified={{$.inputs.parameters[''dataflow_subnetwork'']}}", - "--dataflow_use_public_ips={{$.inputs.parameters[''dataflow_use_public_ips'']}}", - "--dataflow_kms_key={{$.inputs.parameters[''encryption_spec_key_name'']}}", - "--dataflow_service_account={{$.inputs.parameters[''dataflow_service_account'']}}", - "--lro_job_info={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/lro", - "--gcp_resources_path={{$.outputs.parameters[''gcp_resources''].output_file}}"]}}]}}' - command: - - python3 - - -u - - -m - - google_cloud_pipeline_components.container.v1.custom_job.launcher - image: gcr.io/ml-pipeline/google-cloud-pipeline-components:1.0.32 - exec-bool-identity: - container: - args: - - --value - - '{{$.inputs.parameters[''value'']}}' - - '----output-paths' - - '{{$.outputs.parameters[''Output''].output_file}}' - command: - - sh - - -ec - - 'program_path=$(mktemp) - - printf "%s" "$0" > "$program_path" - - python3 -u "$program_path" "$@" - - ' - - "def _bool_identity(value):\n \"\"\"Returns boolean value.\n\n Args:\n\ - \ value: Boolean value to return\n\n Returns:\n Boolean value.\n\ - \ \"\"\"\n return 'true' if value else 'false'\n\ndef _deserialize_bool(s)\ - \ -> bool:\n from distutils.util import strtobool\n return strtobool(s)\ - \ == 1\n\ndef _serialize_str(str_value: str) -> str:\n if not isinstance(str_value,\ - \ str):\n raise TypeError('Value \"{}\" has type \"{}\" instead\ - \ of str.'.format(\n str(str_value), str(type(str_value))))\n\ - \ return str_value\n\nimport argparse\n_parser = argparse.ArgumentParser(prog='Bool\ - \ identity', description='Returns boolean value.')\n_parser.add_argument(\"\ - --value\", dest=\"value\", type=_deserialize_bool, required=True, default=argparse.SUPPRESS)\n\ - _parser.add_argument(\"----output-paths\", dest=\"_output_paths\", type=str,\ - \ nargs=1)\n_parsed_args = vars(_parser.parse_args())\n_output_files =\ - \ _parsed_args.pop(\"_output_paths\", [])\n\n_outputs = _bool_identity(**_parsed_args)\n\ - \n_outputs = [_outputs]\n\n_output_serializers = [\n _serialize_str,\n\ - \n]\n\nimport os\nfor idx, output_file in enumerate(_output_files):\n\ - \ try:\n os.makedirs(os.path.dirname(output_file))\n except\ - \ OSError:\n pass\n with open(output_file, 'w') as f:\n \ - \ f.write(_output_serializers[idx](_outputs[idx]))\n" - image: us-docker.pkg.dev/vertex-ai/automl-tabular/dataflow-worker:20230110_1125_RC00 - exec-bool-identity-2: - container: - args: - - --value - - '{{$.inputs.parameters[''value'']}}' - - '----output-paths' - - '{{$.outputs.parameters[''Output''].output_file}}' - command: - - sh - - -ec - - 'program_path=$(mktemp) - - printf "%s" "$0" > "$program_path" - - python3 -u "$program_path" "$@" - - ' - - "def _bool_identity(value):\n \"\"\"Returns boolean value.\n\n Args:\n\ - \ value: Boolean value to return\n\n Returns:\n Boolean value.\n\ - \ \"\"\"\n return 'true' if value else 'false'\n\ndef _deserialize_bool(s)\ - \ -> bool:\n from distutils.util import strtobool\n return strtobool(s)\ - \ == 1\n\ndef _serialize_str(str_value: str) -> str:\n if not isinstance(str_value,\ - \ str):\n raise TypeError('Value \"{}\" has type \"{}\" instead\ - \ of str.'.format(\n str(str_value), str(type(str_value))))\n\ - \ return str_value\n\nimport argparse\n_parser = argparse.ArgumentParser(prog='Bool\ - \ identity', description='Returns boolean value.')\n_parser.add_argument(\"\ - --value\", dest=\"value\", type=_deserialize_bool, required=True, default=argparse.SUPPRESS)\n\ - _parser.add_argument(\"----output-paths\", dest=\"_output_paths\", type=str,\ - \ nargs=1)\n_parsed_args = vars(_parser.parse_args())\n_output_files =\ - \ _parsed_args.pop(\"_output_paths\", [])\n\n_outputs = _bool_identity(**_parsed_args)\n\ - \n_outputs = [_outputs]\n\n_output_serializers = [\n _serialize_str,\n\ - \n]\n\nimport os\nfor idx, output_file in enumerate(_output_files):\n\ - \ try:\n os.makedirs(os.path.dirname(output_file))\n except\ - \ OSError:\n pass\n with open(output_file, 'w') as f:\n \ - \ f.write(_output_serializers[idx](_outputs[idx]))\n" - image: us-docker.pkg.dev/vertex-ai/automl-tabular/dataflow-worker:20230110_1125_RC00 - exec-bool-identity-3: - container: - args: - - --value - - '{{$.inputs.parameters[''value'']}}' - - '----output-paths' - - '{{$.outputs.parameters[''Output''].output_file}}' - command: - - sh - - -ec - - 'program_path=$(mktemp) - - printf "%s" "$0" > "$program_path" - - python3 -u "$program_path" "$@" - - ' - - "def _bool_identity(value):\n \"\"\"Returns boolean value.\n\n Args:\n\ - \ value: Boolean value to return\n\n Returns:\n Boolean value.\n\ - \ \"\"\"\n return 'true' if value else 'false'\n\ndef _deserialize_bool(s)\ - \ -> bool:\n from distutils.util import strtobool\n return strtobool(s)\ - \ == 1\n\ndef _serialize_str(str_value: str) -> str:\n if not isinstance(str_value,\ - \ str):\n raise TypeError('Value \"{}\" has type \"{}\" instead\ - \ of str.'.format(\n str(str_value), str(type(str_value))))\n\ - \ return str_value\n\nimport argparse\n_parser = argparse.ArgumentParser(prog='Bool\ - \ identity', description='Returns boolean value.')\n_parser.add_argument(\"\ - --value\", dest=\"value\", type=_deserialize_bool, required=True, default=argparse.SUPPRESS)\n\ - _parser.add_argument(\"----output-paths\", dest=\"_output_paths\", type=str,\ - \ nargs=1)\n_parsed_args = vars(_parser.parse_args())\n_output_files =\ - \ _parsed_args.pop(\"_output_paths\", [])\n\n_outputs = _bool_identity(**_parsed_args)\n\ - \n_outputs = [_outputs]\n\n_output_serializers = [\n _serialize_str,\n\ - \n]\n\nimport os\nfor idx, output_file in enumerate(_output_files):\n\ - \ try:\n os.makedirs(os.path.dirname(output_file))\n except\ - \ OSError:\n pass\n with open(output_file, 'w') as f:\n \ - \ f.write(_output_serializers[idx](_outputs[idx]))\n" - image: us-docker.pkg.dev/vertex-ai/automl-tabular/dataflow-worker:20230110_1125_RC00 - exec-calculate-training-parameters: - container: - args: - - --executor_input - - '{{$}}' - - --function_to_execute - - _calculate_training_parameters - command: - - sh - - -ec - - 'program_path=$(mktemp -d) - - printf "%s" "$0" > "$program_path/ephemeral_component.py" - - python3 -m kfp.v2.components.executor_main --component_module_path "$program_path/ephemeral_component.py" "$@" - - ' - - "\nimport kfp\nfrom kfp.v2 import dsl\nfrom kfp.v2.dsl import *\nfrom\ - \ typing import *\n\ndef _calculate_training_parameters(\n stage_1_num_parallel_trials:\ - \ int,\n train_budget_milli_node_hours: float,\n stage_2_num_parallel_trials:\ - \ int,\n run_distillation: bool,\n is_skip_architecture_search:\ - \ bool = False,\n fast_testing: bool = False,\n) -> NamedTuple('Outputs',\ - \ [\n ('stage_1_deadline_hours', float), ('stage_1_num_selected_trials',\ - \ int),\n ('stage_1_single_run_max_secs', int), ('stage_2_deadline_hours',\ - \ float),\n ('stage_2_single_run_max_secs', int),\n ('distill_stage_1_deadline_hours',\ - \ float), ('reduce_search_space_mode', str)\n]):\n \"\"\"Calculates training\ - \ parameters.\n\n Args:\n stage_1_num_parallel_trials: Number of parallel\ - \ trails for stage 1.\n train_budget_milli_node_hours: The train budget\ - \ of creating this model,\n expressed in milli node hours i.e. 1,000\ - \ value in this field means 1 node\n hour.\n stage_2_num_parallel_trials:\ - \ Number of parallel trails for stage 2.\n run_distillation: Whether\ - \ to run distill in the training pipeline.\n is_skip_architecture_search:\ - \ If component is being called in the\n skip_architecture_search\ - \ pipeline.\n fast_testing: Internal flag used for presubmit tests.\n\ - \n Returns:\n stage_1_deadline_hours: Maximum number of hours to\ - \ run stage 1.\n stage_1_num_selected_trials: Number of selected\ - \ trails for stage 1.\n stage_1_single_run_max_secs: Maximum number\ - \ seconds to for a single stage\n 1\n training trial.\n \ - \ stage_2_deadline_hours: Maximum number of hours to run stage 2.\n\ - \ stage_2_single_run_max_secs: Maximum number seconds to for a single\ - \ stage\n 2\n training trial.\n distill_stage_1_deadline_hours:\ - \ Maximum number of hours to run stage 1 for\n the model distillation.\n\ - \ reduce_search_space_mode: The reduce search space mode. Possible\ - \ values:\n minimal, regular, full.\n \"\"\"\n # pylint: disable=g-import-not-at-top,import-outside-toplevel,redefined-outer-name\n\ - \ import collections\n import math\n # pylint: enable=g-import-not-at-top,import-outside-toplevel,redefined-outer-name\n\ - \ num_folds = 5\n distill_total_trials = 100\n\n stage_1_deadline_hours\ - \ = -1.0\n stage_1_num_selected_trials = -1\n stage_1_single_run_max_secs\ - \ = -1\n stage_2_deadline_hours = -1.0\n stage_2_single_run_max_secs\ - \ = -1\n distill_stage_1_deadline_hours = 1.0\n reduce_search_space_mode\ - \ = 'regular'\n\n if is_skip_architecture_search:\n stage_2_deadline_hours\ - \ = train_budget_milli_node_hours / 1000.0\n stage_2_single_run_max_secs\ - \ = int(stage_2_deadline_hours * 3600.0 / 1.3)\n else:\n hours = float(train_budget_milli_node_hours)\ - \ / 1000.0\n multiplier = stage_1_num_parallel_trials * hours / 500.0\n\ - \ stage_1_single_run_max_secs = int(math.sqrt(multiplier) * 2400.0)\n\ - \ phase_2_rounds = int(\n math.sqrt(multiplier) * 100 / stage_2_num_parallel_trials\ - \ + 0.5)\n if phase_2_rounds < 1:\n phase_2_rounds = 1\n\n \ - \ # All of magic number \"1.3\" above is because the trial doesn't\n \ - \ # always finish in time_per_trial. 1.3 is an empirical safety margin\ - \ here.\n stage_1_deadline_secs = int(hours * 3600.0 - 1.3 *\n \ - \ stage_1_single_run_max_secs * phase_2_rounds)\n\ - \n if stage_1_deadline_secs < hours * 3600.0 * 0.5:\n stage_1_deadline_secs\ - \ = int(hours * 3600.0 * 0.5)\n # Phase 1 deadline is the same as\ - \ phase 2 deadline in this case. Phase 2\n # can't finish in time\ - \ after the deadline is cut, so adjust the time per\n # trial to\ - \ meet the deadline.\n stage_1_single_run_max_secs = int(stage_1_deadline_secs\ - \ /\n (1.3 * phase_2_rounds))\n\ - \n reduce_search_space_mode = 'minimal'\n if multiplier > 2:\n \ - \ reduce_search_space_mode = 'regular'\n if multiplier > 4:\n \ - \ reduce_search_space_mode = 'full'\n\n # Stage 2 number of trials\ - \ is stage_1_num_selected_trials *\n # num_folds, which should be equal\ - \ to phase_2_rounds *\n # stage_2_num_parallel_trials. Use this information\ - \ to calculate\n # stage_1_num_selected_trials:\n stage_1_num_selected_trials\ - \ = int(phase_2_rounds *\n stage_2_num_parallel_trials\ - \ / num_folds)\n stage_1_deadline_hours = stage_1_deadline_secs / 3600.0\n\ - \n stage_2_deadline_hours = hours - stage_1_deadline_hours\n stage_2_single_run_max_secs\ - \ = stage_1_single_run_max_secs\n\n if run_distillation:\n # All\ - \ of magic number \"1.3\" above is because the trial doesn't always\n\ - \ # finish in time_per_trial. 1.3 is an empirical safety margin here.\n\ - \ distill_stage_1_deadline_hours = math.ceil(\n float(distill_total_trials)\ - \ / stage_1_num_parallel_trials\n ) * stage_1_single_run_max_secs\ - \ * 1.3 / 3600.0\n\n if fast_testing:\n distill_stage_1_deadline_hours\ - \ = 0.2\n stage_1_deadline_hours = 0.2\n stage_1_single_run_max_secs\ - \ = 1\n stage_2_deadline_hours = 0.2\n stage_2_single_run_max_secs\ - \ = 1\n\n return collections.namedtuple('Outputs', [\n 'stage_1_deadline_hours',\ - \ 'stage_1_num_selected_trials',\n 'stage_1_single_run_max_secs',\ - \ 'stage_2_deadline_hours',\n 'stage_2_single_run_max_secs', 'distill_stage_1_deadline_hours',\n\ - \ 'reduce_search_space_mode'\n ])(stage_1_deadline_hours, stage_1_num_selected_trials,\n\ - \ stage_1_single_run_max_secs, stage_2_deadline_hours,\n stage_2_single_run_max_secs,\ - \ distill_stage_1_deadline_hours,\n reduce_search_space_mode)\n\n" - image: us-docker.pkg.dev/vertex-ai/automl-tabular/dataflow-worker:20230110_1125_RC00 - exec-calculate-training-parameters-2: - container: - args: - - --executor_input - - '{{$}}' - - --function_to_execute - - _calculate_training_parameters - command: - - sh - - -ec - - 'program_path=$(mktemp -d) - - printf "%s" "$0" > "$program_path/ephemeral_component.py" - - python3 -m kfp.v2.components.executor_main --component_module_path "$program_path/ephemeral_component.py" "$@" - - ' - - "\nimport kfp\nfrom kfp.v2 import dsl\nfrom kfp.v2.dsl import *\nfrom\ - \ typing import *\n\ndef _calculate_training_parameters(\n stage_1_num_parallel_trials:\ - \ int,\n train_budget_milli_node_hours: float,\n stage_2_num_parallel_trials:\ - \ int,\n run_distillation: bool,\n is_skip_architecture_search:\ - \ bool = False,\n fast_testing: bool = False,\n) -> NamedTuple('Outputs',\ - \ [\n ('stage_1_deadline_hours', float), ('stage_1_num_selected_trials',\ - \ int),\n ('stage_1_single_run_max_secs', int), ('stage_2_deadline_hours',\ - \ float),\n ('stage_2_single_run_max_secs', int),\n ('distill_stage_1_deadline_hours',\ - \ float), ('reduce_search_space_mode', str)\n]):\n \"\"\"Calculates training\ - \ parameters.\n\n Args:\n stage_1_num_parallel_trials: Number of parallel\ - \ trails for stage 1.\n train_budget_milli_node_hours: The train budget\ - \ of creating this model,\n expressed in milli node hours i.e. 1,000\ - \ value in this field means 1 node\n hour.\n stage_2_num_parallel_trials:\ - \ Number of parallel trails for stage 2.\n run_distillation: Whether\ - \ to run distill in the training pipeline.\n is_skip_architecture_search:\ - \ If component is being called in the\n skip_architecture_search\ - \ pipeline.\n fast_testing: Internal flag used for presubmit tests.\n\ - \n Returns:\n stage_1_deadline_hours: Maximum number of hours to\ - \ run stage 1.\n stage_1_num_selected_trials: Number of selected\ - \ trails for stage 1.\n stage_1_single_run_max_secs: Maximum number\ - \ seconds to for a single stage\n 1\n training trial.\n \ - \ stage_2_deadline_hours: Maximum number of hours to run stage 2.\n\ - \ stage_2_single_run_max_secs: Maximum number seconds to for a single\ - \ stage\n 2\n training trial.\n distill_stage_1_deadline_hours:\ - \ Maximum number of hours to run stage 1 for\n the model distillation.\n\ - \ reduce_search_space_mode: The reduce search space mode. Possible\ - \ values:\n minimal, regular, full.\n \"\"\"\n # pylint: disable=g-import-not-at-top,import-outside-toplevel,redefined-outer-name\n\ - \ import collections\n import math\n # pylint: enable=g-import-not-at-top,import-outside-toplevel,redefined-outer-name\n\ - \ num_folds = 5\n distill_total_trials = 100\n\n stage_1_deadline_hours\ - \ = -1.0\n stage_1_num_selected_trials = -1\n stage_1_single_run_max_secs\ - \ = -1\n stage_2_deadline_hours = -1.0\n stage_2_single_run_max_secs\ - \ = -1\n distill_stage_1_deadline_hours = 1.0\n reduce_search_space_mode\ - \ = 'regular'\n\n if is_skip_architecture_search:\n stage_2_deadline_hours\ - \ = train_budget_milli_node_hours / 1000.0\n stage_2_single_run_max_secs\ - \ = int(stage_2_deadline_hours * 3600.0 / 1.3)\n else:\n hours = float(train_budget_milli_node_hours)\ - \ / 1000.0\n multiplier = stage_1_num_parallel_trials * hours / 500.0\n\ - \ stage_1_single_run_max_secs = int(math.sqrt(multiplier) * 2400.0)\n\ - \ phase_2_rounds = int(\n math.sqrt(multiplier) * 100 / stage_2_num_parallel_trials\ - \ + 0.5)\n if phase_2_rounds < 1:\n phase_2_rounds = 1\n\n \ - \ # All of magic number \"1.3\" above is because the trial doesn't\n \ - \ # always finish in time_per_trial. 1.3 is an empirical safety margin\ - \ here.\n stage_1_deadline_secs = int(hours * 3600.0 - 1.3 *\n \ - \ stage_1_single_run_max_secs * phase_2_rounds)\n\ - \n if stage_1_deadline_secs < hours * 3600.0 * 0.5:\n stage_1_deadline_secs\ - \ = int(hours * 3600.0 * 0.5)\n # Phase 1 deadline is the same as\ - \ phase 2 deadline in this case. Phase 2\n # can't finish in time\ - \ after the deadline is cut, so adjust the time per\n # trial to\ - \ meet the deadline.\n stage_1_single_run_max_secs = int(stage_1_deadline_secs\ - \ /\n (1.3 * phase_2_rounds))\n\ - \n reduce_search_space_mode = 'minimal'\n if multiplier > 2:\n \ - \ reduce_search_space_mode = 'regular'\n if multiplier > 4:\n \ - \ reduce_search_space_mode = 'full'\n\n # Stage 2 number of trials\ - \ is stage_1_num_selected_trials *\n # num_folds, which should be equal\ - \ to phase_2_rounds *\n # stage_2_num_parallel_trials. Use this information\ - \ to calculate\n # stage_1_num_selected_trials:\n stage_1_num_selected_trials\ - \ = int(phase_2_rounds *\n stage_2_num_parallel_trials\ - \ / num_folds)\n stage_1_deadline_hours = stage_1_deadline_secs / 3600.0\n\ - \n stage_2_deadline_hours = hours - stage_1_deadline_hours\n stage_2_single_run_max_secs\ - \ = stage_1_single_run_max_secs\n\n if run_distillation:\n # All\ - \ of magic number \"1.3\" above is because the trial doesn't always\n\ - \ # finish in time_per_trial. 1.3 is an empirical safety margin here.\n\ - \ distill_stage_1_deadline_hours = math.ceil(\n float(distill_total_trials)\ - \ / stage_1_num_parallel_trials\n ) * stage_1_single_run_max_secs\ - \ * 1.3 / 3600.0\n\n if fast_testing:\n distill_stage_1_deadline_hours\ - \ = 0.2\n stage_1_deadline_hours = 0.2\n stage_1_single_run_max_secs\ - \ = 1\n stage_2_deadline_hours = 0.2\n stage_2_single_run_max_secs\ - \ = 1\n\n return collections.namedtuple('Outputs', [\n 'stage_1_deadline_hours',\ - \ 'stage_1_num_selected_trials',\n 'stage_1_single_run_max_secs',\ - \ 'stage_2_deadline_hours',\n 'stage_2_single_run_max_secs', 'distill_stage_1_deadline_hours',\n\ - \ 'reduce_search_space_mode'\n ])(stage_1_deadline_hours, stage_1_num_selected_trials,\n\ - \ stage_1_single_run_max_secs, stage_2_deadline_hours,\n stage_2_single_run_max_secs,\ - \ distill_stage_1_deadline_hours,\n reduce_search_space_mode)\n\n" - image: us-docker.pkg.dev/vertex-ai/automl-tabular/dataflow-worker:20230110_1125_RC00 - exec-feature-attribution: - container: - args: - - --task - - explanation - - --setup_file - - /setup.py - - --project_id - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --batch_prediction_gcs_source - - '{{$.inputs.artifacts[''predictions_gcs_source''].uri}}' - - --batch_prediction_format - - '{{$.inputs.parameters[''predictions_format'']}}' - - --root_dir - - '{{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}}' - - --dataflow_job_prefix - - evaluation-feautre-attribution-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}} - - --dataflow_service_account - - '{{$.inputs.parameters[''dataflow_service_account'']}}' - - --dataflow_disk_size - - '{{$.inputs.parameters[''dataflow_disk_size'']}}' - - --dataflow_machine_type - - '{{$.inputs.parameters[''dataflow_machine_type'']}}' - - --dataflow_workers_num - - '{{$.inputs.parameters[''dataflow_workers_num'']}}' - - --dataflow_max_workers_num - - '{{$.inputs.parameters[''dataflow_max_workers_num'']}}' - - --dataflow_subnetwork - - '{{$.inputs.parameters[''dataflow_subnetwork'']}}' - - --dataflow_use_public_ips - - '{{$.inputs.parameters[''dataflow_use_public_ips'']}}' - - --kms_key_name - - '{{$.inputs.parameters[''encryption_spec_key_name'']}}' - - --gcs_output_path - - '{{$.outputs.artifacts[''feature_attributions''].uri}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --executor_input - - '{{$}}' - command: - - python - - /main.py - image: gcr.io/ml-pipeline/model-evaluation:v0.7 - exec-feature-attribution-2: - container: - args: - - --task - - explanation - - --setup_file - - /setup.py - - --project_id - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --batch_prediction_gcs_source - - '{{$.inputs.artifacts[''predictions_gcs_source''].uri}}' - - --batch_prediction_format - - '{{$.inputs.parameters[''predictions_format'']}}' - - --root_dir - - '{{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}}' - - --dataflow_job_prefix - - evaluation-feautre-attribution-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}} - - --dataflow_service_account - - '{{$.inputs.parameters[''dataflow_service_account'']}}' - - --dataflow_disk_size - - '{{$.inputs.parameters[''dataflow_disk_size'']}}' - - --dataflow_machine_type - - '{{$.inputs.parameters[''dataflow_machine_type'']}}' - - --dataflow_workers_num - - '{{$.inputs.parameters[''dataflow_workers_num'']}}' - - --dataflow_max_workers_num - - '{{$.inputs.parameters[''dataflow_max_workers_num'']}}' - - --dataflow_subnetwork - - '{{$.inputs.parameters[''dataflow_subnetwork'']}}' - - --dataflow_use_public_ips - - '{{$.inputs.parameters[''dataflow_use_public_ips'']}}' - - --kms_key_name - - '{{$.inputs.parameters[''encryption_spec_key_name'']}}' - - --gcs_output_path - - '{{$.outputs.artifacts[''feature_attributions''].uri}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --executor_input - - '{{$}}' - command: - - python - - /main.py - image: gcr.io/ml-pipeline/model-evaluation:v0.7 - exec-feature-attribution-3: - container: - args: - - --task - - explanation - - --setup_file - - /setup.py - - --project_id - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --batch_prediction_gcs_source - - '{{$.inputs.artifacts[''predictions_gcs_source''].uri}}' - - --batch_prediction_format - - '{{$.inputs.parameters[''predictions_format'']}}' - - --root_dir - - '{{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}}' - - --dataflow_job_prefix - - evaluation-feautre-attribution-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}} - - --dataflow_service_account - - '{{$.inputs.parameters[''dataflow_service_account'']}}' - - --dataflow_disk_size - - '{{$.inputs.parameters[''dataflow_disk_size'']}}' - - --dataflow_machine_type - - '{{$.inputs.parameters[''dataflow_machine_type'']}}' - - --dataflow_workers_num - - '{{$.inputs.parameters[''dataflow_workers_num'']}}' - - --dataflow_max_workers_num - - '{{$.inputs.parameters[''dataflow_max_workers_num'']}}' - - --dataflow_subnetwork - - '{{$.inputs.parameters[''dataflow_subnetwork'']}}' - - --dataflow_use_public_ips - - '{{$.inputs.parameters[''dataflow_use_public_ips'']}}' - - --kms_key_name - - '{{$.inputs.parameters[''encryption_spec_key_name'']}}' - - --gcs_output_path - - '{{$.outputs.artifacts[''feature_attributions''].uri}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --executor_input - - '{{$}}' - command: - - python - - /main.py - image: gcr.io/ml-pipeline/model-evaluation:v0.7 - exec-importer: - importer: - artifactUri: - runtimeParameter: uri - typeSchema: - schemaTitle: system.Artifact - schemaVersion: 0.0.1 - exec-merge-materialized-splits: - container: - args: - - --split-0 - - '{{$.inputs.artifacts[''split_0''].path}}' - - --split-1 - - '{{$.inputs.artifacts[''split_1''].path}}' - - --splits - - '{{$.outputs.artifacts[''splits''].path}}' - command: - - sh - - -ec - - 'program_path=$(mktemp) - - printf "%s" "$0" > "$program_path" - - python3 -u "$program_path" "$@" - - ' - - "def _make_parent_dirs_and_return_path(file_path: str):\n import os\n\ - \ os.makedirs(os.path.dirname(file_path), exist_ok=True)\n return\ - \ file_path\n\ndef _merge_materialized_splits(\n split_0,\n split_1,\n\ - \ splits,\n):\n \"\"\"Merge two materialized splits.\n\n Args:\n\ - \ split_0: The first materialized split.\n split_1: The second materialized\ - \ split.\n splits: The merged materialized split.\n \"\"\"\n with\ - \ open(split_0, 'r') as f:\n split_0_content = f.read()\n with open(split_1,\ - \ 'r') as f:\n split_1_content = f.read()\n with open(splits, 'w')\ - \ as f:\n f.write(','.join([split_0_content, split_1_content]))\n\n\ - import argparse\n_parser = argparse.ArgumentParser(prog='Merge materialized\ - \ splits', description='Merge two materialized splits.')\n_parser.add_argument(\"\ - --split-0\", dest=\"split_0\", type=str, required=True, default=argparse.SUPPRESS)\n\ - _parser.add_argument(\"--split-1\", dest=\"split_1\", type=str, required=True,\ - \ default=argparse.SUPPRESS)\n_parser.add_argument(\"--splits\", dest=\"\ - splits\", type=_make_parent_dirs_and_return_path, required=True, default=argparse.SUPPRESS)\n\ - _parsed_args = vars(_parser.parse_args())\n\n_outputs = _merge_materialized_splits(**_parsed_args)\n" - image: us-docker.pkg.dev/vertex-ai/automl-tabular/dataflow-worker:20230110_1125_RC00 - exec-model-batch-explanation: - container: - args: - - --type - - BatchPredictionJob - - --payload - - '{"display_name": "{{$.inputs.parameters[''job_display_name'']}}", "input_config": - {"instances_format": "{{$.inputs.parameters[''instances_format'']}}", - "gcs_source": {"uris":{{$.inputs.parameters[''gcs_source_uris'']}}}, "bigquery_source": - {"input_uri": "{{$.inputs.parameters[''bigquery_source_input_uri'']}}"}}, - "model_parameters": {{$.inputs.parameters[''model_parameters'']}}, "output_config": - {"predictions_format": "{{$.inputs.parameters[''predictions_format'']}}", - "gcs_destination": {"output_uri_prefix": "{{$.inputs.parameters[''gcs_destination_output_uri_prefix'']}}"}, - "bigquery_destination": {"output_uri": "{{$.inputs.parameters[''bigquery_destination_output_uri'']}}"}}, - "dedicated_resources": {"machine_spec": {"machine_type": "{{$.inputs.parameters[''machine_type'']}}", - "accelerator_type": "{{$.inputs.parameters[''accelerator_type'']}}", "accelerator_count": - {{$.inputs.parameters[''accelerator_count'']}}}, "starting_replica_count": - {{$.inputs.parameters[''starting_replica_count'']}}, "max_replica_count": - {{$.inputs.parameters[''max_replica_count'']}}}, "manual_batch_tuning_parameters": - {"batch_size": {{$.inputs.parameters[''manual_batch_tuning_parameters_batch_size'']}}}, - "generate_explanation": {{$.inputs.parameters[''generate_explanation'']}}, - "explanation_spec": {"parameters": {{$.inputs.parameters[''explanation_parameters'']}}, - "metadata": {{$.inputs.parameters[''explanation_metadata'']}}}, "explanation_metadata_artifact": - "{{$.inputs.artifacts[''explanation_metadata_artifact''].uri}}", "labels": - {{$.inputs.parameters[''labels'']}}, "encryption_spec": {"kms_key_name":"{{$.inputs.parameters[''encryption_spec_key_name'']}}"}}' - - --project - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --executor_input - - '{{$}}' - command: - - python3 - - -u - - -m - - launcher - image: gcr.io/ml-pipeline/automl-tables-private:1.0.13 - exec-model-batch-explanation-2: - container: - args: - - --type - - BatchPredictionJob - - --payload - - '{"display_name": "{{$.inputs.parameters[''job_display_name'']}}", "input_config": - {"instances_format": "{{$.inputs.parameters[''instances_format'']}}", - "gcs_source": {"uris":{{$.inputs.parameters[''gcs_source_uris'']}}}, "bigquery_source": - {"input_uri": "{{$.inputs.parameters[''bigquery_source_input_uri'']}}"}}, - "model_parameters": {{$.inputs.parameters[''model_parameters'']}}, "output_config": - {"predictions_format": "{{$.inputs.parameters[''predictions_format'']}}", - "gcs_destination": {"output_uri_prefix": "{{$.inputs.parameters[''gcs_destination_output_uri_prefix'']}}"}, - "bigquery_destination": {"output_uri": "{{$.inputs.parameters[''bigquery_destination_output_uri'']}}"}}, - "dedicated_resources": {"machine_spec": {"machine_type": "{{$.inputs.parameters[''machine_type'']}}", - "accelerator_type": "{{$.inputs.parameters[''accelerator_type'']}}", "accelerator_count": - {{$.inputs.parameters[''accelerator_count'']}}}, "starting_replica_count": - {{$.inputs.parameters[''starting_replica_count'']}}, "max_replica_count": - {{$.inputs.parameters[''max_replica_count'']}}}, "manual_batch_tuning_parameters": - {"batch_size": {{$.inputs.parameters[''manual_batch_tuning_parameters_batch_size'']}}}, - "generate_explanation": {{$.inputs.parameters[''generate_explanation'']}}, - "explanation_spec": {"parameters": {{$.inputs.parameters[''explanation_parameters'']}}, - "metadata": {{$.inputs.parameters[''explanation_metadata'']}}}, "explanation_metadata_artifact": - "{{$.inputs.artifacts[''explanation_metadata_artifact''].uri}}", "labels": - {{$.inputs.parameters[''labels'']}}, "encryption_spec": {"kms_key_name":"{{$.inputs.parameters[''encryption_spec_key_name'']}}"}}' - - --project - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --executor_input - - '{{$}}' - command: - - python3 - - -u - - -m - - launcher - image: gcr.io/ml-pipeline/automl-tables-private:1.0.13 - exec-model-batch-explanation-3: - container: - args: - - --type - - BatchPredictionJob - - --payload - - '{"display_name": "{{$.inputs.parameters[''job_display_name'']}}", "input_config": - {"instances_format": "{{$.inputs.parameters[''instances_format'']}}", - "gcs_source": {"uris":{{$.inputs.parameters[''gcs_source_uris'']}}}, "bigquery_source": - {"input_uri": "{{$.inputs.parameters[''bigquery_source_input_uri'']}}"}}, - "model_parameters": {{$.inputs.parameters[''model_parameters'']}}, "output_config": - {"predictions_format": "{{$.inputs.parameters[''predictions_format'']}}", - "gcs_destination": {"output_uri_prefix": "{{$.inputs.parameters[''gcs_destination_output_uri_prefix'']}}"}, - "bigquery_destination": {"output_uri": "{{$.inputs.parameters[''bigquery_destination_output_uri'']}}"}}, - "dedicated_resources": {"machine_spec": {"machine_type": "{{$.inputs.parameters[''machine_type'']}}", - "accelerator_type": "{{$.inputs.parameters[''accelerator_type'']}}", "accelerator_count": - {{$.inputs.parameters[''accelerator_count'']}}}, "starting_replica_count": - {{$.inputs.parameters[''starting_replica_count'']}}, "max_replica_count": - {{$.inputs.parameters[''max_replica_count'']}}}, "manual_batch_tuning_parameters": - {"batch_size": {{$.inputs.parameters[''manual_batch_tuning_parameters_batch_size'']}}}, - "generate_explanation": {{$.inputs.parameters[''generate_explanation'']}}, - "explanation_spec": {"parameters": {{$.inputs.parameters[''explanation_parameters'']}}, - "metadata": {{$.inputs.parameters[''explanation_metadata'']}}}, "explanation_metadata_artifact": - "{{$.inputs.artifacts[''explanation_metadata_artifact''].uri}}", "labels": - {{$.inputs.parameters[''labels'']}}, "encryption_spec": {"kms_key_name":"{{$.inputs.parameters[''encryption_spec_key_name'']}}"}}' - - --project - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --executor_input - - '{{$}}' - command: - - python3 - - -u - - -m - - launcher - image: gcr.io/ml-pipeline/automl-tables-private:1.0.13 - exec-model-batch-predict: - container: - args: - - --type - - BatchPredictionJob - - --payload - - '{"display_name": "{{$.inputs.parameters[''job_display_name'']}}", "input_config": - {"instances_format": "{{$.inputs.parameters[''instances_format'']}}", - "gcs_source": {"uris":{{$.inputs.parameters[''gcs_source_uris'']}}}, "bigquery_source": - {"input_uri": "{{$.inputs.parameters[''bigquery_source_input_uri'']}}"}}, - "model_parameters": {{$.inputs.parameters[''model_parameters'']}}, "output_config": - {"predictions_format": "{{$.inputs.parameters[''predictions_format'']}}", - "gcs_destination": {"output_uri_prefix": "{{$.inputs.parameters[''gcs_destination_output_uri_prefix'']}}"}, - "bigquery_destination": {"output_uri": "{{$.inputs.parameters[''bigquery_destination_output_uri'']}}"}}, - "dedicated_resources": {"machine_spec": {"machine_type": "{{$.inputs.parameters[''machine_type'']}}", - "accelerator_type": "{{$.inputs.parameters[''accelerator_type'']}}", "accelerator_count": - {{$.inputs.parameters[''accelerator_count'']}}}, "starting_replica_count": - {{$.inputs.parameters[''starting_replica_count'']}}, "max_replica_count": - {{$.inputs.parameters[''max_replica_count'']}}}, "manual_batch_tuning_parameters": - {"batch_size": {{$.inputs.parameters[''manual_batch_tuning_parameters_batch_size'']}}}, - "generate_explanation": {{$.inputs.parameters[''generate_explanation'']}}, - "explanation_spec": {"parameters": {{$.inputs.parameters[''explanation_parameters'']}}, - "metadata": {{$.inputs.parameters[''explanation_metadata'']}}}, "labels": - {{$.inputs.parameters[''labels'']}}, "encryption_spec": {"kms_key_name":"{{$.inputs.parameters[''encryption_spec_key_name'']}}"}}' - - --project - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --executor_input - - '{{$}}' - command: - - python3 - - -u - - -m - - google_cloud_pipeline_components.container.v1.batch_prediction_job.launcher - image: gcr.io/ml-pipeline/google-cloud-pipeline-components:1.0.32 - exec-model-batch-predict-2: - container: - args: - - --type - - BatchPredictionJob - - --payload - - '{"display_name": "{{$.inputs.parameters[''job_display_name'']}}", "input_config": - {"instances_format": "{{$.inputs.parameters[''instances_format'']}}", - "gcs_source": {"uris":{{$.inputs.parameters[''gcs_source_uris'']}}}, "bigquery_source": - {"input_uri": "{{$.inputs.parameters[''bigquery_source_input_uri'']}}"}}, - "model_parameters": {{$.inputs.parameters[''model_parameters'']}}, "output_config": - {"predictions_format": "{{$.inputs.parameters[''predictions_format'']}}", - "gcs_destination": {"output_uri_prefix": "{{$.inputs.parameters[''gcs_destination_output_uri_prefix'']}}"}, - "bigquery_destination": {"output_uri": "{{$.inputs.parameters[''bigquery_destination_output_uri'']}}"}}, - "dedicated_resources": {"machine_spec": {"machine_type": "{{$.inputs.parameters[''machine_type'']}}", - "accelerator_type": "{{$.inputs.parameters[''accelerator_type'']}}", "accelerator_count": - {{$.inputs.parameters[''accelerator_count'']}}}, "starting_replica_count": - {{$.inputs.parameters[''starting_replica_count'']}}, "max_replica_count": - {{$.inputs.parameters[''max_replica_count'']}}}, "manual_batch_tuning_parameters": - {"batch_size": {{$.inputs.parameters[''manual_batch_tuning_parameters_batch_size'']}}}, - "generate_explanation": {{$.inputs.parameters[''generate_explanation'']}}, - "explanation_spec": {"parameters": {{$.inputs.parameters[''explanation_parameters'']}}, - "metadata": {{$.inputs.parameters[''explanation_metadata'']}}}, "labels": - {{$.inputs.parameters[''labels'']}}, "encryption_spec": {"kms_key_name":"{{$.inputs.parameters[''encryption_spec_key_name'']}}"}}' - - --project - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --executor_input - - '{{$}}' - command: - - python3 - - -u - - -m - - google_cloud_pipeline_components.container.v1.batch_prediction_job.launcher - image: gcr.io/ml-pipeline/google-cloud-pipeline-components:1.0.32 - exec-model-batch-predict-3: - container: - args: - - --type - - BatchPredictionJob - - --payload - - '{"display_name": "{{$.inputs.parameters[''job_display_name'']}}", "model": - "{{$.inputs.artifacts[''model''].metadata[''resourceName'']}}", "input_config": - {"instances_format": "{{$.inputs.parameters[''instances_format'']}}", - "gcs_source": {"uris":{{$.inputs.parameters[''gcs_source_uris'']}}}, "bigquery_source": - {"input_uri": "{{$.inputs.parameters[''bigquery_source_input_uri'']}}"}}, - "model_parameters": {{$.inputs.parameters[''model_parameters'']}}, "output_config": - {"predictions_format": "{{$.inputs.parameters[''predictions_format'']}}", - "gcs_destination": {"output_uri_prefix": "{{$.inputs.parameters[''gcs_destination_output_uri_prefix'']}}"}, - "bigquery_destination": {"output_uri": "{{$.inputs.parameters[''bigquery_destination_output_uri'']}}"}}, - "dedicated_resources": {"machine_spec": {"machine_type": "{{$.inputs.parameters[''machine_type'']}}", - "accelerator_type": "{{$.inputs.parameters[''accelerator_type'']}}", "accelerator_count": - {{$.inputs.parameters[''accelerator_count'']}}}, "starting_replica_count": - {{$.inputs.parameters[''starting_replica_count'']}}, "max_replica_count": - {{$.inputs.parameters[''max_replica_count'']}}}, "manual_batch_tuning_parameters": - {"batch_size": {{$.inputs.parameters[''manual_batch_tuning_parameters_batch_size'']}}}, - "generate_explanation": {{$.inputs.parameters[''generate_explanation'']}}, - "explanation_spec": {"parameters": {{$.inputs.parameters[''explanation_parameters'']}}, - "metadata": {{$.inputs.parameters[''explanation_metadata'']}}}, "labels": - {{$.inputs.parameters[''labels'']}}, "encryption_spec": {"kms_key_name":"{{$.inputs.parameters[''encryption_spec_key_name'']}}"}}' - - --project - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --executor_input - - '{{$}}' - command: - - python3 - - -u - - -m - - google_cloud_pipeline_components.container.v1.batch_prediction_job.launcher - image: gcr.io/ml-pipeline/google-cloud-pipeline-components:1.0.32 - exec-model-batch-predict-4: - container: - args: - - --type - - BatchPredictionJob - - --payload - - '{"display_name": "{{$.inputs.parameters[''job_display_name'']}}", "model": - "{{$.inputs.artifacts[''model''].metadata[''resourceName'']}}", "input_config": - {"instances_format": "{{$.inputs.parameters[''instances_format'']}}", - "gcs_source": {"uris":{{$.inputs.parameters[''gcs_source_uris'']}}}, "bigquery_source": - {"input_uri": "{{$.inputs.parameters[''bigquery_source_input_uri'']}}"}}, - "model_parameters": {{$.inputs.parameters[''model_parameters'']}}, "output_config": - {"predictions_format": "{{$.inputs.parameters[''predictions_format'']}}", - "gcs_destination": {"output_uri_prefix": "{{$.inputs.parameters[''gcs_destination_output_uri_prefix'']}}"}, - "bigquery_destination": {"output_uri": "{{$.inputs.parameters[''bigquery_destination_output_uri'']}}"}}, - "dedicated_resources": {"machine_spec": {"machine_type": "{{$.inputs.parameters[''machine_type'']}}", - "accelerator_type": "{{$.inputs.parameters[''accelerator_type'']}}", "accelerator_count": - {{$.inputs.parameters[''accelerator_count'']}}}, "starting_replica_count": - {{$.inputs.parameters[''starting_replica_count'']}}, "max_replica_count": - {{$.inputs.parameters[''max_replica_count'']}}}, "manual_batch_tuning_parameters": - {"batch_size": {{$.inputs.parameters[''manual_batch_tuning_parameters_batch_size'']}}}, - "generate_explanation": {{$.inputs.parameters[''generate_explanation'']}}, - "explanation_spec": {"parameters": {{$.inputs.parameters[''explanation_parameters'']}}, - "metadata": {{$.inputs.parameters[''explanation_metadata'']}}}, "labels": - {{$.inputs.parameters[''labels'']}}, "encryption_spec": {"kms_key_name":"{{$.inputs.parameters[''encryption_spec_key_name'']}}"}}' - - --project - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --executor_input - - '{{$}}' - command: - - python3 - - -u - - -m - - google_cloud_pipeline_components.container.v1.batch_prediction_job.launcher - image: gcr.io/ml-pipeline/google-cloud-pipeline-components:1.0.32 - exec-model-batch-predict-5: - container: - args: - - --type - - BatchPredictionJob - - --payload - - '{"display_name": "{{$.inputs.parameters[''job_display_name'']}}", "input_config": - {"instances_format": "{{$.inputs.parameters[''instances_format'']}}", - "gcs_source": {"uris":{{$.inputs.parameters[''gcs_source_uris'']}}}, "bigquery_source": - {"input_uri": "{{$.inputs.parameters[''bigquery_source_input_uri'']}}"}}, - "model_parameters": {{$.inputs.parameters[''model_parameters'']}}, "output_config": - {"predictions_format": "{{$.inputs.parameters[''predictions_format'']}}", - "gcs_destination": {"output_uri_prefix": "{{$.inputs.parameters[''gcs_destination_output_uri_prefix'']}}"}, - "bigquery_destination": {"output_uri": "{{$.inputs.parameters[''bigquery_destination_output_uri'']}}"}}, - "dedicated_resources": {"machine_spec": {"machine_type": "{{$.inputs.parameters[''machine_type'']}}", - "accelerator_type": "{{$.inputs.parameters[''accelerator_type'']}}", "accelerator_count": - {{$.inputs.parameters[''accelerator_count'']}}}, "starting_replica_count": - {{$.inputs.parameters[''starting_replica_count'']}}, "max_replica_count": - {{$.inputs.parameters[''max_replica_count'']}}}, "manual_batch_tuning_parameters": - {"batch_size": {{$.inputs.parameters[''manual_batch_tuning_parameters_batch_size'']}}}, - "generate_explanation": {{$.inputs.parameters[''generate_explanation'']}}, - "explanation_spec": {"parameters": {{$.inputs.parameters[''explanation_parameters'']}}, - "metadata": {{$.inputs.parameters[''explanation_metadata'']}}}, "labels": - {{$.inputs.parameters[''labels'']}}, "encryption_spec": {"kms_key_name":"{{$.inputs.parameters[''encryption_spec_key_name'']}}"}}' - - --project - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --executor_input - - '{{$}}' - command: - - python3 - - -u - - -m - - google_cloud_pipeline_components.container.v1.batch_prediction_job.launcher - image: gcr.io/ml-pipeline/google-cloud-pipeline-components:1.0.32 - exec-model-evaluation: - container: - args: - - --setup_file - - /setup.py - - --json_mode - - 'true' - - --project_id - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --problem_type - - '{{$.inputs.parameters[''problem_type'']}}' - - --batch_prediction_format - - '{{$.inputs.parameters[''predictions_format'']}}' - - --batch_prediction_gcs_source - - '{{$.inputs.artifacts[''batch_prediction_job''].metadata[''gcsOutputDirectory'']}}' - - --ground_truth_format - - '{{$.inputs.parameters[''ground_truth_format'']}}' - - --ground_truth_gcs_source - - '{{$.inputs.parameters[''ground_truth_gcs_source'']}}' - - --key_prefix_in_prediction_dataset - - instance - - --key_columns - - '{{$.inputs.parameters[''key_columns'']}}' - - --root_dir - - '{{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}}' - - --classification_type - - '{{$.inputs.parameters[''classification_type'']}}' - - --class_names - - '{{$.inputs.parameters[''class_names'']}}' - - --ground_truth_column - - instance.{{$.inputs.parameters['ground_truth_column']}} - - --prediction_score_column - - '{{$.inputs.parameters[''prediction_score_column'']}}' - - --prediction_label_column - - '{{$.inputs.parameters[''prediction_label_column'']}}' - - --prediction_id_column - - '{{$.inputs.parameters[''prediction_id_column'']}}' - - --example_weight_column - - '{{$.inputs.parameters[''example_weight_column'']}}' - - --positive_classes - - '{{$.inputs.parameters[''positive_classes'']}}' - - --generate_feature_attribution - - '{{$.inputs.parameters[''generate_feature_attribution'']}}' - - --dataflow_job_prefix - - evaluation-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}} - - --dataflow_service_account - - '{{$.inputs.parameters[''dataflow_service_account'']}}' - - --dataflow_disk_size - - '{{$.inputs.parameters[''dataflow_disk_size'']}}' - - --dataflow_machine_type - - '{{$.inputs.parameters[''dataflow_machine_type'']}}' - - --dataflow_workers_num - - '{{$.inputs.parameters[''dataflow_workers_num'']}}' - - --dataflow_max_workers_num - - '{{$.inputs.parameters[''dataflow_max_workers_num'']}}' - - --dataflow_subnetwork - - '{{$.inputs.parameters[''dataflow_subnetwork'']}}' - - --dataflow_use_public_ips - - '{{$.inputs.parameters[''dataflow_use_public_ips'']}}' - - --kms_key_name - - '{{$.inputs.parameters[''encryption_spec_key_name'']}}' - - --output_metrics_gcs_path - - '{{$.outputs.artifacts[''evaluation_metrics''].uri}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --executor_input - - '{{$}}' - command: - - python - - /main.py - image: gcr.io/ml-pipeline/model-evaluation:v0.4 - exec-model-evaluation-2: - container: - args: - - --setup_file - - /setup.py - - --json_mode - - 'true' - - --project_id - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --problem_type - - '{{$.inputs.parameters[''problem_type'']}}' - - --batch_prediction_format - - '{{$.inputs.parameters[''predictions_format'']}}' - - --batch_prediction_gcs_source - - '{{$.inputs.artifacts[''batch_prediction_job''].metadata[''gcsOutputDirectory'']}}' - - --ground_truth_format - - '{{$.inputs.parameters[''ground_truth_format'']}}' - - --ground_truth_gcs_source - - '{{$.inputs.parameters[''ground_truth_gcs_source'']}}' - - --key_prefix_in_prediction_dataset - - instance - - --key_columns - - '{{$.inputs.parameters[''key_columns'']}}' - - --root_dir - - '{{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}}' - - --classification_type - - '{{$.inputs.parameters[''classification_type'']}}' - - --class_names - - '{{$.inputs.parameters[''class_names'']}}' - - --ground_truth_column - - instance.{{$.inputs.parameters['ground_truth_column']}} - - --prediction_score_column - - '{{$.inputs.parameters[''prediction_score_column'']}}' - - --prediction_label_column - - '{{$.inputs.parameters[''prediction_label_column'']}}' - - --prediction_id_column - - '{{$.inputs.parameters[''prediction_id_column'']}}' - - --example_weight_column - - '{{$.inputs.parameters[''example_weight_column'']}}' - - --positive_classes - - '{{$.inputs.parameters[''positive_classes'']}}' - - --generate_feature_attribution - - '{{$.inputs.parameters[''generate_feature_attribution'']}}' - - --dataflow_job_prefix - - evaluation-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}} - - --dataflow_service_account - - '{{$.inputs.parameters[''dataflow_service_account'']}}' - - --dataflow_disk_size - - '{{$.inputs.parameters[''dataflow_disk_size'']}}' - - --dataflow_machine_type - - '{{$.inputs.parameters[''dataflow_machine_type'']}}' - - --dataflow_workers_num - - '{{$.inputs.parameters[''dataflow_workers_num'']}}' - - --dataflow_max_workers_num - - '{{$.inputs.parameters[''dataflow_max_workers_num'']}}' - - --dataflow_subnetwork - - '{{$.inputs.parameters[''dataflow_subnetwork'']}}' - - --dataflow_use_public_ips - - '{{$.inputs.parameters[''dataflow_use_public_ips'']}}' - - --kms_key_name - - '{{$.inputs.parameters[''encryption_spec_key_name'']}}' - - --output_metrics_gcs_path - - '{{$.outputs.artifacts[''evaluation_metrics''].uri}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --executor_input - - '{{$}}' - command: - - python - - /main.py - image: gcr.io/ml-pipeline/model-evaluation:v0.4 - exec-model-evaluation-3: - container: - args: - - --setup_file - - /setup.py - - --json_mode - - 'true' - - --project_id - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --problem_type - - '{{$.inputs.parameters[''problem_type'']}}' - - --batch_prediction_format - - '{{$.inputs.parameters[''predictions_format'']}}' - - --batch_prediction_gcs_source - - '{{$.inputs.artifacts[''batch_prediction_job''].metadata[''gcsOutputDirectory'']}}' - - --ground_truth_format - - '{{$.inputs.parameters[''ground_truth_format'']}}' - - --ground_truth_gcs_source - - '{{$.inputs.parameters[''ground_truth_gcs_source'']}}' - - --key_prefix_in_prediction_dataset - - instance - - --key_columns - - '{{$.inputs.parameters[''key_columns'']}}' - - --root_dir - - '{{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}}' - - --classification_type - - '{{$.inputs.parameters[''classification_type'']}}' - - --class_names - - '{{$.inputs.parameters[''class_names'']}}' - - --ground_truth_column - - instance.{{$.inputs.parameters['ground_truth_column']}} - - --prediction_score_column - - '{{$.inputs.parameters[''prediction_score_column'']}}' - - --prediction_label_column - - '{{$.inputs.parameters[''prediction_label_column'']}}' - - --prediction_id_column - - '{{$.inputs.parameters[''prediction_id_column'']}}' - - --example_weight_column - - '{{$.inputs.parameters[''example_weight_column'']}}' - - --positive_classes - - '{{$.inputs.parameters[''positive_classes'']}}' - - --generate_feature_attribution - - '{{$.inputs.parameters[''generate_feature_attribution'']}}' - - --dataflow_job_prefix - - evaluation-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}} - - --dataflow_service_account - - '{{$.inputs.parameters[''dataflow_service_account'']}}' - - --dataflow_disk_size - - '{{$.inputs.parameters[''dataflow_disk_size'']}}' - - --dataflow_machine_type - - '{{$.inputs.parameters[''dataflow_machine_type'']}}' - - --dataflow_workers_num - - '{{$.inputs.parameters[''dataflow_workers_num'']}}' - - --dataflow_max_workers_num - - '{{$.inputs.parameters[''dataflow_max_workers_num'']}}' - - --dataflow_subnetwork - - '{{$.inputs.parameters[''dataflow_subnetwork'']}}' - - --dataflow_use_public_ips - - '{{$.inputs.parameters[''dataflow_use_public_ips'']}}' - - --kms_key_name - - '{{$.inputs.parameters[''encryption_spec_key_name'']}}' - - --output_metrics_gcs_path - - '{{$.outputs.artifacts[''evaluation_metrics''].uri}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --executor_input - - '{{$}}' - command: - - python - - /main.py - image: gcr.io/ml-pipeline/model-evaluation:v0.4 - exec-model-evaluation-import: - container: - args: - - --metrics - - '{{$.inputs.artifacts[''metrics''].uri}}' - - --metrics_explanation - - '{{$.inputs.artifacts[''metrics''].metadata[''explanation_gcs_path'']}}' - - --feature_attributions - - '{{$.inputs.artifacts[''feature_attributions''].uri}}' - - --problem_type - - '{{$.inputs.parameters[''problem_type'']}}' - - --display_name - - '{{$.inputs.parameters[''display_name'']}}' - - --dataset_path - - '{{$.inputs.parameters[''dataset_path'']}}' - - --dataset_paths - - '{{$.inputs.parameters[''dataset_paths'']}}' - - --dataset_type - - '{{$.inputs.parameters[''dataset_type'']}}' - - --pipeline_job_id - - '{{$.pipeline_job_uuid}}' - - --pipeline_job_resource_name - - '{{$.pipeline_job_resource_name}}' - - --model_name - - '{{$.inputs.artifacts[''model''].metadata[''resourceName'']}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - command: - - python3 - - -u - - -m - - google_cloud_pipeline_components.container.experimental.evaluation.import_model_evaluation - image: gcr.io/ml-pipeline/google-cloud-pipeline-components:1.0.32 - exec-model-evaluation-import-2: - container: - args: - - --metrics - - '{{$.inputs.artifacts[''metrics''].uri}}' - - --metrics_explanation - - '{{$.inputs.artifacts[''metrics''].metadata[''explanation_gcs_path'']}}' - - --feature_attributions - - '{{$.inputs.artifacts[''feature_attributions''].uri}}' - - --problem_type - - '{{$.inputs.parameters[''problem_type'']}}' - - --display_name - - '{{$.inputs.parameters[''display_name'']}}' - - --dataset_path - - '{{$.inputs.parameters[''dataset_path'']}}' - - --dataset_paths - - '{{$.inputs.parameters[''dataset_paths'']}}' - - --dataset_type - - '{{$.inputs.parameters[''dataset_type'']}}' - - --pipeline_job_id - - '{{$.pipeline_job_uuid}}' - - --pipeline_job_resource_name - - '{{$.pipeline_job_resource_name}}' - - --model_name - - '{{$.inputs.artifacts[''model''].metadata[''resourceName'']}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - command: - - python3 - - -u - - -m - - google_cloud_pipeline_components.container.experimental.evaluation.import_model_evaluation - image: gcr.io/ml-pipeline/google-cloud-pipeline-components:1.0.32 - exec-model-evaluation-import-3: - container: - args: - - --metrics - - '{{$.inputs.artifacts[''metrics''].uri}}' - - --metrics_explanation - - '{{$.inputs.artifacts[''metrics''].metadata[''explanation_gcs_path'']}}' - - --feature_attributions - - '{{$.inputs.artifacts[''feature_attributions''].uri}}' - - --problem_type - - '{{$.inputs.parameters[''problem_type'']}}' - - --display_name - - '{{$.inputs.parameters[''display_name'']}}' - - --dataset_path - - '{{$.inputs.parameters[''dataset_path'']}}' - - --dataset_paths - - '{{$.inputs.parameters[''dataset_paths'']}}' - - --dataset_type - - '{{$.inputs.parameters[''dataset_type'']}}' - - --pipeline_job_id - - '{{$.pipeline_job_uuid}}' - - --pipeline_job_resource_name - - '{{$.pipeline_job_resource_name}}' - - --model_name - - '{{$.inputs.artifacts[''model''].metadata[''resourceName'']}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - command: - - python3 - - -u - - -m - - google_cloud_pipeline_components.container.experimental.evaluation.import_model_evaluation - image: gcr.io/ml-pipeline/google-cloud-pipeline-components:1.0.32 - exec-model-upload: - container: - args: - - --type - - UploadModel - - --payload - - '{"display_name": "{{$.inputs.parameters[''display_name'']}}", "description": - "{{$.inputs.parameters[''description'']}}", "explanation_spec": {"parameters": - {{$.inputs.parameters[''explanation_parameters'']}}, "metadata": {{$.inputs.parameters[''explanation_metadata'']}}}, - "encryption_spec": {"kms_key_name":"{{$.inputs.parameters[''encryption_spec_key_name'']}}"}, - "labels": {{$.inputs.parameters[''labels'']}}}' - - --project - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --executor_input - - '{{$}}' - command: - - python3 - - -u - - -m - - launcher - image: gcr.io/ml-pipeline/automl-tables-private:1.0.13 - exec-model-upload-2: - container: - args: - - --type - - UploadModel - - --payload - - '{"display_name": "{{$.inputs.parameters[''display_name'']}}", "description": - "{{$.inputs.parameters[''description'']}}", "explanation_spec": {"parameters": - {{$.inputs.parameters[''explanation_parameters'']}}, "metadata": {{$.inputs.parameters[''explanation_metadata'']}}}, - "explanation_metadata_artifact": "{{$.inputs.artifacts[''explanation_metadata_artifact''].uri}}", - "encryption_spec": {"kms_key_name":"{{$.inputs.parameters[''encryption_spec_key_name'']}}"}, - "labels": {{$.inputs.parameters[''labels'']}}}' - - --project - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --executor_input - - '{{$}}' - command: - - python3 - - -u - - -m - - launcher - image: gcr.io/ml-pipeline/automl-tables-private:1.0.13 - exec-model-upload-3: - container: - args: - - --type - - UploadModel - - --payload - - '{"display_name": "{{$.inputs.parameters[''display_name'']}}", "description": - "{{$.inputs.parameters[''description'']}}", "explanation_spec": {"parameters": - {{$.inputs.parameters[''explanation_parameters'']}}, "metadata": {{$.inputs.parameters[''explanation_metadata'']}}}, - "explanation_metadata_artifact": "{{$.inputs.artifacts[''explanation_metadata_artifact''].uri}}", - "encryption_spec": {"kms_key_name":"{{$.inputs.parameters[''encryption_spec_key_name'']}}"}, - "labels": {{$.inputs.parameters[''labels'']}}}' - - --project - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --executor_input - - '{{$}}' - command: - - python3 - - -u - - -m - - launcher - image: gcr.io/ml-pipeline/automl-tables-private:1.0.13 - exec-model-upload-4: - container: - args: - - --type - - UploadModel - - --payload - - '{"display_name": "{{$.inputs.parameters[''display_name'']}}", "description": - "{{$.inputs.parameters[''description'']}}", "explanation_spec": {"parameters": - {{$.inputs.parameters[''explanation_parameters'']}}, "metadata": {{$.inputs.parameters[''explanation_metadata'']}}}, - "explanation_metadata_artifact": "{{$.inputs.artifacts[''explanation_metadata_artifact''].uri}}", - "encryption_spec": {"kms_key_name":"{{$.inputs.parameters[''encryption_spec_key_name'']}}"}, - "labels": {{$.inputs.parameters[''labels'']}}}' - - --project - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --executor_input - - '{{$}}' - command: - - python3 - - -u - - -m - - launcher - image: gcr.io/ml-pipeline/automl-tables-private:1.0.13 - exec-read-input-uri: - container: - args: - - --split-uri - - '{{$.inputs.artifacts[''split_uri''].path}}' - - '----output-paths' - - '{{$.outputs.parameters[''Output''].output_file}}' - command: - - sh - - -ec - - 'program_path=$(mktemp) - - printf "%s" "$0" > "$program_path" - - python3 -u "$program_path" "$@" - - ' - - "def _read_input_uri(split_uri):\n \"\"\"Construct Dataset based on the\ - \ batch prediction job.\n\n Args:\n split_uri: Tbe path to the file\ - \ that contains Dataset data.\n\n Returns:\n The list of string that\ - \ represents the batch prediction input files.\n \"\"\"\n # pylint:\ - \ disable=g-import-not-at-top,import-outside-toplevel,redefined-outer-name,reimported\n\ - \ import json\n # pylint: enable=g-import-not-at-top,import-outside-toplevel,redefined-outer-name,reimported\n\ - \ with open(split_uri, 'r') as f:\n data_source = json.loads(f.read())\n\ - \ return data_source['tf_record_data_source']['file_patterns']\n\n\ - def _serialize_json(obj) -> str:\n if isinstance(obj, str):\n \ - \ return obj\n import json\n\n def default_serializer(obj):\n\ - \ if hasattr(obj, 'to_struct'):\n return obj.to_struct()\n\ - \ else:\n raise TypeError(\n \"Object\ - \ of type '%s' is not JSON serializable and does not have .to_struct()\ - \ method.\"\n % obj.__class__.__name__)\n\n return json.dumps(obj,\ - \ default=default_serializer, sort_keys=True)\n\nimport argparse\n_parser\ - \ = argparse.ArgumentParser(prog='Read input uri', description='Construct\ - \ Dataset based on the batch prediction job.')\n_parser.add_argument(\"\ - --split-uri\", dest=\"split_uri\", type=str, required=True, default=argparse.SUPPRESS)\n\ - _parser.add_argument(\"----output-paths\", dest=\"_output_paths\", type=str,\ - \ nargs=1)\n_parsed_args = vars(_parser.parse_args())\n_output_files =\ - \ _parsed_args.pop(\"_output_paths\", [])\n\n_outputs = _read_input_uri(**_parsed_args)\n\ - \n_outputs = [_outputs]\n\n_output_serializers = [\n _serialize_json,\n\ - \n]\n\nimport os\nfor idx, output_file in enumerate(_output_files):\n\ - \ try:\n os.makedirs(os.path.dirname(output_file))\n except\ - \ OSError:\n pass\n with open(output_file, 'w') as f:\n \ - \ f.write(_output_serializers[idx](_outputs[idx]))\n" - image: us-docker.pkg.dev/vertex-ai/automl-tabular/dataflow-worker:20230110_1125_RC00 - exec-read-input-uri-2: - container: - args: - - --split-uri - - '{{$.inputs.artifacts[''split_uri''].path}}' - - '----output-paths' - - '{{$.outputs.parameters[''Output''].output_file}}' - command: - - sh - - -ec - - 'program_path=$(mktemp) - - printf "%s" "$0" > "$program_path" - - python3 -u "$program_path" "$@" - - ' - - "def _read_input_uri(split_uri):\n \"\"\"Construct Dataset based on the\ - \ batch prediction job.\n\n Args:\n split_uri: Tbe path to the file\ - \ that contains Dataset data.\n\n Returns:\n The list of string that\ - \ represents the batch prediction input files.\n \"\"\"\n # pylint:\ - \ disable=g-import-not-at-top,import-outside-toplevel,redefined-outer-name,reimported\n\ - \ import json\n # pylint: enable=g-import-not-at-top,import-outside-toplevel,redefined-outer-name,reimported\n\ - \ with open(split_uri, 'r') as f:\n data_source = json.loads(f.read())\n\ - \ return data_source['tf_record_data_source']['file_patterns']\n\n\ - def _serialize_json(obj) -> str:\n if isinstance(obj, str):\n \ - \ return obj\n import json\n\n def default_serializer(obj):\n\ - \ if hasattr(obj, 'to_struct'):\n return obj.to_struct()\n\ - \ else:\n raise TypeError(\n \"Object\ - \ of type '%s' is not JSON serializable and does not have .to_struct()\ - \ method.\"\n % obj.__class__.__name__)\n\n return json.dumps(obj,\ - \ default=default_serializer, sort_keys=True)\n\nimport argparse\n_parser\ - \ = argparse.ArgumentParser(prog='Read input uri', description='Construct\ - \ Dataset based on the batch prediction job.')\n_parser.add_argument(\"\ - --split-uri\", dest=\"split_uri\", type=str, required=True, default=argparse.SUPPRESS)\n\ - _parser.add_argument(\"----output-paths\", dest=\"_output_paths\", type=str,\ - \ nargs=1)\n_parsed_args = vars(_parser.parse_args())\n_output_files =\ - \ _parsed_args.pop(\"_output_paths\", [])\n\n_outputs = _read_input_uri(**_parsed_args)\n\ - \n_outputs = [_outputs]\n\n_output_serializers = [\n _serialize_json,\n\ - \n]\n\nimport os\nfor idx, output_file in enumerate(_output_files):\n\ - \ try:\n os.makedirs(os.path.dirname(output_file))\n except\ - \ OSError:\n pass\n with open(output_file, 'w') as f:\n \ - \ f.write(_output_serializers[idx](_outputs[idx]))\n" - image: us-docker.pkg.dev/vertex-ai/automl-tabular/dataflow-worker:20230110_1125_RC00 - exec-set-model-can-skip-validation: - container: - args: - - --executor_input - - '{{$}}' - - --function_to_execute - - _set_model_can_skip_validation - command: - - sh - - -ec - - 'program_path=$(mktemp -d) - - printf "%s" "$0" > "$program_path/ephemeral_component.py" - - python3 -m kfp.v2.components.executor_main --component_module_path "$program_path/ephemeral_component.py" "$@" - - ' - - "\nimport kfp\nfrom kfp.v2 import dsl\nfrom kfp.v2.dsl import *\nfrom\ - \ typing import *\n\ndef _set_model_can_skip_validation(model: Input[Artifact]):\n\ - \ \"\"\"Construct Dataset based on the batch prediction job.\n\n Args:\n\ - \ model: The model artifact.\n \"\"\"\n # pylint: disable=g-import-not-at-top,import-outside-toplevel,redefined-outer-name,reimported\n\ - \ import os\n import tensorflow as tf\n # pylint: enable=g-import-not-at-top,import-outside-toplevel,redefined-outer-name,reimported\n\ - \n # create an empty CAN_SKIP_VALIDATION file\n with tf.io.gfile.GFile(os.path.join(model.uri,\ - \ 'CAN_SKIP_VALIDATION'),\n 'w') as f:\n f.write('')\n\ - \n" - image: us-docker.pkg.dev/vertex-ai/automl-tabular/dataflow-worker:20230110_1125_RC00 - exec-string-not-empty: - container: - args: - - --value - - '{{$.inputs.parameters[''value'']}}' - - '----output-paths' - - '{{$.outputs.parameters[''Output''].output_file}}' - command: - - sh - - -ec - - 'program_path=$(mktemp) - - printf "%s" "$0" > "$program_path" - - python3 -u "$program_path" "$@" - - ' - - "def _string_not_empty(value):\n \"\"\"Check if the input string value\ - \ is not empty.\n\n Args:\n value: String value to be checked.\n\n\ - \ Returns:\n Boolean value. -> 'true' if empty, 'false' if not empty.\ - \ We need to use str\n instead of bool due to a limitation in KFP compiler.\n\ - \ \"\"\"\n return 'true' if value else 'false'\n\ndef _serialize_str(str_value:\ - \ str) -> str:\n if not isinstance(str_value, str):\n raise\ - \ TypeError('Value \"{}\" has type \"{}\" instead of str.'.format(\n \ - \ str(str_value), str(type(str_value))))\n return str_value\n\ - \nimport argparse\n_parser = argparse.ArgumentParser(prog='String not\ - \ empty', description='Check if the input string value is not empty.')\n\ - _parser.add_argument(\"--value\", dest=\"value\", type=str, required=True,\ - \ default=argparse.SUPPRESS)\n_parser.add_argument(\"----output-paths\"\ - , dest=\"_output_paths\", type=str, nargs=1)\n_parsed_args = vars(_parser.parse_args())\n\ - _output_files = _parsed_args.pop(\"_output_paths\", [])\n\n_outputs =\ - \ _string_not_empty(**_parsed_args)\n\n_outputs = [_outputs]\n\n_output_serializers\ - \ = [\n _serialize_str,\n\n]\n\nimport os\nfor idx, output_file in\ - \ enumerate(_output_files):\n try:\n os.makedirs(os.path.dirname(output_file))\n\ - \ except OSError:\n pass\n with open(output_file, 'w') as\ - \ f:\n f.write(_output_serializers[idx](_outputs[idx]))\n" - image: us-docker.pkg.dev/vertex-ai/automl-tabular/dataflow-worker:20230110_1125_RC00 - exec-tabular-stats-and-example-gen: - container: - args: - - --type - - CustomJob - - --project - - '{{$.inputs.parameters[''project'']}}' - - --location - - '{{$.inputs.parameters[''location'']}}' - - --gcp_resources - - '{{$.outputs.parameters[''gcp_resources''].output_file}}' - - --payload - - '{"display_name": "tabular-stats-and-example-gen-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}}", - "encryption_spec": {"kms_key_name":"{{$.inputs.parameters[''encryption_spec_key_name'']}}"}, - "job_spec": {"worker_pool_specs": [{"replica_count": 1, "machine_spec": - {"machine_type": "n1-standard-8"}, "container_spec": {"image_uri":"us-docker.pkg.dev/vertex-ai-restricted/automl-tabular/training:20230110_1125_RC00", - "args": ["stats_generator","--train_spec={\"prediction_type\": \"{{$.inputs.parameters[''prediction_type'']}}\", - \"target_column\": \"{{$.inputs.parameters[''target_column_name'']}}\", - \"optimization_objective\": \"{{$.inputs.parameters[''optimization_objective'']}}\", - \"weight_column_name\": \"{{$.inputs.parameters[''weight_column_name'']}}\", - \"transformations\": {{$.inputs.parameters[''transformations'']}}, \"quantiles\": - {{$.inputs.parameters[''quantiles'']}}, \"enable_probabilistic_inference\": - {{$.inputs.parameters[''enable_probabilistic_inference'']}}}", "--transformations_override_path={{$.inputs.parameters[''transformations_path'']}}", - "--split_spec=", "--data_source=", "--data_source_csv_filenames={{$.inputs.parameters[''data_source_csv_filenames'']}}", - "--data_source_bigquery_table_path={{$.inputs.parameters[''data_source_bigquery_table_path'']}}", - "--predefined_split_key={{$.inputs.parameters[''predefined_split_key'']}}", - "--timestamp_split_key={{$.inputs.parameters[''timestamp_split_key'']}}", - "--stratified_split_key={{$.inputs.parameters[''stratified_split_key'']}}", - "--training_fraction={{$.inputs.parameters[''training_fraction'']}}", - "--validation_fraction={{$.inputs.parameters[''validation_fraction'']}}", - "--test_fraction={{$.inputs.parameters[''test_fraction'']}}", "--target_column={{$.inputs.parameters[''target_column_name'']}}", - "--request_type={{$.inputs.parameters[''request_type'']}}", "--optimization_objective_recall_value={{$.inputs.parameters[''optimization_objective_recall_value'']}}", - "--optimization_objective_precision_value={{$.inputs.parameters[''optimization_objective_precision_value'']}}", - "--example_gen_gcs_output_prefix={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/example_gen_output", - "--dataset_stats_dir={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/stats/", - "--stats_result_path={{$.outputs.artifacts[''dataset_stats''].uri}}", - "--dataset_schema_path={{$.outputs.artifacts[''dataset_schema''].uri}}", - "--job_name=tabular-stats-and-example-gen-{{$.pipeline_job_uuid}}-{{$.pipeline_task_uuid}}", - "--dataflow_project={{$.inputs.parameters[''project'']}}", "--error_file_path={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/error.pb", - "--dataflow_staging_dir={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/dataflow_staging", - "--dataflow_tmp_dir={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/{{$.pipeline_task_uuid}}/dataflow_tmp", - "--dataflow_max_num_workers={{$.inputs.parameters[''dataflow_max_num_workers'']}}", - "--dataflow_worker_container_image=us-docker.pkg.dev/vertex-ai/automl-tabular/dataflow-worker:20230110_1125_RC00", - "--dataflow_machine_type={{$.inputs.parameters[''dataflow_machine_type'']}}", - "--dataflow_disk_size_gb={{$.inputs.parameters[''dataflow_disk_size_gb'']}}", - "--dataflow_kms_key={{$.inputs.parameters[''encryption_spec_key_name'']}}", - "--dataflow_subnetwork_fully_qualified={{$.inputs.parameters[''dataflow_subnetwork'']}}", - "--dataflow_use_public_ips={{$.inputs.parameters[''dataflow_use_public_ips'']}}", - "--dataflow_service_account={{$.inputs.parameters[''dataflow_service_account'']}}", - "--is_distill={{$.inputs.parameters[''run_distillation'']}}", "--additional_experiments={{$.inputs.parameters[''additional_experiments'']}}", - "--metadata_path={{$.outputs.artifacts[''metadata''].uri}}", "--train_split={{$.outputs.artifacts[''train_split''].uri}}", - "--eval_split={{$.outputs.artifacts[''eval_split''].uri}}", "--test_split={{$.outputs.artifacts[''test_split''].uri}}", - "--test_split_for_batch_prediction_component={{$.outputs.parameters[''test_split_json''].output_file}}", - "--downsampled_test_split_for_batch_prediction_component={{$.outputs.parameters[''downsampled_test_split_json''].output_file}}", - "--instance_baseline_path={{$.outputs.artifacts[''instance_baseline''].uri}}", - "--lro_job_info={{$.inputs.parameters[''root_dir'']}}/{{$.pipeline_job_uuid}}/lro", - "--gcp_resources_path={{$.outputs.parameters[''gcp_resources''].output_file}}", - "--parse_json=true", "--generate_additional_downsample_test_split=true", - "--executor_input={{$.json_escape[1]}}"]}}]}}' - command: - - python3 - - -u - - -m - - google_cloud_pipeline_components.container.v1.custom_job.launcher - image: gcr.io/ml-pipeline/google-cloud-pipeline-components:1.0.32 - exec-write-bp-result-path: - container: - args: - - --executor_input - - '{{$}}' - - --function_to_execute - - _write_bp_result_path - command: - - sh - - -ec - - 'program_path=$(mktemp -d) - - printf "%s" "$0" > "$program_path/ephemeral_component.py" - - python3 -m kfp.v2.components.executor_main --component_module_path "$program_path/ephemeral_component.py" "$@" - - ' - - "\nimport kfp\nfrom kfp.v2 import dsl\nfrom kfp.v2.dsl import *\nfrom\ - \ typing import *\n\ndef _write_bp_result_path(\n bp_job: Input[Artifact],\n\ - \ result: OutputPath('Dataset'),\n):\n \"\"\"Construct Dataset based\ - \ on the batch prediction job.\n\n Args:\n bp_job: The batch prediction\ - \ job artifact.\n result: Tbe path to the file that contains Dataset\ - \ data.\n \"\"\"\n # pylint: disable=g-import-not-at-top,import-outside-toplevel,redefined-outer-name,reimported\n\ - \ import json\n # pylint: enable=g-import-not-at-top,import-outside-toplevel,redefined-outer-name,reimported\n\ - \ directory = bp_job.metadata['gcsOutputDirectory']\n data_source =\ - \ {\n 'tf_record_data_source': {\n 'file_patterns': [f'{directory}/prediction.results-*',],\n\ - \ 'coder': 'PROTO_VALUE',\n },\n }\n with open(result,\ - \ 'w') as f:\n f.write(json.dumps(data_source))\n\n" - image: us-docker.pkg.dev/vertex-ai/automl-tabular/dataflow-worker:20230110_1125_RC00 - exec-write-bp-result-path-2: - container: - args: - - --executor_input - - '{{$}}' - - --function_to_execute - - _write_bp_result_path - command: - - sh - - -ec - - 'program_path=$(mktemp -d) - - printf "%s" "$0" > "$program_path/ephemeral_component.py" - - python3 -m kfp.v2.components.executor_main --component_module_path "$program_path/ephemeral_component.py" "$@" - - ' - - "\nimport kfp\nfrom kfp.v2 import dsl\nfrom kfp.v2.dsl import *\nfrom\ - \ typing import *\n\ndef _write_bp_result_path(\n bp_job: Input[Artifact],\n\ - \ result: OutputPath('Dataset'),\n):\n \"\"\"Construct Dataset based\ - \ on the batch prediction job.\n\n Args:\n bp_job: The batch prediction\ - \ job artifact.\n result: Tbe path to the file that contains Dataset\ - \ data.\n \"\"\"\n # pylint: disable=g-import-not-at-top,import-outside-toplevel,redefined-outer-name,reimported\n\ - \ import json\n # pylint: enable=g-import-not-at-top,import-outside-toplevel,redefined-outer-name,reimported\n\ - \ directory = bp_job.metadata['gcsOutputDirectory']\n data_source =\ - \ {\n 'tf_record_data_source': {\n 'file_patterns': [f'{directory}/prediction.results-*',],\n\ - \ 'coder': 'PROTO_VALUE',\n },\n }\n with open(result,\ - \ 'w') as f:\n f.write(json.dumps(data_source))\n\n" - image: us-docker.pkg.dev/vertex-ai/automl-tabular/dataflow-worker:20230110_1125_RC00 - pipelineInfo: - name: automl-tabular - root: - dag: - outputs: - artifacts: - feature-attribution-2-feature_attributions: - artifactSelectors: - - outputArtifactKey: feature-attribution-2-feature_attributions - producerSubtask: exit-handler-1 - feature-attribution-3-feature_attributions: - artifactSelectors: - - outputArtifactKey: feature-attribution-3-feature_attributions - producerSubtask: exit-handler-1 - feature-attribution-feature_attributions: - artifactSelectors: - - outputArtifactKey: feature-attribution-feature_attributions - producerSubtask: exit-handler-1 - model-evaluation-2-evaluation_metrics: - artifactSelectors: - - outputArtifactKey: model-evaluation-2-evaluation_metrics - producerSubtask: exit-handler-1 - model-evaluation-3-evaluation_metrics: - artifactSelectors: - - outputArtifactKey: model-evaluation-3-evaluation_metrics - producerSubtask: exit-handler-1 - model-evaluation-evaluation_metrics: - artifactSelectors: - - outputArtifactKey: model-evaluation-evaluation_metrics - producerSubtask: exit-handler-1 - tasks: - automl-tabular-finalizer: - componentRef: - name: comp-automl-tabular-finalizer - dependentTasks: - - exit-handler-1 - inputs: - parameters: - encryption_spec_key_name: - runtimeValue: - constantValue: - stringValue: '' - location: - componentInputParameter: location - project: - componentInputParameter: project - root_dir: - componentInputParameter: root_dir - taskInfo: - name: automl-tabular-finalizer - triggerPolicy: - strategy: ALL_UPSTREAM_TASKS_COMPLETED - exit-handler-1: - componentRef: - name: comp-exit-handler-1 - inputs: - parameters: - pipelineparam--additional_experiments: - componentInputParameter: additional_experiments - pipelineparam--cv_trainer_worker_pool_specs_override: - componentInputParameter: cv_trainer_worker_pool_specs_override - pipelineparam--data_source_bigquery_table_path: - componentInputParameter: data_source_bigquery_table_path - pipelineparam--data_source_csv_filenames: - componentInputParameter: data_source_csv_filenames - pipelineparam--dataflow_service_account: - componentInputParameter: dataflow_service_account - pipelineparam--dataflow_subnetwork: - componentInputParameter: dataflow_subnetwork - pipelineparam--dataflow_use_public_ips: - componentInputParameter: dataflow_use_public_ips - pipelineparam--disable_early_stopping: - componentInputParameter: disable_early_stopping - pipelineparam--distill_batch_predict_machine_type: - componentInputParameter: distill_batch_predict_machine_type - pipelineparam--distill_batch_predict_max_replica_count: - componentInputParameter: distill_batch_predict_max_replica_count - pipelineparam--distill_batch_predict_starting_replica_count: - componentInputParameter: distill_batch_predict_starting_replica_count - pipelineparam--enable_probabilistic_inference: - componentInputParameter: enable_probabilistic_inference - pipelineparam--encryption_spec_key_name: - componentInputParameter: encryption_spec_key_name - pipelineparam--evaluation_batch_predict_machine_type: - componentInputParameter: evaluation_batch_predict_machine_type - pipelineparam--evaluation_batch_predict_max_replica_count: - componentInputParameter: evaluation_batch_predict_max_replica_count - pipelineparam--evaluation_batch_predict_starting_replica_count: - componentInputParameter: evaluation_batch_predict_starting_replica_count - pipelineparam--evaluation_dataflow_disk_size_gb: - componentInputParameter: evaluation_dataflow_disk_size_gb - pipelineparam--evaluation_dataflow_machine_type: - componentInputParameter: evaluation_dataflow_machine_type - pipelineparam--evaluation_dataflow_max_num_workers: - componentInputParameter: evaluation_dataflow_max_num_workers - pipelineparam--export_additional_model_without_custom_ops: - componentInputParameter: export_additional_model_without_custom_ops - pipelineparam--fast_testing: - componentInputParameter: fast_testing - pipelineparam--location: - componentInputParameter: location - pipelineparam--optimization_objective: - componentInputParameter: optimization_objective - pipelineparam--optimization_objective_precision_value: - componentInputParameter: optimization_objective_precision_value - pipelineparam--optimization_objective_recall_value: - componentInputParameter: optimization_objective_recall_value - pipelineparam--predefined_split_key: - componentInputParameter: predefined_split_key - pipelineparam--prediction_type: - componentInputParameter: prediction_type - pipelineparam--project: - componentInputParameter: project - pipelineparam--quantiles: - componentInputParameter: quantiles - pipelineparam--root_dir: - componentInputParameter: root_dir - pipelineparam--run_distillation: - componentInputParameter: run_distillation - pipelineparam--run_evaluation: - componentInputParameter: run_evaluation - pipelineparam--stage_1_num_parallel_trials: - componentInputParameter: stage_1_num_parallel_trials - pipelineparam--stage_1_tuner_worker_pool_specs_override: - componentInputParameter: stage_1_tuner_worker_pool_specs_override - pipelineparam--stage_1_tuning_result_artifact_uri: - componentInputParameter: stage_1_tuning_result_artifact_uri - pipelineparam--stage_2_num_parallel_trials: - componentInputParameter: stage_2_num_parallel_trials - pipelineparam--stage_2_num_selected_trials: - componentInputParameter: stage_2_num_selected_trials - pipelineparam--stats_and_example_gen_dataflow_disk_size_gb: - componentInputParameter: stats_and_example_gen_dataflow_disk_size_gb - pipelineparam--stats_and_example_gen_dataflow_machine_type: - componentInputParameter: stats_and_example_gen_dataflow_machine_type - pipelineparam--stats_and_example_gen_dataflow_max_num_workers: - componentInputParameter: stats_and_example_gen_dataflow_max_num_workers - pipelineparam--stratified_split_key: - componentInputParameter: stratified_split_key - pipelineparam--study_spec_parameters_override: - componentInputParameter: study_spec_parameters_override - pipelineparam--target_column: - componentInputParameter: target_column - pipelineparam--test_fraction: - componentInputParameter: test_fraction - pipelineparam--timestamp_split_key: - componentInputParameter: timestamp_split_key - pipelineparam--train_budget_milli_node_hours: - componentInputParameter: train_budget_milli_node_hours - pipelineparam--training_fraction: - componentInputParameter: training_fraction - pipelineparam--transform_dataflow_disk_size_gb: - componentInputParameter: transform_dataflow_disk_size_gb - pipelineparam--transform_dataflow_machine_type: - componentInputParameter: transform_dataflow_machine_type - pipelineparam--transform_dataflow_max_num_workers: - componentInputParameter: transform_dataflow_max_num_workers - pipelineparam--transformations: - componentInputParameter: transformations - pipelineparam--validation_fraction: - componentInputParameter: validation_fraction - pipelineparam--weight_column: - componentInputParameter: weight_column - taskInfo: - name: exit-handler-1 - inputDefinitions: - parameters: - additional_experiments: - type: STRING - cv_trainer_worker_pool_specs_override: - type: STRING - data_source_bigquery_table_path: - type: STRING - data_source_csv_filenames: - type: STRING - dataflow_service_account: - type: STRING - dataflow_subnetwork: - type: STRING - dataflow_use_public_ips: - type: STRING - disable_early_stopping: - type: STRING - distill_batch_predict_machine_type: - type: STRING - distill_batch_predict_max_replica_count: - type: INT - distill_batch_predict_starting_replica_count: - type: INT - enable_probabilistic_inference: - type: STRING - encryption_spec_key_name: - type: STRING - evaluation_batch_predict_machine_type: - type: STRING - evaluation_batch_predict_max_replica_count: - type: INT - evaluation_batch_predict_starting_replica_count: - type: INT - evaluation_dataflow_disk_size_gb: - type: INT - evaluation_dataflow_machine_type: - type: STRING - evaluation_dataflow_max_num_workers: - type: INT - export_additional_model_without_custom_ops: - type: STRING - fast_testing: - type: STRING - location: - type: STRING - optimization_objective: - type: STRING - optimization_objective_precision_value: - type: DOUBLE - optimization_objective_recall_value: - type: DOUBLE - predefined_split_key: - type: STRING - prediction_type: - type: STRING - project: - type: STRING - quantiles: - type: STRING - root_dir: - type: STRING - run_distillation: - type: STRING - run_evaluation: - type: STRING - stage_1_num_parallel_trials: - type: INT - stage_1_tuner_worker_pool_specs_override: - type: STRING - stage_1_tuning_result_artifact_uri: - type: STRING - stage_2_num_parallel_trials: - type: INT - stage_2_num_selected_trials: - type: INT - stats_and_example_gen_dataflow_disk_size_gb: - type: INT - stats_and_example_gen_dataflow_machine_type: - type: STRING - stats_and_example_gen_dataflow_max_num_workers: - type: INT - stratified_split_key: - type: STRING - study_spec_parameters_override: - type: STRING - target_column: - type: STRING - test_fraction: - type: DOUBLE - timestamp_split_key: - type: STRING - train_budget_milli_node_hours: - type: DOUBLE - training_fraction: - type: DOUBLE - transform_dataflow_disk_size_gb: - type: INT - transform_dataflow_machine_type: - type: STRING - transform_dataflow_max_num_workers: - type: INT - transformations: - type: STRING - validation_fraction: - type: DOUBLE - weight_column: - type: STRING - outputDefinitions: - artifacts: - feature-attribution-2-feature_attributions: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - feature-attribution-3-feature_attributions: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - feature-attribution-feature_attributions: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - model-evaluation-2-evaluation_metrics: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - model-evaluation-3-evaluation_metrics: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - model-evaluation-evaluation_metrics: - artifactType: - schemaTitle: system.Metrics - schemaVersion: 0.0.1 - schemaVersion: 2.0.0 - sdkVersion: kfp-1.8.14 -runtimeConfig: - parameters: - data_source_bigquery_table_path: - stringValue: '' - data_source_csv_filenames: - stringValue: '' - dataflow_service_account: - stringValue: '' - dataflow_subnetwork: - stringValue: '' - dataflow_use_public_ips: - stringValue: 'True' - disable_early_stopping: - stringValue: 'False' - distill_batch_predict_machine_type: - stringValue: n1-standard-16 - distill_batch_predict_max_replica_count: - intValue: '25' - distill_batch_predict_starting_replica_count: - intValue: '25' - enable_probabilistic_inference: - stringValue: 'False' - encryption_spec_key_name: - stringValue: '' - evaluation_batch_predict_machine_type: - stringValue: n1-standard-16 - evaluation_batch_predict_max_replica_count: - intValue: '25' - evaluation_batch_predict_starting_replica_count: - intValue: '25' - evaluation_dataflow_disk_size_gb: - intValue: '50' - evaluation_dataflow_machine_type: - stringValue: n1-standard-4 - evaluation_dataflow_max_num_workers: - intValue: '25' - export_additional_model_without_custom_ops: - stringValue: 'False' - fast_testing: - stringValue: 'False' - optimization_objective_precision_value: - doubleValue: -1.0 - optimization_objective_recall_value: - doubleValue: -1.0 - predefined_split_key: - stringValue: '' - run_distillation: - stringValue: 'False' - run_evaluation: - stringValue: 'False' - stage_1_num_parallel_trials: - intValue: '35' - stage_1_tuning_result_artifact_uri: - stringValue: '' - stage_2_num_parallel_trials: - intValue: '35' - stage_2_num_selected_trials: - intValue: '5' - stats_and_example_gen_dataflow_disk_size_gb: - intValue: '40' - stats_and_example_gen_dataflow_machine_type: - stringValue: n1-standard-16 - stats_and_example_gen_dataflow_max_num_workers: - intValue: '25' - stratified_split_key: - stringValue: '' - test_fraction: - doubleValue: -1.0 - timestamp_split_key: - stringValue: '' - training_fraction: - doubleValue: -1.0 - transform_dataflow_disk_size_gb: - intValue: '40' - transform_dataflow_machine_type: - stringValue: n1-standard-16 - transform_dataflow_max_num_workers: - intValue: '25' - validation_fraction: - doubleValue: -1.0 - weight_column: - stringValue: '' diff --git a/config/config.yaml.tftpl b/config/config.yaml.tftpl index 7d3209c4..416da14a 100644 --- a/config/config.yaml.tftpl +++ b/config/config.yaml.tftpl @@ -144,6 +144,7 @@ vertex_ai: # - clv # - training # - prediction + # - reporting_preparation pipelines: project_id: "${project_id}" service_account_id: "vertex-pipelines-sa" @@ -280,8 +281,6 @@ vertex_ai: CALL `{user_segmentation_dimensions_procedure_name}`();" query_user_lookback_metrics: " CALL `{user_lookback_metrics_procedure_name}`();" - query_user_scoped_segmentation_metrics: " - CALL `{user_scoped_segmentation_metrics_procedure_name}`();" query_audience_segmentation_inference_preparation: " CALL `{audience_segmentation_inference_preparation_procedure_name}`();" query_audience_segmentation_training_preparation: " @@ -339,12 +338,6 @@ vertex_ai: # The query_user_rolling_window_metrics defines the procedure that will be used to invoke the creation of the user rolling window metrics feature table. query_user_rolling_window_metrics: " CALL `{user_rolling_window_metrics_procedure_name}`();" - # The query_user_scoped_metrics defines the procedure that will be used to invoke the creation of the user scoped metrics feature table. - query_user_scoped_metrics: " - CALL `{user_scoped_metrics_procedure_name}`();" - # The query_user_session_event_aggregated_metrics defines the procedure that will be used to invoke the creation of the user session and events aggregated metrics feature table. - query_user_session_event_aggregated_metrics: " - CALL `{user_session_event_aggregated_metrics_procedure_name}`();" # The query_purchase_propensity_inference_preparation define the procedure that will be used to invoke the creation of the purchase propensity inference preparation table. query_purchase_propensity_inference_preparation: " CALL `{purchase_propensity_inference_preparation_procedure_name}`();" @@ -400,9 +393,6 @@ vertex_ai: # The query_user_rolling_window_lifetime_metrics defines the procedure that will be used to invoke the creation of the user rolling window lifetime metrics feature table. query_user_rolling_window_lifetime_metrics: " CALL `{user_rolling_window_lifetime_metrics_procedure_name}`();" - # The query_user_scoped_lifetime_metrics defines the procedure that will be used to invoke the creation of the user scoped lifetime metrics feature table. - query_user_scoped_lifetime_metrics: " - CALL `{user_scoped_lifetime_metrics_procedure_name}`();" # The query_customer_lifetime_value_inference_preparation defines the procedure that will be used to invoke the creation of the customer lifetime value inference preparation table. query_customer_lifetime_value_inference_preparation: " CALL `{customer_lifetime_value_inference_preparation_procedure_name}`();" @@ -495,6 +485,10 @@ vertex_ai: train_budget_milli_node_hours: 1000 # 1 hour run_evaluation: true run_distillation: false + # The Value Based Bidding model name + model_display_name: "value-based-bidding-model" + # The Value Based Bidding model description + model_description: "Value Based Bidding AutoML Regression Model" # Use `prediction_type` to "regression" for training models that predict a numerical value. For classification models, use "classification" and you will # also get the probability likelihood for that class. prediction_type: "regression" @@ -558,7 +552,8 @@ vertex_ai: project: "${project_id}" location: "${cloud_region}" data_location: "${location}" - model_display_name: "value-based-bidding-training-pl-model" # must match the model name defined in the training pipeline. for now it is {NAME_OF_PIPELINE}-model + # The model name must match the model name defined in the training pipeline. + model_display_name: "value-based-bidding-model" model_metric_name: "meanAbsoluteError" #'rootMeanSquaredError', 'meanAbsoluteError', 'meanAbsolutePercentageError', 'rSquared', 'rootMeanSquaredLogError' # The `model_metric_threshold` parameter defines what is the maximum acceptable value for the `model_metric_name` so that the model can be selected. # If the actual models metrics values are higher than this limit, no models will be selected and the pipeline is going to fail. @@ -606,6 +601,10 @@ vertex_ai: apply_feature_selection_tuning: true run_evaluation: true run_distillation: false + # The Purchase Propensity model name + model_display_name: "purchase-propensity-model" + # The Purchase Propensity model description + model_description: "Purchase Propensity Classification AutoML Model" # Use `prediction_type` to "regression" for training models that predict a numerical value. For classification models, use "classification" and you will # also get the probability likelihood for that class. prediction_type: "classification" @@ -682,7 +681,8 @@ vertex_ai: project_id: "${project_id}" location: "${cloud_region}" job_name_prefix: "propensity-prediction-pl-" - model_display_name: "propensity-training-pl-model" # must match the model name defined in the training pipeline. for now it is {NAME_OF_PIPELINE}-model + # The Purchase Propensity model name to be used for prediction + model_display_name: "purchase-propensity-model" model_metric_name: "logLoss" # The `model_metric_threshold` parameter defines what is the maximum acceptable value for the `model_metric_name` so that the model can be selected. # If the actual models metrics values are higher than this limit, no models will be selected and the pipeline is going to fail. @@ -750,8 +750,10 @@ vertex_ai: km_min_rel_progress: 0.01 km_warm_start: "FALSE" model_dataset_id: "${project_id}.audience_segmentation" # to also include project.dataset - model_name_bq_prefix: "audience_segmentation_model" # must match the model name defined in the training pipeline. for now it is {NAME_OF_PIPELINE}-model - vertex_model_name: "audience_segmentation_model" + # The name of the Audience Segmentation Clustering model + model_name_bq_prefix: "audience-segmentation-model" + # The name of the Audience Segmentation Clustering model + vertex_model_name: "audience-segmentation-model" # This is the training dataset provided during the training routine. # The schema in this table or view must match the schema in the json files. # Take into consideration the `excluded_features` list below. They won't be used for training. @@ -780,7 +782,8 @@ vertex_ai: project_id: "${project_id}" location: "${location}" model_dataset_id: "${project_id}.audience_segmentation" # to also include project.dataset - model_name_bq_prefix: "audience_segmentation_model" # must match the model name defined in the training pipeline. for now it is {NAME_OF_PIPELINE}-model + # The name of the Audience Segmentation Clustering model + model_name_bq_prefix: "audience-segmentation-model" # must match the model name defined in the training pipeline. model_metric_name: "davies_bouldin_index" # one of davies_bouldin_index , mean_squared_distance # This is the model metric value that will tell us if the model has good quality or not. # Lower index values indicate a better clustering result. The index is improved (lowered) by increased separation between clusters and decreased variation within clusters. @@ -823,12 +826,16 @@ vertex_ai: location: "${cloud_region}" project_id: "${project_id}" dataset: "auto_audience_segmentation" - model_name: "interest-cluster-model" - training_table: "auto_audience_segmentation_training_15" + # The name of the Auto Audience Segmentation Clustering model or the Interest based Audience Segmentation model name + model_name: "auto-audience-segmentation-model" + training_table: "v_auto_audience_segmentation_training_15" p_wiggle: 10 min_num_clusters: 3 bucket_name: "${project_id}-custom-models" image_uri: "us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-2:latest" + #exclude_features: + # - user_pseudo_id + # - user_id pipeline_parameters_substitutions: null prediction: name: "auto-segmentation-prediction-pl" @@ -846,7 +853,7 @@ vertex_ai: pipeline_parameters: project_id: "${project_id}" location: "${cloud_region}" - model_name: "interest-cluster-model" + model_name: "auto-audience-segmentation-model" bigquery_source: "${project_id}.auto_audience_segmentation.v_auto_audience_segmentation_inference_15" bigquery_destination_prefix: "${project_id}.auto_audience_segmentation" # These are parameters to trigger the Activation Application Dataflow. @@ -896,6 +903,10 @@ vertex_ai: apply_feature_selection_tuning: true run_evaluation: true run_distillation: false + # The Purchase Propensity model name + model_display_name: "propensity-clv-training-model" + # The Purchase Propensity model description + model_description: "Purchase Propensity Classification AutoML Model for Customer LTV" # Use `prediction_type` to "regression" for training models that predict a numerical value. For classification models, use "classification" and you will # also get the probability likelihood for that class. prediction_type: "classification" @@ -997,6 +1008,10 @@ vertex_ai: apply_feature_selection_tuning: true run_evaluation: true run_distillation: false + # The Customer Lifetime Value model name + model_display_name: "customer-ltv-training-model" + # The Customer Lifetime Value model description + model_description: "Customer Lifetime Value Regression AutoML Model" # Use `prediction_type` to "regression" for training models that predict a numerical value. For classification models, use "classification" and you will # also get the probability likelihood for that class. prediction_type: "regression" @@ -1050,7 +1065,8 @@ vertex_ai: #- feature_date - user_pseudo_id - user_id - pipeline_parameters_substitutions: null + pipeline_parameters_substitutions: null + # The prediction pipeline uses a purchase propensity classification model and a customer lifetime value regression model prediction: name: "clv-prediction-pl" job_id_prefix: "clv-prediction-pl-" @@ -1070,7 +1086,8 @@ vertex_ai: location: "${cloud_region}" purchase_job_name_prefix: "propensity-prediction-pl-" clv_job_name_prefix: "clv-prediction-pl-" - purchase_model_display_name: "propensity-clv-training-pl-model" # must match the model name defined in the training pipeline. for now it is {NAME_OF_PIPELINE}-model + # The name of the trained Purchase Propensity model to be used for Customer LTV prediction + purchase_model_display_name: "propensity-clv-training-model" # must match the model name defined in the training pipeline. for now it is {NAME_OF_PIPELINE}-model # The `purchase_model_metric_threshold` parameter defines what is the maximum acceptable value for the `purchase_model_metric_name` so that the model can be selected. # If the actual models metrics values are higher than this limit, no models will be selected and the pipeline is going to fail. purchase_model_metric_name: "logLoss" @@ -1081,7 +1098,8 @@ vertex_ai: # Probabilities lower than `threashold` sets the LTV prediction pipeline to use 0.0 as the customer LTV value gain. threashold: 0.01 positive_label: "1" - clv_model_display_name: "clv-training-pl-model" # must match the model name defined in the training pipeline. for now it is {NAME_OF_PIPELINE}-model + # The name of the trained Customer Lifetime Value model to be used for Customer LTV prediction + clv_model_display_name: "customer-ltv-training-model" # must match the model name defined in the training pipeline. for now it is {NAME_OF_PIPELINE}-model # The `purchase_model_metric_threshold` parameter defines what is the maximum acceptable value for the `purchase_model_metric_name` so that the model can be selected. # If the actual models metrics values are higher than this limit, no models will be selected and the pipeline is going to fail. clv_model_metric_name: "meanAbsoluteError" #'rootMeanSquaredError', 'meanAbsoluteError', 'meanAbsolutePercentageError', 'rSquared', 'rootMeanSquaredLogError' @@ -1135,7 +1153,7 @@ vertex_ai: schedule: # The `cron` is the cron schedule. Make sure you review the TZ=America/New_York timezone. # More information can be found at https://cloud.google.com/scheduler/docs/configuring/cron-job-schedules. - cron: "TZ=America/New_York 0 8 * * *" + cron: "TZ=America/New_York 0 8-23/2 * * *" # The `max_concurrent_run_count` defines the maximum number of concurrent pipeline runs. max_concurrent_run_count: 1 start_time: null diff --git a/docs/VAIS_prompt_instructions.md b/docs/VAIS_prompt_instructions.md new file mode 100644 index 00000000..54c52918 --- /dev/null +++ b/docs/VAIS_prompt_instructions.md @@ -0,0 +1,17 @@ +You're a marketing analyst who implemented a solution that solves for marketing analytics use cases, such as: purchase propensity, customer lifetime value, behavioural audience segmentation, interest based audience segmentation and aggregated value based bidding. Given the conversation between a user and a helpful assistant and some search results, your goal is to create a final answer for the assistant. The reserved tags and must NEVER be used in your answer. + + + +Marketing Analytics Jumpstart consists of an easy, extensible and automated implementation of an end-to-end solution that enables Marketing Technology teams to store, transform, enrich with 1PD and analyze marketing data, and programmatically send predictive events to Google Analytics 4 to support conversion optimization and remarketing campaigns. +The solution uses these Google Cloud Products: BigQuery, Dataform, Vertex AI, Dataflow, Cloud Functions and Looker Studio. +There are several documentation files you must read carefully to understand the solution and answer most of the questions. + + +- Read with attention all source code files, markdown documentation files and code documentation comments and understand the solution, its architecture and how it works. +- Identify all the configurations files, the terraform resource definitions files or the Python/SQL files. +-- For the Python and SQL files, find out which Google Cloud products are being used and which architecture component they belong to. +-- Or else, for the configuration files, use the important context information provided to understand which parameters are available to change and how they affect the resources deployed. +- Identify the architecture components that each text segment belongs to. +- Read with attention the source code segments and the code documentation provided in comments blocks across the text extracts. +- Answer the question using the knowledge you obtained. If you're not certain, use the information you may find on your own knowledge or search online. If still in doubt, try rephrasing the user question and if you have an answer for it. Finally, when uncertain answer the user you're not sure about how to answer the question. +- The answer should use all relevant information from the retrieval text segments, the provided context or the online search results, not introduce any additional information, and use exactly the same words as the search results when possible. The assistant's answer should be no more than 5 sentences. \ No newline at end of file diff --git a/docs/config.md b/docs/config.md deleted file mode 100644 index 180e1797..00000000 --- a/docs/config.md +++ /dev/null @@ -1,130 +0,0 @@ -Configuration for Google Cloud Project and Services - -## Overall Configuration - -google_cloud_project: - -project_id: Placeholder for the actual Google Cloud project ID. -project_name: Placeholder for the project name. -project_number: Placeholder for the project number. -region: Placeholder for the cloud region where resources will be deployed. -cloud_build: - -project_id: Placeholder for the project ID, used for Cloud Build configuration. -region: Placeholder for the region where Cloud Build will run. -github: -owner: Placeholder for the GitHub owner of the pipelines repository. -repo_name: Placeholder for the pipelines repository name. -trigger_branch: Specifies the branch that will trigger Cloud Build pipelines (set to "dev"). -build_file: Specifies the path to the Cloud Build configuration file (cloudbuild/pipelines.yaml). -_REPOSITORY_GCP_PROJECT, _REPOSITORY_NAME, _REPOSITORY_BRANCH, _GCR_HOSTNAME, _BUILD_REGION: Internal variables likely used by Cloud Build for repository and region information. -container: - -builder: Defines configurations for base container images used for building and formatting. -base: -from_image: Specifies the base image for building (python:3.7-alpine3.7). -base_image_name, base_image_prefix: Placeholders for custom image names. -zetasql: -from_image: Specifies the base image for ZetaSQL formatting (wbsouza/zetasql-formatter:latest). -base_image_name, base_image_prefix: Placeholders for custom image names. -container_registry_hostname: Placeholder for the container registry hostname. -container_registry_region: Placeholder for the container registry region. -artifact_registry: - -pipelines_repo: -name: Name of the Artifact Registry repository for pipelines artifacts. -region: Region of the repository. -project_id: Project ID associated with the repository. -pipelines_docker_repo: -name: Name of the Artifact Registry repository for Docker images. -region: Region of the repository. -project_id: Project ID associated with the repository. -dataflow: - -worker_service_account_id: ID of the service account used by Dataflow workers. -worker_service_account: Full email address of the service account. - -## Vertex AI Configuration - -1. Components: - -Vertex AI: The platform for building and managing machine learning pipelines and models. -Feature Store: A centralized repository for storing and managing features used in model training and prediction. -Pipelines: Automated workflows that orchestrate various tasks, including data preparation, feature engineering, model training, evaluation, and prediction. -Models: Machine learning models trained to make predictions or classifications. -2. Pipelines: - -Feature Creation Pipelines: -feature-creation-auto-audience-segmentation -feature-creation-audience-segmentation -feature-creation-purchase-propensity -feature-creation-customer-ltv -Purpose: Prepare data for model training and prediction by generating necessary features. -Training Pipelines: -propensity-training-pl -segmentation-training-pl -propensity_clv-training-pl -Purpose: Train machine learning models using prepared data. -Prediction Pipelines: -propensity-prediction-pl -segmentation-prediction-pl -auto-segmentation-prediction-pl -Purpose: Generate predictions using trained models on new data. -3. Key Parameters: - -Project ID: Identifier for the Google Cloud project. -Location: Region where pipelines and resources are located. -Schedules: Cron expressions defining pipeline execution times. -Data Sources: BigQuery tables containing training and prediction data. -Features: Columns used for model training and prediction. -Models: Trained machine learning models. -4. Workflows: - -Propensity: -Train a model to predict purchase likelihood for users. -Generate predictions for new users. -Segmentation: -Train a clustering model to group users based on similar characteristics. -Assign new users to their respective segments. -Auto Segmentation: -Generates user segments based on interest using a pre-trained model. -Propensity CLV: -Combines purchase propensity and customer lifetime value (LTV) predictions. -5. Vertex AI Components: - -Tabular Workflows: Pre-built components for training and deploying tabular models. -Custom Components: User-defined components for specific tasks. - - -## BigQuery Configuration - -Datasets: - -feature_store: Houses various feature tables, serving as the central repository for features used in machine learning models. -purchase_propensity, customer_lifetime_value, audience_segmentation, auto_audience_segmentation: Dedicated datasets for specific use cases, likely containing training data, model artifacts, and inference results. - -Tables: - -user_dimensions, user_lifetime_dimensions, user_lookback_metrics, user_rolling_window_lifetime_metrics, user_rolling_window_metrics, user_scoped_lifetime_metrics, user_scoped_metrics, user_scoped_segmentation_metrics, user_segmentation_dimensions, user_session_event_aggregated_metrics: Feature tables within the feature_store, each capturing distinct aspects of user behavior and attributes. -purchase_propensity_label, customer_lifetime_value_label: Tables containing labels for supervised learning tasks (purchase propensity and customer lifetime value prediction). -purchase_propensity_inference_preparation, customer_lifetime_value_inference_preparation, audience_segmentation_inference_preparation, auto_audience_segmentation_inference_preparation: Tables likely used for preparing data for model inference. - -Stored Procedures: - -audience_segmentation_training_preparation, customer_lifetime_value_training_preparation, purchase_propensity_training_preparation, user_dimensions, user_lifetime_dimensions, user_lookback_metrics, user_rolling_window_lifetime_metrics, user_rolling_window_metrics, user_scoped_lifetime_metrics, user_scoped_metrics, user_scoped_segmentation_metrics, user_segmentation_dimensions, customer_lifetime_value_label, purchase_propensity_label: Procedures responsible for populating feature tables, generating labels, and potentially preparing data for training and inference. - -Queries: - -invoke_purchase_propensity_training_preparation, invoke_audience_segmentation_training_preparation, invoke_customer_lifetime_value_training_preparation, invoke_backfill_...: Queries that call the stored procedures to execute various data preparation and feature engineering tasks. - -Key Points: - -The YAML file outlines configurations for various Google Cloud services, likely within a pipeline setup. -It extensively uses placeholders to be populated with actual values during deployment. -Cloud Build is set up to trigger on changes in the "dev" branch of a specified GitHub repository. -Base container images are defined for building and ZetaSQL formatting tasks. -Artifact Registry repositories are configured for storing pipelines artifacts and Docker images. -A specific service account is designated for Dataflow workers. -The configuration demonstrates a well-structured BigQuery setup for managing features and supporting different machine learning use cases. -The use of stored procedures and queries suggests a modular and reusable approach to data processing. -The separation of datasets for different use cases aligns with best practices for data organization. \ No newline at end of file diff --git a/docs/ml_specs.md b/docs/ml_specs.md index 8ad74f66..e3df793a 100644 --- a/docs/ml_specs.md +++ b/docs/ml_specs.md @@ -1,21 +1,46 @@ -# Machine Learning Specifications -This document details the features employed for training and prediction in the out-of-the-box use cases supported by this solution. +# Machine Learning (ML) Technical Design -## Pre-requisites -### GA4 events requirements -The out-of-box ML driven use cases requires the following events to existing in the GA4 export data. The events feed into features for the ML trainings and inference. +## Introduction +This document details the features and models used in the training, prediction and explanation pipelines for the ML driven use cases supported by this solution. + +### GA4 events tagging requirement +The out-of-box ML driven use cases requires the following events tagged in the Google Analytics 4 (GA4) property, exported to BigQuery using the GA4 BigQuery Export. The export has to be set to run daily. | Event | Event Type | Doc Ref. | | -------- | ------- | --------- | -| purchase | Ecommerce Measurement events | https://developers.google.com/analytics/devguides/collection/ga4/set-up-ecommerce | -| view_item | Ecommerce Measurement events | https://developers.google.com/analytics/devguides/collection/ga4/ecommerce | -| view_item_list | Ecommerce Measurement events | https://developers.google.com/analytics/devguides/collection/ga4/ecommerce | -| add_to_cart | Ecommerce Measurement events | https://developers.google.com/analytics/devguides/collection/ga4/ecommerce | -| begin_checkout | Ecommerce Measurement events | https://developers.google.com/analytics/devguides/collection/ga4/ecommerce | -| refund | Ecommerce Measurement events | https://developers.google.com/analytics/devguides/collection/ga4/ecommerce | -| first_visit | Automatically collected events | https://support.google.com/analytics/answer/9234069?hl=en | -| page_view | Automatically collected events | https://support.google.com/analytics/answer/9234069?hl=en | -| click | Automatically collected events | https://support.google.com/analytics/answer/9234069?hl=en | +| purchase | Ecommerce Measurement event | https://developers.google.com/analytics/devguides/collection/ga4/set-up-ecommerce | +| view_item | Ecommerce Measurement event | https://developers.google.com/analytics/devguides/collection/ga4/ecommerce | +| view_item_list | Ecommerce Measurement event | https://developers.google.com/analytics/devguides/collection/ga4/ecommerce | +| add_to_cart | Ecommerce Measurement event | https://developers.google.com/analytics/devguides/collection/ga4/ecommerce | +| begin_checkout | Ecommerce Measurement event | https://developers.google.com/analytics/devguides/collection/ga4/ecommerce | +| refund | Ecommerce Measurement event | https://developers.google.com/analytics/devguides/collection/ga4/ecommerce | +| first_visit | Automatically collected event | https://support.google.com/analytics/answer/9234069?hl=en | +| page_view | Automatically collected event | https://support.google.com/analytics/answer/9234069?hl=en | +| click | Automatically collected event | https://support.google.com/analytics/answer/9234069?hl=en | + +## GA4 User identifiers +The Marketing Analytics Jumpstart solution uses Google Analytics 4 user pseudo IDs as the primary identifier for users. It also includes the user IDs as well. Google Analytics 4 uses User-ID to associate identifiers with individual users, enabling you to connect their behavior across sessions, devices, and platforms. [User-ID - Analytics Help](https://support.google.com/analytics/answer/9355972?hl=en) You can use User-ID to create remarketing audiences and join Analytics data with first-party data, such as CRM data. [Best practices for User-ID - Analytics Help](https://support.google.com/analytics/answer/12675187?hl=en) The User-ID feature is designed for use with Google Analytics technologies and must comply with the Analytics SDK/User-ID Feature Policy. Remember that User IDs sent to Google Analytics must be shorter than 256 characters. [Measure activity across platforms with User-ID - Analytics Help](https://support.google.com/analytics/answer/9213390?hl=en) + +Each user must have a unique User ID, and the `user_id` parameter must be the same each time a user visits your website. [About data segments that use User ID to advertise - Ads Help](https://support.google.com/google-ads/answer/9199250?hl=en) Analytics can also use device ID as an identity space. [GA4 Reporting identity - Analytics Help](https://support.google.com/analytics/answer/10976610?hl=en) + +## Modelling Principles + +The machine pipelines were designed taking the following modelling principles: + +- **Aggregated Value Based Bidding Training (VBB Training)**: This is Tabular Workflow End-to-End AutoML pipeline which modelling principle is to overfit the training data. Because of data, feel free to duplicate the train subset as many times as you need until you have the minimum number of 1000 examples, as required by AutoML. We use one full copy of train subset and use a evaluation and test subsets, to make the model stops training when the model overfits (preventing early stopping from happening). +- **Aggregated Value Based Bidding Explanation (VBB Explanation)**: This is custom pipeline in which we get the latest trained model shapley values generated during the Evaluation step in the AutoML Training. These feature importance values are then written to a BigQuery table for reporting. These values are relevant for a few weeks, that is why there is no need to train the model lesser than once a week. +- **Segmentation Training (Demographic based segmentation training)**: This is a custom pipeline in which we train a BigQuery ML KMEANS model having Vertex AI as a model registry. The modelling principle is to organize the demographic user behaviour by looking back 15 days (by default - double check the `interval_min_date` parameter value inside the `bigquery.query.invoke_audience_segmentation_training_preparation` block in the `config.yaml` file) aggregated metrics. +- **Segmentation Prediction (Demographic based segmentation prediction)**: This custom pipeline gets the latest trained audience segmentation model, call predict method and writes the predictions values to BigQuery. The clusters predicted are useful only for a short number of days (7 days is a good assumption), that is why it is important to retrain the audience segmentation model frequently. +- **Auto Segmentation Training (Interest based segmentation training)**: This a custom training pipeline which trains a scikit-learn KMEANS model and applies the elbow method to find out the correct number of clusters to be created. The modelling principle is to suggest clusters of users who have been navigating in specific section in your website that have reached cumulative percentage traffic of 35% (double check the parameter value defined in the section `vertex_ai.pipelines.feature-creation-auto-audience-segmentation.execution.pipeline_parameters.perc_keep` in the `config.yaml` file) by using the regular expression defined in `vertex_ai.pipelines.feature-creation-auto-audience-segmentation.execution.pipeline_parameters.reg_expression` parameter as a criteria to count the `page_path` event parameter value. +- **Auto Segmentation Prediction (Interest based segmentation prediction)**: This is custom pipeline which gets the latest trained auto audience segmentation model, runs the predict method and writes all predictions values to a BigQuery table. The predictions are useful only for a few days, you may want to retrain the model to observe important changes in traffic behaviour of users weekly or bi-weekly. +- **Propensity Training (Purchase Propensity Training)**: This is a Tabular Workflow AutoML End-to-End training pipeline in which the main objective is classify users into two classes (1 or 0) taking the label `will_purchase` calculated for the next 15 days (double-check the parameter value in `bigquery.query.invoke_purchase_propensity_label.interval_input_date`). The modelling principle is to train a classifier that looks back at aggregated metrics of users looking back 30 days (double-check pipeline parameter `vertex_ai.pipelines.propensity.training.pipeline_parameters.target_column`, and the `sql/procedure/purchase_propensity_label.sqlx` stored procedure file to understand how the `will_purchase` feature label is calculated) to predict 15 days ahead (as default, but configurable). The idea is to avoid overfitting or having a perfect fit, what we actually want is to have accurate propensity probabilities or score or likelihoods to rank the user according to it. Users with a higher rank are more likely to purchase, whereas users with a lower rank is less likely to purchase. + +*Note*: The `propensity_clv.training pipeline` is a training pipeline similar but different to this one. What is different is the look ahead window interval and the how it is used. The `propensity_clv.training` pipeline is used to classify not to rank, since we want to know which users are going to buy to predict their Lifetime Value gain. This model predictions are used in the Customer Lifetime Time Value Prediction pipeline. + +- **Propensity Prediction (Purchase Propensity Prediction)**: This is a custom pipeline in which we predict the propensity of a user to purchase (as default, but configurable). The pipeline gets the latest best performing model, generates the predictions and saves them into a BigQuery table. The propensities rank should not change dramatically from one day to the next day, however depending on the traffic volume you will need to train it weekly or bi-weekly to understand frequent purchasers and non-frequent purchasers behaviours. +- **Customer LTV Training (Customer Lifetime Value Training)**: This is a Tabular Workflow AutoML End-to-End training pipeline which trains a regression model to predict the lifetime value gains of users looking back at aggregated metrics in the past 180 days (double-check the parameter value in `bigquery.query.invoke_customer_lifetime_value_training_preparation.interval_min_date` in the `config.yaml` file) in the future 30 days (double-check the parameter value in `bigquery.query.invoke_customer_lifetime_value_training_preparation.interval_max_date` in the `config.yaml` file). The modelling principle is to predict what will be the LTV gain for every user, in case the user doesn't purchase, the LTV gain is set to 0.0 gain (double-check the query logic used in the sql/procedure/customer_lifetime_value_label.sqlx file). +- **Customer LTV Prediction (Customer Lifetime Value Prediction)**: This is a custom prediction pipeline that uses the models clv regression and propensity clv models to predict the ltv gains for each user. The idea is to split the effort in two steps: First, we predict which users are going to purchase in the next 30 days (default, but configurable). Next, we predict the ltv gain for those users only, for those non-purchasers in the next days we set the ltv gain as 0.0. This prediction is relevant for several weeks, you will retrain the model once you have more conversions events that increase the users LTV past one or two weeks. + ## Machine Learning Feature Reference @@ -23,39 +48,12 @@ The out-of-box ML driven use cases requires the following events to existing in Target field: | Target field | Source Field from GA4 Event | | -------- | -------- | -| will_purchase | A binary value (1 if any purchase event occurred, 0 otherwise) for each user in the predicting time window| +| will_purchase | A binary value (1 if any `purchase` event occurred, 0 otherwise) for each user in the predicting time window| Features: | Feature | Source Field from GA4 Event | | -------- | -------- | -| active_users_past_1_day | Aggregate metric derived from a sliding window over the previous X days, representing a binary value (1 if any events occurred, 0 otherwise) for each user. | -| active_users_past_15_30_day | 〃 | -| active_users_past_2_day | 〃 | -| active_users_past_3_day | 〃 | -| active_users_past_4_day | 〃 | -| active_users_past_5_day | 〃 | -| active_users_past_6_day | 〃 | -| active_users_past_7_day | 〃 | -| active_users_past_8_14_day | 〃 | -| add_to_carts_past_1_day | Aggregate metric derived from a sliding window over the previous X days, representing a binary value (1 if any `add_to_cart` events occurred, 0 otherwise) for each user. | -| add_to_carts_past_15_30_day | 〃 | -| add_to_carts_past_2_day | 〃 | -| add_to_carts_past_3_day | 〃 | -| add_to_carts_past_4_day | 〃 | -| add_to_carts_past_5_day | 〃 | -| add_to_carts_past_6_day | 〃 | -| add_to_carts_past_7_day | 〃 | -| add_to_carts_past_8_14_day | 〃 | -| checkouts_past_1_day | Aggregate metric derived from a sliding window over the previous X days, summerized to the number of `begin_checkout` events occurred for each user. | -| checkouts_past_15_30_day | 〃 | -| checkouts_past_2_day | 〃 | -| checkouts_past_3_day | 〃 | -| checkouts_past_4_day | 〃 | -| checkouts_past_5_day | 〃 | -| checkouts_past_6_day | 〃 | -| checkouts_past_7_day | 〃 | -| checkouts_past_8_14_day | 〃 | | device_category | [GA4 event - device record](https://support.google.com/analytics/answer/7029846?hl=en#zippy=%2Cdevice) | | device_language | 〃 | | device_mobile_brand_name | 〃 | @@ -72,10 +70,38 @@ Features: | geo_metro | 〃 | | geo_region | 〃 | | geo_sub_continent | 〃 | -| has_signed_in_with_user_id | When the `user_id` field is set on events aggregated over each user | +| has_signed_in_with_user_id | Boolean representing the `user_id` field is set on events aggregated over each user | | last_traffic_source_medium | [GA4 event - traffic_source record](https://support.google.com/analytics/answer/7029846#zippy=%2Ctraffic-source) | | last_traffic_source_name | 〃 | | last_traffic_source_source | 〃 | +| user_ltv_revenue | SUM of `ecommerce.purchase_revenue_in_usd` over a period of X days for each user | +| active_users_past_1_day | Aggregate metric derived from a sliding window over the previous 1st day, representing a SUM of active sessions (number of sessions were long enough to be considered an active session, 0 otherwise) for each user. | +| active_users_past_15_30_day | Aggregate metric derived from a sliding window over the interval of the past 15 to 30 days, representing a SUM of active sessions (number of sessions were long enough to be considered an active session, 0 otherwise) for each user. | +| active_users_past_2_day | Aggregate metric derived from a sliding window over the previous 2nd day, representing a SUM of active sessions (number of sessions were long enough to be considered an active session, 0 otherwise) for each user. | +| active_users_past_3_day | Aggregate metric derived from a sliding window over the previous 3rd day, representing a SUM of active sessions number of sessions were long enough to be considered an active session, 0 otherwise) for each user. | +| active_users_past_4_day | Aggregate metric derived from a sliding window over the previous 4th day, representing a SUM of active sessions (number of sessions were long enough to be considered an active session, 0 otherwise) for each user. | +| active_users_past_5_day | Aggregate metric derived from a sliding window over the previous 5th day, representing a SUM of active sessions (number of sessions were long enough to be considered an active session, 0 otherwise) for each user. | +| active_users_past_6_day | Aggregate metric derived from a sliding window over the previous 6th day, representing a SUM of active sessions (number of sessions were long enough to be considered an active session, 0 otherwise) for each user. | +| active_users_past_7_day | Aggregate metric derived from a sliding window over the previous 7th day, representing a SUM of active sessions (number of sessions were long enough to be considered an active session, 0 otherwise) for each user. | +| active_users_past_8_14_day | Aggregate metric derived from a sliding window over the interval of the past 8 to 14 days, representing a SUM of active sessions (number of sessions were long enough to be considered an active session, 0 otherwise) for each user. | +| add_to_carts_past_1_day | Aggregate metric derived from a sliding window over the previous 1st day, representing a SUM of `add_to_cart` events (number of add_to_cart events, 0 otherwise) for each user. | +| add_to_carts_past_15_30_day | Aggregate metric derived from a sliding window over the interval of the past 15 to 30 days, representing a SUM of `add_to_cart` events (number of add_to_cart events, 0 otherwise) for each user. | +| add_to_carts_past_2_day | Aggregate metric derived from a sliding window over the previous 2nd day, representing a SUM of `add_to_cart` events (number of add_to_cart events, 0 otherwise) for each user. | +| add_to_carts_past_3_day | Aggregate metric derived from a sliding window over the previous 3rd day, representing a SUM of `add_to_cart` events (number of add_to_cart events, 0 otherwise) for each user. | +| add_to_carts_past_4_day | Aggregate metric derived from a sliding window over the previous 4th day, representing a SUM of `add_to_cart` events (number of add_to_cart events, 0 otherwise) for each user. | +| add_to_carts_past_5_day | Aggregate metric derived from a sliding window over the previous 5th day, representing a SUM of `add_to_cart` events (number of add_to_cart events, 0 otherwise) for each user. | +| add_to_carts_past_6_day | Aggregate metric derived from a sliding window over the previous 6th day, representing a SUM of `add_to_cart` events (number of add_to_cart events, 0 otherwise) for each user. | +| add_to_carts_past_7_day | Aggregate metric derived from a sliding window over the previous 7th day, representing a SUM of `add_to_cart` events (number of add_to_cart events, 0 otherwise) for each user. | +| add_to_carts_past_8_14_day | Aggregate metric derived from a sliding window over the interval of the past 8 to 14 days, representing a SUM of `add_to_cart` events (number of add_to_cart events, 0 otherwise) for each user. | +| checkouts_past_1_day | Aggregate metric derived from a sliding window over the previous 1st day, representing a SUM of `begin_checkout` events (number of begin_checkout events, 0 otherwise) for each user. | +| checkouts_past_15_30_day | Aggregate metric derived from a sliding window over the interval of the past 15 to 30 days, representing a SUM of `begin_checkout` events (number of begin_checkout events, 0 otherwise) for each user. | +| checkouts_past_2_day | Aggregate metric derived from a sliding window over the previous 2nd day, representing a SUM of `begin_checkout` events (number of begin_checkout events, 0 otherwise) for each user. | +| checkouts_past_3_day | Aggregate metric derived from a sliding window over the previous 3rd day, representing a SUM of `begin_checkout` events (number of begin_checkout events, 0 otherwise) for each user. | +| checkouts_past_4_day | Aggregate metric derived from a sliding window over the previous 4th day, representing a SUM of `begin_checkout` events (number of begin_checkout events, 0 otherwise) for each user. | +| checkouts_past_5_day | Aggregate metric derived from a sliding window over the previous 5th day, representing a SUM of `begin_checkout` events (number of begin_checkout events, 0 otherwise) for each user. | +| checkouts_past_6_day | Aggregate metric derived from a sliding window over the previous 6th day, representing a SUM of `begin_checkout` events (number of begin_checkout events, 0 otherwise) for each user. | +| checkouts_past_7_day | Aggregate metric derived from a sliding window over the previous 7th day, representing a SUM of `begin_checkout` events (number of begin_checkout events, 0 otherwise) for each user. | +| checkouts_past_8_14_day | Aggregate metric derived from a sliding window over the interval of the past 8 to 14 days, representing a SUM of `begin_checkout` events (number of begin_checkout events, 0 otherwise) for each user. | | purchases_past_1_day | Aggregate metric derived from a sliding window over the previous X days, summerized to the number of `purchase` events occurred for each user. | | purchases_past_15_30_day | 〃 | | purchases_past_2_day | 〃 | @@ -85,7 +111,6 @@ Features: | purchases_past_6_day | 〃 | | purchases_past_7_day | 〃 | | purchases_past_8_14_day | 〃 | -| user_ltv_revenue | Summarization of `ecommerce.purchase_revenue_in_usd` over a period of X days for each user | | view_items_past_1_day | Aggregate metric derived from a sliding window over the previous X days, summerized to the number of `view_item` events occurred for each user. | | view_items_past_15_30_day | 〃 | | view_items_past_2_day | 〃 | @@ -208,4 +233,8 @@ Features: | view_items_past_1_7_day | Aggregate metric derived from a sliding window over the previous X days, summerized to the number of `view_item` events occurred for each user. | | view_items_past_8_14_day | 〃 | | visits_past_1_7_day | Aggregate metric derived from a sliding window over the previous X days, summerized to the number of visits an user have had. | -| visits_past_8_14_day | 〃 | \ No newline at end of file +| visits_past_8_14_day | 〃 | + +### Auto Audience Segmentation Features +| Feature | Source Field from GA4 Event | +| -------- | -------- | diff --git a/env.sh.example b/env.sh.example deleted file mode 100644 index db5a3f5b..00000000 --- a/env.sh.example +++ /dev/null @@ -1,25 +0,0 @@ -#!/bin/bash -# Copyright 2022 Google LLC -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -export PIPELINE_FILES_GCS_PATH=gs:///pipelines -# pipeline template - update to any pipelines under the pipelines folder -# tensorflow or xgboost -export PIPELINE_TEMPLATE= -export VERTEX_CMEK_IDENTIFIER= # optional -export VERTEX_LOCATION=us-central1 -export VERTEX_NETWORK= # optional -export VERTEX_PIPELINE_ROOT=gs:///pipeline_root -export VERTEX_PROJECT_ID= -export VERTEX_SA_EMAIL=vertex-pipeline-runner@.iam.gserviceaccount.com \ No newline at end of file diff --git a/googleb3fdc576fe73874e.html b/googleb3fdc576fe73874e.html deleted file mode 100644 index 82759786..00000000 --- a/googleb3fdc576fe73874e.html +++ /dev/null @@ -1 +0,0 @@ -google-site-verification: googleb3fdc576fe73874e.html \ No newline at end of file diff --git a/infrastructure/terraform/modules/pipelines/pipelines.tf b/infrastructure/terraform/modules/pipelines/pipelines.tf index f9b09d82..11b9900a 100644 --- a/infrastructure/terraform/modules/pipelines/pipelines.tf +++ b/infrastructure/terraform/modules/pipelines/pipelines.tf @@ -267,7 +267,7 @@ resource "null_resource" "compile_feature_engineering_auto_audience_segmentation command = <<-EOT ${var.poetry_run_alias} python -m pipelines.compiler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.feature-creation-auto-audience-segmentation.execution -o fe_auto_audience_segmentation.yaml ${var.poetry_run_alias} python -m pipelines.uploader -c ${local.config_file_path_relative_python_run_dir} -f fe_auto_audience_segmentation.yaml -t ${self.triggers.tag} -t latest - ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.feature-creation-auto-audience-segmentation.execution + ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.feature-creation-auto-audience-segmentation.execution -i fe_auto_audience_segmentation.yaml EOT working_dir = self.triggers.working_dir } @@ -281,7 +281,7 @@ resource "null_resource" "compile_feature_engineering_aggregated_value_based_bid pipelines_repo_id = google_artifact_registry_repository.pipelines-repo.id pipelines_repo_create_time = google_artifact_registry_repository.pipelines-repo.create_time source_content_hash = local.pipelines_content_hash - upstream_resource_dependency = null_resource.build_push_pipelines_components_image.id + upstream_resource_dependency = null_resource.compile_feature_engineering_auto_audience_segmentation_pipeline.id } # The provisioner block specifies the command that will be executed to compile and upload the pipeline. @@ -291,7 +291,7 @@ resource "null_resource" "compile_feature_engineering_aggregated_value_based_bid command = <<-EOT ${var.poetry_run_alias} python -m pipelines.compiler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.feature-creation-aggregated-value-based-bidding.execution -o fe_agg_vbb.yaml ${var.poetry_run_alias} python -m pipelines.uploader -c ${local.config_file_path_relative_python_run_dir} -f fe_agg_vbb.yaml -t ${self.triggers.tag} -t latest - ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.feature-creation-aggregated-value-based-bidding.execution + ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.feature-creation-aggregated-value-based-bidding.execution -i fe_agg_vbb.yaml EOT working_dir = self.triggers.working_dir } @@ -305,7 +305,7 @@ resource "null_resource" "compile_feature_engineering_audience_segmentation_pipe pipelines_repo_id = google_artifact_registry_repository.pipelines-repo.id pipelines_repo_create_time = google_artifact_registry_repository.pipelines-repo.create_time source_content_hash = local.pipelines_content_hash - upstream_resource_dependency = null_resource.compile_feature_engineering_auto_audience_segmentation_pipeline.id + upstream_resource_dependency = null_resource.compile_feature_engineering_aggregated_value_based_bidding_pipeline.id } # The provisioner block specifies the command that will be executed to compile and upload the pipeline. @@ -315,7 +315,7 @@ resource "null_resource" "compile_feature_engineering_audience_segmentation_pipe command = <<-EOT ${var.poetry_run_alias} python -m pipelines.compiler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.feature-creation-audience-segmentation.execution -o fe_audience_segmentation.yaml ${var.poetry_run_alias} python -m pipelines.uploader -c ${local.config_file_path_relative_python_run_dir} -f fe_audience_segmentation.yaml -t ${self.triggers.tag} -t latest - ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.feature-creation-audience-segmentation.execution + ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.feature-creation-audience-segmentation.execution -i fe_audience_segmentation.yaml EOT working_dir = self.triggers.working_dir } @@ -339,7 +339,7 @@ resource "null_resource" "compile_feature_engineering_purchase_propensity_pipeli command = <<-EOT ${var.poetry_run_alias} python -m pipelines.compiler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.feature-creation-purchase-propensity.execution -o fe_purchase_propensity.yaml ${var.poetry_run_alias} python -m pipelines.uploader -c ${local.config_file_path_relative_python_run_dir} -f fe_purchase_propensity.yaml -t ${self.triggers.tag} -t latest - ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.feature-creation-purchase-propensity.execution + ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.feature-creation-purchase-propensity.execution -i fe_purchase_propensity.yaml EOT working_dir = self.triggers.working_dir } @@ -363,7 +363,7 @@ resource "null_resource" "compile_feature_engineering_customer_lifetime_value_pi command = <<-EOT ${var.poetry_run_alias} python -m pipelines.compiler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.feature-creation-customer-ltv.execution -o fe_customer_ltv.yaml ${var.poetry_run_alias} python -m pipelines.uploader -c ${local.config_file_path_relative_python_run_dir} -f fe_customer_ltv.yaml -t ${self.triggers.tag} -t latest - ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.feature-creation-customer-ltv.execution + ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.feature-creation-customer-ltv.execution -i fe_customer_ltv.yaml EOT working_dir = self.triggers.working_dir } @@ -388,7 +388,7 @@ resource "null_resource" "compile_propensity_training_pipelines" { command = <<-EOT ${var.poetry_run_alias} python -m pipelines.compiler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.propensity.training -o propensity_training.yaml ${var.poetry_run_alias} python -m pipelines.uploader -c ${local.config_file_path_relative_python_run_dir} -f propensity_training.yaml -t ${self.triggers.tag} -t latest - ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.propensity.training + ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.propensity.training -i propensity_training.yaml EOT working_dir = self.triggers.working_dir } @@ -409,7 +409,7 @@ resource "null_resource" "compile_propensity_prediction_pipelines" { command = <<-EOT ${var.poetry_run_alias} python -m pipelines.compiler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.propensity.prediction -o propensity_prediction.yaml ${var.poetry_run_alias} python -m pipelines.uploader -c ${local.config_file_path_relative_python_run_dir} -f propensity_prediction.yaml -t ${self.triggers.tag} -t latest - ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.propensity.prediction + ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.propensity.prediction -i propensity_prediction.yaml EOT working_dir = self.triggers.working_dir } @@ -430,7 +430,7 @@ resource "null_resource" "compile_propensity_clv_training_pipelines" { command = <<-EOT ${var.poetry_run_alias} python -m pipelines.compiler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.propensity_clv.training -o propensity_clv_training.yaml ${var.poetry_run_alias} python -m pipelines.uploader -c ${local.config_file_path_relative_python_run_dir} -f propensity_clv_training.yaml -t ${self.triggers.tag} -t latest - ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.propensity_clv.training + ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.propensity_clv.training -i propensity_clv_training.yaml EOT working_dir = self.triggers.working_dir } @@ -451,7 +451,7 @@ resource "null_resource" "compile_clv_training_pipelines" { command = <<-EOT ${var.poetry_run_alias} python -m pipelines.compiler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.clv.training -o clv_training.yaml ${var.poetry_run_alias} python -m pipelines.uploader -c ${local.config_file_path_relative_python_run_dir} -f clv_training.yaml -t ${self.triggers.tag} -t latest - ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.clv.training + ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.clv.training -i clv_training.yaml EOT working_dir = self.triggers.working_dir } @@ -472,7 +472,7 @@ resource "null_resource" "compile_clv_prediction_pipelines" { command = <<-EOT ${var.poetry_run_alias} python -m pipelines.compiler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.clv.prediction -o clv_prediction.yaml ${var.poetry_run_alias} python -m pipelines.uploader -c ${local.config_file_path_relative_python_run_dir} -f clv_prediction.yaml -t ${self.triggers.tag} -t latest - ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.clv.prediction + ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.clv.prediction -i clv_prediction.yaml EOT working_dir = self.triggers.working_dir } @@ -493,7 +493,7 @@ resource "null_resource" "compile_segmentation_training_pipelines" { command = <<-EOT ${var.poetry_run_alias} python -m pipelines.compiler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.segmentation.training -o segmentation_training.yaml ${var.poetry_run_alias} python -m pipelines.uploader -c ${local.config_file_path_relative_python_run_dir} -f segmentation_training.yaml -t ${self.triggers.tag} -t latest - ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.segmentation.training + ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.segmentation.training -i segmentation_training.yaml EOT working_dir = self.triggers.working_dir } @@ -514,7 +514,7 @@ resource "null_resource" "compile_segmentation_prediction_pipelines" { command = <<-EOT ${var.poetry_run_alias} python -m pipelines.compiler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.segmentation.prediction -o segmentation_prediction.yaml ${var.poetry_run_alias} python -m pipelines.uploader -c ${local.config_file_path_relative_python_run_dir} -f segmentation_prediction.yaml -t ${self.triggers.tag} -t latest - ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.segmentation.prediction + ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.segmentation.prediction -i segmentation_prediction.yaml EOT working_dir = self.triggers.working_dir } @@ -535,7 +535,7 @@ resource "null_resource" "compile_auto_segmentation_training_pipelines" { command = <<-EOT ${var.poetry_run_alias} python -m pipelines.compiler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.auto_segmentation.training -o auto_segmentation_training.yaml ${var.poetry_run_alias} python -m pipelines.uploader -c ${local.config_file_path_relative_python_run_dir} -f auto_segmentation_training.yaml -t ${self.triggers.tag} -t latest - ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.auto_segmentation.training + ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.auto_segmentation.training -i auto_segmentation_training.yaml EOT working_dir = self.triggers.working_dir } @@ -556,7 +556,7 @@ resource "null_resource" "compile_auto_segmentation_prediction_pipelines" { command = <<-EOT ${var.poetry_run_alias} python -m pipelines.compiler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.auto_segmentation.prediction -o auto_segmentation_prediction.yaml ${var.poetry_run_alias} python -m pipelines.uploader -c ${local.config_file_path_relative_python_run_dir} -f auto_segmentation_prediction.yaml -t ${self.triggers.tag} -t latest - ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.auto_segmentation.prediction + ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.auto_segmentation.prediction -i auto_segmentation_prediction.yaml EOT working_dir = self.triggers.working_dir } @@ -577,7 +577,7 @@ resource "null_resource" "compile_value_based_bidding_training_pipelines" { command = <<-EOT ${var.poetry_run_alias} python -m pipelines.compiler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.value_based_bidding.training -o vbb_training.yaml ${var.poetry_run_alias} python -m pipelines.uploader -c ${local.config_file_path_relative_python_run_dir} -f vbb_training.yaml -t ${self.triggers.tag} -t latest - ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.value_based_bidding.training + ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.value_based_bidding.training -i vbb_training.yaml EOT working_dir = self.triggers.working_dir } @@ -598,7 +598,7 @@ resource "null_resource" "compile_value_based_bidding_explanation_pipelines" { command = <<-EOT ${var.poetry_run_alias} python -m pipelines.compiler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.value_based_bidding.explanation -o vbb_explanation.yaml ${var.poetry_run_alias} python -m pipelines.uploader -c ${local.config_file_path_relative_python_run_dir} -f vbb_explanation.yaml -t ${self.triggers.tag} -t latest - ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.value_based_bidding.explanation + ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.value_based_bidding.explanation -i vbb_explanation.yaml EOT working_dir = self.triggers.working_dir } @@ -619,7 +619,7 @@ resource "null_resource" "compile_reporting_preparation_aggregate_predictions_pi command = <<-EOT ${var.poetry_run_alias} python -m pipelines.compiler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.reporting_preparation.execution -o reporting_preparation.yaml ${var.poetry_run_alias} python -m pipelines.uploader -c ${local.config_file_path_relative_python_run_dir} -f reporting_preparation.yaml -t ${self.triggers.tag} -t latest - ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.reporting_preparation.execution + ${var.poetry_run_alias} python -m pipelines.scheduler -c ${local.config_file_path_relative_python_run_dir} -p vertex_ai.pipelines.reporting_preparation.execution -i reporting_preparation.yaml EOT working_dir = self.triggers.working_dir } diff --git a/mypy.ini b/mypy.ini index e10a48ec..9cc5ca0f 100644 --- a/mypy.ini +++ b/mypy.ini @@ -1,3 +1,17 @@ +# Copyright 2021 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + [mypy] -python_version = 3.7 -namespace_packages = True \ No newline at end of file +python_version = 3.10 +namespace_packages = True diff --git a/poetry.toml b/poetry.toml index c65d3866..92e5bbb5 100644 --- a/poetry.toml +++ b/poetry.toml @@ -1,3 +1,17 @@ +# Copyright 2021 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + [virtualenvs] in-project = true create = true diff --git a/pyproject.toml b/pyproject.toml index d821e26c..36a15a83 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,3 +1,17 @@ +# Copyright 2021 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + [tool.poetry] name = "marketing-data-engine" version = "1.0.0" @@ -9,7 +23,8 @@ packages = [{include = "python"}] [tool.poetry.dependencies] python = ">=3.8,<3.11" -google-cloud-aiplatform = "1.22.0" +google-cloud-aiplatform = "1.52.0" +shapely = "<2.0.0" google-cloud = "^0.34.0" jinja2 = ">=3.0.1,<4.0.0" pip = "23.3" @@ -40,12 +55,14 @@ scikit-learn = "1.2.2" #seaborn = "0.12.2" ma-components = {path = "python/base_component_image/", develop = true} google-cloud-pubsub = "2.15.0" -google-analytics-admin = "0.17.0" -google-analytics-data = "^0.17.1" +#google-analytics-admin = "0.17.0" +google-analytics-admin = "0.22.7" +google-analytics-data = "^0.18.0" pyarrow = "15.0.2" [tool.poetry.group.component_vertex.dependencies] -google-cloud-aiplatform = "1.22.0" +google-cloud-aiplatform = "1.52.0" +shapely = "<2.0.0" toml = "0.10.2" [tool.poetry.scripts] @@ -64,7 +81,7 @@ pytest-xdist = "^3.0.2" pip = "23.3" invoke = "2.2.0" pre-commit = ">=2.14.1,<3.0.0" -black = "22.10.0" +black = "22.12.0" flake8 = "5.0.4" flake8-annotations = "2.9.1" diff --git a/python/activation/main.py b/python/activation/main.py index 082dc48e..494804b6 100644 --- a/python/activation/main.py +++ b/python/activation/main.py @@ -28,9 +28,41 @@ from google.cloud import storage from jinja2 import Environment, BaseLoader + class ActivationOptions(GoogleCloudOptions): + """ + The ActivationOptions class inherits from the GoogleCloudOptions class, which provides a framework for defining + command-line arguments for Google Cloud applications. + + Define the command-line arguments for the activation application. + The arguments are then used to configure the application and run the activation process. + """ + @classmethod def _add_argparse_args(cls, parser): + """ + Adds command-line arguments to the parser. + + Args: + parser: The argparse parser. + + The following arguments are defined: + source_table: The table specification for the source data in the format dataset.data_table. + ga4_measurement_id: The Measurement ID in Google Analytics 4. + ga4_api_secret: The client secret for sending data to Google Analytics 4. + log_db_dataset: The dataset where the log table will be created. + use_api_validation: A boolean flag indicating whether to use the Measurement Protocol API validation for debugging instead of sending the events. + activation_type: The activation use case, which can be one of the following values: + - audience-segmentation-15 + - cltv-180-180 + - cltv-180-90 + - cltv-180-30 + - purchase-propensity-30-15 + - purchase-propensity-15-15 + - purchase-propensity-15-7 + activation_type_configuration: The GCS path to the configuration file for all activation types. + """ + parser.add_argument( '--source_table', type=str, @@ -67,13 +99,13 @@ def _add_argparse_args(cls, parser): type=str, help=''' Specifies the activation use case, currently supported values are: - audience-segmentation-15 - cltv-180-180 - cltv-180-90 - cltv-180-30 - purchase-propensity-30-15 - purchase-propensity-15-15 - purchase-propensity-15-7 + audience-segmentation-15 + cltv-180-180 + cltv-180-90 + cltv-180-30 + purchase-propensity-30-15 + purchase-propensity-15-15 + purchase-propensity-15-7 ''', required=True ) @@ -84,35 +116,149 @@ def _add_argparse_args(cls, parser): required=True ) + + + def build_query(args, activation_type_configuration): + """ + Builds the query to be used to retrieve data from the source table. + + Args: + args: The command-line arguments. + activation_type_configuration: The activation type configuration. + + Returns: + The query to be used to retrieve data from the source table. + """ return activation_type_configuration['source_query_template'].render( source_table=args.source_table ) + + + def gcs_read_file(project_id, gcs_path): + """ + Reads a file from Google Cloud Storage (GCS). + + Args: + project_id: The ID of the Google Cloud project that contains the GCS bucket. + gcs_path: The path to the file in GCS, in the format "gs://bucket_name/object_name". + + Returns: + The contents of the file as a string. + + Raises: + ValueError: If the GCS path is invalid. + IOError: If an error occurs while reading the file. + """ + # Validate the GCS path. + if not gcs_path.startswith("gs://"): + raise ValueError("Invalid GCS path: {}".format(gcs_path)) + + # Extract the bucket name and object name from the GCS path. matches = re.match("gs://(.*?)/(.*)", gcs_path) + if not matches: + raise ValueError("Invalid GCS path: {}".format(gcs_path)) bucket_name, blob_name = matches.groups() + # Create a storage client. storage_client = storage.Client(project=project_id) + # Get a reference to the bucket and blob. bucket = storage_client.bucket(bucket_name) blob = bucket.blob(blob_name) + # Open the blob for reading. with blob.open("r") as f: return f.read() + + + class CallMeasurementProtocolAPI(beam.DoFn): + """ + This class defines a DoFn that sends events to the Google Analytics 4 Measurement Protocol API. + + The DoFn takes the following arguments: + + - measurement_id: The Measurement ID of the Google Analytics 4 property. + - api_secret: The API secret for the Google Analytics 4 property. + - debug: A boolean flag indicating whether to use the Measurement Protocol API validation for debugging instead of sending the events. + + The DoFn yields the following output: + + - The event that was sent. + - The HTTP status code of the response. + - The content of the response. + """ + + def __init__(self, measurement_id, api_secret, debug=False): + """ + Initializes the DoFn. + + Args: + measurement_id: The Measurement ID of the Google Analytics 4 property. + api_secret: The API secret for the Google Analytics 4 property. + debug: A boolean flag indicating whether to use the Measurement Protocol API validation for debugging instead of sending the events. + """ if debug: debug_str = "debug/" else: debug_str = '' self.event_post_url = f"https://www.google-analytics.com/{debug_str}mp/collect?measurement_id={measurement_id}&api_secret={api_secret}" + def process(self, element): + """ + Sends the event to the Measurement Protocol API. + + Args: + element: The event to be sent. + + Yields: + The event that was sent. + The HTTP status code of the response. + The content of the response. + """ response = requests.post(self.event_post_url, data=json.dumps(element),headers={'content-type': 'application/json'}, timeout=20) yield element, response.status_code, response.content + + + class ToLogFormat(beam.DoFn): + """ + This class defines a DoFn that transforms the output of the Measurement Protocol API call into a format suitable for logging. + + The DoFn takes the following arguments: + + - element: A tuple containing the event that was sent and the HTTP status code of the response. + + The DoFn yields the following output: + + - A dictionary containing the following fields: + - id: A unique identifier for the log entry. + - activation_id: The ID of the activation event. + - payload: The JSON payload of the event that was sent. + - latest_state: The latest state of the event, which can be either "SEND_OK" or "SEND_FAIL". + - updated_at: The timestamp when the log entry was created. + """ + def process(self, element): + """ + Transforms the output of the Measurement Protocol API call into a format suitable for logging. + + Args: + element: A tuple containing the event that was sent and the HTTP status code of the response. + + Yields: + A dictionary containing the following fields: + - id: A unique identifier for the log entry. + - activation_id: The ID of the activation event. + - payload: The JSON payload of the event that was sent. + - latest_state: The latest state of the event, which can be either "SEND_OK" or "SEND_FAIL". + - updated_at: The timestamp when the log entry was created. + """ time_cast = datetime.datetime.now(tz=datetime.timezone.utc) if element[1] == requests.status_codes.codes.NO_CONTENT: @@ -141,23 +287,92 @@ def process(self, element): logging.error(traceback.format_exc()) yield result + + + class DecimalEncoder(json.JSONEncoder): + """ + This class defines a custom JSON encoder that handles Decimal objects correctly. + + The DecimalEncoder class inherits from the `json.JSONEncoder` class and overrides the `default` method to handle Decimal objects. + The `default` method is called for objects that are not of a basic type (string, number, boolean, None, list, tuple, dictionary). + The DecimalEncoder class checks if the object is a Decimal object and, if so, returns its value as a float. + Otherwise, it calls the parent class's `default` method to handle the object. + + The DecimalEncoder class is used to ensure that Decimal objects are encoded as floats when they are converted to JSON. + This is important because Decimal objects cannot be directly encoded as JSON strings. + """ + def default(self, obj): + """ + Handles the encoding of Decimal objects. + + Args: + obj: The object to be encoded. + + Returns: + The JSON representation of the object. + """ if isinstance(obj, Decimal): return float(obj) return json.JSONEncoder.default(self, obj) + + + class TransformToPayload(beam.DoFn): + """ + This class defines a DoFn that transforms the output of the inference pipeline into a format suitable for sending to the Google Analytics 4 Measurement Protocol API. + + The DoFn takes the following arguments: + + - template_str: The Jinja2 template string used to generate the Measurement Protocol payload. + - event_name: The name of the event to be sent to Google Analytics 4. + + The DoFn yields the following output: + + - A dictionary containing the Measurement Protocol payload. + + The DoFn performs the following steps: + + 1. Removes bad shaping strings in the `client_id` field. + 2. Renders the Jinja2 template string using the provided data and event name. + 3. Converts the rendered template string into a JSON object. + 4. Handles any JSON decoding errors. + + The DoFn is used to ensure that the Measurement Protocol payload is formatted correctly before being sent to Google Analytics 4. + """ def __init__(self, template_str, event_name): + """ + Initializes the DoFn. + + Args: + template_str: The Jinja2 template string used to generate the Measurement Protocol payload. + event_name: The name of the event to be sent to Google Analytics 4. + """ self.template_str = template_str self.date_format = "%Y-%m-%d" self.date_time_format = "%Y-%m-%d %H:%M:%S.%f %Z" self.event_name = event_name + def setup(self): + """ + Sets up the Jinja2 environment. + """ self.payload_template = Environment(loader=BaseLoader).from_string(self.template_str) + def process(self, element): + """ + Transforms the output of the inference pipeline into a format suitable for sending to the Google Analytics 4 Measurement Protocol API. + + Args: + element: A dictionary containing the output of the inference pipeline. + + Yields: + A dictionary containing the Measurement Protocol payload. + """ # Removing bad shaping strings in client_id _client_id = element['client_id'].replace(r'> beam.ParDo(CallMeasurementProtocolAPI(activation_options.ga4_measurement_id, activation_options.ga4_api_secret, debug=activation_options.use_api_validation)) ) + # Filter the successful responses success_responses = ( measurement_api_responses | 'Get the successful responses' >> beam.Filter(lambda element: element[1] == requests.status_codes.codes.NO_CONTENT) ) + # Filter the failed responses failed_responses = ( measurement_api_responses | 'Get the failed responses' >> beam.Filter(lambda element: element[1] != requests.status_codes.codes.NO_CONTENT) ) + # Store the successful responses in the log tables _ = ( success_responses | 'Transform log format' >> beam.ParDo(ToLogFormat()) | 'Store to log BQ table' >> beam.io.WriteToBigQuery( @@ -294,6 +604,7 @@ def run(argv=None): create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED) ) + # Store the failed responses in the log tables _ = ( failed_responses | 'Transform failure log format' >> beam.ParDo(ToLogFormat()) | 'Store to failure log BQ table' >> beam.io.WriteToBigQuery( @@ -303,6 +614,9 @@ def run(argv=None): create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED) ) + + + if __name__ == '__main__': logging.getLogger().setLevel(logging.INFO) run() diff --git a/python/base_component_image/build-push.py b/python/base_component_image/build-push.py index 80b16680..b18db628 100644 --- a/python/base_component_image/build-push.py +++ b/python/base_component_image/build-push.py @@ -16,22 +16,71 @@ from argparse import ArgumentParser, ArgumentTypeError -def run(dockerfile_path: str, tag: str, nocache: bool =False, quiet: bool =True): +def run( + dockerfile_path: str, + tag: str, + nocache: bool =False, + quiet: bool =True): + """ + This function builds and pushes a Docker image to a specified repository. + + Args: + dockerfile_path (str): The path to the Dockerfile. + tag (str): The tag for the Docker image. + nocache (bool, optional): Whether to disable the Docker cache. Defaults to False. + quiet (bool, optional): Whether to suppress output from the Docker build process. Defaults to True. + + Raises: + FileNotFoundError: If the Dockerfile does not exist. + ArgumentTypeError: If the Dockerfile path is not a string or the tag is not a string. + """ + client = docker.from_env() client.images.build(path = dockerfile_path, tag=tag, nocache=nocache, quiet=quiet) client.images.push(repository=tag) -def check_extention(file_path: str, type: str = '.yaml'): +def check_extention( + file_path: str, + type: str = '.yaml'): + """ + This function checks if a file exists and has the specified extension. + + Args: + file_path (str): The path to the file. + type (str, optional): The file extension to check for. Defaults to '.yaml'. + + Returns: + str: The file path if it exists and has the specified extension. + + Raises: + FileNotFoundError: If the file does not exist. + ArgumentTypeError: If the file path is not a string or the type is not a string. + """ + if not isinstance(file_path, str): + raise ArgumentTypeError("file_path must be a string") + + if not isinstance(type, str): + raise ArgumentTypeError("type must be a string") + if os.path.exists(file_path): if not file_path.lower().endswith(type): raise ArgumentTypeError(f"File provited must be {type}: {file_path}") else: raise FileNotFoundError(f"{file_path} does not exist") + return file_path if __name__ == "__main__": + """ + Script that builds and pushes a Docker image to a specified repository. It takes the following arguments: + + Args: + -c: Path to the configuration YAML file. + -p: Path to the Dockerfile. + -nc: Whether to disable the Docker cache (optional, defaults to False). + """ parser = ArgumentParser() @@ -55,19 +104,17 @@ def check_extention(file_path: str, type: str = '.yaml'): args = parser.parse_args() - repo_params={} components_params={} with open(args.config, encoding='utf-8') as fh: configs = yaml.full_load(fh) - components_params = configs['vertex_ai']['components'] repo_params = configs['artifact_registry']['pipelines_docker_repo'] tag = f"{repo_params['region']}-docker.pkg.dev/{repo_params['project_id']}/{repo_params['name']}/{components_params['base_image_name']}:{components_params['base_image_tag']}" - + # This script provides a convenient way to build and push Docker images for Vertex AI pipelines. if True: import os os.system(f"cd '{args.path}' && gcloud builds submit --project={repo_params['project_id']} --region={repo_params['region']} --tag {tag}") diff --git a/python/base_component_image/pyproject.toml b/python/base_component_image/pyproject.toml index f5837988..917672e7 100644 --- a/python/base_component_image/pyproject.toml +++ b/python/base_component_image/pyproject.toml @@ -20,7 +20,8 @@ urllib3 = "1.26.18" toml = "^0.10.2" docker = "^6.0.1" google-cloud-bigquery = "2.30.0" -google-cloud-aiplatform = "1.22.0" +google-cloud-aiplatform = "1.52.0" +shapely = "<2.0.0" google-cloud-pubsub = "2.15.0" #google-cloud-pipeline-components = "1.0.33" google-cloud-pipeline-components = "2.6.0" diff --git a/python/function/trigger_activation/main.py b/python/function/trigger_activation/main.py index bfb6f64a..50f49804 100644 --- a/python/function/trigger_activation/main.py +++ b/python/function/trigger_activation/main.py @@ -20,27 +20,55 @@ from datetime import datetime from google.cloud import dataflow_v1beta3 + @functions_framework.cloud_event def subscribe(cloud_event): + """ + This function is triggered by a Pub/Sub message. The message contains the activation type and the source table. + The function then launches a Dataflow Flex Template to process the data and send the activation events to GA4. + This function demonstrates how to use Cloud Functions to trigger a Dataflow Flex Template based on a Pub/Sub message. + This allows for automated processing of data and sending activation events to GA4. + + Args: + cloud_event: The CloudEvent message. + + Returns: + None. + """ + + # ACTIVATION_PROJECT: The Google Cloud project ID. project_id = os.environ.get('ACTIVATION_PROJECT') + # ACTIVATION_REGION: The Google Cloud region where the Dataflow Flex Template will be launched. region = os.environ.get('ACTIVATION_REGION') - + # TEMPLATE_FILE_GCS_LOCATION: The Google Cloud Storage location of the Dataflow Flex Template file. template_file_gcs_location = os.environ.get('TEMPLATE_FILE_GCS_LOCATION') + # GA4_MEASUREMENT_ID: The Google Analytics 4 measurement ID. ga4_measurement_id = os.environ.get('GA4_MEASUREMENT_ID') + # GA4_MEASUREMENT_SECRET: The Google Analytics 4 measurement secret. ga4_measurement_secret = os.environ.get('GA4_MEASUREMENT_SECRET') + # ACTIVATION_TYPE_CONFIGURATION: The path to a JSON file containing the configuration for the activation type. activation_type_configuration = os.environ.get('ACTIVATION_TYPE_CONFIGURATION') + # PIPELINE_TEMP_LOCATION: The Google Cloud Storage location for temporary files used by the Dataflow Flex Template. temp_location = os.environ.get('PIPELINE_TEMP_LOCATION') + # LOG_DATA_SET: The BigQuery dataset where the logs of the Dataflow Flex Template will be stored. log_db_dataset = os.environ.get('LOG_DATA_SET') + # PIPELINE_WORKER_EMAIL: The service account email used by the Dataflow Flex Template workers. service_account_email = os.environ.get('PIPELINE_WORKER_EMAIL') + # Decodes the base64 encoded data in the message and parses it as JSON. + # It then extracts the activation_type and source_table values from the JSON object. message_data = base64.b64decode(cloud_event.data["message"]["data"]).decode() message_obj = json.loads(message_data) activation_type = message_obj['activation_type'] source_table = message_obj['source_table'] + # Creates a FlexTemplateRuntimeEnvironment object with the service account email. environment_param = dataflow_v1beta3.FlexTemplateRuntimeEnvironment(service_account_email=service_account_email) + # It then creates a dictionary of parameters for the Dataflow Flex Template, including the project ID, activation type, + # activation type configuration, source table, temporary location, GA4 measurement ID, GA4 measurement secret, and log dataset. + # Finally, it creates a LaunchFlexTemplateParameter object with the job name, container spec GCS path, environment, and parameters. parameters = { 'project': project_id, 'activation_type': activation_type, @@ -51,21 +79,21 @@ def subscribe(cloud_event): 'ga4_api_secret': ga4_measurement_secret, 'log_db_dataset': log_db_dataset } - flex_template_param = dataflow_v1beta3.LaunchFlexTemplateParameter( job_name=f"activation-pipline-{activation_type.replace('_','-')}-{datetime.now().strftime('%Y%m%d-%H%M%S')}", container_spec_gcs_path=template_file_gcs_location, environment=environment_param, parameters=parameters ) + + # Creates a LaunchFlexTemplateRequest object with the project ID, region, and launch parameter. + # It then uses the FlexTemplatesServiceClient to launch the Dataflow Flex Template. request = dataflow_v1beta3.LaunchFlexTemplateRequest( project_id=project_id, location=region, launch_parameter=flex_template_param ) - client = dataflow_v1beta3.FlexTemplatesServiceClient() - response = client.launch_flex_template(request=request) print(response) diff --git a/python/function/trigger_activation/requirements.txt b/python/function/trigger_activation/requirements.txt index 6b67bb67..1b6d3ebf 100644 --- a/python/function/trigger_activation/requirements.txt +++ b/python/function/trigger_activation/requirements.txt @@ -1,2 +1,2 @@ -functions-framework==3.3.0 -google-cloud-dataflow-client==0.8.2 \ No newline at end of file +functions-framework==3.7.0 +google-cloud-dataflow-client==0.8.10 \ No newline at end of file diff --git a/python/ga4_setup/setup.py b/python/ga4_setup/setup.py index 804aef9e..cfaa2d6f 100644 --- a/python/ga4_setup/setup.py +++ b/python/ga4_setup/setup.py @@ -19,14 +19,46 @@ from typing import List + + def get_data_stream(property_id: str, stream_id: str, transport: str = None): + """ + Retrieves a data stream from Google Analytics 4. + + Args: + property_id: The ID of the Google Analytics 4 property. + stream_id: The ID of the data stream. + transport: The transport to use for the request. Defaults to None. + + Returns: + A DataStream object. + + Raises: + GoogleAnalyticsAdminError: If an error occurs while retrieving the data stream. + """ client = AnalyticsAdminServiceClient(transport=transport) return client.get_data_stream( name=f"properties/{property_id}/dataStreams/{stream_id}" ) + + def get_measurement_protocol_secret_value(configuration: map, secret_display_name: str, transport: str = None): + """ + Retrieves the secret value for a given measurement protocol secret display name. + + Args: + configuration: A dictionary containing the Google Analytics 4 property ID and data stream ID. + secret_display_name: The display name of the measurement protocol secret. + transport: The transport to use for the request. Defaults to None. + + Returns: + The secret value for the measurement protocol secret, or None if the secret is not found. + + Raises: + GoogleAnalyticsAdminError: If an error occurs while retrieving the measurement protocol secret. + """ client = AnalyticsAdminServiceClient(transport=transport) results = client.list_measurement_protocol_secrets( parent=f"properties/{configuration['property_id']}/dataStreams/{configuration['stream_id']}" @@ -37,7 +69,22 @@ def get_measurement_protocol_secret_value(configuration: map, secret_display_nam return None + + def get_measurement_protocol_secret(configuration: map, secret_display_name: str): + """ + Retrieves the secret value for a given measurement protocol secret display name. + + Args: + configuration: A dictionary containing the Google Analytics 4 property ID and data stream ID. + secret_display_name: The display name of the measurement protocol secret. + + Returns: + The secret value for the measurement protocol secret, or None if the secret is not found. + + Raises: + GoogleAnalyticsAdminError: If an error occurs while retrieving the measurement protocol secret. + """ measurement_protocol_secret = get_measurement_protocol_secret_value( configuration, secret_display_name) if measurement_protocol_secret: @@ -46,11 +93,41 @@ def get_measurement_protocol_secret(configuration: map, secret_display_name: str return create_measurement_protocol_secret(configuration, secret_display_name) + + def get_measurement_id(configuration: map): + """ + Retrieves the measurement ID for a given Google Analytics 4 property and data stream. + + Args: + configuration: A dictionary containing the Google Analytics 4 property ID and data stream ID. + + Returns: + The measurement ID for the given property and data stream. + + Raises: + GoogleAnalyticsAdminError: If an error occurs while retrieving the measurement ID. + """ return get_data_stream(configuration['property_id'], configuration['stream_id']).web_stream_data.measurement_id + + def create_measurement_protocol_secret(configuration: map, secret_display_name: str, transport: str = None): + """ + Creates a new measurement protocol secret for a given Google Analytics 4 property and data stream. + + Args: + configuration: A dictionary containing the Google Analytics 4 property ID and data stream ID. + secret_display_name: The display name of the measurement protocol secret. + transport: The transport to use for the request. Defaults to None. + + Returns: + The secret value for the measurement protocol secret. + + Raises: + GoogleAnalyticsAdminError: If an error occurs while creating the measurement protocol secret. + """ from google.analytics.admin_v1alpha import MeasurementProtocolSecret client = AnalyticsAdminServiceClient(transport=transport) measurement_protocol_secret = client.create_measurement_protocol_secret( @@ -62,7 +139,15 @@ def create_measurement_protocol_secret(configuration: map, secret_display_name: return measurement_protocol_secret.secret_value + + def load_event_names(): + """ + Loads the event names from the activation type configuration template file. + + Returns: + A list of event names. + """ fo = open('templates/activation_type_configuration_template.tpl') activation_types_obj = json.load(fo) event_names = [] @@ -71,7 +156,18 @@ def load_event_names(): return event_names + + def create_custom_events(configuration: map): + """ + Creates custom events in Google Analytics 4 based on the event names defined in the activation type configuration template file. + + Args: + configuration: A dictionary containing the Google Analytics 4 property ID and data stream ID. + + Raises: + GoogleAnalyticsAdminError: If an error occurs while creating the custom events. + """ event_names = load_event_names() existing_event_names = load_existing_ga4_custom_events(configuration) for event_name in event_names: @@ -80,7 +176,21 @@ def create_custom_events(configuration: map): create_custom_event(configuration, event_name) + + def load_existing_ga4_custom_events(configuration: map): + """ + Loads the existing custom events from Google Analytics 4 based on the provided configuration. + + Args: + configuration: A dictionary containing the Google Analytics 4 property ID and data stream ID. + + Returns: + A list of existing custom event names. + + Raises: + GoogleAnalyticsAdminError: If an error occurs while retrieving the custom events. + """ response = load_existing_ga4_custom_event_objs(configuration) existing_event_rules = [] for page in response.pages: @@ -88,14 +198,42 @@ def load_existing_ga4_custom_events(configuration: map): existing_event_rules.append(event_rule_obj.destination_event) return existing_event_rules + + + def load_existing_ga4_custom_event_objs(configuration: map): + """ + Loads the existing custom event objects from Google Analytics 4 based on the provided configuration. + + Args: + configuration: A dictionary containing the Google Analytics 4 property ID and data stream ID. + + Returns: + A ListEventCreateRulesResponse object containing the existing custom event objects. + + Raises: + GoogleAnalyticsAdminError: If an error occurs while retrieving the custom events. + """ client = admin_v1alpha.AnalyticsAdminServiceClient() request = admin_v1alpha.ListEventCreateRulesRequest( parent=f"properties/{configuration['property_id']}/dataStreams/{configuration['stream_id']}", ) return client.list_event_create_rules(request=request) + + + def create_custom_event(configuration: map, event_name: str): + """ + Creates a custom event in Google Analytics 4 based on the provided event name. + + Args: + configuration: A dictionary containing the Google Analytics 4 property ID and data stream ID. + event_name: The name of the custom event to be created. + + Raises: + GoogleAnalyticsAdminError: If an error occurs while creating the custom event. + """ client = admin_v1alpha.AnalyticsAdminServiceClient() event_create_rule = admin_v1alpha.EventCreateRule() condition = admin_v1alpha.MatchingCondition() @@ -115,7 +253,18 @@ def create_custom_event(configuration: map, event_name: str): response = client.create_event_create_rule(request=request) + + def create_custom_dimensions(configuration: map): + """ + Creates custom dimensions in Google Analytics 4 based on the provided configuration. + + Args: + configuration: A dictionary containing the Google Analytics 4 property ID and data stream ID. + + Raises: + GoogleAnalyticsAdminError: If an error occurs while creating the custom dimensions. + """ existing_dimensions = load_existing_ga4_custom_dimensions(configuration) create_custom_dimensions_for('Audience Segmentation', ['a_s_prediction'], existing_dimensions, configuration) create_custom_dimensions_for('Purchase Propensity', ['p_p_prediction', 'p_p_decile'], existing_dimensions, configuration) @@ -123,7 +272,21 @@ def create_custom_dimensions(configuration: map): create_custom_dimensions_for('Behaviour Based Segmentation', ['a_a_s_prediction'], existing_dimensions, configuration) + + def create_custom_dimensions_for(use_case: str, fields: List[str], existing_dimensions: List[str], configuration: map): + """ + Creates custom dimensions in Google Analytics 4 based on the provided configuration for a specific use case. + + Args: + use_case: The use case for which the custom dimensions are being created. + fields: A list of field names to be used as custom dimensions. + existing_dimensions: A list of existing custom dimension names. + configuration: A dictionary containing the Google Analytics 4 property ID and data stream ID. + + Raises: + GoogleAnalyticsAdminError: If an error occurs while creating the custom dimensions. + """ for field in fields: display_name = f'MAJ {use_case} {field}' if not display_name in existing_dimensions: @@ -131,7 +294,20 @@ def create_custom_dimensions_for(use_case: str, fields: List[str], existing_dime create_custom_dimension(configuration, field, display_name) + + def create_custom_dimension(configuration: map, field_name: str, display_name: str): + """ + Creates a custom dimension in Google Analytics 4 based on the provided configuration. + + Args: + configuration: A dictionary containing the Google Analytics 4 property ID. + field_name: The name of the field to be used as the custom dimension. + display_name: The display name of the custom dimension. + + Raises: + GoogleAnalyticsAdminError: If an error occurs while creating the custom dimension. + """ client = admin_v1alpha.AnalyticsAdminServiceClient() custom_dimension = admin_v1alpha.CustomDimension() @@ -146,14 +322,44 @@ def create_custom_dimension(configuration: map, field_name: str, display_name: s client.create_custom_dimension(request=request) + + + def load_existing_ga4_custom_dimension_objs(configuration: map): + """ + Loads the existing custom dimension objects from Google Analytics 4 based on the provided configuration. + + Args: + configuration: A dictionary containing the Google Analytics 4 property ID. + + Returns: + A ListCustomDimensionsResponse object containing the existing custom dimension objects. + + Raises: + GoogleAnalyticsAdminError: If an error occurs while retrieving the custom dimensions. + """ client = admin_v1alpha.AnalyticsAdminServiceClient() request = admin_v1alpha.ListCustomDimensionsRequest( parent=f"properties/{configuration['property_id']}", ) return client.list_custom_dimensions(request=request) + + + def load_existing_ga4_custom_dimensions(configuration: map): + """ + Loads the existing custom dimension objects from Google Analytics 4 based on the provided configuration. + + Args: + configuration: A dictionary containing the Google Analytics 4 property ID. + + Returns: + A list of existing custom dimension names. + + Raises: + GoogleAnalyticsAdminError: If an error occurs while retrieving the custom dimensions. + """ page_result = load_existing_ga4_custom_dimension_objs(configuration) existing_custom_dimensions = [] for page in page_result.pages: @@ -161,7 +367,21 @@ def load_existing_ga4_custom_dimensions(configuration: map): existing_custom_dimensions.append(custom_dimension.display_name) return existing_custom_dimensions + + + def update_custom_event_with_new_prefix(event_create_rule, old_prefix, new_prefix): + """ + Updates an existing custom event in Google Analytics 4 with a new prefix. + + Args: + event_create_rule: The custom event rule to be updated. + old_prefix: The old prefix to be replaced. + new_prefix: The new prefix to be used. + + Raises: + GoogleAnalyticsAdminError: If an error occurs while updating the custom event. + """ client = admin_v1alpha.AnalyticsAdminServiceClient() event_create_rule.destination_event = event_create_rule.destination_event.replace(old_prefix, new_prefix) event_create_rule.event_conditions[0].value = event_create_rule.event_conditions[0].value.replace(old_prefix, new_prefix) @@ -171,14 +391,42 @@ def update_custom_event_with_new_prefix(event_create_rule, old_prefix, new_prefi ) client.update_event_create_rule(request=request) + + + def rename_existing_ga4_custom_events(configuration: map, old_prefix, new_prefix): + """ + Renames existing custom events in Google Analytics 4 by replacing the old prefix with the new prefix. + + Args: + configuration: A dictionary containing the Google Analytics 4 property ID and data stream ID. + old_prefix: The old prefix to be replaced. + new_prefix: The new prefix to be used. + + Raises: + GoogleAnalyticsAdminError: If an error occurs while renaming the custom events. + """ existing_event_rules = load_existing_ga4_custom_event_objs(configuration) for page in existing_event_rules.pages: for create_event_rule in page.event_create_rules: if create_event_rule.destination_event.startswith(old_prefix): update_custom_event_with_new_prefix(create_event_rule, old_prefix, new_prefix) + + + def update_custom_dimension_with_new_prefix(custom_dimension, old_prefix, new_prefix): + """ + Updates an existing custom dimension in Google Analytics 4 with a new prefix. + + Args: + custom_dimension: The custom dimension to be updated. + old_prefix: The old prefix to be replaced. + new_prefix: The new prefix to be used. + + Raises: + GoogleAnalyticsAdminError: If an error occurs while updating the custom dimension. + """ client = admin_v1alpha.AnalyticsAdminServiceClient() custom_dimension.display_name = custom_dimension.display_name.replace(old_prefix, new_prefix) request = admin_v1alpha.UpdateCustomDimensionRequest( @@ -187,14 +435,43 @@ def update_custom_dimension_with_new_prefix(custom_dimension, old_prefix, new_pr ) client.update_custom_dimension(request=request) + + + def rename_existing_ga4_custom_dimensions(configuration: map, old_prefix, new_prefix): + """ + Renames existing custom dimensions in Google Analytics 4 by replacing the old prefix with the new prefix. + + Args: + configuration: A dictionary containing the Google Analytics 4 property ID. + old_prefix: The old prefix to be replaced. + new_prefix: The new prefix to be used. + + Raises: + GoogleAnalyticsAdminError: If an error occurs while renaming the custom dimensions. + """ page_result = load_existing_ga4_custom_dimension_objs(configuration) for page in page_result.pages: for custom_dimension in page.custom_dimensions: if custom_dimension.display_name.startswith(old_prefix): update_custom_dimension_with_new_prefix(custom_dimension, old_prefix, new_prefix) + + + def entry(): + """ + This function is the entry point for the setup script. It takes three arguments: + + Args: + ga4_resource: The Google Analytics 4 resource to be configured. + ga4_property_id: The Google Analytics 4 property ID. + ga4_stream_id: The Google Analytics 4 data stream ID. + + Raises: + GoogleAnalyticsAdminError: If an error occurs while configuring the Google Analytics 4 resource. + """ + ''' Following Google API scopes are required to call Google Analytics Admin API: https://www.googleapis.com/auth/analytics diff --git a/python/notebooks/vertex_ai_auto_audience_segmentation_ga4_interest.ipynb b/python/notebooks/vertex_ai_auto_audience_segmentation_ga4_interest.ipynb deleted file mode 100644 index 807f3507..00000000 --- a/python/notebooks/vertex_ai_auto_audience_segmentation_ga4_interest.ipynb +++ /dev/null @@ -1,1382 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Auto Audience Segmentation (Interest based)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Install (and update) additional packages\n", - "\n", - "Install the following packages required to execute this notebook. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "\n", - "# The Vertex AI Workbench Notebook product has specific requirements\n", - "IS_WORKBENCH_NOTEBOOK = os.getenv(\"DL_ANACONDA_HOME\")\n", - "IS_USER_MANAGED_WORKBENCH_NOTEBOOK = os.path.exists(\n", - " \"/opt/deeplearning/metadata/env_version\"\n", - ")\n", - "\n", - "# Vertex AI Notebook requires dependencies to be installed with '--user'\n", - "USER_FLAG = \"\"\n", - "if IS_WORKBENCH_NOTEBOOK:\n", - " USER_FLAG = \"--user\"\n", - "\n", - "! pip3 install --upgrade google-cloud-aiplatform {USER_FLAG} -q google-cloud-bigquery db-dtypes\n", - "! pip3 install -q --upgrade optuna==3.2.0 {USER_FLAG}\n", - "! pip3 install -q --upgrade scikit-learn==1.2.* {USER_FLAG}\n", - "! pip3 install -q --upgrade plotly==5.16.0 matplotlib==3.7.2 seaborn==0.12.2 {USER_FLAG}" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Restart the kernel\n", - "\n", - "After you install the additional packages, you need to restart the notebook kernel so it can find the packages." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Automatically restart kernel after installs\n", - "import os\n", - "\n", - "if not os.getenv(\"IS_TESTING\"):\n", - " # Automatically restart kernel after installs\n", - " import IPython\n", - "\n", - " app = IPython.Application.instance()\n", - " app.kernel.do_shutdown(True)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "tags": [] - }, - "source": [ - "## Before you begin" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Set your project ID\n", - "\n", - "**If you don't know your project ID**, you may be able to get your project ID using `gcloud`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "PROJECT_ID = \"[PROJECT_ID]\"\n", - "\n", - "# Get your Google Cloud project ID from gcloud\n", - "import os\n", - "\n", - "if not os.getenv(\"IS_TESTING\"):\n", - " shell_output = !gcloud config list --format 'value(core.project)' 2>/dev/null\n", - " PROJECT_ID = shell_output[0]\n", - " print(\"Project ID:\", PROJECT_ID)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Otherwise, set your project ID here." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "if PROJECT_ID == \"\" or PROJECT_ID is None:\n", - " PROJECT_ID = \"[your-project-id]\" # @param {type:\"string\"}\n", - " \n", - "print (\"Your set Project ID is:\", PROJECT_ID)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "REGION = \"[your-region]\" # @param {type: \"string\"}\n", - "\n", - "if REGION == \"[your-region]\":\n", - " REGION = \"us-central1\"\n", - " \n", - "print ('REGION:', REGION)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Authenticate your Google Cloud account\n", - "\n", - "**If you are using Vertex AI Workbench Notebooks**, your environment is already\n", - "authenticated. Skip this step." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**If you are using Colab**, run the cell below and follow the instructions\n", - "when prompted to authenticate your account via oAuth.\n", - "\n", - "**Otherwise**, follow these steps:\n", - "\n", - "1. In the Cloud Console, go to the **Create service account key** page.\n", - "\n", - "2. Click **Create service account**.\n", - "\n", - "3. In the **Service account name** field, enter a name, and\n", - " click **Create**.\n", - "\n", - "4. In the **Grant this service account access to project** section, click the **Role** drop-down list. Type \"Vertex AI\"\n", - "into the filter box, and select\n", - " **Vertex AI Administrator**. Type \"Storage Object Admin\" into the filter box, and select **Storage Object Admin**.\n", - "\n", - "5. Click *Create*. A JSON file that contains your key downloads to your\n", - "local environment.\n", - "\n", - "6. Enter the path to your service account key as the\n", - "`GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "gJxXVYlrtyEq", - "outputId": "d93a029f-a87d-4201-be3f-1f217706fb8a" - }, - "outputs": [], - "source": [ - "# If you are running this notebook in Colab, run this cell and follow the\n", - "# instructions to authenticate your GCP account.\n", - "\n", - "import os\n", - "import sys\n", - "\n", - "# If on Vertex AI Workbench, then don't execute this code\n", - "IS_COLAB = \"google.colab\" in sys.modules\n", - "if not os.path.exists(\"/opt/deeplearning/metadata/env_version\") and not os.getenv(\n", - " \"DL_ANACONDA_HOME\"\n", - "):\n", - " if \"google.colab\" in sys.modules:\n", - " from google.colab import auth as google_auth\n", - "\n", - " google_auth.authenticate_user()\n", - "\n", - " # If you are running this notebook locally, replace the string below with the\n", - " # path to your service account key and run this cell to authenticate your GCP\n", - " # account.\n", - " # elif not os.getenv(\"IS_TESTING\"):\n", - " # %env GOOGLE_APPLICATION_CREDENTIALS ''" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "SERVICE_ACCOUNT = \"[your-service-account]\" # @param {type:\"string\"}" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "if (\n", - " SERVICE_ACCOUNT == \"\"\n", - " or SERVICE_ACCOUNT is None\n", - " or SERVICE_ACCOUNT == \"[your-service-account]\"\n", - "):\n", - " # Get your service account from gcloud\n", - " if not IS_COLAB:\n", - " shell_output = !gcloud config list account --format \"value(core.account)\"\n", - " SERVICE_ACCOUNT = shell_output[0].strip()\n", - "\n", - " else: # IS_COLAB:\n", - " shell_output = ! gcloud projects describe $PROJECT_ID\n", - " project_number = shell_output[-1].split(\":\")[1].strip().replace(\"'\", \"\")\n", - " SERVICE_ACCOUNT = f\"{project_number}-compute@developer.gserviceaccount.com\"\n", - "\n", - " print(\"Service Account:\", SERVICE_ACCOUNT)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Import libraries" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "cellView": "form", - "id": "hSYrEiHgv8Ql" - }, - "outputs": [], - "source": [ - "import numpy as np\n", - "import pandas as pd\n", - "pd.options.plotting.backend = \"plotly\"\n", - "\n", - "import matplotlib.pyplot as plt\n", - "import matplotlib.cm as cm\n", - "import seaborn as sns\n", - "import plotly\n", - "\n", - "import sklearn\n", - "print('The scikit-learn version is {}.'.format(sklearn.__version__))\n", - "from google.cloud import bigquery\n", - "import jinja2\n", - "import re\n", - "\n", - "import optuna\n", - "optuna.logging.set_verbosity(optuna.logging.WARNING)\n", - "\n", - "from sklearn.pipeline import Pipeline\n", - "from sklearn.cluster import KMeans, MiniBatchKMeans\n", - "from sklearn.compose import ColumnTransformer\n", - "from sklearn.preprocessing import FunctionTransformer\n", - "from sklearn.feature_extraction.text import TfidfTransformer\n", - "from sklearn.metrics import silhouette_samples, silhouette_score" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Configuration" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "cellView": "form", - "id": "M1Ju4cY-5CLd" - }, - "outputs": [], - "source": [ - "#@title Settings\n", - "SRC_PROJECT_ID = '[PROJECT_ID]' #@param {type:\"string\"}\n", - "SRC_DATASET_ID = '[SRC_DATASET_ID]' #@param {type:\"string\"}\n", - "DST_DATASET_ID = '[DST_DATASET_ID]' #@param {type:\"string\"}\n", - "\n", - "DATE_START = \"2023-01-01\" #@param {type:\"date\"}\n", - "DATE_END = \"2023-12-31\" #@param {type:\"date\"}\n", - "LOOKBACK_DAYS = 15 #@param {type:\"integer\"}" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "##### Creating BigQuery home dataset" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from google.cloud import bigquery\n", - "client = bigquery.Client(project=PROJECT_ID)\n", - "dataset = bigquery.Dataset(f\"{PROJECT_ID}.{DST_DATASET_ID}\")\n", - "dataset.location = \"US\"\n", - "dataset = client.create_dataset(dataset, exists_ok=True, timeout=30)\n", - "print(\"Created dataset {}.{}\".format(client.project, dataset.dataset_id))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "xYSspOtx4msG" - }, - "source": [ - "## Creating Auto Generated Dataset" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "cellView": "form", - "id": "k2NrZY9jLMoz" - }, - "outputs": [], - "source": [ - "#@markdown RE_PAGE PATH is the regex expression that tells the query what part of page path to extract. Example: ^https://your-website.com(/[a-z-0-9]*/?).*\n", - "website_url = \"[WEBSITE_URL]\" #@param {type:\"string\"}\n", - "RE_PAGE_PATH = f\"\"\"^https://{website_url}/([-a-zA-Z0-9@:%_+.~#?//=]*)$\"\"\" \n", - "\n", - "#@markdown PERC_KEEP is the percent of cumulative traffic you'd like to keep. (Give me all pages/folders which combine for up to X% of all traffic)\n", - "PERC_KEEP = 35 #@param {type:\"slider\", min:1, max:99, step:1}" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ZK5cVBhgKqHW" - }, - "outputs": [], - "source": [ - "client = bigquery.Client(project=PROJECT_ID)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "soxuYQSsKrhq" - }, - "outputs": [], - "source": [ - "sql = f\"\"\"\n", - "SELECT\n", - " feature,\n", - " ROUND(100 * SUM(users) OVER (ORDER BY users DESC) / SUM(users) OVER (), 2) as cumulative_traffic_percent,\n", - "\n", - "FROM (\n", - " SELECT\n", - " REGEXP_EXTRACT(page_path, @RE_PAGE_PATH) as feature,\n", - " COUNT(DISTINCT user_id) as users\n", - "\n", - " FROM (\n", - " SELECT\n", - " user_pseudo_id as user_id,\n", - " page_location as page_path\n", - " FROM `{SRC_PROJECT_ID}.{SRC_DATASET_ID}.event`\n", - " WHERE\n", - " event_name = 'page_view'\n", - " AND DATE(event_timestamp) BETWEEN @DATE_START AND @DATE_END\n", - " )\n", - " GROUP BY 1\n", - ")\n", - "WHERE\n", - " feature IS NOT NULL\n", - "QUALIFY\n", - " cumulative_traffic_percent <= @PERC_KEEP\n", - "ORDER BY 2 ASC\n", - "\"\"\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "lAUS5p-h4ksN" - }, - "outputs": [], - "source": [ - "df = client.query(query=sql,\n", - " job_config=bigquery.QueryJobConfig(\n", - " query_parameters=[\n", - " bigquery.ScalarQueryParameter(\"DATE_START\", \"DATE\", DATE_START),\n", - " bigquery.ScalarQueryParameter(\"DATE_END\", \"DATE\", DATE_END),\n", - " bigquery.ScalarQueryParameter(\"RE_PAGE_PATH\", \"STRING\", RE_PAGE_PATH),\n", - " bigquery.ScalarQueryParameter(\"PERC_KEEP\", \"FLOAT64\", PERC_KEEP)\n", - " ]\n", - " )\n", - ").to_dataframe()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "8lj9uiQGvjQt", - "outputId": "cfb570ee-6ed4-49fb-ce94-057012040c4e" - }, - "outputs": [], - "source": [ - "print (f'Number of page path categories kept: {len(df)}')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "df" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "gxg2bKdTRFmZ" - }, - "outputs": [], - "source": [ - "def _clean_column_values(f):\n", - " if f == '/' or f == '' or f is None: return 'homepage'\n", - " if f.startswith('/'): f = f[1:]\n", - " if f.endswith('/'): f = f[:-1]\n", - " return re.sub('[^0-9a-zA-Z]+', '_', f)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "4xyv99rmL7ks" - }, - "outputs": [], - "source": [ - "t = jinja2.Template(\"\"\"\n", - "CREATE OR REPLACE PROCEDURE {{ DST_DATASET_ID }}.create_auto_audience_segmentation_dataset(\n", - " DATE_START DATE, DATE_END DATE, LOOKBACK_DAYS INT64\n", - ")\n", - "BEGIN\n", - "\n", - " DECLARE RE_PAGE_PATH STRING DEFAULT \"{{ re_page_path|e }}\";\n", - " \n", - " CREATE OR REPLACE TABLE `{{ DST_DATASET_ID }}.auto_audience_segmentation_full_dataset`\n", - " AS\n", - " WITH \n", - " visitor_pool AS (\n", - " SELECT\n", - " user_pseudo_id,\n", - " MAX(event_timestamp) as feature_timestamp,\n", - " DATE(MAX(event_timestamp)) - LOOKBACK_DAYS as date_lookback\n", - " FROM `{{ PROJECT_ID }}.{{ DATASET_ID }}.event`\n", - " WHERE DATE(event_timestamp) BETWEEN DATE_START AND DATE_END\n", - " GROUP BY 1\n", - " )\n", - "\n", - " SELECT\n", - " user_id,\n", - " feature_timestamp,\n", - " {% for f in features %}COUNTIF( REGEXP_EXTRACT(page_path, RE_PAGE_PATH) = '{{ f }}' ) as {{ clean_column_values(f) }},\n", - " {% endfor %}\n", - " FROM (\n", - " SELECT\n", - " vp.feature_timestamp,\n", - " ga.user_pseudo_id as user_id,\n", - " page_location as page_path\n", - " FROM `{{ PROJECT_ID }}.{{ DATASET_ID }}.event` as ga\n", - " INNER JOIN visitor_pool as vp\n", - " ON vp.user_pseudo_id = ga.user_pseudo_id\n", - " AND DATE(ga.event_timestamp) >= vp.date_lookback\n", - " WHERE\n", - " event_name = 'page_view'\n", - " AND DATE(ga.event_timestamp) BETWEEN DATE_START - LOOKBACK_DAYS AND DATE_END\n", - " )\n", - " GROUP BY 1, 2;\n", - "\n", - "END\n", - "\"\"\")\n", - "t.globals.update({'clean_column_values': _clean_column_values})" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "vk94UWo9PExu" - }, - "outputs": [], - "source": [ - "sql = t.render(\n", - " PROJECT_ID=SRC_PROJECT_ID,\n", - " DATASET_ID=SRC_DATASET_ID,\n", - " DST_DATASET_ID=DST_DATASET_ID,\n", - " re_page_path=RE_PAGE_PATH,\n", - " features=df.feature.tolist()\n", - ")\n", - "client.query(query=sql).result()\n", - "client.query(\n", - " query=f\"CALL `{PROJECT_ID}.{DST_DATASET_ID}.create_auto_audience_segmentation_dataset`(@DATE_START, @DATE_END, @LOOKBACK_DAYS);\",\n", - " job_config=bigquery.QueryJobConfig(\n", - " query_parameters=[\n", - " bigquery.ScalarQueryParameter(\"DATE_START\", \"DATE\", DATE_START),\n", - " bigquery.ScalarQueryParameter(\"DATE_END\", \"DATE\", DATE_END),\n", - " bigquery.ScalarQueryParameter(\"LOOKBACK_DAYS\", \"INTEGER\", LOOKBACK_DAYS)\n", - " ]\n", - " )\n", - ").result()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "jpOKPdAsOp85" - }, - "outputs": [], - "source": [ - "df = client.query(query=f\"SELECT * FROM `{PROJECT_ID}.{DST_DATASET_ID}.auto_audience_segmentation_full_dataset`\").to_dataframe()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 488 - }, - "id": "l2e3-LivTJ0O", - "outputId": "1b28ef3e-41ee-4c5d-df46-2d7f370c30ed" - }, - "outputs": [], - "source": [ - "df" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "aApnI4FUGEZV" - }, - "source": [ - "## Cluster Model [for Interests on Site]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "B6LlKVeTTR3-" - }, - "outputs": [], - "source": [ - "X = df.copy()\n", - "features = list(X.columns[2:]) # need to skip first two columns -> user_id, feature_timestamp\n", - "min_num_clusters = 3\n", - "max_num_clusters = len(features)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Y6IPaEBxW_xq" - }, - "outputs": [], - "source": [ - "def create_model(params):\n", - " model = Pipeline([\n", - " ('transform', ColumnTransformer(\n", - " transformers=[\n", - " ('tfidf',\n", - " TfidfTransformer(norm='l2'),\n", - " list(range(2, len(features) + 2)) # need to skip first two columns -> user_id, feature_timestamp\n", - " )\n", - " ]\n", - " )),\n", - " ('model', KMeans(\n", - " init='k-means++', n_init='auto',\n", - " random_state=42,\n", - " **params)\n", - " )\n", - " ])\n", - "\n", - " return model\n", - "\n", - "def objective(trial):\n", - " params = {\n", - " \"n_clusters\": trial.suggest_int(\"n_clusters\", min_num_clusters, max_num_clusters),\n", - " \"max_iter\": trial.suggest_int(\"max_iter\", 10, 1000, step=10),\n", - " \"tol\": trial.suggest_float(\"tol\", 1e-6, 1e-2, step=1e-6),\n", - " }\n", - "\n", - " model = create_model(params)\n", - " model.fit(X)\n", - " labels = model.predict(X)\n", - "\n", - " return silhouette_score(\n", - " model.named_steps['transform'].transform(X),\n", - " labels, metric='euclidean',\n", - " sample_size=int(len(df) * 0.1) if int(len(df) * 0.1) < 10_000 else 10_000,\n", - " random_state=42\n", - " ), params['n_clusters']" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 1000, - "referenced_widgets": [ - "13b98bf56a1f440bb79777e7b1830176", - "777358fa145d4892b8e314cf25927a21", - "3070539f3e3747a888620f25b097961b", - "be875ed55aae4b8084041d0152c5c242", - "5d9a0a95c6654bc2afe47cb1db05e223", - "882f399734384ce7a69ddcdddfc3e2f5", - "c7454e91123b4c32ba4a86f651b1659f", - "384301bb5d7b4027b5ab555f51736904", - "9f4ca2fa626e49d29fe2b77d0b378f2b", - "d22fc51767a2425fb5d78b46d4d0fd7b", - "610fa6f46ffc4ce9bc96fa31bf479b4a" - ] - }, - "id": "D03pzU4BOLFl", - "outputId": "ea703b5a-328b-4580-d213-b5385619c218", - "tags": [] - }, - "outputs": [], - "source": [ - "study = optuna.create_study(\n", - " directions=[\"maximize\", \"minimize\"],\n", - " sampler=optuna.samplers.TPESampler(seed=42, n_startup_trials=25)\n", - ")\n", - "study.optimize(objective,\n", - " n_trials=125,\n", - " show_progress_bar=True,\n", - " n_jobs=-1\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Optimization Results" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "cellView": "form", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 542 - }, - "id": "9WMk5r1CQ7dG", - "outputId": "e4602cd6-01c0-4959-ab7f-0c489cd7fef0" - }, - "outputs": [], - "source": [ - "fig = optuna.visualization.plot_pareto_front(study, target_names=['Silhuette', 'Num. Clusters'], include_dominated_trials=False)\n", - "fig.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Find a trial with the least number clusters while still retaining sufficient performance.\n", - "**P_WIGGLE** is max percentage a trial can be worse than the best trial to be considered based on the Silhuette Score." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "P_WIGGLE = 10 #@param {type:\"slider\", min:1, max:99, step:1}" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "cellView": "form", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "id": "Cj3ezbi9SzzN", - "outputId": "d8d575c0-2d1e-4494-bdaf-bf23c4231906" - }, - "outputs": [], - "source": [ - "best_trials = sorted([(t.number, t.values[0], t.values[1], t.params) for t in study.best_trials], key=lambda x: x[1], reverse=True)\n", - "best_score = best_trials[0][1]\n", - "best_trials = sorted([(t.number, t.values[0], t.values[1], t.params) for t in study.best_trials], key=lambda x: (x[2], x[1]))\n", - "trial_chosen = None\n", - "for t in best_trials:\n", - " if (1 - t[1]/best_score) <= P_WIGGLE/100:\n", - " print (f'TRIAL {t[0]}:')\n", - " print (f\" Num. clusters: {int(t[2])}\")\n", - " print (f\" Best score: {round(best_score, 4)} / Chosen trial Score: {round(t[1], 4)}\")\n", - " print (f\" % worse than best: {100 * round((1 - t[1]/best_score), 4)}%\")\n", - " print (f\" Params: {t[3]}\")\n", - "\n", - " trial_chosen = t\n", - " break\n", - "\n", - "model = create_model(trial_chosen[3])\n", - "model.fit(X)\n", - "labels = model.predict(X)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Visualization" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Silhouette Analysis" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "cellView": "form", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 724 - }, - "id": "QAhZp2DbcOjw", - "outputId": "8733fefc-843c-44fe-88c1-170f3f4be910" - }, - "outputs": [], - "source": [ - "def silhouette_visualization(X, model):\n", - " np.random.seed(42)\n", - " # Create a subplot with 1 row and 2 columns\n", - " fig, ax1 = plt.subplots(1, 1)\n", - " fig.set_size_inches(18, 7)\n", - "\n", - " sample = np.random.choice(len(X), \n", - " size=int(len(X) * 0.1) if int(len(X) * 0.1) < 10_000 else 10_000)\n", - " model_cluster_centers = model.named_steps['model'].cluster_centers_\n", - " X_tr = model.named_steps['transform'].transform(X)\n", - "\n", - " ax1.set_xlim([-0.1, 1])\n", - " ax1.set_ylim([0, len(sample) + (len(model_cluster_centers) + 1) * 10])\n", - "\n", - " cluster_labels = model.predict(X)\n", - " print(\"Clustering done for n_clusters={}\".format(len(model_cluster_centers)))\n", - "\n", - " cluster_labels = cluster_labels[sample]\n", - "\n", - " # The silhouette_score gives the average value for all the samples.\n", - " # This gives a perspective into the density and separation of the formed\n", - " # clusters\n", - " silhouette_avg = silhouette_score(X_tr[ sample, :], cluster_labels)\n", - " print(\"For n_clusters =\", len(model_cluster_centers),\n", - " \"The average silhouette_score is :\", silhouette_avg)\n", - "\n", - " # Compute the silhouette scores for each sample\n", - "\n", - " sample_silhouette_values = silhouette_samples(X_tr[ sample, :], cluster_labels)\n", - "\n", - " y_lower = 10\n", - " for i in range(len(model_cluster_centers)):\n", - " # Aggregate the silhouette scores for samples belonging to\n", - " # cluster i, and sort them\n", - " ith_cluster_silhouette_values = \\\n", - " sample_silhouette_values[cluster_labels == i]\n", - "\n", - " ith_cluster_silhouette_values.sort()\n", - "\n", - " size_cluster_i = ith_cluster_silhouette_values.shape[0]\n", - " y_upper = y_lower + size_cluster_i\n", - "\n", - " color = cm.nipy_spectral(float(i) / len(model_cluster_centers))\n", - " ax1.fill_betweenx(np.arange(y_lower, y_upper),\n", - " 0, ith_cluster_silhouette_values,\n", - " facecolor=color, edgecolor=color, alpha=0.7)\n", - "\n", - " # Label the silhouette plots with their cluster numbers at the middle\n", - " ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))\n", - "\n", - " # Compute the new y_lower for next plot\n", - " y_lower = y_upper + 10 # 10 for the 0 samples\n", - "\n", - " ax1.set_title(\"The silhouette plot.\")\n", - " ax1.set_xlabel(\"The silhouette coefficient values\")\n", - " ax1.set_ylabel(\"Cluster label\")\n", - "\n", - " # The vertical line for average silhouette score of all the values\n", - " ax1.axvline(x=silhouette_avg, color=\"red\", linestyle=\"--\")\n", - "\n", - " ax1.set_yticks([]) # Clear the yaxis labels / ticks\n", - " ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])\n", - "\n", - " plt.suptitle(f\"Silhouette analysis for KMeans clustering on sample data with n_clusters = {len(model_cluster_centers)}\", \n", - " fontsize=14, fontweight='bold')\n", - "\n", - " plt.show()\n", - "\n", - "silhouette_visualization(X, model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Heatmap" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "cellView": "form", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 1000 - }, - "id": "5Sv7ZejNosU6", - "outputId": "7e2c860c-b099-44be-9573-e2900801ca4c" - }, - "outputs": [], - "source": [ - "model_cluster_centers = pd.DataFrame(model.named_steps['model'].cluster_centers_, columns=features)\n", - "mcc = model_cluster_centers.T / model_cluster_centers.T.sum(axis=1).values.reshape(1, len(features)).T\n", - "\n", - "sns.set(font_scale=0.75)\n", - "fig, ax = plt.subplots(figsize=(int(model_cluster_centers.shape[1]/2), model_cluster_centers.shape[0]/2))\n", - "_ = sns.heatmap(mcc, ax=ax, cbar=False, annot=np.round(model_cluster_centers, 2).T)\n", - "ax.set_title('Heatmap')\n", - "ax.set(xlabel='Cluster')\n", - "plt.yticks(rotation=0)\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Cluster Sizes" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "cellView": "form", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 542 - }, - "id": "XmlAwnOA89Nx", - "outputId": "53181570-0e33-443f-9462-cac9436a5012" - }, - "outputs": [], - "source": [ - "c = np.array(np.bincount(labels), dtype=np.float64)\n", - "c /= c.sum() /100\n", - "df_c = pd.DataFrame(c, index=[f\"Cluster {n}\" for n in range(c.size)], columns=['Size %'])\n", - "df_c.plot.bar()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## [Optional] Deployment to Vertex AI Model Registry" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "_f5sc789EXsA" - }, - "outputs": [], - "source": [ - "import pickle\n", - "with open('model.pkl', 'wb') as f:\n", - " pickle.dump(model, f) " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "MODEL_NAME = f\"interest-cluster-model\"\n", - "GCS_BUCKET=f'{PROJECT_ID}-maj-models'\n", - "ARTIFACT_GCS_URI = f\"gs://{GCS_BUCKET}/{MODEL_NAME}\"\n", - "PREBUILT_CONTAINER_URI = \"us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-2:latest\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!gsutil mb -p $PROJECT_ID -l $REGION gs://$GCS_BUCKET" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!gsutil cp model.pkl $ARTIFACT_GCS_URI/model.pkl\n", - "!rm model.pkl" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!gcloud ai models upload --region=$REGION --display-name=$MODEL_NAME --container-image-uri=$PREBUILT_CONTAINER_URI --artifact-uri=$ARTIFACT_GCS_URI" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!gcloud ai models list --region=$REGION --filter=display_name=$MODEL_NAME" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## [Optional] Delete GCP Resources" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import google.cloud.aiplatform as aiplatform\n", - "aiplatform.init(project=PROJECT_ID, location=REGION)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "for m in aiplatform.Model.list(filter=f\"display_name={MODEL_NAME}\", order_by=f\"create_time desc\"):\n", - " m.delete()" - ] - } - ], - "metadata": { - "colab": { - "provenance": [] - }, - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.2" - }, - "widgets": { - "application/vnd.jupyter.widget-state+json": { - "13b98bf56a1f440bb79777e7b1830176": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HBoxModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_777358fa145d4892b8e314cf25927a21", - "IPY_MODEL_3070539f3e3747a888620f25b097961b", - "IPY_MODEL_be875ed55aae4b8084041d0152c5c242" - ], - "layout": "IPY_MODEL_5d9a0a95c6654bc2afe47cb1db05e223" - } - }, - "3070539f3e3747a888620f25b097961b": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "FloatProgressModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "success", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_384301bb5d7b4027b5ab555f51736904", - "max": 100, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_9f4ca2fa626e49d29fe2b77d0b378f2b", - "value": 100 - } - }, - "384301bb5d7b4027b5ab555f51736904": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "5d9a0a95c6654bc2afe47cb1db05e223": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "610fa6f46ffc4ce9bc96fa31bf479b4a": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "DescriptionStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "777358fa145d4892b8e314cf25927a21": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HTMLModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_882f399734384ce7a69ddcdddfc3e2f5", - "placeholder": "​", - "style": "IPY_MODEL_c7454e91123b4c32ba4a86f651b1659f", - "value": "100%" - } - }, - "882f399734384ce7a69ddcdddfc3e2f5": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "9f4ca2fa626e49d29fe2b77d0b378f2b": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "ProgressStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "be875ed55aae4b8084041d0152c5c242": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HTMLModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_d22fc51767a2425fb5d78b46d4d0fd7b", - "placeholder": "​", - "style": "IPY_MODEL_610fa6f46ffc4ce9bc96fa31bf479b4a", - "value": " 100/100 [03:50<00:00, 2.57s/it]" - } - }, - "c7454e91123b4c32ba4a86f651b1659f": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "DescriptionStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "d22fc51767a2425fb5d78b46d4d0fd7b": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - } - } - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/python/notebooks/vertex_ai_auto_audience_segmentation_ga4_interest.md b/python/notebooks/vertex_ai_auto_audience_segmentation_ga4_interest.md deleted file mode 100644 index 2aa2a476..00000000 --- a/python/notebooks/vertex_ai_auto_audience_segmentation_ga4_interest.md +++ /dev/null @@ -1,80 +0,0 @@ -# Automated Audience Segmentation Approach - -## Challenges -* We don’t know ahead of time how many segments (clusters) will be useful to keep going - * With k-means clustering you have to specify k ahead of training - * With hierarchical clustering, you don’t have to specify the number of clusters, but at some point, you have to draw a threshold line that will determine the number of segments -* Once you have the segments, you have to give them business names - * Hard to do programmatically - * Likely a need for human intervention -* Segments need to be explainable - * Need to have meaningful business value - * The customer needs to understand them - * Segmentation can be done on many things and many attributes - * Throwing all possible attributes into segmentation makes them very hard to explain nicely - * A small number of focus attributes helps explainability, but you are leaving a lot out - * Interest vs. engagement segmentation -* A retrain might generate a new set of segments - * Confusing for the customer - - -## Solutions -#### We don’t know ahead of time how many segments (clusters) will be helpful to keep going -* Running k-means within a hyperparameter framework - * Hyper-params - * `k` (number of clusters) - * Optional: - * Different strategies to normalize the data - * Adding and removing columns - * Different levels of outlier removal - * Optimization metrics - * [Silhouette Score](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html) ** *our preferred method* - * [Sum of Square Distance](https://www.google.com/books/edition/Application_of_Intelligent_Systems_in_Mu/j5YqEAAAQBAJ?hl=en&gbpv=1&dq=SSM+sum+of+square+distance+kmeans&pg=PA228&printsec=frontcover) (SSM) - * [Davies–Bouldin index](https://en.wikipedia.org/wiki/Davies%E2%80%93Bouldin_index) - * General rule: prefer smaller `k`, this helps with explainability -* Possible hybrid solution: - * Run hyper-optimization for `k`, and optimize silhouette score - * From all trials, note the score of the best trial - * Pick a trial that has a silhouette score within `x%` of the best silhouette score but has the lowest `k` of those within `x%` - -#### Once you have the segments, you have to give them business names -* Segments could be pushed for activation as in S1, S2, and S3,...but there should be some sort of dashboard that explains what each number means, what attributes are considered - * Business names, in this case, would be done after the fact within the dashboard after a human would apply business knowledge review the dashboard and add notes) - -#### Segments need to be explainable -* The easiest way to make them explainable is to reduce the number of attributes to something below 10 (there are exceptions, of course) -* You can only reduce the number of attributes if you know your clusters' aim, as you can always cluster by many different attributes, so picking a direction you want your clusters to go is critical. Some examples: - * Interest in the site - * Engagement on the site - * Geo attributes - * Sales funnel stage -* The typical starter segmentation is using page paths (and possibly events) to cluster by site engagement or site interest - * Site engagement would heavily rely on the frequency of visits to particular page paths, with added global attributes like the number of sessions, recency, and time between sessions,... - * Site interest would rely heavily on page paths, but instead of frequency, we would need to infer interest from each visitor. - * Example: There are three possible URLs: A, B, C - * Visitor 1: A: 1, B: 4, C: 8 - * Visitor 2: A: 0, B:1, C: 2 - * Engagement-wise, those two visitors are very different, but interest-wise, they should fall into the same segment. - * We can use TF-IDF to normalize the vectors and pretend that page URLs are words and each vector is a document. Then normalize all vectors to the length of 1. - * This removes all frequency information, and the above two examples should have very similar vectors - * Likely, one could create a query that extracts the page level 1 path and creates a dataset for clustering based on the interest and invariant of the customer. Similarly, for engagement, it should be even easier to grab engagement-level metrics (sessions, pageviews, time on site, recency,...) independent of the customer -* Ideally, a dashboard should be generated where cluster exploration is possible and where you can compare clusters head to head to see where the centroids differ - -#### A retrain might generate a new set of segments -* This could be solved in two ways - * A) Use the same random seed when retraining - * B) Use the previous centroids as your initial centroids in the retrain -* BQML doesn’t support setting random seeds, [so the best option is to just use previous centroids as the init centroids for the retrain](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-kmeans#kmeans_init_method) - * This ensures that even with some new data, the centroids will not end up being too different from what they were before, which means the business names can likely be kept as well - - - - - - - - - - - - diff --git a/python/pipelines/auto_segmentation_pipelines.py b/python/pipelines/auto_segmentation_pipelines.py index a64d86ea..508d863d 100644 --- a/python/pipelines/auto_segmentation_pipelines.py +++ b/python/pipelines/auto_segmentation_pipelines.py @@ -56,6 +56,21 @@ def training_pl( min_num_clusters: int, image_uri: str, ): + """ + This pipeline trains a scikit-learn clustering model and uploads it to GCS. + + Args: + project_id: The Google Cloud project ID. + dataset: The BigQuery dataset where the training data is stored. + location: The Google Cloud region where the pipeline will be run. + training_table: The BigQuery table containing the training data. + bucket_name: The GCS bucket where the trained model will be uploaded. + model_name: The name of the trained model. + p_wiggle: The p_wiggle parameter for the scikit-learn clustering model. + min_num_clusters: The minimum number of clusters for the scikit-learn clustering model. + image_uri: The image URI for the scikit-learn clustering model. + """ + # Train scikit-learn clustering model and upload to GCS train_interest_based_segmentation_model = train_scikit_cluster_model( location=location, @@ -99,6 +114,19 @@ def prediction_pl( pubsub_activation_topic: str, pubsub_activation_type: str ): + """ + This pipeline runs batch prediction using a Vertex AI model and sends a pubsub activation message. + + Args: + project_id: The Google Cloud project ID. + location: The Google Cloud region where the pipeline will be run. + model_name: The name of the Vertex AI model. + bigquery_source: The BigQuery table containing the prediction data. + bigquery_destination_prefix: The BigQuery table prefix where the prediction results will be stored. + pubsub_activation_topic: The Pub/Sub topic to send the activation message. + pubsub_activation_type: The type of activation message to send. + """ + # Get the latest model named `model_name` model_op = get_latest_model( project=project_id, diff --git a/python/pipelines/compiler.py b/python/pipelines/compiler.py index e2f5cbbc..996455b5 100644 --- a/python/pipelines/compiler.py +++ b/python/pipelines/compiler.py @@ -22,6 +22,8 @@ ''' # config path : pipeline module and function name +# This dictionary maps pipeline names to their corresponding module and function names. +# This allows the script to dynamically import the correct pipeline function based on the provided pipeline name. pipelines_list = { 'vertex_ai.pipelines.feature-creation-auto-audience-segmentation.execution': "pipelines.feature_engineering_pipelines.auto_audience_segmentation_feature_engineering_pipeline", 'vertex_ai.pipelines.feature-creation-aggregated-value-based-bidding.execution': "pipelines.feature_engineering_pipelines.aggregated_value_based_bidding_feature_engineering_pipeline", @@ -43,6 +45,15 @@ } # key should match pipeline names as in the `config.yaml.tftpl` files for automatic compilation if __name__ == "__main__": + """ + This Python code defines a script for compiling Vertex AI pipelines. + This script provides a convenient way to compile Vertex AI pipelines from a configuration file. + It allows users to specify the pipeline name, parameters, and output filename, and it automatically handles the compilation process. + It takes three arguments: + -c: Path to the configuration YAML file (config.yaml) + -p: Pipeline key name as it is in config.yaml + -o: The compiled pipeline output filename + """ logging.basicConfig(level=logging.INFO) parser = ArgumentParser() @@ -64,11 +75,14 @@ required=True, help='the compiled pipeline output filename') + # Parses the provided command-line arguments. It retrieves the path to the configuration file, the pipeline name, and the output filename. args = parser.parse_args() pipeline_params={} + # Opens the configuration file and uses the yaml module to parse it. + # It extracts the pipeline parameters based on the provided pipeline name. with open(args.config, encoding='utf-8') as fh: pipeline_params = yaml.full_load(fh) for i in args.pipeline.split('.'): @@ -77,6 +91,19 @@ logging.info(pipeline_params) + # The script checks the pipeline type: + # If the pipeline type is tabular-workflows, it uses the compile_automl_tabular_pipeline function to compile the pipeline. + # Otherwise, it uses the compile_pipeline function to compile the pipeline. + # Both functions take the following arguments: + # template_path: Path to the compiled pipeline template file. + # pipeline_name: Name of the pipeline. + # pipeline_parameters: Parameters to pass to the pipeline. + # pipeline_parameters_substitutions: Substitutions to apply to the pipeline parameters. + # enable_caching: Whether to enable caching for the pipeline. + # type_check: Whether to perform type checking on the pipeline parameters. + # The compile_automl_tabular_pipeline function also takes the following arguments: + # parameters_path: Path to the pipeline parameters file. + # exclude_features: List of features to exclude from the pipeline. if pipeline_params['type'] == 'tabular-workflows': compile_automl_tabular_pipeline( template_path = args.output, diff --git a/python/pipelines/components/bigquery/component.py b/python/pipelines/components/bigquery/component.py index e4c98c36..8d4cddac 100644 --- a/python/pipelines/components/bigquery/component.py +++ b/python/pipelines/components/bigquery/component.py @@ -732,11 +732,12 @@ def bq_dynamic_query_exec_output( FROM ( SELECT REGEXP_EXTRACT(page_path, '{{re_page_path}}') as feature, - COUNT(DISTINCT user_id) as users + COUNT(DISTINCT user_pseudo_id) as users FROM ( SELECT - user_pseudo_id as user_id, + user_pseudo_id, + user_id, page_location as page_path FROM `{{mds_project_id}}.{{mds_dataset}}.event` WHERE @@ -868,13 +869,15 @@ def _clean_column_values(f): visitor_pool AS ( SELECT user_pseudo_id, + user_id, MAX(event_timestamp) as feature_timestamp, DATE(MAX(event_timestamp)) - LOOKBACK_DAYS as date_lookback FROM `{{mds_project_id}}.{{mds_dataset}}.event` WHERE DATE(event_timestamp) BETWEEN DATE_START AND DATE_END - GROUP BY 1 + GROUP BY 1, 2 ) SELECT + user_pseudo_id, user_id, feature_timestamp, {% for f in features %}COUNTIF( REGEXP_EXTRACT(page_path, RE_PAGE_PATH) = '{{ f }}' ) as {{ clean_column_values(f) }}, @@ -882,7 +885,8 @@ def _clean_column_values(f): FROM ( SELECT vp.feature_timestamp, - ga.user_pseudo_id as user_id, + ga.user_pseudo_id, + ga.user_id, page_location as page_path FROM `{{mds_project_id}}.{{mds_dataset}}.event` as ga INNER JOIN visitor_pool as vp @@ -892,7 +896,7 @@ def _clean_column_values(f): event_name = 'page_view' AND DATE(ga.event_timestamp) BETWEEN DATE_START AND DATE_END ) - GROUP BY 1, 2; + GROUP BY 1, 2, 3; END """) template.globals.update({'clean_column_values': _clean_column_values}) diff --git a/python/pipelines/components/pubsub/component.py b/python/pipelines/components/pubsub/component.py index 5dac4621..7a26251b 100644 --- a/python/pipelines/components/pubsub/component.py +++ b/python/pipelines/components/pubsub/component.py @@ -39,6 +39,18 @@ def send_pubsub_activation_msg( activation_type: str, predictions_table: Input[Dataset] ) -> None: + """ + This function sends a Pub/Sub message to trigger the activation application. + + Args: + project: The Google Cloud project ID. + topic_name: The name of the Pub/Sub topic to send the message to. + activation_type: The type of activation message to send. + predictions_table: The BigQuery table containing the predictions to be activated. + + Returns: + None + """ import json import logging diff --git a/python/pipelines/components/python/component.py b/python/pipelines/components/python/component.py index bb061c8d..7099065a 100644 --- a/python/pipelines/components/python/component.py +++ b/python/pipelines/components/python/component.py @@ -12,8 +12,8 @@ # See the License for the specific language governing permissions and # limitations under the License. -from typing import Optional, List -from kfp.dsl import component, Output, Artifact, Model, Input, Metrics, Dataset +from typing import Optional +from kfp.dsl import component, Output, Model import os import yaml @@ -28,7 +28,7 @@ vertex_components_params = configs['vertex_ai']['components'] repo_params = configs['artifact_registry']['pipelines_docker_repo'] - # target_image = f"{repo_params['region']}-docker.pkg.dev/{repo_params['project_id']}/{repo_params['name']}/{vertex_components_params['image_name']}:{vertex_components_params['tag']}" + # defines the base_image variable, which specifies the Docker image to be used for the component. This image is retrieved from the config.yaml file, which contains configuration parameters for the project. base_image = f"{repo_params['region']}-docker.pkg.dev/{repo_params['project_id']}/{repo_params['name']}/{vertex_components_params['base_image_name']}:{vertex_components_params['base_image_tag']}" @@ -43,8 +43,41 @@ def train_scikit_cluster_model( model_name: str, p_wiggle: int = 10, min_num_clusters: int = 3, + columns_to_skip: int = 3, timeout: Optional[float] = 1800 ) -> None: + """ + This component trains a scikit-learn cluster model using the KMeans algorithm. It provides a reusable and configurable way + to train a scikit-learn cluster model using KFP. + + The component trainng logic is described in the following steps: + Constructs a BigQuery client object using the provided project ID. + Reads the training data from the specified BigQuery table. + Defines a function _create_model to create a scikit-learn pipeline with a KMeans clustering model. + Defines an objective function _objective for hyperparameter optimization using Optuna. + This function trains the model with different hyperparameter values and evaluates its performance using the silhouette score. + Creates an Optuna study and optimizes the objective function to find the best hyperparameters. + Trains the final model with the chosen hyperparameters. + Saves the trained model as a pickle file. + Creates a GCS bucket if it doesn't exist. + Uploads the pickled model to the GCS bucket. + + Args: + project_id: The Google Cloud project ID. + dataset: The BigQuery dataset name. + location: The Google Cloud region where the BigQuery dataset is located. + training_table: The BigQuery table name containing the training data. + cluster_model: The output model artifact. + bucket_name: The Google Cloud Storage bucket name to upload the trained model. + model_name: The name of the model to be saved in the bucket. + p_wiggle: The percentage wiggle allowed for the best score. + min_num_clusters: The minimum number of clusters to consider. + columns_to_skip: The number of columns to skip from the beginning of the dataset. + timeout: The maximum time in seconds to wait for the training job to complete. + + Returns: + None + """ import numpy as np import pandas as pd @@ -74,8 +107,9 @@ def train_scikit_cluster_model( query=f"""SELECT * FROM `{project_id}.{dataset}.{training_table}`""" ).to_dataframe() - # Skipping the first two columns: [user_id, feature_timestamp] - features = list(training_dataset_df.columns[2:]) + # Skipping the first three columns: [user_pseudo_id, user_id, feature_timestamp] + columns_to_skip + features = list(training_dataset_df.columns[columns_to_skip:]) min_num_clusters = 3 max_num_clusters = len(features) @@ -85,7 +119,7 @@ def _create_model(params): transformers=[ ('tfidf', TfidfTransformer(norm='l2'), - list(range(2, len(features) + 2)) # Skipping the first two columns: [user_id, feature_timestamp] + list(range(columns_to_skip, len(features) + columns_to_skip)) # Skipping the first three columns: [user_pseudo_id, user_id, feature_timestamp] ) ] )), @@ -186,13 +220,4 @@ def _upload_to_gcs(bucket_name, model_filename, destination_blob_name=""): _upload_to_gcs(bucket_name, model_filename, destination_blob_name) -@component(base_image=base_image) -def upload_scikit_model_to_gcs_bucket( - project_id: str, - location: str, - cluster_model: Input[Model], - bucket_name: Input[Artifact], -) -> None: - pass - diff --git a/python/pipelines/components/vertex/component.py b/python/pipelines/components/vertex/component.py index 5c4bb8fe..80f813b5 100644 --- a/python/pipelines/components/vertex/component.py +++ b/python/pipelines/components/vertex/component.py @@ -41,6 +41,7 @@ + @component( base_image=base_image, #target_image=target_image, @@ -207,6 +208,8 @@ def _isnan(value): elected_model.schema_title = 'google.VertexModel' #classification_metrics_logger.log_roc_curve(fpr_arr,tpr_arr, th_arr) + + @component( base_image=base_image, @@ -219,7 +222,20 @@ def get_latest_model( display_name: str, elected_model: Output[VertexModel] ) -> None: - """ + """Vertex pipelines component that elects the latest model based on the display name. + + Args: + project (str): + Project to retrieve models and model registry from + location (str): + Location to retrieve models and model registry from + display_name (str): + The display name of the model for which selection is going to be made + elected_model: Output[VertexModel]: + The output VertexModel object containing the latest model information. + + Raises: + Exception: If no models are found in the vertex model registry that match the display name. """ from google.cloud import aiplatform as aip @@ -283,6 +299,8 @@ def list(cls): elected_model.schema_title = 'google.VertexModel' + + @component(base_image=base_image) def batch_prediction( destination_table: Output[Dataset], @@ -298,6 +316,38 @@ def batch_prediction( generate_explanation: bool = False, dst_table_expiration_hours: int = 0 ): + """Vertex pipelines component that performs batch prediction using a Vertex AI model. + + Args: + destination_table (Output[Dataset]): + The output BigQuery table where the predictions will be stored. + bigquery_source (str): + The BigQuery table containing the data to be predicted. + bigquery_destination_prefix (str): + The BigQuery table prefix where the predictions will be stored. + job_name_prefix (str): + The prefix for the batch prediction job name. + model (Input[VertexModel]): + The Vertex AI model to be used for prediction. + machine_type (str, optional): + The machine type to use for the batch prediction job. Defaults to "n1-standard-2". + max_replica_count (int, optional): + The maximum number of replicas to use for the batch prediction job. Defaults to 10. + batch_size (int, optional): + The batch size to use for the batch prediction job. Defaults to 64. + accelerator_count (int, optional): + The number of accelerators to use for the batch prediction job. Defaults to 0. + accelerator_type (str, optional): + The type of accelerators to use for the batch prediction job. Defaults to None. + generate_explanation (bool, optional): + Whether to generate explanations for the predictions. Defaults to False. + dst_table_expiration_hours (int, optional): + The number of hours after which the destination table will expire. Defaults to 0. + + Raises: + Exception: If the batch prediction job fails. + """ + from datetime import datetime, timedelta, timezone import logging from google.cloud import bigquery @@ -339,6 +389,8 @@ def batch_prediction( logging.info(batch_prediction_job.to_dict()) + + @component(base_image=base_image) # Note currently KFP SDK doesn't support outputting artifacts in `google` namespace. # Use the base type dsl.Artifact instead. @@ -348,6 +400,21 @@ def return_unmanaged_model( model_name: str, model: Output[Artifact] ) -> None: + """Vertex pipelines component that returns an unmanaged model artifact. + + Args: + image_uri (str): + The URI of the container image for the unmanaged model. + bucket_name (str): + The name of the Google Cloud Storage bucket where the unmanaged model is stored. + model_name (str): + The name of the unmanaged model file in the Google Cloud Storage bucket. + model (Output[Artifact]): + The output VertexModel artifact. + + Raises: + Exception: If the model artifact cannot be created. + """ from google_cloud_pipeline_components import v1 from google_cloud_pipeline_components.types import artifact_types from kfp import dsl @@ -369,6 +436,21 @@ def get_tabular_model_explanation( model: Input[VertexModel], model_explanation: Output[Dataset], ) -> None: + """Vertex pipelines component that retrieves tabular model explanations from the AutoML API. + + Args: + project (str): + Project to retrieve models and model registry from + location (str): + Location to retrieve models and model registry from + model (Input[VertexModel]): + The Vertex AI model for which explanations will be retrieved. + model_explanation (Output[Dataset]): + The output BigQuery dataset where the model explanations will be stored. + + Raises: + Exception: If the model explanations cannot be retrieved. + """ from google.cloud import aiplatform import logging import re diff --git a/python/pipelines/feature_engineering_pipelines.py b/python/pipelines/feature_engineering_pipelines.py index 2cd7416e..32305fc6 100644 --- a/python/pipelines/feature_engineering_pipelines.py +++ b/python/pipelines/feature_engineering_pipelines.py @@ -33,14 +33,37 @@ def auto_audience_segmentation_feature_engineering_pipeline( mds_dataset: str, stored_procedure_name: str, full_dataset_table: str, - #training_table: str, - #inference_table: str, reg_expression: str, query_auto_audience_segmentation_inference_preparation: str, query_auto_audience_segmentation_training_preparation: str, perc_keep: int = 35, timeout: Optional[float] = 3600.0 ): + """ + This pipeline defines the steps for feature engineering for the auto audience segmentation model. + + Args: + project_id: The Google Cloud project ID. + location: The Google Cloud region where the pipeline will be run. + dataset: The BigQuery dataset where the raw data is stored. + date_start: The start date for the data to be processed. + date_end: The end date for the data to be processed. + feature_table: The BigQuery table where the feature data will be stored. + mds_project_id: The Google Cloud project ID where the Marketing Data Store (MDS) is located. + mds_dataset: The MDS dataset where the product data is stored. + stored_procedure_name: The name of the BigQuery stored procedure that will be used to prepare the full dataset. + full_dataset_table: The BigQuery table where the full dataset will be stored. + #training_table: The BigQuery table where the training data will be stored. + #inference_table: The BigQuery table where the inference data will be stored. + reg_expression: The regular expression that will be used to identify the pages to be included in the analysis. + query_auto_audience_segmentation_inference_preparation: The SQL query that will be used to prepare the inference data. + query_auto_audience_segmentation_training_preparation: The SQL query that will be used to prepare the training data. + perc_keep: The percentage of pages to be included in the analysis. + timeout: The timeout for the pipeline in seconds. + + Returns: + None + """ # Feature data preparation feature_table_preparation = bq_dynamic_query_exec_output( location=location, @@ -91,6 +114,20 @@ def aggregated_value_based_bidding_feature_engineering_pipeline( query_aggregated_value_based_bidding_explanation_preparation: str, timeout: Optional[float] = 3600.0 ): + """ + This pipeline defines the steps for feature engineering for the aggregated value based bidding model. + + Args: + project_id: The Google Cloud project ID. + location: The Google Cloud region where the pipeline will be run. + query_aggregated_value_based_bidding_training_preparation: The SQL query that will be used to prepare the training data. + query_aggregated_value_based_bidding_explanation_preparation: The SQL query that will be used to prepare the explanation data. + timeout: The timeout for the pipeline in seconds. + + Returns: + None + """ + # Training data preparation training_table_preparation = sp( project=project_id, @@ -111,12 +148,27 @@ def audience_segmentation_feature_engineering_pipeline( project_id: str, location: Optional[str], query_user_lookback_metrics: str, - query_user_scoped_segmentation_metrics: str, query_user_segmentation_dimensions: str, query_audience_segmentation_inference_preparation: str, query_audience_segmentation_training_preparation: str, timeout: Optional[float] = 3600.0 ): + """ + This pipeline defines the steps for feature engineering for the audience segmentation model. + + Args: + project_id: The Google Cloud project ID. + location: The Google Cloud region where the pipeline will be run. + query_user_lookback_metrics: The SQL query that will be used to calculate the user lookback metrics. + query_user_segmentation_dimensions: The SQL query that will be used to calculate the user segmentation dimensions. + query_audience_segmentation_inference_preparation: The SQL query that will be used to prepare the inference data. + query_audience_segmentation_training_preparation: The SQL query that will be used to prepare the training data. + timeout: The timeout for the pipeline in seconds. + + Returns: + None + """ + # Features Preparation phase_1 = list() phase_1.append(sp( @@ -153,12 +205,27 @@ def purchase_propensity_feature_engineering_pipeline( query_purchase_propensity_label: str, query_user_dimensions: str, query_user_rolling_window_metrics: str, - query_user_scoped_metrics: str, - query_user_session_event_aggregated_metrics: str, query_purchase_propensity_inference_preparation: str, query_purchase_propensity_training_preparation: str, timeout: Optional[float] = 3600.0 ): + """ + This pipeline defines the steps for feature engineering for the purchase propensity model. + + Args: + project_id: The Google Cloud project ID. + location: The Google Cloud region where the pipeline will be run. + query_purchase_propensity_label: The SQL query that will be used to calculate the purchase propensity label. + query_user_dimensions: The SQL query that will be used to calculate the user dimensions. + query_user_rolling_window_metrics: The SQL query that will be used to calculate the user rolling window metrics. + query_purchase_propensity_inference_preparation: The SQL query that will be used to prepare the inference data. + query_purchase_propensity_training_preparation: The SQL query that will be used to prepare the training data. + timeout: The timeout for the pipeline in seconds. + + Returns: + None + """ + # Features Preparation phase_1 = list() phase_1.append( @@ -203,11 +270,27 @@ def customer_lifetime_value_feature_engineering_pipeline( query_customer_lifetime_value_label: str, query_user_lifetime_dimensions: str, query_user_rolling_window_lifetime_metrics: str, - query_user_scoped_lifetime_metrics: str, query_customer_lifetime_value_inference_preparation: str, query_customer_lifetime_value_training_preparation: str, timeout: Optional[float] = 3600.0 ): + """ + This pipeline defines the steps for feature engineering for the customer lifetime value model. + + Args: + project_id: The Google Cloud project ID. + location: The Google Cloud region where the pipeline will be run. + query_customer_lifetime_value_label: The SQL query that will be used to calculate the customer lifetime value label. + query_user_lifetime_dimensions: The SQL query that will be used to calculate the user lifetime dimensions. + query_user_rolling_window_lifetime_metrics: The SQL query that will be used to calculate the user rolling window lifetime metrics. + query_customer_lifetime_value_inference_preparation: The SQL query that will be used to prepare the inference data. + query_customer_lifetime_value_training_preparation: The SQL query that will be used to prepare the training data. + timeout: The timeout for the pipeline in seconds. + + Returns: + None + """ + # Features Preparation phase_1 = list() phase_1.append( @@ -252,6 +335,19 @@ def reporting_preparation_pl( query_aggregate_last_day_predictions: str, timeout: Optional[float] = 3600.0 ): + """ + This pipeline defines the steps for preparing the reporting data. + + Args: + project_id: The Google Cloud project ID. + location: The Google Cloud region where the pipeline will be run. + query_aggregate_last_day_predictions: The SQL query that will be used to aggregate the last day predictions. + timeout: The timeout for the pipeline in seconds. + + Returns: + None + """ + # Reporting Preparation aggregate_predictions = sp( project=project_id, diff --git a/python/pipelines/pipeline_ops.py b/python/pipelines/pipeline_ops.py index 8058137e..d2bf2ceb 100644 --- a/python/pipelines/pipeline_ops.py +++ b/python/pipelines/pipeline_ops.py @@ -13,6 +13,10 @@ # limitations under the License. from datetime import datetime +from os import name +from tracemalloc import start + +import pip from kfp import compiler from google.cloud.aiplatform.pipeline_jobs import PipelineJob, _set_enable_caching_value from google.cloud.aiplatform import TabularDataset, Artifact @@ -33,6 +37,19 @@ def substitute_pipeline_params( pipeline_params: Dict[str, Any], pipeline_param_substitutions: Dict[str, Any] ) -> Dict[str, Any]: + """ + This function substitutes placeholders in the pipeline_params dictionary with values from the pipeline_param_substitutions dictionary. + + Args: + pipeline_params: A dictionary of pipeline parameters. + pipeline_param_substitutions: A dictionary of substitutions to apply to the pipeline parameters. + + Returns: + A dictionary of pipeline parameters with the substitutions applied. + + Raises: + Exception: If a placeholder is not found in the pipeline_param_substitutions dictionary. + """ # if pipeline parameters include placeholders such as {PROJECT_ID} etc, # the following will replace such placeholder with the values @@ -45,12 +62,39 @@ def substitute_pipeline_params( def get_bucket_name_and_path(uri): + """ + This function takes a Google Cloud Storage URI and returns the bucket name and path. + + Args: + uri: The Google Cloud Storage URI. + + Returns: + A tuple containing the bucket name and path. + + Raises: + ValueError: If the URI is not a valid Google Cloud Storage URI. + """ + if not uri.startswith("gs://"): + raise ValueError("URI must start with gs://") + no_prefix_uri = uri[len("gs://"):] splits = no_prefix_uri.split("/") + return splits[0], "/".join(splits[1:]) def write_to_gcs(uri: str, content: str): + """ + Writes the given content to a Google Cloud Storage (GCS) bucket. + + Args: + uri: The GCS URI of the file to write to. + content: The content to write to the file. + + Raises: + ValueError: If the URI is not a valid GCS URI. + """ + bucket_name, path = get_bucket_name_and_path(uri) storage_client = storage.Client() bucket = storage_client.get_bucket(bucket_name) @@ -59,6 +103,21 @@ def write_to_gcs(uri: str, content: str): def generate_auto_transformation(column_names: List[str]) -> List[Dict[str, Any]]: + """ + Generates a list of auto-transformation dictionaries for the given column names. + + Args: + column_names: A list of column names. + + Returns: + A list of auto-transformation dictionaries. + + Raises: + ValueError: If the column_names list is empty. + """ + if not column_names: + raise ValueError("column_names must not be empty") + transformations = [] for column_name in column_names: transformations.append({"auto": {"column_name": column_name}}) @@ -66,6 +125,20 @@ def generate_auto_transformation(column_names: List[str]) -> List[Dict[str, Any] def write_auto_transformations(uri: str, column_names: List[str]): + """ + Generates a list of auto-transformation dictionaries for the given column names and writes them to a Google Cloud Storage (GCS) bucket. + + Args: + uri: The GCS URI of the file to write to. + column_names: A list of column names. + + Raises: + ValueError: If the column_names list is empty. + """ + + if not column_names: + raise ValueError("column_names must not be empty") + transformations = generate_auto_transformation(column_names) write_to_gcs(uri, json.dumps(transformations)) @@ -73,12 +146,45 @@ def write_auto_transformations(uri: str, column_names: List[str]): def read_custom_transformation_file(custom_transformation_file: str): - import json - with open(custom_transformation_file, "r") as f: - transformations = json.load(f) + """ + Reads a custom transformation file and returns the transformations as a list of dictionaries. + + Args: + custom_transformation_file: The path to the custom transformation file. + + Returns: + A list of dictionaries representing the custom transformations. + + Raises: + FileNotFoundError: If the custom transformation file does not exist. + JSONDecodeError: If the custom transformation file is not valid JSON. + """ + + transformations = None + try: + with open(custom_transformation_file, "r") as f: + transformations = json.load(f) + except FileNotFoundError: + raise FileNotFoundError(f"Custom transformation file not found: {custom_transformation_file}") + except json.JSONDecodeError: + raise json.JSONDecodeError(f"Invalid JSON in custom transformation file: {custom_transformation_file}") + return transformations + def write_custom_transformations(uri: str, custom_transformation_file: str): + """ + Writes custom transformation definitions to a Google Cloud Storage (GCS) bucket. + + Args: + uri: The GCS URI of the file to write to. + custom_transformation_file: The path to the custom transformation file. + + Raises: + FileNotFoundError: If the custom transformation file does not exist. + JSONDecodeError: If the custom transformation file is not valid JSON. + """ + transformations = read_custom_transformation_file(custom_transformation_file) write_to_gcs(uri, json.dumps(transformations)) @@ -95,11 +201,34 @@ def compile_pipeline( pipeline_parameters_substitutions: Optional[Dict[str, Any]] = None, enable_caching: bool = True, type_check: bool = True) -> str: + """ + Compiles a Vertex AI Pipeline. + + This function takes a pipeline function, a template path, a pipeline name, and optional pipeline parameters and substitutions, and compiles them into a Vertex AI Pipeline YAML file. + + Args: + pipeline_func: The pipeline function to compile. + template_path: The path to the pipeline template file. + pipeline_name: The name of the pipeline. + pipeline_parameters: The parameters to pass to the pipeline. + pipeline_parameters_substitutions: A dictionary of substitutions to apply to the pipeline parameters. + enable_caching: Whether to enable caching for the pipeline. + type_check: Whether to perform type checking on the pipeline parameters. + + Returns: + The path to the compiled pipeline YAML file. + + Raises: + Exception: If an error occurs while compiling the pipeline. + """ if pipeline_parameters_substitutions != None: pipeline_parameters = substitute_pipeline_params( pipeline_parameters, pipeline_parameters_substitutions) - print(pipeline_parameters) + logging.info("Pipeline parameters: {}".format(pipeline_parameters)) + + # The function uses the compiler.Compiler() class to compile the pipeline defined by the pipeline_func function. + # The compiled pipeline is saved to the template_path file. compiler.Compiler().compile( pipeline_func=pipeline_func, package_path=template_path, @@ -108,12 +237,15 @@ def compile_pipeline( type_check=type_check, ) + # The function opens the compiled pipeline template file and loads the configuration using the yaml.safe_load() function. with open(template_path, 'r') as file: configuration = yaml.safe_load(file) + # The function sets the enable_caching value of the configuration to the enable_caching parameter. _set_enable_caching_value(pipeline_spec=configuration, enable_caching=enable_caching) + # Saves the updated pipeline configuration back to the template_path file. with open(template_path, 'w') as yaml_file: yaml.dump(configuration, yaml_file) @@ -135,6 +267,33 @@ def run_pipeline_from_func( credentials: Optional[credentials.Credentials] = None, encryption_spec_key_name: Optional[str] = None, wait: bool = False) -> str: + """ + Runs a Vertex AI Pipeline from a function. + + This function takes a pipeline function, a pipeline root directory, a project ID, a location, a service account, pipeline parameters, and optional parameters for pipeline parameter substitutions, caching, experiment name, job ID, labels, credentials, encryption key name, and waiting for completion. It creates a PipelineJob object from the pipeline function, submits the pipeline to Vertex AI, and optionally waits for the pipeline to complete. + + Args: + pipeline_func: The pipeline function to run. + pipeline_root: The root directory of the pipeline. + project_id: The ID of the project that contains the pipeline. + location: The location of the pipeline. + service_account: The service account to use for the pipeline. + pipeline_parameters: The parameters to pass to the pipeline. + pipeline_parameters_substitutions: A dictionary of substitutions to apply to the pipeline parameters. + enable_caching: Whether to enable caching for the pipeline. + experiment_name: The name of the experiment to create for the pipeline. + job_id: The ID of the pipeline job. + labels: The labels to apply to the pipeline. + credentials: The credentials to use for the pipeline. + encryption_spec_key_name: The encryption key to use for the pipeline. + wait: Whether to wait for the pipeline to complete. + + Returns: + A PipelineJob object. + + Raises: + RuntimeError: If the pipeline execution fails. + """ if pipeline_parameters_substitutions != None: pipeline_parameters = substitute_pipeline_params( @@ -167,6 +326,22 @@ def _extract_schema_from_bigquery( table_name: str, table_schema: str, ) -> list: + """ + Extracts the schema from a BigQuery table or view. + + Args: + project: The ID of the project that contains the table or view. + location: The location of the table or view. + table_name: The name of the table or view. + table_schema: The path to the schema file. + + Returns: + A list of the column names in the table or view. + + Raises: + Exception: If the table or view does not exist. + """ + from google.cloud import bigquery from google.api_core import exceptions try: @@ -186,8 +361,7 @@ def _extract_schema_from_bigquery( schema = [feature['name'] for feature in d] return schema -# Compile Tabular Workflow Training pipelines -# You don't need to define the pipeline elsewhere since the pre-compiled pipeline component is defined in the `automl_tabular_pl_v?.yaml` + def compile_automl_tabular_pipeline( template_path: str, parameters_path: str, @@ -196,76 +370,92 @@ def compile_automl_tabular_pipeline( pipeline_parameters_substitutions: Optional[Dict[str, Any]] = None, exclude_features = List[Any], enable_caching: bool = True) -> tuple: + """ + Compiles an AutoML Tabular Workflows pipeline. You don't need to define the pipeline elsewhere since the pre-compiled pipeline component is defined in the `automl_tabular_pl_v4.yaml` file. + + Args: + template_path: The path to the pipeline template file. + parameters_path: The path to the pipeline parameters file. + pipeline_name: The name of the pipeline. + pipeline_parameters: The parameters to pass to the pipeline. All these possible parameters can be set in the config.yaml.tftpl file, instead of in this file. + additional_experiments: dict + cv_trainer_worker_pool_specs_override: list + data_source_bigquery_table_path: str [Default: ''] + data_source_csv_filenames: str [Default: ''] + dataflow_service_account: str [Default: ''] + dataflow_subnetwork: str [Default: ''] + dataflow_use_public_ips: bool [Default: True] + disable_early_stopping: bool [Default: False] + distill_batch_predict_machine_type: str [Default: 'n1-standard-16'] + distill_batch_predict_max_replica_count: int [Default: 25.0] + distill_batch_predict_starting_replica_count: int [Default: 25.0] + enable_probabilistic_inference: bool [Default: False] + encryption_spec_key_name: str [Default: ''] + evaluation_batch_explain_machine_type: str [Default: 'n1-highmem-8'] + evaluation_batch_explain_max_replica_count: int [Default: 10.0] + evaluation_batch_explain_starting_replica_count: int [Default: 10.0] + evaluation_batch_predict_machine_type: str [Default: 'n1-highmem-8'] + evaluation_batch_predict_max_replica_count: int [Default: 20.0] + evaluation_batch_predict_starting_replica_count: int [Default: 20.0] + evaluation_dataflow_disk_size_gb: int [Default: 50.0] + evaluation_dataflow_machine_type: str [Default: 'n1-standard-4'] + evaluation_dataflow_max_num_workers: int [Default: 100.0] + evaluation_dataflow_starting_num_workers: int [Default: 10.0] + export_additional_model_without_custom_ops: bool [Default: False] + fast_testing: bool [Default: False] + location: str + model_description: str [Default: ''] + model_display_name: str [Default: ''] + optimization_objective: str + optimization_objective_precision_value: float [Default: -1.0] + optimization_objective_recall_value: float [Default: -1.0] + predefined_split_key: str [Default: ''] + prediction_type: str + project: str + quantiles: list + root_dir: str + run_distillation: bool [Default: False] + run_evaluation: bool [Default: False] + stage_1_num_parallel_trials: int [Default: 35.0] + stage_1_tuner_worker_pool_specs_override: list + stage_1_tuning_result_artifact_uri: str [Default: ''] + stage_2_num_parallel_trials: int [Default: 35.0] + stage_2_num_selected_trials: int [Default: 5.0] + stats_and_example_gen_dataflow_disk_size_gb: int [Default: 40.0] + stats_and_example_gen_dataflow_machine_type: str [Default: 'n1-standard-16'] + stats_and_example_gen_dataflow_max_num_workers: int [Default: 25.0] + stratified_split_key: str [Default: ''] + study_spec_parameters_override: list + target_column: str + test_fraction: float [Default: -1.0] + timestamp_split_key: str [Default: ''] + train_budget_milli_node_hours: float + training_fraction: float [Default: -1.0] + transform_dataflow_disk_size_gb: int [Default: 40.0] + transform_dataflow_machine_type: str [Default: 'n1-standard-16'] + transform_dataflow_max_num_workers: int [Default: 25.0] + transformations: str + validation_fraction: float [Default: -1.0] + vertex_dataset: system.Artifact + weight_column: str [Default: ''] + pipeline_parameters_substitutions: A dictionary of substitutions to apply to the pipeline parameters. + exclude_features: A list of features to exclude from the pipeline. + enable_caching: Whether to enable caching for the pipeline. + + Returns: + A tuple containing the path to the compiled pipeline template file and the pipeline parameters. + """ from google_cloud_pipeline_components.preview.automl.tabular import utils as automl_tabular_utils + # This checks if there are any substitutions defined in the pipeline_parameters_substitutions dictionary. If so, it applies these substitutions to the pipeline_parameters dictionary. This allows for using placeholders in the pipeline parameters, making the pipeline more flexible and reusable. if pipeline_parameters_substitutions != None: pipeline_parameters = substitute_pipeline_params( pipeline_parameters, pipeline_parameters_substitutions) - """ - additional_experiments: dict -# cv_trainer_worker_pool_specs_override: list -# data_source_bigquery_table_path: str [Default: ''] -# data_source_csv_filenames: str [Default: ''] -# dataflow_service_account: str [Default: ''] -# dataflow_subnetwork: str [Default: ''] -# dataflow_use_public_ips: bool [Default: True] -# disable_early_stopping: bool [Default: False] -# distill_batch_predict_machine_type: str [Default: 'n1-standard-16'] -# distill_batch_predict_max_replica_count: int [Default: 25.0] -# distill_batch_predict_starting_replica_count: int [Default: 25.0] -# enable_probabilistic_inference: bool [Default: False] -# encryption_spec_key_name: str [Default: ''] -# evaluation_batch_explain_machine_type: str [Default: 'n1-highmem-8'] -# evaluation_batch_explain_max_replica_count: int [Default: 10.0] -# evaluation_batch_explain_starting_replica_count: int [Default: 10.0] -# evaluation_batch_predict_machine_type: str [Default: 'n1-highmem-8'] -# evaluation_batch_predict_max_replica_count: int [Default: 20.0] -# evaluation_batch_predict_starting_replica_count: int [Default: 20.0] -# evaluation_dataflow_disk_size_gb: int [Default: 50.0] -# evaluation_dataflow_machine_type: str [Default: 'n1-standard-4'] -# evaluation_dataflow_max_num_workers: int [Default: 100.0] -# evaluation_dataflow_starting_num_workers: int [Default: 10.0] -# export_additional_model_without_custom_ops: bool [Default: False] -# fast_testing: bool [Default: False] -# location: str -# model_description: str [Default: ''] -# model_display_name: str [Default: ''] -# optimization_objective: str -# optimization_objective_precision_value: float [Default: -1.0] -# optimization_objective_recall_value: float [Default: -1.0] -# predefined_split_key: str [Default: ''] -# prediction_type: str -# project: str -# quantiles: list -# root_dir: str -# run_distillation: bool [Default: False] -# run_evaluation: bool [Default: False] -# stage_1_num_parallel_trials: int [Default: 35.0] -# stage_1_tuner_worker_pool_specs_override: list -# stage_1_tuning_result_artifact_uri: str [Default: ''] -# stage_2_num_parallel_trials: int [Default: 35.0] -# stage_2_num_selected_trials: int [Default: 5.0] -# stats_and_example_gen_dataflow_disk_size_gb: int [Default: 40.0] -# stats_and_example_gen_dataflow_machine_type: str [Default: 'n1-standard-16'] -# stats_and_example_gen_dataflow_max_num_workers: int [Default: 25.0] -# stratified_split_key: str [Default: ''] -# study_spec_parameters_override: list -# target_column: str -# test_fraction: float [Default: -1.0] -# timestamp_split_key: str [Default: ''] -# train_budget_milli_node_hours: float -# training_fraction: float [Default: -1.0] -# transform_dataflow_disk_size_gb: int [Default: 40.0] -# transform_dataflow_machine_type: str [Default: 'n1-standard-16'] -# transform_dataflow_max_num_workers: int [Default: 25.0] -# transformations: str -# validation_fraction: float [Default: -1.0] -# vertex_dataset: system.Artifact -# weight_column: str [Default: ''] - """ - + # This section handles the feature transformations for the pipeline. It checks if there is a + # custom_transformations file specified. If so, it reads the transformations from that file. + # Otherwise, it extracts the schema from the BigQuery table and generates automatic transformations based on the schema. pipeline_parameters['transformations'] = pipeline_parameters['transformations'].format( timestamp=datetime.now().strftime("%Y%m%d%H%M%S")) @@ -299,6 +489,10 @@ def compile_automl_tabular_pipeline( logging.info(f'features:{schema}') + # This section compiles the AutoML Tabular Workflows pipeline. It uses the automl_tabular_utils module to + # generate the pipeline components and parameters. It then loads a pre-compiled pipeline template file + # (automl_tabular_pl_v4.yaml) and hydrates it with the generated parameters. Finally, it writes the + # compiled pipeline template and parameters to the specified files. if pipeline_parameters['predefined_split_key']: pipeline_parameters['training_fraction'] = None pipeline_parameters['validation_fraction'] = None @@ -321,14 +515,6 @@ def compile_automl_tabular_pipeline( _set_enable_caching_value(pipeline_spec=configuration, enable_caching=enable_caching) - # TODO: This params should be set in conf.yaml . However if i do so the validations in - # .get_automl_tabular_pipeline_and_parameters fail as this values are not - # accepted in the given package. (I use a custom pipeline yaml instead of the one in - # the package and that causes the issue.) - # ETA for a fix is 7th of Feb when a new aiplatform sdk will be released. - parameter_values['model_display_name'] = "{}-model".format(pipeline_name) - parameter_values['model_description'] = "{}-model".format(pipeline_name) - # hydrate pipeline.yaml with parameters as default values for k, v in parameter_values.items(): if k in configuration['root']['inputDefinitions']['parameters']: @@ -338,12 +524,9 @@ def compile_automl_tabular_pipeline( with open(template_path, 'w') as yaml_file: yaml.dump(configuration, yaml_file) - with open(parameters_path, 'w') as param_file: yaml.dump(parameter_values, param_file) - # shutil.copy(pathlib.Path(__file__).parent.resolve().joinpath('automl_tabular_p_v2.yaml'), template_path) - return template_path, parameter_values @@ -355,6 +538,24 @@ def upload_pipeline_artefact_registry( repo_name: str, tags: list = None, description: str = None) -> str: + """ + This function uploads a pipeline YAML file to the Artifact Registry. + + Args: + template_path: The path to the pipeline YAML file. + project_id: The ID of the project that contains the pipeline. + region: The location of the pipeline. + repo_name: The name of the repository to upload the pipeline to. + tags: A list of tags to apply to the pipeline. + description: A description of the pipeline. + + Returns: + The name of the uploaded pipeline. + + Raises: + Exception: If an error occurs while uploading the pipeline. + """ + logging.info(f"Uploading pipeline to {region}-kfp.pkg.dev/{project_id}/{repo_name}") host = f"https://{region}-kfp.pkg.dev/{project_id}/{repo_name}" client = RegistryClient(host=host) @@ -372,6 +573,21 @@ def delete_pipeline_artefact_registry( region: str, repo_name: str, package_name: str) -> str: + """ + This function deletes a pipeline from the Artifact Registry. + + Args: + project_id: The ID of the project that contains the pipeline. + region: The location of the pipeline. + repo_name: The name of the repository that contains the pipeline. + package_name: The name of the pipeline to delete. + + Returns: + A string containing the response from the Artifact Registry. + + Raises: + Exception: If an error occurs while deleting the pipeline. + """ host = f"https://{region}-kfp.pkg.dev/{project_id}/{repo_name}" client = RegistryClient(host=host) @@ -382,78 +598,133 @@ def delete_pipeline_artefact_registry( def get_gcp_bearer_token() -> str: - # creds.valid is False, and creds.token is None - # Need to refresh credentials to populate those + """ + Retrieves a bearer token for Google Cloud Platform (GCP) authentication. + creds.valid is False, and creds.token is None + Need to refresh credentials to populate those + + Returns: + A string containing the bearer token. + + Raises: + Exception: If an error occurs while retrieving the bearer token. + """ + + # Get the default credentials for the current environment. creds, project = google.auth.default() + + # Refresh the credentials to ensure they are valid. creds.refresh(google.auth.transport.requests.Request()) - creds.refresh(google.auth.transport.requests.Request()) - return creds.token + + # Extract the bearer token from the refreshed credentials. + bearer_token = creds.token + + # Return the bearer token. + return bearer_token # Function to schedule the pipeline. def schedule_pipeline( project_id: str, region: str, + template_path: str, pipeline_name: str, - pipeline_template_uri: str, pipeline_sa: str, pipeline_root: str, cron: str, max_concurrent_run_count: str, - start_time: str = None, - end_time: str = None) -> dict: + start_time: str, + end_time: str, + pipeline_parameters: Dict[str, Any] = None, + pipeline_parameters_substitutions: Optional[Dict[str, Any]] = None, + ) -> dict: + """ + This function schedules a Vertex AI Pipeline to run on a regular basis. + + Args: + project_id: The ID of the project that contains the pipeline. + region: The location of the pipeline. + pipeline_name: The name of the pipeline to schedule. + pipeline_template_uri: The URI of the pipeline template file. + pipeline_sa: The service account to use for the pipeline. + pipeline_root: The root directory of the pipeline. + cron: The cron expression that defines the schedule. + max_concurrent_run_count: The maximum number of concurrent pipeline runs. + start_time: The start time of the schedule. + end_time: The end time of the schedule. + + Returns: + A dictionary containing information about the scheduled pipeline. + + Raises: + Exception: If an error occurs while scheduling the pipeline. + """ - url = f"https://{region}-aiplatform.googleapis.com/v1beta1/projects/{project_id}/locations/{region}/schedules" + from google.cloud import aiplatform + # Substitute pipeline parameters with necessary substitutions + if pipeline_parameters_substitutions != None: + pipeline_parameters = substitute_pipeline_params( + pipeline_parameters, pipeline_parameters_substitutions) + + # Deletes scheduled queries with matching description delete_schedules(project_id, region, pipeline_name) - body = dict( + # Create a PipelineJob object + pipeline_job = aiplatform.PipelineJob( + template_path=template_path, + pipeline_root=pipeline_root, + display_name=f"{pipeline_name}", + ) + + # Create the schedule with the pipeline job defined + pipeline_job_schedule = pipeline_job.create_schedule( display_name=f"{pipeline_name}", cron=cron, max_concurrent_run_count=max_concurrent_run_count, start_time=start_time, end_time=end_time, - create_pipeline_job_request=dict( - parent=f"projects/{project_id}/locations/{region}", - pipelineJob=dict( - displayName=f"{pipeline_name}", - template_uri=pipeline_template_uri, - service_account=pipeline_sa, - runtimeConfig=dict( - gcsOutputDirectory=pipeline_root, - parameterValues=dict() - ) - ) - ) + service_account=pipeline_sa, ) - headers = requests.structures.CaseInsensitiveDict() - headers["Content-Type"] = "application/json" - headers["Authorization"] = "Bearer {}".format(get_gcp_bearer_token()) + logging.info(f"Pipeline scheduled : {pipeline_name}") - resp = requests.post(url=url, json=body, headers=headers) - data = resp.json() # Check the JSON Response Content documentation below - - logging.info(f"scheduler for {pipeline_name} submitted") - return data + return pipeline_job def get_schedules( project_id: str, region: str, pipeline_name: str) -> list: + """ + This function retrieves all schedules associated with a given pipeline name in a specific project and region. + Args: + project_id: The ID of the project that contains the pipeline. + region: The location of the pipeline. + pipeline_name: The name of the pipeline to retrieve schedules for. + + Returns: + A list of the schedules associated with the pipeline. If no schedules are found, returns None. + + Raises: + Exception: If an error occurs while retrieving the schedules. + """ + + # Defines the filter query parameter for the URL request filter = "" if pipeline_name is not None: filter = f"filter=display_name={pipeline_name}" url = f"https://{region}-aiplatform.googleapis.com/v1beta1/projects/{project_id}/locations/{region}/schedules?{filter}" + # Defines the header for the URL request headers = requests.structures.CaseInsensitiveDict() headers["Content-Type"] = "application/json" headers["Authorization"] = "Bearer {}".format(get_gcp_bearer_token()) + # Make the request resp = requests.get(url=url, headers=headers) - data = resp.json() # Check the JSON Response Content documentation below + data = resp.json() # Check the JSON Response Content if "schedules" in data: return data['schedules'] else: @@ -464,22 +735,39 @@ def pause_schedule( project_id: str, region: str, pipeline_name: str) -> list: + """ + This function pauses all schedules associated with a given pipeline name in a specific project and region. + + Args: + project_id: The ID of the project that contains the pipeline. + region: The location of the pipeline. + pipeline_name: The name of the pipeline to pause schedules for. + Returns: + A list of the names of the paused schedules. If no schedules are found, returns None. + + Raises: + Exception: If an error occurs while pausing the schedules. + """ + + # Get the list of schedules for the given pipeline name schedules = get_schedules(project_id, region, pipeline_name) if schedules is None: logging.info(f"No schedules found with display_name {pipeline_name}") return None + # Creating the request header headers = requests.structures.CaseInsensitiveDict() headers["Content-Type"] = "application/json" headers["Authorization"] = "Bearer {}".format(get_gcp_bearer_token()) + # Pause the schedules where the display_name matches paused_schedules = [] for s in schedules: url = f"https://{region}-aiplatform.googleapis.com/v1beta1/{s['name']}:pause" resp = requests.post(url=url, headers=headers) - data = resp.json() # Check the JSON Response Content documentation below + data = resp.json() # Check the JSON Response Content print(resp.status_code == 200) if resp.status_code != 200: raise Exception( @@ -494,22 +782,39 @@ def delete_schedules( project_id: str, region: str, pipeline_name: str) -> list: + """ + This function deletes all schedules associated with a given pipeline name in a specific project and region. + + Args: + project_id: The ID of the project that contains the pipeline. + region: The location of the pipeline. + pipeline_name: The name of the pipeline to delete schedules for. + + Returns: + A list of the names of the deleted schedules. If no schedules are found, returns None. + + Raises: + Exception: If an error occurs while deleting the schedules. + """ + # Get all schedules for the given pipeline name schedules = get_schedules(project_id, region, pipeline_name) if schedules is None: logging.info(f"No schedules found with display_name {pipeline_name}") return None + # Defines the header used in the API request headers = requests.structures.CaseInsensitiveDict() headers["Content-Type"] = "application/json" headers["Authorization"] = "Bearer {}".format(get_gcp_bearer_token()) + # Delete each schedule where the display_name matches deleted_schedules = [] for s in schedules: url = f"https://{region}-aiplatform.googleapis.com/v1beta1/{s['name']}" resp = requests.delete(url=url, headers=headers) - data = resp.json() # Check the JSON Response Content documentation below + data = resp.json() # Check the JSON Response Content logging.info(f"scheduled resourse {s['name']} deleted") deleted_schedules.append(s['name']) @@ -533,22 +838,44 @@ def run_pipeline( encryption_spec_key_name: Optional[str] = None, wait: bool = False, ) -> PipelineJob: + + """ + Runs a Vertex AI Pipeline. + This function provides a convenient way to run a Vertex AI Pipeline. It takes care of creating the PipelineJob object, + submitting the pipeline, and waiting for completion (if desired). It also allows for substituting placeholders in the + pipeline parameters, making the pipeline more flexible and reusable. + + Args: + pipeline_root: The root directory of the pipeline. + template_path: The path to the pipeline template file. + project_id: The ID of the project that contains the pipeline. + location: The location of the pipeline. + service_account: The service account to use for the pipeline. + pipeline_parameters: The parameters to pass to the pipeline. + pipeline_parameters_substitutions: A dictionary of substitutions to apply to the pipeline parameters. + enable_caching: Whether to enable caching for the pipeline. + experiment_name: The name of the experiment to create for the pipeline. + job_id: The ID of the pipeline job. + failure_policy: The failure policy for the pipeline. + labels: The labels to apply to the pipeline. + credentials: The credentials to use for the pipeline. + encryption_spec_key_name: The encryption key to use for the pipeline. + wait: Whether to wait for the pipeline to complete. + + Returns: + A PipelineJob object. + """ + # Substitute placeholders in the pipeline_parameters dictionary with values from the pipeline_parameters_substitutions dictionary. + # This is useful for making the pipeline more flexible and reusable, as the same pipeline can be used with different parameter + # values by simply providing a different pipeline_parameters_substitutions dictionary. if pipeline_parameters_substitutions != None: pipeline_parameters = substitute_pipeline_params( pipeline_parameters, pipeline_parameters_substitutions) logging.info(f"Pipeline parameters : {pipeline_parameters}") - # Create Vertex Dataset - #vertex_datasets_uri = create_dataset( - # display_name=pipeline_parameters['vertex_dataset_display_name'], - # bigquery_source=pipeline_parameters['data_source_bigquery_table_path'], - # project_id=pipeline_parameters['project']) - # - #input_artifacts: Dict[str, str] = {} - #input_artifacts['vertex_datasets'] = vertex_datasets_uri - + # Creates a PipelineJob object with the provided arguments. pl = PipelineJob( display_name='na', # not needed and will be optional in next major release template_path=template_path, @@ -558,51 +885,20 @@ def run_pipeline( project=project_id, location=location, parameter_values=pipeline_parameters, - #input_artifacts=input_artifacts, encryption_spec_key_name=encryption_spec_key_name, credentials=credentials, failure_policy=failure_policy, labels=labels) + # Submits the pipeline to Vertex AI pl.submit(service_account=service_account, experiment=experiment_name) + + logging.info(f"Pipeline submitted") + + # Waits for the pipeline to complete. if (wait): pl.wait() if (pl.has_failed): raise RuntimeError("Pipeline execution failed") return pl - - -#def create_dataset( -# display_name: str, -# bigquery_source: str, -# project_id: str, -# location: str = "us-central1", -# credentials: Optional[credentials.Credentials] = None, -# sync: bool = True, -# create_request_timeout: Optional[float] = None, -# ) -> str: -# -# #bigquery_source in this format "bq://.purchase_propensity.v_purchase_propensity_training_30_15" -# #dataset = TabularDataset.create( -# # display_name=display_name, -# # bq_source=[bigquery_source], -# # project=project_id, -# # location=location, -# # credentials=credentials, -# # sync=sync, -# # create_request_timeout=create_request_timeout) -# #dataset.wait() -# -# artifact = Artifact.create( -# schema_title="system.Dataset", -# uri=bigquery_source, -# display_name=display_name, -# project=project_id, -# location=location, -# ) -# artifact.wait() -# -# # Should be: 7104764862735056896 -# # Cannot use full resource name of format: projects/294348452381/locations/us-central1/datasets/7104764862735056896 -# return artifact.resource_id \ No newline at end of file diff --git a/python/pipelines/scheduler.py b/python/pipelines/scheduler.py index a2d02de9..2067a5e1 100644 --- a/python/pipelines/scheduler.py +++ b/python/pipelines/scheduler.py @@ -20,7 +20,7 @@ from pipelines.pipeline_ops import pause_schedule, schedule_pipeline, delete_schedules - +# Ensures that the provided file path is a valid YAML file. def check_extention(file_path: str, type: str = '.yaml'): if os.path.exists(file_path): if not file_path.lower().endswith(type): @@ -51,6 +51,16 @@ def check_extention(file_path: str, type: str = '.yaml'): } # key should match pipeline names as in the config.yaml files for automatic compilation if __name__ == "__main__": + """ + This Python code defines a script for scheduling and deleting Vertex AI pipelines. It uses the pipelines_list dictionary + to map pipeline names to their corresponding module and function names. this script provides a convenient way to schedule + and delete Vertex AI pipelines schedules from the command line. + The script takes the following arguments: + -c: Path to the configuration YAML file. + -p: Pipeline key name as it is in the config.yaml file. + -i: The compiled pipeline input filename. + -d: (Optional) Flag to delete the scheduled pipeline. + """ logging.basicConfig(level=logging.INFO) parser = ArgumentParser() @@ -66,6 +76,10 @@ def check_extention(file_path: str, type: str = '.yaml'): choices=list(pipelines_list.keys()), help='Pipeline key name as it is in config.yaml') + parser.add_argument("-i", '--input-file', + dest="input", + required=True, + help='the compiled pipeline input filename') parser.add_argument("-d", '--delete', dest="delete", @@ -75,6 +89,10 @@ def check_extention(file_path: str, type: str = '.yaml'): args = parser.parse_args() + + # Reads the configuration YAML file and extracts the relevant parameters for the pipeline + # and the artifact registry. It then checks if the pipeline name is valid and retrieves + # the corresponding module and function name from the pipelines_list dictionary. repo_params = {} with open(args.config, encoding='utf-8') as fh: params = yaml.full_load(fh) @@ -93,28 +111,32 @@ def check_extention(file_path: str, type: str = '.yaml'): template_artifact_uri = f"https://{repo_params['region']}-kfp.pkg.dev/{repo_params['project_id']}/{repo_params['name']}/{my_pipeline_vars['name']}/latest" if args.delete: + # If the -d flag is set, the script calls the delete_schedules function to delete the + # scheduled pipeline. logging.info(f"Deleting scheduler for {args.pipeline}") delete_schedules(project_id=generic_pipeline_vars['project_id'], - region=generic_pipeline_vars['region'], - pipeline_name=my_pipeline_vars['name']) + region=generic_pipeline_vars['region'], + pipeline_name=my_pipeline_vars['name']) else: logging.info(f"Creating scheduler for {args.pipeline}") + # Creates a new schedule for the pipeline and returns the schedule object. + # If the schedule is successfully created, the script checks if the pipeline is supposed + # to be paused and calls the pause_schedule function to pause it. schedule = schedule_pipeline( - project_id=generic_pipeline_vars['project_id'], - region=generic_pipeline_vars['region'], - pipeline_name=my_pipeline_vars['name'], - pipeline_template_uri=template_artifact_uri, - pipeline_sa=generic_pipeline_vars['service_account'], - pipeline_root=generic_pipeline_vars['root_path'], - cron=my_pipeline_vars['schedule']['cron'], - max_concurrent_run_count=my_pipeline_vars['schedule']['max_concurrent_run_count'], - start_time=my_pipeline_vars['schedule']['start_time'], - end_time=my_pipeline_vars['schedule']['end_time'] + project_id=generic_pipeline_vars['project_id'], + region=generic_pipeline_vars['region'], + template_path = args.input, + pipeline_parameters=my_pipeline_vars['pipeline_parameters'], + pipeline_parameters_substitutions= my_pipeline_vars['pipeline_parameters_substitutions'], + pipeline_sa=generic_pipeline_vars['service_account'], + pipeline_name=my_pipeline_vars['name'], + pipeline_root=generic_pipeline_vars['root_path'], + cron=my_pipeline_vars['schedule']['cron'], + max_concurrent_run_count=my_pipeline_vars['schedule']['max_concurrent_run_count'], + start_time=my_pipeline_vars['schedule']['start_time'], + end_time=my_pipeline_vars['schedule']['end_time'] ) - if 'state' not in schedule or schedule['state'] != 'ACTIVE': - raise Exception(f"Scheduling pipeline failed {schedule}") - if my_pipeline_vars['schedule']['state'] == 'PAUSED': logging.info(f"Pausing scheduler for {args.pipeline}") pause_schedule( diff --git a/python/pipelines/segmentation_pipelines.py b/python/pipelines/segmentation_pipelines.py index c20ae8a1..21024638 100644 --- a/python/pipelines/segmentation_pipelines.py +++ b/python/pipelines/segmentation_pipelines.py @@ -18,20 +18,9 @@ from pipelines.components.bigquery.component import ( bq_select_best_kmeans_model, bq_clustering_predictions, - bq_flatten_kmeans_prediction_table, bq_evaluate, bq_stored_procedure_exec) + bq_flatten_kmeans_prediction_table, bq_evaluate) from pipelines.components.pubsub.component import send_pubsub_activation_msg -from google_cloud_pipeline_components.types import artifact_types -from google_cloud_pipeline_components.v1.bigquery import ( - BigqueryCreateModelJobOp, BigqueryEvaluateModelJobOp, - BigqueryExportModelJobOp, BigqueryPredictModelJobOp, - BigqueryQueryJobOp) - -from google_cloud_pipeline_components.v1.endpoint import (EndpointCreateOp, - ModelDeployOp) -from google_cloud_pipeline_components.v1.model import ModelUploadOp -from kfp.components.importer_node import importer - from pipelines.components.bigquery.component import ( bq_clustering_exec) @@ -62,6 +51,26 @@ def training_pl( ): + """ + This function defines the Vertex AI Pipeline for Audience Segmentation Training. + + Args: + project_id (str): The Google Cloud project ID. + location (str): The Google Cloud region where the pipeline will be deployed. + model_dataset_id (str): The BigQuery dataset ID where the model will be stored. + model_name_bq_prefix (str): The prefix for the BQML model name. + vertex_model_name (str): The name of the Vertex AI model. + training_data_bq_table (str): The BigQuery table containing the training data. + exclude_features (list): A list of features to exclude from the training data. + km_num_clusters (int): The number of clusters to use for training. + km_init_method (str): The initialization method to use for training. + km_distance_type (str): The distance type to use for training. + km_standardize_features (str): Whether to standardize the features before training. + km_max_interations (int): The maximum number of iterations to train for. + km_early_stop (str): Whether to use early stopping during training. + km_min_rel_progress (float): The minimum relative progress required for early stopping. + km_warm_start (str): Whether to use warm start during training. + """ # Train BQML clustering model and uploads to Vertex AI Model Registry bq_model = bq_clustering_exec( @@ -108,6 +117,22 @@ def prediction_pl( pubsub_activation_topic: str, pubsub_activation_type: str ): + """ + This function defines the Vertex AI Pipeline for Audience Segmentation Prediction. + + Args: + project_id (str): The Google Cloud project ID. + location (Optional[str]): The Google Cloud region where the pipeline will be deployed. + model_dataset_id (str): The BigQuery dataset ID where the model is stored. + model_name_bq_prefix (str): The prefix for the BQML model name. + model_metric_name (str): The metric name to use for model selection. + model_metric_threshold (float): The metric threshold to use for model selection. + number_of_models_considered (int): The number of models to consider for selection. + bigquery_source (str): The BigQuery table containing the prediction data. + bigquery_destination_prefix (str): The prefix for the BigQuery table where the predictions will be stored. + pubsub_activation_topic (str): The Pub/Sub topic to send the activation message to. + pubsub_activation_type (str): The type of activation message to send. + """ # Get the best candidate model according to the parameters. purchase_propensity_label = bq_select_best_kmeans_model( diff --git a/python/pipelines/tabular_pipelines.py b/python/pipelines/tabular_pipelines.py index 7bdcffdb..7cfb33c8 100644 --- a/python/pipelines/tabular_pipelines.py +++ b/python/pipelines/tabular_pipelines.py @@ -14,7 +14,6 @@ from typing import Optional import kfp as kfp -import kfp.components as components import kfp.dsl as dsl from pipelines.components.vertex.component import elect_best_tabular_model, \ batch_prediction, \ @@ -23,7 +22,6 @@ from pipelines.components.bigquery.component import bq_flatten_tabular_binary_prediction_table, \ bq_flatten_tabular_regression_table, \ bq_union_predictions_tables, \ - bq_stored_procedure_exec, \ write_tabular_model_explanation_to_bigquery from pipelines.components.pubsub.component import send_pubsub_activation_msg @@ -39,8 +37,6 @@ def prediction_binary_classification_pl( model_metric_name: str, model_metric_threshold: float, number_of_models_considered: int, - - pubsub_activation_topic: str, pubsub_activation_type: str, bigquery_source: str, @@ -53,11 +49,36 @@ def prediction_binary_classification_pl( accelerator_count: int = 0, accelerator_type: str = None, generate_explanation: bool = False, - threashold: float = 0.5, positive_label: str = 'true', ): - + """ + This function defines a KFP pipeline for binary classification prediction pipeline using an AutoML Tabular Workflow Model. + + Args: + project_id: The Google Cloud project ID. + location: The Google Cloud region where the pipeline will be deployed. + model_display_name: The name of the Tabular Workflow Model to be used for prediction. + model_metric_name: The name of the metric used to select the best model. + model_metric_threshold: The threshold value for the metric used to select the best model. + number_of_models_considered: The number of models to consider when selecting the best model. + pubsub_activation_topic: The name of the Pub/Sub topic to send activation messages to. + pubsub_activation_type: The type of activation message to send. + bigquery_source: The BigQuery table containing the data to be predicted. + bigquery_destination_prefix: The prefix for the BigQuery table where the predictions will be stored. + bq_unique_key: The name of the column in the BigQuery table that uniquely identifies each row. + job_name_prefix: The prefix for the Vertex AI Batch Prediction job name. + machine_type: The machine type to use for the Vertex AI Batch Prediction job. + max_replica_count: The maximum number of replicas to use for the Vertex AI Batch Prediction job. + batch_size: The batch size to use for the Vertex AI Batch Prediction job. + accelerator_count: The number of accelerators to use for the Vertex AI Batch Prediction job. + accelerator_type: The type of accelerators to use for the Vertex AI Batch Prediction job. + generate_explanation: Whether to generate explanations for the predictions. + threashold: The threshold value used to convert the predicted probabilities into binary labels. + positive_label: The label to assign to predictions with a probability greater than or equal to the threshold. + """ + + # Elect best model based on a metric and a threshold purchase_propensity_label = elect_best_tabular_model( project=project_id, location=location, @@ -67,6 +88,7 @@ def prediction_binary_classification_pl( number_of_models_considered=number_of_models_considered, ).set_display_name('elect_best_model') + # Submits a Vertex AI Batch prediction job predictions = batch_prediction( bigquery_source=bigquery_source, bigquery_destination_prefix=bigquery_destination_prefix, @@ -80,6 +102,7 @@ def prediction_binary_classification_pl( generate_explanation=generate_explanation ) + # Flattens prediction table in BigQuery flatten_predictions = bq_flatten_tabular_binary_prediction_table( project_id=project_id, location=location, @@ -90,6 +113,7 @@ def prediction_binary_classification_pl( positive_label=positive_label ) + # Sends pubsub message for activation send_pubsub_activation_msg( project=project_id, topic_name=pubsub_activation_topic, @@ -108,15 +132,11 @@ def prediction_regression_pl( model_metric_name: str, model_metric_threshold: float, number_of_models_considered: int, - - pubsub_activation_topic: str, pubsub_activation_type: str, - bigquery_source: str, bigquery_destination_prefix: str, bq_unique_key: str, - job_name_prefix: str, machine_type: str = "n1-standard-4", max_replica_count: int = 10, @@ -125,7 +145,31 @@ def prediction_regression_pl( accelerator_type: str = None, generate_explanation: bool = False ): - + """ + This function defines a KFP pipeline for regression prediction pipeline using an AutoML Tabular Workflow Model. + + Args: + project_id: The Google Cloud project ID. + location: The Google Cloud region where the pipeline will be deployed. + model_display_name: The name of the Tabular Workflow Model to be used for prediction. + model_metric_name: The name of the metric used to select the best model. + model_metric_threshold: The threshold value for the metric used to select the best model. + number_of_models_considered: The number of models to consider when selecting the best model. + pubsub_activation_topic: The name of the Pub/Sub topic to send activation messages to. + pubsub_activation_type: The type of activation message to send. + bigquery_source: The BigQuery table containing the data to be predicted. + bigquery_destination_prefix: The prefix for the BigQuery table where the predictions will be stored. + bq_unique_key: The name of the column in the BigQuery table that uniquely identifies each row. + job_name_prefix: The prefix for the Vertex AI Batch Prediction job name. + machine_type: The machine type to use for the Vertex AI Batch Prediction job. + max_replica_count: The maximum number of replicas to use for the Vertex AI Batch Prediction job. + batch_size: The batch size to use for the Vertex AI Batch Prediction job. + accelerator_count: The number of accelerators to use for the Vertex AI Batch Prediction job. + accelerator_type: The type of accelerators to use for the Vertex AI Batch Prediction job. + generate_explanation: Whether to generate explanations for the predictions. + """ + + # Elect best model based on a metric and a threshold customer_lifetime_value_model = elect_best_tabular_model( project=project_id, location=location, @@ -135,6 +179,7 @@ def prediction_regression_pl( number_of_models_considered=number_of_models_considered, ).set_display_name('elect_best_clv_model') + # Submits a Vertex AI Batch prediction job predictions = batch_prediction( bigquery_source=bigquery_source, bigquery_destination_prefix=bigquery_destination_prefix, @@ -148,6 +193,7 @@ def prediction_regression_pl( generate_explanation=generate_explanation ) + # Flattens prediction table in BigQuery flatten_predictions = bq_flatten_tabular_regression_table( project_id=project_id, location=location, @@ -156,6 +202,7 @@ def prediction_regression_pl( bq_unique_key=bq_unique_key ) + # Sends pubsub message for activation send_pubsub_activation_msg( project=project_id, topic_name=pubsub_activation_topic, @@ -168,41 +215,68 @@ def prediction_regression_pl( def prediction_binary_classification_regression_pl( project_id: str, location: Optional[str], - purchase_bigquery_source: str, purchase_bigquery_destination_prefix: str, purchase_bq_unique_key: str, purchase_job_name_prefix: str, - clv_bigquery_source: str, clv_bigquery_destination_prefix: str, clv_bq_unique_key: str, clv_job_name_prefix: str, - purchase_model_display_name: str, purchase_model_metric_name: str, purchase_model_metric_threshold: float, number_of_purchase_models_considered: int, - clv_model_display_name: str, clv_model_metric_name: str, clv_model_metric_threshold: float, number_of_clv_models_considered: int, - pubsub_activation_topic: str, pubsub_activation_type: str, - machine_type: str = "n1-standard-4", max_replica_count: int = 10, batch_size: int = 64, accelerator_count: int = 0, accelerator_type: str = None, generate_explanation: bool = False, - threashold: float = 0.5, positive_label: str = 'true', ): - + """ + This function defines a KFP pipeline for a combined binary classification and regression prediction pipeline using AutoML Tabular Workflow Models. + + Args: + project_id: The Google Cloud project ID. + location: The Google Cloud region where the pipeline will be deployed. + purchase_bigquery_source: The BigQuery table containing the data to be predicted for purchase propensity. + purchase_bigquery_destination_prefix: The prefix for the BigQuery table where the purchase propensity predictions will be stored. + purchase_bq_unique_key: The name of the column in the BigQuery table that uniquely identifies each row for purchase propensity. + purchase_job_name_prefix: The prefix for the Vertex AI Batch Prediction job name for purchase propensity. + clv_bigquery_source: The BigQuery table containing the data to be predicted for customer lifetime value. + clv_bigquery_destination_prefix: The prefix for the BigQuery table where the customer lifetime value predictions will be stored. + clv_bq_unique_key: The name of the column in the BigQuery table that uniquely identifies each row for customer lifetime value. + clv_job_name_prefix: The prefix for the Vertex AI Batch Prediction job name for customer lifetime value. + purchase_model_display_name: The name of the Tabular Workflow Model to be used for purchase propensity prediction. + purchase_model_metric_name: The name of the metric used to select the best model for purchase propensity. + purchase_model_metric_threshold: The threshold value for the metric used to select the best model for purchase propensity. + number_of_purchase_models_considered: The number of models to consider when selecting the best model for purchase propensity. + clv_model_display_name: The name of the Tabular Workflow Model to be used for customer lifetime value prediction. + clv_model_metric_name: The name of the metric used to select the best model for customer lifetime value. + clv_model_metric_threshold: The threshold value for the metric used to select the best model for customer lifetime value. + number_of_clv_models_considered: The number of models to consider when selecting the best model for customer lifetime value. + pubsub_activation_topic: The name of the Pub/Sub topic to send activation messages to. + pubsub_activation_type: The type of activation message to send. + machine_type: The machine type to use for the Vertex AI Batch Prediction job. + max_replica_count: The maximum number of replicas to use for the Vertex AI Batch Prediction job. + batch_size: The batch size to use for the Vertex AI Batch Prediction job. + accelerator_count: The number of accelerators to use for the Vertex AI Batch Prediction job. + accelerator_type: The type of accelerators to use for the Vertex AI Batch Prediction job. + generate_explanation: Whether to generate explanations for the predictions. + threashold: The threshold value used to convert the predicted probabilities into binary labels for purchase propensity. + positive_label: The label to assign to predictions with a probability greater than or equal to the threshold for purchase propensity. + """ + + # Elects the best purchase propensity model based on a metric and a threshold purchase_propensity_best_model = elect_best_tabular_model( project=project_id, location=location, @@ -212,6 +286,7 @@ def prediction_binary_classification_regression_pl( number_of_models_considered=number_of_purchase_models_considered, ).set_display_name('elect_best_purchase_propensity_model') + # Submits a Vertex AI Batch Prediction job for purchase propensity propensity_predictions = batch_prediction( bigquery_source=purchase_bigquery_source, bigquery_destination_prefix=purchase_bigquery_destination_prefix, @@ -225,6 +300,7 @@ def prediction_binary_classification_regression_pl( generate_explanation=generate_explanation ).set_display_name('propensity_predictions') + # Elects the best customer lifetime value regression model based on a metric and a threshold customer_lifetime_value_model = elect_best_tabular_model( project=project_id, location=location, @@ -234,6 +310,7 @@ def prediction_binary_classification_regression_pl( number_of_models_considered=number_of_clv_models_considered, ).set_display_name('elect_best_clv_model') + # Submits a Vertex AI Batch Prediction job for customer lifetime value clv_predictions = batch_prediction( bigquery_source=clv_bigquery_source, bigquery_destination_prefix=clv_bigquery_destination_prefix, @@ -247,6 +324,7 @@ def prediction_binary_classification_regression_pl( generate_explanation=generate_explanation ).set_display_name('clv_predictions') + # Flattens the prediction table for the customer lifetime value model clv_flatten_predictions = bq_flatten_tabular_regression_table( project_id=project_id, location=location, @@ -255,6 +333,7 @@ def prediction_binary_classification_regression_pl( bq_unique_key=clv_bq_unique_key ).set_display_name('clv_flatten_predictions') + # Union the two predicitons tables: the flatenned clv predictions and the purchase propensity predictions union_predictions = bq_union_predictions_tables( project_id=project_id, location=location, @@ -265,6 +344,7 @@ def prediction_binary_classification_regression_pl( threashold=threashold ).set_display_name('union_predictions') + # Sends pubsub message for activation send_pubsub_activation_msg( project=project_id, topic_name=pubsub_activation_topic, @@ -286,7 +366,22 @@ def explanation_tabular_workflow_regression_pl( number_of_models_considered: int, bigquery_destination_prefix: str, ): - #TODO: Implement the explanation pipeline for the value based bidding model + """ + This function defines a KFP pipeline for a Explanation pipeline that uses a Tabular Workflow Model. + This is a Explanation Pipeline Definition that will output the Feature Attribution + + Args: + project: The Google Cloud project ID. + location: The Google Cloud region where the pipeline will be deployed. + data_location: The location of the data to be used for explanation. + model_display_name: The name of the Tabular Workflow Model to be used for explanation. + model_metric_name: The name of the metric used to select the best model. + model_metric_threshold: The threshold value for the metric used to select the best model. + number_of_models_considered: The number of models to consider when selecting the best model. + bigquery_destination_prefix: The prefix for the BigQuery table where the explanation will be stored. + """ + + # Elect best model based on a metric and a threshold value_based_bidding_model = elect_best_tabular_model( project=project, location=location, @@ -296,12 +391,14 @@ def explanation_tabular_workflow_regression_pl( number_of_models_considered=number_of_models_considered, ).set_display_name('elect_best_vbb_model') + # Get the model explanation value_based_bidding_model_explanation = get_tabular_model_explanation( project=project, location=location, model=value_based_bidding_model.outputs['elected_model'], ).set_display_name('get_vbb_model_explanation') + # Write the model explanation to BigQuery value_based_bidding_flatten_explanation = write_tabular_model_explanation_to_bigquery( project=project, location=location, diff --git a/python/pipelines/uploader.py b/python/pipelines/uploader.py index a2c26d88..3d165e6e 100644 --- a/python/pipelines/uploader.py +++ b/python/pipelines/uploader.py @@ -16,6 +16,8 @@ from pipelines.pipeline_ops import upload_pipeline_artefact_registry from argparse import ArgumentParser, ArgumentTypeError + +# Checks if a file exists and has the correct extension (.yaml by default). def check_extention(file_path: str, type: str = '.yaml'): if os.path.exists(file_path): if not file_path.lower().endswith(type): @@ -23,8 +25,16 @@ def check_extention(file_path: str, type: str = '.yaml'): else: raise FileNotFoundError(f"{file_path} does not exist") return file_path - + + if __name__ == "__main__": + """ + This Python script defines a command-line tool for uploading compiled Vertex AI pipelines to Artifact Registry. It takes the following arguments: + -c: Path to the configuration YAML file (e.g., dev.yaml or prod.yaml). This file contains information about the Artifact Registry repository where the pipeline will be uploaded. + -f: Path to the compiled pipeline YAML file. This file contains the pipeline definition. + -d: (Optional) Description of the pipeline artifact. + -t: (Optional) List of tags for the pipeline artifact. + """ logging.basicConfig(level=logging.INFO) parser = ArgumentParser() @@ -57,9 +67,13 @@ def check_extention(file_path: str, type: str = '.yaml'): args = parser.parse_args() repo_params={} + # Opens the configuration YAML file and extracts the parameters for the + # Artifact Registry repository. with open(args.config, encoding='utf-8') as fh: repo_params = yaml.full_load(fh)['artifact_registry']['pipelines_repo'] + # Calls the upload_pipeline_artefact_registry function from pipelines.pipeline_ops to + # upload the compiled pipeline to the specified Artifact Registry repository. upload_pipeline_artefact_registry( template_path=args.filename, project_id=repo_params['project_id'], diff --git a/renovate.json b/renovate.json deleted file mode 100644 index e69de29b..00000000 diff --git a/sql/procedure/aggregate_predictions_procedure.sqlx b/sql/procedure/aggregate_predictions_procedure.sqlx index cf874bf1..e34f7268 100644 --- a/sql/procedure/aggregate_predictions_procedure.sqlx +++ b/sql/procedure/aggregate_predictions_procedure.sqlx @@ -13,14 +13,14 @@ -- limitations under the License. -- Setting procedure to lookback from the day before `inference_date` -# This procedure aggregates predictions from multiple BigQuery tables into a single table. -# It can be breakdown in 6 steps: -# 1. Declare Variables: The code declares several variables that will be used throughout the procedure. -# 2. Define Helper Functions: The code defines several helper functions that will be used in the procedure -# 3. Set Variable Values: The code sets the values of the declared variables using the helper functions and other expressions. -# 4. Create Temporary Tables: The code creates several temporary tables that will be used to store intermediate results. -# 5. Execute Queries: The code executes several SQL queries to aggregate the predictions from the different BigQuery tables. -# 6. Create Final Table: The code creates a final BigQuery table that contains the aggregated predictions. +-- This procedure aggregates predictions from multiple BigQuery tables into a single table. +-- It can be breakdown in 6 steps: +-- 1. Declare Variables: The code declares several variables that will be used throughout the procedure. +-- 2. Define Helper Functions: The code defines several helper functions that will be used in the procedure +-- 3. Set Variable Values: The code sets the values of the declared variables using the helper functions and other expressions. +-- 4. Create Temporary Tables: The code creates several temporary tables that will be used to store intermediate results. +-- 5. Execute Queries: The code executes several SQL queries to aggregate the predictions from the different BigQuery tables. +-- 6. Create Final Table: The code creates a final BigQuery table that contains the aggregated predictions. DECLARE project_id, table_pattern, # A pattern used to identify the BigQuery tables that contain the predictions. @@ -67,7 +67,7 @@ DECLARE second_join_select_columns ARRAY; SET project_id = '{{project_id}}'; -# A procedure that retrieves the column names for a specified BigQuery table. +-- A procedure that retrieves the column names for a specified BigQuery table. CREATE OR REPLACE PROCEDURE {{dataset_id}}.get_columns_for_table(table_name STRING, data_set STRING, @@ -85,7 +85,7 @@ SELECT ARRAY_AGG(column_name) """, data_set, table_name_only) INTO table_columns; END ; -# A procedure that retrieves the name of the latest BigQuery table that matches a specified pattern. +-- A procedure that retrieves the name of the latest BigQuery table that matches a specified pattern. CREATE OR REPLACE PROCEDURE {{dataset_id}}.get_latest_table_by_pattern(dataset_name STRING, table_pattern STRING, @@ -113,7 +113,7 @@ SET temp_table ); END ; -# A function that returns the difference between two arrays. +-- A function that returns the difference between two arrays. CREATE TEMP FUNCTION array_diff(src_array ARRAY, rm_array ARRAY) @@ -128,7 +128,7 @@ CREATE TEMP FUNCTION element FROM UNNEST(rm_array) AS element ) )); -# A function that returns the common elements between two arrays. +-- A function that returns the common elements between two arrays. CREATE TEMP FUNCTION array_common(arr_one ARRAY, arr_two ARRAY) AS (( @@ -140,7 +140,7 @@ CREATE TEMP FUNCTION UNNEST(arr_one) AS element WHERE element IN UNNEST(arr_two) ) )); -# A function that creates a SQL expression for selecting common columns from two tables. +-- A function that creates a SQL expression for selecting common columns from two tables. CREATE TEMP FUNCTION create_common_columns_select(common_columns ARRAY, f_alias STRING, @@ -154,7 +154,7 @@ CREATE TEMP FUNCTION CONCAT('COALESCE(',f_alias, '.', element, ',', s_alias,'.', element,') AS ', element) FROM UNNEST(common_columns) AS element) ), ',') )); -# A function that creates a SQL expression for selecting columns from a single table. +-- A function that creates a SQL expression for selecting columns from a single table. CREATE TEMP FUNCTION create_columns_select(COLUMNS ARRAY, t_alias STRING) @@ -356,13 +356,14 @@ SET third_query_str = FORMAT(""" CREATE OR REPLACE TABLE `%s.{{dataset_id}}.{{table_id}}` AS SELECT +e.user_pseudo_id, %s, f.feature_timestamp AS auto_segment_processed_timestamp, f.prediction AS Auto_Segment_ID, %s FROM temp2 AS e full outer join `%s` AS f -ON e.user_pseudo_id=f.user_id; +ON e.user_pseudo_id=f.user_pseudo_id; """, project_id, second_join_selections, auto_audience_segmentation_selections, auto_audience_segmentation_table); EXECUTE IMMEDIATE third_query_str; \ No newline at end of file diff --git a/sql/procedure/aggregated_value_based_bidding_explanation_preparation.sqlx b/sql/procedure/aggregated_value_based_bidding_explanation_preparation.sqlx index c282c340..ebf8f2ad 100644 --- a/sql/procedure/aggregated_value_based_bidding_explanation_preparation.sqlx +++ b/sql/procedure/aggregated_value_based_bidding_explanation_preparation.sqlx @@ -19,9 +19,9 @@ DECLARE end_date DATE DEFAULT NULL; DECLARE max_date DATE; DECLARE min_date DATE; -# explain_start_date, explain_end_date: Variables to store the start and end dates for the explanation period. -# start_date, end_date: Variables to store the start and end dates specified by the user. -# max_date, min_date: Variables to store the maximum and minimum dates in the aggregated_vbb table. +-- explain_start_date, explain_end_date: Variables to store the start and end dates for the explanation period. +-- start_date, end_date: Variables to store the start and end dates specified by the user. +-- max_date, min_date: Variables to store the maximum and minimum dates in the aggregated_vbb table. SET max_date = (SELECT MAX(Dt) FROM `{{mds_project_id}}.{{mds_dataset}}.aggregated_vbb`); SET min_date = (SELECT MIN(Dt) FROM `{{mds_project_id}}.{{mds_dataset}}.aggregated_vbb`); SET explain_start_date = min_date; @@ -29,8 +29,8 @@ SET explain_end_date = max_date; SET start_date = PARSE_DATE("%Y-%m-%d", {{start_date}}); SET end_date = PARSE_DATE("%Y-%m-%d", {{end_date}}); -# Validate User-Specified Dates: The code checks if the user-specified start and end dates are valid and within the -# range of dates in the aggregated_vbb table. If either date is invalid or out of range, it is adjusted to the nearest valid date. +-- Validate User-Specified Dates: The code checks if the user-specified start and end dates are valid and within the +-- range of dates in the aggregated_vbb table. If either date is invalid or out of range, it is adjusted to the nearest valid date. IF start_date IS NULL OR start_date < min_date OR start_date > max_date OR start_date > end_date THEN SET explain_start_date = min_date; ELSE @@ -43,7 +43,7 @@ ELSE SET explain_end_date = end_date; END IF; -# Volume of conversions actions table to be used for reporting +-- Volume of conversions actions table to be used for reporting CREATE OR REPLACE TABLE `{{project_id}}.{{dataset}}.{{volume_table_name}}` AS SELECT DISTINCT @@ -62,7 +62,7 @@ FROM WHERE {{datetime_column}} BETWEEN explain_start_date AND explain_end_date ; -# Daily aggregated volume of conversions actions table to be used for reporting +-- Daily aggregated volume of conversions actions table to be used for reporting CREATE OR REPLACE TABLE `{{project_id}}.{{dataset}}.{{daily_volume_view_name}}` AS SELECT DISTINCT @@ -81,7 +81,7 @@ FROM WHERE {{datetime_column}} BETWEEN explain_start_date AND explain_end_date ; -# Weekly aggregated volume of conversions actions table to be used for reporting +-- Weekly aggregated volume of conversions actions table to be used for reporting CREATE OR REPLACE TABLE `{{project_id}}.{{dataset}}.{{weekly_volume_view_name}}` AS SELECT DISTINCT @@ -114,7 +114,7 @@ WHERE {{datetime_column}} BETWEEN explain_start_date AND explain_end_date ) ; -# Correlation between purchase and other conversion actions table to be used for reporting +-- Correlation between purchase and other conversion actions table to be used for reporting CREATE OR REPLACE TABLE `{{project_id}}.{{dataset}}.{{corr_table_name}}` AS SELECT DISTINCT diff --git a/sql/procedure/aggregated_value_based_bidding_training_preparation.sqlx b/sql/procedure/aggregated_value_based_bidding_training_preparation.sqlx index 89f2bcdf..48e892aa 100644 --- a/sql/procedure/aggregated_value_based_bidding_training_preparation.sqlx +++ b/sql/procedure/aggregated_value_based_bidding_training_preparation.sqlx @@ -19,9 +19,31 @@ # of the feature importance, there is no problem in applying this strategy. # Validation and test subsets are not replicated. +# This code snippet defines a BigQuery SQL view named {{view_name}} +# in the project {{project_id}} and dataset {{dataset}}. +# The view is used for training the Aggregated value-based bidding model. -- The view schema should match with the `transformations-value-based-bidding.json` file. -- Taking into consideration the excluded_features as listed in the `config.yaml` file. +-- +-- The view schema includes the following columns: +-- data_split: String indicating the data split (TRAIN, VALIDATE, TEST). +-- Dt: Date of the data. +-- First_Visits: SUM of first visits for each Date. +-- Visit_Product_Page: SUM of visits to product pages for each Date. +-- View_Product_Details: SUM of views of product details for each Date. +-- Add_Product_to_Cart: SUM of times products were added to the cart for each Date. +-- View_Cart: SUM of times the cart was viewed for each Date. +-- Begin_Checkout: SUM of times checkout was initiated for each Date. +-- Added_Shipping_Info: SUM of times shipping information was added for each Date. +-- Added_Payment_Info: SUM of times payment information was added for each Date. +-- Purchase_Product: SUM of purchases made for each Date. +-- +-- The view is defined using a series of UNION ALL statements that combine data from +-- the aggregated_vbb table in the {{mds_project_id}}.{{mds_dataset}} dataset. +-- The first three UNION ALL statements select data for the training split. Each statement selects the same data, effectively replicating it three times. This is done to increase the size of the training dataset and potentially improve model performance. +-- The fourth UNION ALL statement selects data for the validation split, filtering for data within the specified eval_start_date and eval_end_date range. +-- The fifth UNION ALL statement selects data for the test split, also filtering for data within the specified eval_start_date and eval_end_date range. CREATE OR REPLACE VIEW `{{project_id}}.{{dataset}}.{{view_name}}` (data_split, Dt, diff --git a/sql/procedure/audience_segmentation_inference_preparation.sqlx b/sql/procedure/audience_segmentation_inference_preparation.sqlx index 12845e80..11ff6e85 100644 --- a/sql/procedure/audience_segmentation_inference_preparation.sqlx +++ b/sql/procedure/audience_segmentation_inference_preparation.sqlx @@ -13,17 +13,25 @@ -- limitations under the License. -- Setting procedure to lookback from the day before `inference_date` +-- This procedure prepares data for the audience segmentation inference pipeline. +-- It extracts relevant features from the user segmentation dimensions and user lookback metrics metrics tables +-- and combines them into a single table for model inference. DECLARE lastest_processed_time_ud TIMESTAMP; DECLARE lastest_processed_time_uwm TIMESTAMP; DECLARE lastest_processed_time_um TIMESTAMP; +-- Parameters: +-- inference_date: The date for which to prepare the data. This date should be one day before the actual inference date to account for data processing delays. -- Setting procedure to lookback from the day before `inference_date` SET inference_date = DATE_SUB(inference_date, INTERVAL 1 DAY); +-- Get the latest processed timestamps: The latest processed timestamps for each of the three feature tables are retrieved. These timestamps are used to filter the data to ensure that only the most recent data is used for inference. SET lastest_processed_time_ud = (SELECT MAX(processed_timestamp) FROM `{{feature_store_project_id}}.{{feature_store_dataset}}.user_segmentation_dimensions` WHERE feature_date = inference_date LIMIT 1); SET lastest_processed_time_uwm = (SELECT MAX(processed_timestamp) FROM `{{feature_store_project_id}}.{{feature_store_dataset}}.user_lookback_metrics` WHERE feature_date = inference_date LIMIT 1); -SET lastest_processed_time_um = (SELECT MAX(processed_timestamp) FROM `{{feature_store_project_id}}.{{feature_store_dataset}}.user_scoped_segmentation_metrics` WHERE feature_date = inference_date LIMIT 1); + +-- Prepare user segmentation dimensions data: The user_segmentation_dimensions table is queried to extract relevant features for the inference date. +-- The query uses the user_segmentation_dimensions_window window function to aggregate features over the past 15 days. CREATE OR REPLACE TEMP TABLE inference_preparation_ud as ( SELECT DISTINCT UD.user_pseudo_id, @@ -63,6 +71,9 @@ CREATE OR REPLACE TEMP TABLE inference_preparation_ud as ( user_segmentation_dimensions_window AS (PARTITION BY UD.user_pseudo_id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ); + +-- Prepare user lookback metrics data: The user_lookback_metrics table is queried to extract relevant features for the inference date. +-- The query uses the user_lookback_metrics_window window function to aggregate features over the past 15 days. CREATE OR REPLACE TEMP TABLE inference_preparation_uwm AS ( SELECT DISTINCT UWM.user_pseudo_id, @@ -96,6 +107,8 @@ CREATE OR REPLACE TEMP TABLE inference_preparation_uwm AS ( ); +-- Combine the data: The data from the three feature tables is combined into a single table called inference_preparation. +-- This table contains all of the features that will be used for model inference. CREATE OR REPLACE TEMP TABLE inference_preparation AS ( SELECT DISTINCT UD.user_pseudo_id, @@ -144,8 +157,52 @@ CREATE OR REPLACE TEMP TABLE inference_preparation AS ( AND UWM.feature_date = UD.feature_date ); + +-- Delete all rows from the insert_table DELETE FROM `{{project_id}}.{{dataset}}.{{insert_table}}` WHERE TRUE; + +-- Insert the data into the target table: The data from the inference_preparation table is inserted into the target table specified by the insert_table parameter. +-- This table will be used by the model inference pipeline. +-- +-- The table schema includes the following columns: +-- feature_date: The date for which the features are extracted. +-- user_pseudo_id: The unique identifier for the user. +-- user_id: The user ID. +-- device_category: The category of the device used by the user. +-- device_mobile_brand_name: The brand name of the mobile device used by the user. +-- device_mobile_model_name: The model name of the mobile device used by the user. +-- device_os: The operating system of the device used by the user. +-- device_os_version: The version of the operating system used by the user. +-- device_language: The language used by the user. +-- device_web_browser: The web browser used by the user. +-- device_web_browser_version: The version of the web browser used by the user. +-- geo_sub_continent: The sub-continent of the user's location. +-- geo_country: The country of the user's location. +-- geo_region: The region of the user's location. +-- geo_city: The city of the user's location. +-- geo_metro: The metropolitan area of the user's location. +-- last_traffic_source_medium: The medium used to reach the user's last session. +-- last_traffic_source_name: The name of the traffic source used to reach the user's last session. +-- last_traffic_source_source: The source of the last traffic source used by the user. +-- first_traffic_source_medium: The medium of the first traffic source used by the user. +-- first_traffic_source_name: The name of the first traffic source used by the user. +-- first_traffic_source_source: The source of the first traffic source used by the user. +-- has_signed_in_with_user_id: Whether the user has signed in with a user ID. +-- active_users_past_1_7_day: The number of active users in the past 7 days for each user. +-- active_users_past_8_14_day: The number of active users in the past 8-14 days for each user. +-- purchases_past_1_7_day: The number of purchases in the past 7 days for each user. +-- purchases_past_8_14_day: The number of purchases in the past 8-14 days for each user. +-- visits_past_1_7_day: The number of visits in the past 7 days for each user. +-- visits_past_8_14_day: The number of visits in the past 8-14 days for each user. +-- view_items_past_1_7_day: The number of items viewed in the past 7 days for each user. +-- view_items_past_8_14_day: The number of items viewed in the past 8-14 days for each user. +-- add_to_carts_past_1_7_day: The number of items added to carts in the past 7 days for each user. +-- add_to_carts_past_8_14_day: The number of items added to carts in the past 8-14 days for each user. +-- checkouts_past_1_7_day: The number of checkouts in the past 7 days for each user. +-- checkouts_past_8_14_day: The number of checkouts in the past 8-14 days for each user. +-- ltv_revenue_past_1_7_day: The lifetime value revenue in the past 7 days for each user. +-- ltv_revenue_past_7_15_day: The lifetime value revenue in the past 7-15 days for each user. INSERT INTO `{{project_id}}.{{dataset}}.{{insert_table}}` (feature_date, user_pseudo_id, @@ -226,6 +283,9 @@ SELECT FROM inference_preparation; + +-- Create the final inference table: The audience_segmentation_inference_15 table is created by selecting the latest values for each feature from the insert_table. +-- This table will be used by the model inference pipeline. CREATE OR REPLACE TABLE `{{project_id}}.{{dataset}}.audience_segmentation_inference_15` AS( SELECT DISTINCT @@ -262,6 +322,10 @@ CREATE OR REPLACE TABLE FROM `{{project_id}}.{{dataset}}.{{insert_table}}` ); + +-- Create the final inference view: The v_audience_segmentation_inference_15 view is created by selecting the latest values for each feature from the audience_segmentation_inference_15 table. +-- This view will be used by the model inference pipeline. +-- The v_audience_segmentation_inference_15 view is created by selecting the latest values for each feature from the audience_segmentation_inference_15 table. CREATE OR REPLACE VIEW `{{project_id}}.{{dataset}}.v_audience_segmentation_inference_15` (processed_timestamp, feature_date, @@ -362,11 +426,13 @@ FROM ( checkouts_past_8_14_day, ltv_revenue_past_1_7_day, ltv_revenue_past_7_15_day, + -- Row order for each user_pseudo_id is used to select the latest value for each feature. ROW_NUMBER() OVER (PARTITION BY user_pseudo_id ORDER BY feature_date DESC) AS user_row_order FROM `{{project_id}}.{{dataset}}.audience_segmentation_inference_15` ) WHERE + -- Only the latest row ordered by feature_date descending user_row_order = 1; DROP TABLE inference_preparation; \ No newline at end of file diff --git a/sql/procedure/audience_segmentation_training_preparation.sqlx b/sql/procedure/audience_segmentation_training_preparation.sqlx index 1421d899..598a4576 100644 --- a/sql/procedure/audience_segmentation_training_preparation.sqlx +++ b/sql/procedure/audience_segmentation_training_preparation.sqlx @@ -12,24 +12,25 @@ -- See the License for the specific language governing permissions and -- limitations under the License. +-- The procedure audience_segmentation_training_preparation prepares data for training the Audience Segmentation model. DECLARE custom_start_date DATE DEFAULT NULL; DECLARE custom_end_date DATE DEFAULT NULL; DECLARE max_date DATE; DECLARE min_date DATE; -# custom_start_date: The start date of the data to be used for training. -# custom_end_date: The end date of the data to be used for training. +-- custom_start_date: The start date of the data to be used for training. +-- custom_end_date: The end date of the data to be used for training. SET custom_start_date = PARSE_DATE("%Y-%m-%d", {{custom_start_date}}); SET custom_end_date = PARSE_DATE("%Y-%m-%d", {{custom_end_date}}); -# max_date: The maximum date of the data that is available for training. -# min_date: The minimum date of the data that is available for training. +-- max_date: The maximum date of the data that is available for training. +-- min_date: The minimum date of the data that is available for training. SET max_date = (SELECT DATE_SUB(MAX(event_date), INTERVAL 1 DAY) FROM `{{mds_project_id}}.{{mds_dataset}}.event`); SET min_date = (SELECT DATE_ADD(MIN(event_date), INTERVAL 15 DAY) FROM `{{mds_project_id}}.{{mds_dataset}}.event`); -# The procedure first checks if the custom_start_date and custom_end_date parameters are valid. -# If either parameter is not valid, the procedure sets the corresponding date to the maximum or -# minimum date of the available data. +-- The procedure first checks if the custom_start_date and custom_end_date parameters are valid. +-- If either parameter is not valid, the procedure sets the corresponding date to the maximum or +-- minimum date of the available data. IF (custom_start_date IS NOT NULL OR custom_start_date >= min_date OR custom_start_date <= max_date) AND custom_start_date < custom_end_date THEN SET min_date = custom_start_date; @@ -40,6 +41,12 @@ IF (custom_end_date IS NOT NULL OR custom_end_date <= max_date OR custom_end_dat SET max_date = custom_end_date; END IF; + +-- Prepare Training Data: +-- Create temporary tables training_preparation_ud and training_preparation_uwm to store +-- user segmentation dimensions and user lookback metrics data, respectively. +-- Filter the data based on the custom_start_date and custom_end_date parameters. +-- Use window functions to aggregate features over the past 15 days for each user. CREATE OR REPLACE TEMP TABLE training_preparation_ud as ( SELECT DISTINCT UD.user_pseudo_id, @@ -103,6 +110,8 @@ WINDOW ); +-- Create a temporary table training_preparation by joining the +-- training_preparation_ud and training_preparation_uwm tables. CREATE OR REPLACE TEMP TABLE training_preparation as ( SELECT DISTINCT UD.user_pseudo_id, @@ -152,6 +161,7 @@ ON ); +-- Create a temporary table DataForTargetTable that assigns a data split (TRAIN, VALIDATE, TEST) to each user based on their user_pseudo_id fingerprint. CREATE OR REPLACE TEMP TABLE DataForTargetTable AS( SELECT DISTINCT CASE @@ -190,10 +200,43 @@ CREATE OR REPLACE TEMP TABLE DataForTargetTable AS( ltv_revenue_past_7_15_day FROM training_preparation); +-- Create the final training table audience_segmentation_training_full_dataset by selecting all data from DataForTargetTable. +-- This table schema defines the following columns: +-- data_split: The data split (TRAIN, VALIDATE, TEST) to which the user belongs. +-- feature_date: The date for which the features are extracted. +-- user_pseudo_id: The unique identifier for the user. +-- user_id: The user ID. +-- device_category: The category of the device used by the user. +-- device_mobile_model_name: The model name of the mobile device used by the user. +-- device_os_version: The operating system version of the device used by the user. +-- geo_country: The country of the user's location. +-- geo_region: The region of the user's location. +-- geo_city: The city of the user's location. +-- last_traffic_source_medium: The medium used to reach the user's last session. +-- last_traffic_source_name: The name of the traffic source used to reach the user's last session. +-- last_traffic_source_source: The source of the last traffic source used by the user. +-- first_traffic_source_medium: The medium of the first traffic source used by the user. +-- first_traffic_source_name: The name of the first traffic source used by the user. +-- first_traffic_source_source: The source of the first traffic source used by the user. +-- active_users_past_1_7_day: The number of times the user has been active in the past 7 days for each user. +-- active_users_past_8_14_day: The number of times the user has been active in the past 8-14 days for each user. +-- purchases_past_1_7_day: The number of purchases in the past 7 days for each user. +-- purchases_past_8_14_day: The number of purchases in the past 8-14 days for each user. +-- visits_past_1_7_day: The number of visits in the past 7 days for each user. +-- visits_past_8_14_day: The number of visits in the past 8-14 days for each user. +-- view_items_past_1_7_day: The number of items viewed in the past 7 days for each user. +-- view_items_past_8_14_day: The number of items viewed in the past 8-14 days for each user. +-- add_to_carts_past_1_7_day: The number of items added to carts in the past 7 days for each user. +-- add_to_carts_past_8_14_day: The number of items added to carts in the past 8-14 days for each user. +-- checkouts_past_1_7_day: The number of checkouts in the past 7 days for each user. +-- checkouts_past_8_14_day: The number of checkouts in the past 8-14 days for each user. +-- ltv_revenue_past_1_7_day: The lifetime value revenue gain in the past 7 days for each user. +-- ltv_revenue_past_7_15_day: The lifetime value revenue gain in the past 7-15 days for each user. CREATE OR REPLACE TABLE `{{project_id}}.{{dataset}}.audience_segmentation_training_full_dataset` AS SELECT DISTINCT * FROM DataForTargetTable WHERE data_split IS NOT NULL; +-- Create the final training table audience_segmentation_training_full_dataset by selecting all data from DataForTargetTable. CREATE OR REPLACE TABLE `{{project_id}}.{{dataset}}.audience_segmentation_training_15` AS( SELECT DISTINCT CURRENT_TIMESTAMP() AS processed_timestamp, @@ -230,6 +273,8 @@ CREATE OR REPLACE TABLE `{{project_id}}.{{dataset}}.audience_segmentation_traini FROM `{{project_id}}.{{dataset}}.audience_segmentation_training_full_dataset` ); +-- Create a view v_audience_segmentation_training_15 that selects the latest values for each feature from the audience_segmentation_training_full_dataset table. +-- This view is used by the Vertex AI pipeline to train the Audience Segmentation model. CREATE OR REPLACE VIEW `{{project_id}}.{{dataset}}.v_audience_segmentation_training_15` (processed_timestamp, data_split, @@ -337,4 +382,5 @@ FROM ( `{{project_id}}.{{dataset}}.audience_segmentation_training_15`) WHERE user_row_order = 1 AND + -- samples_per_split variable determines the number of samples to be included in each data split (TRAIN, VALIDATE, TEST). rn <= {{samples_per_split}}; \ No newline at end of file diff --git a/sql/procedure/customer_lifetime_value_inference_preparation.sqlx b/sql/procedure/customer_lifetime_value_inference_preparation.sqlx index 2aa69599..d2bbb52f 100644 --- a/sql/procedure/customer_lifetime_value_inference_preparation.sqlx +++ b/sql/procedure/customer_lifetime_value_inference_preparation.sqlx @@ -513,6 +513,7 @@ CREATE OR REPLACE VIEW `{{project_id}}.{{dataset}}.v_customer_lifetime_value_inf (processed_timestamp, feature_date, user_pseudo_id, + user_id, device_category, device_mobile_brand_name, device_mobile_model_name, @@ -580,6 +581,7 @@ SELECT DISTINCT processed_timestamp, feature_date, user_pseudo_id, + user_id, device_category, device_mobile_brand_name, device_mobile_model_name, @@ -641,6 +643,7 @@ SELECT DISTINCT processed_timestamp, feature_date, user_pseudo_id, + user_id, device_category, device_mobile_brand_name, device_mobile_model_name, @@ -709,6 +712,7 @@ CREATE OR REPLACE VIEW `{{project_id}}.{{dataset}}.v_customer_lifetime_value_inf (processed_timestamp, feature_date, user_pseudo_id, + user_id, device_category, device_mobile_brand_name, device_mobile_model_name, @@ -776,6 +780,7 @@ SELECT DISTINCT processed_timestamp, feature_date, user_pseudo_id, + user_id, device_category, device_mobile_brand_name, device_mobile_model_name, @@ -837,6 +842,7 @@ SELECT DISTINCT processed_timestamp, feature_date, user_pseudo_id, + user_id, device_category, device_mobile_brand_name, device_mobile_model_name, @@ -904,6 +910,7 @@ WHERE (processed_timestamp, feature_date, user_pseudo_id, + user_id, device_category, device_mobile_brand_name, device_mobile_model_name, @@ -971,6 +978,7 @@ SELECT DISTINCT processed_timestamp, feature_date, user_pseudo_id, + user_id, device_category, device_mobile_brand_name, device_mobile_model_name, @@ -1032,6 +1040,7 @@ SELECT DISTINCT processed_timestamp, feature_date, user_pseudo_id, + user_id, device_category, device_mobile_brand_name, device_mobile_model_name,