Skip to content

Commit

Permalink
Added section on putting everything in script
Browse files Browse the repository at this point in the history
  • Loading branch information
JaumeAmoresDS committed Apr 3, 2024
1 parent 0430a2c commit 3c0f3c2
Show file tree
Hide file tree
Showing 3 changed files with 366 additions and 2 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,4 @@
**/*/*.ipynb_checkpoints
_site
.amlignore*

.ipynb_checkpoints/
29 changes: 29 additions & 0 deletions additional/data_science/connect_locally.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Based on https://github.com/Azure/AzureML-Containers/tree/master?tab=readme-ov-file#howtorun\n",
"- Using cheat sheet from https://dockerlabs.collabnix.com/docker/cheatsheet/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```bash\n",
"docker pull mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu22.04\n",
"docker run -it --entrypoint bash mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu22.04\n",
"```"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
337 changes: 336 additions & 1 deletion posts/data_science/hello_world.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -3659,8 +3659,343 @@
"id": "9f5f0693",
"metadata": {},
"source": [
"## Final refactoring"
"## Putting everything into a script\n",
"\n",
"Let's see how to put all the code needed for creating a pipeline into a script. "
]
},
{
"cell_type": "markdown",
"id": "e6a069f1",
"metadata": {},
"source": [
"### config json file"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5bcbc199",
"metadata": {},
"outputs": [],
"source": [
"%%writefile pipeline_input.json\n",
"{\n",
" \"preprocessing_training_input_file\": \"./data/dummy_input.csv\",\n",
" \"preprocessing_training_output_filename\":\"preprocessed_training_data.csv\",\n",
" \"x\": 10,\n",
" \n",
" \"preprocessing_test_input_file\": \"./data/dummy_test.csv\",\n",
" \"preprocessing_test_output_filename\": \"preprocessed_test_data.csv\",\n",
" \n",
" \"training_output_filename\": \"model.pk\",\n",
" \n",
" \"inference_output_filename\": \"inference_results.csv\",\n",
"\n",
" \"experiment_name\": \"e2e_three_components_in_script\"\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "f3ccd600",
"metadata": {},
"source": [
"### pipeline script"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5138a417",
"metadata": {},
"outputs": [],
"source": [
"%%writefile hello_world_pipeline.py\n",
"# -------------------------------------------------------------------------------------\n",
"# Imports\n",
"# -------------------------------------------------------------------------------------\n",
"# Standard imports\n",
"import os\n",
"import argparse\n",
"\n",
"# Third-party imports\n",
"import pandas as pd\n",
"from sklearn import Bunch\n",
"\n",
"# AML imports\n",
"from azure.ai.ml import (\n",
" command,\n",
" dsl,\n",
" Input,\n",
" Output,\n",
" MLClient\n",
")\n",
"from azure.identity import DefaultAzureCredential\n",
"\n",
"# -------------------------------------------------------------------------------------\n",
"# Connection\n",
"# -------------------------------------------------------------------------------------\n",
"# authenticate\n",
"credential = DefaultAzureCredential()\n",
"\n",
"# Get a handle to the workspace\n",
"ml_client = MLClient.from_config (\n",
" credential=credential\n",
")\n",
"\n",
"# -------------------------------------------------------------------------------------\n",
"# Interface for each component\n",
"# -------------------------------------------------------------------------------------\n",
"# Preprocessing\n",
"preprocessing_command = command(\n",
" inputs=dict(\n",
" input_file=Input (type=\"uri_file\"),\n",
" x=Input (type=\"number\"),\n",
" output_filename=Input (type=\"string\"),\n",
" ),\n",
" outputs=dict(\n",
" output_folder=Output (type=\"uri_folder\"),\n",
" ),\n",
" code=f\"./preprocessing/\", # location of source code: in this case, the root folder\n",
" command=\"python preprocessing.py \"\n",
" \"--input_file ${{inputs.input_file}} \"\n",
" \"-x ${{inputs.x}} \"\n",
" \"--output_folder ${{outputs.output_folder}} \"\n",
" \"--output_filename ${{inputs.output_filename}}\",\n",
" environment=\"AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest\",\n",
" display_name=\"Pre-processing\",\n",
")\n",
"preprocessing_component = ml_client.create_or_update(preprocessing_command.component)\n",
"\n",
"# Training\n",
"training_command = command(\n",
" inputs=dict(\n",
" input_folder=Input (type=\"uri_folder\"),\n",
" input_filename=Input (type=\"string\"),\n",
" output_filename=Input (type=\"string\"),\n",
" ),\n",
" outputs=dict(\n",
" output_folder=Output (type=\"uri_folder\"),\n",
" ),\n",
" code=f\"./training/\", # location of source code: in this case, the root folder\n",
" command=\"python training.py \"\n",
" \"--input_folder ${{inputs.input_folder}} \"\n",
" \"--input_filename ${{inputs.input_filename}} \"\n",
" \"--output_folder ${{outputs.output_folder}} \"\n",
" \"--output_filename ${{inputs.output_filename}}\",\n",
" environment=\"AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest\",\n",
" display_name=\"Training\",\n",
")\n",
"training_component = ml_client.create_or_update(training_command.component)\n",
"\n",
"# Inference\n",
"inference_command = command(\n",
" inputs=dict(\n",
" preprocessed_input_folder=Input (type=\"uri_folder\"),\n",
" preprocessed_input_filename=Input (type=\"string\"),\n",
" model_input_folder=Input (type=\"uri_folder\"),\n",
" model_input_filename=Input (type=\"string\"),\n",
" output_filename=Input (type=\"string\"),\n",
" ),\n",
" outputs=dict(\n",
" output_folder=Output (type=\"uri_folder\"),\n",
" ),\n",
" code=f\"./inference/\", # location of source code: in this case, the root folder\n",
" command=\"python inference.py \" \n",
" \"--preprocessed_input_folder ${{inputs.preprocessed_input_folder}} \"\n",
" \"--preprocessed_input_filename ${{inputs.preprocessed_input_filename}} \"\n",
" \"--model_input_folder ${{inputs.model_input_folder}} \"\n",
" \"--model_input_filename ${{inputs.model_input_filename}} \"\n",
" \"--output_folder ${{outputs.output_folder}} \"\n",
" \"--output_filename ${{inputs.output_filename}} \",\n",
"\n",
" environment=\"AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest\",\n",
" display_name=\"inference\",\n",
")\n",
"inference_component = ml_client.create_or_update(inference_command.component)\n",
"\n",
"# -------------------------------------------------------------------------------------\n",
"# Pipeline definition\n",
"# -------------------------------------------------------------------------------------\n",
"@dsl.pipeline(\n",
" compute=\"serverless\", # \"serverless\" value runs pipeline on serverless compute\n",
" description=\"E2E hello world pipeline with input\",\n",
")\n",
"def three_components_pipeline(\n",
" # Preprocessing component parameters, first component:\n",
" preprocessing_training_input_file: str,\n",
" preprocessing_training_output_filename: str,\n",
" x: int,\n",
" \n",
" # Preprocessing component parameters, second component:\n",
" preprocessing_test_input_file: str,\n",
" preprocessing_test_output_filename: str,\n",
" \n",
" # Training component parameters:\n",
" training_output_filename: str, \n",
" \n",
" # Inference component parameters:\n",
" inference_output_filename: str,\n",
"):\n",
" \"\"\"\n",
" Third pipeline: preprocessing, training and inference.\n",
" \n",
" Parameters\n",
" ----------\n",
" preprocessing_training_input_file: str\n",
" Path to file containing training data to be preprocessed.\n",
" preprocessing_training_output_filename: str\n",
" Name of file containing the preprocessed, training data.\n",
" x: int\n",
" Number to add to input data for preprocessing it.\n",
" preprocessing_test_input_file: str\n",
" Path to file containing test data to be preprocessed.\n",
" preprocessing_test_output_filename: str\n",
" Name of file containing the preprocessed, test data.\n",
" training_output_filename: str\n",
" Name of file containing the trained model.\n",
" inference_output_filename: str\n",
" Name of file containing the output data with inference results.\n",
" \"\"\"\n",
" # using data_prep_function like a python call with its own inputs\n",
" preprocessing_training_job = preprocessing_component(\n",
" input_file=preprocessing_training_input_file,\n",
" #output_folder: automatically determined\n",
" output_filename=preprocessing_training_output_filename,\n",
" x=x,\n",
" )\n",
" preprocessing_test_job = preprocessing_component(\n",
" input_file=preprocessing_test_input_file,\n",
" #output_folder: automatically determined\n",
" output_filename=preprocessing_test_output_filename,\n",
" x=x,\n",
" )\n",
" training_job = training_component(\n",
" input_folder=preprocessing_training_job.outputs.output_folder,\n",
" input_filename=preprocessing_training_output_filename,\n",
" #output_folder: automatically determined\n",
" output_filename=training_output_filename,\n",
" )\n",
" inference_job = inference_component(\n",
" preprocessed_input_folder=preprocessing_test_job.outputs.output_folder,\n",
" preprocessed_input_filename=preprocessing_test_output_filename,\n",
" model_input_folder=training_job.outputs.output_folder,\n",
" model_input_filename=training_output_filename,\n",
" #output_folder: automatically determined\n",
" output_filename=inference_output_filename,\n",
" )\n",
"\n",
"# -------------------------------------------------------------------------------------\n",
"# Pipeline running\n",
"# -------------------------------------------------------------------------------------\n",
"def run_pipeline (\n",
" config_path: str=\"./pipeline_input.json\",\n",
"):\n",
"\n",
" # Read config json file\n",
" with open (config_path,\"rt\") as config_file\n",
" config = json.load (config_file)\n",
"\n",
" # Convert config dictionary into a Bunch object.\n",
" # This allows to get access to fields as object attributes\n",
" # Which I find more convenient.\n",
" config = Bunch (**config)\n",
"\n",
" # Build pipeline \n",
" three_components_pipeline = three_components_pipeline(\n",
" # first preprocessing component\n",
" preprocessing_training_input_file=Input(type=\"uri_file\", path=config.preprocessing_training_input_file),\n",
" preprocessing_training_output_filename=config.preprocessing_training_output_filename,\n",
" x=config.x,\n",
" \n",
" # second preprocessing component\n",
" preprocessing_test_input_file=Input(type=\"uri_file\", path=config.preprocessing_test_input_file),\n",
" preprocessing_test_output_filename=config.preprocessing_test_output_filename,\n",
" \n",
" # Training component parameters:\n",
" training_output_filename=config.training_output_filename,\n",
" \n",
" # Inference component parameters:\n",
" inference_output_filename=config.inference_output_filename,\n",
" )\n",
"\n",
" three_components_pipeline_job = ml_client.jobs.create_or_update(\n",
" three_components_pipeline,\n",
" # Project's name\n",
" experiment_name=config.experiment_name,\n",
" )\n",
"\n",
" # ----------------------------------------------------\n",
" # Pipeline running\n",
" # ----------------------------------------------------\n",
" ml_client.jobs.stream(three_components_pipeline_job.name)\n",
"\n",
"# -------------------------------------------------------------------------------------\n",
"# Parsing\n",
"# -------------------------------------------------------------------------------------\n",
"def parse_args ():\n",
" \"\"\"Parses input arguments\"\"\"\n",
" \n",
" parser = argparse.ArgumentParser()\n",
" parser.add_argument (\n",
" \"--config-path\", \n",
" type=str, \n",
" default=\"pipeline_input.json\",\n",
" help=\"Path to config file specifying pipeline input parameters.\",\n",
" )\n",
" parser.add_argument (\n",
" \"--experiment-name\", \n",
" type=str, \n",
" default=\"hello-world-experiment\",\n",
" help=\"Name of experiment.\",\n",
" )\n",
"\n",
" args = parser.parse_args()\n",
" \n",
" return args\n",
"\n",
"\n",
"# -------------------------------------------------------------------------------------\n",
"# main\n",
"# -------------------------------------------------------------------------------------\n",
"def main ():\n",
" \"\"\"Parses arguments and runs pipeline\"\"\"\n",
" args = parse_args ()\n",
" run_pipeline (\n",
" args.config_path,\n",
" args.experiment_name,\n",
" )\n",
"\n",
"# -------------------------------------------------------------------------------------\n",
"# -------------------------------------------------------------------------------------\n",
"if __name__ == \"__main__\":\n",
" main ()"
]
},
{
"cell_type": "markdown",
"id": "35ca4a4d",
"metadata": {},
"source": [
"\n",
"### Optional changes\n",
"\n",
"Let's introduce two optional changes: \n",
"\n",
"- Instead of defining the `preprocessing_command`, `training_command` and `inference_command` out of the pipeline function, we will define them inside. \n",
"- We will avoid creating the components beforehand, by not calling `ml_client.create_or_update`, as it was done for example here:\n",
"\n",
"```python\n",
"preprocessing_component = ml_client.create_or_update(preprocessing_command.component)\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "3b093697",
"metadata": {},
"source": []
}
],
"metadata": {
Expand Down

0 comments on commit 3c0f3c2

Please sign in to comment.