Welcome to the Iguazio Data Science Platform

An initial introduction to the Iguazio Data Science Platform and the platform tutorials

Platform Overview
Data Science Workflow
The Tutorial Notebooks
Getting-Started Tutorial
End-to-End Use-Case Application and How-To Demos
Installing and Updating the MLRun Python Package
Data Ingestion and Preparation
Additional Platform Resources
Miscellaneous

Platform Overview

The Iguazio Data Science Platform ("the platform") is a fully integrated and secure data science platform as a service (PaaS), which simplifies development, accelerates performance, facilitates collaboration, and addresses operational challenges. The platform incorporates the following components:

A data science workbench that includes Jupyter Notebook, integrated analytics engines, and Python packages
The MLRun open-source MLOps orchestration framework for ML model management with experiments tracking and pipeline automation
Managed data and machine-learning (ML) services over a scalable Kubernetes cluster
A real-time serverless functions framework for model serving (Nuclio)
An extremely fast and secure data layer that supports SQL, NoSQL, time-series databases, files (simple objects), and streaming
Integration with third-party data sources such as Amazon S3, HDFS, SQL databases, and streaming or messaging protocols
Real-time dashboards based on Grafana

Data Science Workflow

The platform provides a complete data science workflow in a single ready-to-use platform that includes all the required building blocks for creating data science applications from research to production:

Collect, explore, and label data from various real-time or offline sources
Run ML training and validation at scale over multiple CPUs and GPUs
Deploy models and applications into production with serverless functions
Log, monitor, and visualize all your data and services

The Tutorial Notebooks

The home directory of the platform's running-user directory (/User/<running user>) contains pre-deployed tutorial Jupyter notebooks with code samples and documentation to assist you in your development — including a demos directory with end-to-end use-case applications (see the next section) and a data-ingestion-and-preparation directory with documentation and examples for performing data ingestion and preparation tasks.

Note:

To view and run the tutorials from the platform, you first need to create a Jupyter Notebook service.

The welcome.ipynb notebook and main README.md file provide the same introduction in different formats.

Getting-Started Tutorial

Start out by running the getting-started tutorial to familiarize yourself with the platform and experience firsthand some of its main capabilities.

You can also view the tutorial on GitHub.

End-to-End Use-Case Application and How-To Demos

Iguazio provides full end-to-end use-case application and how-to demos that demonstrate how to use the platform, its MLRun service, and related tools to address data science requirements for different industries and implementations. These demos are available in the MLRun demos repository. Use the provided update-demos.sh script to get updated demos from this repository. By default, the script retrieves the files from the latest release that matches the version of the installed mlrun package (see Installing and Updating the MLRun Python Package). The files are copied to the /v3io/users/<username>/demos directory, where <username> is the name of the running user ($V3IO_USERNAME) unless you set the -u|--user flag to another username.

Note: Before running the script, close any open files in the demos directory.

# Get additional demos
!/User/update-demos.sh

For full usage instructions, run the script with the -h or --help flag:

!/User/update-demos.sh --help

End-to-End Use-Case Application Demos

Demo			Description
scikit-learn Demo: Full AutoML pipeline	Open locally	View on GitHub	Demonstrates how to build a full end-to-end automated-ML (AutoML) pipeline using scikit-learn and the UCI Iris data set.
Image-Classification Demo: Image classification with distributed training	Open locally	View on GitHub	Demonstrates an end-to-end image-classification solution using TensorFlow (versions 1 or 2), Keras, Horovod, and Nuclio.
Faces Demo: Real-time image recognition with deep learning	Open locally	View on GitHub	Demonstrates real-time capture, recognition, and classification of face images over a video stream, as well as location tracking of identities, using PyTorch, OpenCV, and Streamlit.
Churn Demo: Real-time customer-churn prediction	Open locally	View on GitHub	Demonstrates analysis of customer-churn data using the Kaggle Telco Customer Churn data set, model training and validation using XGBoost, and model serving using real-time Nuclio serverless functions.
Stock-Analysis Demo	Open locally	View on GitHub	Demonstrates how to tackle a common requirement of running a data-engineering pipeline as part of ML model serving by reading data from external data sources and generating insights using ML models. The demo reads stock data from an external source, analyzes the related market news, and visualizes the analyzed data in a Grafana dashboard.
NetOps Demo: Predictive network operations / telemetry	Open locally	View on GitHub	Demonstrates how to build an automated ML pipeline for predicting network outages based on network-device telemetry, also known as Network Operations (NetOps). The demo implements both model training and inference, including model monitoring and concept-drift detection.

How-To Demos

Demo			Description
How-To: Converting existing ML code to an MLRun project	Open locally	View on GitHub	Demonstrates how to convert existing ML code to an MLRun project. The demo implements an MLRun project for taxi ride-fare prediction based on a Kaggle notebook with an ML Python script that uses data from the New York City Taxi Fare Prediction competition.
How-To: Running a Spark job for reading a CSV file	Open locally	View on GitHub	Demonstrates how to run a Spark job that reads a CSV file and logs the data set to an MLRun database.
How-To: Running a Spark job for analyzing data	Open locally	View on GitHub	Demonstrates how to create and run a Spark job that generates a profile report from an Apache Spark DataFrame based on pandas profiling.
How-To: Running a Spark Job with Spark Operator	Open locally	View on GitHub	Demonstrates how to use Spark Operator to run a Spark job over Kubernetes with MLRun.

Installing and Updating the MLRun Python Package

The demo applications and many of the platform tutorials use MLRun — Iguazio's end-to-end open-source MLOps solution for managing and automating your entire analytics and machine-learning life cycle, from data ingestion through model development to full pipeline deployment in production. MLRun is available in the platform via a default (pre-deployed) shared platform service (mlrun). However, to use MLRun from Python code (such as in the demo and tutorial notebooks), you also need to install the MLRun Python package (mlrun). The version of the installed package must match the version of the platform's MLRun service and must be updated whenever the service's version is updated.

The platform provides an align_mlrun.sh script for simplifying the MLRun package installation and version synchronization with the MLRun service. The script is available in the running-user directory (your Jupyter home directory), which is accessible via the /User data mount. Use the following command to run this script for the initial package installation (after creating a new Jupyter Notebook service) and whenever the MLRun service is updated; (the command should be run for each Jupyter Notebook service):

!/User/align_mlrun.sh

Data Ingestion and Preparation

The platform allows storing data in any format. The platform's multi-model data layer and related APIs provide enhanced support for working with NoSQL ("key-value"), time-series, and stream data. Various steps of the data science life cycle (pipeline) might require different tools and frameworks for working with data, especially when it comes to the different mechanisms required during the research and development phase versus the operational production phase. The platform features a wide array of methods for manipulating and managing data, of different formats, in each step of the data life cycle, using a variety of frameworks, tools, and APIs — such as as the following:

Spark SQL and DataFrames
Spark Streaming
Presto SQL queries
pandas DataFrames
Dask
V3IO Frames Python library
V3IO SDK
Web APIs

The data ingestion and preparation tutorial README (data-ingestion-and-preparation/README.ipynb/.md) provides an overview of various methods for collecting, storing, and manipulating data in the platform, and references to sample tutorial notebooks that demonstrate how to use these methods.
▶ Open the README notebook / Markdown file

Additional Platform Resources

You can find more information and resources in the MLRun documentation:
▶ View the MLRun documentation

You might also find the following resources useful:

Introduction video
In-depth platform overview with a break down of the steps for developing a full data science workflow from development to production
Platform Services
Platform data layer, including references
nuclio-jupyter SDK for creating and deploying Nuclio functions with Python and Jupyter Notebook

Miscellaneous

Creating Virtual Environments in Jupyter Notebook

A virtual environment is a named, isolated, working copy of Python that maintains its own files, directories, and paths so that you can work with specific versions of libraries or Python itself without affecting other Python projects. Virtual environments make it easy to cleanly separate projects and avoid problems with different dependencies and version requirements across components. See the virtual-env tutorial notebook for step-by-step instructions for using conda to create your own Python virtual environments, which will appear as custom kernels in Jupyter Notebook.

Updating the Tutorial Notebooks

You can use the provided igz-tutorials-get.sh script to get updated platform tutorials from the tutorials GitHub repository. By default, the script retrieves the files from the latest release that matches the current platform version. For details, see the update-tutorials.ipynb notebook.

The v3io Directory

The v3io directory that you see in the file browser of the Jupyter UI displays the contents of the v3io data mount for browsing the platform data containers. For information about the platform's data containers and how to reference data in these containers, see Platform Data Containers.

Support

The Iguazio support team will be happy to assist with any questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Welcome to the Iguazio Data Science Platform

Platform Overview

Data Science Workflow

The Tutorial Notebooks

Getting-Started Tutorial

End-to-End Use-Case Application and How-To Demos

End-to-End Use-Case Application Demos

How-To Demos

Installing and Updating the MLRun Python Package

Data Ingestion and Preparation

Additional Platform Resources

Miscellaneous

Creating Virtual Environments in Jupyter Notebook

Updating the Tutorial Notebooks

The v3io Directory

Support

Files

README.md

Latest commit

History

README.md

File metadata and controls

Welcome to the Iguazio Data Science Platform

Platform Overview

Data Science Workflow

The Tutorial Notebooks

Getting-Started Tutorial

End-to-End Use-Case Application and How-To Demos

End-to-End Use-Case Application Demos

How-To Demos

Installing and Updating the MLRun Python Package

Data Ingestion and Preparation

Additional Platform Resources

Miscellaneous

Creating Virtual Environments in Jupyter Notebook

Updating the Tutorial Notebooks

The v3io Directory

Support