Skip to content

Latest commit

 

History

History
316 lines (208 loc) · 21.3 KB

File metadata and controls

316 lines (208 loc) · 21.3 KB

Data Ingestion and Preparation Overview

Learn about different methods for ingesting data into the Iguazio Data Science Platform, analyzing the data, and preparing it for the next step in your data pipeline.

Overview

The Iguazio Data Science Platform ("the platform") allows storing data in any format. The platform's multi-model data layer and related APIs provide enhanced support for working with NoSQL ("key-value"), time-series, and stream data. Various steps of the data science life cycle (pipeline) might require different tools and frameworks for working with data, especially when it comes to the different mechanisms required during the research and development phase versus the operational production phase. The platform features a wide set of methods for manipulating and managing data, of different formats, in each step of the data life cycle, using a variety of frameworks, tools, and APIs — such as Spark SQL and DataFrames, Spark Streaming, Presto SQL queries, pandas DataFrames, Dask, the V3IO Frames Python library, and web APIs.

This tutorial provides an overview of various methods for collecting, storing, and manipulating data in the platform, and refers to sample tutorial notebooks that demonstrate how to use these methods.
For an in-depth overview of the platform and how it can be used to implement a full data science workflow, see the platform-overview tutorial notebook. For information about the available full end-to-end use-case application and how-to demos, see the welcome notebook or the matching README.md file.


pipeline-diagram

Basic Flow

The basic-data-ingestion-and-preparation tutorial walks you through basic scenarios of ingesting data from external sources into the platform's data store and manipulating the data using different data formats. The tutorial includes an example of ingesting a CSV file from an AWS S3 bucket; converting it into a NoSQL table using Spark DataFrames; running SQL queries on the table; and converting the table into a Parquet file.

The Platform's Data Layer

The platform features an extremely fast and secure data layer (a.k.a. "data store") that supports storing data in different formats — SQL, NoSQL, time-series databases, files (simple objects), and streaming. The data is stored within data containers and can be accessed using a variety of APIs — including simple-object, NoSQL ("key-value"), and streaming APIs.

Platform Data Containers

Data is stored within data containers in the platform's distributed file system (DFS), which makes up the platform's data layer. All platform clusters have several predefined containers:

  • "users" — This container is designed to contain <username> directories that provide individual development environments for storing user-specific data. The platform's Jupyter Notebook, Zeppelin, and web-based shell "command-line services" automatically create such a directory for the running user of the service and set it as the home directory of the service environment. You can leverage the following environment variables, which are predefined in the platform's command-line services, to access this running-user directory from your code:

    • V3IO_USERNAME — set to the username of the running user of the Jupyter Notebook service.
    • V3IO_HOME — set to the running-user directory in the "users" container — users/<running user>.
    • V3IO_HOME_URL — set to the fully qualified v3io path to the running-user directory — v3io://users/<running user>.
  • "projects" — This container is designed to store shared project artifacts.
    When creating a new project, the default artifacts path is projects/<project name>/artifacts.

  • "bigdata" — This container has no special significance in the current release, and it will no longer be predefined in future releases. However, you'll still be able to use your existing "bigdata" container and all its data, or create a custom container by this name if it doesn't already exist.

The data containers and their contents are referenced differently depending on the programming interface. For example:

  • In local file-system (FS) commands you use the predefined v3io root data mount — /v3io/<container name>[/<data path>]. There's also a predefined local-FS User mount to the users/<running user> directory, and you can use the aforementioned environment variables when setting data paths. For example, /v3io/users/$V3IO_USERNAME, /v3io/$V3IO_HOME, and /User are all valid ways of referencing the users/<running user> directory from a local FS command.
  • In Hadoop FS or Spark DataFrame commands you use a fully qualified path of the format v3io://<container name>/<data path>. You can also use environment variables with these interfaces.

For detailed information and examples on how to set the data path for each interface, see API Data Paths and the examples in the platform's tutorial Jupyter notebooks.

The Simple-Object Platform API

The platform's Simple-Object API enables performing simple data-object and container operations that resemble the Amazon Web Services (AWS) Simple Storage Service (S3) API. In addition to the S3-like capabilities, the Simple-Object Web API enables appending data to existing objects. For more information and API usage examples, see the v3io-objects tutorial.

The NoSQL (Key-Value) Platform API

The platform's NoSQL (a.k.a. key-value/KV) API provides access to the platform's NoSQL data store (database service), which enables storing and consuming data in a tabular format. For more information and API usage examples, see the v3io-kv tutorial.

The Streaming Platform API

The platform's Streaming API enables working with data in the platform as streams. For more information and API usage examples, see the v3io-streams tutorial. In addition, see the Working with Streams section in the current tutorial for general information about different methods for working with data streams in the platform.

Reading Data from External Databases

You can use different methods to read data from external databases into the platform's data store, such Spark over JDBC or SQLAlchemy.

Using Spark over JDBC

Spark SQL includes a data source that can read data from other databases using Java database connectivity (JDBC). The results are returned as a Spark DataFrame that can easily be processed using Spark SQL, or joined with other data sources. The spark-jdbc tutorial includes several examples of using Spark JDBC to ingest data from various databases — such as MySQL, Oracle, and PostgreSQL.

Using SQLAlchemy

The read-external-db tutorial outlines how to ingest data using SQLAlchemy — a Python SQL toolkit and Object Relational Mapper, which gives application developers the full power and flexibility of SQL — and then use Python DataFrames to work on the ingested data set.

Working with Spark

The platform has a default pre-deployed Spark service that enables ingesting, analyzing, and manipulating data using different Spark APIs:

Using Spark SQL and DataFrames

Spark lets you write and query structured data inside Spark programs by using either SQL or a familiar DataFrame API. DataFrames and SQL provide a common way to access a variety of data sources. You can use the Spark SQL and DataFrames API to ingest data into the platform, for both batch and micro-batch processing, and analyze and manipulate large data sets, in a distributed manner.

The platform's custom NoSQL Spark DataFrame implements the Spark data-source API to support a custom data source that enables reading and writing data in the platform's NoSQL store using Spark DataFrames, including enhanced features such as data pruning and filtering (predicate push down); queries are passed to the platform's data store, which returns only the relevant data. This allows accelerated and high-speed access from Spark to data stored in the platform.

The spark-sql-analytics tutorial demonstrates how to use Spark SQL and DataFrames to access objects, tables, and unstructured data that persists in the platform's data store.

For more information and examples of data ingestion with Spark DataFrames, see Getting Started with Data Ingestion Using Spark.
For more about running SQL queries with Spark, see Running Spark SQL Queries under "Running SQL Queries on Platform Data".

Working with Streams

The platform supports various methods for working with data streams, including the following:

Using Nuclio to Get Data from Common Streaming Engines

The platform has a default pre-deployed Nuclio service that uses Iguazio's Nuclio serverless-framework, which provides a mechanism for analyzing and processing real-time events from various streaming engines. Nuclio currently supports the following streaming frameworks — Kafka, Kinesis, Azure Event Hubs, platform streams (a.k.a. V3IO streams), RabbitMQ, and MQTT.

Using Nuclio functions to retrieve and analyze streaming data in real time is a very common practice when building a real-time data pipeline. You can stream any type of data — such as telemetry (NetOps) metrics, financial transactions, web clicks, or sensors data — in any format, including images and videos. You can also implement your own logic within the Nuclio function to manipulate or enrich the consumed stream data and prepare it for the next step in the pipeline.

Nuclio serverless functions can sustain high workloads with very low latencies, thus making them very useful for building an event-driven applications with strict latency requirements.

For more information about Nuclio, see the platform's Nuclio service overview.

Using the Platform's Streaming Engine

The platform features a custom streaming engine and a related stream format — a platform stream (a.k.a. V3IO stream). You can use the platform's streaming engine to write data into a queue in a real-time data pipeline, or as a standard streaming engine (similar to Kafka and Kinesis), so you don't need to use an external engine.

The platform's streaming engine is currently available via the platform's Streaming Web API.
In addition, the platform's Spark-Streaming Integration API enables using the Spark Streaming API to work with platform streams, as explained in the next section (Using Spark Streaming).

The v3io-streams tutorial demonstrates basic usage of the streaming API.

Using Spark Streaming

You can use the Spark Streaming API to ingest, consume, and analyze data using data streams. The platform features a custom Spark-Streaming Integration API to allow using the Spark Streaming API with platform streams.

Running SQL Queries on Platform Data

You can run SQL queries on NoSQL and Parquet data in the platform's data store, using any of the following methods:

Running Full ANSI Presto SQL Queries

The platform has a default pre-deployed Presto service that enables using the Presto open-source distributed SQL query engine to run interactive SQL queries and perform high-performance low-latency interactive analytics on data that's stored in the platform. To run a Presto query from a Jupyter notebook, all you need is to use an SQL magic command — %sql followed by your Presto query. Such queries are executed as distributed queries across the platform's application nodes. The basic-data-ingestion-and-preparation tutorial demonstrates how to run Presto queries using SQL magic.

Note that for running queries on Parquet tables, you need to work with Hive tables. The csv-to-hive tutorial includes a script that converts a CSV file into a Hive table.

Running Spark SQL Queries

The spark-sql-analytics tutorial demonstrates how to run Spark SQL queries on data in the platform's data store.

For more information about the platform's Spark service, see Working with Spark in this tutorial.

Running SQL Queries from Nuclio Functions

In some cases, you might need to run SQL queries as part of an event-driven application. The nuclio-read-via-presto tutorial demonstrates how to run an SQL query from a serverless Nuclio function.

Running SQL Queries from MLRun Jobs

In some cases, you might need to run SQL queries as part of an MLRun job. The mlrun-read-via-presto tutorial demonstrates how to run an SQL query from an MLRun job using Presto.

Working with Parquet Files

Parquet is a columnar storage format that provides high-density high-performance file organization.
The parquet-read-write tutorial demonstrates how to create and write data to a Parquet table in the platform and read data from the table.

After you ingest Parquet files into the platform, you might want to create related Hive tables and run SQL queries on these tables.
The parquet-to-hive tutorial demonstrates how you can do this using Spark DataFrames.

Accessing Platform NoSQL and TSDB Data Using the Frames Library

V3IO Frames ("Frames") is a multi-model open-source data-access library, developed by Iguazio, which provides a unified high-performance DataFrame API for working with data in the platform's data store. Frames currently supports the NoSQL (key-value) and time-series (TSDB) data models via its NoSQL (nosql|kv) and TSDB (tsdb) backends. The frames tutorial provides an introduction to Frames and demonstrates how to use it to work with NoSQL and TSDB data in the platform.

Getting Data from AWS S3 Using curl

A simple way to ingest data from the Amazon Simple Storage Service (S3) into the platform's data store is to run a curl command that sends an HTTP request to the relevant AWS S3 bucket, as demonstrated in the following code cell. For more information and examples, see the basic-data-ingestion-and-preparation tutorial.

%%sh
CSV_PATH="/User/examples/stocks.csv"
curl -L "https://s3.wasabisys.com/iguazio/data/stocks/2018-03-26_BINS_XETR08.csv" > ${CSV_PATH}

Running Distributed Python Code with Dask

Dask is a flexible library for parallel computation in Python, which is useful for computations that don't fit into a DataFrame. Dask exposes low-level APIs that enable you to build custom systems for in-house applications. This helps parallelize Python processes and dramatically accelerates their performance. The dask-cluster tutorial demonstrates how to use Dask with platform data.

Running DataFrames on GPUs using NVIDIA cuDF

The platform allows you to use NVIDIA's RAPIDS open-source libraries suite to execute end-to-end data science and analytics pipelines entirely on GPUs. cuDF is a RAPIDS GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data. This library features a pandas-like API that will be familiar to data engineers and data scientists, who can use it to easily accelerate their workflows without going into the details of CUDA programming. The gpu-cudf-vs-pd tutorial demonstrates how to use the cuDF library and compares performance benchmarks with pandas and cuDF.

Note: To use the cuDF library, you need to create a RAPIDS Conda environment. For more information, see the virtual-env tutorial.

Visualizing Data with Grafana

The platform has a Grafana service with predefined dashboards that leverage the monitoring service to display monitoring data, such as performance statistics, for application services. You can also define custom Grafana dashboards for monitoring, visualizing, and understanding data stored in the platform, such as time-series metrics and NoSQL data. You can read and analyze data from the platform's data store and visualize it on Grafana dashboards in the desired formats, such as tables and graphs. This can be done by using the custom iguazio data source, or by using a Prometheus data source for running Prometheus queries on platform TSDB tables. You can also issue data alerts and create, explore, and share dashboards.

You can use Iguazio's grafwiz Python library to create an deploy Grafana dashboards programmatically, as demonstrated in the grafana-grafwiz tutorial.