From 8fa8af0f3a6a5284545cfbf42349b9ddd2a984d9 Mon Sep 17 00:00:00 2001 From: Matthew Powers Date: Tue, 22 Nov 2022 11:12:11 -0500 Subject: [PATCH] Update Delta Storage page --- src/pages/latest/delta-storage-oss.mdx | 112 +++++++++++++++---------- 1 file changed, 68 insertions(+), 44 deletions(-) diff --git a/src/pages/latest/delta-storage-oss.mdx b/src/pages/latest/delta-storage-oss.mdx index e067d93..47c255b 100644 --- a/src/pages/latest/delta-storage-oss.mdx +++ b/src/pages/latest/delta-storage-oss.mdx @@ -73,12 +73,19 @@ read and write from different storage systems. +**In this article** +* Amazon S3 +* Microsoft Azure storage +* HDFS +* Google Cloud Storage +* Oracle Cloud Infrastructure +* IBM Cloud Object Storage + ## Amazon S3 -Delta Lake supports reads and writes to S3 in two different modes: -Single-cluster and Multi-cluster. +Delta Lake supports reads and writes to S3 in two different modes: Single-cluster and Multi-cluster. | | Single-cluster | Multi-cluster | | ------------- | ------------------------------------------------------- | ------------------------------------------------ | @@ -97,16 +104,21 @@ driver in order for Delta Lake to provide transactional guarantees. This is because S3 currently does not provide mutual exclusion, that is, there is no way to ensure that only one writer is able to create a file. - + Concurrent writes to the same Delta table from multiple Spark drivers can lead to data loss. +**In this section**: +* Requirements (S3 single-cluster) +* Quickstart (S3 single-cluster) +* Configuration (S3 single-cluster) + #### Requirements (S3 single-cluster) - S3 credentials: [IAM roles](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html) - (recommended) or access keys + (recommended) or access keys. - Apache Spark associated with the corresponding Delta Lake version. - Hadoop's [AWS connector (hadoop-aws)](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/) @@ -116,23 +128,23 @@ to ensure that only one writer is able to create a file. This section explains how to quickly start reading and writing Delta tables on S3 using single-cluster mode. For a detailed explanation of the configuration, -see [\_](#setup-configuration-s3-multi-cluster). +see [Setup Configuration (S3 multi-cluster)](#setup-configuration-s3-multi-cluster). -#. Use the following command to launch a Spark shell with Delta Lake and S3 +1. Use the following command to launch a Spark shell with Delta Lake and S3 support (assuming you use Spark 3.2.1 which is pre-built for Hadoop 3.3.1): ```bash bin/spark-shell \ - --packages io.delta:delta-core_2.12:$VERSION$,org.apache.hadoop:hadoop-aws:3.3.1 \ + --packages io.delta:delta-core_2.12:2.1.1,org.apache.hadoop:hadoop-aws:3.3.1 \ --conf spark.hadoop.fs.s3a.access.key= \ --conf spark.hadoop.fs.s3a.secret.key= ``` -#. Try out some basic Delta table operations on S3 (in Scala): +2. Try out some basic Delta table operations on S3 (in Scala): @@ -147,20 +159,20 @@ spark.read.format("delta").load("s3a:///"). For other languages and more examples of Delta table operations, see the -[\_](quick-start.md) page. +[Quickstart](quick-start.md) page. #### Configuration (S3 single-cluster) Here are the steps to configure Delta Lake for S3. -#. Include `hadoop-aws` JAR in the classpath. +1. Include `hadoop-aws` JAR in the classpath. Delta Lake needs the `org.apache.hadoop.fs.s3a.S3AFileSystem` class from the `hadoop-aws` package, which implements Hadoop's `FileSystem` API for S3. Make sure the version of this package matches the Hadoop version with which Spark was built. -#. Set up S3 credentials. +2. Set up S3 credentials. We recommend using [IAM roles](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html) for @@ -192,16 +204,23 @@ implementation. This implementation uses [DynamoDB](https://aws.amazon.com/dynamodb/) to provide the mutual exclusion that S3 is lacking. - + This multi-cluster writing solution is only safe when all writers use this `LogStore` implementation as well as the same DynamoDB table and region. If some drivers use out-of-the-box Delta Lake while others use this experimental `LogStore`, then data loss can occur. +In this section: + +* Requirements (S3 multi-cluster) +* Quickstart (S3 multi-cluster) +* Setup Configuration (S3 multi-cluster) +* Production Configuration (S3 multi-cluster) + #### Requirements (S3 multi-cluster) -- All of the requirements listed in [\_](#requirements-s3-single-cluster) +- All of the requirements listed in [Requirements (S3 single-cluster)](#requirements-s3-single-cluster) section - In additon to S3 credentials, you also need DynamoDB operating permissions @@ -210,7 +229,7 @@ that S3 is lacking. This section explains how to quickly start reading and writing Delta tables on S3 using multi-cluster mode. -#. Use the following command to launch a Spark shell with Delta Lake and S3 +1. Use the following command to launch a Spark shell with Delta Lake and S3 support (assuming you use Spark 3.2.1 which is pre-built for Hadoop 3.3.1): @@ -226,7 +245,7 @@ bin/spark-shell \ -#. Try out some basic Delta table operations on S3 (in Scala): +2. Try out some basic Delta table operations on S3 (in Scala): @@ -242,7 +261,7 @@ spark.read.format("delta").load("s3a:///"). #### Setup Configuration (S3 multi-cluster) -#. Create the DynamoDB table. +1. Create the DynamoDB table. You have the choice of creating the DynamoDB table yourself (recommended) or having it created for you automatically. @@ -287,15 +306,14 @@ aws dynamodb create-table \ writes per second. You may change these default values using the table-creation-only configurations keys detailed in the table below. -#. Follow the configuration steps listed in -[\_](#configuration-s3-single-cluster) section. +2. Follow the configuration steps listed in +[Configuration (S3 single-cluster)](#configuration-s3-single-cluster) section. -#. Include the `delta-storage-s3-dynamodb` JAR in the classpath. +3. Include the `delta-storage-s3-dynamodb` JAR in the classpath. -#. Configure the `LogStore` implementation in your Spark session. +4. Configure the `LogStore` implementation in your Spark session. -First, configure this `LogStore` implementation for the scheme `s3`. You can -replicate this command for schemes `s3a` and `s3n` as well. +First, configure this `LogStore` implementation for the scheme `s3`. You can replicate this command for schemes `s3a` and `s3n` as well. @@ -329,7 +347,7 @@ By this point, this multi-cluster setup is fully operational. However, there is extra configuration you may do to improve performance and optimize storage when running in production. -#. Adjust your Read and Write Capacity Mode. +1. Adjust your Read and Write Capacity Mode. If you are using the default DynamoDB table created for you by this `LogStore` implementation, its default RCU and WCU might not be enough for your workloads. @@ -355,7 +373,9 @@ Run the following command on your given DynamoDB table to enable TTL: AttributeName=commitTime" ``` -#. Cleanup old AWS S3 temp files using S3 Lifecycle Expiration. +The default `expireTime` will be one day after the DynamoDB entry was marked as completed. + +3. Cleanup old AWS S3 temp files using S3 Lifecycle Expiration. In this `LogStore` implementation, a temp file is created containing a copy of the metadata to be committed into the Delta log. Once that commit to the Delta @@ -365,7 +385,7 @@ ever be used during recovery of a failed commit. Here are two simple options for deleting these temp files. -#. Delete manually using S3 CLI. +1. Delete manually using S3 CLI. This is the safest option. The following command will delete all but the latest temp file in your given `` and ``: @@ -376,7 +396,7 @@ aws s3 ls s3:////_delta_log/.tmp/ --recursive | awk 'N done ``` -#. Delete using an S3 Lifecycle Expiration Rule +2. Delete using an S3 Lifecycle Expiration Rule A more automated option is to use an [S3 Lifecycle Expiration rule](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html), with filter prefix pointing to the `/_delta_log/.tmp/` folder located in your table path, and an expiration value of 30 days. @@ -396,10 +416,11 @@ See [S3 Lifecycle Configuration](https://docs.aws.amazon.com/cli/latest/reference/s3api/put-bucket-lifecycle-configuration.html) for details. An example rule and command invocation is given below: +In a file referenced as `file://lifecycle.json`: + ```json -// file://lifecycle.json { "Rules": [ { @@ -442,11 +463,15 @@ versions ([Hadoop-15156](https://issues.apache.org/jira/browse/HADOOP-15156) and reason, you may need to build Spark with newer Hadoop versions and use them for deploying your application. See [Specifying the Hadoop Version and Enabling YARN](https://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version-and-enabling-yarn) -for building Spark with a specific Hadoop version and [\_](quick-start.md) for +for building Spark with a specific Hadoop version and [Quickstart](quick-start.md) for setting up Spark with Delta Lake. Here is a list of requirements specific to each type of Azure storage system: +* Azure Blob storage +* Azure Data Lake Storage Gen1 +* Azure Data Lake Storage Gen2 + ### Azure Blob storage #### Requirements (Azure Blob storage) @@ -473,10 +498,10 @@ along with Apache Spark 3.0 compiled and deployed with Hadoop 3.2. Here are the steps to configure Delta Lake on Azure Blob storage. -#. Include `hadoop-azure` JAR in the classpath. See the requirements above for +1. Include `hadoop-azure` JAR in the classpath. See the requirements above for version details. -#. Set up credentials. +2. Set up credentials. You can set up your credentials in the [Spark configuration property](https://spark.apache.org/docs/latest/configuration.html). @@ -600,12 +625,12 @@ along with Apache Spark 3.0 compiled and deployed with Hadoop 3.2. Here are the steps to configure Delta Lake on Azure Data Lake Storage Gen1. -#. Include the JAR of the Maven artifact `hadoop-azure-datalake` in the +1. Include the JAR of the Maven artifact `hadoop-azure-datalake` in the classpath. See the [requirements](#azure-blob-storage) for version details. In addition, you may also have to include JARs for Maven artifacts `hadoop-azure` and `wildfly-openssl`. -#. Set up Azure Data Lake Storage Gen2 credentials. +2. Set up Azure Data Lake Storage Gen2 credentials. @@ -623,7 +648,7 @@ where ``, ``, `` and `` are details of the service principal we set as requirements earlier. -#. Initialize the file system if needed +3. Initialize the file system if needed @@ -673,7 +698,7 @@ reading and writing from GCS. ### Configuration (GCS) -#. For Delta Lake 1.2.0 and below, you must explicitly configure the LogStore +1. For Delta Lake 1.2.0 and below, you must explicitly configure the LogStore implementation for the scheme `gs`. @@ -684,7 +709,7 @@ spark.delta.logStore.gs.impl=io.delta.storage.GCSLogStore -#. Include the JAR for `gcs-connector` in the classpath. See the +2. Include the JAR for `gcs-connector` in the classpath. See the [documentation](https://cloud.google.com/dataproc/docs/tutorials/gcs-connector-spark-tutorial) for details on how to configure Spark with GCS. @@ -706,8 +731,7 @@ spark.read.format("delta").load("gs:///").show This support is new and experimental. -You have to configure Delta Lake to use the correct `LogStore` for -concurrently reading and writing. +You have to configure Delta Lake to use the correct `LogStore` for concurrently reading and writing. ### Requirements (OCI) @@ -722,7 +746,7 @@ concurrently reading and writing. ### Configuration (OCI) -#. Configure LogStore implementation for the scheme `oci`. +1. Configure LogStore implementation for the scheme `oci`. @@ -732,12 +756,12 @@ spark.delta.logStore.oci.impl=io.delta.storage.OracleCloudLogStore -#. Include the JARs for `delta-contribs` and `hadoop-oci-connector` in the +2. Include the JARs for `delta-contribs` and `hadoop-oci-connector` in the classpath. See [Using the HDFS Connector with Spark](https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/hdfsconnectorspark.htm) for details on how to configure Spark with OCI. -#. Set the OCI Object Store credentials as explained in the +3. Set the OCI Object Store credentials as explained in the [documentation](https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/hdfsconnector.htm). ### Usage (OCI) @@ -775,7 +799,7 @@ concurrently reading and writing. ### Configuration (IBM) -#. Configure LogStore implementation for the scheme `cos`. +1. Configure LogStore implementation for the scheme `cos`. @@ -785,9 +809,9 @@ spark.delta.logStore.cos.impl=io.delta.storage.IBMCOSLogStore -#. Include the JARs for `delta-contribs` and `Stocator` in the classpath. +2. Include the JARs for `delta-contribs` and `Stocator` in the classpath. -#. Configure `Stocator` with atomic write support by setting the following +3. Configure `Stocator` with atomic write support by setting the following properties in the Hadoop configuration. @@ -800,7 +824,7 @@ fs.stocator.cos.scheme=cos fs.cos.atomic.write=true -#. Set up IBM COS credentials. The example below uses access keys with a service +4. Set up IBM COS credentials. The example below uses access keys with a service named `service` (in Scala):