Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HADOOP-???? - adding ARC Managed Identity support to hadoop-azure ABFS #7186

Open
wants to merge 1 commit into
base: branch-3.4.1
Choose a base branch
from

Conversation

sijjay
Copy link

@sijjay sijjay commented Nov 25, 2024

Description of PR

Please note that I am a DevOps engineer and not a Java developer. Therefore, certain aspects (such as writing tests) might be incomplete. While I am not trying to delegate this responsibility entirely, I would greatly appreciate support in writing tests for the introduced changes.

Additionally, there is no JIRA ticket associated with this pull request as I have not been granted access, nor did I receive a response to the email I sent to the Hadoop community mailing list at [email protected]. I would also appreciate any assistance in creating a corresponding JIRA ticket.

In any case, I welcome and am grateful for any feedback. Thank you!

Changes Introduced in This Pull Request

  • Addition of ARC MSI Provider: Enables support for Managed Identity on Azure ARC servers (on-premises servers connected to the Azure ecosystem).
  • Dynamic Variable for API Version: Adds support for dynamically defining the api-version of the MSI endpoint for both MSI and ARC MSI.

Details regarding the motivation for this pull request can be found here: StackOverflow discussion.

How was this patch tested?

hadoop-azure was built and connected to pyspark on an on-premises server with installed ARC libraries for testing the connection to ABFS. Example code:

from pyspark.sql import SparkSession

# Spark add-ons
path_to_hadoop_azure_jar = "/tmp/hadoop-azure-3.4.1-custom.jar"
path_to_hadoop_common_jar = "/opt/hadoop-common-3.4.1.jar"
path_to_azure_storage_jar = "/opt/azure-storage-8.6.6.jar"
path_to_azure_datalake_jar = "/opt/hadoop-azure-datalake-3.4.1.jar"

# ABFS variables
account_name = "***"
container_name = "***"
container_path = "***"
abfs_path=f"abfss://{container_name}@{account_name}.dfs.core.windows.net/{container_path}"

# Spark Session setting up
spark = SparkSession.builder.appName("AzureDataRead") \
    .config("spark.jars", f"{path_to_hadoop_common_jar},{path_to_hadoop_azure_jar},{path_to_azure_storage_jar},{path_to_azure_datalake_jar}") \
    .getOrCreate()
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")

spark.conf.set(f"fs.azure.account.auth.type.{account_name}.dfs.core.windows.net", "OAuth")
spark.conf.set(f"fs.azure.account.oauth.provider.type.{account_name}.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ArcMsiTokenProvider")
spark.conf.set(f"fs.azure.account.oauth2.msi.endpoint.{account_name}.dfs.core.windows.net", "http://127.0.0.1:40342/metadata/identity/oauth2/token")

# Logging
spark.sparkContext.setLogLevel("DEBUG")

# Create a simple DataFrame
data = [("Alice", 25), ("Bob", 30), ("Cathy", 28)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)

# Write the DataFrame to ABFS as Parquet
try:
    df.write.parquet(abfs_path)
    print(f"Parquet file successfully written to {abfs_path}")
except Exception as e:
    print(f"Error writing Parquet file: {e}")
 # Register the Parquet file as a table
table_name = "my_second_table"
spark.sql(f"""
    CREATE TABLE IF NOT EXISTS {table_name}
    USING PARQUET
    LOCATION '{abfs_path}'
""")
print(f"Table '{table_name}' created successfully!")

# Query the table
spark.sql(f"SELECT * FROM {table_name}").show()

# List files in the directory
try:
    print(f"Listing files in directory: {abfs_path}")
    fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
    files = fs.listStatus(spark._jvm.org.apache.hadoop.fs.Path(abfs_path.rsplit("/", 1)[0]))
    for file in files:
        print(file.getPath().getName())
except Exception as e:
    print(f"Error listing files: {e}")

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@sijjay sijjay changed the title [HADOOP-????] adding ARC Managed Identity support to hadoop-azure ABFS HADOOP-???? - adding ARC Managed Identity support to hadoop-azure ABFS Nov 25, 2024
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 13m 4s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ branch-3.4.1 Compile Tests _
+1 💚 mvninstall 49m 33s branch-3.4.1 passed
+1 💚 compile 0m 36s branch-3.4.1 passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚 compile 0m 34s branch-3.4.1 passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
+1 💚 checkstyle 0m 31s branch-3.4.1 passed
+1 💚 mvnsite 0m 39s branch-3.4.1 passed
+1 💚 javadoc 0m 37s branch-3.4.1 passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚 javadoc 0m 34s branch-3.4.1 passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
+1 💚 spotbugs 1m 10s branch-3.4.1 passed
+1 💚 shadedclient 37m 27s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 30s the patch passed
+1 💚 compile 0m 33s the patch passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04
+1 💚 javac 0m 33s the patch passed
+1 💚 compile 0m 28s the patch passed with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
+1 💚 javac 0m 28s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 19s /results-checkstyle-hadoop-tools_hadoop-azure.txt hadoop-tools/hadoop-azure: The patch generated 2 new + 2 unchanged - 0 fixed = 4 total (was 2)
+1 💚 mvnsite 0m 30s the patch passed
-1 ❌ javadoc 0m 25s /results-javadoc-javadoc-hadoop-tools_hadoop-azure-jdkUbuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04.txt hadoop-tools_hadoop-azure-jdkUbuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04 with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04 generated 3 new + 15 unchanged - 0 fixed = 18 total (was 15)
-1 ❌ javadoc 0m 25s /results-javadoc-javadoc-hadoop-tools_hadoop-azure-jdkPrivateBuild-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga.txt hadoop-tools_hadoop-azure-jdkPrivateBuild-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga with JDK Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga generated 3 new + 15 unchanged - 0 fixed = 18 total (was 15)
-1 ❌ spotbugs 1m 14s /new-spotbugs-hadoop-tools_hadoop-azure.html hadoop-tools/hadoop-azure generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0)
+1 💚 shadedclient 39m 7s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 2m 27s hadoop-azure in the patch passed.
+1 💚 asflicense 0m 37s The patch does not generate ASF License warnings.
152m 48s
Reason Tests
SpotBugs module:hadoop-tools/hadoop-azure
Found reliance on default encoding in org.apache.hadoop.fs.azurebfs.oauth2.AzureADAuthenticator.getTokenSingleCall(String, String, Hashtable, String, boolean, boolean):in org.apache.hadoop.fs.azurebfs.oauth2.AzureADAuthenticator.getTokenSingleCall(String, String, Hashtable, String, boolean, boolean): new java.io.FileReader(String) At AzureADAuthenticator.java:[line 492]
Immediate dereference of the result of readLine() in org.apache.hadoop.fs.azurebfs.oauth2.AzureADAuthenticator.getTokenSingleCall(String, String, Hashtable, String, boolean, boolean) At AzureADAuthenticator.java:of readLine() in org.apache.hadoop.fs.azurebfs.oauth2.AzureADAuthenticator.getTokenSingleCall(String, String, Hashtable, String, boolean, boolean) At AzureADAuthenticator.java:[line 493]
Subsystem Report/Notes
Docker ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7186/1/artifact/out/Dockerfile
GITHUB PR #7186
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 5cea14605df9 5.15.0-124-generic #134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision branch-3.4.1 / 3a5f541
Default Java Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_432-8u432-gaus1-0ubuntu220.04-ga
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7186/1/testReport/
Max. process+thread count 552 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-azure U: hadoop-tools/hadoop-azure
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7186/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants