Skip to content

Commit

Permalink
Add spark-3.4.3 and 3.5.1 support (#582)
Browse files Browse the repository at this point in the history
* Add spark-3.4.3 support and migrate to scala 2.13

- updates of pom files
- fixes in tests
- migration to 2.13
- datasources-34

 On branch spark-34-35
 Changes to be committed:
	modified:   .github/workflows/spark.yaml
	modified:   .gitignore
	modified:   maven-projects/spark/datasources-32/pom.xml
	modified:   maven-projects/spark/datasources-33/pom.xml
	new file:   maven-projects/spark/datasources-34/.scalafmt.conf
	new file:   maven-projects/spark/datasources-34/pom.xml
	new file:   maven-projects/spark/datasources-34/src/main/java/org/apache/graphar/GeneralParams.java
	new file:   maven-projects/spark/datasources-34/src/main/scala/org/apache/graphar/datasources/GarDataSource.scala
	new file:   maven-projects/spark/datasources-34/src/main/scala/org/apache/spark/sql/graphar/GarCommitProtocol.scala
	new file:   maven-projects/spark/datasources-34/src/main/scala/org/apache/spark/sql/graphar/GarScan.scala
	new file:   maven-projects/spark/datasources-34/src/main/scala/org/apache/spark/sql/graphar/GarScanBuilder.scala
	new file:   maven-projects/spark/datasources-34/src/main/scala/org/apache/spark/sql/graphar/GarTable.scala
	new file:   maven-projects/spark/datasources-34/src/main/scala/org/apache/spark/sql/graphar/GarWriteBuilder.scala
	new file:   maven-projects/spark/datasources-34/src/main/scala/org/apache/spark/sql/graphar/csv/CSVWriteBuilder.scala
	new file:   maven-projects/spark/datasources-34/src/main/scala/org/apache/spark/sql/graphar/json/JSONWriteBuilder.scala
	new file:   maven-projects/spark/datasources-34/src/main/scala/org/apache/spark/sql/graphar/orc/OrcOutputWriter.scala
	new file:   maven-projects/spark/datasources-34/src/main/scala/org/apache/spark/sql/graphar/orc/OrcWriteBuilder.scala
	new file:   maven-projects/spark/datasources-34/src/main/scala/org/apache/spark/sql/graphar/parquet/ParquetWriteBuilder.scala
	modified:   maven-projects/spark/graphar/pom.xml
	modified:   maven-projects/spark/graphar/src/main/scala/org/apache/graphar/graph/GraphReader.scala
	modified:   maven-projects/spark/graphar/src/main/scala/org/apache/graphar/writer/EdgeWriter.scala
	modified:   maven-projects/spark/graphar/src/main/scala/org/apache/graphar/writer/VertexWriter.scala
	modified:   maven-projects/spark/graphar/src/test/scala/org/apache/graphar/TestReader.scala
	modified:   maven-projects/spark/pom.xml

* Apply spotless and fix licenserc

 On branch spark-34-35
 Changes to be committed:
	modified:   licenserc.toml
	modified:   maven-projects/spark/datasources-33/src/main/scala/org/apache/spark/sql/graphar/GarScan.scala
	modified:   maven-projects/spark/datasources-33/src/main/scala/org/apache/spark/sql/graphar/GarScanBuilder.scala
	modified:   maven-projects/spark/datasources-33/src/main/scala/org/apache/spark/sql/graphar/GarTable.scala
	modified:   maven-projects/spark/datasources-34/src/main/scala/org/apache/spark/sql/graphar/GarScan.scala
	modified:   maven-projects/spark/datasources-34/src/main/scala/org/apache/spark/sql/graphar/GarScanBuilder.scala
	modified:   maven-projects/spark/datasources-34/src/main/scala/org/apache/spark/sql/graphar/GarTable.scala
	modified:   maven-projects/spark/graphar/src/main/scala/org/apache/graphar/graph/GraphReader.scala
	modified:   maven-projects/spark/graphar/src/main/scala/org/apache/graphar/writer/EdgeWriter.scala

* Fix a weak place in GAR

 On branch spark-34-35
 Changes to be committed:
	modified:   maven-projects/spark/graphar/src/main/scala/org/apache/graphar/graph/GraphWriter.scala

* Fix scala versions mismatch

 On branch spark-34-35
 Changes to be committed:
	modified:   .github/workflows/spark.yaml

* Revert migration to 2.13

+ slightly update poms structure

 On branch spark-34-35
 Changes to be committed:
	modified:   .github/workflows/spark.yaml
	modified:   maven-projects/spark/datasources-32/pom.xml
	modified:   maven-projects/spark/datasources-33/pom.xml
	modified:   maven-projects/spark/datasources-34/pom.xml
	modified:   maven-projects/spark/graphar/pom.xml
	modified:   maven-projects/spark/graphar/src/main/scala/org/apache/graphar/writer/EdgeWriter.scala
	modified:   maven-projects/spark/pom.xml

* Update PySpark build

- skipt nebula example because nebula does not support spark 3.4.3

 On branch spark-34-35
 Changes to be committed:
	modified:   .github/workflows/spark.yaml
	modified:   pyspark/Makefile
	modified:   pyspark/poetry.lock
	modified:   pyspark/pyproject.toml

* Try to fix if in CI

 On branch spark-34-35
 Changes to be committed:
	modified:   .github/workflows/spark.yaml

* Try to fix CI

* Fix PySpark CI

* Fix hadoop download link for spark 3.2

* Datasources-35

- Jackson databind
- PySpark to 3.5

 On branch spark-34-35
 Changes to be committed:
	new file:   maven-projects/spark/datasources-35/.scalafmt.conf
	new file:   maven-projects/spark/datasources-35/pom.xml
	new file:   maven-projects/spark/datasources-35/src/main/java/org/apache/graphar/GeneralParams.java
	new file:   maven-projects/spark/datasources-35/src/main/scala/org/apache/graphar/datasources/GarDataSource.scala
	new file:   maven-projects/spark/datasources-35/src/main/scala/org/apache/spark/sql/graphar/GarCommitProtocol.scala
	new file:   maven-projects/spark/datasources-35/src/main/scala/org/apache/spark/sql/graphar/GarScan.scala
	new file:   maven-projects/spark/datasources-35/src/main/scala/org/apache/spark/sql/graphar/GarScanBuilder.scala
	new file:   maven-projects/spark/datasources-35/src/main/scala/org/apache/spark/sql/graphar/GarTable.scala
	new file:   maven-projects/spark/datasources-35/src/main/scala/org/apache/spark/sql/graphar/GarWriteBuilder.scala
	new file:   maven-projects/spark/datasources-35/src/main/scala/org/apache/spark/sql/graphar/csv/CSVWriteBuilder.scala
	new file:   maven-projects/spark/datasources-35/src/main/scala/org/apache/spark/sql/graphar/json/JSONWriteBuilder.scala
	new file:   maven-projects/spark/datasources-35/src/main/scala/org/apache/spark/sql/graphar/orc/OrcOutputWriter.scala
	new file:   maven-projects/spark/datasources-35/src/main/scala/org/apache/spark/sql/graphar/orc/OrcWriteBuilder.scala
	new file:   maven-projects/spark/datasources-35/src/main/scala/org/apache/spark/sql/graphar/parquet/ParquetWriteBuilder.scala
	modified:   maven-projects/spark/pom.xml
	modified:   pyspark/Makefile
	modified:   pyspark/pyproject.toml

* Fix licenserc && update python deps

 On branch spark-34-35
 Changes to be committed:
	modified:   licenserc.toml
	modified:   pyspark/poetry.lock

* Update Spark CI

 On branch spark-34-35
 Changes to be committed:
	modified:   .github/workflows/spark.yaml
  • Loading branch information
SemyonSinchenko authored Aug 12, 2024
1 parent 083f0d5 commit d460f3d
Show file tree
Hide file tree
Showing 47 changed files with 3,789 additions and 235 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/pyspark.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ jobs:
- name: Install Python
uses: actions/setup-python@v4
with:
python-version: 3.9
python-version: '3.10'

- name: Install Poetry
working-directory: pyspark
Expand Down
13 changes: 11 additions & 2 deletions .github/workflows/spark.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -47,11 +47,17 @@ jobs:
matrix:
include:
- mvn-profile: "datasources-32"
spark: "spark-3.2.2"
spark-hadoop: "spark-3.2.2-bin-hadoop3.2"
spark: "spark-3.2.4"
spark-hadoop: "spark-3.2.4-bin-hadoop3.2"
- mvn-profile: "datasources-33"
spark: "spark-3.3.4"
spark-hadoop: "spark-3.3.4-bin-hadoop3"
- mvn-profile: "datasources-34"
spark: "spark-3.4.3"
spark-hadoop: "spark-3.4.3-bin-hadoop3"
- mvn-profile: "datasources-35"
spark: "spark-3.5.1"
spark-hadoop: "spark-3.5.1-bin-hadoop3"

steps:
- uses: actions/checkout@v4
Expand Down Expand Up @@ -117,7 +123,10 @@ jobs:
echo "match (a) -[r] -> () delete a, r;match (a) delete a;" | cypher-shell -u ${NEO4J_USR} -p ${NEO4J_PWD} -d neo4j --format plain
scripts/run-graphar2neo4j.sh
# Apache Spark version 3.4.3 is not supported by the current NebulaGraph Spark Connector.
- name: Run Nebula2GraphAr example
# https://github.com/orgs/community/discussions/37883#discussioncomment-4021318
if: ${{ matrix.spark < 'spark-3.4.3' }}
working-directory: maven-projects/spark
run: |
export JAVA_HOME=${JAVA_HOME_11_X64}
Expand Down
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,11 @@
.DS_store
.cache
.ccls-cache
.dir-locals.el
.classpath
.project
.settings
.factorypath

compile_commands.json

Expand Down
4 changes: 4 additions & 0 deletions licenserc.toml
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,10 @@ excludes = [
"spark/datasources-32/src/main/scala/org/apache/spark/sql/graphar",
"spark/datasources-33/src/main/scala/org/apache/graphar/datasources",
"spark/datasources-33/src/main/scala/org/apache/spark/sql/graphar",
"spark/datasources-34/src/main/scala/org/apache/graphar/datasources",
"spark/datasources-34/src/main/scala/org/apache/spark/sql/graphar",
"spark/datasources-35/src/main/scala/org/apache/graphar/datasources",
"spark/datasources-35/src/main/scala/org/apache/spark/sql/graphar",
"java/src/main/java/org/apache/graphar/stdcxx/StdString.java",
"java/src/main/java/org/apache/graphar/stdcxx/StdVector.java",
"java/src/main/java/org/apache/graphar/stdcxx/StdSharedPtr.java",
Expand Down
6 changes: 3 additions & 3 deletions maven-projects/spark/datasources-32/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<args>
<arg>-target:jvm-1.8</arg>
<arg>-target:jvm-${maven.compiler.target}</arg>
</args>
<jvmArgs>
<jvmArg>-Xss4096K</jvmArg>
Expand Down Expand Up @@ -128,8 +128,8 @@
<compilerPlugins>
<compilerPlugin>
<groupId>org.scalameta</groupId>
<artifactId>semanticdb-scalac_2.12.10</artifactId>
<version>4.3.24</version>
<artifactId>semanticdb-scalac_${scala.version}</artifactId>
<version>${semanticdb-scalac.version}</version>
</compilerPlugin>
</compilerPlugins>
</configuration>
Expand Down
6 changes: 3 additions & 3 deletions maven-projects/spark/datasources-33/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<args>
<arg>-target:jvm-1.8</arg>
<arg>-target:jvm-${maven.compiler.target}</arg>
</args>
<jvmArgs>
<jvmArg>-Xss4096K</jvmArg>
Expand Down Expand Up @@ -128,8 +128,8 @@
<compilerPlugins>
<compilerPlugin>
<groupId>org.scalameta</groupId>
<artifactId>semanticdb-scalac_2.12.10</artifactId>
<version>4.3.24</version>
<artifactId>semanticdb-scalac_${scala.version}</artifactId>
<version>${semanticdb-scalac.version}</version>
</compilerPlugin>
</compilerPlugins>
</configuration>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -212,7 +212,8 @@ case class GarScan(
val parsedOptions = new JSONOptionsInRead(
CaseInsensitiveMap(options.asScala.toMap),
sparkSession.sessionState.conf.sessionLocalTimeZone,
sparkSession.sessionState.conf.columnNameOfCorruptRecord)
sparkSession.sessionState.conf.columnNameOfCorruptRecord
)

// Check a field requirement for corrupt records here to throw an exception in a driver side
ExprUtils.verifyColumnNameOfCorruptRecord(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ case class GarScanBuilder(
this.filters = dataFilters
formatName match {
case "csv" => Array.empty[Filter]
case "json" => Array.empty[Filter]
case "json" => Array.empty[Filter]
case "orc" => pushedOrcFilters
case "parquet" => pushedParquetFilters
case _ =>
Expand Down Expand Up @@ -84,9 +84,9 @@ case class GarScanBuilder(
// Check if the file format supports nested schema pruning.
override protected val supportsNestedSchemaPruning: Boolean =
formatName match {
case "csv" => false
case "csv" => false
case "json" => false
case "orc" => sparkSession.sessionState.conf.nestedSchemaPruningEnabled
case "orc" => sparkSession.sessionState.conf.nestedSchemaPruningEnabled
case "parquet" =>
sparkSession.sessionState.conf.nestedSchemaPruningEnabled
case _ =>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -86,20 +86,20 @@ case class GarTable(
case "parquet" =>
ParquetUtils.inferSchema(sparkSession, options.asScala.toMap, files)
case "json" => {
val parsedOptions = new JSONOptions(
options.asScala.toMap,
sparkSession.sessionState.conf.sessionLocalTimeZone
)

JsonDataSource(parsedOptions).inferSchema(
sparkSession,
files,
parsedOptions
)
val parsedOptions = new JSONOptions(
options.asScala.toMap,
sparkSession.sessionState.conf.sessionLocalTimeZone
)

JsonDataSource(parsedOptions).inferSchema(
sparkSession,
files,
parsedOptions
)
}
case _ =>
throw new IllegalArgumentException("Invalid format name: " + formatName)

}

/** Construct a new write builder according to the actual file format. */
Expand Down
1 change: 1 addition & 0 deletions maven-projects/spark/datasources-34/.scalafmt.conf
193 changes: 193 additions & 0 deletions maven-projects/spark/datasources-34/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<parent>
<groupId>org.apache.graphar</groupId>
<artifactId>spark</artifactId>
<version>${graphar.version}</version>
<relativePath>../pom.xml</relativePath>
</parent>

<artifactId>graphar-datasources</artifactId>
<version>${graphar.version}</version>
<packaging>jar</packaging>

<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
</dependencies>

<build>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<args>
<arg>-target:jvm-${maven.compiler.target}</arg>
</args>
<jvmArgs>
<jvmArg>-Xss4096K</jvmArg>
</jvmArgs>
</configuration>
<executions>
<execution>
<id>scala-compile</id>
<goals>
<goal>compile</goal>
</goals>
<configuration>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</configuration>
</execution>
<execution>
<id>scala-test-compile</id>
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>4.8.0</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<jvmArgs>
<jvmArg>-Xms64m</jvmArg>
<jvmArg>-Xmx1024m</jvmArg>
</jvmArgs>
<args>
<arg>-Ywarn-unused</arg>
</args>
<compilerPlugins>
<compilerPlugin>
<groupId>org.scalameta</groupId>
<artifactId>semanticdb-scalac_${scala.version}</artifactId>
<version>${semanticdb-scalac.version}</version>
</compilerPlugin>
</compilerPlugins>
</configuration>
</plugin>
<plugin>
<groupId>com.diffplug.spotless</groupId>
<artifactId>spotless-maven-plugin</artifactId>
<version>2.20.0</version>
<configuration>
<!-- define a language-specific format -->
<java>
<!-- no need to specify files, inferred automatically, but you can if you want -->
<!-- apply a specific flavor of google-java-format and reflow long strings -->
<googleJavaFormat>
<version>1.13.0</version>
<style>AOSP</style>
</googleJavaFormat>
</java>
<scala>
<scalafmt>
<file>${project.basedir}/.scalafmt.conf</file> <!-- optional -->
</scalafmt>
</scala>
</configuration>
</plugin>
<plugin>
<groupId>io.github.evis</groupId>
<artifactId>scalafix-maven-plugin_2.13</artifactId>
<version>0.1.8_0.11.0</version>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-source-plugin</artifactId>
<executions>
<execution>
<id>attach-sources</id>
<goals>
<goal>jar</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-javadoc-plugin</artifactId>
<executions>
<execution>
<id>attach-javadocs</id>
<goals>
<goal>jar</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<artifactId>maven-site-plugin</artifactId>
<version>3.7.1</version>
</plugin>
</plugins>
</build>
</project>
Loading

0 comments on commit d460f3d

Please sign in to comment.