Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-51132][ML][BUILD] Upgrade JPMML to 1.7.1 #49854

Closed
wants to merge 5 commits into from

Conversation

wayneguow
Copy link
Contributor

@wayneguow wayneguow commented Feb 8, 2025

What changes were proposed in this pull request?

This PR aims to upgrade JPMML from 1.4.8 to 1.7.1. The main changes from 1.4.8 to 1.7.1 are as follows:

  1. Starting from version 1.5.0, PMML schema version has been updated from 4.3 to 4.4, the related commit is:
    jpmml/jpmml-model@7d8607a
  2. Starting from version 1.6.0, Java XML Binding has been upgraded to Jakarta XML Binding, the related commit is: jpmml/jpmml-model@d76de1c

After this PR, the exported PMML model schema version has been upgraded from 4.3(https://dmg.org/pmml/v4-3/GeneralStructure.html) to 4.4(https://dmg.org/pmml/v4-4-1/GeneralStructure.html).

Why are the changes needed?

  1. Upgrade the PMML standard to the latest 4.4 version ;
  2. Upgrade Java XML Binding to Jakarta XML Binding by upgrade JPMML.

Does this PR introduce any user-facing change?

Yes, the exported PMML model version has been upgraded from 4.3 to 4.4, details are as follows:

  1. version has been changed from 4.2 to 4.4;
  2. xmlns has been changed from http://www.dmg.org/PMML-4_3 to http://www.dmg.org/PMML-4_4;
  3. Application version has been changed to the current Spark version.

Before:

<PMML version="4.2" xmlns="http://www.dmg.org/PMML-4_3" xmlns:data="http://jpmml.org/jpmml-model/InlineTable">
    <Header description="k-means clustering">
        <Application name="Apache Spark MLlib" version="3.5.4"/>
        <Timestamp>2025-02-08T10:00:55</Timestamp>
    </Header>
...

After:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML version="4.4" xmlns="http://www.dmg.org/PMML-4_4" xmlns:data="http://jpmml.org/jpmml-model/InlineTable">
    <Header description="k-means clustering">
        <Application name="Apache Spark MLlib" version="4.1.0-SNAPSHOT"/>
        <Timestamp>2025-02-08T11:34:40</Timestamp>
    </Header>
...

Despite this change, on the one hand, PMML version 4.4 was released as early as November 2019. On the other hand, the upgrade from 4.3 to 4.4 is backward compatible.(Reference: https://dmg.org/pmml/v4-4-1/Changes.html)

How was this patch tested?

  1. Passed GA;
  2. Manually checked the changes after the upgrade.
    KMeans:
import org.apache.spark.ml.clustering.KMeans
val dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")
val kmeans = new KMeans().setK(2).setSeed(1L)
val model = kmeans.fit(dataset)
model.write.format("pmml").save("./kmeans")

LinearRegression:

import org.apache.spark.ml.regression.LinearRegression
val dataset = spark.read.format("libsvm").load("data/mllib/sample_linear_regression_data.txt")
val lr = new LinearRegression()
val model = lr.fit(dataset)
model.write.format("pmml").save("./lr")

Was this patch authored or co-authored using generative AI tooling?

No.

@wayneguow wayneguow changed the title [SPARK-51132][ML][BUILD] Upgrade JPMML to 1.7.1 [SPARK-51132][ML][BUILD] Upgrade JPMML to 1.7.1 Feb 8, 2025
@wayneguow
Copy link
Contributor Author

Manually exported PMML models(pmml_export_models.zip) between Spark 3.5.4 and the current PR:

  1. Kmeans
    Spark 3.5.4 vs Current master branch with this PR:
    image

  2. LinearRegression
    Spark 3.5.4 vs Current master branch with this PR:
    image

@wayneguow
Copy link
Contributor Author

@zhengruifeng For comment form #48611 (comment)

adding additional tests to make sure pmml-model 1.7.1 can successfully load previous models

After my confirmation, currently Spark ML can only export models in PMML format, but currently does not support reloading models from PMML format.

Although there is PMMLUtils tool class in Spark that can load models from PMML format (https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/util/PMMLUtils.scala#L36), but I think it is meaningless to use JPMML to verify the load of PMML, it is more about verifying the compatibility of JPMML. In fact, what we need to verify is that the PMML model exported by Spark can be reloaded on other downstream platforms or systems.

So, here I did some manual checks, which is what the previous comment mentioned.

@@ -515,7 +515,9 @@ javax.xml.bind:jaxb-api https://github.com/javaee/jaxb-v2
Eclipse Distribution License (EDL) 1.0
--------------------------------------
com.sun.istack:istack-commons-runtime
jakarta.activation:jakarta.activation-api
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
image

<exclusion>
<groupId>jakarta.activation</groupId>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current jaxb-runtime version, we need jakarta.activation, otherwise there will be java.lang.NoClassDefFoundError: jakarta/activation/DataSource errors, both when exporting PMML models and when using jersey.

Copy link
Member

@pan3793 pan3793 Feb 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some exclusions are invalid now, please remove them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I found that the jakarta.xml.bind:jakarta.xml.bind-api exclusion in jersey-server can be removed, but the com.sun.activation:jakarta.activation exclusion in jersey-common can still be retained, because this dependence is currently useless.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean com.sun.xml.fastinfoset:FastInfoset and org.jvnet.staxex:stax-ex, they are marked as optional in org.glassfish.jaxb:jaxb-runtime:4.0.5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out, I updated it

@@ -1165,7 +1165,7 @@ class LinearRegressionSuite extends MLTest with DefaultReadWriteTest with PMMLRe
assert(fields(0).getOpType() == OpType.CONTINUOUS)
val pmmlRegressionModel = pmml.getModels().get(0).asInstanceOf[PMMLRegressionModel]
val pmmlPredictors = pmmlRegressionModel.getRegressionTables.get(0).getNumericPredictors
val pmmlWeights = pmmlPredictors.asScala.map(_.getCoefficient()).toList
val pmmlWeights = pmmlPredictors.asScala.map(_.getCoefficient().doubleValue()).toList
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes come from: jpmml/jpmml-model@6d356fe

@@ -44,6 +44,6 @@ private[mllib] trait PMMLModelExport {
val header = new Header()
.setApplication(app)
.setTimestamp(timestamp)
new PMML("4.2", header, null)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, in JPMML 1.4.7 version, the PMML standard was updated to 4.3, so it should be consistent and changed to 4.3 at that time

@@ -44,7 +44,7 @@ private[mllib] class BinaryClassificationPMMLModelExport(
pmml.getHeader.setDescription(description)

if (model.weights.size > 0) {
val fields = new SArray[FieldName](model.weights.size)
val fields = new SArray[String](model.weights.size)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes come from: jpmml/jpmml-model@5969dc2

@wayneguow
Copy link
Contributor Author

Also cc @vruusmann , if you have time, would you like to take a look at the version upgrade of JPMML for Spark 4.0.0? Does it have some potential compatibility risks?

@wayneguow wayneguow marked this pull request as ready for review February 8, 2025 04:47
@@ -599,7 +599,7 @@
<dependency>
<groupId>org.glassfish.jaxb</groupId>
<artifactId>jaxb-runtime</artifactId>
<version>2.3.2</version>
<version>4.0.5</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there version contract between this dep and other org.glassfish.* deps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is currently no mutual dependence between the org.glassfish.jersey series (3.0.16) and the org.glassfish.jaxb (4.0.5) series.

But they both depend on jakarta.xml.bind:jakarta.xml.bind-api and jakarta.activation:jakarta.activation-api, and the dependency versions of jaxb are newer (xml: 4.0.2 vs 3.0.1, activation: 2.1.3 vs 2.0.1).

@vruusmann
Copy link
Contributor

A quick memory dump on JPMML-Model evolution:

  • 1.5.X. Migrating from PMML schema 4.3 to 4.4. JDK 8 compatible.
  • 1.6.X. API upgrades. Specifically, replacing org.dmg.pmml.FieldName (pseudo-enum for representing field names) with plain java.lang.String. Also, migrating from javax.xml.bind.* API to jakarta.xml.bind.* API. JDK 8 compatible.
  • 1.7.X. Upgrading from JDK 8 to JDK 11. The codebase is effectively unchanged. Just picking up up-to-date dependencies.

The org.glassfish.jaxb:jaxb-* dependencies are chosen based on their JDK compaibility. For example, org.glassfish.jaxb:jaxb-runtime:3.X versions are targeting JDK 8, whereas org.glassfish.jaxb:jaxb-runtime:4.X are targeting JDK 11. Apache Spark 4.X is targeting JDK 17, no?

As for encoding PMML schema versions and XML namespaces, then there's a org.dmg.pmml.Version enum class for this purpose. You could define a class constant SPARK_PMML_VERSION somewhere, and then access its #version and #namespaceURI string attributes:

org.dmg.pmml.Version SPARK_PMML_VERSION = org.dmg.pmml.Version.PMML_4_4;

System.out.println(SPARK_PMML_VERSION .getVersion()); // Prints "4.4"
System.out.println(SPARK_PMML_VERSION.getNamespaceURI()); // Prints "http://www.dmg.org/PMML-4_4"

@wayneguow
Copy link
Contributor Author

Apache Spark 4.X is targeting JDK 17, no?

Yes, Spark 4.x is targeting JDK 17.

spark/pom.xml

Line 117 in 301b666

<java.version>17</java.version>

@vruusmann
Copy link
Contributor

vruusmann commented Feb 9, 2025

Yes, Spark 4.x is targeting JDK 17.

Perhaps it is then possible to be even more aggressive on some transitive dependency updates (than JPMML-Model 1.7.1 proposes).

My goal is to maintain source compatibility with legacy platforms. One of new goals for the 1.7.X development branche is to achieve cross-compilability with JavaScript/WASM, as implemented by the TeaVM platform.

TLDR: The JPMML-Model dependency upgrade is highly unlikely to break anything. It's targeting JDK 11, whereas you are already targeting JDK 17. My unit and integration tests indicate that JPMML-Model runs fine on all JDK 11+ versions (specifically, 11, 17 and 21 LTS versions).

@wayneguow
Copy link
Contributor Author

@vruusmann Thank you for sharing these views, much appreciated.

@zhengruifeng
Copy link
Contributor

The JPMML upgrade and related code changes LGTM.

also ping @LuciferYang and @dongjoon-hyun

@pan3793
Copy link
Member

pan3793 commented Feb 10, 2025

Should the change be mentioned in the migration guide? Otherwise LGTM

@LuciferYang
Copy link
Contributor

Just got back from vacation, I'll take a look tonight.

@github-actions github-actions bot added the DOCS label Feb 10, 2025
@wayneguow
Copy link
Contributor Author

Should the change be mentioned in the migration guide? Otherwise LGTM

Add a change description in ml-migration-guide.md.

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

**Changes of behavior**

* [SPARK-51132](https://issues.apache.org/jira/browse/SPARK-51132):
The PMML XML schema version of exported PMML format models by [PMML model export](mllib-pmml-model-export.html) has been upgraded from `PMML-4_3` to `PMML-4_4`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's just one question: What's the difference between PMML-4_3 and PMML-4_4? And are we sure this won't introduce any breaking changes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between PMML-4_3 and PMML-4_4?

Since you didn't change anything about your application code, then the only perceivable difference will be a new XML namespace declaration on the first line of the exported PMML documents.

Previously, the top-level PMML element was <PMML xmlns="http://www.dmg.org/PMML-4_3">, now it's <PMML xmlns="http://www.dmg.org/PMML-4_4">.

And are we sure this won't introduce any breaking changes?

The JPMML-Model library defaults to the latest PMML schema version in its output (ie. 4.4).

If you really want to, you can keep outputting PMML 4.3 schema version documents by "filtering" the output stream using the org.jpmml.model.PMMLOutputStream class:

JAXBSerializer jaxbSerializer = new JAXBSerializer();

OutputStream os = ...

try(OutputStream os = new PMMLOutputStream(os, Version.PMML_4_3)){
	jaxbSerializer.serializePretty(pmml, os);
}

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhengruifeng
Copy link
Contributor

cc @zhengruifeng

I have approved #49854 (comment), not sure why I am not in the Reviewers list :)

@dongjoon-hyun
Copy link
Member

Oh, ya. Sorry, I missed that, @zhengruifeng . :)

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for Apache Spark 4.1.0.

However, I'd not be against on backporting if one of PMC decide to backport this at 4.0.0.

Upgrading from MLlib 3.5 to 4.0

LuciferYang pushed a commit that referenced this pull request Feb 11, 2025
### What changes were proposed in this pull request?

This PR aims to upgrade `JPMML` from 1.4.8 to 1.7.1. The main changes from 1.4.8 to 1.7.1 are as follows:
1. Starting from version 1.5.0, `PMML` schema version has been updated from 4.3 to 4.4, the related commit is:
jpmml/jpmml-model@7d8607a
2. Starting from version 1.6.0, `Java XML Binding` has been upgraded to `Jakarta XML Binding`,  the related commit is: jpmml/jpmml-model@d76de1c

After this PR, the exported PMML model schema version has been upgraded from 4.3(https://dmg.org/pmml/v4-3/GeneralStructure.html) to 4.4(https://dmg.org/pmml/v4-4-1/GeneralStructure.html).

### Why are the changes needed?

1. Upgrade the PMML standard to the latest 4.4 version ;
2. Upgrade `Java XML Binding` to `Jakarta XML Binding` by upgrade `JPMML`.

### Does this PR introduce _any_ user-facing change?

Yes, the exported PMML model version has been upgraded from 4.3 to 4.4, details are as follows:
1. `version` has been changed from 4.2 to 4.4;
2. `xmlns` has been changed from `http://www.dmg.org/PMML-4_3` to `http://www.dmg.org/PMML-4_4`;
3. `Application version` has been changed to the  current Spark version.

Before:
```
<PMML version="4.2" xmlns="http://www.dmg.org/PMML-4_3" xmlns:data="http://jpmml.org/jpmml-model/InlineTable">
    <Header description="k-means clustering">
        <Application name="Apache Spark MLlib" version="3.5.4"/>
        <Timestamp>2025-02-08T10:00:55</Timestamp>
    </Header>
...
```
After:
```
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML version="4.4" xmlns="http://www.dmg.org/PMML-4_4" xmlns:data="http://jpmml.org/jpmml-model/InlineTable">
    <Header description="k-means clustering">
        <Application name="Apache Spark MLlib" version="4.1.0-SNAPSHOT"/>
        <Timestamp>2025-02-08T11:34:40</Timestamp>
    </Header>
...
```

**Despite this change, on the one hand, PMML version 4.4 was released as early as November 2019. On the other hand, the upgrade from 4.3 to 4.4 is backward compatible.(Reference: https://dmg.org/pmml/v4-4-1/Changes.html)**

### How was this patch tested?

1. Passed GA;
2. Manually checked the changes after the upgrade.
KMeans:
```
import org.apache.spark.ml.clustering.KMeans
val dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")
val kmeans = new KMeans().setK(2).setSeed(1L)
val model = kmeans.fit(dataset)
model.write.format("pmml").save("./kmeans")
```

LinearRegression:
```
import org.apache.spark.ml.regression.LinearRegression
val dataset = spark.read.format("libsvm").load("data/mllib/sample_linear_regression_data.txt")
val lr = new LinearRegression()
val model = lr.fit(dataset)
model.write.format("pmml").save("./lr")
```
### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #49854 from wayneguow/pmml.

Authored-by: Wei Guo <[email protected]>
Signed-off-by: yangjie01 <[email protected]>
(cherry picked from commit cea79dc)
Signed-off-by: yangjie01 <[email protected]>
@LuciferYang
Copy link
Contributor

Merged into master and branch-4.0.
Thanks @wayneguow @zhengruifeng @dongjoon-hyun @pan3793 @vruusmann

@wayneguow
Copy link
Contributor Author

Thank you all! Also, those two annoying Jersey warning logs are now gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants