Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support gcs-connector 3.x in GcsUtil #33368

Merged
merged 7 commits into from
Jan 14, 2025

Conversation

clairemcginty
Copy link
Contributor

@clairemcginty clairemcginty commented Dec 12, 2024

Rationale:

I would like to use gcs-connector 3.x, which supports the new Parquet VectorIO feature. However, gcs-connector 3.x also drops Java 8 and targets Java 11, which blocks us from upgrading it directly in Beam, since Beam is still targeting 8 (see #31678).

Additionally, as a Beam user, I can't just upgrade gcs-connector on my end, due to breaking changes in how GoogleCloudStorageImpl is instantiated: in 2.x it has public constructors, but in 3.x it drops the public constructors and enforces a Builder pattern.

Therefore, when running on gcs-connector 3.x, my pipeline throws a NoSuchMethodError from org.apache.beam.sdk.extensions.gcp.util.GcsUtil when it tries to invoke the 2.x constructor: https://github.com/apache/beam/blob/v2.61.0/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/util/GcsUtil.java#L727

This PR adds a pipeline option for a GoogleCloudStorage Provider, so that users who want to use gcs-connector 3.x can be unblocked from doing so. It defaults to invoking the gcs-connector 2.x public constructor, but 3.x users can override it to use the Builder.


GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

Comment on lines 228 to 232
GoogleCloudStorage get(
GoogleCloudStorageOptions options,
Storage storage,
Credentials credentials,
HttpRequestInitializer httpRequestInitializer);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 4 params should cover both the 2.x constructor and the 3.x Builder

Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@clairemcginty
Copy link
Contributor Author

cc @Abacn - wdyt of this workaround for gcs-connector 3.x?

@Abacn
Copy link
Contributor

Abacn commented Dec 13, 2024

Hi, thanks for the investigation. Is the builder constructor also supported on 2.x ? If so we can just change to use it in all case and no need extra options exposed to user

@clairemcginty
Copy link
Contributor Author

Hi, thanks for the investigation. Is the builder constructor also supported on 2.x ? If so we can just change to use it in all case and no need extra options exposed to user

Unfortunately it isn't :/ There is no way to construct a GoogleCloudStorageImpl that works for both gcs-connector 2.x and 3.x (at least as far as I've been able to find).

Copy link
Contributor

@Abacn Abacn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

understand, thanks!

@clairemcginty
Copy link
Contributor Author

understand, thanks!

thanks @Abacn ! should I update CHANGES.md for the 2.62.0 release or the next one?

@Abacn
Copy link
Contributor

Abacn commented Dec 23, 2024

java precommit tineout for several rerun. It's passing on HEAD, could you please taking a look if it is related to this change?

@Abacn
Copy link
Contributor

Abacn commented Dec 27, 2024

after multiple rerun there is indeed a related test failure:

https://github.com/apache/beam/runs/34888346578

testGcpCoreApiSurface (org.apache.beam.sdk.extensions.gcp.GcpCoreApiSurfaceTest) failed

java.lang.AssertionError: 
Expected: API surface to include only:

...

 but: The following disallowed classes appeared on the API surface:
	class com.google.protobuf.AbstractMessage exposed via:
...
interface org.apache.beam.sdk.extensions.gcp.options.GcsOptions$GoogleCloudStorageProvider

looks like it is coded in test to prevent the exposure of unwanted classes. Refactor in a way that does not leak these may fix

@clairemcginty
Copy link
Contributor Author

clairemcginty commented Dec 27, 2024

looks like it is coded in test to prevent the exposure of unwanted classes. Refactor in a way that does not leak these may fix

thanks for looking into it @Abacn ! hmm, this seems challenging to refactor since the exposure is coming from interface com.google.cloud.hadoop.gcsio.GoogleCloudStorage itself... and there's no superclass/subclass that would be appropriate to substitute in here. I guess instead of adding a provider for a GoogleCloudStorage instance, we could add a provider for a BiFunction<StorageResourceId, CreateObjectOptions, WritableByteChannel> ? Although that would leak more implementation details to the user.

@clairemcginty
Copy link
Contributor Author

looks like it is coded in test to prevent the exposure of unwanted classes. Refactor in a way that does not leak these may fix

hmm, this seems challenging to refactor since the exposure is coming from interface com.google.cloud.hadoop.gcsio.GoogleCloudStorage itself... and there's no superclass/subclass that would be appropriate to substitute in here. I guess instead of adding a provider for a GoogleCloudStorage instance, we could add a provider for a BiFunction<StorageResourceId, CreateObjectOptions, WritableByteChannel> ? Although that would leak more implementation details to the user.

hey @Abacn ! just wanted to ping this thread before the PR goes stale :)

@Abacn
Copy link
Contributor

Abacn commented Jan 10, 2025

a last resort would be using reflection to handle both constructor or builder. Say, use a try to encapsulate the constictor call, when using 3.x dependency ut will throw an error, catch that Exception and construct the instance using Reflection. Is that possible to do?

@clairemcginty
Copy link
Contributor Author

a last resort would be using reflection to handle both constructor or builder. Say, use a try to encapsulate the constictor call, when using 3.x dependency ut will throw an error, catch that Exception and construct the instance using Reflection. Is that possible to do?

I think that makes sense... so we'd keep the official Beam dep on 2.x. but try to reflectively construct a 3.x instance first?

@clairemcginty clairemcginty changed the title Parameterize GoogleCloudStorage provider in GcsUtil to unblock gcs-co… Support gcs-connector 3.x in GcsUtil Jan 10, 2025
@clairemcginty
Copy link
Contributor Author

seems to work! tested it on 2.x and 3.x 👍

return new GoogleCloudStorageImpl(options, storage, credentials);
try {
// Attempt to construct gcs-connector 3.x-style GoogleCloudStorage, which is created
// exclusively via Builder method; this can be replaced once Java 8 is dropped and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could add a // TODO tag like

// TODO eliminate reflection once Beam drops Java 8 support and upgrade to use gcsio 3.x

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea- updated 👍

@clairemcginty
Copy link
Contributor Author

there is an integration failure but it doesn't look related:

org.apache.beam.sdk.io.gcp.bigquery.StorageApiSinkSchemaUpdateIT > testExactlyOnceWithAutoSchemaUpdate[1] FAILED
    org.apache.beam.sdk.Pipeline$PipelineExecutionException at StorageApiSinkSchemaUpdateIT.java:431
        Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException at GoogleJsonResponseException.java:146

@Abacn
Copy link
Contributor

Abacn commented Jan 13, 2025

Sorry for another late request

I've been a little bit worry about relying on the try {} catch {} to determine the code path, which happened to bring out performance regression in #33575

since using Reflection is already a hack, I do not see a good solution to this either. But could we try {return new GoogleCloudStorageImpl(...)} first? So in the default dependency versions (gcs 2.x), it won't throw Exception (even though it is handled) after this change

@clairemcginty
Copy link
Contributor Author

that makes sense @Abacn ! Updated so that we try to the 2.x constructor first and attempt the 3.x Builder only if that throws a NoSuchMethodError.

@clairemcginty
Copy link
Contributor Author

looks like there was a spotbugs error - looking into it now..

@Abacn
Copy link
Contributor

Abacn commented Jan 14, 2025

yeah, can be found in "SpotBugs Results": https://github.com/apache/beam/actions/runs/12768890245?pr=33368


Code | Warning
-- | --
REC | Exception is caught when Exception is not thrown in org.apache.beam.sdk.extensions.gcp.util.GcsUtil.createGoogleCloudStorage(GoogleCloudStorageOptions, Storage, Credentials)
  | Bug type REC_CATCH_EXCEPTION (click for details)In class org.apache.beam.sdk.extensions.gcp.util.GcsUtilIn method org.apache.beam.sdk.extensions.gcp.util.GcsUtil.createGoogleCloudStorage(GoogleCloudStorageOptions, Storage, Credentials)At GcsUtil.java:[line 780]

Code	Warning
REC	Exception is caught when Exception is not thrown in org.apache.beam.sdk.extensions.gcp.util.GcsUtil.createGoogleCloudStorage(GoogleCloudStorageOptions, Storage, Credentials)
[Bug type REC_CATCH_EXCEPTION (click for details)](https://github.com/apache/beam/pull/33368#REC_CATCH_EXCEPTION)
In class org.apache.beam.sdk.extensions.gcp.util.GcsUtil
In method org.apache.beam.sdk.extensions.gcp.util.GcsUtil.createGoogleCloudStorage(GoogleCloudStorageOptions, Storage, Credentials)
At GcsUtil.java:[line 780]

@Abacn
Copy link
Contributor

Abacn commented Jan 14, 2025

catch only the checked reflection Exceptions should fix the issue I see what the check means now, see below

@Abacn Abacn merged commit ff8b6a1 into apache:master Jan 14, 2025
18 of 19 checks passed
@clairemcginty clairemcginty deleted the gcs-connector-compat branch January 14, 2025 16:34
Naireen pushed a commit to Naireen/beam that referenced this pull request Jan 17, 2025
* Parameterize GoogleCloudStorage provider in GcsUtil to unblock gcs-connector 3.x

* Use Reflection to attempt 3.x Builder, and fall back to 2.x Constructor

* Attempt constructing 3.x-style GCS via reflection; fall back to 2.x constructor

* Update CHANGES.md

* Add TODO note on reflection block

* Try non-reflected construction first

* Fix SpotBugs error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants