Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External backing storage demo #668

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

bennsimon
Copy link

Description of what I changed

  • Adds a demonstration of deploying a spark thrift server with external backing storage (Postgresql).
    • Uses an intermediary image that provides all the JDBC drivers that will be used by hive.

E2E test

TESTED:

Tested on my local machine by running below:

docker-compose -f compose-controller-spark-sql-external-storage.yaml up

Checklist: I completed these to help reviewers :)

  • I have read and will follow the review process.

  • I am familiar with Google Style Guides for the language I have coded in.

    No? Please take some time and review Java and Python style guides.

  • My IDE is configured to follow the Google code styles.

    No? Unsure? -> configure your IDE.

  • I have added tests to cover my changes. (If you refactored existing code that was well tested you do not have to add tests)

  • I ran mvn clean package right before creating this pull request and added all formatting changes to my commit.

  • All new and existing tests passed.

  • My pull request is based on the latest changes of the master branch.

    No? Unsure? -> execute command git pull --rebase upstream master

@bashir2
Copy link
Collaborator

bashir2 commented Apr 28, 2023

/gcbrun

Copy link
Collaborator

@bashir2 bashir2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bennsimon for this change; just added some suggestions/questions.

@@ -0,0 +1,2 @@
FROM busybox:1.36
COPY postgresql-42.6.0.jar /jdbcDrivers/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a github security policy not to check-in binary artifacts as they are not reviewable (details here). Can we find another way of copying PostgreSQL driver, e.g., fetching it from an official docker image during the image build process? (We are violating this in one particular case which should be fixed too but I don't want to add more to it.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i can fetch it from the official site during the build process.

@@ -0,0 +1,2 @@
FROM busybox:1.36
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to build a new image, why not start from bitnami/spark as base then copy the required driver?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce the size of the image. When i use bitnami/spark base the final image becomes more than a GB but if i use a minimal image as base the size becomes ~15mb.

I was thinking that since the only purpose of this image is to host all the drivers why not use a minimal image. It frees the need to maintain a parallel spark image.

If we use bitnami/spark as base we wont need to copy over the drivers but then we will need to maintain a parallel sparkimage.

@@ -0,0 +1,43 @@
<!--
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just reuse/adapt docker/hive-site_example.xml instead of creating a new one?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sure will do.

@bennsimon bennsimon requested a review from bashir2 May 2, 2023 15:11
@bashir2
Copy link
Collaborator

bashir2 commented Jul 20, 2023

Thanks again @bennsimon; I think we discussed this in other forums, can you please remind me if this is still needed and if yes, update it to be merged? It would be great to add a README.md to the docker/drivers-build directory describing that this is for demo purposes and not integrated into our continuous tests.

@bennsimon
Copy link
Author

Thanks again @bennsimon; I think we discussed this in other forums, can you please remind me if this is still needed and if yes, update it to be merged? It would be great to add a README.md to the docker/drivers-build directory describing that this is for demo purposes and not integrated into our continuous tests.

Hey @bashir2, yeah i think it will be needed to demonstrate that one can setup spark thriftserver with database other than derby database.

I will add the README.md.

@bennsimon bennsimon force-pushed the external-storage-drivers-builds-demo branch from 4d366ae to 70feb27 Compare July 21, 2023 06:56
@bennsimon bennsimon force-pushed the external-storage-drivers-builds-demo branch from 70feb27 to bd6b026 Compare July 21, 2023 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants