-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kedro-datasets: allow users to choose between Databricks and Spark sessions #700
Comments
It's also worth considering that with The whole I haven't tested this throroughly but I guess with the current implementation since spark is initialized independently and without args for each Dataset, no spark configuration could be used If this development is going to be taken it would be nice to also consider this and design the spark integration in a way that spark config can also be used when using |
We need to consider not just for latest For now, @MigQ2 can you provide some pointer and snippet how would a spark connection looks like for |
I'm not sure I fully get what you mean @noklam In the older versions ( In the newer versions ( I have opened #861 to tackle the problem @michal-mmm initially mentioned. Providing a way to provide spark options in databricks-connect is still not implemented. Shall track that in a new different issue? |
@MigQ2 Thanks for opening the PR.
Can you explains how this resolve the issue? From my understanding the PR try to avoid initialising databricks-connect for older version. This issue seems to suggest that sometimes people still want to use pure spark, even if |
Yeah @noklam you're right, probably it's worth doing a more general approach at once. I can think of 2 possible implementations, let me know what you think and I can try to implement it:
|
I would prefer 1 if it achieve the same thing without introducing another class. For global default, I would imagine user do this via dataset factory to give default setting of their preference of spark instead of doing this via the spark dataset. Thanks for taking a stab at this quickly. Again I will be spending more time this week on PR review so if you are able to come up with some PR soon, I can review it quickly. |
Hi @noklam, after some research I have come up with #862 (it's still WIP, let's discuss if we like the pattern and then I can finish implementation for all datasets). Some relevant comments:
Could some of the people who liked this issue explain their setup and if this would solve their concerns? @michal-mmm @zerodarkzone @filipeo2-mck What do you think? Shall I implement this on all Datasets or should we rather just update the documentation to be clearer, close #862 and leave the code as-is? |
Description
Using
databricks-connect
isn't always optimal for initializing Spark sessions in some use cases. This can be problematic for users who do not wish to usedatabricks-connect
or prefer usingspark-connect
.Context
databricks-connect
can sometimes cause issues, such as those described in this community discussion. These issues can disrupt workflows and create unnecessary complications.databricks-connect
or a regular Spark session would allow users to avoid these issues and use their preferred method for Spark session initialization._get_spark()
function is used in some datasetsPossible Alternatives
_get_spark()
function to allow users to specify their preference fordatabricks-connect
,spark-connect
, or a regular Spark session through configuration settings.The text was updated successfully, but these errors were encountered: