Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessary RDD repartition if RDD is already indexed. [Improvement] #76

Open
merlintang opened this issue Dec 7, 2016 · 4 comments
Open

Comments

@merlintang
Copy link

For the distance join like RDJSpark,
the left RDD is always repartitioned based on the STRPartition.
However, suppose that the left RDD is already indexed and partitioned, this redundant repartition is painful. how about we add function inside the STRPartition to check whether the index partitioner is existed or not? This can avoid the unnecessary shuffle cost.

@dongx-psu
Copy link
Member

I agree with you on this ticket. It requires a physical planning strategy to handle this case. I think it would be fine to add some heuristics in SpatialJoinExtractor checking if table on one side has been indexed.

@merlintang
Copy link
Author

since I can not fork the standalone branch, I can not send PR for this feature. Can you release the standalone version.

@dongx-psu
Copy link
Member

How about your test on the current standalone version? Do you think it is ready to stand as the master branch?

BTW, I think you can just fork the repo and it will fork all branches for you. You just need to switch them back. If you already have a forked repo, one solution is to set the main repo as a remote and pull from it then push it back to your forked repo.

@merlintang
Copy link
Author

merlintang commented Dec 8, 2016 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants