Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polygon and point join [new feature] #75

Open
merlintang opened this issue Dec 7, 2016 · 10 comments
Open

polygon and point join [new feature] #75

merlintang opened this issue Dec 7, 2016 · 10 comments

Comments

@merlintang
Copy link

Magellan project support the join between polygons and points, and join relationship can be inside, intersect or other. Users are interest in this feature, since this polygon can represent one county, and points can be the ubers, the join results can show which county is more visited and etc.

The current Simba does not support this, does any plan for this function? or any proposal or issue a JERA to track this feature?

@merlintang merlintang changed the title polygon and point join polygon and point join [new feature] Dec 7, 2016
@dongx-psu
Copy link
Member

I think it is a crucial feature, and I try to take some time think about it.

The core problem in this context is actually how to partition a group of polygons. If there is no assumption on the data set, polygons can be very general and easily overflow the heap memory of an executor. Advanced load balancing strategy need to be designed.

Besides, current polygon utilities are relied on JTS 1.14, which probably too heavy to deal with something as simple as contains and intersect predicates.

@merlintang
Copy link
Author

consider the USA county example, polygons are not heavy skew distributions and without strange polygons cover whole place , the load balance would not be that bad.

we can rewrite the contains and intersect functions as magellan.

@dongx-psu
Copy link
Member

One thing I think we can do for this feature is implementing a predicate called intersects, and then do Cartesian product + filtering. It is not fast but at least it will work. What do you think?

@merlintang
Copy link
Author

yes, we can do it via this way, but the cartesian product is too bad in the spark as magellan.

one thing come to mind is following.
suppose we have two tables
(1) outer table is the polygon (2) inner table is the points.
when we do this join,
(a) partition the space based on the inner table
(b) use the partitioner of the step 1 to duplicate the outer table based on overlapping
(c) zip operation and nest loop.
(d) reduce step to filter the duplicate results.
this is very basic, could be better than the cartesian.

@ricosfeifei
Copy link
Member

ricosfeifei commented Dec 21, 2016 via email

@merlintang
Copy link
Author

merlintang commented Dec 21, 2016 via email

@ricosfeifei
Copy link
Member

ricosfeifei commented Dec 21, 2016 via email

@mithdrann
Copy link

Hi, did you start working on this feature?. I am working on a comparative study of spatial data management tools based on the spatial join operation between points and polygons (within, contains, ...) and I would like to add Simba to this work.

@dongx-psu
Copy link
Member

Sorry for the late response. We did not implement it yet in the system. We can implement a simple version use the same algorithms (which is most likely SJMR) adopted by other major systems. I believe the main difference on performance then will be on detailed system design choices.

@mithdrann
Copy link

Thank you for your answer. I agree with you, system design choices can make the difference. Even implementation choices can do it.
It would be nice to have this simple version but I only have 3 weeks to finish this work. I wonder whether you could implement it before this deadline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants