-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How does morpheus stores the graph on HDFS? #935
Comments
In Morpheus, all file system based data sources (csv, parquet, orc) store graphs in a well known directory structure, which is:
For CSV, you can find an example here: https://github.com/opencypher/morpheus/tree/master/morpheus-examples/src/main/resources/fs-graphsource/csv/products |
Thanks a lot! That example is great. Looks to me that relationships are stored in a table with src and destination columns while neo4j stores all the connected edges as a doubly-linked list for efficient retrieval (avoiding all the JOINS required for queries like given a node get all the children or more generally get all connected vertices etc). I wonder if something like that is possible with HDFS? If relationships are stored in a table with src and destination columns then queries like given a node get the entire subgraph up to say 10 levels deep would require lot of JOINS isn't it? sure it is parallelizable but distributed joins can incur lot of costs depending on how much data we need to move etc. at very least this is not something we can use for OLTP. isn't it? more for OLAP looks like? |
You're right. The storage layout is very different from Neo4j. With Morpheus, we had a relational abstraction in mind. As you correctly highlighted, this requires two join operations for a 1-hop traversal ( Morpheus - in contrast to Neo4j - has a focus on global queries (OLAP), data integration (Property Graph Data Sources) and handling multiple graphs (Cypher 10 features + graph catalog). There are some optimizations, like indexed DataFrames and Multi-way-join-algorithms that would help us a lot, however, they are not part of Spark (yet). |
JOINS on a distributed database looks like a well-studied problem so I am not really sure if any JOIN algorithm can be efficient for deep traversals. Do you know any of any distributed database that can do this efficiently? It sounds counterintuitive to me that a JOIN can be efficient once the data is distributed across nodes unless everything is predetermined such as join keys, collocation of the rows with the same join key, etc. If we are stuck with this table thing may be nested sets or nested interval looks much better but I don't know of any open-source or commercial database that had successfully implemented these models. |
How does Morpheus stores the graph on HDFS? does it layout the same way as neo4j would on disk?
I am just trying to figure out how to store and retrieve a graph from HDFS?
The text was updated successfully, but these errors were encountered: