Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does morpheus stores the graph on HDFS? #935

Open
kant111 opened this issue Oct 1, 2019 · 4 comments
Open

How does morpheus stores the graph on HDFS? #935

kant111 opened this issue Oct 1, 2019 · 4 comments

Comments

@kant111
Copy link

kant111 commented Oct 1, 2019

How does Morpheus stores the graph on HDFS? does it layout the same way as neo4j would on disk?

I am just trying to figure out how to store and retrieve a graph from HDFS?

@s1ck
Copy link
Contributor

s1ck commented Oct 1, 2019

In Morpheus, all file system based data sources (csv, parquet, orc) store graphs in a well known directory structure, which is:

└── <graphName>
    ├── capsGraphMetaData.json (generated, contains some version infos)
    ├── propertyGraphSchema.json (generated, required to read from disk)
    ├── nodes
    │   ├── <Label_A>
    │   │   └── table.[csv|orc|parquet]
    │   └── <Label_B>
    │       └── table.[csv|orc|parquet]
    └── relationships
        └── <RelType_A>
            └── table.[csv|orc|parquet]

For CSV, you can find an example here: https://github.com/opencypher/morpheus/tree/master/morpheus-examples/src/main/resources/fs-graphsource/csv/products

@kant111
Copy link
Author

kant111 commented Oct 1, 2019

Thanks a lot! That example is great.

Looks to me that relationships are stored in a table with src and destination columns while neo4j stores all the connected edges as a doubly-linked list for efficient retrieval (avoiding all the JOINS required for queries like given a node get all the children or more generally get all connected vertices etc). I wonder if something like that is possible with HDFS?

If relationships are stored in a table with src and destination columns then queries like given a node get the entire subgraph up to say 10 levels deep would require lot of JOINS isn't it? sure it is parallelizable but distributed joins can incur lot of costs depending on how much data we need to move etc. at very least this is not something we can use for OLTP. isn't it? more for OLAP looks like?

@s1ck
Copy link
Contributor

s1ck commented Oct 2, 2019

You're right. The storage layout is very different from Neo4j. With Morpheus, we had a relational abstraction in mind. As you correctly highlighted, this requires two join operations for a 1-hop traversal ((:A)-[:B]->(:C)), which - for deep traversals / complex patterns - might end up producing large intermediate results. Storing adjacency list structures in a schema-fixed tabular representation is a possibility and would potentially save us 1 join per hop. However, in Spark, you cannot just do pointer chasing throughout the DataFrames, each "hop" from one DataFrame (adjacency list / rels) to another (nodes / intermediate result) requires a join.

Morpheus - in contrast to Neo4j - has a focus on global queries (OLAP), data integration (Property Graph Data Sources) and handling multiple graphs (Cypher 10 features + graph catalog).

There are some optimizations, like indexed DataFrames and Multi-way-join-algorithms that would help us a lot, however, they are not part of Spark (yet).

@kant111
Copy link
Author

kant111 commented Oct 2, 2019

JOINS on a distributed database looks like a well-studied problem so I am not really sure if any JOIN algorithm can be efficient for deep traversals. Do you know any of any distributed database that can do this efficiently? It sounds counterintuitive to me that a JOIN can be efficient once the data is distributed across nodes unless everything is predetermined such as join keys, collocation of the rows with the same join key, etc.

If we are stuck with this table thing may be nested sets or nested interval looks much better but I don't know of any open-source or commercial database that had successfully implemented these models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants