-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Analytics Pipelines #252
Comments
This functionality seems good. Primary use case: Thoughts/Discussions:
|
Based on feedback, a redesign using existing pipeline concepts: - implementation: nodestream.analytics:CopyData
arguments:
source: persistent-graph
nodes:
- Person
relationships:
- KNOWS
- implementation: nodestream.analytics:ProjectGraph
arguments:
nodes:
- Person
relationships:
- KNOWS
- implementation: nodestream.analytics:RunAlgorithm
arguments:
algorithm: weakly_connected_components
parameters:
writeProperty: "weakly_connected_components"
- implementation: nodestream.analytics:RunAlgorithm
arguments:
algorithm: degree_centrality
parameters:
writeProperty: "degree_centrality"
node_types:
- Person
relationship_types:
- KNOWS
- implementation: nodestream.analytics:ExportResultsToDatabase
arguments:
target: persistent-graph
nodes:
- type: Person
properties:
- weakly_connected_components
- degree_centrality
Also adding a run query command for an escape hatch / incremental modifications: - implementation: nodestream.analytics:RunCypher
arguments:
query: "MATCH (n:Person) where (n)-[:KNOWS]->(:Person{name: "Zach"}) set n:ZachFriend" |
Yeah I like this, I think it will be easier on the end-user using the same pipeline construct formats for the analytics pipelines |
I like this approach... its incremental and reduces the cognitive requirements on the end users for understanding concepts. 👍 |
Overall, I think this makes sense, I have a few other thoughts/comments below. 1/ One thing to highlight here is that some of the concepts here seem very specific to Neo4j versus the way that other DBs handle things. One specific area is the "persistent" versus "projected" graph constructs. This is the way that Neo approaches analytics, however most other databases such as Neptune Analytics, Memgraph, TigerGraph, etc. allow for running these on the data instead of a projection. I am just curious if the proposal here needs any changes or if it is just that in these scenarios where the pipeline might just be much simpler and fewer steps for other DBs versus Neo? e.g. In Neptune Analytics, you can run WCC and Degree (similar to #252 (comment)) and save the data back to the graph using just this:
2/ One of the other differences in Neptune Analytics is that the implementation integrates the algorithms into OC syntax in a much richer way than supported in Neo, which allows for chaining multiple algorithms together or combining OC's composability constructs with algorithms. Examples:
or
For these sorts of use cases, would we just recommend using the 3/ Any thoughts on allowing for additional constructs like generating embeddings for adding to the loaded data? 4/ What about allowing for saving of query results to a datalake/S3 bucket or something similar instead of back into a DB? I highlight this as one of the common use cases I am seeing is CX's wanting to use an analytics graph to generate features for downstream model training. So the general pattern is - Load Data, run a few queries, save the results to S3 (or similar). This may be a bit more than what nodestream is intended for, but I thought it was at least worth discussing |
@bechbd Thanks for the feedback. Here are some of my questions/thoughts/reactions to your comments.
Thats a good question. I think that this is probably actually a candidate for naming. Internally, Maybe something more like
Yeah thats a good question... I hadn't thought of that. I do think this model as it is today limits the possibility of doing any kind of chaining. I think for now, this definitely does fall into the kind of problem set that the
Absolutely! I think this is totally on the cards. The algorithms represented were more for demonstration purposes that meant as a comprehensive list. We're getting to the phase in this where the concept is validated and we can move into scoping an initial version.
This is a good idea. It would definitely take some designing - maybe as a phase 2 for this - but it absolutely make sense. I had a person ask me in person yesterday if the intention was to support this to feed into a sage maker model and the like. For those that are extracting features for the ML model. |
I've removing this from the 0.13 milestone. Likely this will be moved from the code of nodestream to a sister package. |
Background
One of the primary use cases for using graph databases is the use of analytics and ML workloads.
Requirements / Principles
If
nodestream
were to support analytics jobs, it would be ideal for it to support the same core principles that the remainder of nodestream supports.Implementation Details
Implementation could essentially follow a similar design approach as
migrations
is taking. The core framework handles as much as is prudent and defers to the database connector (which can optionally support the feature) to perform the actual work of data analysis. Steps likecopy
andexport
mentioned below can be implemented using nodestream's existing copy and pipelines features to retrieve and map data.Example Project File
Example Analysis File
This example pipeline facilitates the copying of data from
persistent-graph
toanaylitics-graph
. From there it runs some topological analysis algorithms and persists the results back inpersistent-graph
.Can be run with
nodestream analytics run example --target anaylitics-graph
The text was updated successfully, but these errors were encountered: