Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subscriptions #337

Open
gedw99 opened this issue Apr 1, 2023 · 11 comments
Open

Subscriptions #337

gedw99 opened this issue Apr 1, 2023 · 11 comments

Comments

@gedw99
Copy link

gedw99 commented Apr 1, 2023

Is there a way to subscribe to a query and only get changed / new / deleted results ?

@gruuya
Copy link
Contributor

gruuya commented May 22, 2023

Not at the moment or in the near-term future, but it could be something we'd want to add later on (sorry for a bit of a late response).

@gedw99
Copy link
Author

gedw99 commented May 22, 2023

On thanks .. will close then

@gedw99 gedw99 closed this as completed May 22, 2023
@gedw99
Copy link
Author

gedw99 commented Jun 11, 2023

@gruuya

i Wonder if the new rust delta lane storage changes this ??

https://github.com/delta-io/delta-rs

@gruuya
Copy link
Contributor

gruuya commented Jun 12, 2023

Not really; note that the delta-rs migration effectively only changed the storage layer behind DataFusion's TableProvider::scan/ExecutionPlan::execute, and doesn't offer some primitives/hooks for query changed / new / deleted results type of subscriptions.

If you have some particular scenario/problem in mind feel free to add it here, so that it can serve as a guideline if/when we come back to this issue (it may also be that we think of another solution).

@gedw99
Copy link
Author

gedw99 commented Jun 12, 2023

Well I am thinking about how a users does a mutation into data fusion. One way is use CDC off the original system like postresql or google sheets.

So the mutations goes into that system.

Then we have the reactive data scenarios. If you have a CDC then you can see the changes of the master data source like postresql or google sheets, so then you can get the the middle and correlate what data changes against what your uses are currently looking out that they did a query for a minute again , and so send them a change stream.

This only works though if you can do that correlation.

do you get the concept ?

i already do cdc on all this stuff .

also you guys are not using Flight SQL yet ? Cause this helps with doing the correlation .

@gruuya
Copy link
Contributor

gruuya commented Jun 13, 2023

so then you can get the the middle and correlate what data changes against what your uses are currently looking out that they did a query for a minute again , and so send them a change stream.

I'm probably missing something here, but if the underlying issue is simply about making sure the users see the latest data note that this happens automatically once a change has been made. This touches on

how a users does a mutation into data fusion

in Seafowl we support a couple of ways of changing data (all of them batch-oriented):

As soon as some data is written to a table, this results in a new table version. This annuls the cached responses, and all subsequent queries always target the latest version by default.

You can also target previous table versions using the time travel syntax (as well as look at e.g. the diff between successive table versions).

do you get the concept ?

More likely, maybe you're referring to enabling streaming replication from master data sources using CDC files in a particular format, as a new approach to changing/writing data in addition to the ones above?

also you guys are not using Flight SQL yet ?

No, not currently.

@gedw99
Copy link
Author

gedw99 commented Jun 13, 2023

Thanks @gruuya for the in-depth links. It’s me that’s missing something :)

I did not realise these aspects were done by seafowl for mutations. I am still getting familiar with data fusion aspects.

about Streaming changes via CDC. I currently do this already in other projects using nats message broker.
The use case is that when a user updates their google sheets ir DB, they can see it in data fusion system.

I really want to adopt DataFusion and so am brain storming it with you.

The more I think about it the more I see it as being non intrusive. The cdc system just needs a way to talk to data fusion system and transform the data in.

I can track what user is “ watching what query “ out of bounds , and so then just rerun the query on data fusion and send the deltas to the user. I wonder if there is a way to track the version of a query and the data as a hash ? So then because data fusion versions data I can iWork out the diff a d send that to the user. So then I have subscriptions out of bounds of the data fusion system.

If you want to have a chat / video about it I would really like to explore it ..

@gedw99
Copy link
Author

gedw99 commented Jun 13, 2023

Am reopening . Hoping to discuss this ..

@gedw99 gedw99 reopened this Jun 13, 2023
@gedw99
Copy link
Author

gedw99 commented Jun 13, 2023

(as well as look at e.g. the diff between successive table versions)

Is this available at the API / http level ? Cause then I can do subscriptions and only send what changed. I do all this by wrapping seafowl api with NATS message broker and so know what users are doing what. It also gives my global load balancing and auth / authz.

@gruuya
Copy link
Contributor

gruuya commented Jun 13, 2023

Aha no worries, I think we're (getting) on the same page.

I wonder if there is a way to track the version of a query and the data as a hash ?

You could look at the ETag of the response, as this is currently based on the which specific data files are used in a query. Note that this doesn't involve changes to the query plan, hence issue #57. (There's also the system.table_versions metadata table for browsing table versions if that's useful.)

If you want to have a chat / video about it I would really like to explore it ..

Oh sure, I'd love to see if we can converge here—how about discord for starters: https://discord.gg/eFEFRKm?

Is this available at the API / http level ?

Yup, it works as any other query (check out the prior-to-last section of this blog post for an example); we just have the awkward syntax since the AS OF clause isn't supported by sqlparser for now.

@mrchypark
Copy link
Contributor

I'm also interested in this topic.

I'm grateful to be trying Seafowl for log storage. As new data is constantly coming in, the table version keeps increasing, which causes the cache to be invalidated continuously.

As a workaround strategy, I'm creating tables by date or using similar approaches.

However, it would be great if we could achieve the same effect within a single table.

I'm not quite sure how to do this yet!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants