Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling with message queues #16

Open
gedw99 opened this issue Mar 3, 2025 · 2 comments
Open

Scaling with message queues #16

gedw99 opened this issue Mar 3, 2025 · 2 comments
Labels
clusternode Relating to cluster nodes question Further information is requested

Comments

@gedw99
Copy link

gedw99 commented Mar 3, 2025

Hey,

Glad to see Semadb is still alive and kicking !

This is very much a due diligence / discussion...

I am doing some research on how best to put a Search system in place in a system that uses many storage mechanisms. Typically you have Structured ( sql ) data, and unstructured ( document ) data.

Then you have graph DB's too which try to do and other things all in one.

I am looking at Sema and Zinc.
They both seem to be designed for Document indexing and search

First difference is the bleve deps...

https://github.com/zincsearch/zincsearch uses github.com/blugelabs/bluge

https://github.com/Semafind/semadb uses github.com/blevesearch/bleve/v2


Next one is scale out. You include a Cluster of sorts it seems.

A lot of teams end up with a Event Stream of mutations that go into a Message queue or otherwise and then have "Command agents" that pick up the mutation and do a transform into each Store to keep them up "up to date".

I use NATS Jetsam for example...

So I was wondering if you could comment how that relates to your Cluster, because with a Message queue, you can "ensure" that many Store instances get updates eventually.


One other point is Change streams. Does Sema tell me via HTTP or SSE or other that a record changes and its nature. Sometimes called CDC.

This can be vital when you want other things to happen when things happen inside Sema.

Hope dont mind me raising an issue like this, but I as I said its important when picking stores, how they are designed etc. I tend to go for goalng and not rust ones since I code golang..

thanks in advance.

@nuric nuric added question Further information is requested clusternode Relating to cluster nodes labels Mar 4, 2025
@nuric
Copy link
Member

nuric commented Mar 4, 2025

Hello again and it's good to see you haven't given up on NATS and using message queues to scale out since #10. To comment on the discussion points:

SemaDB uses bleve for the text analysis only, not for indexing and doesn't depend on its search capabilities. The analysis involves splitting the text into terms and then we implement TF-IDF ourselves. This is done to keep SemaDB self-contained as possible although bleeve search is a great project.

The horizontal scaling using NATS, we discussed in #10 and at this stage we're not thinking of incorporating a separate moving component like NATS Jetstream. The main limitation on SemaDB on horizontal scaling is on the write path and the shard where the data is allocated must be online for the write operation to succeed.

At the moment SemaDB doesn't have a change stream mechanism because full replication isn't implemented. Each collection is assigned a primary server and all write requests are linearised through that server.

I'll leave this issue open again in case others want to comment.

@nuric nuric changed the title comparision Scaling with message queues Mar 4, 2025
@gedw99
Copy link
Author

gedw99 commented Mar 5, 2025

Wonderfully concise answers. Thanks @nuric

I’m going to kick the tires on this.

I use nats for a ton of projects so will see where the ground lies with Sema.

My use case is documents and videos / images that need to be everywhere and edited everywhere so have been using CRDT for multi version merging in general.

The schema of the document can change . And so been using WASM . Main host sniffs the version and loads the WASM that matches the doc version.

It’s for a myriad of uses . But think Google Workplsce ( or whatever they call it these days ) or Apple cloud for the masses….

I figure it’s worth explaining the use case so we can compare notes as they say .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clusternode Relating to cluster nodes question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants