-
-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Project: data architecture improvements #192
Comments
As I sketch out new data structures, I notice that my approach is highly entangled with the important question of how to efficiently sync these structures (i.e. see #188). So I’ll have to discuss both at once. I also notice just how much complexity I’ve caused for myself by framing the data model as a hash graph, with stable content-addressed identifiers and immutable contents. Yes, such structures have many excellent theoretical properties, but boy do they make it more difficult to iterate, and that is absolutely the wrong trade to be making for this project right now. Some notes on data structure desiderata:
Modulo the complication of the many-to-one relationship between task states and task content, my instinct would be to simply combine all these elements. The tractability of that approach really depends on how we store, sync, and query tasks. I’ll now describe three approaches to syncing which seem plausible to me. (A much simpler) event sourcing modelThe log-driven state model in the current system is more typically called an “event sourcing” model. The idea is basically: you don't transmit the state of the system; you transmit actions, then compute an effective state locally. State is always just a cache, computed over the action stream. This doesn’t have to be terribly complex, but our current model includes several decisions which dramatically increase the complexity:
All these things create nice properties! But they also drastically increase the complexity of the system, both to maintain and to iterate. We could create a much simpler event sourcing model in which:
Attachments would still need to be fetched out-of-band on native clients, which is a nuisance for maintaining consistency. But this would be much, much easier to maintain than our current model. This doesn’t imply much change to our data stores: we’d still use LevelDB / IDB to store and query. We can still read/write to Firestore for now on the server, but this much dumber model will be easier to write a self-hosted backend for. Sync states, not eventsAn even “dumber” approach would be to sync the computed state snapshots, rather than the events which produced them, with last-writer-wins or very simple conflict resolution logic. The simplest form of this would be to consider each task as a single document, with a modification time. To sync: keep track of the last time you synced with a server; send/request all documents modified after that time. These documents could even be serialized as simple files on disk. We could implement a trivial “backend” via rsync to GCS. People could self-host with Dropbox! Mobile and web clients would use LevelDB / IDB directly as their document store, rather than the file system, so I suppose we’d need a separate rsync-like backend for them to sync. Orbit clients would need to separately maintain an efficient index (e.g. of tasks by due time) in order to present their interfaces efficiently. This turns out to be more possible than I’d thought. I was somewhat surprised to find while investigating this model that even running through Node.js, I can Computing server indexes (e.g. for notifications) becomes more complicated in this regime: we’d need to attach cloud functions to GCS file changes and update a queryable database accordingly. Not so bad. A variant of this model, which would be somewhat more resilient to conflicts, would be to model each task as a folder comprising This would be quite an invasive change to both server and client. I’d probably begin with this model by implementing the all-in-one task-as-document model, using LevelDB/IDB as store, since the mobile/web clients need that anyway. Then we could add a file system-based store and better conflict resiliency in time. One interesting advantage of this model is that clients don’t need to contain the complex logic necessary to reconstruct a task state from its actions. If you wanted to write something that interacted with Orbit data structures from, say, Python, it’d be much easier this way. Lean on CouchDBBoth of these approaches involves implementing our own syncing protocol to varying degrees. I can’t help feeling like this is silly, and something we should really try to avoid. Isn’t there a general-purpose syncable data store we can use? This is basically the mission of Apache CouchDB. CouchDB is written in Erlang, so it’s something we’d run on the server; PouchDB is a Javascript implementation. The big draw of the system is that it implements its own replication, certainly much more robustly than anything we’d manage. The database model itself is much like LevelDB, so we already know the model is a decent fit for our needs. PouchDB would store the database in native LevelDB bindings in our native apps and IDB on the web. We could use CouchDB to replicate a data model based on either of the two approaches above (event-based or document-based). If we wanted to expose a file system interface for the latter, we could always use FuseFS or something. Another interesting advantage of CouchDB is that it’s a general replication protocol (like rsync), so anyone could run their own CouchDB server, point Orbit at it, and voila! Self-hosted sync, with no extra effort. This is great! My primary concerns with adopting CouchDB:
It would also create some extra complexity on our backend. The semantics of its current authentication model would mean giving each user their own database. It’s not possible to make queries across databases, so we’d need to run a service which selectively replicates active user databases into an admin-read-only “merged” database to run cross-user queries (e.g. for notifications). We’d also need to build some glue to bridge our existing authentication system into its internal user management system. I don’t think either of these things would be too onerous, but they do add complexity and failure surface. I’m going to let this simmer for the weekend. Each approach has significant advantages and disadvantages; there isn’t a clear winner in my mind at the moment. |
Reflecting on this over the weekend, one thing which makes me lean towards event-sourcing (or at least away from document-based methods) is that for research analysis purposes, I really do need a log of events which took place. It's not even enough to just separate out all the review actions, since I'll almost certainly also be interested in events like "user deleted this prompt" (was that because it wasn't interesting/meaningful? this is an important behavior to examine!). So if I go the document-based model, I end up needing to implement diffing to extract the events. The combination is perhaps still less complex than the event-sourcing model, but it strikes me as brittle/lossy, and it does seem to be giving up much of the simplicity the document-based model enjoyed. |
Very interesting. There isn't a clear winner in my mind either, but a few points came to mind as I was reading:
|
Thanks, those are yet more good reasons to lean towards a relatively minimalistic event sourcing model! I'm digging in on the next level of detail now… |
…Project: data architecture improvements]
Last night I pushed the work I’ve got so far on a new data model and approach to syncing. It’s in a new package called store, which is also (temporarily) holding a new version of core called core2. Overall, I’ve tried to use this opportunity to simplify things a great deal, while leaving us more room to expand in the near future. Key differences in the data model:
The broad architecture for these data types is the same, but made somewhat more general. I’ve implemented a standardized event sourcing data store. What that means is that the only real things in this system are events, which describe changes to entities. The only type of entity right now is a task, but future entities might include collections and tags. Task data structures aren’t modified directly; they’re computed from the event sequence through a “reducer” which defines how an event modified an entity. For the sake of performance, these snapshots are computed when writing new events to the data store, saved, and indexed. But they should be viewed as a cache; events are the source of truth. Events and entity snapshots are stored in a database that’s meant to be an accessible file format. My intent is to make the store library a public API. If you want to write a desktop script / app which interacts with Orbit (like our note syncing script), the best way to do it is to have that script simply read/write the local data store. Then, if it wants, it can explicitly sync the local store with the server. This way changes from local scripts can be reflected offline, and without a round trip to a server API. The server API becomes something to use in contexts where you wouldn’t have a full local store, like a highlighting mashup embedded into a web page. Syncing is pretty straightforward in this model:
But “after” is a bit tricky here: you don’t want to make a syncing system whose correctness depends on accurate and monotonic client clocks. There are lots of ways to get well behaved distributed clocks, but for now I’ve chosen an exceedingly simple approach, which should work fine for us, though it means we can’t easily sync peer-to-peer—only client-server. My approach is: client-side, a simple sequence number functions as a local logical clock; server-side, we rely on Firebase’s ServerTimestamp, which uses Google TrueTime to produce monotonic timestamps. In practice this means simple data types for our clocks and a simple comparison for “after.” We do still need to rely on local timestamps for the spaced repetition algorithm itself, which depends on clock time deltas. My simple approach to the distributed clock problem means there could be events which violate causation (ie the logical timestamps and client timestamps are reversed), but I don’t think that really matters in practice, so long as the reducers are well defined. If necessary, we could solve this by using a hybrid logical clock, which can capture both approximate relative clock times and strict causative ordering. Still to do:
Phew. That’s a lot! Help with any/all welcome. Cc @kirkbyo |
Unfortunately I've discovered today (thanks, tests!) that Firestore's This is a bit tricky, since (at least in our current architecture) API requests are potentially distributed to many independent servers. We'll have to arrange monotonicity using transactions or using something like a hybrid logical clock salted with random bits (to accommodate multiple server nodes). |
The salted HLC does seem to be fine for the purposes of our stable listing contracts. Linking here for posterity: https://github.com/andymatuschak/orbit/blob/master/packages/backend/src/backend/2/orderedID.ts |
Some updates: sync is now implemented (in a new package, |
The new scheduler, syncing module, and backend implementation are now all used end-to-end in the local configuration. Incremental syncing seems to be working nicely. Still working through pushing the new types through all of |
Next up: I'll deploy the write-twice cloud functions, migrate my own user data, and live on the core2 app for a few days to make sure I don't see anything unexpected. |
I've migrated my own user now and worked through a lot of small issues which arose through that work. Double-writes appear to be working as expected. A full cold sync of my database is pretty slow: 24s for a few tens of thousands of events. Of course, no one else has a large database, so this isn't an issue in practice. But this'll need to be optimized if the platform gets real use. The easiest thing would be to build a "cold start" API which lets you directly download / upload the on-disk representation. I'll watch for errors as I review in the next few days, then I'll run the migration across all users. I can't deploy the new
|
Migrated all users today, fixing a number of small bugs in the process. I haven't yet deployed the core2-based |
Notification logic has been migrated, and I've deployed the core2 I migrated That's pretty much it: just some odds and ends now…
|
Alright, I'm going to close this. Everything else can be done at our leisure. Hooray! I'm quite happy with how this came out. |
As I implemented the API for Orbit, I noticed that I made syncing very cost-inefficient by giving each task state its own hash graph. Since I'll need to run a full-database migration to change this (see #188), I figured I should simultaneously make any other changes I'd like to see in the data model—these migrations are a hassle.
A few key things I'd like to do:
Prepare the data model for future variation: non-prompt tasks and different scheduling strategies. In particular I'd like to eliminate the distinction between
ActionLog
andPromptActionLog
, and I'd like to makePromptState
becomeTaskState
. Tasks should be able to specify an "interaction type" (memory, exercise, activity, etc) in addition to describing the contents; the interaction type will determine the feedback semantics and the default scheduling behavior. I'll at least partially think through how e.g. an incremental reading client, and a timeful text might be represented.If possible, unify the three stored types (logs, prompts, and attachments) as "objects," so that storage / syncing modules can treat them more uniformly.
I'll sketch out how edits to prompts and organization operations (tags, folders) might work, just to make sure I'm not backing myself into a corner.
I'll start by just writing out new interfaces and playing with those. Then I'll perform the full migration. Then I'll have to work #188.
The text was updated successfully, but these errors were encountered: