uip | title | description | author | status | type | category | created |
---|---|---|---|---|---|---|---|
0104 |
Scry Store |
External Storage for Scry Bindings |
~rovnys-ricfer |
Withdrawn |
Standards Track |
Kernel |
2023-06-13 |
Store [path noun]
scry bindings in an external database, to move those pieces of data out of the loom, increasing storage capacity and allowing scry data to be bound and stored without going through Arvo at all.
As Urbit addresses more data, keeping it organized in the runtime could improve performance and scalability. This document proposes one such organization scheme, in which scry bindings are stored separately from other data.
A scry binding is a pair of [path noun]
, where path
is a fully qualified scry path, and noun
is the value bound immutably to that path by an Urbit ship. The publishing ship is the first element of the path. Scry bindings might also come with a signature from the publishing ship, so that the data can be validated after redistribution by other ships or relays.
An external scry storage database stores these scry bindings in a persistent key-value store outside of the storage arena used for the rest of the Arvo state.
There are some benefits to this arrangement:
- The Urbit project can make use of off-the-shelf high-performance, scalable key-value stores to scale its data storage, instead of needing to spend a lot of resources on increasing persistent storage space for arbitrary nouns.
- Data could be ingested into Urbit using the scry storage system without it needing to enter Arvo (Arvo should be told about the binding, but using the hash of the value instead of the whole value, to keep space usage low inside Arvo). This allows values larger than Arvo's "loom" arena (currently 8GB) to be stored in Urbit.
- Data could be served from Urbit using the scry storage system without the request needing to be evaluated in Nock inside Arvo. This facilitates paralellism through horizontal scaling of scry servers, each backed primarily by the scry storage system rather than Arvo itself. This architecture much more closely resembles modern web server request-handling architectures, and it could benefit from more off-the-shelf tooling, such as load balancers.
There are also some downsides:
- Urbit's single-level store abstraction is weakened, since not all of the data that Arvo expects to be guaranteed to be able to address is stored in the Arvo noun. This could be handled by the Arvo kernel in such a way as to be unobservable by userspace, but it would still violate the single-level store abstraction for the kernel itself.
- Event replay becomes more complex.
- Interrupt safety must be carefully evaluated.
- Any large datum must be placed into the scry namespace. This could be inconvenient for intermediate products of large array computations, for example. This further weakens the single-level store abstraction, since the userspace developer would know to treat data persisted one way (in the Arvo noun) differently from data persisted another way (in the scry store).
Vere runs the scry store as a standalone Unix process. The scry store is a key-value store where the key is a scry path and the value is a (possibly signed) serialized noun. LMDB, Redis, RocksDB, and FoundationDB come to mind as potential off-the-shelf key-value stores to use for this.
Vere's Mars process, which runs Arvo, communicates with the scry store over IPC, probably a Unix socket or some kind of connection specific to that database (such as an HTTP API).
A checkpoint in a system with an external scry store needs to have both an Arvo snapshot and a snapshot of the scry store. When replaying events from such a checkpoint, Vere needs to look at the effects Arvo emits and apply each requested operation to the scry store. This ensures that the states of both Arvo and the scry store when replaying an event are the same as the first time the event ran.
When Arvo grows a new scry binding, it emits a %grew
effect to Vere. When Vere hears this effect, it sends the binding to the scry store. Once the scry store has committed the binding to persistent storage, it sends a response to acknowledge the write. When Vere hears this ack, it injects a %twin
event into Arvo, which replaces the value of the scry binding with its hash to free up space in the loom.
There are two main alternatives for how Arvo could read data out of the scry store: by using an effect or by scrying. Using an effect is more conservative; scrying is more convenient.
Arvo emits an effect to Vere asking to read the value at a scry path. Vere (the Mars process) sends a read request over the Unix socket to the scry store. The scry store sends back the value over the Unix socket. When Vere hears this, it injects the scry binding as an Arvo event.
A downside of this approach is that without special handling, scry bindings, which might contain large amounts of data, will be written to the event log in their entirety whenever Arvo reads them. This could balloon the size of the event log.
A trick to deal with this would be to have a special event type that only writes to disk the scry binding with the hash of the value, but with a tag telling Vere to retrieve the full value from the store during replay.
Vere runs Arvo with a jetted scry handler gate. When Arvo wants to read a scry path from the scry store, it runs a .^
with that path. When Vere's Nock interpreter sees the virtual Nock 12 instruction corresponding to the .^
rune, it holds onto the Nock subject and call stack and emits a request to the scry store over the Unix socket. When the value comes back over the socket, Vere copies it into the loom, deserializes it, and delivers the result as the return value of the Nock 12 operation, continuing the rest of the computation.
If Arvo wants the value at a scry path to be deleted, it will emit an effect asking the scry store to delete that binding. Arvo is responsible for guaranteeing referential transparency, so it will need to maintain either the path and a hash of the value, or a high water mark indicating that all paths at that revision or lower have been permanently deleted from its state.
When the scry store receives the effect, it deletes the key-value pair in question and sends an acknowledgment back over the Unix socket to Vere. If Vere does not receive this acknowledgment within some timeout, it enters an error state.
A failure of any scry store operation will be retried up to three total attempts, each with a timeout. On the third failure, Vere should crash, since it should interpret the repeated failures as the scry store being permanently inoperable. Crashing is preferable to applications experiencing ever-worsening lag and outbound message queues to the scry store eating up more and more memory.
One-off failures can all be retried, since any single scry store operation is idempotent and none of them cause catastrophic problems if temporarily delayed:
- A retried synchronous read will delay the completion of an Arvo event.
- A retried asynchronous read will delay the injection of an Arvo event with the scry result.
- A retried write will delay the injection of an Arvo event to delete the binding from Arvo, so Arvo will hang on to extraneous data somewhat longer than it needs to.
- A retried deletion will mean the scry store hangs on to extraneous data somewhat longer than it needs to.
If a scry store error occurs during event replay, the operation should be tried at least as many times as the first time the event ran. A permanent failure on replay should crash the system just like in live event processing.
If the system does crash due to a scry store failure, it is crucial for the system to be able to restart without data loss or transactionality failure.
On replay, Vere will look through the effects and retry sending the message to the scry store. If this succeeds, the state of the scry store will change to match what Vere had tried to get it to be before crashing, and everything is nominal.
On replay, Vere will look through the effects and sending the message to the scry store. The message had already been processed successfully, but since scry store operations are idempotent, asking it again does not change the scry store's state. It sends back an acknowledgment (which is a duplicate ack from the scry store's perspective -- "always ack a dupe"), Arvo hears the acknowledgment, and the system progresses just as if the scry store had only been sent one request.
Building a system with an external scry store is a nontrivial amount of work, and while it could greatly accelerate the development of Urbit's file-sharing capabilities, it would also introduce complexity in the event log and replay, and it would violate Urbit's single-level store abstraction.
In practice, I suspect the biggest violation of the single-level store abstraction would be when userspace apps want to perform operations on large pieces of data, such as math operations on n-dimensional array data. In this case, userspace code that needs to store a lot of large intermediate products would likely need to move some of them into the scry namespace just to offload them from Arvo's limited memory space. If that happens, then userspace devs might as well be given a disk-write command, and Urbit would lose some of its magic.
Maybe there would be ways to prevent this kind of leakage while maintaining a key-value store for scry data. Long-term, making it easy for Urbit to leverage off-the-shelf distributed key-value stores in hosted environments (and maybe even in self-hosted environments -- how many years until NativePlanet ships a personal cloud?) is likely to deliver big scaling wins compared to large amounts of relatively unstructured noun data. It would be good to figure out a more principled way to make Urbit interact smoothly with existing storage systems.
Interestingly, if all userspace data were in the scry namespace, the single-level store abstraction might not be hit as hard by separating scry data out into its own store -- instead of an application knowing a piece of data will be stored on disk if put into the scry namespace but not knowing otherwise, the app would simply have a single persistence interface: bind a scry path. Intermediate products could be ephemeral (or not, depending on the runtime).
This difference might not be enough to take this proposal from unworthy to worthy of being built, but it's worth noting that there are ways in which the application model and the storage model might end up being at least softly related to each other.
Copyright and related rights waived via CC0.