fix: IOD long streams could remain undelivered #387

dav1do · 2024-06-18T00:02:20Z

Address AES-83 and replaces part of #375 and incorporates a lot of the feedback from that one. Depends on #390.

This changes the initial "deliver all undelivered events" process to happen when creating a CeramicEventService, which means it could take a while, and will block (or prevent if it fails) the server from starting rather than happen in the background.

There are new/better tests. Some where deleted/replaced that were specific to the IOD implementation details and the higher API is now tested exclusively. The general premise is to send recon events to the task, and it will track them in memory if they need history. Delivered events will be be stored if we have the stream in memory, and will trigger a review of the stream if they unblock something we have. If we don't have the stream, we drop them (but we need to learn about them jic we needed them).

When reviewing streams, we order as many events as we can and then insert them. We use mutable state, but things should be cancel and retry safe, as we don't drop things until we've successfully committed to the database. The previous bugs were due to some state management mistakes about when to keep things around, when to trigger a review, when to look things up (so pretty much everything 😞). This implementation should be simpler, handles these situations correctly, does less work and uses an enum rather than flags to make things harder to mess up but is pretty much a rewrite of ordering_task.rs.

I'm not happy with the structs needed to wrap different varieties of the event (unvalidated, store with order key or without, etc). Cleaning that up will make event validation easier. This should fix the test in ceramicnetwork/js-ceramic#3205.

linear · 2024-06-18T00:02:23Z

AES-83 IOD bug: failing js-ceramic test updating long stream

stbrody

it's still pretty tough to follow the control flow here, though I do think this is better than the previous version.

Left a bunch of comments. Admittedly, a lot of my comments are kind of nit-picky. It's just that with code this complicated any little bit to improve readability and reduce cognitive overhead of someone trying to read and understand this code is extra valuable I think.

service/src/event/service.rs

service/src/event/order_events.rs

service/src/event/ordering_task.rs

service/src/event/service.rs

service/src/event/ordering_task.rs

service/src/event/service.rs

- is now correct and handles long streams out of order successfully in all cases (afaict). Should be more performant as it only does work when it should be necessary. - includes test of long streams

- use map of {cid: deliverable} to reduce the number of entires we have to review at the end - use a VecDeque instead of a Vec to fix a bug where it would just process the same event and exit rather than find the in memory history - need to write tests to exercise this. it would have rejected an api write that had a prev in one batch while recon events would be sorted out. would be hard to encounter but merits tests, particulary because it's hard to encounter.

- process all undelivered events at server startup so we can fail the process not just the task - adjust loop to while condition - update some comments/function names - fail loudly (i.e. panic) on anything expected to be unreachable (next up: refactor to make those states unrepresentable)

- modified the approach to keep everything in the maps until we're done and change things in place when it's identified as deliverable. this has the advantage of better cancel safety and more consistency about what is in the maps. - also renamed some things from header->metadata

also changed struct sent to ordering task to avoid sending body we don't need and the ordering task correspondingly

the check should have been cid_map before too

we should never discover undelivered init events while processing the start up batch, but we can correct them so we aren't going to crash

Don't want the process to appear to be hanging without any indication of what is going on if there is a large backlog of undelivered events

use `send` instead of `try_send` to enforce backpressure, clean up comments/docs

if we're not done, we keep one event in memory until next time to optimize for in order streams. we also try to process on all undelivered events for a stream we're tracking as things fork and we don't know what happened while we were in the queue

service/src/event/ordering_task.rs

Co-authored-by: Spencer T Brody <[email protected]>

nathanielc

LGTM. Its merged but I do have a few places where code readability could be improved. Not urgent but thought I'd share.

nathanielc · 2024-06-26T18:03:25Z

service/src/event/order_events.rs

+use super::service::EventMetadata;
+
+pub(crate) struct OrderEvents {
+    pub(crate) deliverable: Vec<(EventInsertable, EventMetadata)>,


If these are meant to be read only can we add getters for them?

nathanielc · 2024-06-26T18:43:26Z

service/src/event/order_events.rs

+        for event in stream_1.iter().chain(stream_2.iter()) {
+            let insertable = CeramicEventService::validate_discovered_event(
+                event.0.to_owned(),
+                event.1.as_slice(),
+            )
+            .await
+            .unwrap();
+            to_insert.push(insertable);
+        }


This loop exists in a few places in this test code. Can we abstract it into a function for clarity in reading the code?

nathanielc · 2024-06-26T18:46:15Z

service/src/event/order_events.rs

+        let mut after_1 = Vec::with_capacity(10);
+        let mut after_2 = Vec::with_capacity(10);
+        for (event, _) in ordered.deliverable {
+            assert!(event.deliverable());
+            if stream_1.iter().any(|e| e.0 == event.order_key) {
+                after_1.push(event.order_key.clone());
+            } else {
+                after_2.push(event.order_key.clone());
+            }
+        }


This logic to separate out the events into groups by there stream exists in a few places. Can we also make this into a function? That would help the test code read well as it doesn't get distracted by the particulars of how we split the data into groups.

Not necessary but this grouping concept exists in itertools see https://docs.rs/itertools/latest/itertools/trait.Itertools.html#method.group_by. Might make the code a little clearer.

nathanielc · 2024-06-26T18:55:39Z

service/src/event/service.rs

-    }
+        let to_insert = to_insert_with_metadata
+            .iter()
+            .map(|(e, _)| e.clone())


Why do we need to clone all the events before inserting them?

removed by refactoring

nathanielc · 2024-06-26T18:57:08Z

service/src/event/service.rs

+        let missing_history = ordered
+            .missing_history
+            .iter()
+            .map(|(e, _)| e.order_key.clone())
+            .collect();

-    async fn parse_item<'a>(
-        item: &ReconItem<'a, EventId>,
-    ) -> Result<(EventInsertable, Option<DeliverableMetadata>)> {
-        let cid = item.key.cid().ok_or_else(|| {
-            Error::new_invalid_arg(anyhow::anyhow!("EventID is missing a CID: {}", item.key))
-        })?;
-        // we want to end a conversation if any of the events aren't ceramic events and not store them
-        // this includes making sure the key matched the body cid
-        let (insertable_body, maybe_prev) =
-            CeramicEventService::parse_event_carfile(cid, item.value).await?;
-        let insertable = EventInsertable::try_new(item.key.to_owned(), insertable_body)?;
-        Ok((insertable, maybe_prev))
-    }
+        let to_insert_with_metadata = if history_required {
+            ordered.deliverable
+        } else {
+            ordered
+                .deliverable
+                .into_iter()
+                .chain(ordered.missing_history)
+                .collect()
+        };


Instead of cloning the missing history why not have two separate calls to CeramicOneEvent::insert_many. Maybe that is slower than cloning since its two transactions but thought I'd ask.

yeah, I wanted to avoid the database twice, but I adjusted things with the change to iterator that avoids needing to allocate

nathanielc · 2024-06-26T19:05:17Z

service/src/event/service.rs

+        if history_required {
+            return Ok(InsertResult {
+                store_result: res,
+                missing_history,
+            });


nit, can you remove this early return and guard the discovered logic with !history_required?

I'm pretty sure I asked for the exact opposite in an earlier review 😅.

Can I ask why you don't like the early return? Personally I'm a big fan of early returns, I like removing the extra indentation layer and I think it helps reduce cognitive load by letting me see that this case is over and dealt with and I don't need to think about it any more, vs being in a big if block where there's some overhead to remembering what block I'm in and what the condition in play is right now.

I generally dislike early returns as they make it harder to determine the different code branches as they can be easy to miss. In this case both returns, return the same value so they were really the same code branch just duplicated.

Not a strong preference here either way.

fascinating! I've never heard an argument against early returns before. I often push for early returns when I do code reviews because I've always felt they improve code readability. I always thought that opinion was non-controversial 😅

Good to know it's not as clear-cut as I thought. I still like them but don't feel strongly either way. Maybe I'll stop pushing as hard for others to use them going forward

I also don't like early returns unless they're right away. At least in rust I prefer to rely on the fact that it's an expression and returns something.

That said, I think I can extract the big if block so it's easy to understand (and hopefully satisfies everyone 😁).

nathanielc · 2024-06-26T19:08:34Z

service/src/event/service.rs

-            .push(DeliveredEvent::new(ev.body.cid, init_cid));
-        self.insert_now.push(ev);
-    }
+        let res = CeramicOneEvent::insert_many(&self.pool, &to_insert[..]).await?;


In reading the code it seems like if we change this method to accept an iterator instead of a slice we do not have to do as much cloning of the data. For example we could keep the to_insert vector as a tuple of the insertable and the metadata and use an iter to hide the metadata. This would avoid having the map the insertable back to its metadata below.

yeah, I thought that too but hesitated to add another refactor. I did it in the new branch and it removes some allocations (the clone above as well as collects)

nathanielc · 2024-06-26T19:25:29Z

service/src/event/service.rs

+
+        let metadata = EventMetadata::from(parsed_event);
+        let mut body = EventInsertableBody::try_from_carfile(cid, carfile).await?;
+        body.set_deliverable(matches!(metadata, EventMetadata::Init { .. }));


Why is this logic here instead of part of the event ordering bits? While true any init event is deliverable it seems like this logic should live with the other logic that knows that an init event doesn't have a prev.

I moved it to the Ordering task. Originally I thought it might be less likely to ever be missed if it just happened right away but I think it makes sense in the new location.

nathanielc · 2024-06-26T19:34:44Z

service/src/event/ordering_task.rs

+    /// Returns `false` if we have more work to do and should be retained for future processing
+    fn processing_completed(&mut self) -> bool {
+        // if we're done, we don't need to bother cleaning up since we get dropped
+        if !self


nit, instead of !...any(Undelivered) use ..all(!Undelivered).

dav1do mentioned this pull request Jun 18, 2024

fix: api batch with one invalid event could fail all #388

Merged

dav1do temporarily deployed to github-tests-2024 June 18, 2024 00:11 — with GitHub Actions Inactive

dav1do force-pushed the fix/aes-83-iod-long-streams branch from cfea650 to 41778d4 Compare June 18, 2024 16:52

dav1do mentioned this pull request Jun 18, 2024

refactor: modify store API and rename some things #390

Merged

dav1do changed the base branch from main to refactor/store-api June 18, 2024 16:57

dav1do temporarily deployed to github-tests-2024 June 18, 2024 17:02 — with GitHub Actions Inactive

dav1do requested review from nathanielc and stbrody and removed request for nathanielc June 18, 2024 17:07

dav1do force-pushed the fix/aes-83-iod-long-streams branch from 41778d4 to 4fed1e9 Compare June 18, 2024 17:12

dav1do temporarily deployed to github-tests-2024 June 18, 2024 17:22 — with GitHub Actions Inactive

Base automatically changed from refactor/store-api to main June 18, 2024 19:17

dav1do force-pushed the fix/aes-83-iod-long-streams branch from 4fed1e9 to 1cd9f15 Compare June 18, 2024 22:01

dav1do temporarily deployed to github-tests-2024 June 18, 2024 22:11 — with GitHub Actions Inactive

stbrody requested changes Jun 19, 2024

View reviewed changes

dav1do force-pushed the fix/aes-83-iod-long-streams branch from 1cd9f15 to 75a65f2 Compare June 19, 2024 22:51

dav1do had a problem deploying to github-tests-2024 June 19, 2024 23:00 — with GitHub Actions Failure

dav1do had a problem deploying to github-tests-2024 June 20, 2024 06:25 — with GitHub Actions Failure

dav1do had a problem deploying to github-tests-2024 June 20, 2024 07:07 — with GitHub Actions Failure

dav1do force-pushed the fix/aes-83-iod-long-streams branch from 6ebe671 to 4a32725 Compare June 20, 2024 19:08

dav1do temporarily deployed to github-tests-2024 June 20, 2024 19:33 — with GitHub Actions Inactive

dav1do force-pushed the fix/aes-83-iod-long-streams branch from 9c6cdc2 to 90379ed Compare June 20, 2024 20:17

dav1do temporarily deployed to github-tests-2024 June 20, 2024 20:27 — with GitHub Actions Inactive

dav1do requested a review from stbrody June 20, 2024 20:32

dav1do commented Jun 21, 2024

View reviewed changes

service/src/event/ordering_task.rs Show resolved Hide resolved

dav1do commented Jun 21, 2024

View reviewed changes

service/src/event/service.rs Show resolved Hide resolved

dav1do temporarily deployed to github-tests-2024 June 24, 2024 16:01 — with GitHub Actions Inactive

dav1do temporarily deployed to github-tests-2024 June 24, 2024 16:57 — with GitHub Actions Inactive

dav1do temporarily deployed to github-tests-2024 June 24, 2024 20:51 — with GitHub Actions Inactive

dav1do added 13 commits June 24, 2024 16:40

fix: correct/better IOD

a19e019

- is now correct and handles long streams out of order successfully in all cases (afaict). Should be more performant as it only does work when it should be necessary. - includes test of long streams

refactor: rename structs with Meta instead of Header

144cbcc

chore: add test logging

194942d

fix: clean up docs/comments

44fc70e

refactor: use event crate carfile parsing

81df3bc

also changed struct sent to ordering task to avoid sending body we don't need and the ordering task correspondingly

fix: optimization to do less work

e4f5a6e

the check should have been cid_map before too

fix: fix initial (shallow) ordering code and write tests

eb4d489

fix: only panic on internal states in release mode

e14b972

we should never discover undelivered init events while processing the start up batch, but we can correct them so we aren't going to crash

chore: improve logging for start up processing of undelivered events

81f098f

Don't want the process to appear to be hanging without any indication of what is going on if there is a large backlog of undelivered events

chore: address PR review feedback

4ed6721

use `send` instead of `try_send` to enforce backpressure, clean up comments/docs

dav1do force-pushed the fix/aes-83-iod-long-streams branch from 40f6e25 to 4ed6721 Compare June 24, 2024 22:41

dav1do temporarily deployed to github-tests-2024 June 24, 2024 22:50 — with GitHub Actions Inactive

dav1do temporarily deployed to github-tests-2024 June 25, 2024 22:23 — with GitHub Actions Inactive

stbrody reviewed Jun 26, 2024

View reviewed changes

service/src/event/ordering_task.rs Outdated Show resolved Hide resolved

Update ordering_task.rs

e933b96

Co-authored-by: Spencer T Brody <[email protected]>

dav1do temporarily deployed to github-tests-2024 June 26, 2024 17:12 — with GitHub Actions Inactive

dav1do added this pull request to the merge queue Jun 26, 2024

Merged via the queue into main with commit 95f5498 Jun 26, 2024
5 checks passed

dav1do deleted the fix/aes-83-iod-long-streams branch June 26, 2024 18:57

nathanielc reviewed Jun 26, 2024

View reviewed changes

dav1do mentioned this pull request Jun 26, 2024

refactor: use an Iterator for insert_many and other clean up #399

Merged

This was referenced Jul 1, 2024

chore: version v0.25.1 #408

Closed

chore: version v0.26.0 #414

Closed

chore: version v0.26.0 #416

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: IOD long streams could remain undelivered #387

fix: IOD long streams could remain undelivered #387

dav1do commented Jun 18, 2024 •

edited

Loading

linear bot commented Jun 18, 2024

stbrody left a comment

nathanielc left a comment

nathanielc Jun 26, 2024

nathanielc Jun 26, 2024

nathanielc Jun 26, 2024

nathanielc Jun 26, 2024

dav1do Jun 26, 2024

nathanielc Jun 26, 2024

dav1do Jun 26, 2024

nathanielc Jun 26, 2024

stbrody Jun 26, 2024

nathanielc Jun 26, 2024

stbrody Jun 26, 2024

dav1do Jun 26, 2024

nathanielc Jun 26, 2024

dav1do Jun 26, 2024

nathanielc Jun 26, 2024

dav1do Jun 26, 2024

nathanielc Jun 26, 2024

fix: IOD long streams could remain undelivered #387

fix: IOD long streams could remain undelivered #387

Conversation

dav1do commented Jun 18, 2024 • edited Loading

linear bot commented Jun 18, 2024

stbrody left a comment

Choose a reason for hiding this comment

nathanielc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dav1do commented Jun 18, 2024 •

edited

Loading