PoH - Reserve Space #1227

apfitzge · 2024-05-07T15:51:13Z

Problem

In transaction processing, one of the main bottlenecks for simple transactions (such as in bench-tps) is recording transactions into PoH.
The reason this is slow is because of the communication pattern between execution and PoH threads:
- Execution threads send messages to PoH thread
- PoH thread, occasionally checks channel for messages from execution threads
- PoH thread records any messages from execution threads in order and sends messages back
The slowness mainly comes from the fact that the PoH thread does not constantly check the channel from execution threads. It only does so at specific intervals between hash batches. By default the PoH thread will do 64 hashes between each checking of the channel (only hashing if channel is empty).

Assuming cluster average hash-rate of 10M hashes/sec. This means the interval of checking the channel is around 6.4us.

Proposed Solution

Create a mechanism for execution threads to "reserve" PoH space.
Reserve should not fail if there is still time left in the block, PoH wise.
Before the block ends, PoH thread will "flip a bit" to prevent any further reservations, and then record all messages before the last tick is inserted.
This will allow execution threads to more quickly move on, since they do not need to wait for the actual recording into PoH to occur - just the "reserve" operation.

Need to compare alternate implementations:

Atomic using CAS for reserve mechanism
Mutex for reserve mechanism

ryoqun · 2024-05-08T00:38:32Z

oh, i think this can be solved more elegantly just with crossbeam-rs/crossbeam#959 (cc @behzadnouri; he also came up with the similar idea: crossbeam-rs/crossbeam#861)

the idea is that we can guarantee block insertion just just from the simple fact of succeeding in (one-sided) sending to the poh thread. so, no need to wait replies from the poh thread.

that said, that upstream pr is kind of stale. we can patch the crossbeam-channel or just wrap the crossbeam channel with mutex (ugh).

apfitzge · 2024-05-08T10:18:35Z

the idea is that we can guarantee block insertion just just from the simple fact of succeeding in (one-sided) sending to the poh thread. so, no need to wait replies from the poh thread.

Exactly, the whole point of this issue is to have this guarantee.
If we can achieve that in another than suggested, I'm happy.

I'm not sure I understand the suggestion w/ crossbeam though. At least currently we use the same channel to send to poh thread throughout the lifetime; do you mean use the disconnect approach, and need to swap out channels for every new slot?

ryoqun · 2024-05-08T12:37:35Z

do you mean use the disconnect approach, and need to swap out channels for every new slot?

exactly. so that senders can be notified about end of slots.

apfitzge · 2024-05-08T16:25:31Z

Yeah that seems like an approach that'd work.
I think Mutex-wrapping the channels is not an option.
Is there any expected timeline on those upstream PRs? they seem to have been there for quite a long time.

ryoqun · 2024-05-09T05:32:54Z

I think Mutex-wrapping the channels is not an option.

Well, I think we need some bench to preclude that. Now that Mutex is futex-based, it's not so different from having another crossbeam channel. I'm not saying this is a permanent solution, though. Also, come to think of it, RwLock could be used to mitigate contentions among the external (i.e. tx-sending) threads, because it's guaranteed that there's the single receiver.

Is there any expected timeline on those upstream PRs? they seem to have been there for quite a long time.

Sadly, none. Seems the maintainer is too busy... I'm replying as fast as i can.

At this moment, i think patching or forking makes sense.

So, ultimately, which solution (reserve, Mutex, or disconnect) to be taken could largely depends on the enginener's preference, i guess. Fortunately, this isn't consensus change and the change is quite localized. so, relatively easy to switch the solutions.

apfitzge · 2024-05-09T16:14:20Z

Also, come to think of it, RwLock could be used to mitigate contentions among the external (i.e. tx-sending) threads, because it's guaranteed that there's the single receiver.

My main contention with mutex was blocking of execution threads by other execution threads.
However, even with using a RwLock the execution threads will be blocked by the recording thread.

I guess I'm also not quite sure how this approach would work. In crossbeam the Sender and Receiver are separate objects/types - a lock around the Receiver would not affect the Sender afaik - unless we do some sort of Channel wrapper that holds the lock around both.

ryoqun · 2024-05-12T04:54:30Z

However, even with using a RwLock the execution threads will be blocked by the recording thread.

recording thread will only write-lock at the end of slots. But as said below, use of RwLock is awkward.

I guess I'm also not quite sure how this approach would work. In crossbeam the Sender and Receiver are separate objects/types

ohh, true.

a lock around the Receiver would not affect the Sender afaik

this is true as well

unless we do some sort of Channel wrapper that holds the lock around both.

Yeah, that works. but it's odd even more. Instead, I think we need is_reached_max_height: Arc<RwLock<bool>> besides senders/receivers. i.e. an indirect way to create a critical section by accompanying this synchronization along with them at the cost of losing type safety to some extent. this is like insert_shreds_lock: Mutex<()> in solana-ledger.

ryoqun · 2024-05-19T04:49:06Z

Instead, I think we need is_reached_max_height: Arc<RwLock<bool>> besides senders/receivers. i.e. an indirect way to create a critical section by accompanying this synchronization along with them

I created such a wrapper: https://gist.github.com/ryoqun/80adbd49c1658812b47fcf35bacb058f

ryoqun · 2024-05-22T08:13:10Z

further update: I created revamped pr upstream: crossbeam-rs/crossbeam#1114

apfitzge · 2024-07-11T19:57:39Z

Reviewing this, I think we still need/want some reserve method instead. We currently return the transaction index w/ the result of recording which the worker threads use for something(?).

I'm imaginging we could do something like having a Mutex<Option<usize>> or AtomicI64.
Use lock or CAS to get the current index for the batch we're recording, if it's None or negative then we consider the recording to fail.

alessandrod · 2024-09-25T14:03:44Z

The slowness mainly comes from the fact that the poh thread does not constantly check the channel from execution threads. It only does so at specific intervals between hash batches. By default the poh thread will do 64 hashes between each checking of the channel (only hashing if channel is empty).

I think the slowness comes from the fact that when the poh thread sends results back, the banking threads are always sleeping so it needs to syscall to wake them up. Ban syscalls.

apfitzge · 2024-09-25T14:40:43Z

I think the slowness comes from the fact that when the poh thread sends results back, the banking threads are always sleeping so it needs to syscall to wake them up. Ban syscalls.

Yeah. those threads are always sleeping.
Communication should be "largely" one way. Workers send to stuff to PoH, but get back immediately as part of send process the info it needs to continue.
No messages from PoH, just something to flag the end of slot so banks dont' record past end of slot.

alessandrod · 2024-09-25T15:46:36Z

This sounds like a fixed capacity ringbuffer with head and tail that can advance independently without contention

apfitzge · 2024-09-25T17:12:50Z

This sounds like a fixed capacity ringbuffer with head and tail that can advance independently without contention

yeah ideally we could have some sort of ring-buffer, large enough we expect to never fail to insert.
still need a way to communicate "slot is over" to workers, as well as the transaction_index stuff that's used for something.

apfitzge assigned ksolana Jul 17, 2024

ksolana mentioned this issue Oct 2, 2024

Add MPMC ring buffer #3053

Closed

ksolana mentioned this issue Oct 30, 2024

Poh using mpsc ringbuf #3361

Draft

ksolana mentioned this issue Nov 11, 2024

MPSC ringbuf for poh recorder #3568

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PoH - Reserve Space #1227

PoH - Reserve Space #1227

apfitzge commented May 7, 2024 •

edited by ksolana

Loading

ryoqun commented May 8, 2024

apfitzge commented May 8, 2024

ryoqun commented May 8, 2024

apfitzge commented May 8, 2024

ryoqun commented May 9, 2024

apfitzge commented May 9, 2024

ryoqun commented May 12, 2024

ryoqun commented May 19, 2024

ryoqun commented May 22, 2024

apfitzge commented Jul 11, 2024

alessandrod commented Sep 25, 2024

apfitzge commented Sep 25, 2024

alessandrod commented Sep 25, 2024

apfitzge commented Sep 25, 2024

PoH - Reserve Space #1227

PoH - Reserve Space #1227

Comments

apfitzge commented May 7, 2024 • edited by ksolana Loading

Problem

Proposed Solution

ryoqun commented May 8, 2024

apfitzge commented May 8, 2024

ryoqun commented May 8, 2024

apfitzge commented May 8, 2024

ryoqun commented May 9, 2024

apfitzge commented May 9, 2024

ryoqun commented May 12, 2024

ryoqun commented May 19, 2024

ryoqun commented May 22, 2024

apfitzge commented Jul 11, 2024

alessandrod commented Sep 25, 2024

apfitzge commented Sep 25, 2024

alessandrod commented Sep 25, 2024

apfitzge commented Sep 25, 2024

apfitzge commented May 7, 2024 •

edited by ksolana

Loading