Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoH - Reserve Space #1227

Open
apfitzge opened this issue May 7, 2024 · 14 comments
Open

PoH - Reserve Space #1227

apfitzge opened this issue May 7, 2024 · 14 comments
Assignees

Comments

@apfitzge
Copy link

apfitzge commented May 7, 2024

Problem

  • In transaction processing, one of the main bottlenecks for simple transactions (such as in bench-tps) is recording transactions into PoH.
  • The reason this is slow is because of the communication pattern between execution and PoH threads:
    • Execution threads send messages to PoH thread
    • PoH thread, occasionally checks channel for messages from execution threads
    • PoH thread records any messages from execution threads in order and sends messages back
  • The slowness mainly comes from the fact that the PoH thread does not constantly check the channel from execution threads. It only does so at specific intervals between hash batches. By default the PoH thread will do 64 hashes between each checking of the channel (only hashing if channel is empty).

Assuming cluster average hash-rate of 10M hashes/sec. This means the interval of checking the channel is around 6.4us.

Proposed Solution

  • Create a mechanism for execution threads to "reserve" PoH space.
  • Reserve should not fail if there is still time left in the block, PoH wise.
  • Before the block ends, PoH thread will "flip a bit" to prevent any further reservations, and then record all messages before the last tick is inserted.
  • This will allow execution threads to more quickly move on, since they do not need to wait for the actual recording into PoH to occur - just the "reserve" operation.

Need to compare alternate implementations:

  1. Atomic using CAS for reserve mechanism
  2. Mutex for reserve mechanism
@ryoqun
Copy link
Member

ryoqun commented May 8, 2024

oh, i think this can be solved more elegantly just with crossbeam-rs/crossbeam#959 (cc @behzadnouri; he also came up with the similar idea: crossbeam-rs/crossbeam#861)

the idea is that we can guarantee block insertion just just from the simple fact of succeeding in (one-sided) sending to the poh thread. so, no need to wait replies from the poh thread.

that said, that upstream pr is kind of stale. we can patch the crossbeam-channel or just wrap the crossbeam channel with mutex (ugh).

@apfitzge
Copy link
Author

apfitzge commented May 8, 2024

the idea is that we can guarantee block insertion just just from the simple fact of succeeding in (one-sided) sending to the poh thread. so, no need to wait replies from the poh thread.

Exactly, the whole point of this issue is to have this guarantee.
If we can achieve that in another than suggested, I'm happy.

I'm not sure I understand the suggestion w/ crossbeam though. At least currently we use the same channel to send to poh thread throughout the lifetime; do you mean use the disconnect approach, and need to swap out channels for every new slot?

@ryoqun
Copy link
Member

ryoqun commented May 8, 2024

do you mean use the disconnect approach, and need to swap out channels for every new slot?

exactly. so that senders can be notified about end of slots.

@apfitzge
Copy link
Author

apfitzge commented May 8, 2024

Yeah that seems like an approach that'd work.
I think Mutex-wrapping the channels is not an option.
Is there any expected timeline on those upstream PRs? they seem to have been there for quite a long time.

@ryoqun
Copy link
Member

ryoqun commented May 9, 2024

I think Mutex-wrapping the channels is not an option.

Well, I think we need some bench to preclude that. Now that Mutex is futex-based, it's not so different from having another crossbeam channel. I'm not saying this is a permanent solution, though. Also, come to think of it, RwLock could be used to mitigate contentions among the external (i.e. tx-sending) threads, because it's guaranteed that there's the single receiver.

Is there any expected timeline on those upstream PRs? they seem to have been there for quite a long time.

Sadly, none. Seems the maintainer is too busy... I'm replying as fast as i can.

At this moment, i think patching or forking makes sense.

So, ultimately, which solution (reserve, Mutex, or disconnect) to be taken could largely depends on the enginener's preference, i guess. Fortunately, this isn't consensus change and the change is quite localized. so, relatively easy to switch the solutions.

@apfitzge
Copy link
Author

apfitzge commented May 9, 2024

Also, come to think of it, RwLock could be used to mitigate contentions among the external (i.e. tx-sending) threads, because it's guaranteed that there's the single receiver.

My main contention with mutex was blocking of execution threads by other execution threads.
However, even with using a RwLock the execution threads will be blocked by the recording thread.

I guess I'm also not quite sure how this approach would work. In crossbeam the Sender and Receiver are separate objects/types - a lock around the Receiver would not affect the Sender afaik - unless we do some sort of Channel wrapper that holds the lock around both.

@ryoqun
Copy link
Member

ryoqun commented May 12, 2024

However, even with using a RwLock the execution threads will be blocked by the recording thread.

recording thread will only write-lock at the end of slots. But as said below, use of RwLock is awkward.

I guess I'm also not quite sure how this approach would work. In crossbeam the Sender and Receiver are separate objects/types

ohh, true.

a lock around the Receiver would not affect the Sender afaik

this is true as well

unless we do some sort of Channel wrapper that holds the lock around both.

Yeah, that works. but it's odd even more. Instead, I think we need is_reached_max_height: Arc<RwLock<bool>> besides senders/receivers. i.e. an indirect way to create a critical section by accompanying this synchronization along with them at the cost of losing type safety to some extent. this is like insert_shreds_lock: Mutex<()> in solana-ledger.

@ryoqun
Copy link
Member

ryoqun commented May 19, 2024

Instead, I think we need is_reached_max_height: Arc<RwLock<bool>> besides senders/receivers. i.e. an indirect way to create a critical section by accompanying this synchronization along with them

I created such a wrapper: https://gist.github.com/ryoqun/80adbd49c1658812b47fcf35bacb058f

@ryoqun
Copy link
Member

ryoqun commented May 22, 2024

further update: I created revamped pr upstream: crossbeam-rs/crossbeam#1114

@apfitzge
Copy link
Author

Reviewing this, I think we still need/want some reserve method instead. We currently return the transaction index w/ the result of recording which the worker threads use for something(?).

I'm imaginging we could do something like having a Mutex<Option<usize>> or AtomicI64.
Use lock or CAS to get the current index for the batch we're recording, if it's None or negative then we consider the recording to fail.

@alessandrod
Copy link

The slowness mainly comes from the fact that the poh thread does not constantly check the channel from execution threads. It only does so at specific intervals between hash batches. By default the poh thread will do 64 hashes between each checking of the channel (only hashing if channel is empty).

I think the slowness comes from the fact that when the poh thread sends results back, the banking threads are always sleeping so it needs to syscall to wake them up. Ban syscalls.

image

@apfitzge
Copy link
Author

I think the slowness comes from the fact that when the poh thread sends results back, the banking threads are always sleeping so it needs to syscall to wake them up. Ban syscalls.

Yeah. those threads are always sleeping.
Communication should be "largely" one way. Workers send to stuff to PoH, but get back immediately as part of send process the info it needs to continue.
No messages from PoH, just something to flag the end of slot so banks dont' record past end of slot.

@alessandrod
Copy link

This sounds like a fixed capacity ringbuffer with head and tail that can advance independently without contention

@apfitzge
Copy link
Author

This sounds like a fixed capacity ringbuffer with head and tail that can advance independently without contention

yeah ideally we could have some sort of ring-buffer, large enough we expect to never fail to insert.
still need a way to communicate "slot is over" to workers, as well as the transaction_index stuff that's used for something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants