feat: redesign controller-replica communication #1246

Kampadais · 2024-10-03T16:51:05Z

Which issue(s) this PR fixes: longhorn/longhorn#6600

What this PR does / why we need it:

This PR is implementing a new communication structure between the controller and the replicas. The main focus of this restructure is the remove of the loop function and replace it with a more scalable approach.

Based on the current implementation, frontent issuing a new operation goes into a request channel and wait for the loop function to serve it. The same approach is followed with the replies from the replicas. The main fault of the loop function is that there is only on thread running and it both serves requests and replies. This is because Messages map can be R/W by only one thread. So even if there are a bunch of incoming requests or replies the systems is capped on serving these operations one at a time. This solution is not scalable.

The new approach scales the operation process removing the loop function and replacing it with immediate call of handleRequest and handleResponse. In order to make this possible there is also a need to prevent concurrent R/W in the Messages map. I replaced it with a Messages array and a Seq channel that ensures that every index of the array will be accessed by only one thread. What the new structs do :

Messages array : It stores the data equivalant with the Messages map but has a fixed length equal to the number of the inflight I/O we want to cap it. It's index is the id/Seq of the message.
Seq channel : It stores all the available Seqs that a new operation can acquire. Go channels ensure that only one thread will get the next available Seq. When we init the Client the Seq channel gets populated with numbers from 0 to Size(Messages). When an operation is completed the client stores the , now available , index back to the Seq channel

With this approach when a reply come from the replica the client is available to identify the message using the id as index in the Messages array ensuring the integrity of the Messages array.

Basic data-path flow :

Frontend issues a new operation
Client accepts it with one of the WriteAt/UnmapAt/ReadAt
func operation is called , formats the message and calls handleRequest()
handleRequest give the message a Seq from the Seq channel and forwards it to the send channel
This part is not changed and follows the flow of the current version
When the read function reads a reply on the wire it immediately calls handleResponse()
handleResponse identify the operation replied using the Seq as an index on the Message array and after it manipulates it , it notifies the Complete channel of the message to complete the request
After the request is completed the operation function reinserts the Seq used in the Seq channel in order to be used by the next operation

This approach uses multiple concurrent threads that each serves a different request and is scalable up to the frontend's and network's capabilities

Benchmarks

As mentioned before this approach is scalable up to the frontend's capabilities. Since tgt is a far from optimal solution using it as a frontend shows minimal to zero performance boost. In order to see the potential of this approach i used the Ublk frontend i have implemented in here .

Specs :
CPU : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz
RAM: 128GB
Disk: Samsung 860 Evo SSD 250GB
Network : 10Gbps

One replica , one controller , 2 machines in a local cluster
On the ublk+optimizations :
Ublk frontend queues : 6
Controller-Replica connections : 6

The fio command used for testing IOPS:
sudo fio --name=read_iops --filename=/dev/ublkb0 --size=5G --numjobs=12 --time_based --runtime=30s --ramp_time=2s --ioengine=libaio --direct=1 --verify=0 --bs=4K --iodepth=64 --rw=randwrite --group_reporting=1
fio command used for testing Bandwidth:

sudo fio --name=bandwidth --filename=/dev/ublkb0 --size=5G --numjobs=12--time_based --runtime=30s --ramp_time=2s --ioengine=libaio --direct=1 --verify=0 --bs=1M --iodepth=64 --rw=read --group_reporting=1

Additional documentation or context

*Replica R/W disabled means that the replica code is altered in order to answer dummy replies and not do the actual R/W in the Linux sparse files.

Note that with this approach we manage to scale the system to the Network's speed disabling the Replica R/W ( which is another bottleneck which need investigation given that the system up to this part can perform in these numbers)

Special notes for your reviewer:

I know i have removed some of the client's functionality regarding Journal but i find those part of the bottleneck . I am open in conversation in order to put any necessary part of it back

Signed-off-by: kampadais <[email protected]>

feat: redesign controller-replica communication

ca399f3

Signed-off-by: kampadais <[email protected]>

Kampadais force-pushed the communication-impr branch from f3cb007 to ca399f3 Compare October 9, 2024 11:08

Kampadais changed the title ~~Improved controller-replica communication~~ feat: redesign controller-replica communication Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: redesign controller-replica communication #1246

feat: redesign controller-replica communication #1246

Kampadais commented Oct 3, 2024

feat: redesign controller-replica communication #1246

Are you sure you want to change the base?

feat: redesign controller-replica communication #1246

Conversation

Kampadais commented Oct 3, 2024

Which issue(s) this PR fixes: longhorn/longhorn#6600

What this PR does / why we need it:

Additional documentation or context

Special notes for your reviewer: