Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vhost-device-vsock: increase max queue size to 1024 #679

Merged
merged 2 commits into from
Jul 10, 2024

Conversation

dorjoy03
Copy link
Contributor

@dorjoy03 dorjoy03 commented Jul 5, 2024

Summary of the PR

Please summarize here why the changes in this PR are needed.

When running "vhost-device-vsock" against QEMU's "vhost-user-vsock-device" (i.e., mmio version, not the pci one) causes an error output on the vhost-device-vsock terminal:

[2024-06-30T10:57:54Z ERROR vhost_device_vsock] Fatal error: failed to handle request: invalid parameters 

And some errors on the QEMU terminal too like below:

qemu-system-x86_64: Failed to read msg header. Read -1 instead of 12.
Original request 1.
qemu-system-x86_64: Failed to write msg. Wrote -1 instead of 20.
qemu-system-x86_64: vhost VQ 1 ring restore failed: -22: Invalid argument (22)
qemu-system-x86_64: Failed to set msg fds.
qemu-system-x86_64: vhost VQ 0 ring restore failed: -22: Invalid argument (22)
qemu-system-x86_64: Error starting vhost: 71
qemu-system-x86_64: Failed to set msg fds.
qemu-system-x86_64: vhost_set_vring_call failed 22
qemu-system-x86_64: Failed to set msg fds.
qemu-system-x86_64: vhost_set_vring_call failed 22

I debugged this on the QEMU side and traced it back to trying to set the vring queue size to 1024 and I think the "InvalidParameter" error comes from the "set_vring_num" function in vhost-device-vsock. The commands to reproduce this are below:

shell1: ./target/release/vhost-device-vsock --vm guest-cid=4,uds-path=/tmp/vm4.vsock,socket=/tmp/vhost4.socket

shell2: ./qemu-system-x86_64 -M microvm,memory-backend=mem0 -kernel ./bzImage -nographic -append "console=ttyS0 nokaslr" -initrd ./initramfs.cpio.gz -m 4G --enable-kvm -cpu host -object memory-backend-memfd,id=mem0,size=4G -chardev socket,id=char0,reconnect=0,path=/tmp/vhost4.socket -device vhost-user-vsock-device,chardev=char0

I built the bzImage from linux kernel 6.9.3. From what I understand, in the linux driver side in "drivers/virtio/virtio_mmio.c" the "vm_setup_vq" function first reads the max queue size (VIRTIO_MMIO_QUEUE_NUM_MAX) and then sets it to be the queue size (VIRTIO_MMIO_QUEUE_NUM). In the QEMU side, the max queue size is in "include/hw/virtio/virtio.h" "VIRTQUEUE_MAX_SIZE" which is 1024. I think it makes sense to just increase the max queue size in vhost-device-vsock too. The errors go away if I do so.

Edit: Per review comments, a new queue-size option is introduced and the default value of queue size has been changed to 1024

Requirements

Before submitting your PR, please make sure you addressed the following
requirements:

  • All commits in this PR have Signed-Off-By trailers (with
    git commit -s), and the commit message has max 60 characters for the
    summary and max 75 characters for each description line.
  • All added/changed functionality has a corresponding unit/integration
    test.
  • All added/changed public-facing functionality has entries in the "Upcoming
    Release" section of CHANGELOG.md (if no such section exists, please create one).
  • Any newly added unsafe code is properly documented.

@stefano-garzarella
Copy link
Member

We already discussed this change here: #458

The VhostUserBackend::max_queue_size is used to allocate structures, so increasing it will consume more memory.
1024 seems very big, so we should not set it as default.
As we discussed, the best way should be to add a parameter to control it.

@stefano-garzarella
Copy link
Member

We already discussed this change here: #458

The VhostUserBackend::max_queue_size is used to allocate structures, so increasing it will consume more memory. 1024 seems very big, so we should not set it as default. As we discussed, the best way should be to add a parameter to control it.

So, I'd suggest a new queue-size parameter, something like this:

./target/release/vhost-device-vsock --vm guest-cid=4,uds-path=/tmp/vm4.vsock,socket=/tmp/vhost4.socket,queue-size=1024

@epilys
Copy link
Member

epilys commented Jul 8, 2024

@stefano-garzarella +1 on giving an argument instead. By the way have you measured how much extra memory it'd need? Maybe it's negligible.

@stefano-garzarella
Copy link
Member

@stefano-garzarella +1 on giving an argument instead. By the way have you measured how much extra memory it'd need? Maybe it's negligible.

@epilys this is a good point. I didn't measure it. Do you suggest a proper way? (for generic exec I used smem in the past, not sure if for rust we have some analyzing tool)

@epilys
Copy link
Member

epilys commented Jul 8, 2024

@stefano-garzarella I think smem should be enough but there's also bytehound, heaptrack

https://github.com/koute/bytehound

https://github.com/KDE/heaptrack

@stefano-garzarella
Copy link
Member

PID User     Command                           Swap      USS      PSS      RSS 

# before this PR
$ smem | grep vhost
86936 stefano  ./target/release/vhost-devi        0     1596     1644     3640 

# after this PR
$ smem | grep vhost
87359 stefano  ./target/release/vhost-devi        0     1704     1753     3768 

USS is 108 KB higher, so around 6.8% more.
Given that it's 1.7 MB in the end (vs 1.6 MB), maybe we can use 1024 by default, but still add a parameter for those use cases where we want to reduce it?

@dorjoy03
Copy link
Contributor Author

dorjoy03 commented Jul 8, 2024

PID User     Command                           Swap      USS      PSS      RSS 

# before this PR
$ smem | grep vhost
86936 stefano  ./target/release/vhost-devi        0     1596     1644     3640 

# after this PR
$ smem | grep vhost
87359 stefano  ./target/release/vhost-devi        0     1704     1753     3768 

USS is 108 KB higher, so around 6.8% more. Given that it's 1.7 MB in the end (vs 1.6 MB), maybe we can use 1024 by default, but still add a parameter for those use cases where we want to reduce it?

Yeah agree that a new parameter makes sense. I will try to add the new parameter and keep 1024 as default. Thanks!

@dorjoy03 dorjoy03 force-pushed the vhost-device-vsock-queue-size branch 2 times, most recently from c4acf31 to 6d7b3d1 Compare July 8, 2024 17:56
@dorjoy03
Copy link
Contributor Author

dorjoy03 commented Jul 8, 2024

@stefano-garzarella @epilys Thanks for reviewing.

I have now introduced a "queue-size" option and set the default queue size to 1024 in the latest changes. I don't have any experience in Rust but the changes are simple enough (I mostly followed tx_buffer_size). If you have any review comments, it would be great if you could give me code examples too in the comment if the change is not obvious. Thanks!

Edit: Unrelated to this discussion, it could probably make sense to make the vhost-device-vsock error messages more exact. I had a hard time finding out where the "invalid parameters" error was coming from. It would have been more helpful if the error message had more details that it was when trying to set a queue size 1024 but max queue size was 256.

Copy link
Member

@epilys epilys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's see what @stefano-garzarella thinks!

@epilys
Copy link
Member

epilys commented Jul 9, 2024

Edit: Unrelated to this discussion, it could probably make sense to make the vhost-device-vsock error messages more exact. I had a hard time finding out where the "invalid parameters" error was coming from. It would have been more helpful if the error message had more details that it was when trying to set a queue size 1024 but max queue size was 256.

A lot of the errors and log messages are indeed not as descriptive as they could be. If you want to change this particular case we'd love a PR! Otherwise we can open a ticket and someone will get to it eventually :)

@dorjoy03
Copy link
Contributor Author

dorjoy03 commented Jul 9, 2024

Edit: Unrelated to this discussion, it could probably make sense to make the vhost-device-vsock error messages more exact. I had a hard time finding out where the "invalid parameters" error was coming from. It would have been more helpful if the error message had more details that it was when trying to set a queue size 1024 but max queue size was 256.

A lot of the errors and log messages are indeed not as descriptive as they could be. If you want to change this particular case we'd love a PR! Otherwise we can open a ticket and someone will get to it eventually :)

I probably won't look into it now so I suggest opening a ticket. Thanks.

@stefano-garzarella
Copy link
Member

Edit: Unrelated to this discussion, it could probably make sense to make the vhost-device-vsock error messages more exact. I had a hard time finding out where the "invalid parameters" error was coming from. It would have been more helpful if the error message had more details that it was when trying to set a queue size 1024 but max queue size was 256.

A lot of the errors and log messages are indeed not as descriptive as they could be. If you want to change this particular case we'd love a PR! Otherwise we can open a ticket and someone will get to it eventually :)

I probably won't look into it now so I suggest opening a ticket. Thanks.

@dorjoy03 can you open it and describe the issue?

@stefano-garzarella
Copy link
Member

The PR LGTM, but I'd remove Fixes: #202 since other devices need to be updated too. Maybe we can just mention it.

Also, what about splitting the commit in two?
The first one you introduce the new parameter, leaving the default to 256. The second commit will increase it to 1024 explaining why. So in the future, if we need to return to 256, we can just revert the second patch.

@dorjoy03
Copy link
Contributor Author

Edit: Unrelated to this discussion, it could probably make sense to make the vhost-device-vsock error messages more exact. I had a hard time finding out where the "invalid parameters" error was coming from. It would have been more helpful if the error message had more details that it was when trying to set a queue size 1024 but max queue size was 256.

A lot of the errors and log messages are indeed not as descriptive as they could be. If you want to change this particular case we'd love a PR! Otherwise we can open a ticket and someone will get to it eventually :)

I probably won't look into it now so I suggest opening a ticket. Thanks.

@dorjoy03 can you open it and describe the issue?

No problem. Will do.

The PR LGTM, but I'd remove Fixes: #202 since other devices need to be updated too. Maybe we can just mention it.

Good point. I did not realize the ticket was about other daemons as well. Will remove.

Also, what about splitting the commit in two? The first one you introduce the new parameter, leaving the default to 256. The second commit will increase it to 1024 explaining why. So in the future, if we need to return to 256, we can just revert the second patch.

Yep, will split into 2.

The vring queue size was hardcoded to 256 which is problematic for
vsock devices that use a larger queue size. For example, when using
vhost-device-vsock with QEMU's vhost-user-vsock-device, the message
to set the vring queue size to 1024 causes an "InvalidParameter"
error from vhost-device-vsock. Making the queue size configurable
via an option makes it easy to use vhost-device-vsock where a larger
queue size is required.

Signed-off-by: Dorjoy Chowdhury <[email protected]>
Although increasing queue size from 256 to 1024 increases the memory
usage of vhost-device-vsock, 1024 is a reasonable default which should
make vhost-device-vsock work by default (i.e., without using the
queue-size option) with devices that use vring queue size up to 1024
such as QEMU's vhost-user-vsock-device.

If Users require greater or smaller queue size, they can use the
'queue-size' option to configure the queue size.

Signed-off-by: Dorjoy Chowdhury <[email protected]>
@dorjoy03 dorjoy03 force-pushed the vhost-device-vsock-queue-size branch from 0270e4f to 35a945b Compare July 10, 2024 14:01
@dorjoy03
Copy link
Contributor Author

@stefano-garzarella I have now split the changes into 2 commits and submitted an issue about making the error logs more descriptive #682

@stefano-garzarella
Copy link
Member

@epilys your "Approve" is still valid for GH, but I'd like to double-check before merge this. Is this version okay for you?

@epilys epilys merged commit f7724bf into rust-vmm:main Jul 10, 2024
2 checks passed
@epilys
Copy link
Member

epilys commented Jul 10, 2024

@stefano-garzarella was already on it, thanks!

@stefano-garzarella
Copy link
Member

@dorjoy03 thanks for this PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants