-
Notifications
You must be signed in to change notification settings - Fork 6
NVMeWithLinux
This tutorial will show you how the L4Re NVMe server can be used to make an NVMe namespace or a GPT partition on an NVMe namespace accessible in the form of a VIRTIO block device to a Linux VM. We will use our knowledge from the previous tutorials, namely the Building L4Re, Running a Linux guest VM tutorials and to a degree the Hardware pass-through to the VM tutorial. You are kindly reminded to get familiar with the concepts and procedures presented in them before moving on with this one.
As the NVMe technology is bound to systems with PCIe, we will be using the AMD64 platform (also known as x86_64), where PCIe is most common. This means that certain steps from the above tutorials do not apply verbatim, but need to be modified for the AMD64/x86_64 use case. In most situations we will leave this as an exercise to the reader, but will bring your attention to important differences in configuration.
In the following we will run a simple Linux VM in uvmm. The Linux VM will be provided with a VIRTIO block device by the L4Re NVMe server. The VIRTIO device will be backed by an NVMe namespace. The VM will be able to mount the VIRTIO block device and use it just as any other block device. The whole scenario will run in KVM for the sake of a unified user experience and reusability of this tutorial.
You first need to checkout and build the sources for the following components as described in the Building L4Re and Running a Linux guest VM:
- Fiasco
- L4Re
- Linux
- Busybox
Without going into details described in the previous tutorials, the building is done like this:
[somedir] $ cd fiasco
[somedir/fiasco] $ make B=../build-fiasco-x86_64
[somedir/fiasco] $ cd ../build-fiasco-x86_64
[somedir/build-fiasco-x86_64] $ make config # save defaults for x86_64
[somedir/build-fiasco-x86_64] $ make -j 6
[somedir/build-fiasco-x86_64] $ cd ../l4
[somedir/l4] $ make B=../build-x86_64
[somedir/l4] $ cd ../build-x86_64
[somedir/build-x86_64] $ make config # save default for x86_64
[somedir/build-x86_64] $ make -j 6
[somedir/build-x86_64] $ cd ../linux
[somedir/linux] $ mkdir ../build-linux-x86_64
[somedir/linux] $ make O=../build-linux-x86_64 defconfig
[somedir/linux] $ make O=../build-linux-x86_64 -j 6
[somedir/linux] $ cd ../busybox
[somedir/busybox] $ mkdir ../build-busybox-x86_64
[somedir/busybox] $ make O=../build-busybox-x86_64 defconfig
[somedir/busybox] $ make O=../build-busybox-x86_64 menuconfig # select Build static binary (no shared libs)
[somedir/busybox] $ cd ../build-busybox-x86_64
[somedir/build-busybox-x86_64] $ make install -j 6
In this case we are assuming the host and target architecture is x86_64.
In this step you will make changes to the installed Busybox file system that will later allow you to see mounted file systems in the running Linux VM and list the discovered VIRTIO block devices.
[somedir/build-busybox-x86_64] $ mkdir _install/proc
[somedir/build-busybox-x86_64] $ mkdir _install/etc
[somedir/build-busybox-x86_64] $ cat >_install/etc/fstab <<EOF
none /proc proc defaults 0 0
devtmpfs /dev devtmpfs mode=0755,nosuid 0 0
EOF
[somedir/build-busybox-x86_64] $ mkdir ../ramdisk
[somedir/build-busybox-x86_64] $ cd _install
[somedir/build-busybox-x86_64/_install] $ find . | cpio -H newc -o | gzip > ../../ramdisk/ramdisk-x86_64.cpio.gz
[somedir/build-busybox-x86_64/_install] $ cd ../..
Create an empty 10MB file to hold an NVMe namespace image for QEMU:
[somedir] $ mkdir images
[somedir] $ dd if=/dev/zero of=images/disk.img bs=1M count=10
You can tweak the count argument to change the size of the image in MB according to your needs.
Now is time to create the configuration files for your scenario. First of all, create a directory that will contain all configuration:
[somedir] $ mkdir conf
We will now proceed to create two configuration files in this directory:
- Vbus configuration file for passing the NVMe hardware to the NVMe server
- ned script to start the scenario
Create somedir/conf/nvme.vbus like this:
Io.add_vbusses
{
nvme = Io.Vi.System_bus(function ()
Property.num_msis = 2;
PCI0 = Io.Vi.PCI_bus(function ()
pci_hd = wrap(Io.system_bus():match("PCI/storage"));
end);
end);
};
When processed by io, this configuration will create a vbus capability called nvme and, for the sake of simplification, via this capability make accessible all PCI storage devices found on the host system. Note that this assignment is a little bit too generous (it would match all non-NVMe storage devices too), but for your scenario it will do with a remark that it is possible to match the real hardware exactly.
In addition, this configuration specifies that the nvme vbus will support up to 2 MSI interrupts. One is for the NVMe admin queues and one is for the I/O queue pair of each active NVMe namespace. In general you want to have this limit large enough to accomodate all your NVMe namespaces with connected clients and the admin queues. If MSI interrupts are not configured or are not available, the NVMe server will fall back to using legacy interrupts.
Create somedir/conf/uvmm-nvme.ned as follows:
package.path = "rom/?.lua";
local L4 = require "L4";
local vmm = require "vmm";
local l = L4.default_loader;
local nvmecl = l:new_channel();
local io_busses = {
nvme = 1,
}
vmm.start_io(io_busses, "rom/nvme.vbus -v");
l:start(
{
caps =
{
vbus = io_busses.nvme,
cl1 = nvmecl:svr()
},
log = { "nvme", "g" },
},
"rom/nvme-drv"
.. " -v"
.. " --client cl1"
.. " --device MYNVMECTLSN:n1"
.. " --ds-max 5"
);
vmm.start_vm({
id = 1,
mem = 128,
mon = false,
rd = "rom/ramdisk-x86_64.cpio.gz",
fdt = "rom/virt-pc.dtb",
bootargs = "console=ttyS0 earlyprintk=1 rdinit=/bin/sh",
kernel = "rom/bzImage",
log = L4.Env.log,
ext_caps =
{
qdrv = nvmecl,
},
});
The above ned script first starts io via the vmm convenience wrapper and passes it the vbus configuration file nvme.vbus. This instructs io to create a vbus called nvme according to the configuration file.
It then starts the NVMe server, which goes by name nvme-drv. Its vbus capability is the client capability for the nvme vbus as created by io. cl1 is a server capability for the NVMe server's client that will be eventually handed over to uvmm for use by the Linux VM. The client is assigned a VIRTIO block device identified by the controller's serial number (in our case this is MYNVMECTLSN) and the number of the NVMe namepspace, in this case namespace 1. If you run this tutorial on bare metal, make sure to change the serial number to match your hardware. Note that the device command line argument can instead identify a GPT partition by its GUID or label. In that case the client will be able to access only that particular partition.
Finally, uvmm is started using the vmm convenience wrapper. In constrast to the previous tutorials (which were for the aarch64 architecture), the arguments are tailored for running on x86_64. This shows in the used device tree (virt-pc.dtb) and the Linux image name (bzImage). Also note that the -i command line argument is not used here.
The virt-pc.dtb device tree contains a node that allows a VIRTIO block device to be assigned to the VM when a capability named qdrv is defined. We define it using the nvmecl capability, which is the client capability for the NVMe server client.
Add an entry to somedir/l4/conf/modules.list with the following content in order to inform the build system and the boot loader of all the files that form your scenario.
entry uvmm-nvme
kernel fiasco -serial_esc
roottask moe rom/uvmm-nvme.ned
module uvmm
module io
module nvme-drv
module nvme.vbus
module l4re
module ned
module virt-pc.dtb
module ramdisk-x86_64.cpio.gz
module[shell] echo $SRC_BASE_ABS/pkg/uvmm/configs/vmm.lua
module uvmm-nvme.ned
module[uncompress] bzImage
Mind the device tree and Linux boot image names specific to x86_64.
You'll now need to configure the build system.
Go to somedir/l4/conf and copy Makeconf.boot.example to Makeconf.boot (assuming it does not exist yet). Then edit Makeconf.boot and make sure that:
- somedir/build-fiasco-x86_64
- somedir/ramdisk
- somedir/conf
- somedir/build-linux-x86_64/arch/x86/boot/
are all included in the definition of MODULE_SEARCH_PATH. The individual paths need to be absolute and spearated by colons. An example definition of MODULE_SEARCH_PATH might look as follows. Just make sure to change SOMEDIR according to your environment:
SOMEDIR = /absolute/path/to/your/somedir
MODULE_SEARCH_PATH = $(SOMEDIR)/build-fiasco-x86_64:$\
$(SOMEDIR)/ramdisk:$(SOMEDIR)/conf:$(SOMEDIR)/build-linux-x86_64/arch/x86/boot
At the same time, make sure that QEMU_OPTIONS for amd64 are defined in the following fashion:
QEMU_OPTIONS-amd64 = -cpu host -enable-kvm -m 512 -display none
QEMU_OPTIONS-amd64 += -drive if=none,file=$(SOMEDIR)/images/disk.img,format=raw,id=D1 \
-device nvme,drive=D1,serial=MYNVMECTLSN
If the above steps were made correctly, you should be able to start the scenario like this:
[somedir] $ cd build-x86_64
[somedir/build-x86_64] $ make E=uvmm-nvme qemu
After a short while the Linux VM will have booted into its interactive shell.
Let us first explore the output produced by the NVMe server though. It will be interleaved with output from other components, but when collected it should look like this:
nvme | NVMe[]: NVMe driver says hello.
nvme | NVMe[]: Starting device discovery.
nvme | NVMe[]: ICU info: features=80000000 #Irqs=24, #MSIs=2
nvme | NVMe[]: MSI info: vector=0x80000000 addr=fee00000, data=4022
nvme | NVMe[]: Device: interrupt : 80000000 trigger: 1, polarity: 0
nvme | NVMe[]: All devices scanned.
nvme | Serial Number: MYNVMECTLSN
nvme | Model Number: QEMU NVMe Ctrl
nvme | Firmware Revision: 1.0
nvme | Controller ID: 0
nvme | SGL Support: yes
nvme | Number of Namespaces: 256
nvme | NVMe[]: MSI info: vector=0x80000001 addr=fee00000, data=4023
nvme | Making NSID 1 visible to clients
nvme | libblock[]: No GUID partition header found.
nvme | NVMe[]: Capability 'svr' not found. No dynamic clients accepted.
nvme | Registering dataspace from 0x0 with 131072 KiB, offset 0x0
nvme | PORT[0x23090]: DMA guest [0-7ffffff] local [1200000-91fffff] offset 0
nvme | register client: host IRQ: 42a
nvme | Resetting device
nvme | shm=[0-7ffffff] local=[1200000-91fffff] desc=[1f20000-1f21000] (0x3120000-0x3121000)
nvme | shm=[0-7ffffff] local=[1200000-91fffff] avail=[1f21000-1f21206] (0x3121000-0x3121206)
nvme | shm=[0-7ffffff] local=[1200000-91fffff] used=[1f21240-1f21a46] (0x3121240-0x3121a46)
nvme | VQ[0x230f0]: num=256 d:0x3120000 a:0x3121000 u:0x3121240
This suggests the NVMe server discovered the NVMe controller emulated by QEMU. Note the serial number matches the one passed to QEMU via its command line argument. Also note that the server is using MSI interrupts. The controller reports 256 NVMe namespace, but only namespace 1 is used. It is the one which contains the disk image passed to QEMU. The disk image we prepared has no GPT partitions which is in line with what the NVMe server reports. We see that a client registered with the server and shared VIRTIO queue and buffer memory with it.
When we now look at the Linux VM, it will have detected the VIRTIO block device which corresponds to namespace 1 on the NVMe controller.
~ # dmesg | grep virtio_blk
[ 0.606814] virtio_blk virtio1: 1/0/0 default/read/poll queues
[ 0.776591] virtio_blk virtio1: [vda] 20480 512-byte logical blocks (10.5 MB/10.0 MiB)
Linux will designate this VIRTIO block device as /dev/vda:
~ # mount -a
~ # mount
rootfs on / type rootfs (rw,size=45176k,nr_inodes=11294)
none on /proc type proc (rw,relatime)
devtmpfs on /dev type devtmpfs (rw,nosuid,relatime,size=45200k,nr_inodes=11300,mode=755)
~ # ls /dev/vda
/dev/vda
You can now create a file system on this device:
~ # mkfs.ext2 /dev/vda
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
2560 inodes, 10240 blocks
512 blocks (5%) reserved for the super user
First data block=1
Maximum filesystem blocks=262144
2 block groups
8192 blocks per group, 8192 fragments per group
1280 inodes per group
Superblock backups stored on blocks:
8193
And mount it:
~ # mkdir /mnt
~ # mount /dev/vda /mnt
~ # mount
rootfs on / type rootfs (rw,size=45176k,nr_inodes=11294)
none on /proc type proc (rw,relatime)
devtmpfs on /dev type devtmpfs (rw,nosuid,relatime,size=45200k,nr_inodes=11300,mode=755)
/dev/vda on /mnt type ext2 (rw,relatime)
This tutorial has taught you to configure an L4Re scenario which allows you to export an NVMe namespace (or partition) as a VIRTIO block device to a Linux VM. The Linux VM sees this device as /dev/vda and can use it as any other block device.