Conversation
The point of this commit is that the heap allocated for the guest may
not be all that the host has allocated for the guest. In other words, we
could exclude a certain portion that is always accessible to the guest
and ensure that the stack starts after this portion. Furthermore, the
stack (SP) will start from guest_mem_size and there may potentially be a
portion between guest_mem_size and mem_alloc_size.
0x0 guest_mem_size <| |> mem_alloc_size
|---------------------------+------+
| kernel + heap -> <- stack | ring |
+---------------------------+------+
SP ^
With regard to this pull request, we would like to place the shared ring
between the guest and the host on this portion (so that it is accessible
by both and shared).
We consider this invariant to always hold true:
guest_mem_size <= mem_alloc_size
Our ring is a fixed structure containing two queues: - a queue for the guest to send operations to - a queue for the host to confirm operations The main idea is that the two queues are shared between the guest and the host and are located in an area allocated to the guest (so that the guest can access them). The second idea is that these queues are shared by two processes running in parallel (hence the use of memory barriers). These barriers can be found, for example, in our VirtIO support. The third idea concerns the 'kick'. The host thread can wait for new entries. It will therefore inform the guest that it needs to be woken up via the kick value (if it is set to 1). The final idea is cache-line alignment (64 bytes). The guest generates entries and consumes commits when the host generates commits and consumes entries. We can physically locate what the guest modifies and what the host modifies on different cache lines to avoid false sharing between the two CPUs (that of the guest and that of the host thread).
This commit introduces two fields that allow us to specify the features the host has implemented and what the guest can handle. For the time being, only the `host_features` field is used in our pull request.
Here, we implement the functions used to notify the host. For KVM, an ioeventfd will be associated with the address RING_KICK_PIO_BASE. For FreeBSD/OpenBSD, this will be a genuine VM-exit.
We define a pipe (2 file descriptors), our thread that will perform the read()/write() operations, and our ring. We catch the signal from our VM and write to our pipe. <machine/vmm.h> is added also due to an error for an incomplete vm_run.
We define a pipe (2 file descriptors), our thread that will perform the read()/write() operations, and our ring. We catch the signal from our VM and write to our pipe.
Here, we just define our ioeventfd file-descriptor and our ring. The kick mechanism differs from that used in OpenBSD/FreeBSD. Here, we use IOEVENTFD, which we initialise at the address RING_KICK_PIO_BASE. When we wish to perform an `out` operation at this address, KVM will automatically write to the file descriptor.
We statically allocate the buffers that will contain our Ethernet frames. When writing, we notify the host and do not wait for the result. As for reading, we notify the host and wait for the commit (we want to be synchronised with our host’s thread). These operations only occur if net_ring is not NULL (which is the case if the FEATURE_RING_IO feature is advertised by the host).
This function allows you to allocate a ring within the memory area allocated for the guest. This ring is used by all network interfaces.
Preparing our ring for our virtual machine before the CPU initialises. This function must be called beforehand, as it changes `guest_mem_size` (relative to `mem_alloc_size`), where the SP register must be placed.
Here, we create a thread that will handle the read()/write() operations received (from the guest) via the shared queues. The first value used by our thread is: ta->ready. This ensures that the dom0 waits until the thread has actually started before applying certain security measures such as pledge (for OpenBSD). Next, our thread iterates 4096 times initially to 'wait' for an action. If, despite this 'wait', we still have no inputs, we block on reading the file descriptor (the ioeventfd for KVM or the pipe for FreeBSD/OpenBSD) and signal to the guest that we need to be 'kicked'. Finally, we execute the write() and read() operations without blocking. If a write fails, we fail the operation (as we did before). For the read(), we pass the error to the guest. Finally, we register a hook to ensure our pthread terminates correctly.
Here, we are essentially setting up the `notify_fd`, which will allow us to kick the thread we are also creating. For KVM, this is where we associate the file descriptor with the address `HVT_RING_KICK_PIO_BASE`. For OpenBSD/FreeBSD, we properly allocate our pipe.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Here is an initial attempt to optimise our unikernels with regard to hvt. To fully understand the original problem: every time we want to write an Ethernet frame, our execution path is as follows:
out; this is a VM-exitread()What is very costly here is the VM-exit (
out), which significantly degrades our performance if we run a ‘fair’ benchmark comparison with VirtIO.Overview
The idea behind this PR is to offer something very similar to what VirtIO can provide with Virtqueues. It therefore involves implementing queues that are shared between the host and the guest so that they can exchange information.
In this case, the guest sends “entries”, which are actions the host must perform (read from and write to a network interface), and the host sends confirmation of these actions along with the result. In our case, we are only interested in the result of the read operation. Initially, Solo5 would only return SOLO5_R_OK or fail when writing:
solo5/tenders/hvt/hvt_module_net.c
Lines 46 to 70 in a333cbb
To enable the guest to transmit information to the host, we must be mindful of certain barriers (as was the case with VirtIO, #630) and there may be instances where we wish to “kick” our host’s thread.
Indeed, our host may be in a state of waiting for entries (the unikernel has not sent any entries). In this case, the thread goes to sleep whilst reading a file descriptor. As regards KVM, this file descriptor is derived from ioeventfd (it is a file descriptor that can be associated with a specific memory address). As regards OpenBSD and FreeBSD, this file descriptor is one derived from pipe() (thanks to @haesbaert for giving me the tip already present in Miou).
If our thread is waiting, it updates a shared variable called needs_kick. If the unikernel sees this value set to 1, it performs a VM-exit to wake up the host thread. On KVM, this VM-exit is inexpensive as KVM handles it directly. On OpenBSD/FreeBSD, there will be a proper VM-exit which involves writing to the other end of our pipe().
As for the unikernel, since we have very little available in this space, I chose to statically allocate what is necessary for our “ring”. In this case, when a unikernel wishes to write, we need to “hold” its buffer. This buffer will only become available after our host thread has written it to the interface. Once again, we’re not interested in write confirmation. So we’ll simply attempt to write until the queue is full. The buffers are allocated via the .bss segment.
The ring is a bit special, however. It must exist for the guest but also for the host. The idea is to reserve an area just before the stack. This is why there are now guest_mem_size and mem_alloc_size. The first refers to what the unikernel can use and where the stack must start. The second refers to everything allocated for the guest (which includes what is needed for our ring). Thus, a unikernel requesting 512MB will only have 510MB (as the ring takes up 2MB). The ring is only allocated if a network interface is present.
Since the buffers are statically allocated, there is a scenario where the user might wish to configure the MTU to more than 2048 bytes (the maximum size of a buffer that can be written). In this case, we ‘fall back’ to the hypercall scenario and therefore perform a VM-exit. Furthermore, increasing the MTU will therefore tend to degrade performance.
Benchmark
It should be noted first of all that the performance gain is specifically observable on Linux. I haven’t run any benchmarks on FreeBSD/OpenBSD, but I think there must still be a gain. Here is a benchmark using IPerf3 without that pull request:
And here is the result with this PR:
As you can see, the performance gain is quite significant. It is based on our implementation of
mnetandmkernel. The IPerf3 code is available here. I haven’t carried out a comparison withmirage-tcpip, however (there may be issues relating to the scheduler, particularly between Miou/mkernelandlwt).Tests
A test has been added to verify the integrity of the data transmitted over the network. In theory, the test could work perfectly well without this PR (it simply sends a ping with data), but I wanted to check that, under high load, I wasn’t misinterpreting the indices and buffers.
Improvements
The number of comments is inversely proportional to my confidence in the code. In this case, there are a lot of comments. I hope the code is sufficiently documented, but above all it needs to be tested. In short, it’s a major change, but I think it’s definitely worth it to close the performance gap between hvt and VirtIO or Xen.