Skip to content

hvt: optimize net interface#637

Open
dinosaure wants to merge 16 commits intomainfrom
pthread-clean-clean
Open

hvt: optimize net interface#637
dinosaure wants to merge 16 commits intomainfrom
pthread-clean-clean

Conversation

@dinosaure
Copy link
Copy Markdown
Collaborator

Here is an initial attempt to optimise our unikernels with regard to hvt. To fully understand the original problem: every time we want to write an Ethernet frame, our execution path is as follows:

  1. we prepare a value
  2. we write using out; this is a VM-exit
  3. we are in the tender’s address space and execute read()
  4. we prepare the result
  5. we return to the VM

What is very costly here is the VM-exit (out), which significantly degrades our performance if we run a ‘fair’ benchmark comparison with VirtIO.

Overview

The idea behind this PR is to offer something very similar to what VirtIO can provide with Virtqueues. It therefore involves implementing queues that are shared between the host and the guest so that they can exchange information.
In this case, the guest sends “entries”, which are actions the host must perform (read from and write to a network interface), and the host sends confirmation of these actions along with the result. In our case, we are only interested in the result of the read operation. Initially, Solo5 would only return SOLO5_R_OK or fail when writing:

static void hypercall_net_write(struct hvt *hvt, hvt_gpa_t gpa)
{
struct hvt_hc_net_write *wr =
HVT_CHECKED_GPA_P(hvt, gpa, sizeof(struct hvt_hc_net_write));
struct mft_entry *e =
mft_get_by_index(host_mft, wr->handle, MFT_DEV_NET_BASIC);
if (e == NULL) {
wr->ret = SOLO5_R_EINVAL;
return;
}
ssize_t ret;
ret =
write(e->b.hostfd, HVT_CHECKED_GPA_P(hvt, wr->data, wr->len), wr->len);
if (ret == -1) {
fprintf(stderr, "Fatal error when writing: %s\n", strerror(errno));
exit(1);
} else if ((size_t)ret != wr->len) {
fprintf(stderr, "Fatal error: wrote only %ld out of %ld bytes\n", ret,
wr->len);
exit(1);
}
wr->ret = SOLO5_R_OK;
}

To enable the guest to transmit information to the host, we must be mindful of certain barriers (as was the case with VirtIO, #630) and there may be instances where we wish to “kick” our host’s thread.

Indeed, our host may be in a state of waiting for entries (the unikernel has not sent any entries). In this case, the thread goes to sleep whilst reading a file descriptor. As regards KVM, this file descriptor is derived from ioeventfd (it is a file descriptor that can be associated with a specific memory address). As regards OpenBSD and FreeBSD, this file descriptor is one derived from pipe() (thanks to @haesbaert for giving me the tip already present in Miou).

If our thread is waiting, it updates a shared variable called needs_kick. If the unikernel sees this value set to 1, it performs a VM-exit to wake up the host thread. On KVM, this VM-exit is inexpensive as KVM handles it directly. On OpenBSD/FreeBSD, there will be a proper VM-exit which involves writing to the other end of our pipe().

As for the unikernel, since we have very little available in this space, I chose to statically allocate what is necessary for our “ring”. In this case, when a unikernel wishes to write, we need to “hold” its buffer. This buffer will only become available after our host thread has written it to the interface. Once again, we’re not interested in write confirmation. So we’ll simply attempt to write until the queue is full. The buffers are allocated via the .bss segment.

The ring is a bit special, however. It must exist for the guest but also for the host. The idea is to reserve an area just before the stack. This is why there are now guest_mem_size and mem_alloc_size. The first refers to what the unikernel can use and where the stack must start. The second refers to everything allocated for the guest (which includes what is needed for our ring). Thus, a unikernel requesting 512MB will only have 510MB (as the ring takes up 2MB). The ring is only allocated if a network interface is present.

Since the buffers are statically allocated, there is a scenario where the user might wish to configure the MTU to more than 2048 bytes (the maximum size of a buffer that can be written). In this case, we ‘fall back’ to the hypercall scenario and therefore perform a VM-exit. Furthermore, increasing the MTU will therefore tend to degrade performance.

Benchmark

It should be noted first of all that the performance gain is specifically observable on Linux. I haven’t run any benchmarks on FreeBSD/OpenBSD, but I think there must still be a gain. Here is a benchmark using IPerf3 without that pull request:

Connecting to host 10.0.0.2, port 5201
[  6] local 10.0.0.1 port 36664 connected to 10.0.0.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  6]   0.00-1.00   sec  80.1 MBytes   671 Mbits/sec    0    178 KBytes       
[  6]   1.00-2.00   sec  86.0 MBytes   721 Mbits/sec    0    178 KBytes       
[  6]   2.00-3.00   sec  85.0 MBytes   713 Mbits/sec    0    178 KBytes       
[  6]   3.00-4.00   sec  82.2 MBytes   690 Mbits/sec    0    178 KBytes       
[  6]   4.00-5.00   sec  83.5 MBytes   700 Mbits/sec    0    178 KBytes       
[  6]   5.00-6.00   sec  84.9 MBytes   712 Mbits/sec    0    178 KBytes       
[  6]   6.00-7.00   sec  85.2 MBytes   715 Mbits/sec    0    178 KBytes       
[  6]   7.00-8.00   sec  80.5 MBytes   675 Mbits/sec    0    178 KBytes       
[  6]   8.00-9.00   sec  83.1 MBytes   697 Mbits/sec    0    178 KBytes       
[  6]   9.00-10.00  sec  85.6 MBytes   718 Mbits/sec    0    178 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  6]   0.00-10.00  sec   836 MBytes   701 Mbits/sec    0            sender
[  6]   0.00-10.00  sec  0.00 Bytes  0.00 bits/sec                  receiver

iperf Done.

And here is the result with this PR:

Connecting to host 10.0.0.2, port 5201
[  6] local 10.0.0.1 port 46470 connected to 10.0.0.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  6]   0.00-1.00   sec   254 MBytes  2.13 Gbits/sec    0    178 KBytes       
[  6]   1.00-2.00   sec   261 MBytes  2.19 Gbits/sec    0    178 KBytes       
[  6]   2.00-3.00   sec   262 MBytes  2.20 Gbits/sec    0    178 KBytes       
[  6]   3.00-4.00   sec   258 MBytes  2.17 Gbits/sec    0    178 KBytes       
[  6]   4.00-5.00   sec   260 MBytes  2.18 Gbits/sec    0    178 KBytes       
[  6]   5.00-6.00   sec   252 MBytes  2.12 Gbits/sec    0    178 KBytes       
[  6]   6.00-7.00   sec   256 MBytes  2.14 Gbits/sec    0    178 KBytes       
[  6]   7.00-8.00   sec   263 MBytes  2.20 Gbits/sec    0    178 KBytes       
[  6]   8.00-9.00   sec   257 MBytes  2.16 Gbits/sec    0    178 KBytes       
[  6]   9.00-10.00  sec   242 MBytes  2.03 Gbits/sec    0    178 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  6]   0.00-10.00  sec  2.50 GBytes  2.15 Gbits/sec    0            sender
[  6]   0.00-10.00  sec  0.00 Bytes  0.00 bits/sec                  receiver

iperf Done.

As you can see, the performance gain is quite significant. It is based on our implementation of mnet and mkernel. The IPerf3 code is available here. I haven’t carried out a comparison with mirage-tcpip, however (there may be issues relating to the scheduler, particularly between Miou/mkernel and lwt).

Tests

A test has been added to verify the integrity of the data transmitted over the network. In theory, the test could work perfectly well without this PR (it simply sends a ping with data), but I wanted to check that, under high load, I wasn’t misinterpreting the indices and buffers.

Improvements

The number of comments is inversely proportional to my confidence in the code. In this case, there are a lot of comments. I hope the code is sufficiently documented, but above all it needs to be tested. In short, it’s a major change, but I think it’s definitely worth it to close the performance gap between hvt and VirtIO or Xen.

The point of this commit is that the heap allocated for the guest may
not be all that the host has allocated for the guest. In other words, we
could exclude a certain portion that is always accessible to the guest
and ensure that the stack starts after this portion. Furthermore, the
stack (SP) will start from guest_mem_size and there may potentially be a
portion between guest_mem_size and mem_alloc_size.

0x0          guest_mem_size <|      |> mem_alloc_size
 |---------------------------+------+
 | kernel + heap -> <- stack | ring |
 +---------------------------+------+
                          SP ^

With regard to this pull request, we would like to place the shared ring
between the guest and the host on this portion (so that it is accessible
by both and shared).

We consider this invariant to always hold true:
guest_mem_size <= mem_alloc_size
Our ring is a fixed structure containing two queues:
- a queue for the guest to send operations to
- a queue for the host to confirm operations

The main idea is that the two queues are shared between the guest and
the host and are located in an area allocated to the guest (so that the
guest can access them).

The second idea is that these queues are shared by two processes running
in parallel (hence the use of memory barriers). These barriers can be
found, for example, in our VirtIO support.

The third idea concerns the 'kick'. The host thread can wait for new
entries. It will therefore inform the guest that it needs to be woken up
via the kick value (if it is set to 1).

The final idea is cache-line alignment (64 bytes). The guest generates
entries and consumes commits when the host generates commits and
consumes entries. We can physically locate what the guest modifies and
what the host modifies on different cache lines to avoid false sharing
between the two CPUs (that of the guest and that of the host thread).
This commit introduces two fields that allow us to specify the features
the host has implemented and what the guest can handle. For the time
being, only the `host_features` field is used in our pull request.
Here, we implement the functions used to notify the host. For KVM, an
ioeventfd will be associated with the address RING_KICK_PIO_BASE. For
FreeBSD/OpenBSD, this will be a genuine VM-exit.
We define a pipe (2 file descriptors), our thread that will perform the
read()/write() operations, and our ring. We catch the signal from our VM and
write to our pipe.

<machine/vmm.h> is added also due to an error for an incomplete vm_run.
We define a pipe (2 file descriptors), our thread that will perform the
read()/write() operations, and our ring. We catch the signal from our VM and
write to our pipe.
Here, we just define our ioeventfd file-descriptor and our ring. The
kick mechanism differs from that used in OpenBSD/FreeBSD. Here, we use
IOEVENTFD, which we initialise at the address RING_KICK_PIO_BASE. When
we wish to perform an `out` operation at this address, KVM will
automatically write to the file descriptor.
We statically allocate the buffers that will contain our Ethernet
frames. When writing, we notify the host and do not wait for the result.
As for reading, we notify the host and wait for the commit (we want to
be synchronised with our host’s thread). These operations only occur if
net_ring is not NULL (which is the case if the FEATURE_RING_IO feature
is advertised by the host).
This function allows you to allocate a ring within the memory area
allocated for the guest. This ring is used by all network interfaces.
Preparing our ring for our virtual machine before the CPU initialises.
This function must be called beforehand, as it changes `guest_mem_size`
(relative to `mem_alloc_size`), where the SP register must be placed.
Here, we create a thread that will handle the read()/write() operations
received (from the guest) via the shared queues.

The first value used by our thread is: ta->ready. This ensures that the
dom0 waits until the thread has actually started before applying certain
security measures such as pledge (for OpenBSD).

Next, our thread iterates 4096 times initially to 'wait' for an action.
If, despite this 'wait', we still have no inputs, we block on reading
the file descriptor (the ioeventfd for KVM or the pipe for
FreeBSD/OpenBSD) and signal to the guest that we need to be 'kicked'.

Finally, we execute the write() and read() operations without blocking.
If a write fails, we fail the operation (as we did before). For the
read(), we pass the error to the guest.

Finally, we register a hook to ensure our pthread terminates correctly.
Here, we are essentially setting up the `notify_fd`, which will allow us
to kick the thread we are also creating. For KVM, this is where we
associate the file descriptor with the address `HVT_RING_KICK_PIO_BASE`.
For OpenBSD/FreeBSD, we properly allocate our pipe.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant