hvt: optimize net interface by dinosaure · Pull Request #637 · Solo5/solo5

dinosaure · 2026-03-31T14:44:00Z

Here is an initial attempt to optimise our unikernels with regard to hvt. To fully understand the original problem: every time we want to write an Ethernet frame, our execution path is as follows:

we prepare a value
we write using out; this is a VM-exit
we are in the tender’s address space and execute read()
we prepare the result
we return to the VM

What is very costly here is the VM-exit (out), which significantly degrades our performance if we run a ‘fair’ benchmark comparison with VirtIO.

Overview

The idea behind this PR is to offer something very similar to what VirtIO can provide with Virtqueues. It therefore involves implementing queues that are shared between the host and the guest so that they can exchange information.
In this case, the guest sends “entries”, which are actions the host must perform (read from and write to a network interface), and the host sends confirmation of these actions along with the result. In our case, we are only interested in the result of the read operation. Initially, Solo5 would only return SOLO5_R_OK or fail when writing:

solo5/tenders/hvt/hvt_module_net.c

Lines 46 to 70 in a333cbb

    
           static void hypercall_net_write(struct hvt *hvt, hvt_gpa_t gpa) 
        
           { 
        
               struct hvt_hc_net_write *wr = 
        
                   HVT_CHECKED_GPA_P(hvt, gpa, sizeof(struct hvt_hc_net_write)); 
        
               struct mft_entry *e = 
        
                   mft_get_by_index(host_mft, wr->handle, MFT_DEV_NET_BASIC); 
        
               if (e == NULL) { 
        
                   wr->ret = SOLO5_R_EINVAL; 
        
                   return; 
        
               } 
        
               ssize_t ret; 
        
               ret = 
        
                   write(e->b.hostfd, HVT_CHECKED_GPA_P(hvt, wr->data, wr->len), wr->len); 
        
               if (ret == -1) { 
        
                   fprintf(stderr, "Fatal error when writing: %s\n", strerror(errno)); 
        
                   exit(1); 
        
               } else if ((size_t)ret != wr->len) { 
        
                   fprintf(stderr, "Fatal error: wrote only %ld out of %ld bytes\n", ret, 
        
                           wr->len); 
        
                   exit(1); 
        
               } 
        
               wr->ret = SOLO5_R_OK; 
        
           }

To enable the guest to transmit information to the host, we must be mindful of certain barriers (as was the case with VirtIO, #630) and there may be instances where we wish to “kick” our host’s thread.

Indeed, our host may be in a state of waiting for entries (the unikernel has not sent any entries). In this case, the thread goes to sleep whilst reading a file descriptor. As regards KVM, this file descriptor is derived from ioeventfd (it is a file descriptor that can be associated with a specific memory address). As regards OpenBSD and FreeBSD, this file descriptor is one derived from pipe() (thanks to @haesbaert for giving me the tip already present in Miou).

If our thread is waiting, it updates a shared variable called needs_kick. If the unikernel sees this value set to 1, it performs a VM-exit to wake up the host thread. On KVM, this VM-exit is inexpensive as KVM handles it directly. On OpenBSD/FreeBSD, there will be a proper VM-exit which involves writing to the other end of our pipe().

As for the unikernel, since we have very little available in this space, I chose to statically allocate what is necessary for our “ring”. In this case, when a unikernel wishes to write, we need to “hold” its buffer. This buffer will only become available after our host thread has written it to the interface. Once again, we’re not interested in write confirmation. So we’ll simply attempt to write until the queue is full. The buffers are allocated via the .bss segment.

The ring is a bit special, however. It must exist for the guest but also for the host. The idea is to reserve an area just before the stack. This is why there are now guest_mem_size and mem_alloc_size. The first refers to what the unikernel can use and where the stack must start. The second refers to everything allocated for the guest (which includes what is needed for our ring). Thus, a unikernel requesting 512MB will only have 510MB (as the ring takes up 2MB). The ring is only allocated if a network interface is present.

Since the buffers are statically allocated, there is a scenario where the user might wish to configure the MTU to more than 2048 bytes (the maximum size of a buffer that can be written). In this case, we ‘fall back’ to the hypercall scenario and therefore perform a VM-exit. Furthermore, increasing the MTU will therefore tend to degrade performance.

Benchmark

It should be noted first of all that the performance gain is specifically observable on Linux. I haven’t run any benchmarks on FreeBSD/OpenBSD, but I think there must still be a gain. Here is a benchmark using IPerf3 without that pull request:

Connecting to host 10.0.0.2, port 5201
[  6] local 10.0.0.1 port 36664 connected to 10.0.0.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  6]   0.00-1.00   sec  80.1 MBytes   671 Mbits/sec    0    178 KBytes       
[  6]   1.00-2.00   sec  86.0 MBytes   721 Mbits/sec    0    178 KBytes       
[  6]   2.00-3.00   sec  85.0 MBytes   713 Mbits/sec    0    178 KBytes       
[  6]   3.00-4.00   sec  82.2 MBytes   690 Mbits/sec    0    178 KBytes       
[  6]   4.00-5.00   sec  83.5 MBytes   700 Mbits/sec    0    178 KBytes       
[  6]   5.00-6.00   sec  84.9 MBytes   712 Mbits/sec    0    178 KBytes       
[  6]   6.00-7.00   sec  85.2 MBytes   715 Mbits/sec    0    178 KBytes       
[  6]   7.00-8.00   sec  80.5 MBytes   675 Mbits/sec    0    178 KBytes       
[  6]   8.00-9.00   sec  83.1 MBytes   697 Mbits/sec    0    178 KBytes       
[  6]   9.00-10.00  sec  85.6 MBytes   718 Mbits/sec    0    178 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  6]   0.00-10.00  sec   836 MBytes   701 Mbits/sec    0            sender
[  6]   0.00-10.00  sec  0.00 Bytes  0.00 bits/sec                  receiver

iperf Done.

And here is the result with this PR:

Connecting to host 10.0.0.2, port 5201
[  6] local 10.0.0.1 port 46470 connected to 10.0.0.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  6]   0.00-1.00   sec   254 MBytes  2.13 Gbits/sec    0    178 KBytes       
[  6]   1.00-2.00   sec   261 MBytes  2.19 Gbits/sec    0    178 KBytes       
[  6]   2.00-3.00   sec   262 MBytes  2.20 Gbits/sec    0    178 KBytes       
[  6]   3.00-4.00   sec   258 MBytes  2.17 Gbits/sec    0    178 KBytes       
[  6]   4.00-5.00   sec   260 MBytes  2.18 Gbits/sec    0    178 KBytes       
[  6]   5.00-6.00   sec   252 MBytes  2.12 Gbits/sec    0    178 KBytes       
[  6]   6.00-7.00   sec   256 MBytes  2.14 Gbits/sec    0    178 KBytes       
[  6]   7.00-8.00   sec   263 MBytes  2.20 Gbits/sec    0    178 KBytes       
[  6]   8.00-9.00   sec   257 MBytes  2.16 Gbits/sec    0    178 KBytes       
[  6]   9.00-10.00  sec   242 MBytes  2.03 Gbits/sec    0    178 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  6]   0.00-10.00  sec  2.50 GBytes  2.15 Gbits/sec    0            sender
[  6]   0.00-10.00  sec  0.00 Bytes  0.00 bits/sec                  receiver

iperf Done.

As you can see, the performance gain is quite significant. It is based on our implementation of mnet and mkernel. The IPerf3 code is available here. I haven’t carried out a comparison with mirage-tcpip, however (there may be issues relating to the scheduler, particularly between Miou/mkernel and lwt).

Tests

A test has been added to verify the integrity of the data transmitted over the network. In theory, the test could work perfectly well without this PR (it simply sends a ping with data), but I wanted to check that, under high load, I wasn’t misinterpreting the indices and buffers.

Improvements

The number of comments is inversely proportional to my confidence in the code. In this case, there are a lot of comments. I hope the code is sufficiently documented, but above all it needs to be tested. In short, it’s a major change, but I think it’s definitely worth it to close the performance gap between hvt and VirtIO or Xen.

The point of this commit is that the heap allocated for the guest may not be all that the host has allocated for the guest. In other words, we could exclude a certain portion that is always accessible to the guest and ensure that the stack starts after this portion. Furthermore, the stack (SP) will start from guest_mem_size and there may potentially be a portion between guest_mem_size and mem_alloc_size. 0x0 guest_mem_size <| |> mem_alloc_size |---------------------------+------+ | kernel + heap -> <- stack | ring | +---------------------------+------+ SP ^ With regard to this pull request, we would like to place the shared ring between the guest and the host on this portion (so that it is accessible by both and shared). We consider this invariant to always hold true: guest_mem_size <= mem_alloc_size

Our ring is a fixed structure containing two queues: - a queue for the guest to send operations to - a queue for the host to confirm operations The main idea is that the two queues are shared between the guest and the host and are located in an area allocated to the guest (so that the guest can access them). The second idea is that these queues are shared by two processes running in parallel (hence the use of memory barriers). These barriers can be found, for example, in our VirtIO support. The third idea concerns the 'kick'. The host thread can wait for new entries. It will therefore inform the guest that it needs to be woken up via the kick value (if it is set to 1). The final idea is cache-line alignment (64 bytes). The guest generates entries and consumes commits when the host generates commits and consumes entries. We can physically locate what the guest modifies and what the host modifies on different cache lines to avoid false sharing between the two CPUs (that of the guest and that of the host thread).

This commit introduces two fields that allow us to specify the features the host has implemented and what the guest can handle. For the time being, only the `host_features` field is used in our pull request.

Here, we implement the functions used to notify the host. For KVM, an ioeventfd will be associated with the address RING_KICK_PIO_BASE. For FreeBSD/OpenBSD, this will be a genuine VM-exit.

We define a pipe (2 file descriptors), our thread that will perform the read()/write() operations, and our ring. We catch the signal from our VM and write to our pipe. <machine/vmm.h> is added also due to an error for an incomplete vm_run.

We define a pipe (2 file descriptors), our thread that will perform the read()/write() operations, and our ring. We catch the signal from our VM and write to our pipe.

Here, we just define our ioeventfd file-descriptor and our ring. The kick mechanism differs from that used in OpenBSD/FreeBSD. Here, we use IOEVENTFD, which we initialise at the address RING_KICK_PIO_BASE. When we wish to perform an `out` operation at this address, KVM will automatically write to the file descriptor.

We statically allocate the buffers that will contain our Ethernet frames. When writing, we notify the host and do not wait for the result. As for reading, we notify the host and wait for the commit (we want to be synchronised with our host’s thread). These operations only occur if net_ring is not NULL (which is the case if the FEATURE_RING_IO feature is advertised by the host).

This function allows you to allocate a ring within the memory area allocated for the guest. This ring is used by all network interfaces.

Preparing our ring for our virtual machine before the CPU initialises. This function must be called beforehand, as it changes `guest_mem_size` (relative to `mem_alloc_size`), where the SP register must be placed.

Here, we create a thread that will handle the read()/write() operations received (from the guest) via the shared queues. The first value used by our thread is: ta->ready. This ensures that the dom0 waits until the thread has actually started before applying certain security measures such as pledge (for OpenBSD). Next, our thread iterates 4096 times initially to 'wait' for an action. If, despite this 'wait', we still have no inputs, we block on reading the file descriptor (the ioeventfd for KVM or the pipe for FreeBSD/OpenBSD) and signal to the guest that we need to be 'kicked'. Finally, we execute the write() and read() operations without blocking. If a write fails, we fail the operation (as we did before). For the read(), we pass the error to the guest. Finally, we register a hook to ensure our pthread terminates correctly.

Here, we are essentially setting up the `notify_fd`, which will allow us to kick the thread we are also creating. For KVM, this is where we associate the file descriptor with the address `HVT_RING_KICK_PIO_BASE`. For OpenBSD/FreeBSD, we properly allocate our pipe.

dinosaure added 16 commits March 31, 2026 15:01

Add -lpthread as a required library for Solo5/hvt

5b26f25

Add a negotiation mechanism between the host and the guest for hvt

e0832b1

This commit introduces two fields that allow us to specify the features the host has implemented and what the guest can handle. For the time being, only the `host_features` field is used in our pull request.

Add net_ring, RING_IO feature and how to kick host thread

7e771d4

Here, we implement the functions used to notify the host. For KVM, an ioeventfd will be associated with the address RING_KICK_PIO_BASE. For FreeBSD/OpenBSD, this will be a genuine VM-exit.

Implement the kick mechanism for FreeBSD

281f91e

We define a pipe (2 file descriptors), our thread that will perform the read()/write() operations, and our ring. We catch the signal from our VM and write to our pipe. <machine/vmm.h> is added also due to an error for an incomplete vm_run.

Implement the kick mechanism for OpenBSD

3809b4b

We define a pipe (2 file descriptors), our thread that will perform the read()/write() operations, and our ring. We catch the signal from our VM and write to our pipe.

Disable GRO/GSO for Linux on our TAP interface

0ac19bf

Implement hvt_net_reserve_ring

7e16fdc

This function allows you to allocate a ring within the memory area allocated for the guest. This ring is used by all network interfaces.

Call hvt_net_reserve_ring

a6de225

Preparing our ring for our virtual machine before the CPU initialises. This function must be called beforehand, as it changes `guest_mem_size` (relative to `mem_alloc_size`), where the SP register must be placed.

Assign FEATURE_RING_IO if we can (if ring_active == 1)

4041850

Add a simple test which verifies data integrity on network

a6bfa19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hvt: optimize net interface#637

hvt: optimize net interface#637
dinosaure wants to merge 16 commits intomainfrom
pthread-clean-clean

dinosaure commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	static void hypercall_net_write(struct hvt *hvt, hvt_gpa_t gpa)
	{
	struct hvt_hc_net_write *wr =
	HVT_CHECKED_GPA_P(hvt, gpa, sizeof(struct hvt_hc_net_write));
	struct mft_entry *e =
	mft_get_by_index(host_mft, wr->handle, MFT_DEV_NET_BASIC);
	if (e == NULL) {
	wr->ret = SOLO5_R_EINVAL;
	return;
	}

	ssize_t ret;

	ret =
	write(e->b.hostfd, HVT_CHECKED_GPA_P(hvt, wr->data, wr->len), wr->len);
	if (ret == -1) {
	fprintf(stderr, "Fatal error when writing: %s\n", strerror(errno));
	exit(1);
	} else if ((size_t)ret != wr->len) {
	fprintf(stderr, "Fatal error: wrote only %ld out of %ld bytes\n", ret,
	wr->len);
	exit(1);
	}
	wr->ret = SOLO5_R_OK;
	}

Conversation

dinosaure commented Mar 31, 2026

Overview

Benchmark

Tests

Improvements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant