Description
Background
Hi, I have been working on a vhost-user-gpu
with @dorindabassey and we ran into some complications.
I think there are some things about how VhostUserBackend
is designed and generally the API model that are not great for the GPU device, and in general for other devices too - personally I am also familiar with virtio-net
and virtio-console
devices because I worked on them in libkrun
. The goal would be to make writing new devices and mainting existing ones easier by making the API more flexible.
Current API can be clunky to use in some cases:
Currently the vhost-user-backend
handles spawning worker threads for the device, and calls the handle_event
callback on that thread. I don't think this model is good, because it leads to some difficulties when programming.
Here are the main issues:
1. Where to store the data of a worker thread
Consider you have a worker thread that wants to have some variables only available to this single thread, this is currently difficult to do. Basically you have a couple of options:
- Store the data inside the Backend struct and wrap in a a
Mutex
/RwLock
- for example virtio-sound does this. I feel like this adds unnecessary complexity, and unless I am missing something, we could get rid of some of the extra locking (by using theGuestMemoryAtomic
we already have). - If the thread specific data is
!Send
- suchRutabaga
/virglrenderer
context in thevhost-user-gpu
, your only option is to lazy initialize the struct (putting something!Send
inside aMutex
doesn't help), and usethread_local!
or useunsafe
to work around this.
2. Adding file descriptors for polling after starting the backend
- In
vhost-user-gpu
we needed to add a file descriptor to be signaled byhandle_event
insidehandle_event
this is very akward to do. If you want to do this you firt need to pass the epoll handler created by the daemon to the device backend. Inhandle_event
you now have access to the epoll, but but you have to be careful, because you cannot use theVhostUserBackendMut
trait and the convenient lock around the whole struct, because ifhandle_event
is called while the lock is held, you would deadlock yourself whennum_queues
is called here:vhost/vhost-user-backend/src/event_loop.rs
Line 117 in 176f44a
3. VhostUserBackend
vs VhostUserBackendMut
In general the split VhostUserBackend
and VhostUserBackendMut
traits doesn't seem to me like a good design choice. I feel like when you specify that you want 1 worker thread and you use the VhostUserBackendMut
trait it mostly works fine (except the !Send
problem), but if you want 2 worker threads or something more complex, you have to change over to VhostUserBackend
and use manual locking. Another problem is that if you don't switch to using the VhostUserBackend
trait you would have 2 worker threads, but the handle_event
could never run at the same time because it is behind a lock leading to a non-obvious performance bug.
More flexible event handling API
My main idea is to just let the device spawn it's own worker threads and use epoll directly in them instead of having the handle_event
callback. This would basically change the event system from "push" to "poll". I think this would be a lot more flexible, because it would allow:
- any custom thread <-> vring mapping
- adding and removing file descriptors to watch easily
- having detailed control of which thread has access to what leading to more understandable and efficient code
One of the problems with the current implementation, is that since handle_event
is a method and a callback the type systems basically forces self
to be Send
+ Sync
, which in practise forces you to lock everything, which could be avoided.
struct FooDevice { ... };
fn worker(mem: GuestMemoryAtomic<GuestMemoryMmap>, vring: VringRwLock, exit_event: EventFd) {
// Epoll abstraction details are not that important for this proposal, this could change
Poller poller;
let vring_ready_event = vring.ready_event();
poller.add(vring_ready_event);
poller.add(exit_event);
loop {
let event = poller.poll();
if event == exit_event {
return
}
if event == vring_ready_event {
// handle the vring event here
// you can even add/remove event file descriptors to the epoll
}
}
}
impl VhostUserBackend for FooDevice {
fn num_queues(&self) -> usize { ... }
fn max_queue_size(&self) -> usize { .. }
fn features(&self) -> u64 { ... }
fn protocol_features(&self) -> VhostUserProtocolFeatures { ... }
// Other methods like before, except the `queues_per_thread(&self)` which would be removed
// The entry point where we start the device, this would require the device starting at least 1 worker thread, but we do that today anyway so it shouldn't be an issue.
// (The exact mem, and vring types should be more generic, this is just an example)
fn start(&mut self, vrings: Box<[VringRwLock]>, quit_event: EventFd) -> Result<()> {
// Notice the thread doesn't have to have a reference to `self`
// if you need to shared state between the worker threads, you can use something like Arc<Mutex<YourSharedDataStruct>>, or even a channel.
thread::spawn(move || worker(self.mem.clone(), vrings[0], quit_event));
Ok(())
}
}
Consider having devices also be usable as in-process devices by VMMs
Since I am proposing changing the API it could be beneficial to make a bit more breaking changes and also allow consuming the devices as in-process devices. Actually we aren't that far from this, but this would of course require some more thoughs about the abstractions.
This could be useful for libkrun and other VMMs (anyone else interested?)
Questions
I can create a more detailed proposals of APIs, but first I wanted to see:
- Is there interest in doing smaller breaking changes to the API like I proposed?
- Is there interest in doing bigger API changes to also allow consuming the devices as in-process devices?