io_uring reduces but doesn't remove the system call overhead.
Only with in kernel polling mode is it close to removed. But kernel polling mode has it's own cost. If the system call overhead is no where close to being a bottle neck, i.e. you don't do system calls "that" much, e.g. because your endpoints take longer to complete then using kernel polling mode can degrade the overall system performance. And potential increase power consumption and as such heat generation.
Besides that user mode tcp stacks can be more tailored for your use case which can increase performance.
So all in all I would say that it depends on your use case. For some it will make user mode tcp useless or at least not worth it but for others it doesn't.
I am not sure how user stack tcp or DPDK would get around the power consumption issues of kernel polling. Infact the usage I am aware of pretty much involve polling in user mode because any context switch or O/S scheduling related overhead is excessive. The only thing you can do is to keep your task queue full so as to always be doing something.
io_uring does allow to remove a lot of the syscall overhead, without polling. Many operations can be submitted with just one syscall. And ready completions can be consumed without a syscall at all.
Additionally, compared to using epoll/select/.. for network IO, one can just submit a send/recv, instead of patterns like recv -> EAGAIN, epoll, recv
It does remove the syscall overhead, but as the IO itself will be performed by the kernel so the cpu will still need to switch regularly between user and kernel level. With a full user level network stack and correct interrupt steering the kernel need not be involved at all and the cpu can stay in userspace all the time.
Or you can run the kernel IO thread on another CPU, but that itself has overhead compared to performing IO and handling the data all in the same thread.
I see so the extra kernel IO thread which can be spin waiting is one extra busy core and latency of getting data from one core to the other is the additional overhead.
Are ready completions strictly determined by continuous polling if there's no system call involved? If lots of applications end up using this method, will it increase power consumption due to many processes actively idling until a new consumable shows up in the completion queue?
Thanks for explaining. I was confused by the "no system call on consumption" example in the blog post, but if it uses a system call after emptying the completion queue then that'll work just fine.