Making USB WiFi 55% Faster on Linux: Threaded NAPI in the mt76 Driver
If you've used a MediaTek USB WiFi adapter (mt7921u and friends) on Linux and wondered why bulk downloads felt slower than the same chip over PCIe — there was a structural reason. The USB RX path in the mt76 drivers never got GRO. I reworked it to run from a threaded NAPI, and bulk TCP throughput on an mt7921u went from ~380 to ~588 Mbit/s. The patch is on linux-wireless, and Lorenzo Bianconi, the mt76 maintainer, Acked v3.
The problem: no GRO on the USB path
The DMA-based mt76 drivers deliver received frames through NAPI, which means the network stack can run GRO (generic receive offload) — coalescing consecutive TCP segments into larger super-packets before they climb the stack. That coalescing is where a lot of bulk-TCP efficiency comes from.
The USB path didn't do that. It called mt76_rx_complete() with a NULL napi pointer, which routes delivery through netif_receive_skb_list() — a path with no GRO at all. Every single TCP segment traversed the full network stack individually. At hundreds of megabits, that per-packet overhead is the bottleneck, not the radio.
The fix: a threaded NAPI for the main RX queue
The rework moves the main USB RX queue onto a NAPI instance running in threaded mode:
- The URB completion handler now just schedules the NAPI.
- The NAPI poll drains completed URBs, builds the skbs, resubmits the URBs, and delivers frames through
napi_gro_receive()— so GRO finally happens. - Running the NAPI threaded puts RX processing in its own kernel thread, so the datapath runs in parallel with the rest of the driver instead of competing inside the USB completion context.
Keeping the footprint small
Kernel review rewards minimalism, and this patch went through in three revisions partly because it adds no new state: it reuses mt76_dev's existing napi_dev and napi[] members on a dummy netdev, exactly as the DMA path does. The MCU (firmware control) queue stays on the existing RX worker — only the data queue moves. The threaded-NAPI approach itself was suggested by Lorenzo Bianconi during review of an earlier revision, and he Acked v3.
The numbers
Benchmarked on an mt7921u USB adapter, 2x2 at 80 MHz, HE-MCS 11, bulk TCP over multiple streams:
- Unmodified driver: ~380 Mbit/s
- NAPI, non-threaded: ~424 Mbit/s
- Threaded NAPI: ~588 Mbit/s
GRO alone helps; GRO plus its own thread is where the +55% comes from.
FAQ
- Why is my mt7921u slow on Linux?
- Historically, the mt76 USB RX path delivered every TCP segment individually with no GRO. This patch fixes that structurally; until it reaches your kernel, per-packet stack overhead caps bulk TCP well below what the radio can do.
- What is threaded NAPI?
- Normally NAPI polling runs in softirq context. Threaded NAPI moves it into a dedicated kernel thread, which the scheduler can place on its own CPU — useful when the completion context (here, USB) is itself busy.
- Which adapters does this affect?
- USB adapters driven by mt76 — mt7921u-class hardware being the common case. The DMA/PCIe mt76 drivers already had NAPI + GRO.