Trying to cover what I don’t know about Linux network
Quite often I question myself about the role of Linux kernel memory with its network components. How many of them are allocated and when ? If it’s allocated on the fly, will the other memory allocation ( like pagecache , write buffer ) affects network performance ?
This article is my first try to solve all those questions myself. About network ring buffers.
What is network ring buffers ?
Network ring buffer is a memory region where NIC have direct access ( over DMA ) to store incoming data.
There are actually 2 parts: a ring ( fixed-size circular queue ) of descriptor / address which points to memory regions holding data. The “buffer” or memory part is quite a hidden place for us, which is handled by kernel in the background.
There are 2 types of ring buffer.
RX ring buffer for receiving data
TX ring buffer for sending data
Both of those TX/RX ring buffers are bound to a specific network interface and shared between application that send/receive network through that interface. Thus knowing their limit is important.
The ring / queue part
Again, the ring buffer is basically the fixed-size circular FIFO queue ( which contains descriptor or simply pointer to a memory address ).
Then if there’s a queue with a fixed-size, there will be a exceeded queue length situation.
When you send / receive too many data than you can process, the queue is full and there will be packet dropped. Luckily we can check it by using ethtool, like this:
ethtool -S enp8s0f0 | grep drop
dropped_smbus: 0
tx_dropped: 0
rx_queue_0_drops: 0
rx_queue_1_drops: 0
rx_queue_2_drops: 0
rx_queue_3_drops: 0
rx_queue_4_drops: 0
rx_queue_5_drops: 0
rx_queue_6_drops: 0
rx_queue_7_drops: 0
When the number of dropped packet increased, we can try increasing our ring buffers size.
There are 2 ways to do that:
- Increase the size of each queue: the length of our queue is actually the number of descriptor pointing to allocated memory region via DMA. So the longer queue, the more places to store data. Using
ethtool -g
to show maximum possible and your current values andethtool -G
to really adjust it. You can find examples with instruction in details easily on the internet. - Use multi queues NIC: modern NIC supports multiple queues, means multiple separate DMA-mapping in parallels , thus bigger throughput. Same as above, we can use
ethtool -l
to view current setting andethtool -L
to adjust it.
Though, as any queue in the world, increasing the queue size won’t solve your issue all the time. Long queue can give you high throughput but can cause a super high latency outliers. Short queue can be used to optimize latency, but it comes with the risk of dropping packets when there are too much too handle. As usual, use it as your own risk.
In sending path, Linux implements features like queuing discipline, Byte Queue Limit to solve the queue problem automatically for us. On the other hand, receiving path doesn’t have anything like that.
The buffer / memory part
The RX ring buffer to me is like an in-memory “kafka topic” holding all incoming packets that we need to consume as quickly as possible. And the consumer here is the ksoftirq->net_rx_action
handler, which in turn calls a driver specific poll
function, in our case it’s eventually is igb_poll
.
The very simplified asynchronous process looks like this:
- The incoming packet arrives to network interface card ( NIC ) . NIC saves it to RX buffer in memory ( our “kafka topic” ) via DMA ( no CPU should be involved )
- When data is already saved in RAM, NIC sends an hardware interrupt tell CPU about the new arrival.
- CPU replies very quickly to release that interrupt, and send another signal ( NET_RX_SOFTIRQ) to
ksoftirqd
- NIC can receive another packet and CPU also can handle other interrupts now while
ksoftirqd
starts consuming data from RX buffer in the background, without blocking the whole process. The main idea is that hardware interrupt should be handled as quickly as possible. Software interrupt will really do things.
Here’s a real world kernel stack ( I have to trace upper layer function to get a full call stack of this layer 2 functions ):
bpftrace -e 'kprobe:netif_receive_skb_internal {printf("%s\n",kstack);}' netif_receive_skb_internal+1
napi_gro_receive+186 ///End of pulling data out of rx ring buffer
igb_poll+1153
net_rx_action+329
__do_softirq+222
irq_exit+186
do_IRQ+127
ret_from_intr+0
cpuidle_enter_state+182
do_idle+552
cpu_startup_entry+111
start_secondary+420
secondary_startup_64+164
( If you want to know more in details about sequence of steps that Linux does while receiving packets, read this fantastic article please. )
Now we’re dealing with data. And the size ( in bytes, not length anymore ) matters.
What i don’t know for sure right now is: what the maximum possible size of DMA memory allocated for all descriptor registered in the TX ring? And what if we cross that limit ?
At device initialization step, Linux calls driver specific function to create resource, in our case it’s igb_setup_rx_resource which allocate both buffer ( via vmalloc ) and ring’s descriptor in by dma_alloc_coherent. The size of buffer is
sizeof(struct igb_rx_buffer) * rx_ring->count;
Size of igb_rx_buffer is a bit bigger than struct page which is around 64 bytes, so a DMA range for 1 NIC , if i understand all these things correctly, is up to 64 * 4096 ( max rx queue size for a normal NIC ) * 8 ( number of queue length ) = 2MB.
Which is still good enough. Because each “buffer” , which is represent by sk_buff struct, is an Ethernet data frame. And usually NICs follow IEEE 802.3 standard, each frame size is up to 1522 bytes only. Some NIC supports jumbo frame, which the size is ~ 9000 bytes ( need to enable on both send/ receive ends ) , still well under our limit.
In case you use a custom NIC which sends a “illegal” data size, the packet can be dropped , or worse, the system will be crashed.
Thus, without any hacky violation of a custom NIC, we shouldn’t worry about our rx buffer overflow. Because :
- The allocated memory should be enough.
- The number of incoming packet can be controlled by ring / queue size: if there’s too much to handle, there will be packet drops.
And the last but not least: the ring buffer is drained , or consumed by net_rx_action
, in our case igb_poll
continuously.
In a very busy network server, you may see ksoftirq
consumes quite a lot of CPU, and if you dig deeper by perf
, for example, igb_poll
can be the hot code that can consume quite an amount of CPU share. Because it’s just actively helping you to shovel all the data out of rx ring buffer !
There are 3 things that can stop net_rx_action/igb_poll
from doing that:
- When there’s no more data to read from
- It reaches its
budget
limit, which is used to control it from consuming too much CPU - It takes too long to finish
Once again, follow this fantastic article, we can both detect and improve those cases. By checking info in /proc/net/softnet_stat
like this to see if our consuming process was interrupted when there’s still data:
cat /proc/net/softnet_stat | awk '{print $3}'
.......
0000000c
0000002b
00000004
00000075
00000018
00000003
0000006e
00000071
Field $3 is:
The third value, sd->time_squeeze, is (as we saw) the number of times the net_rx_action loop terminated because the budget was consumed or the time limit was reached, but more work could have been. Increasing the budget as explained earlier can help reduce this.
If those values > 0, we’ve hit some limit. In that case, we can let net_rx_action/igb_poll
use more CPU by increasing the budget like this:
echo your_value > /proc/sys/net/core/netdev_budget
By default it’s 300. If you have plenty of free CPU, with irqbalance
installed, feel free to triple it or more. Observe both %si in CPU utilization and the data in softnet_stat
to see the effect.
One more thing that we can do is to speed up the memory reading speed, by using a faster RAM , or checking if you have any NUMA related issue.
We have a simple but crucial point here, where faster and bigger CPU and RAM do increase your network performance ( In term of both latency and throughput ).
Is it too obvious to tell that system resources connect to each other closely ?
Epilogue
The Linux ring buffers internal, the initialization process for NIC, the send / receive sequence of steps a way more complicated than what is described in this posts.
I just want to explain things , important ones in the most understandable form so everybody can apply it easily in our daily tasks. Hope it helps.
More posts about Linux networking are coming. Stay tuned
References
- A must read: https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data
- https://www.linuxjournal.com/content/queueing-linux-network-stack
- https://stackoverflow.com/questions/47450231/what-is-the-relationship-of-dma-ring-buffer-and-tx-rx-ring-for-a-network-card
- https://wiki.linuxfoundation.org/networking/kernel_flow
- https://people.cs.clemson.edu/~westall/853/notes/skbuff.pdf