Hitchhiker's guide to high-performance with netsniff-ng: //////////////////////////////////////////////////////// This is a collection of short notes in random order concerning software and hardware for optimizing throughput (partly copied or derived from sources that are mentioned at the end of this file): <=== Hardware ====> .-=> Use a PCI-X or PCIe server NIC `-------------------------------------------------------------------------- Only if it says Gigabit Ethernet on the box of your NIC, that does not necessarily mean that you will also reach it. Especially on small packet sizes, you won't reach wire-rate with a PCI adapter built for desktop or consumer machines. Rather, you should buy a server adapter that has faster interconnects such as PCIe. Also, make your choice of a server adapter, whether it has a good support in the kernel. Check the Linux drivers directory for your targeted chipset and look at the netdev list if the adapter is updated frequently. Also, check the location/slot of the NIC adapter on the system motherboard: Our experience resulted in significantly different measurement values by locating the NIC adapter in different PCIe slots. Since we did not have schematics for the system motherboard, this was a trial and error effort. Moreover, check the specifications of the NIC hardware: is the system bus connector I/O capable of Gigabit Ethernet frame rate throughput? Also check the network topology: is your network Gigabit switch capable of switching Ethernet frames at the maximum rate or is a direct connection of two end-nodes the better solution? Is Ethernet flow control being used? "ethtool -a eth0" can be used to determine this. For measurement purposes, you might want to turn it off to increase throughput: * ethtool -A eth0 autoneg off * ethtool -A eth0 rx off * ethtool -A eth0 tx off .-=> Use better (faster) hardware `-------------------------------------------------------------------------- Before doing software-based fine-tuning, check if you can afford better and especially faster hardware. For instance, get a fast CPU with lots of cores or a NUMA architecture with multi-core CPUs and a fast interconnect. If you dump PCAP files to disc with netsniff-ng, then a fast SSD is appropriate. If you plan to memory map PCAP files with netsniff-ng, then choose an appropriate amount of RAM and so on and so forth. <=== Software (Linux kernel specific) ====> .-=> Use NAPI drivers `-------------------------------------------------------------------------- The "New API" (NAPI) is a rework of the packet processing code in the kernel to improve performance for high speed networking. NAPI provides two major features: Interrupt mitigation: High-speed networking can create thousands of interrupts per second, all of which tell the system something it already knew: it has lots of packets to process. NAPI allows drivers to run with (some) interrupts disabled during times of high traffic, with a corresponding decrease in system load. Packet throttling: When the system is overwhelmed and must drop packets, it's better if those packets are disposed of before much effort goes into processing them. NAPI-compliant drivers can often cause packets to be dropped in the network adaptor itself, before the kernel sees them at all. Many recent NIC drivers automatically support NAPI, so you don't need to do anything. Some drivers need you to explicitly specify NAPI in the kernel config or on the command line when compiling the driver. If you are unsure, check your driver documentation. .-=> Use a tickless kernel `-------------------------------------------------------------------------- The tickless kernel feature allows for on-demand timer interrupts. This means that during idle periods, fewer timer interrupts will fire, which should lead to power savings, cooler running systems, and fewer useless context switches. (Kernel option: CONFIG_NO_HZ=y) .-=> Reduce timer interrupts `-------------------------------------------------------------------------- You can select the rate at which timer interrupts in the kernel will fire. When a timer interrupt fires on a CPU, the process running on that CPU is interrupted while the timer interrupt is handled. Reducing the rate at which the timer fires allows for fewer interruptions of your running processes. This option is particularly useful for servers with multiple CPUs where processes are not running interactively. (Kernel options: CONFIG_HZ_100=y and CONFIG_HZ=100) .-=> Use Intel's I/OAT DMA Engine `-------------------------------------------------------------------------- This kernel option enables the Intel I/OAT DMA engine that is present in recent Xeon CPUs. This option increases network throughput as the DMA engine allows the kernel to offload network data copying from the CPU to the DMA engine. This frees up the CPU to do more useful work. Check to see if it's enabled: [foo@bar]% dmesg | grep ioat ioatdma 0000:00:08.0: setting latency timer to 64 ioatdma 0000:00:08.0: Intel(R) I/OAT DMA Engine found, 4 channels, [...] ioatdma 0000:00:08.0: irq 56 for MSI/MSI-X There's also a sysfs interface where you can get some statistics about the DMA engine. Check the directories under /sys/class/dma/. (Kernel options: CONFIG_DMADEVICES=y and CONFIG_INTEL_IOATDMA=y and CONFIG_DMA_ENGINE=y and CONFIG_NET_DMA=y and CONFIG_ASYNC_TX_DMA=y) .-=> Use Direct Cache Access (DCA) `-------------------------------------------------------------------------- Intel's I/OAT also includes a feature called Direct Cache Access (DCA). DCA allows a driver to warm a CPU cache. A few NICs support DCA, the most popular (to my knowledge) is the Intel 10GbE driver (ixgbe). Refer to your NIC driver documentation to see if your NIC supports DCA. To enable DCA, a switch in the BIOS must be flipped. Some vendors supply machines that support DCA, but don't expose a switch for DCA. You can check if DCA is enabled: [foo@bar]% dmesg | grep dca dca service started, version 1.8 If DCA is possible on your system but disabled you'll see: ioatdma 0000:00:08.0: DCA is disabled in BIOS Which means you'll need to enable it in the BIOS or manually. (Kernel option: CONFIG_DCA=y) .-=> Throttle NIC Interrupts `-------------------------------------------------------------------------- Some drivers allow the user to specify the rate at which the NIC will generate interrupts. The e1000e driver allows you to pass a command line option InterruptThrottleRate when loading the module with insmod. For the e1000e there are two dynamic interrupt throttle mechanisms, specified on the command line as 1 (dynamic) and 3 (dynamic conservative). The adaptive algorithm traffic into different classes and adjusts the interrupt rate appropriately. The difference between dynamic and dynamic conservative is the rate for the 'Lowest Latency' traffic class, dynamic (1) has a much more aggressive interrupt rate for this traffic class. As always, check your driver documentation for more information. With modprobe: insmod e1000e.o InterruptThrottleRate=1 .-=> Use Process and IRQ affinity `-------------------------------------------------------------------------- Linux allows the user to specify which CPUs processes and interrupt handlers are bound. Processes: You can use taskset to specify which CPUs a process can run on Interrupt Handlers: The interrupt map can be found in /proc/interrupts, and the affinity for each interrupt can be set in the file smp_affinity in the directory for each interrupt under /proc/irq/. This is useful because you can pin the interrupt handlers for your NICs to specific CPUs so that when a shared resource is touched (a lock in the network stack) and loaded to a CPU cache, the next time the handler runs, it will be put on the same CPU avoiding costly cache invalidations that can occur if the handler is put on a different CPU. However, reports of up to a 24% improvement can be had if processes and the IRQs for the NICs the processes get data from are pinned to the same CPUs. Doing this ensures that the data loaded into the CPU cache by the interrupt handler can be used (without invalidation) by the process; extremely high cache locality is achieved. NOTE: If netsniff-ng or trafgen is bound to a specific, it automatically migrates the NIC's IRQ affinity to this CPU to achieve a high cache locality. .-=> Tune Socket's memory allocation area `-------------------------------------------------------------------------- On default, each socket has a backend memory between 130KB and 160KB on a x86/x86_64 machine with 4GB RAM. Hence, network packets can be received on the NIC driver layer, but later dropped at the socket queue due to memory restrictions. "sysctl -a | grep mem" will display your current memory settings. To increase maximum and default values of read and write memory areas, use: * sysctl -w net.core.rmem_max=8388608 This sets the max OS receive buffer size for all types of connections. * sysctl -w net.core.wmem_max=8388608 This sets the max OS send buffer size for all types of connections. * sysctl -w net.core.rmem_default=65536 This sets the default OS receive buffer size for all types of connections. * sysctl -w net.core.wmem_default=65536 This sets the default OS send buffer size for all types of connections. .-=> Enable Linux' BPF Just-in-Time compiler `-------------------------------------------------------------------------- If you're using filtering with netsniff-ng (or tcpdump, Wireshark, ...), you should activate the Berkeley Packet Filter Just-in-Time compiler. The Linux kernel has a built-in "virtual machine" that interprets BPF opcodes for filtering packets. Hence, those small filter applications are applied to each packet. (Read more about this in the Bpfc document.) The Just-in-Time compiler is able to 'compile' such an filter application to assembler code that can directly be run on the CPU instead on the virtual machine. If netsniff-ng or trafgen detects that the BPF JIT is present on the system, it automatically enables it. (Kernel option: CONFIG_HAVE_BPF_JIT=y and CONFIG_BPF_JIT=y) .-=> Increase the TX queue length `-------------------------------------------------------------------------- There are settings available to regulate the size of the queue between the kernel network subsystems and the driver for network interface card. Just as with any queue, it is recommended to size it such that losses do no occur due to local buffer overflows. Therefore careful tuning is required to ensure that the sizes of the queues are optimal for your network connection. There are two queues to consider, the txqueuelen; which is related to the transmit queue size, and the netdev_backlog; which determines the recv queue size. Users can manually set this queue size using the ifconfig command on the required device: ifconfig eth0 txqueuelen 2000 The default of 100 is inadequate for long distance, or high throughput pipes. For example, on a network with a rtt of 120ms and at Gig rates, a txqueuelen of at least 10000 is recommended. .-=> Increase kernel receiver backlog queue `-------------------------------------------------------------------------- For the receiver side, we have a similar queue for incoming packets. This queue will build up in size when an interface receives packets faster than the kernel can process them. If this queue is too small (default is 300), we will begin to loose packets at the receiver, rather than on the network. One can set this value by: sysctl -w net.core.netdev_max_backlog=2000 .-=> Use a RAM-based filesystem if possible `-------------------------------------------------------------------------- If you have a considerable amount of RAM, you can also think of using a RAM-based file system such as ramfs for dumping pcap files with netsniff-ng. This can be useful for small until middle-sized pcap sizes or for pcap probes that are generated with netsniff-ng. <=== Software (netsniff-ng / trafgen specific) ====> .-=> Bind netsniff-ng / trafgen to a CPU `-------------------------------------------------------------------------- Both tools have a command-line option '--bind-cpu' that can be used like '--bind-cpu 0' in order to pin the process to a specific CPU. This was already mentioned earlier in this file. However, netsniff-ng and trafgen are able to do this without an external tool. Next to this CPU pinning, they also automatically migrate this CPU's NIC IRQ affinity. Hence, as in '--bind-cpu 0' netsniff-ng will not be migrated to a different CPU and the NIC's IRQ affinity will also be moved to CPU 0 to increase cache locality. .-=> Use netsniff-ng in silent mode `-------------------------------------------------------------------------- Don't print information to the konsole while you want to achieve high-speed, because this highly slows down the application. Hence, use netsniff-ng's '--silent' option when recording or replaying PCAP files! .-=> Use netsniff-ng's scatter/gather or mmap for PCAP files `-------------------------------------------------------------------------- The scatter/gather I/O mode which is default in netsniff-ng can be used to record large PCAP files and is slower than the memory mapped I/O. However, you don't have the RAM size as your limit for recording. Use netsniff-ng's memory mapped I/O option for achieving a higher speed for recording a PCAP, but with the trade-off that the maximum allowed size is limited. .-=> Use static packet configurations in trafgen `-------------------------------------------------------------------------- Don't use counters or byte randomization in trafgen configuration file, since it slows down the packet generation process. Static packet bytes are the fastest to go with. .-=> Generate packets with different txhashes in trafgen `-------------------------------------------------------------------------- For 10Gbit/s multiqueue NICs, it might be good to generate packets that result in different txhashes, thus multiple queues are used in the transmission path (and therefore high likely also multiple CPUs). Sources: ~~~~~~~~ * http://www.linuxfoundation.org/collaborate/workgroups/networking/napi * http://datatag.web.cern.ch/datatag/howto/tcp.html * http://thread.gmane.org/gmane.linux.network/191115 * http://bit.ly/3XbBrM * http://wwwx.cs.unc.edu/~sparkst/howto/network_tuning.php * http://bit.ly/pUFJxU