diff options
Diffstat (limited to 'Documentation/Performance')
-rw-r--r-- | Documentation/Performance | 278 |
1 files changed, 278 insertions, 0 deletions
diff --git a/Documentation/Performance b/Documentation/Performance new file mode 100644 index 0000000..e51411a --- /dev/null +++ b/Documentation/Performance @@ -0,0 +1,278 @@ +Hitchhiker's guide to high-performance with netsniff-ng: +//////////////////////////////////////////////////////// + +This is a collection of short notes in random order concerning software +and hardware for optimizing throughput (partly copied or derived from sources +that are mentioned at the end of this file): + +<=== Hardware ====> + +.-=> Use a PCI-X or PCIe server NIC +`-------------------------------------------------------------------------- +Only if it says Gigabit Ethernet on the box of your NIC, that does not +necessarily mean that you will also reach it. Especially on small packet +sizes, you won't reach wire-rate with a PCI adapter built for desktop or +consumer machines. Rather, you should buy a server adapter that has faster +interconnects such as PCIe. Also, make your choice of a server adapter, +whether it has a good support in the kernel. Check the Linux drivers +directory for your targeted chipset and look at the netdev list if the adapter +is updated frequently. Also, check the location/slot of the NIC adapter on +the system motherboard: Our experience resulted in significantly different +measurement values by locating the NIC adapter in different PCIe slots. +Since we did not have schematics for the system motherboard, this was a +trial and error effort. Moreover, check the specifications of the NIC +hardware: is the system bus connector I/O capable of Gigabit Ethernet +frame rate throughput? Also check the network topology: is your network +Gigabit switch capable of switching Ethernet frames at the maximum rate +or is a direct connection of two end-nodes the better solution? Is Ethernet +flow control being used? "ethtool -a eth0" can be used to determine this. +For measurement purposes, you might want to turn it off to increase throughput: + * ethtool -A eth0 autoneg off + * ethtool -A eth0 rx off + * ethtool -A eth0 tx off + +.-=> Use better (faster) hardware +`-------------------------------------------------------------------------- +Before doing software-based fine-tuning, check if you can afford better and +especially faster hardware. For instance, get a fast CPU with lots of cores +or a NUMA architecture with multi-core CPUs and a fast interconnect. If you +dump PCAP files to disc with netsniff-ng, then a fast SSD is appropriate. +If you plan to memory map PCAP files with netsniff-ng, then choose an +appropriate amount of RAM and so on and so forth. + +<=== Software (Linux kernel specific) ====> + +.-=> Use NAPI drivers +`-------------------------------------------------------------------------- +The "New API" (NAPI) is a rework of the packet processing code in the +kernel to improve performance for high speed networking. NAPI provides +two major features: + +Interrupt mitigation: High-speed networking can create thousands of +interrupts per second, all of which tell the system something it already +knew: it has lots of packets to process. NAPI allows drivers to run with +(some) interrupts disabled during times of high traffic, with a +corresponding decrease in system load. + +Packet throttling: When the system is overwhelmed and must drop packets, +it's better if those packets are disposed of before much effort goes into +processing them. NAPI-compliant drivers can often cause packets to be +dropped in the network adaptor itself, before the kernel sees them at all. + +Many recent NIC drivers automatically support NAPI, so you don't need to do +anything. Some drivers need you to explicitly specify NAPI in the kernel +config or on the command line when compiling the driver. If you are unsure, +check your driver documentation. + +.-=> Use a tickless kernel +`-------------------------------------------------------------------------- +The tickless kernel feature allows for on-demand timer interrupts. This +means that during idle periods, fewer timer interrupts will fire, which +should lead to power savings, cooler running systems, and fewer useless +context switches. (Kernel option: CONFIG_NO_HZ=y) + +.-=> Reduce timer interrupts +`-------------------------------------------------------------------------- +You can select the rate at which timer interrupts in the kernel will fire. +When a timer interrupt fires on a CPU, the process running on that CPU is +interrupted while the timer interrupt is handled. Reducing the rate at +which the timer fires allows for fewer interruptions of your running +processes. This option is particularly useful for servers with multiple +CPUs where processes are not running interactively. (Kernel options: +CONFIG_HZ_100=y and CONFIG_HZ=100) + +.-=> Use Intel's I/OAT DMA Engine +`-------------------------------------------------------------------------- +This kernel option enables the Intel I/OAT DMA engine that is present in +recent Xeon CPUs. This option increases network throughput as the DMA +engine allows the kernel to offload network data copying from the CPU to +the DMA engine. This frees up the CPU to do more useful work. + +Check to see if it's enabled: + +[foo@bar]% dmesg | grep ioat +ioatdma 0000:00:08.0: setting latency timer to 64 +ioatdma 0000:00:08.0: Intel(R) I/OAT DMA Engine found, 4 channels, [...] +ioatdma 0000:00:08.0: irq 56 for MSI/MSI-X + +There's also a sysfs interface where you can get some statistics about the +DMA engine. Check the directories under /sys/class/dma/. (Kernel options: +CONFIG_DMADEVICES=y and CONFIG_INTEL_IOATDMA=y and CONFIG_DMA_ENGINE=y and +CONFIG_NET_DMA=y and CONFIG_ASYNC_TX_DMA=y) + +.-=> Use Direct Cache Access (DCA) +`-------------------------------------------------------------------------- +Intel's I/OAT also includes a feature called Direct Cache Access (DCA). +DCA allows a driver to warm a CPU cache. A few NICs support DCA, the most +popular (to my knowledge) is the Intel 10GbE driver (ixgbe). Refer to your +NIC driver documentation to see if your NIC supports DCA. To enable DCA, +a switch in the BIOS must be flipped. Some vendors supply machines that +support DCA, but don't expose a switch for DCA. + +You can check if DCA is enabled: + +[foo@bar]% dmesg | grep dca +dca service started, version 1.8 + +If DCA is possible on your system but disabled you'll see: + +ioatdma 0000:00:08.0: DCA is disabled in BIOS + +Which means you'll need to enable it in the BIOS or manually. (Kernel +option: CONFIG_DCA=y) + +.-=> Throttle NIC Interrupts +`-------------------------------------------------------------------------- +Some drivers allow the user to specify the rate at which the NIC will +generate interrupts. The e1000e driver allows you to pass a command line +option InterruptThrottleRate when loading the module with insmod. For +the e1000e there are two dynamic interrupt throttle mechanisms, specified +on the command line as 1 (dynamic) and 3 (dynamic conservative). The +adaptive algorithm traffic into different classes and adjusts the interrupt +rate appropriately. The difference between dynamic and dynamic conservative +is the rate for the 'Lowest Latency' traffic class, dynamic (1) has a much +more aggressive interrupt rate for this traffic class. + +As always, check your driver documentation for more information. + +With modprobe: insmod e1000e.o InterruptThrottleRate=1 + +.-=> Use Process and IRQ affinity +`-------------------------------------------------------------------------- +Linux allows the user to specify which CPUs processes and interrupt +handlers are bound. + +Processes: You can use taskset to specify which CPUs a process can run on +Interrupt Handlers: The interrupt map can be found in /proc/interrupts, and +the affinity for each interrupt can be set in the file smp_affinity in the +directory for each interrupt under /proc/irq/. + +This is useful because you can pin the interrupt handlers for your NICs +to specific CPUs so that when a shared resource is touched (a lock in the +network stack) and loaded to a CPU cache, the next time the handler runs, +it will be put on the same CPU avoiding costly cache invalidations that +can occur if the handler is put on a different CPU. + +However, reports of up to a 24% improvement can be had if processes and +the IRQs for the NICs the processes get data from are pinned to the same +CPUs. Doing this ensures that the data loaded into the CPU cache by the +interrupt handler can be used (without invalidation) by the process; +extremely high cache locality is achieved. + +NOTE: If netsniff-ng or trafgen is bound to a specific, it automatically +migrates the NIC's IRQ affinity to this CPU to achieve a high cache locality. + +.-=> Tune Socket's memory allocation area +`-------------------------------------------------------------------------- +On default, each socket has a backend memory between 130KB and 160KB on +a x86/x86_64 machine with 4GB RAM. Hence, network packets can be received +on the NIC driver layer, but later dropped at the socket queue due to memory +restrictions. "sysctl -a | grep mem" will display your current memory +settings. To increase maximum and default values of read and write memory +areas, use: + * sysctl -w net.core.rmem_max=8388608 + This sets the max OS receive buffer size for all types of connections. + * sysctl -w net.core.wmem_max=8388608 + This sets the max OS send buffer size for all types of connections. + * sysctl -w net.core.rmem_default=65536 + This sets the default OS receive buffer size for all types of connections. + * sysctl -w net.core.wmem_default=65536 + This sets the default OS send buffer size for all types of connections. + +.-=> Enable Linux' BPF Just-in-Time compiler +`-------------------------------------------------------------------------- +If you're using filtering with netsniff-ng (or tcpdump, Wireshark, ...), you +should activate the Berkeley Packet Filter Just-in-Time compiler. The Linux +kernel has a built-in "virtual machine" that interprets BPF opcodes for +filtering packets. Hence, those small filter applications are applied to +each packet. (Read more about this in the Bpfc document.) The Just-in-Time +compiler is able to 'compile' such an filter application to assembler code +that can directly be run on the CPU instead on the virtual machine. If +netsniff-ng or trafgen detects that the BPF JIT is present on the system, it +automatically enables it. (Kernel option: CONFIG_HAVE_BPF_JIT=y and +CONFIG_BPF_JIT=y) + +.-=> Increase the TX queue length +`-------------------------------------------------------------------------- +There are settings available to regulate the size of the queue between the +kernel network subsystems and the driver for network interface card. Just +as with any queue, it is recommended to size it such that losses do no +occur due to local buffer overflows. Therefore careful tuning is required +to ensure that the sizes of the queues are optimal for your network +connection. + +There are two queues to consider, the txqueuelen; which is related to the +transmit queue size, and the netdev_backlog; which determines the recv +queue size. Users can manually set this queue size using the ifconfig +command on the required device: + +ifconfig eth0 txqueuelen 2000 + +The default of 100 is inadequate for long distance, or high throughput pipes. +For example, on a network with a rtt of 120ms and at Gig rates, a +txqueuelen of at least 10000 is recommended. + +.-=> Increase kernel receiver backlog queue +`-------------------------------------------------------------------------- +For the receiver side, we have a similar queue for incoming packets. This +queue will build up in size when an interface receives packets faster than +the kernel can process them. If this queue is too small (default is 300), +we will begin to loose packets at the receiver, rather than on the network. +One can set this value by: + +sysctl -w net.core.netdev_max_backlog=2000 + +.-=> Use a RAM-based filesystem if possible +`-------------------------------------------------------------------------- +If you have a considerable amount of RAM, you can also think of using a +RAM-based file system such as ramfs for dumping pcap files with netsniff-ng. +This can be useful for small until middle-sized pcap sizes or for pcap probes +that are generated with netsniff-ng. + +<=== Software (netsniff-ng / trafgen specific) ====> + +.-=> Bind netsniff-ng / trafgen to a CPU +`-------------------------------------------------------------------------- +Both tools have a command-line option '--bind-cpu' that can be used like +'--bind-cpu 0' in order to pin the process to a specific CPU. This was +already mentioned earlier in this file. However, netsniff-ng and trafgen are +able to do this without an external tool. Next to this CPU pinning, they also +automatically migrate this CPU's NIC IRQ affinity. Hence, as in '--bind-cpu 0' +netsniff-ng will not be migrated to a different CPU and the NIC's IRQ affinity +will also be moved to CPU 0 to increase cache locality. + +.-=> Use netsniff-ng in silent mode +`-------------------------------------------------------------------------- +Don't print information to the konsole while you want to achieve high-speed, +because this highly slows down the application. Hence, use netsniff-ng's +'--silent' option when recording or replaying PCAP files! + +.-=> Use netsniff-ng's scatter/gather or mmap for PCAP files +`-------------------------------------------------------------------------- +The scatter/gather I/O mode which is default in netsniff-ng can be used to +record large PCAP files and is slower than the memory mapped I/O. However, +you don't have the RAM size as your limit for recording. Use netsniff-ng's +memory mapped I/O option for achieving a higher speed for recording a PCAP, +but with the trade-off that the maximum allowed size is limited. + +.-=> Use static packet configurations in trafgen +`-------------------------------------------------------------------------- +Don't use counters or byte randomization in trafgen configuration file, since +it slows down the packet generation process. Static packet bytes are the fastest +to go with. + +.-=> Generate packets with different txhashes in trafgen +`-------------------------------------------------------------------------- +For 10Gbit/s multiqueue NICs, it might be good to generate packets that result +in different txhashes, thus multiple queues are used in the transmission path +(and therefore high likely also multiple CPUs). + +Sources: +~~~~~~~~ + +* http://www.linuxfoundation.org/collaborate/workgroups/networking/napi +* http://datatag.web.cern.ch/datatag/howto/tcp.html +* http://thread.gmane.org/gmane.linux.network/191115 +* http://bit.ly/3XbBrM +* http://wwwx.cs.unc.edu/~sparkst/howto/network_tuning.php +* http://bit.ly/pUFJxU |