diff options
Diffstat (limited to 'Documentation/Performance')
-rw-r--r-- | Documentation/Performance | 278 |
1 files changed, 0 insertions, 278 deletions
diff --git a/Documentation/Performance b/Documentation/Performance deleted file mode 100644 index e51411a..0000000 --- a/Documentation/Performance +++ /dev/null @@ -1,278 +0,0 @@ -Hitchhiker's guide to high-performance with netsniff-ng: -//////////////////////////////////////////////////////// - -This is a collection of short notes in random order concerning software -and hardware for optimizing throughput (partly copied or derived from sources -that are mentioned at the end of this file): - -<=== Hardware ====> - -.-=> Use a PCI-X or PCIe server NIC -`-------------------------------------------------------------------------- -Only if it says Gigabit Ethernet on the box of your NIC, that does not -necessarily mean that you will also reach it. Especially on small packet -sizes, you won't reach wire-rate with a PCI adapter built for desktop or -consumer machines. Rather, you should buy a server adapter that has faster -interconnects such as PCIe. Also, make your choice of a server adapter, -whether it has a good support in the kernel. Check the Linux drivers -directory for your targeted chipset and look at the netdev list if the adapter -is updated frequently. Also, check the location/slot of the NIC adapter on -the system motherboard: Our experience resulted in significantly different -measurement values by locating the NIC adapter in different PCIe slots. -Since we did not have schematics for the system motherboard, this was a -trial and error effort. Moreover, check the specifications of the NIC -hardware: is the system bus connector I/O capable of Gigabit Ethernet -frame rate throughput? Also check the network topology: is your network -Gigabit switch capable of switching Ethernet frames at the maximum rate -or is a direct connection of two end-nodes the better solution? Is Ethernet -flow control being used? "ethtool -a eth0" can be used to determine this. -For measurement purposes, you might want to turn it off to increase throughput: - * ethtool -A eth0 autoneg off - * ethtool -A eth0 rx off - * ethtool -A eth0 tx off - -.-=> Use better (faster) hardware -`-------------------------------------------------------------------------- -Before doing software-based fine-tuning, check if you can afford better and -especially faster hardware. For instance, get a fast CPU with lots of cores -or a NUMA architecture with multi-core CPUs and a fast interconnect. If you -dump PCAP files to disc with netsniff-ng, then a fast SSD is appropriate. -If you plan to memory map PCAP files with netsniff-ng, then choose an -appropriate amount of RAM and so on and so forth. - -<=== Software (Linux kernel specific) ====> - -.-=> Use NAPI drivers -`-------------------------------------------------------------------------- -The "New API" (NAPI) is a rework of the packet processing code in the -kernel to improve performance for high speed networking. NAPI provides -two major features: - -Interrupt mitigation: High-speed networking can create thousands of -interrupts per second, all of which tell the system something it already -knew: it has lots of packets to process. NAPI allows drivers to run with -(some) interrupts disabled during times of high traffic, with a -corresponding decrease in system load. - -Packet throttling: When the system is overwhelmed and must drop packets, -it's better if those packets are disposed of before much effort goes into -processing them. NAPI-compliant drivers can often cause packets to be -dropped in the network adaptor itself, before the kernel sees them at all. - -Many recent NIC drivers automatically support NAPI, so you don't need to do -anything. Some drivers need you to explicitly specify NAPI in the kernel -config or on the command line when compiling the driver. If you are unsure, -check your driver documentation. - -.-=> Use a tickless kernel -`-------------------------------------------------------------------------- -The tickless kernel feature allows for on-demand timer interrupts. This -means that during idle periods, fewer timer interrupts will fire, which -should lead to power savings, cooler running systems, and fewer useless -context switches. (Kernel option: CONFIG_NO_HZ=y) - -.-=> Reduce timer interrupts -`-------------------------------------------------------------------------- -You can select the rate at which timer interrupts in the kernel will fire. -When a timer interrupt fires on a CPU, the process running on that CPU is -interrupted while the timer interrupt is handled. Reducing the rate at -which the timer fires allows for fewer interruptions of your running -processes. This option is particularly useful for servers with multiple -CPUs where processes are not running interactively. (Kernel options: -CONFIG_HZ_100=y and CONFIG_HZ=100) - -.-=> Use Intel's I/OAT DMA Engine -`-------------------------------------------------------------------------- -This kernel option enables the Intel I/OAT DMA engine that is present in -recent Xeon CPUs. This option increases network throughput as the DMA -engine allows the kernel to offload network data copying from the CPU to -the DMA engine. This frees up the CPU to do more useful work. - -Check to see if it's enabled: - -[foo@bar]% dmesg | grep ioat -ioatdma 0000:00:08.0: setting latency timer to 64 -ioatdma 0000:00:08.0: Intel(R) I/OAT DMA Engine found, 4 channels, [...] -ioatdma 0000:00:08.0: irq 56 for MSI/MSI-X - -There's also a sysfs interface where you can get some statistics about the -DMA engine. Check the directories under /sys/class/dma/. (Kernel options: -CONFIG_DMADEVICES=y and CONFIG_INTEL_IOATDMA=y and CONFIG_DMA_ENGINE=y and -CONFIG_NET_DMA=y and CONFIG_ASYNC_TX_DMA=y) - -.-=> Use Direct Cache Access (DCA) -`-------------------------------------------------------------------------- -Intel's I/OAT also includes a feature called Direct Cache Access (DCA). -DCA allows a driver to warm a CPU cache. A few NICs support DCA, the most -popular (to my knowledge) is the Intel 10GbE driver (ixgbe). Refer to your -NIC driver documentation to see if your NIC supports DCA. To enable DCA, -a switch in the BIOS must be flipped. Some vendors supply machines that -support DCA, but don't expose a switch for DCA. - -You can check if DCA is enabled: - -[foo@bar]% dmesg | grep dca -dca service started, version 1.8 - -If DCA is possible on your system but disabled you'll see: - -ioatdma 0000:00:08.0: DCA is disabled in BIOS - -Which means you'll need to enable it in the BIOS or manually. (Kernel -option: CONFIG_DCA=y) - -.-=> Throttle NIC Interrupts -`-------------------------------------------------------------------------- -Some drivers allow the user to specify the rate at which the NIC will -generate interrupts. The e1000e driver allows you to pass a command line -option InterruptThrottleRate when loading the module with insmod. For -the e1000e there are two dynamic interrupt throttle mechanisms, specified -on the command line as 1 (dynamic) and 3 (dynamic conservative). The -adaptive algorithm traffic into different classes and adjusts the interrupt -rate appropriately. The difference between dynamic and dynamic conservative -is the rate for the 'Lowest Latency' traffic class, dynamic (1) has a much -more aggressive interrupt rate for this traffic class. - -As always, check your driver documentation for more information. - -With modprobe: insmod e1000e.o InterruptThrottleRate=1 - -.-=> Use Process and IRQ affinity -`-------------------------------------------------------------------------- -Linux allows the user to specify which CPUs processes and interrupt -handlers are bound. - -Processes: You can use taskset to specify which CPUs a process can run on -Interrupt Handlers: The interrupt map can be found in /proc/interrupts, and -the affinity for each interrupt can be set in the file smp_affinity in the -directory for each interrupt under /proc/irq/. - -This is useful because you can pin the interrupt handlers for your NICs -to specific CPUs so that when a shared resource is touched (a lock in the -network stack) and loaded to a CPU cache, the next time the handler runs, -it will be put on the same CPU avoiding costly cache invalidations that -can occur if the handler is put on a different CPU. - -However, reports of up to a 24% improvement can be had if processes and -the IRQs for the NICs the processes get data from are pinned to the same -CPUs. Doing this ensures that the data loaded into the CPU cache by the -interrupt handler can be used (without invalidation) by the process; -extremely high cache locality is achieved. - -NOTE: If netsniff-ng or trafgen is bound to a specific, it automatically -migrates the NIC's IRQ affinity to this CPU to achieve a high cache locality. - -.-=> Tune Socket's memory allocation area -`-------------------------------------------------------------------------- -On default, each socket has a backend memory between 130KB and 160KB on -a x86/x86_64 machine with 4GB RAM. Hence, network packets can be received -on the NIC driver layer, but later dropped at the socket queue due to memory -restrictions. "sysctl -a | grep mem" will display your current memory -settings. To increase maximum and default values of read and write memory -areas, use: - * sysctl -w net.core.rmem_max=8388608 - This sets the max OS receive buffer size for all types of connections. - * sysctl -w net.core.wmem_max=8388608 - This sets the max OS send buffer size for all types of connections. - * sysctl -w net.core.rmem_default=65536 - This sets the default OS receive buffer size for all types of connections. - * sysctl -w net.core.wmem_default=65536 - This sets the default OS send buffer size for all types of connections. - -.-=> Enable Linux' BPF Just-in-Time compiler -`-------------------------------------------------------------------------- -If you're using filtering with netsniff-ng (or tcpdump, Wireshark, ...), you -should activate the Berkeley Packet Filter Just-in-Time compiler. The Linux -kernel has a built-in "virtual machine" that interprets BPF opcodes for -filtering packets. Hence, those small filter applications are applied to -each packet. (Read more about this in the Bpfc document.) The Just-in-Time -compiler is able to 'compile' such an filter application to assembler code -that can directly be run on the CPU instead on the virtual machine. If -netsniff-ng or trafgen detects that the BPF JIT is present on the system, it -automatically enables it. (Kernel option: CONFIG_HAVE_BPF_JIT=y and -CONFIG_BPF_JIT=y) - -.-=> Increase the TX queue length -`-------------------------------------------------------------------------- -There are settings available to regulate the size of the queue between the -kernel network subsystems and the driver for network interface card. Just -as with any queue, it is recommended to size it such that losses do no -occur due to local buffer overflows. Therefore careful tuning is required -to ensure that the sizes of the queues are optimal for your network -connection. - -There are two queues to consider, the txqueuelen; which is related to the -transmit queue size, and the netdev_backlog; which determines the recv -queue size. Users can manually set this queue size using the ifconfig -command on the required device: - -ifconfig eth0 txqueuelen 2000 - -The default of 100 is inadequate for long distance, or high throughput pipes. -For example, on a network with a rtt of 120ms and at Gig rates, a -txqueuelen of at least 10000 is recommended. - -.-=> Increase kernel receiver backlog queue -`-------------------------------------------------------------------------- -For the receiver side, we have a similar queue for incoming packets. This -queue will build up in size when an interface receives packets faster than -the kernel can process them. If this queue is too small (default is 300), -we will begin to loose packets at the receiver, rather than on the network. -One can set this value by: - -sysctl -w net.core.netdev_max_backlog=2000 - -.-=> Use a RAM-based filesystem if possible -`-------------------------------------------------------------------------- -If you have a considerable amount of RAM, you can also think of using a -RAM-based file system such as ramfs for dumping pcap files with netsniff-ng. -This can be useful for small until middle-sized pcap sizes or for pcap probes -that are generated with netsniff-ng. - -<=== Software (netsniff-ng / trafgen specific) ====> - -.-=> Bind netsniff-ng / trafgen to a CPU -`-------------------------------------------------------------------------- -Both tools have a command-line option '--bind-cpu' that can be used like -'--bind-cpu 0' in order to pin the process to a specific CPU. This was -already mentioned earlier in this file. However, netsniff-ng and trafgen are -able to do this without an external tool. Next to this CPU pinning, they also -automatically migrate this CPU's NIC IRQ affinity. Hence, as in '--bind-cpu 0' -netsniff-ng will not be migrated to a different CPU and the NIC's IRQ affinity -will also be moved to CPU 0 to increase cache locality. - -.-=> Use netsniff-ng in silent mode -`-------------------------------------------------------------------------- -Don't print information to the konsole while you want to achieve high-speed, -because this highly slows down the application. Hence, use netsniff-ng's -'--silent' option when recording or replaying PCAP files! - -.-=> Use netsniff-ng's scatter/gather or mmap for PCAP files -`-------------------------------------------------------------------------- -The scatter/gather I/O mode which is default in netsniff-ng can be used to -record large PCAP files and is slower than the memory mapped I/O. However, -you don't have the RAM size as your limit for recording. Use netsniff-ng's -memory mapped I/O option for achieving a higher speed for recording a PCAP, -but with the trade-off that the maximum allowed size is limited. - -.-=> Use static packet configurations in trafgen -`-------------------------------------------------------------------------- -Don't use counters or byte randomization in trafgen configuration file, since -it slows down the packet generation process. Static packet bytes are the fastest -to go with. - -.-=> Generate packets with different txhashes in trafgen -`-------------------------------------------------------------------------- -For 10Gbit/s multiqueue NICs, it might be good to generate packets that result -in different txhashes, thus multiple queues are used in the transmission path -(and therefore high likely also multiple CPUs). - -Sources: -~~~~~~~~ - -* http://www.linuxfoundation.org/collaborate/workgroups/networking/napi -* http://datatag.web.cern.ch/datatag/howto/tcp.html -* http://thread.gmane.org/gmane.linux.network/191115 -* http://bit.ly/3XbBrM -* http://wwwx.cs.unc.edu/~sparkst/howto/network_tuning.php -* http://bit.ly/pUFJxU |