summaryrefslogtreecommitdiff
path: root/Documentation/Performance
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/Performance')
-rw-r--r--Documentation/Performance278
1 files changed, 278 insertions, 0 deletions
diff --git a/Documentation/Performance b/Documentation/Performance
new file mode 100644
index 0000000..e51411a
--- /dev/null
+++ b/Documentation/Performance
@@ -0,0 +1,278 @@
+Hitchhiker's guide to high-performance with netsniff-ng:
+////////////////////////////////////////////////////////
+
+This is a collection of short notes in random order concerning software
+and hardware for optimizing throughput (partly copied or derived from sources
+that are mentioned at the end of this file):
+
+<=== Hardware ====>
+
+.-=> Use a PCI-X or PCIe server NIC
+`--------------------------------------------------------------------------
+Only if it says Gigabit Ethernet on the box of your NIC, that does not
+necessarily mean that you will also reach it. Especially on small packet
+sizes, you won't reach wire-rate with a PCI adapter built for desktop or
+consumer machines. Rather, you should buy a server adapter that has faster
+interconnects such as PCIe. Also, make your choice of a server adapter,
+whether it has a good support in the kernel. Check the Linux drivers
+directory for your targeted chipset and look at the netdev list if the adapter
+is updated frequently. Also, check the location/slot of the NIC adapter on
+the system motherboard: Our experience resulted in significantly different
+measurement values by locating the NIC adapter in different PCIe slots.
+Since we did not have schematics for the system motherboard, this was a
+trial and error effort. Moreover, check the specifications of the NIC
+hardware: is the system bus connector I/O capable of Gigabit Ethernet
+frame rate throughput? Also check the network topology: is your network
+Gigabit switch capable of switching Ethernet frames at the maximum rate
+or is a direct connection of two end-nodes the better solution? Is Ethernet
+flow control being used? "ethtool -a eth0" can be used to determine this.
+For measurement purposes, you might want to turn it off to increase throughput:
+ * ethtool -A eth0 autoneg off
+ * ethtool -A eth0 rx off
+ * ethtool -A eth0 tx off
+
+.-=> Use better (faster) hardware
+`--------------------------------------------------------------------------
+Before doing software-based fine-tuning, check if you can afford better and
+especially faster hardware. For instance, get a fast CPU with lots of cores
+or a NUMA architecture with multi-core CPUs and a fast interconnect. If you
+dump PCAP files to disc with netsniff-ng, then a fast SSD is appropriate.
+If you plan to memory map PCAP files with netsniff-ng, then choose an
+appropriate amount of RAM and so on and so forth.
+
+<=== Software (Linux kernel specific) ====>
+
+.-=> Use NAPI drivers
+`--------------------------------------------------------------------------
+The "New API" (NAPI) is a rework of the packet processing code in the
+kernel to improve performance for high speed networking. NAPI provides
+two major features:
+
+Interrupt mitigation: High-speed networking can create thousands of
+interrupts per second, all of which tell the system something it already
+knew: it has lots of packets to process. NAPI allows drivers to run with
+(some) interrupts disabled during times of high traffic, with a
+corresponding decrease in system load.
+
+Packet throttling: When the system is overwhelmed and must drop packets,
+it's better if those packets are disposed of before much effort goes into
+processing them. NAPI-compliant drivers can often cause packets to be
+dropped in the network adaptor itself, before the kernel sees them at all.
+
+Many recent NIC drivers automatically support NAPI, so you don't need to do
+anything. Some drivers need you to explicitly specify NAPI in the kernel
+config or on the command line when compiling the driver. If you are unsure,
+check your driver documentation.
+
+.-=> Use a tickless kernel
+`--------------------------------------------------------------------------
+The tickless kernel feature allows for on-demand timer interrupts. This
+means that during idle periods, fewer timer interrupts will fire, which
+should lead to power savings, cooler running systems, and fewer useless
+context switches. (Kernel option: CONFIG_NO_HZ=y)
+
+.-=> Reduce timer interrupts
+`--------------------------------------------------------------------------
+You can select the rate at which timer interrupts in the kernel will fire.
+When a timer interrupt fires on a CPU, the process running on that CPU is
+interrupted while the timer interrupt is handled. Reducing the rate at
+which the timer fires allows for fewer interruptions of your running
+processes. This option is particularly useful for servers with multiple
+CPUs where processes are not running interactively. (Kernel options:
+CONFIG_HZ_100=y and CONFIG_HZ=100)
+
+.-=> Use Intel's I/OAT DMA Engine
+`--------------------------------------------------------------------------
+This kernel option enables the Intel I/OAT DMA engine that is present in
+recent Xeon CPUs. This option increases network throughput as the DMA
+engine allows the kernel to offload network data copying from the CPU to
+the DMA engine. This frees up the CPU to do more useful work.
+
+Check to see if it's enabled:
+
+[foo@bar]% dmesg | grep ioat
+ioatdma 0000:00:08.0: setting latency timer to 64
+ioatdma 0000:00:08.0: Intel(R) I/OAT DMA Engine found, 4 channels, [...]
+ioatdma 0000:00:08.0: irq 56 for MSI/MSI-X
+
+There's also a sysfs interface where you can get some statistics about the
+DMA engine. Check the directories under /sys/class/dma/. (Kernel options:
+CONFIG_DMADEVICES=y and CONFIG_INTEL_IOATDMA=y and CONFIG_DMA_ENGINE=y and
+CONFIG_NET_DMA=y and CONFIG_ASYNC_TX_DMA=y)
+
+.-=> Use Direct Cache Access (DCA)
+`--------------------------------------------------------------------------
+Intel's I/OAT also includes a feature called Direct Cache Access (DCA).
+DCA allows a driver to warm a CPU cache. A few NICs support DCA, the most
+popular (to my knowledge) is the Intel 10GbE driver (ixgbe). Refer to your
+NIC driver documentation to see if your NIC supports DCA. To enable DCA,
+a switch in the BIOS must be flipped. Some vendors supply machines that
+support DCA, but don't expose a switch for DCA.
+
+You can check if DCA is enabled:
+
+[foo@bar]% dmesg | grep dca
+dca service started, version 1.8
+
+If DCA is possible on your system but disabled you'll see:
+
+ioatdma 0000:00:08.0: DCA is disabled in BIOS
+
+Which means you'll need to enable it in the BIOS or manually. (Kernel
+option: CONFIG_DCA=y)
+
+.-=> Throttle NIC Interrupts
+`--------------------------------------------------------------------------
+Some drivers allow the user to specify the rate at which the NIC will
+generate interrupts. The e1000e driver allows you to pass a command line
+option InterruptThrottleRate when loading the module with insmod. For
+the e1000e there are two dynamic interrupt throttle mechanisms, specified
+on the command line as 1 (dynamic) and 3 (dynamic conservative). The
+adaptive algorithm traffic into different classes and adjusts the interrupt
+rate appropriately. The difference between dynamic and dynamic conservative
+is the rate for the 'Lowest Latency' traffic class, dynamic (1) has a much
+more aggressive interrupt rate for this traffic class.
+
+As always, check your driver documentation for more information.
+
+With modprobe: insmod e1000e.o InterruptThrottleRate=1
+
+.-=> Use Process and IRQ affinity
+`--------------------------------------------------------------------------
+Linux allows the user to specify which CPUs processes and interrupt
+handlers are bound.
+
+Processes: You can use taskset to specify which CPUs a process can run on
+Interrupt Handlers: The interrupt map can be found in /proc/interrupts, and
+the affinity for each interrupt can be set in the file smp_affinity in the
+directory for each interrupt under /proc/irq/.
+
+This is useful because you can pin the interrupt handlers for your NICs
+to specific CPUs so that when a shared resource is touched (a lock in the
+network stack) and loaded to a CPU cache, the next time the handler runs,
+it will be put on the same CPU avoiding costly cache invalidations that
+can occur if the handler is put on a different CPU.
+
+However, reports of up to a 24% improvement can be had if processes and
+the IRQs for the NICs the processes get data from are pinned to the same
+CPUs. Doing this ensures that the data loaded into the CPU cache by the
+interrupt handler can be used (without invalidation) by the process;
+extremely high cache locality is achieved.
+
+NOTE: If netsniff-ng or trafgen is bound to a specific, it automatically
+migrates the NIC's IRQ affinity to this CPU to achieve a high cache locality.
+
+.-=> Tune Socket's memory allocation area
+`--------------------------------------------------------------------------
+On default, each socket has a backend memory between 130KB and 160KB on
+a x86/x86_64 machine with 4GB RAM. Hence, network packets can be received
+on the NIC driver layer, but later dropped at the socket queue due to memory
+restrictions. "sysctl -a | grep mem" will display your current memory
+settings. To increase maximum and default values of read and write memory
+areas, use:
+ * sysctl -w net.core.rmem_max=8388608
+ This sets the max OS receive buffer size for all types of connections.
+ * sysctl -w net.core.wmem_max=8388608
+ This sets the max OS send buffer size for all types of connections.
+ * sysctl -w net.core.rmem_default=65536
+ This sets the default OS receive buffer size for all types of connections.
+ * sysctl -w net.core.wmem_default=65536
+ This sets the default OS send buffer size for all types of connections.
+
+.-=> Enable Linux' BPF Just-in-Time compiler
+`--------------------------------------------------------------------------
+If you're using filtering with netsniff-ng (or tcpdump, Wireshark, ...), you
+should activate the Berkeley Packet Filter Just-in-Time compiler. The Linux
+kernel has a built-in "virtual machine" that interprets BPF opcodes for
+filtering packets. Hence, those small filter applications are applied to
+each packet. (Read more about this in the Bpfc document.) The Just-in-Time
+compiler is able to 'compile' such an filter application to assembler code
+that can directly be run on the CPU instead on the virtual machine. If
+netsniff-ng or trafgen detects that the BPF JIT is present on the system, it
+automatically enables it. (Kernel option: CONFIG_HAVE_BPF_JIT=y and
+CONFIG_BPF_JIT=y)
+
+.-=> Increase the TX queue length
+`--------------------------------------------------------------------------
+There are settings available to regulate the size of the queue between the
+kernel network subsystems and the driver for network interface card. Just
+as with any queue, it is recommended to size it such that losses do no
+occur due to local buffer overflows. Therefore careful tuning is required
+to ensure that the sizes of the queues are optimal for your network
+connection.
+
+There are two queues to consider, the txqueuelen; which is related to the
+transmit queue size, and the netdev_backlog; which determines the recv
+queue size. Users can manually set this queue size using the ifconfig
+command on the required device:
+
+ifconfig eth0 txqueuelen 2000
+
+The default of 100 is inadequate for long distance, or high throughput pipes.
+For example, on a network with a rtt of 120ms and at Gig rates, a
+txqueuelen of at least 10000 is recommended.
+
+.-=> Increase kernel receiver backlog queue
+`--------------------------------------------------------------------------
+For the receiver side, we have a similar queue for incoming packets. This
+queue will build up in size when an interface receives packets faster than
+the kernel can process them. If this queue is too small (default is 300),
+we will begin to loose packets at the receiver, rather than on the network.
+One can set this value by:
+
+sysctl -w net.core.netdev_max_backlog=2000
+
+.-=> Use a RAM-based filesystem if possible
+`--------------------------------------------------------------------------
+If you have a considerable amount of RAM, you can also think of using a
+RAM-based file system such as ramfs for dumping pcap files with netsniff-ng.
+This can be useful for small until middle-sized pcap sizes or for pcap probes
+that are generated with netsniff-ng.
+
+<=== Software (netsniff-ng / trafgen specific) ====>
+
+.-=> Bind netsniff-ng / trafgen to a CPU
+`--------------------------------------------------------------------------
+Both tools have a command-line option '--bind-cpu' that can be used like
+'--bind-cpu 0' in order to pin the process to a specific CPU. This was
+already mentioned earlier in this file. However, netsniff-ng and trafgen are
+able to do this without an external tool. Next to this CPU pinning, they also
+automatically migrate this CPU's NIC IRQ affinity. Hence, as in '--bind-cpu 0'
+netsniff-ng will not be migrated to a different CPU and the NIC's IRQ affinity
+will also be moved to CPU 0 to increase cache locality.
+
+.-=> Use netsniff-ng in silent mode
+`--------------------------------------------------------------------------
+Don't print information to the konsole while you want to achieve high-speed,
+because this highly slows down the application. Hence, use netsniff-ng's
+'--silent' option when recording or replaying PCAP files!
+
+.-=> Use netsniff-ng's scatter/gather or mmap for PCAP files
+`--------------------------------------------------------------------------
+The scatter/gather I/O mode which is default in netsniff-ng can be used to
+record large PCAP files and is slower than the memory mapped I/O. However,
+you don't have the RAM size as your limit for recording. Use netsniff-ng's
+memory mapped I/O option for achieving a higher speed for recording a PCAP,
+but with the trade-off that the maximum allowed size is limited.
+
+.-=> Use static packet configurations in trafgen
+`--------------------------------------------------------------------------
+Don't use counters or byte randomization in trafgen configuration file, since
+it slows down the packet generation process. Static packet bytes are the fastest
+to go with.
+
+.-=> Generate packets with different txhashes in trafgen
+`--------------------------------------------------------------------------
+For 10Gbit/s multiqueue NICs, it might be good to generate packets that result
+in different txhashes, thus multiple queues are used in the transmission path
+(and therefore high likely also multiple CPUs).
+
+Sources:
+~~~~~~~~
+
+* http://www.linuxfoundation.org/collaborate/workgroups/networking/napi
+* http://datatag.web.cern.ch/datatag/howto/tcp.html
+* http://thread.gmane.org/gmane.linux.network/191115
+* http://bit.ly/3XbBrM
+* http://wwwx.cs.unc.edu/~sparkst/howto/network_tuning.php
+* http://bit.ly/pUFJxU