summaryrefslogtreecommitdiff
path: root/Documentation/Performance
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/Performance')
-rw-r--r--Documentation/Performance278
1 files changed, 0 insertions, 278 deletions
diff --git a/Documentation/Performance b/Documentation/Performance
deleted file mode 100644
index e51411a..0000000
--- a/Documentation/Performance
+++ /dev/null
@@ -1,278 +0,0 @@
-Hitchhiker's guide to high-performance with netsniff-ng:
-////////////////////////////////////////////////////////
-
-This is a collection of short notes in random order concerning software
-and hardware for optimizing throughput (partly copied or derived from sources
-that are mentioned at the end of this file):
-
-<=== Hardware ====>
-
-.-=> Use a PCI-X or PCIe server NIC
-`--------------------------------------------------------------------------
-Only if it says Gigabit Ethernet on the box of your NIC, that does not
-necessarily mean that you will also reach it. Especially on small packet
-sizes, you won't reach wire-rate with a PCI adapter built for desktop or
-consumer machines. Rather, you should buy a server adapter that has faster
-interconnects such as PCIe. Also, make your choice of a server adapter,
-whether it has a good support in the kernel. Check the Linux drivers
-directory for your targeted chipset and look at the netdev list if the adapter
-is updated frequently. Also, check the location/slot of the NIC adapter on
-the system motherboard: Our experience resulted in significantly different
-measurement values by locating the NIC adapter in different PCIe slots.
-Since we did not have schematics for the system motherboard, this was a
-trial and error effort. Moreover, check the specifications of the NIC
-hardware: is the system bus connector I/O capable of Gigabit Ethernet
-frame rate throughput? Also check the network topology: is your network
-Gigabit switch capable of switching Ethernet frames at the maximum rate
-or is a direct connection of two end-nodes the better solution? Is Ethernet
-flow control being used? "ethtool -a eth0" can be used to determine this.
-For measurement purposes, you might want to turn it off to increase throughput:
- * ethtool -A eth0 autoneg off
- * ethtool -A eth0 rx off
- * ethtool -A eth0 tx off
-
-.-=> Use better (faster) hardware
-`--------------------------------------------------------------------------
-Before doing software-based fine-tuning, check if you can afford better and
-especially faster hardware. For instance, get a fast CPU with lots of cores
-or a NUMA architecture with multi-core CPUs and a fast interconnect. If you
-dump PCAP files to disc with netsniff-ng, then a fast SSD is appropriate.
-If you plan to memory map PCAP files with netsniff-ng, then choose an
-appropriate amount of RAM and so on and so forth.
-
-<=== Software (Linux kernel specific) ====>
-
-.-=> Use NAPI drivers
-`--------------------------------------------------------------------------
-The "New API" (NAPI) is a rework of the packet processing code in the
-kernel to improve performance for high speed networking. NAPI provides
-two major features:
-
-Interrupt mitigation: High-speed networking can create thousands of
-interrupts per second, all of which tell the system something it already
-knew: it has lots of packets to process. NAPI allows drivers to run with
-(some) interrupts disabled during times of high traffic, with a
-corresponding decrease in system load.
-
-Packet throttling: When the system is overwhelmed and must drop packets,
-it's better if those packets are disposed of before much effort goes into
-processing them. NAPI-compliant drivers can often cause packets to be
-dropped in the network adaptor itself, before the kernel sees them at all.
-
-Many recent NIC drivers automatically support NAPI, so you don't need to do
-anything. Some drivers need you to explicitly specify NAPI in the kernel
-config or on the command line when compiling the driver. If you are unsure,
-check your driver documentation.
-
-.-=> Use a tickless kernel
-`--------------------------------------------------------------------------
-The tickless kernel feature allows for on-demand timer interrupts. This
-means that during idle periods, fewer timer interrupts will fire, which
-should lead to power savings, cooler running systems, and fewer useless
-context switches. (Kernel option: CONFIG_NO_HZ=y)
-
-.-=> Reduce timer interrupts
-`--------------------------------------------------------------------------
-You can select the rate at which timer interrupts in the kernel will fire.
-When a timer interrupt fires on a CPU, the process running on that CPU is
-interrupted while the timer interrupt is handled. Reducing the rate at
-which the timer fires allows for fewer interruptions of your running
-processes. This option is particularly useful for servers with multiple
-CPUs where processes are not running interactively. (Kernel options:
-CONFIG_HZ_100=y and CONFIG_HZ=100)
-
-.-=> Use Intel's I/OAT DMA Engine
-`--------------------------------------------------------------------------
-This kernel option enables the Intel I/OAT DMA engine that is present in
-recent Xeon CPUs. This option increases network throughput as the DMA
-engine allows the kernel to offload network data copying from the CPU to
-the DMA engine. This frees up the CPU to do more useful work.
-
-Check to see if it's enabled:
-
-[foo@bar]% dmesg | grep ioat
-ioatdma 0000:00:08.0: setting latency timer to 64
-ioatdma 0000:00:08.0: Intel(R) I/OAT DMA Engine found, 4 channels, [...]
-ioatdma 0000:00:08.0: irq 56 for MSI/MSI-X
-
-There's also a sysfs interface where you can get some statistics about the
-DMA engine. Check the directories under /sys/class/dma/. (Kernel options:
-CONFIG_DMADEVICES=y and CONFIG_INTEL_IOATDMA=y and CONFIG_DMA_ENGINE=y and
-CONFIG_NET_DMA=y and CONFIG_ASYNC_TX_DMA=y)
-
-.-=> Use Direct Cache Access (DCA)
-`--------------------------------------------------------------------------
-Intel's I/OAT also includes a feature called Direct Cache Access (DCA).
-DCA allows a driver to warm a CPU cache. A few NICs support DCA, the most
-popular (to my knowledge) is the Intel 10GbE driver (ixgbe). Refer to your
-NIC driver documentation to see if your NIC supports DCA. To enable DCA,
-a switch in the BIOS must be flipped. Some vendors supply machines that
-support DCA, but don't expose a switch for DCA.
-
-You can check if DCA is enabled:
-
-[foo@bar]% dmesg | grep dca
-dca service started, version 1.8
-
-If DCA is possible on your system but disabled you'll see:
-
-ioatdma 0000:00:08.0: DCA is disabled in BIOS
-
-Which means you'll need to enable it in the BIOS or manually. (Kernel
-option: CONFIG_DCA=y)
-
-.-=> Throttle NIC Interrupts
-`--------------------------------------------------------------------------
-Some drivers allow the user to specify the rate at which the NIC will
-generate interrupts. The e1000e driver allows you to pass a command line
-option InterruptThrottleRate when loading the module with insmod. For
-the e1000e there are two dynamic interrupt throttle mechanisms, specified
-on the command line as 1 (dynamic) and 3 (dynamic conservative). The
-adaptive algorithm traffic into different classes and adjusts the interrupt
-rate appropriately. The difference between dynamic and dynamic conservative
-is the rate for the 'Lowest Latency' traffic class, dynamic (1) has a much
-more aggressive interrupt rate for this traffic class.
-
-As always, check your driver documentation for more information.
-
-With modprobe: insmod e1000e.o InterruptThrottleRate=1
-
-.-=> Use Process and IRQ affinity
-`--------------------------------------------------------------------------
-Linux allows the user to specify which CPUs processes and interrupt
-handlers are bound.
-
-Processes: You can use taskset to specify which CPUs a process can run on
-Interrupt Handlers: The interrupt map can be found in /proc/interrupts, and
-the affinity for each interrupt can be set in the file smp_affinity in the
-directory for each interrupt under /proc/irq/.
-
-This is useful because you can pin the interrupt handlers for your NICs
-to specific CPUs so that when a shared resource is touched (a lock in the
-network stack) and loaded to a CPU cache, the next time the handler runs,
-it will be put on the same CPU avoiding costly cache invalidations that
-can occur if the handler is put on a different CPU.
-
-However, reports of up to a 24% improvement can be had if processes and
-the IRQs for the NICs the processes get data from are pinned to the same
-CPUs. Doing this ensures that the data loaded into the CPU cache by the
-interrupt handler can be used (without invalidation) by the process;
-extremely high cache locality is achieved.
-
-NOTE: If netsniff-ng or trafgen is bound to a specific, it automatically
-migrates the NIC's IRQ affinity to this CPU to achieve a high cache locality.
-
-.-=> Tune Socket's memory allocation area
-`--------------------------------------------------------------------------
-On default, each socket has a backend memory between 130KB and 160KB on
-a x86/x86_64 machine with 4GB RAM. Hence, network packets can be received
-on the NIC driver layer, but later dropped at the socket queue due to memory
-restrictions. "sysctl -a | grep mem" will display your current memory
-settings. To increase maximum and default values of read and write memory
-areas, use:
- * sysctl -w net.core.rmem_max=8388608
- This sets the max OS receive buffer size for all types of connections.
- * sysctl -w net.core.wmem_max=8388608
- This sets the max OS send buffer size for all types of connections.
- * sysctl -w net.core.rmem_default=65536
- This sets the default OS receive buffer size for all types of connections.
- * sysctl -w net.core.wmem_default=65536
- This sets the default OS send buffer size for all types of connections.
-
-.-=> Enable Linux' BPF Just-in-Time compiler
-`--------------------------------------------------------------------------
-If you're using filtering with netsniff-ng (or tcpdump, Wireshark, ...), you
-should activate the Berkeley Packet Filter Just-in-Time compiler. The Linux
-kernel has a built-in "virtual machine" that interprets BPF opcodes for
-filtering packets. Hence, those small filter applications are applied to
-each packet. (Read more about this in the Bpfc document.) The Just-in-Time
-compiler is able to 'compile' such an filter application to assembler code
-that can directly be run on the CPU instead on the virtual machine. If
-netsniff-ng or trafgen detects that the BPF JIT is present on the system, it
-automatically enables it. (Kernel option: CONFIG_HAVE_BPF_JIT=y and
-CONFIG_BPF_JIT=y)
-
-.-=> Increase the TX queue length
-`--------------------------------------------------------------------------
-There are settings available to regulate the size of the queue between the
-kernel network subsystems and the driver for network interface card. Just
-as with any queue, it is recommended to size it such that losses do no
-occur due to local buffer overflows. Therefore careful tuning is required
-to ensure that the sizes of the queues are optimal for your network
-connection.
-
-There are two queues to consider, the txqueuelen; which is related to the
-transmit queue size, and the netdev_backlog; which determines the recv
-queue size. Users can manually set this queue size using the ifconfig
-command on the required device:
-
-ifconfig eth0 txqueuelen 2000
-
-The default of 100 is inadequate for long distance, or high throughput pipes.
-For example, on a network with a rtt of 120ms and at Gig rates, a
-txqueuelen of at least 10000 is recommended.
-
-.-=> Increase kernel receiver backlog queue
-`--------------------------------------------------------------------------
-For the receiver side, we have a similar queue for incoming packets. This
-queue will build up in size when an interface receives packets faster than
-the kernel can process them. If this queue is too small (default is 300),
-we will begin to loose packets at the receiver, rather than on the network.
-One can set this value by:
-
-sysctl -w net.core.netdev_max_backlog=2000
-
-.-=> Use a RAM-based filesystem if possible
-`--------------------------------------------------------------------------
-If you have a considerable amount of RAM, you can also think of using a
-RAM-based file system such as ramfs for dumping pcap files with netsniff-ng.
-This can be useful for small until middle-sized pcap sizes or for pcap probes
-that are generated with netsniff-ng.
-
-<=== Software (netsniff-ng / trafgen specific) ====>
-
-.-=> Bind netsniff-ng / trafgen to a CPU
-`--------------------------------------------------------------------------
-Both tools have a command-line option '--bind-cpu' that can be used like
-'--bind-cpu 0' in order to pin the process to a specific CPU. This was
-already mentioned earlier in this file. However, netsniff-ng and trafgen are
-able to do this without an external tool. Next to this CPU pinning, they also
-automatically migrate this CPU's NIC IRQ affinity. Hence, as in '--bind-cpu 0'
-netsniff-ng will not be migrated to a different CPU and the NIC's IRQ affinity
-will also be moved to CPU 0 to increase cache locality.
-
-.-=> Use netsniff-ng in silent mode
-`--------------------------------------------------------------------------
-Don't print information to the konsole while you want to achieve high-speed,
-because this highly slows down the application. Hence, use netsniff-ng's
-'--silent' option when recording or replaying PCAP files!
-
-.-=> Use netsniff-ng's scatter/gather or mmap for PCAP files
-`--------------------------------------------------------------------------
-The scatter/gather I/O mode which is default in netsniff-ng can be used to
-record large PCAP files and is slower than the memory mapped I/O. However,
-you don't have the RAM size as your limit for recording. Use netsniff-ng's
-memory mapped I/O option for achieving a higher speed for recording a PCAP,
-but with the trade-off that the maximum allowed size is limited.
-
-.-=> Use static packet configurations in trafgen
-`--------------------------------------------------------------------------
-Don't use counters or byte randomization in trafgen configuration file, since
-it slows down the packet generation process. Static packet bytes are the fastest
-to go with.
-
-.-=> Generate packets with different txhashes in trafgen
-`--------------------------------------------------------------------------
-For 10Gbit/s multiqueue NICs, it might be good to generate packets that result
-in different txhashes, thus multiple queues are used in the transmission path
-(and therefore high likely also multiple CPUs).
-
-Sources:
-~~~~~~~~
-
-* http://www.linuxfoundation.org/collaborate/workgroups/networking/napi
-* http://datatag.web.cern.ch/datatag/howto/tcp.html
-* http://thread.gmane.org/gmane.linux.network/191115
-* http://bit.ly/3XbBrM
-* http://wwwx.cs.unc.edu/~sparkst/howto/network_tuning.php
-* http://bit.ly/pUFJxU