Documentation/Performance


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278

Hitchhiker's guide to high-performance with netsniff-ng:
////////////////////////////////////////////////////////

This is a collection of short notes in random order concerning software
and hardware for optimizing throughput (partly copied or derived from sources
that are mentioned at the end of this file):

<=== Hardware ====>

.-=> Use a PCI-X or PCIe server NIC
`--------------------------------------------------------------------------
Only if it says Gigabit Ethernet on the box of your NIC, that does not
necessarily mean that you will also reach it. Especially on small packet
sizes, you won't reach wire-rate with a PCI adapter built for desktop or
consumer machines. Rather, you should buy a server adapter that has faster
interconnects such as PCIe. Also, make your choice of a server adapter,
whether it has a good support in the kernel. Check the Linux drivers
directory for your targeted chipset and look at the netdev list if the adapter
is updated frequently. Also, check the location/slot of the NIC adapter on
the system motherboard: Our experience resulted in significantly different
measurement values by locating the NIC adapter in different PCIe slots.
Since we did not have schematics for the system motherboard, this was a
trial and error effort. Moreover, check the specifications of the NIC
hardware: is the system bus connector I/O capable of Gigabit Ethernet
frame rate throughput? Also check the network topology: is your network
Gigabit switch capable of switching Ethernet frames at the maximum rate
or is a direct connection of two end-nodes the better solution? Is Ethernet
flow control being used? "ethtool -a eth0" can be used to determine this.
For measurement purposes, you might want to turn it off to increase throughput:
  * ethtool -A eth0 autoneg off
  * ethtool -A eth0 rx off
  * ethtool -A eth0 tx off

.-=> Use better (faster) hardware
`--------------------------------------------------------------------------
Before doing software-based fine-tuning, check if you can afford better and
especially faster hardware. For instance, get a fast CPU with lots of cores
or a NUMA architecture with multi-core CPUs and a fast interconnect. If you
dump PCAP files to disc with netsniff-ng, then a fast SSD is appropriate.
If you plan to memory map PCAP files with netsniff-ng, then choose an
appropriate amount of RAM and so on and so forth.

<=== Software (Linux kernel specific) ====>

.-=> Use NAPI drivers
`--------------------------------------------------------------------------
The "New API" (NAPI) is a rework of the packet processing code in the
kernel to improve performance for high speed networking. NAPI provides
two major features:

Interrupt mitigation: High-speed networking can create thousands of
interrupts per second, all of which tell the system something it already
knew: it has lots of packets to process. NAPI allows drivers to run with
(some) interrupts disabled during times of high traffic, with a
corresponding decrease in system load.

Packet throttling: When the system is overwhelmed and must drop packets,
it's better if those packets are disposed of before much effort goes into
processing them. NAPI-compliant drivers can often cause packets to be
dropped in the network adaptor itself, before the kernel sees them at all.

Many recent NIC drivers automatically support NAPI, so you don't need to do
anything. Some drivers need you to explicitly specify NAPI in the kernel
config or on the command line when compiling the driver. If you are unsure,
check your driver documentation.

.-=> Use a tickless kernel
`--------------------------------------------------------------------------
The tickless kernel feature allows for on-demand timer interrupts. This
means that during idle periods, fewer timer interrupts will fire, which
should lead to power savings, cooler running systems, and fewer useless
context switches. (Kernel option: CONFIG_NO_HZ=y)

.-=> Reduce timer interrupts
`--------------------------------------------------------------------------
You can select the rate at which timer interrupts in the kernel will fire.
When a timer interrupt fires on a CPU, the process running on that CPU is
interrupted while the timer interrupt is handled. Reducing the rate at
which the timer fires allows for fewer interruptions of your running
processes. This option is particularly useful for servers with multiple
CPUs where processes are not running interactively. (Kernel options:
CONFIG_HZ_100=y and CONFIG_HZ=100)

.-=> Use Intel's I/OAT DMA Engine
`--------------------------------------------------------------------------
This kernel option enables the Intel I/OAT DMA engine that is present in
recent Xeon CPUs. This option increases network throughput as the DMA
engine allows the kernel to offload network data copying from the CPU to
the DMA engine. This frees up the CPU to do more useful work.

Check to see if it's enabled:

[foo@bar]% dmesg | grep ioat
ioatdma 0000:00:08.0: setting latency timer to 64
ioatdma 0000:00:08.0: Intel(R) I/OAT DMA Engine found, 4 channels, [...]
ioatdma 0000:00:08.0: irq 56 for MSI/MSI-X

There's also a sysfs interface where you can get some statistics about the
DMA engine. Check the directories under /sys/class/dma/. (Kernel options:
CONFIG_DMADEVICES=y and CONFIG_INTEL_IOATDMA=y and CONFIG_DMA_ENGINE=y and
CONFIG_NET_DMA=y and CONFIG_ASYNC_TX_DMA=y)

.-=> Use Direct Cache Access (DCA)
`--------------------------------------------------------------------------
Intel's I/OAT also includes a feature called Direct Cache Access (DCA).
DCA allows a driver to warm a CPU cache. A few NICs support DCA, the most
popular (to my knowledge) is the Intel 10GbE driver (ixgbe). Refer to your
NIC driver documentation to see if your NIC supports DCA. To enable DCA,
a switch in the BIOS must be flipped. Some vendors supply machines that
support DCA, but don't expose a switch for DCA.

You can check if DCA is enabled:

[foo@bar]% dmesg | grep dca
dca service started, version 1.8

If DCA is possible on your system but disabled you'll see:

ioatdma 0000:00:08.0: DCA is disabled in BIOS

Which means you'll need to enable it in the BIOS or manually. (Kernel
option: CONFIG_DCA=y)

.-=> Throttle NIC Interrupts
`--------------------------------------------------------------------------
Some drivers allow the user to specify the rate at which the NIC will
generate interrupts. The e1000e driver allows you to pass a command line
option InterruptThrottleRate when loading the module with insmod. For
the e1000e there are two dynamic interrupt throttle mechanisms, specified
on the command line as 1 (dynamic) and 3 (dynamic conservative). The
adaptive algorithm traffic into different classes and adjusts the interrupt
rate appropriately. The difference between dynamic and dynamic conservative
is the rate for the 'Lowest Latency' traffic class, dynamic (1) has a much
more aggressive interrupt rate for this traffic class.

As always, check your driver documentation for more information.

With modprobe: insmod e1000e.o InterruptThrottleRate=1

.-=> Use Process and IRQ affinity
`--------------------------------------------------------------------------
Linux allows the user to specify which CPUs processes and interrupt
handlers are bound.

Processes: You can use taskset to specify which CPUs a process can run on
Interrupt Handlers: The interrupt map can be found in /proc/interrupts, and
the affinity for each interrupt can be set in the file smp_affinity in the
directory for each interrupt under /proc/irq/.

This is useful because you can pin the interrupt handlers for your NICs
to specific CPUs so that when a shared resource is touched (a lock in the
network stack) and loaded to a CPU cache, the next time the handler runs,
it will be put on the same CPU avoiding costly cache invalidations that
can occur if the handler is put on a different CPU.

However, reports of up to a 24% improvement can be had if processes and
the IRQs for the NICs the processes get data from are pinned to the same
CPUs. Doing this ensures that the data loaded into the CPU cache by the
interrupt handler can be used (without invalidation) by the process;
extremely high cache locality is achieved.

NOTE: If netsniff-ng or trafgen is bound to a specific, it automatically
migrates the NIC's IRQ affinity to this CPU to achieve a high cache locality.

.-=> Tune Socket's memory allocation area
`--------------------------------------------------------------------------
On default, each socket has a backend memory between 130KB and 160KB on
a x86/x86_64 machine with 4GB RAM. Hence, network packets can be received
on the NIC driver layer, but later dropped at the socket queue due to memory
restrictions. "sysctl -a | grep mem" will display your current memory
settings. To increase maximum and default values of read and write memory
areas, use:
   * sysctl -w net.core.rmem_max=8388608 
     This sets the max OS receive buffer size for all types of connections.
   * sysctl -w net.core.wmem_max=8388608
     This sets the max OS send buffer size for all types of connections.
   * sysctl -w net.core.rmem_default=65536
     This sets the default OS receive buffer size for all types of connections.
   * sysctl -w net.core.wmem_default=65536
     This sets the default OS send buffer size for all types of connections.

.-=> Enable Linux' BPF Just-in-Time compiler
`--------------------------------------------------------------------------
If you're using filtering with netsniff-ng (or tcpdump, Wireshark, ...), you
should activate the Berkeley Packet Filter Just-in-Time compiler. The Linux
kernel has a built-in "virtual machine" that interprets BPF opcodes for
filtering packets. Hence, those small filter applications are applied to
each packet. (Read more about this in the Bpfc document.) The Just-in-Time
compiler is able to 'compile' such an filter application to assembler code
that can directly be run on the CPU instead on the virtual machine. If
netsniff-ng or trafgen detects that the BPF JIT is present on the system, it
automatically enables it. (Kernel option: CONFIG_HAVE_BPF_JIT=y and
CONFIG_BPF_JIT=y)

.-=> Increase the TX queue length
`--------------------------------------------------------------------------
There are settings available to regulate the size of the queue between the
kernel network subsystems and the driver for network interface card. Just
as with any queue, it is recommended to size it such that losses do no
occur due to local buffer overflows. Therefore careful tuning is required
to ensure that the sizes of the queues are optimal for your network
connection.

There are two queues to consider, the txqueuelen; which is related to the
transmit queue size, and the netdev_backlog; which determines the recv
queue size. Users can manually set this queue size using the ifconfig
command on the required device:

ifconfig eth0 txqueuelen 2000

The default of 100 is inadequate for long distance, or high throughput pipes.
For example, on a network with a rtt of 120ms and at Gig rates, a
txqueuelen of at least 10000 is recommended.

.-=> Increase kernel receiver backlog queue
`--------------------------------------------------------------------------
For the receiver side, we have a similar queue for incoming packets. This
queue will build up in size when an interface receives packets faster than
the kernel can process them. If this queue is too small (default is 300),
we will begin to loose packets at the receiver, rather than on the network.
One can set this value by:

sysctl -w net.core.netdev_max_backlog=2000

.-=> Use a RAM-based filesystem if possible
`--------------------------------------------------------------------------
If you have a considerable amount of RAM, you can also think of using a
RAM-based file system such as ramfs for dumping pcap files with netsniff-ng.
This can be useful for small until middle-sized pcap sizes or for pcap probes
that are generated with netsniff-ng.

<=== Software (netsniff-ng / trafgen specific) ====>

.-=> Bind netsniff-ng / trafgen to a CPU
`--------------------------------------------------------------------------
Both tools have a command-line option '--bind-cpu' that can be used like
'--bind-cpu 0' in order to pin the process to a specific CPU. This was
already mentioned earlier in this file. However, netsniff-ng and trafgen are
able to do this without an external tool. Next to this CPU pinning, they also
automatically migrate this CPU's NIC IRQ affinity. Hence, as in '--bind-cpu 0'
netsniff-ng will not be migrated to a different CPU and the NIC's IRQ affinity
will also be moved to CPU 0 to increase cache locality.

.-=> Use netsniff-ng in silent mode
`--------------------------------------------------------------------------
Don't print information to the konsole while you want to achieve high-speed,
because this highly slows down the application. Hence, use netsniff-ng's
'--silent' option when recording or replaying PCAP files!

.-=> Use netsniff-ng's scatter/gather or mmap for PCAP files
`--------------------------------------------------------------------------
The scatter/gather I/O mode which is default in netsniff-ng can be used to
record large PCAP files and is slower than the memory mapped I/O. However,
you don't have the RAM size as your limit for recording. Use netsniff-ng's
memory mapped I/O option for achieving a higher speed for recording a PCAP,
but with the trade-off that the maximum allowed size is limited.

.-=> Use static packet configurations in trafgen
`--------------------------------------------------------------------------
Don't use counters or byte randomization in trafgen configuration file, since
it slows down the packet generation process. Static packet bytes are the fastest
to go with.

.-=> Generate packets with different txhashes in trafgen
`--------------------------------------------------------------------------
For 10Gbit/s multiqueue NICs, it might be good to generate packets that result
in different txhashes, thus multiple queues are used in the transmission path
(and therefore high likely also multiple CPUs).

Sources:
~~~~~~~~

* http://www.linuxfoundation.org/collaborate/workgroups/networking/napi
* http://datatag.web.cern.ch/datatag/howto/tcp.html
* http://thread.gmane.org/gmane.linux.network/191115
* http://bit.ly/3XbBrM
* http://wwwx.cs.unc.edu/~sparkst/howto/network_tuning.php
* http://bit.ly/pUFJxU