User Tools

Site Tools


documentation:technical_docs:performance

FreeBSD forwarding Performance

There are lot's of guide about tuning FreeBSD TCP performance (where the FreeBSD host is an end-point of the TCP session), but it's not the same that tunig forwarding performance (where the FreeBSD host don't have to read the TCP information of the packet being forwarded) or firewalling performance.

Concepts

How to bench a router

Definition

Clear definition regarding some relations between the bandwidth and frame rate is mandatory:

Benchmarks

Cisco or Linux

FreeBSD

Here are some benchs regarding network forwarding performance of FreeBSD (made by BSDRP team):

Bench lab

The bench lab should permit to measure the pps. For obtaining accurate result the RFC 2544 (Benchmarking Methodology for Network Interconnect Devices) is a good reference. If switches are used, they need to have proper configuration too, refers to the BSDRP performance lab for some examples.

Tuning

Literature

Here is a list of sources about optimizing/analysis forwarding performance under FreeBSD.

How to bench or tune the network stack:

FreeBSD Experimental high-performance network stacks:

Multiple flows

Don't try to bench a router with only one flow (same source|destination address and same source|destination port): You need to generate multiples flows. Multi-queue NIC uses feature like Toeplitz Hash Algorithm that balance multiples flows between all cores. Then generating only one flow will use only one NIC queue.

During your load, check that each queues are used with sysctl or with python script like this one that will display real-time usage of each queue.

On this example we can see that all flows are correctly shared between each 8 queues (about 340K paquets-per-seconds for each):

[root@router]~# nic-queue-usage cxl0
[Q0   346K/s] [Q1   343K/s] [Q2   339K/s] [Q3   338K/s] [Q4   338K/s] [Q5   338K/s] [Q6   343K/s] [Q7   346K/s] [QT  2734K/s  3269K/s ->     0K/s]
[Q0   347K/s] [Q1   344K/s] [Q2   339K/s] [Q3   339K/s] [Q4   338K/s] [Q5   338K/s] [Q6   343K/s] [Q7   346K/s] [QT  2735K/s  3277K/s ->     0K/s]
[Q0   344K/s] [Q1   341K/s] [Q2   338K/s] [Q3   338K/s] [Q4   337K/s] [Q5   337K/s] [Q6   342K/s] [Q7   345K/s] [QT  2727K/s  3262K/s ->     0K/s]
[Q0   355K/s] [Q1   352K/s] [Q2   348K/s] [Q3   349K/s] [Q4   348K/s] [Q5   347K/s] [Q6   352K/s] [Q7   355K/s] [QT  2809K/s  3381K/s ->     0K/s]
[Q0   351K/s] [Q1   348K/s] [Q2   344K/s] [Q3   343K/s] [Q4   342K/s] [Q5   344K/s] [Q6   349K/s] [Q7   352K/s] [QT  2776K/s  3288K/s ->     0K/s]
[Q0   344K/s] [Q1   341K/s] [Q2   338K/s] [Q3   339K/s] [Q4   338K/s] [Q5   338K/s] [Q6   343K/s] [Q7   346K/s] [QT  2731K/s  3261K/s ->     0K/s]

Choosing good Hardware

CPU

Avoid NUMA architecture but prefer a CPU in only one package with maximum core (8 or 16). If you are using NUMA, check that inbound/outbound NIC queues are correctly mapping to the same package.

Network Interface Card

Chelsio, by mixing good chipset and excellent drivers are an excellent choice.

Intel seems to have problem for managing lot's of PPS (= IRQ).

Melanox have good reputation too, but their performance where not tested on BSDRP bench lab.

BIOS: Disable HT

Disable Hyper Threading (HT): By default, lot's of multi-queue NIC drivers create one queue per core. But “virtual” cores simulated by HT are not a good idea for forwarding:

Impact of HyperThreading on forwarding performance on FreeBSD

Choosing good FreeBSD release

Before tuning, you need to use the good FreeBSD version. This mean a FreeBSD -head version older than r309257 (Andrey V. Elsukov 's improvement: Rework ip_tryforward() to use FIB4 KPI) backported to FreeBSD 11-stable r310771 (MFC to stable).

BSDRP since version 1.70 is using a FreeBSD 11-stable (r312663) that includes this improvement.

2016 Forwarding performance evolution of FreeBSD -head on a 8 core Atom

For better (and linear scale) performance there is the projects/routing too that still give better performance.

fastforwarding

By default, fastforwarding is disabled on FreeBSD (and incompatible with IPSec usage). This feature is useless since FreeBSD 11.0, because enabled by defalt, fix IPSec usage and renamed tryforward.

On a FreeBSD 10.3 or older, you should enable fastforwarding with a:

echo "net.inet.ip.fastforwarding=1" >> /etc/sysctl.conf
sysctl net.inet.ip.fastforwarding=1

Entropy harvest impact

Lot's of tuning guide indicate to disable:

  • kern.random.sys.harvest.ethernet
  • kern.random.sys.harvest.point_to_point
  • kern.random.sys.harvest.interrupt.

But what about the real impact on a FreeBSD 10.2 as a router (value in pps):

x harvest DISABLED
+ harvest ENABLED (default)
+--------------------------------------------------------------------------------+
|+                   x          x    x        x+        +   +   +               x|
|                    |_______________M_____A______________________|              |
|                   |_________________________A_________M______________|         |
+--------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5       1918159       2036665       1950208       1963257     44988.621
+   5       1878893       2005333       1988952     1967850.8     51378.188
No difference proven at 95.0% confidence

⇒ There is no difference on FreeBSD 10.2 and earlier: Then we can keep the default value (enabled)

But since FreeBSD 11 (head) it's not true and removing some entropy source (interrupt and ethernet) can have a very big impact:

Impact of disabling some entropy source on FreeBSD forwarding performance

Polling mode

Polling can be used in 2 cases:

For enabling polling mode:

  1. Edit /etc/rc.conf.misc and replace polling_enable=“NO” by polling_enable=“YES”
  2. Execute: service polling start

NIC drivers compatibility matrix

BSDRP can use some special features on somes NIC:

And only theses devices support these modes:

name Description Polling ALTQ
ae Attansic/Atheros L2 FastEthernet controller driver no yes
age Attansic/Atheros L1 Gigabit Ethernet driver no yes
alc Atheros AR813x/AR815x Gigabit/Fast Ethernet driver no yes
ale Atheros AR8121/AR8113/AR8114 Gigabit/Fast Ethernet driver no yes
bce Broadcom NetXtreme II (BCM5706/5708/5709/5716) PCI/PCIe Gigabit Ethernet adapter driver no yes
bfe Broadcom BCM4401 Ethernet Device Driver no yes
bge Broadcom BCM570x/5714/5721/5722/5750/5751/5752/5789 PCI Gigabit Ethernet adapter driver yes yes
cas Sun Cassini/Cassini+ and National Semiconductor DP83065 Saturn Gigabit Ethernet driver no yes
cxgbe Chelsio T4 and T5 based 40Gb, 10Gb, and 1Gb Ethernet adapter driver no yes
dc DEC/Intel 21143 and clone 10/100 Ethernet driver yes yes
de DEC DC21x4x Ethernet device driver no yes
ed NE-2000 and WD-80×3 Ethernet driver no yes
em Intel(R) PRO/1000 Gigabit Ethernet adapter driver yes yes
et Agere ET1310 10/100/Gigabit Ethernet driver no yes
ep Ethernet driver for 3Com Etherlink III (3c5x9) interfaces no yes
fxp Intel EtherExpress PRO/100 Ethernet device driver yes yes
gem ERI/GEM/GMAC Ethernet device driver no yes
hme Sun Microelectronics STP2002-STQ Ethernet interfaces device driver no yes
igb Intel(R) PRO/1000 PCI Express Gigabit Ethernet adapter driver yes needs IGB_LEGACY_TX
ixgb(e) Intel(R) 10Gb Ethernet driver yes needs IGB_LEGACY_TX
jme JMicron Gigabit/Fast Ethernet driver no yes
le AMD Am7900 LANCE and Am79C9xx ILACC/PCnet Ethernet interface driver no yes
msk Marvell/SysKonnect Yukon II Gigabit Ethernet adapter driver no yes
mxge Myricom Myri10GE 10 Gigabit Ethernet adapter driver no yes
my Myson Technology Ethernet PCI driver no yes
nfe NVIDIA nForce MCP Ethernet driver yes yes
nge National Semiconductor PCI Gigabit Ethernet adapter driver yes no
nve NVIDIA nForce MCP Networking Adapter device driver no yes
qlxgb QLogic 10 Gigabit Ethernet & CNA Adapter Driver no yes
re RealTek 8139C+/8169/816xS/811xS/8101E PCI/PCIe Ethernet adapter driver yes yes
rl RealTek 8129/8139 Fast Ethernet device driver yes yes
sf Adaptec AIC‐6915 “Starfire” PCI Fast Ethernet adapter driver yes yes
sge Silicon Integrated Systems SiS190/191 Fast/Gigabit Ethernet driver no yes
sis SiS 900, SiS 7016 and NS DP83815/DP83816 Fast Ethernet device driver yes yes
sk SysKonnect SK-984x and SK-982x PCI Gigabit Ethernet adapter driver yes yes
ste Sundance Technologies ST201 Fast Ethernet device driver no yes
stge Sundance/Tamarack TC9021 Gigabit Ethernet adapter driver yes yes
ti Alteon Networks Tigon I and Tigon II Gigabit Ethernet driver no yes
txp 3Com 3XP Typhoon/Sidewinder (3CR990) Ethernet interface no yes
vge VIA Networking Technologies VT6122 PCI Gigabit Ethernet adapter driver yes yes
vr VIA Technologies Rhine I/II/III Ethernet device driver yes yes
xl 3Com Etherlink XL and Fast Etherlink XL Ethernet device driver yes yes

Using others NIC will works too :-)

NIC drivers tuning

Network card became very complex and provide lot's of tuning parameters that can add huge performance impact.

Limiting default NIC Multi-queue number (valid for FreeBSD 11.0 or older)

First, the multi-queue feature of all modern NIC can be limited to the number of queues (then CPU) to uses. You need to test this impact on your own hardware because it's not always a good idea to use the default value (which is number of queues = number of CPU): Bad default NIC queue number with 8 cores or more

This graphic shows that, on this specific case, playing with the parameters “max interrupts rate” didn't help.

Still regarding this graphic we could understand for this setup the best configuration was limiting 4 queues to the drivers: This is correct for a router… but for a firewall this parameters isn't optimum: Impact of ipfw and pf on throughput with a 8 cores Intel Atom C2758 running FreeBSD 10-STABLE r262743

Descriptors per queue and maxi number of received packets to process at a time

Regarding some others drivers parameters, here are potential impact of the maximum input packets to manage and size of the descriptors:

Disabling LRO and TSO

All modern NIC support LRO and TSO features that needs to be disabled on a router:

  1. By waiting to store multiple packets at the NIC level before to hand them up to the stack: This add latency, and because all packets need to be sending out again, the stack have to split in different packets again before to hand them down to the NIC. Intel drivers readme include this note “The result of not disabling LRO when combined with ip forwarding or bridging can be low throughput or even a kernel panic.”
  2. This break the End-to-end principle

There is no real impact of disabling these features on PPS performance:

x tso.lro.enabled
+ tso.lro.disabled
+--------------------------------------------------------------------------+
|   +  +     x+    *                          x+                    x     x|
|               |___________________________A_M_________________________|  |
||____________M___A________________|                                       |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5       1724046       1860817       1798145       1793343     61865.164
+   5       1702496       1798998       1725396     1734863.2     38178.905
No difference proven at 95.0% confidence

Resume

Default FreeBSD parameters are for a generic server (end host) and not tuned for a router usage, and tuning parameters that suit to a router didn't always suit for a firewall usage.

Where is the bottleneck ?

Tools:

Packet traffic

Display the information regarding packet traffic, with refresh each second.

Here is a first example:

[root@BSDRP3]~# netstat -i -h -w 1
            input        (Total)           output
   packets  errs idrops      bytes    packets  errs      bytes colls
      370k     0     0        38M       370k     0        38M     0
      369k     0     0        38M       368k     0        38M     0
      370k     0     0        38M       370k     0        38M     0
      373k     0     0        38M       376k     0        38M     0
      370k     0     0        38M       368k     0        38M     0
      368k     0     0        38M       368k     0        38M     0
      368k     0     0        38M       369k     0        38M     0

⇒ This system is forwarding 370Kpps (in and out) without any in/out errs (The packet generator used netblast with 64B packet-size a 370Kpps).

Don't use “netstat -h” on a standard FreeBSD: This option has a bug

Here is a second example:

[root@BSDRP3]~# netstat -ihw 1
            input        (Total)           output
   packets  errs idrops      bytes    packets  errs      bytes colls
      399k  915k     0        25M       395k     0        24M     0
      398k  914k     0        24M       398k     0        24M     0
      399k  915k     0        25M       399k     0        25M     0
      398k  915k     0        24M       397k     0        24M     0
      399k  914k     0        25M       398k     0        24M     0
      398k  914k     0        24M       400k     0        25M     0
      398k  915k     0        24M       396k     0        24M     0
      400k  915k     0        25M       401k     0        25M     0
      397k  914k     0        24M       397k     0        24M     0
      398k  914k     0        24M       399k     0        25M     0
      400k  914k     0        25M       401k     0        25M     0
      398k  914k     0        24M       397k     0        24M     0

⇒ This system is forwarding about 400Kpps (in and out), but it's overloaded because it drops (errs) about 914Kpps (the generator used netmap pkt-gen with 64B packet size at a rate of 1.34Mpps).

Interrupt usage

Report on the number of interrupts taken by each device since system startup.

Here is a first example:

[root@BSDRP3]~# vmstat -i
interrupt                          total       rate
irq4: uart0                         6670          5
irq14: ata0                            5          0
irq16: bge0                           27          0
irq17: em0 bge1                  5209668       4510
cpu0:timer                       1299291       1124
irq256: ahci0                       1172          1
Total                            6516833       5642

⇒ Notice that em0 and bge1 are sharing the same IRQ. It's not a good news.

Here is a second example:

[root@BSDRP3]# vmstat -i
interrupt                          total       rate
irq4: uart0                        17869          0
irq14: ata0                            5          0
irq16: bge0                            1          0
irq17: em0 bge1                        2          0
cpu0:timer                     214331752       1125
irq256: ahci0                       1725          0
Total                          214351354       1126

⇒ Almost zero rate and counters regarding NIC IRQ means polling is enabled: IRQ management of current NIC avoid the use of polling.

Memory Buffer

Show statistics recorded by the memory management routines. The network manages a private pool of memory buffers.

[root@BSDRP3]~# netstat -m
5220/810/6030 mbufs in use (current/cache/total)
5219/675/5894/512000 mbuf clusters in use (current/cache/total/max)
5219/669 mbuf+clusters out of packet secondary zone in use (current/cache)
0/0/0/256000 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/128000 9k jumbo clusters in use (current/cache/total/max)
0/0/0/64000 16k jumbo clusters in use (current/cache/total/max)
11743K/1552K/13295K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/0/0 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines

Or more verbose:

[root@BSDRP3]~# vmstat -z | head -1 ; vmstat -z | grep -i mbuf
ITEM                   SIZE  LIMIT     USED     FREE      REQ FAIL SLEEP
mbuf_packet:            256,      0,    5221,     667,414103198,   0,   0
mbuf:                   256,      0,       1,     141,     135,   0,   0
mbuf_cluster:          2048, 512000,    5888,       6,    5888,   0,   0
mbuf_jumbo_page:       4096, 256000,       0,       0,       0,   0,   0
mbuf_jumbo_9k:         9216, 128000,       0,       0,       0,   0,   0
mbuf_jumbo_16k:       16384,  64000,       0,       0,       0,   0,   0
mbuf_ext_refcnt:          4,      0,       0,       0,       0,   0,   0

⇒ No “failed” here.

CPU / NIC

top can give very useful information regarding the CPU/NIC affinity:

[root@BSDRP]/# top -nCHSIzs1
last pid:  1717;  load averages:  7.39,  2.01,  0.78  up 0+00:18:58    21:51:08
148 processes: 18 running, 85 sleeping, 45 waiting

Mem: 13M Active, 9476K Inact, 641M Wired, 128K Cache, 9560K Buf, 7237M Free
Swap:


  PID USERNAME   PRI NICE   SIZE    RES STATE   C   TIME     CPU COMMAND
   11 root       -92    -     0K   864K CPU2    2   0:01  98.39% intr{irq259: igb0:que}
   11 root       -92    -     0K   864K CPU5    5   0:38  97.07% intr{irq262: igb0:que}
   11 root       -92    -     0K   864K WAIT    7   0:38  96.68% intr{irq264: igb0:que}
   11 root       -92    -     0K   864K WAIT    3   0:39  96.58% intr{irq260: igb0:que}
   11 root       -92    -     0K   864K CPU6    6   0:38  96.48% intr{irq263: igb0:que}
   11 root       -92    -     0K   864K WAIT    4   0:38  96.00% intr{irq261: igb0:que}
   11 root       -92    -     0K   864K RUN     0   0:40  95.56% intr{irq257: igb0:que}
   11 root       -92    -     0K   864K WAIT    1   0:37  95.17% intr{irq258: igb0:que}
   11 root       -92    -     0K   864K WAIT    1   0:01   0.98% intr{irq276: igb2:que}
   11 root       -92    -     0K   864K RUN     3   0:00   0.88% intr{irq278: igb2:que}
   11 root       -92    -     0K   864K WAIT    0   0:01   0.78% intr{irq275: igb2:que}
   11 root       -92    -     0K   864K WAIT    4   0:00   0.78% intr{irq279: igb2:que}
   11 root       -92    -     0K   864K RUN     7   0:00   0.59% intr{irq282: igb2:que}
   11 root       -92    -     0K   864K RUN     6   0:00   0.59% intr{irq281: igb2:que}
   11 root       -92    -     0K   864K RUN     5   0:00   0.29% intr{irq280: igb2:que}

Drivers

Depending the NIC drivers used, there are some counters available:

[root@BSDRP3]~# sysctl dev.em.0.mac_stats. | grep -v ': 0'
dev.em.0.mac_stats.missed_packets: 221189883
dev.em.0.mac_stats.recv_no_buff: 94987654
dev.em.0.mac_stats.total_pkts_recvd: 351270928
dev.em.0.mac_stats.good_pkts_recvd: 130081045
dev.em.0.mac_stats.bcast_pkts_recvd: 1
dev.em.0.mac_stats.rx_frames_64: 2
dev.em.0.mac_stats.rx_frames_65_127: 130081043
dev.em.0.mac_stats.good_octets_recvd: 14308901524
dev.em.0.mac_stats.good_octets_txd: 892
dev.em.0.mac_stats.total_pkts_txd: 10
dev.em.0.mac_stats.good_pkts_txd: 10
dev.em.0.mac_stats.bcast_pkts_txd: 2
dev.em.0.mac_stats.mcast_pkts_txd: 5
dev.em.0.mac_stats.tx_frames_64: 2
dev.em.0.mac_stats.tx_frames_65_127: 8

⇒ Notice the high level of missed_packets and recv_no_buff. It's a problem regarding performance of the NIC or its drivers (on this example, the packet generator send packet at a rate about 1.38Mpps).

pmcstat

During high-load of your router/firewall, load the hwpmc(4) module:

kldload hwpmc
Time used by process

Now you can display the most time consumed process with:

pmcstat -TS instructions -w1

That will display this output:

PMC: [INSTR_RETIRED_ANY] Samples: 36456 (100.0%) , 29616 unresolved

%SAMP IMAGE      FUNCTION             CALLERS
 56.6 pf.ko      pf_test              pf_check_in:29.0 pf_check_out:27.6
 13.5 pf.ko      pf_find_state        pf_test_state_udp
  7.7 pf.ko      pf_test_state_udp    pf_test
  7.5 pf.ko      pf_pull_hdr          pf_test
  4.0 pf.ko      pf_check_out
  2.5 pf.ko      pf_normalize_ip      pf_test
  2.3 pf.ko      pf_check_in
  1.5 libpmc.so. pmclog_read
  1.3 hwpmc.ko   pmclog_process_callc pmc_process_samples
  0.8 libc.so.7  bcopy

On this case, the bottleneck is pf(4)

CPU cycles spent

For displaying where the most cpu cycles are being spent with. We first need a partition with about 200MB that include the debug kernel:

system expand-data-slice
mount /data

Then, under high-load, start collecting during about 20 seconds:

pmcstat -S CPU_CLK_UNHALTED_CORE -l 20 -O /data/pmc.out

Then analyses the output with:

fetch http://BSDRP-release-debug
tar xzfv BSDRP-release-debug.tar.xz
pmcannotate /data/pmc.out /data/debug/boot/kernel/kernel.symbols
documentation/technical_docs/performance.txt · Last modified: 2017/03/10 09:56 by olivier