User Tools

Site Tools


documentation:technical_docs:performance

FreeBSD forwarding Performance

There are lot's of guide about tuning FreeBSD TCP performance (where the FreeBSD host is an end-point of the TCP session), but it's not the same that tunig forwarding performance (where the FreeBSD host don't have to read the TCP information of the packet being forwarded) or firewalling performance.

Concepts

How to bench a router

Benchmarking a router is not measuring the maximum bandwidth crossing the router, but it's about measuring the network throughput (in packets-per-second unit):

Definition

Clear definition regarding some relations between the bandwidth and frame rate is mandatory:

Benchmarks

Cisco or Linux

FreeBSD

Here are some benchs regarding network forwarding performance of FreeBSD:

Bench lab

The bench lab should permit to measure the pps. For obtaining accurate result the RFC 2544 (Benchmarking Methodology for Network Interconnect Devices) is a good reference. If switches are used, they need to have proper configuration too, refers to the BSDRP performance lab for some examples.

Tuning FreeBSD

Literature

Here is a list of sources about optimizing forwarding performance under FreeBSD.

How to bench or tune the network stack:

FreeBSD Experimental high-performance network stacks:

Enable fastforwarding

By default, fastforwarding is disabled on FreeBSD (and incompatible with IPSec usage). The first step is to enable fastforwarding with a:

echo "net.inet.ip.fastforwarding=1" >> /etc/sysctl.conf
sysctl net.inet.ip.fastforwarding=1

Here is an example of the difference without and with fastforwarding: Impact of ipfw and pf on 4 cores Xeon 2.13GHz with 10-Gigabit Intel X540-AT2

Entropy harvest impact

Lot's of tuning guide indicate to disable:

  • kern.random.sys.harvest.ethernet
  • kern.random.sys.harvest.point_to_point
  • kern.random.sys.harvest.interrupt.

But what about the REAL impact on a router (value in pps):

x harvest DISABLED
+ harvest ENABLED (default)
+--------------------------------------------------------------------------------+
|+                   x          x    x        x+        +   +   +               x|
|                    |_______________M_____A______________________|              |
|                   |_________________________A_________M______________|         |
+--------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5       1918159       2036665       1950208       1963257     44988.621
+   5       1878893       2005333       1988952     1967850.8     51378.188
No difference proven at 95.0% confidence

⇒ There is no difference, then we can keep the default value (enabled)

NIC drivers tuning

Network card became very complex and provide lot's of tuning parameters that can add huge performance impact.

Multi-queue

First, the multi-queue feature of all modern NIC can be limited to the number of queues (then CPU) to uses. You need to test this impact on your own hardware because it's not always a good idea to use the default value (which is number of queues = number of CPU): igb(4) num_queues and max_interrupt_rate impact on throughput with a 8 cores Intel Atom C2758 running FreeBSD 10-STABLE r262743

This graphic shows that, on this specific case, playing with the parameters “max interrupts rate” didn't help.

Still regarding this graphic we could understand for this setup the best configuration was limiting 4 queues to the drivers: This is correct for a router… but for a firewall this parameters isn't optimum: Impact of ipfw and pf on throughput with a 8 cores Intel Atom C2758 running FreeBSD 10-STABLE r262743

Descriptors per queue and maxi number of received packets to process at a time

Regarding some others drivers parameters, here are potential impact of the maximum input packets to manage and size of the descriptors:

Disabling LRO and TSO

All modern NIC support LRO and TSO features that needs to be disabled on a router:

  1. By waiting to store multiple packets at the NIC level before to hand them up to the stack: This add latency, and because all packets need to be sending out again, the stack have to split in different packets again before to hand them down to the NIC. Intel drivers readme include this note “The result of not disabling LRO when combined with ip forwarding or bridging can be low throughput or even a kernel panic.”
  2. This break the End-to-end principle

There is no real impact of disabling these feature on PPS:

x tso.lro.enabled
+ tso.lro.disabled
+--------------------------------------------------------------------------+
|   +  +     x+    *                          x+                    x     x|
|               |___________________________A_M_________________________|  |
||____________M___A________________|                                       |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5       1724046       1860817       1798145       1793343     61865.164
+   5       1702496       1798998       1725396     1734863.2     38178.905
No difference proven at 95.0% confidence

Resume

Default FreeBSD parameters are for a generic server (end host) and not tuned for a router usage, and tuning parameters that suit to a router didn't always suit for a firewall usage.

Where is the bottleneck ?

Tools:

  • netstat: show network status
  • vmstat: report virtual memory statistics
  • top: display and update information about the top cpu processes
  • pmcstat: Measuring performance using hardward counter

Packet traffic

Display the information regarding packet traffic, with refresh each second.

Here is a first example:

[root@BSDRP3]~# netstat -i -h -w 1
            input        (Total)           output
   packets  errs idrops      bytes    packets  errs      bytes colls
      370k     0     0        38M       370k     0        38M     0
      369k     0     0        38M       368k     0        38M     0
      370k     0     0        38M       370k     0        38M     0
      373k     0     0        38M       376k     0        38M     0
      370k     0     0        38M       368k     0        38M     0
      368k     0     0        38M       368k     0        38M     0
      368k     0     0        38M       369k     0        38M     0

⇒ This system is forwarding 370Kpps (in and out) without any in/out errs (The packet generator used netblast with 64B packet-size a 370Kpps).

Don't use “netstat -h” on a standard FreeBSD: This option has a bug

Here is a second example:

[root@BSDRP3]~# netstat -ihw 1
            input        (Total)           output
   packets  errs idrops      bytes    packets  errs      bytes colls
      399k  915k     0        25M       395k     0        24M     0
      398k  914k     0        24M       398k     0        24M     0
      399k  915k     0        25M       399k     0        25M     0
      398k  915k     0        24M       397k     0        24M     0
      399k  914k     0        25M       398k     0        24M     0
      398k  914k     0        24M       400k     0        25M     0
      398k  915k     0        24M       396k     0        24M     0
      400k  915k     0        25M       401k     0        25M     0
      397k  914k     0        24M       397k     0        24M     0
      398k  914k     0        24M       399k     0        25M     0
      400k  914k     0        25M       401k     0        25M     0
      398k  914k     0        24M       397k     0        24M     0

⇒ This system is forwarding about 400Kpps (in and out), but it's overloaded because it drops (errs) about 914Kpps (the generator used netmap pkt-gen with 64B packet size at a rate of 1.34Mpps).

Interrupt usage

Report on the number of interrupts taken by each device since system startup.

Here is a first example:

[root@BSDRP3]~# vmstat -i
interrupt                          total       rate
irq4: uart0                         6670          5
irq14: ata0                            5          0
irq16: bge0                           27          0
irq17: em0 bge1                  5209668       4510
cpu0:timer                       1299291       1124
irq256: ahci0                       1172          1
Total                            6516833       5642

⇒ Notice that em0 and bge1 are sharing the same IRQ. It's not a good news.

Here is a second example:

[root@BSDRP3]# vmstat -i
interrupt                          total       rate
irq4: uart0                        17869          0
irq14: ata0                            5          0
irq16: bge0                            1          0
irq17: em0 bge1                        2          0
cpu0:timer                     214331752       1125
irq256: ahci0                       1725          0
Total                          214351354       1126

⇒ Almost zero rate and counters regarding NIC IRQ means polling is enabled: IRQ management of current NIC avoid the use of polling.

Memory Buffer

Show statistics recorded by the memory management routines. The network manages a private pool of memory buffers.

[root@BSDRP3]~# netstat -m
5220/810/6030 mbufs in use (current/cache/total)
5219/675/5894/512000 mbuf clusters in use (current/cache/total/max)
5219/669 mbuf+clusters out of packet secondary zone in use (current/cache)
0/0/0/256000 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/128000 9k jumbo clusters in use (current/cache/total/max)
0/0/0/64000 16k jumbo clusters in use (current/cache/total/max)
11743K/1552K/13295K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/0/0 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines

Or more verbose:

[root@BSDRP3]~# vmstat -z | head -1 ; vmstat -z | grep -i mbuf
ITEM                   SIZE  LIMIT     USED     FREE      REQ FAIL SLEEP
mbuf_packet:            256,      0,    5221,     667,414103198,   0,   0
mbuf:                   256,      0,       1,     141,     135,   0,   0
mbuf_cluster:          2048, 512000,    5888,       6,    5888,   0,   0
mbuf_jumbo_page:       4096, 256000,       0,       0,       0,   0,   0
mbuf_jumbo_9k:         9216, 128000,       0,       0,       0,   0,   0
mbuf_jumbo_16k:       16384,  64000,       0,       0,       0,   0,   0
mbuf_ext_refcnt:          4,      0,       0,       0,       0,   0,   0

⇒ No “failed” here.

CPU / NIC

top can give very useful information regarding the CPU/NIC affinity:

[root@BSDRP]/# top -nCHSIzs1
last pid:  1717;  load averages:  7.39,  2.01,  0.78  up 0+00:18:58    21:51:08
148 processes: 18 running, 85 sleeping, 45 waiting

Mem: 13M Active, 9476K Inact, 641M Wired, 128K Cache, 9560K Buf, 7237M Free
Swap:


  PID USERNAME   PRI NICE   SIZE    RES STATE   C   TIME     CPU COMMAND
   11 root       -92    -     0K   864K CPU2    2   0:01  98.39% intr{irq259: igb0:que}
   11 root       -92    -     0K   864K CPU5    5   0:38  97.07% intr{irq262: igb0:que}
   11 root       -92    -     0K   864K WAIT    7   0:38  96.68% intr{irq264: igb0:que}
   11 root       -92    -     0K   864K WAIT    3   0:39  96.58% intr{irq260: igb0:que}
   11 root       -92    -     0K   864K CPU6    6   0:38  96.48% intr{irq263: igb0:que}
   11 root       -92    -     0K   864K WAIT    4   0:38  96.00% intr{irq261: igb0:que}
   11 root       -92    -     0K   864K RUN     0   0:40  95.56% intr{irq257: igb0:que}
   11 root       -92    -     0K   864K WAIT    1   0:37  95.17% intr{irq258: igb0:que}
   11 root       -92    -     0K   864K WAIT    1   0:01   0.98% intr{irq276: igb2:que}
   11 root       -92    -     0K   864K RUN     3   0:00   0.88% intr{irq278: igb2:que}
   11 root       -92    -     0K   864K WAIT    0   0:01   0.78% intr{irq275: igb2:que}
   11 root       -92    -     0K   864K WAIT    4   0:00   0.78% intr{irq279: igb2:que}
   11 root       -92    -     0K   864K RUN     7   0:00   0.59% intr{irq282: igb2:que}
   11 root       -92    -     0K   864K RUN     6   0:00   0.59% intr{irq281: igb2:que}
   11 root       -92    -     0K   864K RUN     5   0:00   0.29% intr{irq280: igb2:que}

Drivers

Depending the NIC drivers used, there are some counters available:

[root@BSDRP3]~# sysctl dev.em.0.mac_stats. | grep -v ': 0'
dev.em.0.mac_stats.missed_packets: 221189883
dev.em.0.mac_stats.recv_no_buff: 94987654
dev.em.0.mac_stats.total_pkts_recvd: 351270928
dev.em.0.mac_stats.good_pkts_recvd: 130081045
dev.em.0.mac_stats.bcast_pkts_recvd: 1
dev.em.0.mac_stats.rx_frames_64: 2
dev.em.0.mac_stats.rx_frames_65_127: 130081043
dev.em.0.mac_stats.good_octets_recvd: 14308901524
dev.em.0.mac_stats.good_octets_txd: 892
dev.em.0.mac_stats.total_pkts_txd: 10
dev.em.0.mac_stats.good_pkts_txd: 10
dev.em.0.mac_stats.bcast_pkts_txd: 2
dev.em.0.mac_stats.mcast_pkts_txd: 5
dev.em.0.mac_stats.tx_frames_64: 2
dev.em.0.mac_stats.tx_frames_65_127: 8

⇒ Notice the high level of missed_packets and recv_no_buff. It's a problem regarding performance of the NIC or its drivers (on this example, the packet generator send packet at a rate about 1.38Mpps).

pmcstat

During high-load of your router/firewall, load the hwpmc(4) module:

kldload hwpmc
Time used by process

Now you can display the most time consumed process with:

pmcstat -TS instructions -w1

That will display this output:

PMC: [INSTR_RETIRED_ANY] Samples: 36456 (100.0%) , 29616 unresolved

%SAMP IMAGE      FUNCTION             CALLERS
 56.6 pf.ko      pf_test              pf_check_in:29.0 pf_check_out:27.6
 13.5 pf.ko      pf_find_state        pf_test_state_udp
  7.7 pf.ko      pf_test_state_udp    pf_test
  7.5 pf.ko      pf_pull_hdr          pf_test
  4.0 pf.ko      pf_check_out
  2.5 pf.ko      pf_normalize_ip      pf_test
  2.3 pf.ko      pf_check_in
  1.5 libpmc.so. pmclog_read
  1.3 hwpmc.ko   pmclog_process_callc pmc_process_samples
  0.8 libc.so.7  bcopy

On this case, the bottleneck is pf(4)

CPU cycles spent

For displaying where the most cpu cycles are being spent with. We first need a partition with about 200MB that include the unzipped kernel:

system expand-data-slice
mount /data
gunzip -c /boot/kernel/kernel.gz > /data/kernel

Then, during high-load, start collecting (during about 5 seconds):

pmcstat -S CPU_CLK_UNHALTED_CORE -O /data/pmc.out

Then analyses the output with:

pmcannotate /data/pmc.out /boot/kernel/kernel
documentation/technical_docs/performance.txt · Last modified: 2014/10/22 14:25 by olivier