User Tools

Site Tools


documentation:technical_docs:performance

FreeBSD forwarding Performance

There are lot's of guide about tuning FreeBSD TCP performance (where the FreeBSD host is an end-point of the TCP session), but it's not the same that tunig forwarding performance (where the FreeBSD host don't have to read the TCP information of the packet being forwarded) or firewalling performance.

Concepts

How to bench a router

Benchmarking a router is not measuring the maximum bandwidth crossing the router, but it's about measuring the network throughput (in packets-per-second unit):

Definition

Clear definition regarding some relations between the bandwidth and frame rate is mandatory:

Benchmarks

Cisco or Linux

FreeBSD

Here are some benchs regarding network forwarding performance of FreeBSD:

Bench lab

The bench lab should permit to measure the pps. For obtaining accurate result the RFC 2544 (Benchmarking Methodology for Network Interconnect Devices) is a good reference. If switches are used, they need to have proper configuration too, refers to the BSDRP performance lab for some examples.

Tuning FreeBSD

Literature

Here is a list of sources about optimizing forwarding performance under FreeBSD.

How to bench or tune the network stack:

FreeBSD Experimental high-performance network stacks:

Enable fastforwarding

By default, fastforwarding is disabled on FreeBSD (and incompatible with IPSec usage). The first step is to enable fastforwarding with a:

echo "net.inet.ip.fastforwarding=1" >> /etc/sysctl.conf
sysctl net.inet.ip.fastforwarding=1

Here is an example of the difference without and with fastforwarding: Impact of ipfw and pf on 4 cores Xeon 2.13GHz with 10-Gigabit Intel X540-AT2

Drivers tuning

Network card became very complex and provide lot's of tuning parameters that can add huge performance impact.

First, the multi-queue feature of all modern NIC can be limited to the number of queues (then CPU) to uses. You need to test this impact on your own hardware because it's not always a good idea to use the default value (which is number of queues = number of CPU): igb(4) num_queues and max_interrupt_rate impact on throughput with a 8 cores Intel Atom C2758 running FreeBSD 10-STABLE r262743

This graphic shows that, on this specific case, playing with the parameters “max interrupts rate” didn't help.

Still regarding this graphic we could understand for this setup the best configuration was limiting 4 queues to the drivers: This is correct for a router… but for a firewall this parameters isn't optimum: Impact of ipfw and pf on throughput with a 8 cores Intel Atom C2758 running FreeBSD 10-STABLE r262743 Regarding some others drivers parameters, here are potential impact of the maximum input packets to manage and size of the descriptors:

The conclusion is that the default parameters of FreeBSD (“generic server”) aren't tuned for a “router” usage, and the tuning parameters for a router didn't always suit the “firewall” usage.

Where is the bottleneck ?

Tools:

  • netstat: show network status
  • vmstat: report virtual memory statistics
  • top: display and update information about the top cpu processes

Packet traffic

Display the information regarding packet traffic, with refresh each second.

Here is a first example:

[root@BSDRP3]~# netstat -i -h -w 1
            input        (Total)           output
   packets  errs idrops      bytes    packets  errs      bytes colls
      370k     0     0        38M       370k     0        38M     0
      369k     0     0        38M       368k     0        38M     0
      370k     0     0        38M       370k     0        38M     0
      373k     0     0        38M       376k     0        38M     0
      370k     0     0        38M       368k     0        38M     0
      368k     0     0        38M       368k     0        38M     0
      368k     0     0        38M       369k     0        38M     0

⇒ This system is forwarding 370Kpps (in and out) without any in/out errs (The packet generator used netblast with 64B packet-size a 370Kpps).

Here is a second example:

[root@BSDRP3]~# netstat -ihw 1
            input        (Total)           output
   packets  errs idrops      bytes    packets  errs      bytes colls
      399k  915k     0        25M       395k     0        24M     0
      398k  914k     0        24M       398k     0        24M     0
      399k  915k     0        25M       399k     0        25M     0
      398k  915k     0        24M       397k     0        24M     0
      399k  914k     0        25M       398k     0        24M     0
      398k  914k     0        24M       400k     0        25M     0
      398k  915k     0        24M       396k     0        24M     0
      400k  915k     0        25M       401k     0        25M     0
      397k  914k     0        24M       397k     0        24M     0
      398k  914k     0        24M       399k     0        25M     0
      400k  914k     0        25M       401k     0        25M     0
      398k  914k     0        24M       397k     0        24M     0

⇒ This system is forwarding about 400Kpps (in and out), but it's overloaded because it drops (errs) about 914Kpps (the generator used netmap pkt-gen with 64B packet size at a rate of 1.34Mpps).

Interrupt usage

Report on the number of interrupts taken by each device since system startup.

Here is a first example:

[root@BSDRP3]~# vmstat -i
interrupt                          total       rate
irq4: uart0                         6670          5
irq14: ata0                            5          0
irq16: bge0                           27          0
irq17: em0 bge1                  5209668       4510
cpu0:timer                       1299291       1124
irq256: ahci0                       1172          1
Total                            6516833       5642

⇒ Notice that em0 and bge1 are sharing the same IRQ. It's not a good news.

Here is a second example:

[root@BSDRP3]# vmstat -i
interrupt                          total       rate
irq4: uart0                        17869          0
irq14: ata0                            5          0
irq16: bge0                            1          0
irq17: em0 bge1                        2          0
cpu0:timer                     214331752       1125
irq256: ahci0                       1725          0
Total                          214351354       1126

⇒ Almost zero rate and counters regarding NIC IRQ means polling is enabled: IRQ management of current NIC avoid the use of polling.

Memory Buffer

Show statistics recorded by the memory management routines. The network manages a private pool of memory buffers.

[root@BSDRP3]~# netstat -m
5220/810/6030 mbufs in use (current/cache/total)
5219/675/5894/512000 mbuf clusters in use (current/cache/total/max)
5219/669 mbuf+clusters out of packet secondary zone in use (current/cache)
0/0/0/256000 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/128000 9k jumbo clusters in use (current/cache/total/max)
0/0/0/64000 16k jumbo clusters in use (current/cache/total/max)
11743K/1552K/13295K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/0/0 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines

Or more verbose:

[root@BSDRP3]~# vmstat -z | head -1 ; vmstat -z | grep -i mbuf
ITEM                   SIZE  LIMIT     USED     FREE      REQ FAIL SLEEP
mbuf_packet:            256,      0,    5221,     667,414103198,   0,   0
mbuf:                   256,      0,       1,     141,     135,   0,   0
mbuf_cluster:          2048, 512000,    5888,       6,    5888,   0,   0
mbuf_jumbo_page:       4096, 256000,       0,       0,       0,   0,   0
mbuf_jumbo_9k:         9216, 128000,       0,       0,       0,   0,   0
mbuf_jumbo_16k:       16384,  64000,       0,       0,       0,   0,   0
mbuf_ext_refcnt:          4,      0,       0,       0,       0,   0,   0

⇒ No “failed” here.

CPU / NIC

top can give very usefull information regarding the CPU/NIC affinity:

[root@BSDRP]/# top -nCHSIzs1
last pid:  1717;  load averages:  7.39,  2.01,  0.78  up 0+00:18:58    21:51:08
148 processes: 18 running, 85 sleeping, 45 waiting

Mem: 13M Active, 9476K Inact, 641M Wired, 128K Cache, 9560K Buf, 7237M Free
Swap:


  PID USERNAME   PRI NICE   SIZE    RES STATE   C   TIME     CPU COMMAND
   11 root       -92    -     0K   864K CPU2    2   0:01  98.39% intr{irq259: igb0:que}
   11 root       -92    -     0K   864K CPU5    5   0:38  97.07% intr{irq262: igb0:que}
   11 root       -92    -     0K   864K WAIT    7   0:38  96.68% intr{irq264: igb0:que}
   11 root       -92    -     0K   864K WAIT    3   0:39  96.58% intr{irq260: igb0:que}
   11 root       -92    -     0K   864K CPU6    6   0:38  96.48% intr{irq263: igb0:que}
   11 root       -92    -     0K   864K WAIT    4   0:38  96.00% intr{irq261: igb0:que}
   11 root       -92    -     0K   864K RUN     0   0:40  95.56% intr{irq257: igb0:que}
   11 root       -92    -     0K   864K WAIT    1   0:37  95.17% intr{irq258: igb0:que}
   11 root       -92    -     0K   864K WAIT    1   0:01   0.98% intr{irq276: igb2:que}
   11 root       -92    -     0K   864K RUN     3   0:00   0.88% intr{irq278: igb2:que}
   11 root       -92    -     0K   864K WAIT    0   0:01   0.78% intr{irq275: igb2:que}
   11 root       -92    -     0K   864K WAIT    4   0:00   0.78% intr{irq279: igb2:que}
   11 root       -92    -     0K   864K RUN     7   0:00   0.59% intr{irq282: igb2:que}
   11 root       -92    -     0K   864K RUN     6   0:00   0.59% intr{irq281: igb2:que}
   11 root       -92    -     0K   864K RUN     5   0:00   0.29% intr{irq280: igb2:que}

Drivers

Depending the NIC drivers used, there are some counters available:

[root@BSDRP3]~# sysctl dev.em.0.mac_stats. | grep -v ': 0'
dev.em.0.mac_stats.missed_packets: 221189883
dev.em.0.mac_stats.recv_no_buff: 94987654
dev.em.0.mac_stats.total_pkts_recvd: 351270928
dev.em.0.mac_stats.good_pkts_recvd: 130081045
dev.em.0.mac_stats.bcast_pkts_recvd: 1
dev.em.0.mac_stats.rx_frames_64: 2
dev.em.0.mac_stats.rx_frames_65_127: 130081043
dev.em.0.mac_stats.good_octets_recvd: 14308901524
dev.em.0.mac_stats.good_octets_txd: 892
dev.em.0.mac_stats.total_pkts_txd: 10
dev.em.0.mac_stats.good_pkts_txd: 10
dev.em.0.mac_stats.bcast_pkts_txd: 2
dev.em.0.mac_stats.mcast_pkts_txd: 5
dev.em.0.mac_stats.tx_frames_64: 2
dev.em.0.mac_stats.tx_frames_65_127: 8

⇒ Notice the high level of missed_packets and recv_no_buff. It's a problem regarding performance of the NIC or its drivers (on this example, the packet generator send packet at a rate about 1.38Mpps).

documentation/technical_docs/performance.txt · Last modified: 2014/03/27 11:30 by olivier