Forwarding performance lab of an IBM System x3550 M3 with 10-Gigabit Intel 82599EB

Forwarding performance lab of a quad cores Xeon 2.13GHz and dual-port Intel 82599EB 10-Gigabit

Bench lab

Hardware detail

This lab will test an IBM System x3550 M3 with quad cores (Intel Xeon L5630 2.13GHz, hyper-threading disabled), dual port Intel 82599EB 10-Gigabit and OPT SFP (SFP-10G-LR).

NIC details:

ix0@pci0:21:0:0:        class=0x020000 card=0x00038086 chip=0x10fb8086 rev=0x01 hdr=0x00
    vendor     = 'Intel Corporation'
    device     = '82599EB 10-Gigabit SFI/SFP+ Network Connection'
    class      = network
    subclass   = ethernet
    bar   [10] = type Prefetchable Memory, range 64, base 0xfbe80000, size 524288, enabled
    bar   [18] = type I/O Port, range 32, base 0x2020, size 32, enabled
    bar   [20] = type Prefetchable Memory, range 64, base 0xfbf04000, size 16384, enabled
    cap 01[40] = powerspec 3  supports D0 D3  current D0
    cap 05[50] = MSI supports 1 message, 64 bit, vector masks
    cap 11[70] = MSI-X supports 64 messages, enabled
                 Table in map 0x20[0x0], PBA in map 0x20[0x2000]
    cap 10[a0] = PCI-Express 2 endpoint max data 256(512) FLR link x8(x8)
                 speed 5.0(5.0) ASPM disabled(L0s)
    ecap 0001[100] = AER 1 0 fatal 0 non-fatal 1 corrected
    ecap 0003[140] = Serial 1 90e2baffff842038
    ecap 000e[150] = ARI 1
    ecap 0010[160] = SR-IOV 1 IOV disabled, Memory Space disabled, ARI disabled
                     0 VFs configured out of 64 supported
                     First VF RID Offset 0x0180, VF RID Stride 0x0002
                     VF Device ID 0x10ed
                     Page Sizes: 4096 (enabled), 8192, 65536, 262144, 1048576, 4194304

Lab set-up

The lab is detailed here: Setting up a forwarding performance benchmark lab.

Diagram

+------------------------------------------+ +-------+ +------------------------------+
|        Device under test                 | |Juniper| | Packet generator & receiver  |
|                                          | |  QFX  | |                              |
|                ix0: 198.18.0.1/24        |=|   <   |=| vcxl0: 198.18.0.110/24       |
|                      2001:2::1/64        | |       | |        2001:2::110/64        |
|                      (90:e2:ba:84:20:38) | |       | |        (00:07:43:2e:e4:72)   |
|                                          | |       | |                              |
|                ix1: 198.19.0.1/24        |=|   >   |=| vcxl1: 198.19.0.110/24       |
|                      2001:2:0:8000::1/64 | |       | |        2001:2:0:8000::110/64 |
|                      (90:e2:ba:84:20:39) | +-------+ |        (00:07:43:2e:e4:7a)   |
|                                          |           |                              |
|            static routes                 |           |                              |
| 192.18.0.0/16      => 198.18.0.110       |           |                              |
| 192.19.0.0/16      => 198.19.0.110       |           |                              |
| 2001:2::/49        => 2001:2::110        |           |                              |
| 2001:2:0:8000::/49 => 2001:2:0:8000::110 |           |                              |
|                                          |           |                              |
|        static arp and ndp                |           | /boot/loader.conf:           |
| 198.18.0.110        => 00:07:43:2e:e4:72 |           |      hw.cxgbe.num_vis=2      |
| 2001:2::110                              |           |                              |
|                                          |           |                              |
| 198.19.0.110        => 00:07:43:2e:e4:7a |           |                              |
| 2001:2:0:8000::110                       |           |                              |
+------------------------------------------+           +------------------------------+

The generator MUST generate lot's of smallest IP flows (multiple source/destination IP addresses and/or UDP src/dst port).

Here is an example for generating 2000 IPv4 flows (100 destination IP addresses * 20 source IP addresses) with a Chelsio NIC:

pkt-gen -i vcxl0 -f tx -n 1000000000 -l 60 -d 198.19.10.1:2000-198.19.10.100 -D 90:e2:ba:84:20:38 -s 198.18.10.1:2000-198.18.10.20 -w 4 -p 2

And the same with IPv6 flows (minimum frame size of 62 here):

pkt-gen -f tx -i vcxl0 -n 1000000000 -l 62 -6 -d "[2001:2:0:8010::1]-[2001:2:0:8010::64]" -D 90:e2:ba:84:20:38 -s "[2001:2:0:10::1]-[2001:2:0:10::14]" -S 00:07:43:2e:e4:72 -w 4 -p 2

This version of pkt-gen is improved with: IPv6 support, software checksum and optional unit normalization. BSDRP's patch to netmap pkt-gen .

Receiver will use this command:

pkt-gen -i vcxl1 -f rx -w 4

</code>

Basic configuration

Disabling Ethernet flow-control

First, disable Ethernet flow-control on both servers:

echo "dev.ix.0.fc=0" >> /etc/sysctl.conf
echo "dev.ix.1.fc=0" >> /etc/sysctl.conf

Enabling unsupported SFP

Because we are using a non-Intel SFP:

mount -uw /
echo 'hw.ix.unsupported_sfp="1"' >> /boot/loader.conf.local
mount -ur /

Disabling LRO and TSO

A router should not use LRO and TSO. BSDRP disable by default using a RC script (disablelrotso_enable=“YES” in /etc/rc.conf.misc).

But on a standard FreeBSD:

ifconfig ix0 -tso4 -tso6 -lro
ifconfig ix1 -tso4 -tso6 -lro

IP configuration on DUT

# IPv4 router
gateway_enable="YES"
static_routes="generator receiver"
route_generator="-net 198.18.0.0/16 198.18.0.110"
route_receiver="-net 198.19.0.0/16 198.19.0.110"
ifconfig_ix0="inet 198.18.0.1/24 -tso4 -tso6 -lro"
ifconfig_ix1="inet 198.19.0.1/24 -tso4 -tso6 -lro"
static_arp_pairs="HPvcxl0 HPvcxl1"
static_arp_HPvcxl0="198.18.0.110 00:07:43:2e:e4:72"
static_arp_HPvcxl1="198.19.0.110 00:07:43:2e:e4:7a"

# IPv6 router
ipv6_gateway_enable="YES"
ipv6_activate_all_interfaces="YES"
ipv6_static_routes="generator receiver"
ipv6_route_generator="2001:2:: -prefixlen 49 2001:2::110"
ipv6_route_receiver="2001:2:0:8000:: -prefixlen 49 2001:2:0:8000::110"
ifconfig_ix0_ipv6="inet6 2001:2::1 prefixlen 64"
ifconfig_ix1_ipv6="inet6 2001:2:0:8000::1 prefixlen 64"
static_ndp_pairs="HPvcxl0 HPvcxl1"
static_ndp_HPvcxl0="2001:2::110 00:07:43:2e:e4:72"
static_ndp_HPvcxl1="2001:2:0:8000::110 00:07:43:2e:e4:7a"

Routing performance with default BSDRP value

Default fast-forwarding performance in front of a line-rate generator

Behaviour in front of a multi-flow traffic generator at line-rate 14.8Mpps (thanks Chelsio!), netstat on DUT report:

Can't enter any command on the DUT during the load: All 4 cores are overloaded.

But on the receiver, there is only 2.8Mpps received (then forwarded):

242.851700 main_thread [2277] 2870783 pps (2873654 pkts 1379353920 bps in 1001000 usec) 17.79 avg_batch 3584 min_space
243.853699 main_thread [2277] 2869423 pps (2875162 pkts 1380077760 bps in 1002000 usec) 17.77 avg_batch 1956 min_space
244.854700 main_thread [2277] 2870532 pps (2873403 pkts 1379233440 bps in 1001000 usec) 17.78 avg_batch 2022 min_space
245.855699 main_thread [2277] 2872424 pps (2875296 pkts 1380142080 bps in 1001000 usec) 17.79 avg_batch 1949 min_space
246.856699 main_thread [2277] 2871882 pps (2874754 pkts 1379881920 bps in 1001000 usec) 17.79 avg_batch 3584 min_space
247.857699 main_thread [2277] 2871047 pps (2873918 pkts 1379480640 bps in 1001000 usec) 17.78 avg_batch 3584 min_space
248.858700 main_thread [2277] 2870945 pps (2873816 pkts 1379431680 bps in 1001000 usec) 17.79 avg_batch 1792 min_space
249.859699 main_thread [2277] 2870647 pps (2873518 pkts 1379288640 bps in 1001000 usec) 17.78 avg_batch 1959 min_space
250.860699 main_thread [2277] 2870222 pps (2873092 pkts 1379084160 bps in 1001000 usec) 17.78 avg_batch 1956 min_space
251.861699 main_thread [2277] 2870311 pps (2873178 pkts 1379125440 bps in 1000999 usec) 17.78 avg_batch 1792 min_space
252.862699 main_thread [2277] 2870795 pps (2873669 pkts 1379361120 bps in 1001001 usec) 17.78 avg_batch 2024 min_space

The traffic is correctly load-balanced between each queues:

[root@DUT]~# sysctl dev.ix.0. | grep rx_packet
dev.ix.0.queue3.rx_packets: 143762837
dev.ix.0.queue2.rx_packets: 141867655
dev.ix.0.queue1.rx_packets: 140704642
dev.ix.0.queue0.rx_packets: 139301732
[root@R1]~# sysctl dev.ix.1. | grep tx_packet
dev.ix.1.queue3.tx_packets: 143762837
dev.ix.1.queue2.tx_packets: 141867655
dev.ix.1.queue1.tx_packets: 140704643
dev.ix.1.queue0.tx_packets: 139301734

Where the system spend this time?

[root@DUT]~# kldload hwpmc
[root@DUT]~# pmcstat -TS instructions -w1

PMC: [INSTR_RETIRED_ANY] Samples: 99530 (100.0%) , 0 unresolved

%SAMP IMAGE      FUNCTION             CALLERS
  6.5 kernel     ixgbe_rxeof          ixgbe_msix_que
  5.8 kernel     bzero                m_pkthdr_init:1.7 ip_tryforward:1.5 ip_findroute:1.4 fib4_lookup_nh_basic:1.3
  5.1 kernel     ixgbe_xmit           ixgbe_mq_start_locked
  3.7 kernel     ixgbe_mq_start       ether_output
  2.8 kernel     rn_match             fib4_lookup_nh_basic
  2.7 kernel     _rw_runlock_cookie   fib4_lookup_nh_basic:2.0 arpresolve:0.8
  2.7 kernel     ip_tryforward        ip_input
  2.6 kernel     ether_nh_input       netisr_dispatch_src
  2.6 kernel     bounce_bus_dmamap_lo bus_dmamap_load_mbuf_sg
  2.6 kernel     uma_zalloc_arg       ixgbe_rxeof
  2.6 kernel     _rm_rlock            in_localip
  2.4 kernel     netisr_dispatch_src  ether_demux:1.7 ether_input:0.7
  2.3 libc.so.7  bsearch              0x63ac
  2.2 kernel     ether_output         ip_tryforward
  2.2 kernel     ip_input             netisr_dispatch_src
  2.1 kernel     uma_zfree_arg        m_freem
  2.1 kernel     fib4_lookup_nh_basic ip_findroute
  1.9 kernel     __rw_rlock           arpresolve:1.3 fib4_lookup_nh_basic:0.6
  1.8 kernel     bus_dmamap_load_mbuf ixgbe_xmit
  1.7 kernel     bcopy                arpresolve
  1.6 kernel     m_adj                ether_demux
  1.5 kernel     memcpy               ether_output
  1.5 kernel     arpresolve           ether_output
  1.5 kernel     _mtx_trylock_flags_  ixgbe_mq_start
  1.4 kernel     mb_ctor_mbuf         uma_zalloc_arg
  1.4 kernel     ixgbe_txeof          ixgbe_msix_que
  1.3 pmcstat    0x63f0               bsearch
  1.1 pmcstat    0x63e3               bsearch
  1.0 kernel     ixgbe_refresh_mbufs  ixgbe_rxeof
  1.0 kernel     in_localip           ip_tryforward
  1.0 kernel     random_harvest_queue ether_nh_input
  0.9 kernel     bcmp                 ether_nh_input
  0.8 kernel     cpu_search_lowest    cpu_search_lowest
  0.8 kernel     mac_ifnet_check_tran ether_output
  0.8 kernel     critical_exit        uma_zalloc_arg
  0.7 kernel     critical_enter
  0.7 kernel     ixgbe_mq_start_locke ixgbe_mq_start
  0.6 kernel     mac_ifnet_create_mbu ether_nh_input
  0.6 kernel     ether_demux          ether_nh_input
  0.5 kernel     m_freem              ixgbe_txeof
  0.5 kernel     ipsec4_capability    ip_input

⇒ Time spend in ixgbe_rxeof

Equilibrium throughput

Previous methodology, by generating 14.8Mpps, is like testing the DUT under a “Denial-of-Service”. Try another methodology known as equilibrium throughput.

From the pkt-generator, start an estimation of the “equilibrium throughput” starting at 4Mpps:

[root@pkt-gen]~# equilibrium -d 90:e2:ba:84:20:38 -p -l 4000 -t vcxl0 -r vcxl1
Benchmark tool using equilibrium throughput method
- Benchmark mode: Throughput (pps) for Router
- UDP load = 18B, IPv4 packet size=46B, Ethernet frame size=60B
- Link rate = 4000 Kpps
- Tolerance = 0.01
Iteration 1
  - Offering load = 2000 Kpps
  - Step = 1000 Kpps
  - Measured forwarding rate = 1999 Kpps
Iteration 2
  - Offering load = 3000 Kpps
  - Step = 1000 Kpps
  - Trend = increasing
  - Measured forwarding rate = 2949 Kpps
Iteration 3
  - Offering load = 2500 Kpps
  - Step = 500 Kpps
  - Trend = decreasing
  - Measured forwarding rate = 2499 Kpps
Iteration 4
  - Offering load = 2750 Kpps
  - Step = 250 Kpps
  - Trend = increasing
  - Measured forwarding rate = 2750 Kpps
Iteration 5
  - Offering load = 2875 Kpps
  - Step = 125 Kpps
  - Trend = increasing
  - Measured forwarding rate = 2875 Kpps
Iteration 6
  - Offering load = 2937 Kpps
  - Step = 62 Kpps
  - Trend = increasing
  - Measured forwarding rate = 2933 Kpps
Iteration 7
  - Offering load = 2968 Kpps
  - Step = 31 Kpps
  - Trend = increasing
  - Measured forwarding rate = 2948 Kpps
Estimated Equilibrium Ethernet throughput= 2948 Kpps (maximum value seen: 2949 Kpps)

⇒ Same results with equilibrium method: 2.9Mpps.

Firewall impact

One rule for each firewall and 2000 UDP “sessions”, more information on the GigaEthernet performance lab.

Full configuration sets, scripts and results.

Routing performance with multiples static routes

FreeBSD had some route lookup contention problem: This setup is using only one static route (192.19.0.0/8) toward the traffic receiver.

By spliting this unique route by 4 or 8, we should obtain better result.

This bench method is using 100 differents destinations IP addressess from 198.19.10.1 to 198.19.10.100, then redoing this bench using 4 static routes:

198.19.10.0/27 (0 to 31)
198.19.10.32/27 (32 to 63)
198.19.10.64/27 (64 to 95)
198.19.10.96/27 (96 to 127)

then with 8 routes:

198.19.10.0/28
198.19.10.16/28
198.19.10.32/28
198.19.10.48/28
198.19.10.64/28
198.19.10.80/28
198.19.10.96/28
198.19.10.112/28

Configuration examples:

sysrc static_routes="generator receiver1 receiver2 receiver3 receiver4"
sysrc route_generator="-net 198.18.0.0/16 198.18.2.2"
sysrc -x route_receiver
sysrc route_receiver1="-net 198.19.10.0/27 198.19.2.2"
sysrc route_receiver2="-net 198.19.10.32/27 198.19.2.2"
sysrc route_receiver3="-net 198.19.10.64/27 198.19.2.2"
sysrc route_receiver4="-net 198.19.10.96/27 198.19.2.2"

Graphs

⇒ A small 4% increase by using 4 static routes in place of 1 route, and 5% if using 8 routes.

Ministat

x pps.one-route
+ pps.four-routes
* pps.eight-routes
+--------------------------------------------------------------------------+
|                   *                                                      |
|                   *                                                      |
|   x      x   + +  *         x           x                   +  +       **|
||_________M______A________________|                                       |
|         |_________M_______________A________________________|             |
|           |_______M____________________A_____________________________|   |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5       1639912       1984116       1704942     1769454.2     154632.18
+   5       1737180       2194547       1785163     1927662.8     229585.57
No difference proven at 95.0% confidence
*   5       1782648       2273933       1785695     1978219.6     266278.12
No difference proven at 95.0% confidence

BSD Router Project

Table of Contents