User Tools

Site Tools


documentation:examples:forwarding_performance_lab_of_an_ibm_system_x3550_m3_with_intel_82580

Forwarding performance lab of an IBM System x3550 M3 with Intel 82580

Hardware detail

This lab will test an IBM System x3550 M3 with quad cores (Intel Xeon L5630 2.13GHz, hyper-threading disabled) and a quad NIC 82580 connected to the PCI-Express Bus.

Lab set-up

The lab is detailed here: Setting up a forwarding performance benchmark lab.

BSDRP-amd64 v1.51 (FreeBSD 10.0-BETA2 with autotune mbuf patch) is used on the DUT.

Diagram

+-------------------+      +----------------------------------------+      +-------------------+
| Packet generator  |      |            Device under Test           |      |  Packet receiver  |
|   igb2: 1.1.1.1   |      | igb2: 1.1.1.2            igb3: 2.2.2.2 |      |   igb3: 2.2.2.3   |
| 00:1b:21:d4:3f:2a |      | 00:1b:21:d3:8f:3e    00:1b:21:d3:8f:3f |      | 00:1b:21:c4:95:7b |
|                   |=====>|                                        |=====>|                   |
|  static routes    |      |              static routes             |      |   static route    |
| default=> 1.1.1.2 |      |         1.0.0.0/8 => 1.1.1.1           |      | default=> 2.2.2.2 |
|                   |      |         2.0.0.0/8 => 2.2.2.3           |      |                   |
+-------------------+      +----------------------------------------+      +-------------------+

The generator MUST generate lot's of smallest IP flows (multiple source/destination IP addresses and/or UDP src/dst port).

Here is an example for generating about 2000 flows:

pkt-gen -i ix0 -f tx -n 1000000000 -l 60 -d 2.3.3.2:2000-2.3.3.128 -D 00:1b:21:d3:8f:3e -s 1.3.3.3:2000-1.3.3.10 -w 4
You need to use a FreeBSD -head with svn revision of 257758 minimum with thepkt-gen checksum patch for using multiple src/dst IP or port with netmap's pkt-gen).

Receiver will use this command:

pkt-gen -i igb3 -f rx -w 10

Basic configuration

Disabling Ethernet flow-control

First, disable Ethernet flow-control:

echo "dev.igb.2.fc=0" >> /etc/sysctl.conf
echo "dev.igb.3.fc=0" >> /etc/sysctl.conf
sysctl dev.igb.2.fc=0
sysctl dev.igb.3.fc=0

IP Configuration

Configure IP addresses, static routes and static ARP entries.

A router should not use LRO and TSO. BSDRP disable by default using a RC script (disablelrotso_enable=“YES” in /etc/rc.conf.misc).

sysrc static_routes="generator receiver"
sysrc route_generator="-net 1.0.0.0/8 1.1.1.1"
sysrc route_receiver="-net 2.0.0.0/8 2.2.2.3"
sysrc ifconfig_igb2="inet 1.1.1.2/24 -tso4 -tso6 -lro"
sysrc ifconfig_igb3="inet 2.2.2.2/24 -tso4 -tso6 -lro"
sysrc static_arp_pairs="receiver generator"
sysrc static_arp_generator="1.1.1.1 00:1b:21:d4:3f:2a"
sysrc static_arp_receiver="2.2.2.3 00:1b:21:c4:95:7b"

Default forwarding speed

With the default parameters, multi-flow traffic at 1.488Mpps (the maximum rate for GigaEthernet) are correctly forwarded without any loss:

[root@BSDRP]~# netstat -iw 1
            input        (Total)           output
   packets  errs idrops      bytes    packets  errs      bytes colls
   1511778     0     0   91524540    1508774     0   54260514     0
   1437061     0     0   87010506    1433981     0   51628556     0
   1492392     0     0   90363066    1489107     0   53551190     0
   1435098     0     0   86919666    1432911     0   51403100     0
   1486627     0     0   90015126    1483984     0   53323802     0
   1435217     0     0   86898126    1432679     0   51498188     0
   1486694     0     0   90017226    1483725     0   53324978     0
   1488248     0     0   90106326    1485331     0   53360930     0
   1437796     0     0   87084606    1435594     0   51504950     0

The traffic is correctly load-balanced between NIC-queue/CPU binding:

[root@BSDRP]# vmstat -i | grep igb
irq278: igb2:que 0            2759545998      95334
irq279: igb2:que 1            2587966938      89406
irq280: igb2:que 2            2589102074      89445
irq281: igb2:que 3            2598239184      89761
irq282: igb2:link                      2          0
irq283: igb3:que 0            3318777087     114654
irq284: igb3:que 1            3098055250     107028
irq285: igb3:que 2            3101570541     107150
irq286: igb3:que 3            3052431966     105452
irq287: igb3:link                      2          0
 
[root@BSDRP]/# top -nCHSIzs1
last pid:  8292;  load averages:  5.38,  1.70,  0.65  up 0+10:38:54    13:08:33
153 processes: 12 running, 97 sleeping, 44 waiting
 
Mem: 2212K Active, 24M Inact, 244M Wired, 18M Buf, 15G Free
Swap: 
 
 
  PID USERNAME PRI NICE   SIZE    RES STATE   C   TIME     CPU COMMAND
   11 root     -92    -     0K   816K WAIT    0 218:26  85.25% intr{irq278: igb2:que}
   11 root     -92    -     0K   816K CPU1    1 296:18  84.77% intr{irq279: igb2:que}
   11 root     -92    -     0K   816K RUN     2 298:15  84.67% intr{irq280: igb2:que}
   11 root     -92    -     0K   816K CPU3    3 294:53  84.18% intr{irq281: igb2:que}
   11 root     -92    -     0K   816K RUN     3  67:46  16.46% intr{irq286: igb3:que}
   11 root     -92    -     0K   816K RUN     2  70:27  16.06% intr{irq285: igb3:que}
   11 root     -92    -     0K   816K RUN     1  68:36  15.97% intr{irq284: igb3:que}
   11 root     -92    -     0K   816K CPU0    0  59:39  15.28% intr{irq283: igb3:que}

igb(4) drivers tunning with 82546GB

Disabling multi-queue

For disabling multi-queue (this mean without IRQ load-sharing between CPUs), there are 2 methods:

The first method is to use pkt-gen for generating a one IP flow (same src/dst IP and same src/dst port) traffic like this:

pkt-gen -i igb2 -f tx -n 80000000 -l 42 -d 2.3.3.2 -D 00:1b:21:d3:8f:3e -s 1.3.3.3 -w 10

⇒ With this method, igb(4) can't do load-balancing input traffic and will use only one queue.

The second method is to disabling the multi-queue support of igb(4) drivers by forcing the usage of one queue:

mount -uw /
echo 'sysctl hw.igb.num_queues="1"' >> /boot/loader.conf.local
mount -ur /
reboot

And check on the dmesg or with number of IRQ assigned to the NIC that no multi-queue was enabled:

[root@BSDRP]~# grep 'igb[2-3]' /var/run/dmesg.boot 
igb2: <Intel(R) PRO/1000 Network Connection version - 2.4.0> mem 0x97a80000-0x97afffff,0x97c04000-0x97c07fff irq 39 at device 0.2 on pci26
igb2: Using MSIX interrupts with 2 vectors
igb2: Ethernet address: 00:1b:21:d3:8f:3e
001.000011 netmap_attach [2244] success for igb2
igb3: <Intel(R) PRO/1000 Network Connection version - 2.4.0> mem 0x97a00000-0x97a7ffff,0x97c00000-0x97c03fff irq 38 at device 0.3 on pci26
igb3: Using MSIX interrupts with 2 vectors
igb3: Ethernet address: 00:1b:21:d3:8f:3f
001.000012 netmap_attach [2244] success for igb3
 
[root@BSDRP]~# vmstat -i | grep igb
irq272: igb2:que 0                     8          0
irq273: igb2:link                      2          0
irq274: igb3:que 0              48517905      74757
irq275: igb3:link                      2          0

Using any of theses method, the result will be the same: forwarding speed will decrease (to about 700Kpps) corresponding to the maximum input rate (100% CPU usage of the unique CPU bound to input NIC IRQ).

[root@BSDRP]~# netstat -iw 1
            input        (Total)           output
   packets  errs idrops      bytes    packets  errs      bytes colls
    690541 797962     0   41432466     690541     0   29002910     0
    704171 797906     0   42250266     704174     0   29575322     0
    676522 797770     0   40591326     676519     0   28414022     0
    707373 797878     0   42442386     707376     0   29709848     0
    672983 797962     0   40378986     672981     0   28265426     0
    705339 797899     0   42320346     705336     0   29624378     0
    684930 798049     0   41095806     684934     0   28767200     0
[root@bsdrp2]~# top -nCHSIzs1
last pid:  2930;  load averages:  1.44,  0.94,  0.50  up 0+00:08:21    13:29:11
129 processes: 8 running, 84 sleeping, 37 waiting
 
Mem: 13M Active, 8856K Inact, 203M Wired, 9748K Buf, 15G Free
Swap: 
 
 
  PID USERNAME PRI NICE   SIZE    RES STATE   C   TIME     CPU COMMAND
   11 root     -92    -     0K   624K CPU2    2   0:00 100.00% intr{irq272: igb2:que}
   11 root     -92    -     0K   624K CPU0    0   0:54  24.37% intr{irq274: igb3:que}
    0 root     -92    0     0K   368K CPU1    1   0:17   5.66% kernel{igb3 que}

hw.igb.rx_process_limit and hw.igb.txd/rxd

What are the impact of modifying hw.igb.rx_process_limit and hw.igb.txd/rxd sysctls on the igb(4) performance ?

We need to overload this NIC for this test, this meaning using this NIC without multi-queue.

graphical result

Here are the results of one-flow paquet per seconds peformance with differents values:

Ministat graphs

txd/rxd fixed at 1024, rx_process_limit variable
x xd1024.proc_lim-1
+ xd1024.proc_lim100
* xd1024.proc_lim500
+----------------------------------------------------------------------------------------------+
|*    +        +    ++   *      +  * *      *                    x            x          x    x|
|                                                                     |_______M__A__________|  |
|        |_________AM_______|                                                                  |
|           |_______________A______M_________|                                                 |
+----------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        550723        566775        558124      559487.2     6199.8641
+   5        517990        532364        526037        525182     5268.8787
Difference at 95.0% confidence
        -34305.2 +/- 8390.76
        -6.13154% +/- 1.49972%
        (Student's t, pooled s = 5753.23)
*   5        515348        539060        534270      530501.6       9292.18
Difference at 95.0% confidence
        -28985.6 +/- 11520
        -5.18074% +/- 2.05903%
        (Student's t, pooled s = 7898.83)
txd/rxd fixed at 2048, rx_process_limit variable
x xd2048.proc_lim-1
+ xd2048.proc_lim100
* xd2048.proc_lim500
+----------------------------------------------------------------------------------------------+
|+**    +*     +*  +      *                                                         x x   x   x|
|                                                                                    |___AM__| |
||______MA_______|                                                                             |
||_______M_A_________|                                                                         |
+----------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        563660        568029        566189      565768.2     1721.5844
+   5        527496        535493        530494      531057.4     3465.0161
Difference at 95.0% confidence
        -34710.8 +/- 3990.14
        -6.13516% +/- 0.70526%
        (Student's t, pooled s = 2735.89)
*   5        527807        538259        530987      531871.8     4288.4826
Difference at 95.0% confidence
        -33896.4 +/- 4765.66
        -5.99122% +/- 0.842335%
        (Student's t, pooled s = 3267.64)
txd/rxd fixed at 4096, rx_process_limit variable
x xd4096.proc_lim-1
+ xd4096.proc_lim100
* xd4096.proc_lim500
+----------------------------------------------------------------------------------------------+
|                *                                                                             |
|                *                                                                             |
|+         +  + +*   **  +                                           x         x         x  x  |
|                                                                         |_________A____M____||
|    |________A_______|                                                                        |
|               |M_A_|                                                                         |
+----------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        555002        565784        564089      562047.8     4646.4517
+   5        522545        533819        528932      528506.4      4088.621
Difference at 95.0% confidence
        -33541.4 +/- 6382.78
        -5.96771% +/- 1.13563%
        (Student's t, pooled s = 4376.43)
*   5        530026        532361        530327        530970     1102.3525
Difference at 95.0% confidence
        -31077.8 +/- 4924.78
        -5.52939% +/- 0.87622%
        (Student's t, pooled s = 3376.74)

Firewall impact

Multi-queue is re-enabled for this test, and best value from the previous tests used:

  • hw.igb.rxd=2048
  • hw.igb.txd=2048
  • hw.igb.rx_process_limit=-1 (disabled)
  • hw.igb.num_queues=0 (automatically based on number of CPUs and max supported MSI-X messages = 4 on this lab hardware)

This test will generate 2000 different flows by using 2000 different UDP destination ports:

pkt-gen -i igb2 -f tx -l 42 -d 2.3.3.1:2000-2.3.3.1:4000 -D 00:1b:21:d3:8f:3e -s 1.3.3.1 -w 10

IPFW

Stateless

Now we will test the impact of enabling simple stateless IPFW rules:

cat > /etc/ipfw.rules <<'EOF'
#!/bin/sh
fwcmd="/sbin/ipfw"
# Flush out the list before we begin.
${fwcmd} -f flush
${fwcmd} add 3000 allow ip from any to any
'EOF'
 
sysrc firewall_enable="YES"
sysrc firewall_script="/etc/ipfw.rules"

Statefull

Now we will test the impact of enabling simple statefull IPFW rules:

cat > /etc/ipfw.rules <<'EOF'
#!/bin/sh
fwcmd="/sbin/ipfw"
# Flush out the list before we begin.
${fwcmd} -f flush
${fwcmd} add 3000 allow ip from any to any keep-state
'EOF'
 
service ipfw restart

PF

Stateless

Now we will test the impact of enabling simple stateless PF rules:

cat >/etc/pf.conf <<'EOF'
set skip on lo0
pass no state
'EOF'
 
sysrc pf_enable="YES"

Statefull

Now we will test the impact of enabling simple statefull PF rules:

cat >/etc/pf.conf <<'EOF'
set skip on lo0
pass
'EOF'
 
sysrc pf_enable="YES"

Results

Graph

scale information: 1.488Mpps is the maximum paquet-per-second rate for GigaEthernet.

Impact of ipfw and pf on 4 cores Xeon 2.13GHz with Intel 82580 NIC

ministat

x pps.fastforwarding
+ pps.ipfw-stateless
* pps.ipfw-statefull
% pps.pf-stateless
# pps.pf-statefull
+------------------------------------------------------------------------------------------------+
|%  %%%  %  # #O   # * * #             *     *                                                  *|
|                                                                                               A|
|                                                                                               A|
|               |______M_____A___________|                                                       |
| |__A__|                                                                                        |
|           |__M_A____|                                                                          |
+------------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5       1488140       1488142       1488142     1488141.2     1.0955343
+   5       1488140       1488142       1488141       1488141    0.70710678
No difference proven at 95.0% confidence
*   5       1299942       1369446       1319923     1331666.2     29281.162
Difference at 95.0% confidence
        -156475 +/- 30196.9
        -10.5148% +/- 2.02917%
        (Student's t, pooled s = 20704.9)
%   5       1267869       1287061       1276235       1277154     6914.3127
Difference at 95.0% confidence
        -210987 +/- 7130.55
        -14.1779% +/- 0.479158%
        (Student's t, pooled s = 4889.16)
#   5       1294193       1324502       1300015     1304887.8     12265.529
Difference at 95.0% confidence
        -183253 +/- 12649.1
        -12.3142% +/- 0.849995%
        (Student's t, pooled s = 8673.04)
documentation/examples/forwarding_performance_lab_of_an_ibm_system_x3550_m3_with_intel_82580.txt · Last modified: 2015/06/01 07:39 (external edit)