====== Forwarding performance lab of an IBM System x3550 M3 with Intel 82580 ======
{{description>Forwarding performance lab of a quad cores Xeon 2.13GHz and quad-port gigabit Intel 82580}}
===== Hardware detail =====
This lab will test an [[IBM System x3550 M3]] with **quad** cores (Intel Xeon L5630 2.13GHz, hyper-threading disabled) and a quad NIC 82580 connected to the PCI-Express Bus.
===== Lab set-up =====
The lab is detailed here: [[documentation:examples:Setting up a forwarding performance benchmark lab]].
BSDRP-amd64 v1.51 (FreeBSD 10.0-BETA2 with autotune mbuf patch) is used on the DUT.
==== Diagram ====
+------------------------------------------+ +--------+ +------------------------------+
| Device under test | | Cisco | | Packet generator & receiver |
| | |Catalyst| | |
| igb2: 198.18.0.203/24 |=| < |=| igb1: 198.18.0.201/24 |
| 2001:2::203/64 | | | | 2001:2::201/64 |
| (00:1b:21:c4:95:7a) | | | | (0c:c4:7a:da:3c:11) |
| | | | | |
| igb3: 198.19.0.203/24 |=| > |=| igb2: 198.19.0.201/24 |
| 2001:2:0:8000::203/64 | | | | 2001:2:0:8000::201/64 |
| (00:1b:21:c4:95:7b) | +--------+ | (0c:c4:7a:da:3c:12) |
| | | |
| static routes | | |
| 192.18.0.0/16 => 198.18.0.208 | | |
| 192.19.0.0/16 => 198.19.0.208 | | |
| 2001:2::/49 => 2001:2::208 | | |
| 2001:2:0:8000::/49 => 2001:2:0:8000::208 | | |
| | | |
| static arp and ndp | | |
| 198.18.0.208 => 0c:c4:7a:da:3c:11 | | |
| 2001:2::208 | | |
| | | |
| 198.19.0.208 => 0c:c4:7a:da:3c:12 | | |
| 2001:2:0:8000::208 | | |
+------------------------------------------+ +------------------------------+
The generator **MUST** generate lot's of smallest IP flows (multiple source/destination IP addresses and/or UDP src/dst port).
Here is an example for generating about 2000 flows:
pkt-gen -N -f tx -i igb1 -n 1000000000 -4 -d 198.19.10.1:2000-198.19.10.100 -D 00:1b:21:c4:95:7a -s 198.18.10.1:2000-198.18.10.20 -S 0c:c4:7a:da:3c:11 -w 4 -l 60 -U
Receiver will use this command:
pkt-gen -N -f rx -i igb2 -w 4
===== Basic configuration =====
==== Disabling Ethernet flow-control ===
First, disable Ethernet flow-control:
echo "dev.igb.2.fc=0" >> /etc/sysctl.conf
echo "dev.igb.3.fc=0" >> /etc/sysctl.conf
sysctl dev.igb.2.fc=0
sysctl dev.igb.3.fc=0
==== IP Configuration ====
Configure IP addresses, static routes and static ARP entries.
A router [[Documentation:Technical docs:Performance|should not use LRO and TSO]]. BSDRP disable by default using a RC script (disablelrotso_enable="YES" in /etc/rc.conf.misc).
sysrc static_routes="generator receiver"
sysrc route_generator="-net 1.0.0.0/8 1.1.1.1"
sysrc route_receiver="-net 2.0.0.0/8 2.2.2.3"
sysrc ifconfig_igb2="inet 1.1.1.2/24 -tso4 -tso6 -lro"
sysrc ifconfig_igb3="inet 2.2.2.2/24 -tso4 -tso6 -lro"
sysrc static_arp_pairs="receiver generator"
sysrc static_arp_generator="1.1.1.1 00:1b:21:d4:3f:2a"
sysrc static_arp_receiver="2.2.2.3 00:1b:21:c4:95:7b"
===== Default forwarding speed =====
With the default parameters, multi-flow traffic at 1.488Mpps (the maximum rate for GigaEthernet) are correctly forwarded without any loss:
[root@BSDRP]~# netstat -iw 1
input (Total) output
packets errs idrops bytes packets errs bytes colls
1511778 0 0 91524540 1508774 0 54260514 0
1437061 0 0 87010506 1433981 0 51628556 0
1492392 0 0 90363066 1489107 0 53551190 0
1435098 0 0 86919666 1432911 0 51403100 0
1486627 0 0 90015126 1483984 0 53323802 0
1435217 0 0 86898126 1432679 0 51498188 0
1486694 0 0 90017226 1483725 0 53324978 0
1488248 0 0 90106326 1485331 0 53360930 0
1437796 0 0 87084606 1435594 0 51504950 0
The traffic is correctly load-balanced between NIC-queue/CPU binding:
[root@BSDRP]# vmstat -i | grep igb
irq278: igb2:que 0 2759545998 95334
irq279: igb2:que 1 2587966938 89406
irq280: igb2:que 2 2589102074 89445
irq281: igb2:que 3 2598239184 89761
irq282: igb2:link 2 0
irq283: igb3:que 0 3318777087 114654
irq284: igb3:que 1 3098055250 107028
irq285: igb3:que 2 3101570541 107150
irq286: igb3:que 3 3052431966 105452
irq287: igb3:link 2 0
[root@BSDRP]/# top -nCHSIzs1
last pid: 8292; load averages: 5.38, 1.70, 0.65 up 0+10:38:54 13:08:33
153 processes: 12 running, 97 sleeping, 44 waiting
Mem: 2212K Active, 24M Inact, 244M Wired, 18M Buf, 15G Free
Swap:
PID USERNAME PRI NICE SIZE RES STATE C TIME CPU COMMAND
11 root -92 - 0K 816K WAIT 0 218:26 85.25% intr{irq278: igb2:que}
11 root -92 - 0K 816K CPU1 1 296:18 84.77% intr{irq279: igb2:que}
11 root -92 - 0K 816K RUN 2 298:15 84.67% intr{irq280: igb2:que}
11 root -92 - 0K 816K CPU3 3 294:53 84.18% intr{irq281: igb2:que}
11 root -92 - 0K 816K RUN 3 67:46 16.46% intr{irq286: igb3:que}
11 root -92 - 0K 816K RUN 2 70:27 16.06% intr{irq285: igb3:que}
11 root -92 - 0K 816K RUN 1 68:36 15.97% intr{irq284: igb3:que}
11 root -92 - 0K 816K CPU0 0 59:39 15.28% intr{irq283: igb3:que}
===== igb(4) drivers tunning with 82546GB =====
==== Disabling multi-queue ====
For disabling multi-queue (this mean without IRQ load-sharing between CPUs), there are 2 methods:
The first method is to use pkt-gen for generating a one IP flow (same src/dst IP and same src/dst port) traffic like this:
pkt-gen -i igb2 -f tx -n 80000000 -l 42 -d 2.3.3.2 -D 00:1b:21:d3:8f:3e -s 1.3.3.3 -w 10
=> With this method, igb(4) can't do load-balancing input traffic and will use only one queue.
The second method is to disabling the multi-queue support of igb(4) drivers by forcing the usage of one queue:
mount -uw /
echo 'sysctl hw.igb.num_queues="1"' >> /boot/loader.conf.local
mount -ur /
reboot
And check on the dmesg or with number of IRQ assigned to the NIC that no multi-queue was enabled:
[root@BSDRP]~# grep 'igb[2-3]' /var/run/dmesg.boot
igb2: mem 0x97a80000-0x97afffff,0x97c04000-0x97c07fff irq 39 at device 0.2 on pci26
igb2: Using MSIX interrupts with 2 vectors
igb2: Ethernet address: 00:1b:21:d3:8f:3e
001.000011 netmap_attach [2244] success for igb2
igb3: mem 0x97a00000-0x97a7ffff,0x97c00000-0x97c03fff irq 38 at device 0.3 on pci26
igb3: Using MSIX interrupts with 2 vectors
igb3: Ethernet address: 00:1b:21:d3:8f:3f
001.000012 netmap_attach [2244] success for igb3
[root@BSDRP]~# vmstat -i | grep igb
irq272: igb2:que 0 8 0
irq273: igb2:link 2 0
irq274: igb3:que 0 48517905 74757
irq275: igb3:link 2 0
Using any of theses method, the result will be the same: forwarding speed will decrease (to about 700Kpps) corresponding to the maximum input rate (100% CPU usage of the unique CPU bound to input NIC IRQ).
[root@BSDRP]~# netstat -iw 1
input (Total) output
packets errs idrops bytes packets errs bytes colls
690541 797962 0 41432466 690541 0 29002910 0
704171 797906 0 42250266 704174 0 29575322 0
676522 797770 0 40591326 676519 0 28414022 0
707373 797878 0 42442386 707376 0 29709848 0
672983 797962 0 40378986 672981 0 28265426 0
705339 797899 0 42320346 705336 0 29624378 0
684930 798049 0 41095806 684934 0 28767200 0
[root@bsdrp2]~# top -nCHSIzs1
last pid: 2930; load averages: 1.44, 0.94, 0.50 up 0+00:08:21 13:29:11
129 processes: 8 running, 84 sleeping, 37 waiting
Mem: 13M Active, 8856K Inact, 203M Wired, 9748K Buf, 15G Free
Swap:
PID USERNAME PRI NICE SIZE RES STATE C TIME CPU COMMAND
11 root -92 - 0K 624K CPU2 2 0:00 100.00% intr{irq272: igb2:que}
11 root -92 - 0K 624K CPU0 0 0:54 24.37% intr{irq274: igb3:que}
0 root -92 0 0K 368K CPU1 1 0:17 5.66% kernel{igb3 que}
==== hw.igb.rx_process_limit and hw.igb.txd/rxd ====
What are the impact of modifying hw.igb.rx_process_limit and hw.igb.txd/rxd sysctls on the igb(4) performance ?
We need to overload this NIC for this test, this meaning using this NIC without multi-queue.
=== graphical result ===
Here are the results of one-flow paquet per seconds peformance with differents values:
{{:documentation:examples:play_ing_with_hw.igb_values.png}}
=== Ministat graphs ===
== txd/rxd fixed at 1024, rx_process_limit variable ==
x xd1024.proc_lim-1
+ xd1024.proc_lim100
* xd1024.proc_lim500
+----------------------------------------------------------------------------------------------+
|* + + ++ * + * * * x x x x|
| |_______M__A__________| |
| |_________AM_______| |
| |_______________A______M_________| |
+----------------------------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 5 550723 566775 558124 559487.2 6199.8641
+ 5 517990 532364 526037 525182 5268.8787
Difference at 95.0% confidence
-34305.2 +/- 8390.76
-6.13154% +/- 1.49972%
(Student's t, pooled s = 5753.23)
* 5 515348 539060 534270 530501.6 9292.18
Difference at 95.0% confidence
-28985.6 +/- 11520
-5.18074% +/- 2.05903%
(Student's t, pooled s = 7898.83)
== txd/rxd fixed at 2048, rx_process_limit variable ==
x xd2048.proc_lim-1
+ xd2048.proc_lim100
* xd2048.proc_lim500
+----------------------------------------------------------------------------------------------+
|+** +* +* + * x x x x|
| |___AM__| |
||______MA_______| |
||_______M_A_________| |
+----------------------------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 5 563660 568029 566189 565768.2 1721.5844
+ 5 527496 535493 530494 531057.4 3465.0161
Difference at 95.0% confidence
-34710.8 +/- 3990.14
-6.13516% +/- 0.70526%
(Student's t, pooled s = 2735.89)
* 5 527807 538259 530987 531871.8 4288.4826
Difference at 95.0% confidence
-33896.4 +/- 4765.66
-5.99122% +/- 0.842335%
(Student's t, pooled s = 3267.64)
== txd/rxd fixed at 4096, rx_process_limit variable ==
x xd4096.proc_lim-1
+ xd4096.proc_lim100
* xd4096.proc_lim500
+----------------------------------------------------------------------------------------------+
| * |
| * |
|+ + + +* ** + x x x x |
| |_________A____M____||
| |________A_______| |
| |M_A_| |
+----------------------------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 5 555002 565784 564089 562047.8 4646.4517
+ 5 522545 533819 528932 528506.4 4088.621
Difference at 95.0% confidence
-33541.4 +/- 6382.78
-5.96771% +/- 1.13563%
(Student's t, pooled s = 4376.43)
* 5 530026 532361 530327 530970 1102.3525
Difference at 95.0% confidence
-31077.8 +/- 4924.78
-5.52939% +/- 0.87622%
(Student's t, pooled s = 3376.74)
===== Firewall impact =====
Multi-queue is re-enabled for this test, and best value from the previous tests used:
* hw.igb.rxd=2048
* hw.igb.txd=2048
* hw.igb.rx_process_limit=-1 (disabled)
* hw.igb.num_queues=0 (automatically based on number of CPUs and max supported MSI-X messages = 4 on this lab hardware)
This test will generate 2000 different flows by using 2000 different UDP destination ports:
pkt-gen -i igb2 -f tx -l 42 -d 2.3.3.1:2000-2.3.3.1:4000 -D 00:1b:21:d3:8f:3e -s 1.3.3.1 -w 10
==== IPFW ====
=== Stateless ===
Now we will test the impact of enabling simple stateless IPFW rules:
cat > /etc/ipfw.rules <<'EOF'
#!/bin/sh
fwcmd="/sbin/ipfw"
# Flush out the list before we begin.
${fwcmd} -f flush
${fwcmd} add 3000 allow ip from any to any
'EOF'
sysrc firewall_enable="YES"
sysrc firewall_script="/etc/ipfw.rules"
=== Statefull ===
Now we will test the impact of enabling simple statefull IPFW rules:
cat > /etc/ipfw.rules <<'EOF'
#!/bin/sh
fwcmd="/sbin/ipfw"
# Flush out the list before we begin.
${fwcmd} -f flush
${fwcmd} add 3000 allow ip from any to any keep-state
'EOF'
service ipfw restart
==== PF ====
=== Stateless ===
Now we will test the impact of enabling simple stateless PF rules:
cat >/etc/pf.conf <<'EOF'
set skip on lo0
pass no state
'EOF'
sysrc pf_enable="YES"
=== Statefull ===
Now we will test the impact of enabling simple statefull PF rules:
cat >/etc/pf.conf <<'EOF'
set skip on lo0
pass
'EOF'
sysrc pf_enable="YES"
==== Results ====
=== Graph ===
scale information: 1.488Mpps is the maximum paquet-per-second rate for GigaEthernet.
{{documentation:examples:bench.impact.of.ipfw-pf.png|Impact of ipfw and pf on 4 cores Xeon 2.13GHz with Intel 82580 NIC}}
=== ministat ===
x pps.fastforwarding
+ pps.ipfw-stateless
* pps.ipfw-statefull
% pps.pf-stateless
# pps.pf-statefull
+------------------------------------------------------------------------------------------------+
|% %%% % # #O # * * # * * *|
| A|
| A|
| |______M_____A___________| |
| |__A__| |
| |__M_A____| |
+------------------------------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 5 1488140 1488142 1488142 1488141.2 1.0955343
+ 5 1488140 1488142 1488141 1488141 0.70710678
No difference proven at 95.0% confidence
* 5 1299942 1369446 1319923 1331666.2 29281.162
Difference at 95.0% confidence
-156475 +/- 30196.9
-10.5148% +/- 2.02917%
(Student's t, pooled s = 20704.9)
% 5 1267869 1287061 1276235 1277154 6914.3127
Difference at 95.0% confidence
-210987 +/- 7130.55
-14.1779% +/- 0.479158%
(Student's t, pooled s = 4889.16)
# 5 1294193 1324502 1300015 1304887.8 12265.529
Difference at 95.0% confidence
-183253 +/- 12649.1
-12.3142% +/- 0.849995%
(Student's t, pooled s = 8673.04)