no way to compare when less than two revisions

Differences

This shows you the differences between two versions of the page.

@@ Line 1: / Line 1: @@
+====== Forwarding performance lab of an IBM System x3550 M3 with 10-Gigabit Intel 82599EB ======
+{{description>Forwarding performance lab of a quad cores Xeon 2.13GHz and dual-port Intel 82599EB 10-Gigabit}}
+===== Bench lab =====
+==== Hardware detail ====
+This lab will test an [[IBM System x3550 M3]] with **quad** cores (Intel Xeon L5630 2.13GHz, hyper-threading disabled), dual port Intel 82599EB 10-Gigabit and OPT SFP (SFP-10G-LR).
+NIC details:
+<code>
+ix0@pci0:21:0:0:        class=0x020000 card=0x00038086 chip=0x10fb8086 rev=0x01 hdr=0x00
+    vendor     = 'Intel Corporation'
+    device     = '82599EB 10-Gigabit SFI/SFP+ Network Connection'
+    class      = network
+    subclass   = ethernet
+    bar   [10] = type Prefetchable Memory, range 64, base 0xfbe80000, size 524288, enabled
+    bar   [18] = type I/O Port, range 32, base 0x2020, size 32, enabled
+    bar   [20] = type Prefetchable Memory, range 64, base 0xfbf04000, size 16384, enabled
+    cap 01[40] = powerspec 3  supports D0 D3  current D0
+    cap 05[50] = MSI supports 1 message, 64 bit, vector masks
+    cap 11[70] = MSI-X supports 64 messages, enabled
+                 Table in map 0x20[0x0], PBA in map 0x20[0x2000]
+    cap 10[a0] = PCI-Express 2 endpoint max data 256(512) FLR link x8(x8)
+                 speed 5.0(5.0) ASPM disabled(L0s)
+    ecap 0001[100] = AER 1 0 fatal 0 non-fatal 1 corrected
+    ecap 0003[140] = Serial 1 90e2baffff842038
+    ecap 000e[150] = ARI 1
+    ecap 0010[160] = SR-IOV 1 IOV disabled, Memory Space disabled, ARI disabled
+VFs configured out of 64 supported
+                     First VF RID Offset 0x0180, VF RID Stride 0x0002
+                     VF Device ID 0x10ed
+                     Page Sizes: 4096 (enabled), 8192, 65536, 262144, 1048576, 4194304
+</code>
+==== Lab set-up ====
+The lab is detailed here: [[documentation:examples:Setting up a forwarding performance benchmark lab]].
+=== Diagram ===
+<code>
++------------------------------------------+ +-------+ +------------------------------+
+|        Device under test                 | |Juniper| | Packet generator & receiver  |
+|                                          | |  QFX  | |                              |
+|                ix0: 198.18.0.1/24        |=|   <   |=| vcxl0: 198.18.0.110/24       |
+|                      2001:2::1/64        | |       | |        2001:2::110/64        |
+|                      (90:e2:ba:84:20:38) | |       | |        (00:07:43:2e:e4:72)   |
+|                                          | |       | |                              |
+|                ix1: 198.19.0.1/24        |=|   >   |=| vcxl1: 198.19.0.110/24       |
+|                      2001:2:0:8000::1/64 | |       | |        2001:2:0:8000::110/64 |
+|                      (90:e2:ba:84:20:39) | +-------+ |        (00:07:43:2e:e4:7a)   |
+|                                          |           |                              |
+|            static routes                 |           |                              |
+| 192.18.0.0/16      => 198.18.0.110       |           |                              |
+| 192.19.0.0/16      => 198.19.0.110       |           |                              |
+| 2001:2::/49        => 2001:2::110        |           |                              |
+| 2001:2:0:8000::/49 => 2001:2:0:8000::110 |           |                              |
+|                                          |           |                              |
+|        static arp and ndp                |           | /boot/loader.conf:           |
+| 198.18.0.110        => 00:07:43:2e:e4:72 |           |      hw.cxgbe.num_vis=2      |
+| 2001:2::110                              |           |                              |
+|                                          |           |                              |
+| 198.19.0.110        => 00:07:43:2e:e4:7a |           |                              |
+| 2001:2:0:8000::110                       |           |                              |
++------------------------------------------+           +------------------------------+
+</code>
+The generator **MUST** generate lot's of smallest IP flows (multiple source/destination IP addresses and/or UDP src/dst port).
+Here is an example for generating 2000 IPv4 flows (100 destination IP addresses * 20 source IP addresses) with a Chelsio NIC:
+<code>
+pkt-gen -i vcxl0 -f tx -n 1000000000 -l 60 -d 198.19.10.1:2000-198.19.10.100 -D 90:e2:ba:84:20:38 -s 198.18.10.1:2000-198.18.10.20 -w 4 -p 2
+</code>
+And the same with IPv6 flows (minimum frame size of 62 here):
+<code>
+pkt-gen -f tx -i vcxl0 -n 1000000000 -l 62 -6 -d "[2001:2:0:8010::1]-[2001:2:0:8010::64]" -D 90:e2:ba:84:20:38 -s "[2001:2:0:10::1]-[2001:2:0:10::14]" -S 00:07:43:2e:e4:72 -w 4 -p 2
+</code>
+<note warning>
+This version of pkt-gen is improved with: IPv6 support, software checksum and optional unit normalization. [[https://raw.githubusercontent.com/ocochard/BSDRP/master/BSDRPcur/patches/freebsd.pkt-gen.ae-ipv6.patch|BSDRP's patch to netmap pkt-gen ]].
+</note>
+Receiver will use this command:
+<code>
+pkt-gen -i vcxl1 -f rx -w 4
+</code>
+</code>
+==== Basic configuration ====
+=== Disabling Ethernet flow-control ===
+First, disable Ethernet flow-control on both servers:
+<code>
+echo "dev.ix.0.fc=0" >> /etc/sysctl.conf
+echo "dev.ix.1.fc=0" >> /etc/sysctl.conf
+</code>
+=== Enabling unsupported SFP ===
+Because we are using a non-Intel SFP:
+<code>
+mount -uw /
+echo 'hw.ix.unsupported_sfp="1"' >> /boot/loader.conf.local
+mount -ur /
+</code>
+=== Disabling LRO and TSO ===
+A router [[Documentation:Technical docs:Performance|should not use LRO and TSO]]. BSDRP disable by default using a RC script (disablelrotso_enable="YES" in /etc/rc.conf.misc).
+But on a standard FreeBSD:
+<code>
+ifconfig ix0 -tso4 -tso6 -lro
+ifconfig ix1 -tso4 -tso6 -lro
+</code>
+=== IP configuration on DUT ===
+<code>
+# IPv4 router
+gateway_enable="YES"
+static_routes="generator receiver"
+route_generator="-net 198.18.0.0/16 198.18.0.110"
+route_receiver="-net 198.19.0.0/16 198.19.0.110"
+ifconfig_ix0="inet 198.18.0.1/24 -tso4 -tso6 -lro"
+ifconfig_ix1="inet 198.19.0.1/24 -tso4 -tso6 -lro"
+static_arp_pairs="HPvcxl0 HPvcxl1"
+static_arp_HPvcxl0="198.18.0.110 00:07:43:2e:e4:72"
+static_arp_HPvcxl1="198.19.0.110 00:07:43:2e:e4:7a"
+# IPv6 router
+ipv6_gateway_enable="YES"
+ipv6_activate_all_interfaces="YES"
+ipv6_static_routes="generator receiver"
+ipv6_route_generator="2001:2:: -prefixlen 49 2001:2::110"
+ipv6_route_receiver="2001:2:0:8000:: -prefixlen 49 2001:2:0:8000::110"
+ifconfig_ix0_ipv6="inet6 2001:2::1 prefixlen 64"
+ifconfig_ix1_ipv6="inet6 2001:2:0:8000::1 prefixlen 64"
+static_ndp_pairs="HPvcxl0 HPvcxl1"
+static_ndp_HPvcxl0="2001:2::110 00:07:43:2e:e4:72"
+static_ndp_HPvcxl1="2001:2:0:8000::110 00:07:43:2e:e4:7a"
+</code>
+===== Routing performance with default BSDRP value =====
+==== Default fast-forwarding performance in front of a line-rate generator ====
+Behaviour in front of a multi-flow traffic generator at line-rate 14.8Mpps (thanks Chelsio!), netstat on DUT report:
+Can't enter any command on the DUT during the load: All 4 cores are overloaded.
+But on the receiver, there is only 2.8Mpps received (then forwarded):
+<code>
+.851700 main_thread [2277] 2870783 pps (2873654 pkts 1379353920 bps in 1001000 usec) 17.79 avg_batch 3584 min_space
+.853699 main_thread [2277] 2869423 pps (2875162 pkts 1380077760 bps in 1002000 usec) 17.77 avg_batch 1956 min_space
+.854700 main_thread [2277] 2870532 pps (2873403 pkts 1379233440 bps in 1001000 usec) 17.78 avg_batch 2022 min_space
+.855699 main_thread [2277] 2872424 pps (2875296 pkts 1380142080 bps in 1001000 usec) 17.79 avg_batch 1949 min_space
+.856699 main_thread [2277] 2871882 pps (2874754 pkts 1379881920 bps in 1001000 usec) 17.79 avg_batch 3584 min_space
+.857699 main_thread [2277] 2871047 pps (2873918 pkts 1379480640 bps in 1001000 usec) 17.78 avg_batch 3584 min_space
+.858700 main_thread [2277] 2870945 pps (2873816 pkts 1379431680 bps in 1001000 usec) 17.79 avg_batch 1792 min_space
+.859699 main_thread [2277] 2870647 pps (2873518 pkts 1379288640 bps in 1001000 usec) 17.78 avg_batch 1959 min_space
+.860699 main_thread [2277] 2870222 pps (2873092 pkts 1379084160 bps in 1001000 usec) 17.78 avg_batch 1956 min_space
+.861699 main_thread [2277] 2870311 pps (2873178 pkts 1379125440 bps in 1000999 usec) 17.78 avg_batch 1792 min_space
+.862699 main_thread [2277] 2870795 pps (2873669 pkts 1379361120 bps in 1001001 usec) 17.78 avg_batch 2024 min_space
+</code>
+The traffic is correctly load-balanced between each queues:
+<code>
+[root@DUT]~# sysctl dev.ix.0. | grep rx_packet
+dev.ix.0.queue3.rx_packets: 143762837
+dev.ix.0.queue2.rx_packets: 141867655
+dev.ix.0.queue1.rx_packets: 140704642
+dev.ix.0.queue0.rx_packets: 139301732
+[root@R1]~# sysctl dev.ix.1. | grep tx_packet
+dev.ix.1.queue3.tx_packets: 143762837
+dev.ix.1.queue2.tx_packets: 141867655
+dev.ix.1.queue1.tx_packets: 140704643
+dev.ix.1.queue0.tx_packets: 139301734
+</code>
+Where the system spend this time?
+<code>
+[root@DUT]~# kldload hwpmc
+[root@DUT]~# pmcstat -TS instructions -w1
+PMC: [INSTR_RETIRED_ANY] Samples: 99530 (100.0%) , 0 unresolved
+%SAMP IMAGE      FUNCTION             CALLERS
+.5 kernel     ixgbe_rxeof          ixgbe_msix_que
+.8 kernel     bzero                m_pkthdr_init:1.7 ip_tryforward:1.5 ip_findroute:1.4 fib4_lookup_nh_basic:1.3
+.1 kernel     ixgbe_xmit           ixgbe_mq_start_locked
+.7 kernel     ixgbe_mq_start       ether_output
+.8 kernel     rn_match             fib4_lookup_nh_basic
+.7 kernel     _rw_runlock_cookie   fib4_lookup_nh_basic:2.0 arpresolve:0.8
+.7 kernel     ip_tryforward        ip_input
+.6 kernel     ether_nh_input       netisr_dispatch_src
+.6 kernel     bounce_bus_dmamap_lo bus_dmamap_load_mbuf_sg
+.6 kernel     uma_zalloc_arg       ixgbe_rxeof
+.6 kernel     _rm_rlock            in_localip
+.4 kernel     netisr_dispatch_src  ether_demux:1.7 ether_input:0.7
+.3 libc.so.7  bsearch              0x63ac
+.2 kernel     ether_output         ip_tryforward
+.2 kernel     ip_input             netisr_dispatch_src
+.1 kernel     uma_zfree_arg        m_freem
+.1 kernel     fib4_lookup_nh_basic ip_findroute
+.9 kernel     __rw_rlock           arpresolve:1.3 fib4_lookup_nh_basic:0.6
+.8 kernel     bus_dmamap_load_mbuf ixgbe_xmit
+.7 kernel     bcopy                arpresolve
+.6 kernel     m_adj                ether_demux
+.5 kernel     memcpy               ether_output
+.5 kernel     arpresolve           ether_output
+.5 kernel     _mtx_trylock_flags_  ixgbe_mq_start
+.4 kernel     mb_ctor_mbuf         uma_zalloc_arg
+.4 kernel     ixgbe_txeof          ixgbe_msix_que
+.3 pmcstat    0x63f0               bsearch
+.1 pmcstat    0x63e3               bsearch
+.0 kernel     ixgbe_refresh_mbufs  ixgbe_rxeof
+.0 kernel     in_localip           ip_tryforward
+.0 kernel     random_harvest_queue ether_nh_input
+.9 kernel     bcmp                 ether_nh_input
+.8 kernel     cpu_search_lowest    cpu_search_lowest
+.8 kernel     mac_ifnet_check_tran ether_output
+.8 kernel     critical_exit        uma_zalloc_arg
+.7 kernel     critical_enter
+.7 kernel     ixgbe_mq_start_locke ixgbe_mq_start
+.6 kernel     mac_ifnet_create_mbu ether_nh_input
+.6 kernel     ether_demux          ether_nh_input
+.5 kernel     m_freem              ixgbe_txeof
+.5 kernel     ipsec4_capability    ip_input
+</code>
+=> Time spend in ixgbe_rxeof
+==== Equilibrium throughput ====
+Previous methodology, by generating 14.8Mpps, is like testing the DUT under a "Denial-of-Service". Try another methodology known as [[documentation:examples:Setting up a VPN (IPSec, GRE, etc...) performance benchmark lab|equilibrium throughput]].
+From the pkt-generator, start an estimation of the "equilibrium throughput" starting at 4Mpps:
+<code>
+[root@pkt-gen]~# equilibrium -d 90:e2:ba:84:20:38 -p -l 4000 -t vcxl0 -r vcxl1
+Benchmark tool using equilibrium throughput method
+- Benchmark mode: Throughput (pps) for Router
+- UDP load = 18B, IPv4 packet size=46B, Ethernet frame size=60B
+- Link rate = 4000 Kpps
+- Tolerance = 0.01
+Iteration 1
+  - Offering load = 2000 Kpps
+  - Step = 1000 Kpps
+  - Measured forwarding rate = 1999 Kpps
+Iteration 2
+  - Offering load = 3000 Kpps
+  - Step = 1000 Kpps
+  - Trend = increasing
+  - Measured forwarding rate = 2949 Kpps
+Iteration 3
+  - Offering load = 2500 Kpps
+  - Step = 500 Kpps
+  - Trend = decreasing
+  - Measured forwarding rate = 2499 Kpps
+Iteration 4
+  - Offering load = 2750 Kpps
+  - Step = 250 Kpps
+  - Trend = increasing
+  - Measured forwarding rate = 2750 Kpps
+Iteration 5
+  - Offering load = 2875 Kpps
+  - Step = 125 Kpps
+  - Trend = increasing
+  - Measured forwarding rate = 2875 Kpps
+Iteration 6
+  - Offering load = 2937 Kpps
+  - Step = 62 Kpps
+  - Trend = increasing
+  - Measured forwarding rate = 2933 Kpps
+Iteration 7
+  - Offering load = 2968 Kpps
+  - Step = 31 Kpps
+  - Trend = increasing
+  - Measured forwarding rate = 2948 Kpps
+Estimated Equilibrium Ethernet throughput= 2948 Kpps (maximum value seen: 2949 Kpps)
+</code>
+=> Same results with equilibrium method: 2.9Mpps.
+==== Firewall impact ====
+One rule for each firewall and 2000 UDP "sessions", more information on the [[documentation:examples:forwarding_performance_lab_of_an_ibm_system_x3550_m3_with_intel_82580#firewall_impact|GigaEthernet performance lab]].
+[[https://github.com/ocochard/netbenchs/tree/master/Xeon_L5630-4Cores-Intel_82599EB/fastforwarding-pf-ipfw|Full configuration sets, scripts and results]].
+{{:documentation:examples:bench.forwarding.and.firewalling.rate.on.ibm.intel-82599eb.png|Impact of ipfw and pf on 4 cores Xeon 2.13GHz with 10-Gigabit Intel 82599EB }}
+===== Routing performance with multiples static routes =====
+FreeBSD had some route lookup contention problem: This setup is using only one static route (192.19.0.0/8) toward the traffic receiver.
+By spliting this unique route by 4 or 8, we should obtain better result.
+This bench method is using 100 differents destinations IP addressess from 198.19.10.1 to 198.19.10.100, then redoing this bench using 4 static routes:
+  - 198.19.10.0/27 (0 to 31)
+  - 198.19.10.32/27 (32 to 63)
+  - 198.19.10.64/27 (64 to 95)
+  - 198.19.10.96/27 (96 to 127)
+then with 8 routes:
+  -198.19.10.0/28
+  -198.19.10.16/28
+  -198.19.10.32/28
+  -198.19.10.48/28
+  -198.19.10.64/28
+  -198.19.10.80/28
+  -198.19.10.96/28
+  -198.19.10.112/28
+Configuration examples:
+<code>
+sysrc static_routes="generator receiver1 receiver2 receiver3 receiver4"
+sysrc route_generator="-net 198.18.0.0/16 198.18.2.2"
+sysrc -x route_receiver
+sysrc route_receiver1="-net 198.19.10.0/27 198.19.2.2"
+sysrc route_receiver2="-net 198.19.10.32/27 198.19.2.2"
+sysrc route_receiver3="-net 198.19.10.64/27 198.19.2.2"
+sysrc route_receiver4="-net 198.19.10.96/27 198.19.2.2"
+</code>
+==== Graphs ====
+{{:documentation:examples:bench.static-routes-contention.test.fbsd10.2.png|Impact of number of static routes on forwarding on 4 cores Xeon 2.13GHz with 10-Gigabit Intel 82599EB }}
+=> A small 4% increase by using 4 static routes in place of 1 route, and 5% if using 8 routes.
+==== Ministat ====
+<code>
+x pps.one-route
++ pps.four-routes
+* pps.eight-routes
++--------------------------------------------------------------------------+
+|                   *                                                      |
+|                   *                                                      |
+|   x      x   + +  *         x           x                   +  +       **|
+||_________M______A________________|                                       |
+|         |_________M_______________A________________________|             |
+|           |_______M____________________A_____________________________|   |
++--------------------------------------------------------------------------+
+    N           Min           Max        Median           Avg        Stddev
+x   5       1639912       1984116       1704942     1769454.2     154632.18
++   5       1737180       2194547       1785163     1927662.8     229585.57
+No difference proven at 95.0% confidence
+*   5       1782648       2273933       1785695     1978219.6     266278.12
+No difference proven at 95.0% confidence
+</code>