User Tools

Site Tools


documentation:examples:forwarding_performance_lab_of_an_ibm_system_x3550_m3_with_10-gigabit_intel_82599eb

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

documentation:examples:forwarding_performance_lab_of_an_ibm_system_x3550_m3_with_10-gigabit_intel_82599eb [2017/07/12 11:45] (current)
Line 1: Line 1:
 +====== Forwarding performance lab of an IBM System x3550 M3 with 10-Gigabit Intel 82599EB ======
 +{{description>​Forwarding performance lab of a quad cores Xeon 2.13GHz and dual-port Intel 82599EB 10-Gigabit}}
 +
 +===== Bench lab =====
 +
 +==== Hardware detail ====
 +
 +This lab will test an [[IBM System x3550 M3]] with **quad** cores (Intel Xeon L5630 2.13GHz, hyper-threading disabled), dual port Intel 82599EB 10-Gigabit and OPT SFP (SFP-10G-LR).
 +
 +NIC details:
 +<​code>​
 +ix0@pci0:​21:​0:​0: ​       class=0x020000 card=0x00038086 chip=0x10fb8086 rev=0x01 hdr=0x00
 +    vendor ​    = 'Intel Corporation'​
 +    device ​    = '​82599EB 10-Gigabit SFI/SFP+ Network Connection'​
 +    class      = network
 +    subclass ​  = ethernet
 +    bar   [10] = type Prefetchable Memory, range 64, base 0xfbe80000, size 524288, enabled
 +    bar   [18] = type I/O Port, range 32, base 0x2020, size 32, enabled
 +    bar   [20] = type Prefetchable Memory, range 64, base 0xfbf04000, size 16384, enabled
 +    cap 01[40] = powerspec 3  supports D0 D3  current D0
 +    cap 05[50] = MSI supports 1 message, 64 bit, vector masks
 +    cap 11[70] = MSI-X supports 64 messages, enabled
 +                 Table in map 0x20[0x0], PBA in map 0x20[0x2000]
 +    cap 10[a0] = PCI-Express 2 endpoint max data 256(512) FLR link x8(x8)
 +                 speed 5.0(5.0) ASPM disabled(L0s)
 +    ecap 0001[100] = AER 1 0 fatal 0 non-fatal 1 corrected
 +    ecap 0003[140] = Serial 1 90e2baffff842038
 +    ecap 000e[150] = ARI 1
 +    ecap 0010[160] = SR-IOV 1 IOV disabled, Memory Space disabled, ARI disabled
 +                     0 VFs configured out of 64 supported
 +                     First VF RID Offset 0x0180, VF RID Stride 0x0002
 +                     VF Device ID 0x10ed
 +                     Page Sizes: 4096 (enabled), 8192, 65536, 262144, 1048576, 4194304
 +</​code>​
 +
 +==== Lab set-up ====
 +
 +The lab is detailed here: [[documentation:​examples:​Setting up a forwarding performance benchmark lab]].
 +
 +=== Diagram ===
 +
 +<​code>​
 ++------------------------------------------+ +-------+ +------------------------------+
 +|        Device under test                 | |Juniper| | Packet generator & receiver ​ |
 +|                                          | |  QFX  | |                              |
 +|                ix0: 198.18.0.1/​24 ​       |=|   < ​  |=| vcxl0: 198.18.0.110/​24 ​      |
 +|                      2001:​2::​1/​64 ​       | |       | |        2001:​2::​110/​64 ​       |
 +|                      (90:​e2:​ba:​84:​20:​38) | |       | |        (00:​07:​43:​2e:​e4:​72) ​  |
 +|                                          | |       | |                              |
 +|                ix1: 198.19.0.1/​24 ​       |=|   > ​  |=| vcxl1: 198.19.0.110/​24 ​      |
 +|                      2001:​2:​0:​8000::​1/​64 | |       | |        2001:​2:​0:​8000::​110/​64 |
 +|                      (90:​e2:​ba:​84:​20:​39) | +-------+ |        (00:​07:​43:​2e:​e4:​7a) ​  |
 +|                                          |           ​| ​                             |
 +|            static routes ​                ​| ​          ​| ​                             |
 +| 192.18.0.0/​16 ​     => 198.18.0.110 ​      ​| ​          ​| ​                             |
 +| 192.19.0.0/​16 ​     => 198.19.0.110 ​      ​| ​          ​| ​                             |
 +| 2001:​2::/​49 ​       => 2001:​2::​110 ​       |           ​| ​                             |
 +| 2001:​2:​0:​8000::/​49 => 2001:​2:​0:​8000::​110 |           ​| ​                             |
 +|                                          |           ​| ​                             |
 +|        static arp and ndp                |           | /​boot/​loader.conf: ​          |
 +| 198.18.0.110 ​       => 00:​07:​43:​2e:​e4:​72 |           ​| ​     hw.cxgbe.num_vis=2 ​     |
 +| 2001:​2::​110 ​                             |           ​| ​                             |
 +|                                          |           ​| ​                             |
 +| 198.19.0.110 ​       => 00:​07:​43:​2e:​e4:​7a |           ​| ​                             |
 +| 2001:​2:​0:​8000::​110 ​                      ​| ​          ​| ​                             |
 ++------------------------------------------+ ​          ​+------------------------------+
 +</​code>​
 +
 +The generator **MUST** generate lot's of smallest IP flows (multiple source/​destination IP addresses and/or UDP src/dst port).
 +
 +Here is an example for generating 2000 IPv4 flows (100 destination IP addresses * 20 source IP addresses) with a Chelsio NIC:
 +<​code>​
 +pkt-gen -i vcxl0 -f tx -n 1000000000 -l 60 -d 198.19.10.1:​2000-198.19.10.100 -D 90:​e2:​ba:​84:​20:​38 -s 198.18.10.1:​2000-198.18.10.20 -w 4 -p 2
 +</​code>​
 +
 +And the same with IPv6 flows (minimum frame size of 62 here):
 +<​code>​
 +pkt-gen -f tx -i vcxl0 -n 1000000000 -l 62 -6 -d "​[2001:​2:​0:​8010::​1]-[2001:​2:​0:​8010::​64]"​ -D 90:​e2:​ba:​84:​20:​38 -s "​[2001:​2:​0:​10::​1]-[2001:​2:​0:​10::​14]"​ -S 00:​07:​43:​2e:​e4:​72 -w 4 -p 2
 +</​code>​
 +
 +<note warning>
 +This version of pkt-gen is improved with: IPv6 support, software checksum and optional unit normalization. [[https://​raw.githubusercontent.com/​ocochard/​BSDRP/​master/​BSDRPcur/​patches/​freebsd.pkt-gen.ae-ipv6.patch|BSDRP'​s patch to netmap pkt-gen ]].
 +</​note>​
 +Receiver will use this command:
 +<​code>​
 +pkt-gen -i vcxl1 -f rx -w 4
 +</​code>​
 +</​code>​
 +==== Basic configuration ====
 +
 +=== Disabling Ethernet flow-control ===
 +
 +First, disable Ethernet flow-control on both servers:
 +<​code>​
 +echo "​dev.ix.0.fc=0"​ >> /​etc/​sysctl.conf
 +echo "​dev.ix.1.fc=0"​ >> /​etc/​sysctl.conf
 +</​code>​
 +
 +=== Enabling unsupported SFP ===
 +
 +Because we are using a non-Intel SFP:
 +<​code>​
 +mount -uw /
 +echo '​hw.ix.unsupported_sfp="​1"'​ >> /​boot/​loader.conf.local
 +mount -ur /
 +</​code>​
 +
 +=== Disabling LRO and TSO ===
 +
 +A router [[Documentation:​Technical docs:​Performance|should not use LRO and TSO]]. BSDRP disable by default using a RC script (disablelrotso_enable="​YES"​ in /​etc/​rc.conf.misc).
 +
 +But on a standard FreeBSD:
 +<​code>​
 +ifconfig ix0 -tso4 -tso6 -lro
 +ifconfig ix1 -tso4 -tso6 -lro
 +</​code>​
 +
 +=== IP configuration on DUT === 
 +
 +<​code>​
 +# IPv4 router
 +gateway_enable="​YES"​
 +static_routes="​generator receiver"​
 +route_generator="​-net 198.18.0.0/​16 198.18.0.110"​
 +route_receiver="​-net 198.19.0.0/​16 198.19.0.110"​
 +ifconfig_ix0="​inet 198.18.0.1/​24 -tso4 -tso6 -lro"
 +ifconfig_ix1="​inet 198.19.0.1/​24 -tso4 -tso6 -lro"
 +static_arp_pairs="​HPvcxl0 HPvcxl1"​
 +static_arp_HPvcxl0="​198.18.0.110 00:​07:​43:​2e:​e4:​72"​
 +static_arp_HPvcxl1="​198.19.0.110 00:​07:​43:​2e:​e4:​7a"​
 +
 +# IPv6 router
 +ipv6_gateway_enable="​YES"​
 +ipv6_activate_all_interfaces="​YES"​
 +ipv6_static_routes="​generator receiver"​
 +ipv6_route_generator="​2001:​2::​ -prefixlen 49 2001:​2::​110"​
 +ipv6_route_receiver="​2001:​2:​0:​8000::​ -prefixlen 49 2001:​2:​0:​8000::​110"​
 +ifconfig_ix0_ipv6="​inet6 2001:2::1 prefixlen 64"
 +ifconfig_ix1_ipv6="​inet6 2001:​2:​0:​8000::​1 prefixlen 64"
 +static_ndp_pairs="​HPvcxl0 HPvcxl1"​
 +static_ndp_HPvcxl0="​2001:​2::​110 00:​07:​43:​2e:​e4:​72"​
 +static_ndp_HPvcxl1="​2001:​2:​0:​8000::​110 00:​07:​43:​2e:​e4:​7a"​
 +</​code>​
 +===== Routing performance with default BSDRP value =====
 +==== Default fast-forwarding performance in front of a line-rate generator ====
 +
 +Behaviour in front of a multi-flow traffic generator at line-rate 14.8Mpps (thanks Chelsio!), netstat on DUT report:
 +
 +Can't enter any command on the DUT during the load: All 4 cores are overloaded.
 +
 +But on the receiver, there is only 2.8Mpps received (then forwarded):
 +<​code>​
 +242.851700 main_thread [2277] 2870783 pps (2873654 pkts 1379353920 bps in 1001000 usec) 17.79 avg_batch 3584 min_space
 +243.853699 main_thread [2277] 2869423 pps (2875162 pkts 1380077760 bps in 1002000 usec) 17.77 avg_batch 1956 min_space
 +244.854700 main_thread [2277] 2870532 pps (2873403 pkts 1379233440 bps in 1001000 usec) 17.78 avg_batch 2022 min_space
 +245.855699 main_thread [2277] 2872424 pps (2875296 pkts 1380142080 bps in 1001000 usec) 17.79 avg_batch 1949 min_space
 +246.856699 main_thread [2277] 2871882 pps (2874754 pkts 1379881920 bps in 1001000 usec) 17.79 avg_batch 3584 min_space
 +247.857699 main_thread [2277] 2871047 pps (2873918 pkts 1379480640 bps in 1001000 usec) 17.78 avg_batch 3584 min_space
 +248.858700 main_thread [2277] 2870945 pps (2873816 pkts 1379431680 bps in 1001000 usec) 17.79 avg_batch 1792 min_space
 +249.859699 main_thread [2277] 2870647 pps (2873518 pkts 1379288640 bps in 1001000 usec) 17.78 avg_batch 1959 min_space
 +250.860699 main_thread [2277] 2870222 pps (2873092 pkts 1379084160 bps in 1001000 usec) 17.78 avg_batch 1956 min_space
 +251.861699 main_thread [2277] 2870311 pps (2873178 pkts 1379125440 bps in 1000999 usec) 17.78 avg_batch 1792 min_space
 +252.862699 main_thread [2277] 2870795 pps (2873669 pkts 1379361120 bps in 1001001 usec) 17.78 avg_batch 2024 min_space
 +</​code>​
 +
 +The traffic is correctly load-balanced between each queues:
 +
 +<​code>​
 +[root@DUT]~#​ sysctl dev.ix.0. | grep rx_packet
 +dev.ix.0.queue3.rx_packets:​ 143762837
 +dev.ix.0.queue2.rx_packets:​ 141867655
 +dev.ix.0.queue1.rx_packets:​ 140704642
 +dev.ix.0.queue0.rx_packets:​ 139301732
 +[root@R1]~# sysctl dev.ix.1. | grep tx_packet
 +dev.ix.1.queue3.tx_packets:​ 143762837
 +dev.ix.1.queue2.tx_packets:​ 141867655
 +dev.ix.1.queue1.tx_packets:​ 140704643
 +dev.ix.1.queue0.tx_packets:​ 139301734
 +</​code>​
 +
 +Where the system spend this time?
 +
 +<​code>​
 +[root@DUT]~#​ kldload hwpmc
 +[root@DUT]~#​ pmcstat -TS instructions -w1
 +
 +PMC: [INSTR_RETIRED_ANY] Samples: 99530 (100.0%) , 0 unresolved
 +
 +%SAMP IMAGE      FUNCTION ​            ​CALLERS
 +  6.5 kernel ​    ​ixgbe_rxeof ​         ixgbe_msix_que
 +  5.8 kernel ​    ​bzero ​               m_pkthdr_init:​1.7 ip_tryforward:​1.5 ip_findroute:​1.4 fib4_lookup_nh_basic:​1.3
 +  5.1 kernel ​    ​ixgbe_xmit ​          ​ixgbe_mq_start_locked
 +  3.7 kernel ​    ​ixgbe_mq_start ​      ​ether_output
 +  2.8 kernel ​    ​rn_match ​            ​fib4_lookup_nh_basic
 +  2.7 kernel ​    ​_rw_runlock_cookie ​  ​fib4_lookup_nh_basic:​2.0 arpresolve:​0.8
 +  2.7 kernel ​    ​ip_tryforward ​       ip_input
 +  2.6 kernel ​    ​ether_nh_input ​      ​netisr_dispatch_src
 +  2.6 kernel ​    ​bounce_bus_dmamap_lo bus_dmamap_load_mbuf_sg
 +  2.6 kernel ​    ​uma_zalloc_arg ​      ​ixgbe_rxeof
 +  2.6 kernel ​    ​_rm_rlock ​           in_localip
 +  2.4 kernel ​    ​netisr_dispatch_src ​ ether_demux:​1.7 ether_input:​0.7
 +  2.3 libc.so.7 ​ bsearch ​             0x63ac
 +  2.2 kernel ​    ​ether_output ​        ​ip_tryforward
 +  2.2 kernel ​    ​ip_input ​            ​netisr_dispatch_src
 +  2.1 kernel ​    ​uma_zfree_arg ​       m_freem
 +  2.1 kernel ​    ​fib4_lookup_nh_basic ip_findroute
 +  1.9 kernel ​    ​__rw_rlock ​          ​arpresolve:​1.3 fib4_lookup_nh_basic:​0.6
 +  1.8 kernel ​    ​bus_dmamap_load_mbuf ixgbe_xmit
 +  1.7 kernel ​    ​bcopy ​               arpresolve
 +  1.6 kernel ​    ​m_adj ​               ether_demux
 +  1.5 kernel ​    ​memcpy ​              ​ether_output
 +  1.5 kernel ​    ​arpresolve ​          ​ether_output
 +  1.5 kernel ​    ​_mtx_trylock_flags_ ​ ixgbe_mq_start
 +  1.4 kernel ​    ​mb_ctor_mbuf ​        ​uma_zalloc_arg
 +  1.4 kernel ​    ​ixgbe_txeof ​         ixgbe_msix_que
 +  1.3 pmcstat ​   0x63f0 ​              ​bsearch
 +  1.1 pmcstat ​   0x63e3 ​              ​bsearch
 +  1.0 kernel ​    ​ixgbe_refresh_mbufs ​ ixgbe_rxeof
 +  1.0 kernel ​    ​in_localip ​          ​ip_tryforward
 +  1.0 kernel ​    ​random_harvest_queue ether_nh_input
 +  0.9 kernel ​    ​bcmp ​                ​ether_nh_input
 +  0.8 kernel ​    ​cpu_search_lowest ​   cpu_search_lowest
 +  0.8 kernel ​    ​mac_ifnet_check_tran ether_output
 +  0.8 kernel ​    ​critical_exit ​       uma_zalloc_arg
 +  0.7 kernel ​    ​critical_enter
 +  0.7 kernel ​    ​ixgbe_mq_start_locke ixgbe_mq_start
 +  0.6 kernel ​    ​mac_ifnet_create_mbu ether_nh_input
 +  0.6 kernel ​    ​ether_demux ​         ether_nh_input
 +  0.5 kernel ​    ​m_freem ​             ixgbe_txeof
 +  0.5 kernel ​    ​ipsec4_capability ​   ip_input
 +</​code>​
 +
 +=> Time spend in ixgbe_rxeof
 +==== Equilibrium throughput ====
 +
 +Previous methodology,​ by generating 14.8Mpps, is like testing the DUT under a "​Denial-of-Service"​. Try another methodology known as [[documentation:​examples:​Setting up a VPN (IPSec, GRE, etc...) performance benchmark lab|equilibrium throughput]].
 +
 +From the pkt-generator,​ start an estimation of the "​equilibrium throughput"​ starting at 4Mpps:
 +<​code>​
 +[root@pkt-gen]~#​ equilibrium -d 90:​e2:​ba:​84:​20:​38 -p -l 4000 -t vcxl0 -r vcxl1
 +Benchmark tool using equilibrium throughput method
 +- Benchmark mode: Throughput (pps) for Router
 +- UDP load = 18B, IPv4 packet size=46B, Ethernet frame size=60B
 +- Link rate = 4000 Kpps
 +- Tolerance = 0.01
 +Iteration 1
 +  - Offering load = 2000 Kpps
 +  - Step = 1000 Kpps
 +  - Measured forwarding rate = 1999 Kpps
 +Iteration 2
 +  - Offering load = 3000 Kpps
 +  - Step = 1000 Kpps
 +  - Trend = increasing
 +  - Measured forwarding rate = 2949 Kpps
 +Iteration 3
 +  - Offering load = 2500 Kpps
 +  - Step = 500 Kpps
 +  - Trend = decreasing
 +  - Measured forwarding rate = 2499 Kpps
 +Iteration 4
 +  - Offering load = 2750 Kpps
 +  - Step = 250 Kpps
 +  - Trend = increasing
 +  - Measured forwarding rate = 2750 Kpps
 +Iteration 5
 +  - Offering load = 2875 Kpps
 +  - Step = 125 Kpps
 +  - Trend = increasing
 +  - Measured forwarding rate = 2875 Kpps
 +Iteration 6
 +  - Offering load = 2937 Kpps
 +  - Step = 62 Kpps
 +  - Trend = increasing
 +  - Measured forwarding rate = 2933 Kpps
 +Iteration 7
 +  - Offering load = 2968 Kpps
 +  - Step = 31 Kpps
 +  - Trend = increasing
 +  - Measured forwarding rate = 2948 Kpps
 +Estimated Equilibrium Ethernet throughput= 2948 Kpps (maximum value seen: 2949 Kpps)
 +</​code>​
 +
 +=> Same results with equilibrium method: 2.9Mpps.
 +==== Firewall impact ====
 +
 +One rule for each firewall and 2000 UDP "​sessions",​ more information on the [[documentation:​examples:​forwarding_performance_lab_of_an_ibm_system_x3550_m3_with_intel_82580#​firewall_impact|GigaEthernet performance lab]].
 +
 +[[https://​github.com/​ocochard/​netbenchs/​tree/​master/​Xeon_L5630-4Cores-Intel_82599EB/​fastforwarding-pf-ipfw|Full configuration sets, scripts and results]].
 +
 +{{:​documentation:​examples:​bench.forwarding.and.firewalling.rate.on.ibm.intel-82599eb.png|Impact of ipfw and pf on 4 cores Xeon 2.13GHz with 10-Gigabit Intel 82599EB }}
 +===== Routing performance with multiples static routes =====
 +
 +FreeBSD had some route lookup contention problem: This setup is using only one static route (192.19.0.0/​8) toward the traffic receiver.
 +
 +By spliting this unique route by 4 or 8, we should obtain better result.
 +
 +This bench method is using 100 differents destinations IP addressess from 198.19.10.1 to 198.19.10.100,​ then redoing this bench using 4 static routes:
 +  - 198.19.10.0/​27 (0 to 31)
 +  - 198.19.10.32/​27 (32 to 63)
 +  - 198.19.10.64/​27 (64 to 95)
 +  - 198.19.10.96/​27 (96 to 127)
 +then with 8¬†routes:
 +  -198.19.10.0/​28
 +  -198.19.10.16/​28
 +  -198.19.10.32/​28
 +  -198.19.10.48/​28
 +  -198.19.10.64/​28
 +  -198.19.10.80/​28
 +  -198.19.10.96/​28
 +  -198.19.10.112/​28
 +
 +Configuration examples:
 +<​code>​
 +sysrc static_routes="​generator receiver1 receiver2 receiver3 receiver4"​
 +sysrc route_generator="​-net 198.18.0.0/​16 198.18.2.2"​
 +sysrc -x route_receiver
 +sysrc route_receiver1="​-net 198.19.10.0/​27 198.19.2.2"​
 +sysrc route_receiver2="​-net 198.19.10.32/​27 198.19.2.2"​
 +sysrc route_receiver3="​-net 198.19.10.64/​27 198.19.2.2"​
 +sysrc route_receiver4="​-net 198.19.10.96/​27 198.19.2.2"​
 +</​code>​
 +
 +==== Graphs ====
 +
 +{{:​documentation:​examples:​bench.static-routes-contention.test.fbsd10.2.png|Impact of number of static routes on forwarding on 4 cores Xeon 2.13GHz with 10-Gigabit Intel 82599EB }}
 +
 +=> A small 4% increase by using 4 static routes in place of 1 route, and 5% if using 8 routes.
 +==== Ministat ====
 +
 +<​code>​
 +x pps.one-route
 ++ pps.four-routes
 +* pps.eight-routes
 ++--------------------------------------------------------------------------+
 +|                   ​* ​                                                     |
 +|                   ​* ​                                                     |
 +|   ​x ​     x   + +  *         ​x ​          ​x ​                  ​+ ​ +       **|
 +||_________M______A________________| ​                                      |
 +|         ​|_________M_______________A________________________| ​            |
 +|           ​|_______M____________________A_____________________________| ​  |
 ++--------------------------------------------------------------------------+
 +    N           ​Min ​          ​Max ​       Median ​          ​Avg ​       Stddev
 +x   ​5 ​      ​1639912 ​      ​1984116 ​      ​1704942 ​    ​1769454.2 ​    ​154632.18
 ++   ​5 ​      ​1737180 ​      ​2194547 ​      ​1785163 ​    ​1927662.8 ​    ​229585.57
 +No difference proven at 95.0% confidence
 +*   ​5 ​      ​1782648 ​      ​2273933 ​      ​1785695 ​    ​1978219.6 ​    ​266278.12
 +No difference proven at 95.0% confidence
 +</​code>​
  
documentation/examples/forwarding_performance_lab_of_an_ibm_system_x3550_m3_with_10-gigabit_intel_82599eb.txt · Last modified: 2017/07/12 11:45 (external edit)