documentation:examples:forwarding_performance_lab_of_an_ibm_system_x3550_m3_with_10-gigabit_intel_82599eb
no way to compare when less than two revisions
Differences
This shows you the differences between two versions of the page.
— | documentation:examples:forwarding_performance_lab_of_an_ibm_system_x3550_m3_with_10-gigabit_intel_82599eb [2017/07/12 11:45] (current) – created - external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Forwarding performance lab of an IBM System x3550 M3 with 10-Gigabit Intel 82599EB ====== | ||
+ | {{description> | ||
+ | |||
+ | ===== Bench lab ===== | ||
+ | |||
+ | ==== Hardware detail ==== | ||
+ | |||
+ | This lab will test an [[IBM System x3550 M3]] with **quad** cores (Intel Xeon L5630 2.13GHz, hyper-threading disabled), dual port Intel 82599EB 10-Gigabit and OPT SFP (SFP-10G-LR). | ||
+ | |||
+ | NIC details: | ||
+ | < | ||
+ | ix0@pci0: | ||
+ | vendor | ||
+ | device | ||
+ | class = network | ||
+ | subclass | ||
+ | bar [10] = type Prefetchable Memory, range 64, base 0xfbe80000, size 524288, enabled | ||
+ | bar [18] = type I/O Port, range 32, base 0x2020, size 32, enabled | ||
+ | bar [20] = type Prefetchable Memory, range 64, base 0xfbf04000, size 16384, enabled | ||
+ | cap 01[40] = powerspec 3 supports D0 D3 current D0 | ||
+ | cap 05[50] = MSI supports 1 message, 64 bit, vector masks | ||
+ | cap 11[70] = MSI-X supports 64 messages, enabled | ||
+ | Table in map 0x20[0x0], PBA in map 0x20[0x2000] | ||
+ | cap 10[a0] = PCI-Express 2 endpoint max data 256(512) FLR link x8(x8) | ||
+ | speed 5.0(5.0) ASPM disabled(L0s) | ||
+ | ecap 0001[100] = AER 1 0 fatal 0 non-fatal 1 corrected | ||
+ | ecap 0003[140] = Serial 1 90e2baffff842038 | ||
+ | ecap 000e[150] = ARI 1 | ||
+ | ecap 0010[160] = SR-IOV 1 IOV disabled, Memory Space disabled, ARI disabled | ||
+ | 0 VFs configured out of 64 supported | ||
+ | First VF RID Offset 0x0180, VF RID Stride 0x0002 | ||
+ | VF Device ID 0x10ed | ||
+ | Page Sizes: 4096 (enabled), 8192, 65536, 262144, 1048576, 4194304 | ||
+ | </ | ||
+ | |||
+ | ==== Lab set-up ==== | ||
+ | |||
+ | The lab is detailed here: [[documentation: | ||
+ | |||
+ | === Diagram === | ||
+ | |||
+ | < | ||
+ | +------------------------------------------+ +-------+ +------------------------------+ | ||
+ | | Device under test | |Juniper| | Packet generator & receiver | ||
+ | | | | QFX | | | | ||
+ | | ix0: 198.18.0.1/ | ||
+ | | 2001: | ||
+ | | (90: | ||
+ | | | | | | | | ||
+ | | ix1: 198.19.0.1/ | ||
+ | | 2001: | ||
+ | | (90: | ||
+ | | | | ||
+ | | static routes | ||
+ | | 192.18.0.0/ | ||
+ | | 192.19.0.0/ | ||
+ | | 2001: | ||
+ | | 2001: | ||
+ | | | | ||
+ | | static arp and ndp | | / | ||
+ | | 198.18.0.110 | ||
+ | | 2001: | ||
+ | | | | ||
+ | | 198.19.0.110 | ||
+ | | 2001: | ||
+ | +------------------------------------------+ | ||
+ | </ | ||
+ | |||
+ | The generator **MUST** generate lot's of smallest IP flows (multiple source/ | ||
+ | |||
+ | Here is an example for generating 2000 IPv4 flows (100 destination IP addresses * 20 source IP addresses) with a Chelsio NIC: | ||
+ | < | ||
+ | pkt-gen -i vcxl0 -f tx -n 1000000000 -l 60 -d 198.19.10.1: | ||
+ | </ | ||
+ | |||
+ | And the same with IPv6 flows (minimum frame size of 62 here): | ||
+ | < | ||
+ | pkt-gen -f tx -i vcxl0 -n 1000000000 -l 62 -6 -d " | ||
+ | </ | ||
+ | |||
+ | <note warning> | ||
+ | This version of pkt-gen is improved with: IPv6 support, software checksum and optional unit normalization. [[https:// | ||
+ | </ | ||
+ | Receiver will use this command: | ||
+ | < | ||
+ | pkt-gen -i vcxl1 -f rx -w 4 | ||
+ | </ | ||
+ | </ | ||
+ | ==== Basic configuration ==== | ||
+ | |||
+ | === Disabling Ethernet flow-control === | ||
+ | |||
+ | First, disable Ethernet flow-control on both servers: | ||
+ | < | ||
+ | echo " | ||
+ | echo " | ||
+ | </ | ||
+ | |||
+ | === Enabling unsupported SFP === | ||
+ | |||
+ | Because we are using a non-Intel SFP: | ||
+ | < | ||
+ | mount -uw / | ||
+ | echo ' | ||
+ | mount -ur / | ||
+ | </ | ||
+ | |||
+ | === Disabling LRO and TSO === | ||
+ | |||
+ | A router [[Documentation: | ||
+ | |||
+ | But on a standard FreeBSD: | ||
+ | < | ||
+ | ifconfig ix0 -tso4 -tso6 -lro | ||
+ | ifconfig ix1 -tso4 -tso6 -lro | ||
+ | </ | ||
+ | |||
+ | === IP configuration on DUT === | ||
+ | |||
+ | < | ||
+ | # IPv4 router | ||
+ | gateway_enable=" | ||
+ | static_routes=" | ||
+ | route_generator=" | ||
+ | route_receiver=" | ||
+ | ifconfig_ix0=" | ||
+ | ifconfig_ix1=" | ||
+ | static_arp_pairs=" | ||
+ | static_arp_HPvcxl0=" | ||
+ | static_arp_HPvcxl1=" | ||
+ | |||
+ | # IPv6 router | ||
+ | ipv6_gateway_enable=" | ||
+ | ipv6_activate_all_interfaces=" | ||
+ | ipv6_static_routes=" | ||
+ | ipv6_route_generator=" | ||
+ | ipv6_route_receiver=" | ||
+ | ifconfig_ix0_ipv6=" | ||
+ | ifconfig_ix1_ipv6=" | ||
+ | static_ndp_pairs=" | ||
+ | static_ndp_HPvcxl0=" | ||
+ | static_ndp_HPvcxl1=" | ||
+ | </ | ||
+ | ===== Routing performance with default BSDRP value ===== | ||
+ | ==== Default fast-forwarding performance in front of a line-rate generator ==== | ||
+ | |||
+ | Behaviour in front of a multi-flow traffic generator at line-rate 14.8Mpps (thanks Chelsio!), netstat on DUT report: | ||
+ | |||
+ | Can't enter any command on the DUT during the load: All 4 cores are overloaded. | ||
+ | |||
+ | But on the receiver, there is only 2.8Mpps received (then forwarded): | ||
+ | < | ||
+ | 242.851700 main_thread [2277] 2870783 pps (2873654 pkts 1379353920 bps in 1001000 usec) 17.79 avg_batch 3584 min_space | ||
+ | 243.853699 main_thread [2277] 2869423 pps (2875162 pkts 1380077760 bps in 1002000 usec) 17.77 avg_batch 1956 min_space | ||
+ | 244.854700 main_thread [2277] 2870532 pps (2873403 pkts 1379233440 bps in 1001000 usec) 17.78 avg_batch 2022 min_space | ||
+ | 245.855699 main_thread [2277] 2872424 pps (2875296 pkts 1380142080 bps in 1001000 usec) 17.79 avg_batch 1949 min_space | ||
+ | 246.856699 main_thread [2277] 2871882 pps (2874754 pkts 1379881920 bps in 1001000 usec) 17.79 avg_batch 3584 min_space | ||
+ | 247.857699 main_thread [2277] 2871047 pps (2873918 pkts 1379480640 bps in 1001000 usec) 17.78 avg_batch 3584 min_space | ||
+ | 248.858700 main_thread [2277] 2870945 pps (2873816 pkts 1379431680 bps in 1001000 usec) 17.79 avg_batch 1792 min_space | ||
+ | 249.859699 main_thread [2277] 2870647 pps (2873518 pkts 1379288640 bps in 1001000 usec) 17.78 avg_batch 1959 min_space | ||
+ | 250.860699 main_thread [2277] 2870222 pps (2873092 pkts 1379084160 bps in 1001000 usec) 17.78 avg_batch 1956 min_space | ||
+ | 251.861699 main_thread [2277] 2870311 pps (2873178 pkts 1379125440 bps in 1000999 usec) 17.78 avg_batch 1792 min_space | ||
+ | 252.862699 main_thread [2277] 2870795 pps (2873669 pkts 1379361120 bps in 1001001 usec) 17.78 avg_batch 2024 min_space | ||
+ | </ | ||
+ | |||
+ | The traffic is correctly load-balanced between each queues: | ||
+ | |||
+ | < | ||
+ | [root@DUT]~# | ||
+ | dev.ix.0.queue3.rx_packets: | ||
+ | dev.ix.0.queue2.rx_packets: | ||
+ | dev.ix.0.queue1.rx_packets: | ||
+ | dev.ix.0.queue0.rx_packets: | ||
+ | [root@R1]~# sysctl dev.ix.1. | grep tx_packet | ||
+ | dev.ix.1.queue3.tx_packets: | ||
+ | dev.ix.1.queue2.tx_packets: | ||
+ | dev.ix.1.queue1.tx_packets: | ||
+ | dev.ix.1.queue0.tx_packets: | ||
+ | </ | ||
+ | |||
+ | Where the system spend this time? | ||
+ | |||
+ | < | ||
+ | [root@DUT]~# | ||
+ | [root@DUT]~# | ||
+ | |||
+ | PMC: [INSTR_RETIRED_ANY] Samples: 99530 (100.0%) , 0 unresolved | ||
+ | |||
+ | %SAMP IMAGE FUNCTION | ||
+ | 6.5 kernel | ||
+ | 5.8 kernel | ||
+ | 5.1 kernel | ||
+ | 3.7 kernel | ||
+ | 2.8 kernel | ||
+ | 2.7 kernel | ||
+ | 2.7 kernel | ||
+ | 2.6 kernel | ||
+ | 2.6 kernel | ||
+ | 2.6 kernel | ||
+ | 2.6 kernel | ||
+ | 2.4 kernel | ||
+ | 2.3 libc.so.7 | ||
+ | 2.2 kernel | ||
+ | 2.2 kernel | ||
+ | 2.1 kernel | ||
+ | 2.1 kernel | ||
+ | 1.9 kernel | ||
+ | 1.8 kernel | ||
+ | 1.7 kernel | ||
+ | 1.6 kernel | ||
+ | 1.5 kernel | ||
+ | 1.5 kernel | ||
+ | 1.5 kernel | ||
+ | 1.4 kernel | ||
+ | 1.4 kernel | ||
+ | 1.3 pmcstat | ||
+ | 1.1 pmcstat | ||
+ | 1.0 kernel | ||
+ | 1.0 kernel | ||
+ | 1.0 kernel | ||
+ | 0.9 kernel | ||
+ | 0.8 kernel | ||
+ | 0.8 kernel | ||
+ | 0.8 kernel | ||
+ | 0.7 kernel | ||
+ | 0.7 kernel | ||
+ | 0.6 kernel | ||
+ | 0.6 kernel | ||
+ | 0.5 kernel | ||
+ | 0.5 kernel | ||
+ | </ | ||
+ | |||
+ | => Time spend in ixgbe_rxeof | ||
+ | ==== Equilibrium throughput ==== | ||
+ | |||
+ | Previous methodology, | ||
+ | |||
+ | From the pkt-generator, | ||
+ | < | ||
+ | [root@pkt-gen]~# | ||
+ | Benchmark tool using equilibrium throughput method | ||
+ | - Benchmark mode: Throughput (pps) for Router | ||
+ | - UDP load = 18B, IPv4 packet size=46B, Ethernet frame size=60B | ||
+ | - Link rate = 4000 Kpps | ||
+ | - Tolerance = 0.01 | ||
+ | Iteration 1 | ||
+ | - Offering load = 2000 Kpps | ||
+ | - Step = 1000 Kpps | ||
+ | - Measured forwarding rate = 1999 Kpps | ||
+ | Iteration 2 | ||
+ | - Offering load = 3000 Kpps | ||
+ | - Step = 1000 Kpps | ||
+ | - Trend = increasing | ||
+ | - Measured forwarding rate = 2949 Kpps | ||
+ | Iteration 3 | ||
+ | - Offering load = 2500 Kpps | ||
+ | - Step = 500 Kpps | ||
+ | - Trend = decreasing | ||
+ | - Measured forwarding rate = 2499 Kpps | ||
+ | Iteration 4 | ||
+ | - Offering load = 2750 Kpps | ||
+ | - Step = 250 Kpps | ||
+ | - Trend = increasing | ||
+ | - Measured forwarding rate = 2750 Kpps | ||
+ | Iteration 5 | ||
+ | - Offering load = 2875 Kpps | ||
+ | - Step = 125 Kpps | ||
+ | - Trend = increasing | ||
+ | - Measured forwarding rate = 2875 Kpps | ||
+ | Iteration 6 | ||
+ | - Offering load = 2937 Kpps | ||
+ | - Step = 62 Kpps | ||
+ | - Trend = increasing | ||
+ | - Measured forwarding rate = 2933 Kpps | ||
+ | Iteration 7 | ||
+ | - Offering load = 2968 Kpps | ||
+ | - Step = 31 Kpps | ||
+ | - Trend = increasing | ||
+ | - Measured forwarding rate = 2948 Kpps | ||
+ | Estimated Equilibrium Ethernet throughput= 2948 Kpps (maximum value seen: 2949 Kpps) | ||
+ | </ | ||
+ | |||
+ | => Same results with equilibrium method: 2.9Mpps. | ||
+ | ==== Firewall impact ==== | ||
+ | |||
+ | One rule for each firewall and 2000 UDP " | ||
+ | |||
+ | [[https:// | ||
+ | |||
+ | {{: | ||
+ | ===== Routing performance with multiples static routes ===== | ||
+ | |||
+ | FreeBSD had some route lookup contention problem: This setup is using only one static route (192.19.0.0/ | ||
+ | |||
+ | By spliting this unique route by 4 or 8, we should obtain better result. | ||
+ | |||
+ | This bench method is using 100 differents destinations IP addressess from 198.19.10.1 to 198.19.10.100, | ||
+ | - 198.19.10.0/ | ||
+ | - 198.19.10.32/ | ||
+ | - 198.19.10.64/ | ||
+ | - 198.19.10.96/ | ||
+ | then with 8 routes: | ||
+ | -198.19.10.0/ | ||
+ | -198.19.10.16/ | ||
+ | -198.19.10.32/ | ||
+ | -198.19.10.48/ | ||
+ | -198.19.10.64/ | ||
+ | -198.19.10.80/ | ||
+ | -198.19.10.96/ | ||
+ | -198.19.10.112/ | ||
+ | |||
+ | Configuration examples: | ||
+ | < | ||
+ | sysrc static_routes=" | ||
+ | sysrc route_generator=" | ||
+ | sysrc -x route_receiver | ||
+ | sysrc route_receiver1=" | ||
+ | sysrc route_receiver2=" | ||
+ | sysrc route_receiver3=" | ||
+ | sysrc route_receiver4=" | ||
+ | </ | ||
+ | |||
+ | ==== Graphs ==== | ||
+ | |||
+ | {{: | ||
+ | |||
+ | => A small 4% increase by using 4 static routes in place of 1 route, and 5% if using 8 routes. | ||
+ | ==== Ministat ==== | ||
+ | |||
+ | < | ||
+ | x pps.one-route | ||
+ | + pps.four-routes | ||
+ | * pps.eight-routes | ||
+ | +--------------------------------------------------------------------------+ | ||
+ | | | ||
+ | | | ||
+ | | | ||
+ | ||_________M______A________________| | ||
+ | | | ||
+ | | | ||
+ | +--------------------------------------------------------------------------+ | ||
+ | N | ||
+ | x | ||
+ | + | ||
+ | No difference proven at 95.0% confidence | ||
+ | * | ||
+ | No difference proven at 95.0% confidence | ||
+ | </ | ||
documentation/examples/forwarding_performance_lab_of_an_ibm_system_x3550_m3_with_10-gigabit_intel_82599eb.txt · Last modified: 2017/07/12 11:45 by 127.0.0.1