documentation:examples:forwarding_performance_lab_of_a_hp_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-cr
no way to compare when less than two revisions
Differences
This shows you the differences between two versions of the page.
Next revision | |||
— | documentation:examples:forwarding_performance_lab_of_a_hp_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-cr [2017/01/30 15:38] – external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Forwarding performance lab of a HP ProLiant DL360p Gen8 with 10-Gigabit Chelsio T540-CR ====== | ||
+ | {{description> | ||
+ | |||
+ | ===== Bench lab ===== | ||
+ | |||
+ | ==== Hardware detail ==== | ||
+ | |||
+ | This lab will test an [[HP ProLiant DL360p Gen8]] with **eight** cores (Intel Xeon E5-2650 | ||
+ | |||
+ | The lab is detailed here: [[documentation: | ||
+ | |||
+ | === Diagram === | ||
+ | |||
+ | < | ||
+ | +------------------------------------------+ +-------+ +------------------------------+ | ||
+ | | Device under test | |Juniper| | Packet generator & receiver | ||
+ | | | | QFX | | | | ||
+ | | cxl0: 198.18.0.10/ | ||
+ | | 2001: | ||
+ | | (00: | ||
+ | | | | | | | | ||
+ | | cxl1: 198.19.0.10/ | ||
+ | | | ||
+ | | (00: | ||
+ | | | | ||
+ | | static routes | ||
+ | | 192.18.0.0/ | ||
+ | | 192.19.0.0/ | ||
+ | | 2001: | ||
+ | | 2001: | ||
+ | | | | ||
+ | | static arp and ndp | | / | ||
+ | | 198.18.0.108 | ||
+ | | 2001: | ||
+ | | | | ||
+ | | 198.19.0.108 | ||
+ | | 2001: | ||
+ | +------------------------------------------+ | ||
+ | </ | ||
+ | |||
+ | The generator **MUST** generate lot's of smallest IP flows (multiple source/ | ||
+ | |||
+ | Here is an example for generating 2000 flows (100 different source IP * 20 different destination IP) at line-rate by using 2 threads: | ||
+ | < | ||
+ | pkt-gen -i vcxl0 -f tx -n 1000000000 -l 60 -d 198.19.10.1: | ||
+ | </ | ||
+ | |||
+ | And the same with IPv6 flows (minimum frame size of 62 here): | ||
+ | < | ||
+ | pkt-gen -f tx -i vcxl0 -n 1000000000 -l 62 -6 -d " | ||
+ | </ | ||
+ | |||
+ | |||
+ | <note warning> | ||
+ | Netmap disable hardware checksum on the NIC, if you can't re-enable hardware checksum in netmap mode (like with Intel NIC), you need to use a FreeBSD -head with svn revision of 257758 minimum with the[[https:// | ||
+ | </ | ||
+ | Receiver will use this command: | ||
+ | < | ||
+ | pkt-gen -i vcxl1 -f rx -w 4 | ||
+ | </ | ||
+ | ==== Basic configuration ==== | ||
+ | |||
+ | === Disabling Ethernet flow-control === | ||
+ | |||
+ | First, disable Ethernet flow-control on both servers. Chelsio T540 are configured like this: | ||
+ | < | ||
+ | echo " | ||
+ | echo " | ||
+ | service sysctl reload | ||
+ | </ | ||
+ | |||
+ | === Disabling LRO and TSO === | ||
+ | |||
+ | A router [[Documentation: | ||
+ | |||
+ | ==== IP Configuration on DUT ==== | ||
+ | |||
+ | / | ||
+ | < | ||
+ | # IPv4 router | ||
+ | gateway_enable=" | ||
+ | ifconfig_cxl0=" | ||
+ | ifconfig_cxl1=" | ||
+ | static_routes=" | ||
+ | route_generator=" | ||
+ | route_receiver=" | ||
+ | static_arp_pairs=" | ||
+ | static_arp_generator=" | ||
+ | static_arp_receiver=" | ||
+ | |||
+ | # IPv6 router | ||
+ | ipv6_gateway_enable=" | ||
+ | ipv6_activate_all_interfaces=" | ||
+ | ifconfig_cxl0_ipv6=" | ||
+ | ifconfig_cxl1_ipv6=" | ||
+ | ipv6_static_routes=" | ||
+ | ipv6_route_generator=" | ||
+ | ipv6_route_receiver=" | ||
+ | static_ndp_pairs=" | ||
+ | static_ndp_generator=" | ||
+ | static_ndp_receiver=" | ||
+ | </ | ||
+ | |||
+ | ===== Routing performance with default value ===== | ||
+ | ==== Default forwarding performance in front of a line-rate generator ==== | ||
+ | |||
+ | Trying the " | ||
+ | |||
+ | < | ||
+ | [root@hp]~# netstat -iw 1 | ||
+ | input (Total) | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | </ | ||
+ | |||
+ | The traffic is correctly load-balanced between NIC-queue/ | ||
+ | |||
+ | < | ||
+ | [root@hp]~# vmstat -i | grep t5nex0 | ||
+ | irq291: t5nex0: | ||
+ | irq292: t5nex0: | ||
+ | irq293: t5nex0: | ||
+ | irq294: t5nex0: | ||
+ | irq295: t5nex0: | ||
+ | irq296: t5nex0: | ||
+ | irq297: t5nex0: | ||
+ | irq298: t5nex0: | ||
+ | irq299: t5nex0: | ||
+ | irq305: t5nex0: | ||
+ | irq306: t5nex0: | ||
+ | irq307: t5nex0: | ||
+ | irq308: t5nex0: | ||
+ | irq309: t5nex0: | ||
+ | irq310: t5nex0: | ||
+ | irq311: t5nex0: | ||
+ | irq312: t5nex0: | ||
+ | |||
+ | [root@hp]~# top -nCHSIzs1 | ||
+ | last pid: 2032; load averages: | ||
+ | 205 processes: 12 running, 106 sleeping, 87 waiting | ||
+ | |||
+ | Mem: 13M Active, 728K Inact, 504M Wired, 23M Buf, 62G Free | ||
+ | Swap: | ||
+ | |||
+ | |||
+ | PID USERNAME | ||
+ | 11 root | ||
+ | 11 root | ||
+ | 11 root | ||
+ | 11 root | ||
+ | 11 root | ||
+ | 11 root | ||
+ | 11 root | ||
+ | 11 root | ||
+ | 11 root | ||
+ | </ | ||
+ | |||
+ | Where the system spend this time? | ||
+ | |||
+ | < | ||
+ | [root@hp]~# kldload hwpmc | ||
+ | [root@hp]~# pmcstat -TS CPU_CLK_UNHALTED_CORE -w 1 | ||
+ | PMC: [CPU_CLK_UNHALTED_CORE] Samples: 320832 (100.0%) , 0 unresolved | ||
+ | |||
+ | %SAMP IMAGE FUNCTION | ||
+ | 21.4 kernel | ||
+ | 15.8 kernel | ||
+ | 8.8 kernel | ||
+ | 6.3 kernel | ||
+ | 4.1 kernel | ||
+ | 3.6 kernel | ||
+ | 2.6 kernel | ||
+ | 2.0 kernel | ||
+ | 2.0 kernel | ||
+ | 2.0 libc.so.7 | ||
+ | 1.7 kernel | ||
+ | 1.6 kernel | ||
+ | 1.5 kernel | ||
+ | 1.4 kernel | ||
+ | 1.3 kernel | ||
+ | 1.2 kernel | ||
+ | 1.2 kernel | ||
+ | 1.1 kernel | ||
+ | 1.1 kernel | ||
+ | 1.0 kernel | ||
+ | 1.0 kernel | ||
+ | |||
+ | </ | ||
+ | |||
+ | Some lock contention on the fib4_lookup_nh_basic. | ||
+ | |||
+ | ==== Equilibrium throughput ==== | ||
+ | |||
+ | Previous methodology, | ||
+ | |||
+ | === IPv4 === | ||
+ | |||
+ | From the pkt-generator, | ||
+ | < | ||
+ | [root@pkt-gen]~# | ||
+ | Benchmark tool using equilibrium throughput method | ||
+ | - Benchmark mode: Throughput (pps) for Router | ||
+ | - UDP load = 18B, IPv4 packet size=46B, Ethernet frame size=60B | ||
+ | - Link rate = 10000 Kpps | ||
+ | - Tolerance = 0.01 | ||
+ | Iteration 1 | ||
+ | - Offering load = 5000 Kpps | ||
+ | - Step = 2500 Kpps | ||
+ | - Measured forwarding rate = 5000 Kpps | ||
+ | Iteration 2 | ||
+ | - Offering load = 7500 Kpps | ||
+ | - Step = 2500 Kpps | ||
+ | - Trend = increasing | ||
+ | - Measured forwarding rate = 5440 Kpps | ||
+ | Iteration 3 | ||
+ | - Offering load = 6250 Kpps | ||
+ | - Step = 1250 Kpps | ||
+ | - Trend = decreasing | ||
+ | - Measured forwarding rate = 5437 Kpps | ||
+ | Iteration 4 | ||
+ | - Offering load = 5625 Kpps | ||
+ | - Step = 625 Kpps | ||
+ | - Trend = decreasing | ||
+ | - Measured forwarding rate = 5442 Kpps | ||
+ | Iteration 5 | ||
+ | - Offering load = 5313 Kpps | ||
+ | - Step = 312 Kpps | ||
+ | - Trend = decreasing | ||
+ | - Measured forwarding rate = 5313 Kpps | ||
+ | Iteration 6 | ||
+ | - Offering load = 5469 Kpps | ||
+ | - Step = 156 Kpps | ||
+ | - Trend = increasing | ||
+ | - Measured forwarding rate = 5434 Kpps | ||
+ | Iteration 7 | ||
+ | - Offering load = 5391 Kpps | ||
+ | - Step = 78 Kpps | ||
+ | - Trend = decreasing | ||
+ | - Measured forwarding rate = 5390 Kpps | ||
+ | Estimated Equilibrium Ethernet throughput= 5390 Kpps (maximum value seen: 5442 Kpps) | ||
+ | </ | ||
+ | |||
+ | => About the same performance as the "under DOS" bench (only running multiple times this same benchs can give valide " | ||
+ | |||
+ | === IPv6 === | ||
+ | |||
+ | From the pkt-generator, | ||
+ | < | ||
+ | [root@pkt-gen]~# | ||
+ | Benchmark tool using equilibrium throughput method | ||
+ | - Benchmark mode: Throughput (pps) for Router | ||
+ | - UDP load = 0B, IPv6 packet size=48B, Ethernet frame size=62B | ||
+ | - Link rate = 10000 Kpps | ||
+ | - Tolerance = 0.01 | ||
+ | Iteration 1 | ||
+ | - Offering load = 5000 Kpps | ||
+ | - Step = 2500 Kpps | ||
+ | - Measured forwarding rate = 2681 Kpps | ||
+ | Iteration 2 | ||
+ | - Offering load = 2500 Kpps | ||
+ | - Step = 2500 Kpps | ||
+ | - Trend = decreasing | ||
+ | - Measured forwarding rate = 2499 Kpps | ||
+ | Iteration 3 | ||
+ | - Offering load = 3750 Kpps | ||
+ | - Step = 1250 Kpps | ||
+ | - Trend = increasing | ||
+ | - Measured forwarding rate = 2682 Kpps | ||
+ | Iteration 4 | ||
+ | - Offering load = 3125 Kpps | ||
+ | - Step = 625 Kpps | ||
+ | - Trend = decreasing | ||
+ | - Measured forwarding rate = 2681 Kpps | ||
+ | Iteration 5 | ||
+ | - Offering load = 2813 Kpps | ||
+ | - Step = 312 Kpps | ||
+ | - Trend = decreasing | ||
+ | - Measured forwarding rate = 2681 Kpps | ||
+ | Iteration 6 | ||
+ | - Offering load = 2657 Kpps | ||
+ | - Step = 156 Kpps | ||
+ | - Trend = decreasing | ||
+ | - Measured forwarding rate = 2657 Kpps | ||
+ | Iteration 7 | ||
+ | - Offering load = 2735 Kpps | ||
+ | - Step = 78 Kpps | ||
+ | - Trend = increasing | ||
+ | - Measured forwarding rate = 2680 Kpps | ||
+ | Iteration 8 | ||
+ | - Offering load = 2696 Kpps | ||
+ | - Step = 39 Kpps | ||
+ | - Trend = decreasing | ||
+ | - Measured forwarding rate = 2679 Kpps | ||
+ | Estimated Equilibrium Ethernet throughput= 2679 Kpps (maximum value seen: 2682 Kpps) | ||
+ | </ | ||
+ | |||
+ | From 5.4Mpps in IPv4, it lower to 2.67Mppps in IPv6 (no fastforward with IPv6). | ||
+ | ==== Firewall impact ==== | ||
+ | |||
+ | One rule for each firewall and 2000 UDP " | ||
+ | |||
+ | {{documentation: | ||
+ | |||
+ | ===== tuning ===== | ||
+ | |||
+ | === BIOS === | ||
+ | |||
+ | Disable Hyperthreading: | ||
+ | |||
+ | {{documentation: | ||
+ | |||
+ | === Chelsio drivers === | ||
+ | |||
+ | == Reducing NIC queues (FreeBSD 11.0 or older only)== | ||
+ | |||
+ | By default queues are: | ||
+ | * TX: 16 or ncpu if ncpu<16 | ||
+ | * RX: 8 or ncpu if ncpu<8 | ||
+ | |||
+ | Then in our case there are equal to 8: | ||
+ | < | ||
+ | [root@hp]~# sysctl dev.cxl.3.nrxq | ||
+ | dev.cxl.3.nrxq: | ||
+ | [root@hp]~# sysctl dev.cxl.3.ntxq | ||
+ | dev.cxl.3.ntxq: | ||
+ | </ | ||
+ | |||
+ | Here is how to changes number of queue to 4: | ||
+ | < | ||
+ | mount -uw / | ||
+ | echo ' | ||
+ | echo ' | ||
+ | mount -ur / | ||
+ | reboot | ||
+ | </ | ||
+ | |||
+ | {{: | ||
+ | |||
+ | <note warning> | ||
+ | On a 8 cores machine, we had to reduce NIC queue numbers to 4 on FreeBSD 11.0 and older. | ||
+ | </ | ||
+ | |||
+ | === descriptor ring size === | ||
+ | |||
+ | The size, in number of entries, of the descriptor ring used for a RX and TX queue are 1024 by default. | ||
+ | |||
+ | < | ||
+ | [root@hp]~# sysctl dev.cxl.3.qsize_rxq | ||
+ | dev.cxl.3.qsize_rxq: | ||
+ | [root@hp]~# sysctl dev.cxl.2.qsize_rxq | ||
+ | dev.cxl.2.qsize_rxq: | ||
+ | </ | ||
+ | |||
+ | Let's change them to different value (1024, 2048 and 4096) and measuring the impact: | ||
+ | |||
+ | < | ||
+ | mount -uw / | ||
+ | echo ' | ||
+ | echo ' | ||
+ | mount -ur / | ||
+ | reboot | ||
+ | </ | ||
+ | |||
+ | {{: | ||
+ | |||
+ | Ministat: | ||
+ | |||
+ | < | ||
+ | x pps.qsize1024 | ||
+ | + pps.qsize2048 | ||
+ | * pps.qsize4096 | ||
+ | +--------------------------------------------------------------------------+ | ||
+ | |x x *+x | ||
+ | | | ||
+ | | |___________________A_____M____________|| | ||
+ | | |______________A_M____________| | ||
+ | +--------------------------------------------------------------------------+ | ||
+ | N | ||
+ | x | ||
+ | + | ||
+ | No difference proven at 95.0% confidence | ||
+ | * | ||
+ | No difference proven at 95.0% confidence | ||
+ | </ | ||
+ | |||
+ | By reading the graphic it seems there is a better behaviour with a qsize of 2048, but ministat answers to 5 benchs says there is not. | ||
+ | |||
documentation/examples/forwarding_performance_lab_of_a_hp_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-cr.txt · Last modified: 2019/12/27 11:41 by olivier