User Tools

Site Tools


documentation:examples:forwarding_performance_lab_of_a_hp_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-cr

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
documentation:examples:forwarding_performance_lab_of_a_hp_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-cr [2017/01/30 15:38]
127.0.0.1 external edit
documentation:examples:forwarding_performance_lab_of_a_hp_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-cr [2019/12/27 11:41] (current)
olivier [Configuration and tuning]
Line 41: Line 41:
 The generator **MUST** generate lot's of smallest IP flows (multiple source/destination IP addresses and/or UDP src/dst port). The generator **MUST** generate lot's of smallest IP flows (multiple source/destination IP addresses and/or UDP src/dst port).
  
-Here is an example for generating 2000 flows (100 different source IP * 20 different destination IP) at line-rate by using 2 threads:+Here is an example for generating 5000 flows (different source IP * different destination IP) at line-rate by using 2 threads:
 <code> <code>
-pkt-gen -i vcxl0 -f tx -n 1000000000 -l 60 -198.19.10.1:2000-198.19.10.10 -D 00:07:43:2e:e4:70 -198.18.10.1:2000-198.18.10.20 -w 4 -p 2+pkt-gen --f tx -w 2 -i vcxl0 -n 1000000000 -l 60 -4 -p 2 -S 00:07:43:2f:fe:b2 -s 198.18.10.1:2001-198.18.10.71 -D 00:07:43:2e:e4:70 -198.19.10.1:2001-198.19.10.70
 </code> </code>
  
 And the same with IPv6 flows (minimum frame size of 62 here): And the same with IPv6 flows (minimum frame size of 62 here):
 <code> <code>
-pkt-gen -f tx -i vcxl0 -n 1000000000 -l 62 -6 -"[2001:2:0:8001::1]-[2001:2:0:8001::64]" -D 00:07:43:2e:e4:70 -"[2001:2:0:1::1]-[2001:2:0:1::14]" -w 4 -p 2+pkt-gen -N -f tx -w 2 -i vcxl0 -n 1000000000 -l 62 -6 -p 2 -S 00:07:43:2f:fe:b2 -s "[2001:2:0:10::1]-[2001:2:0:10::47]" -D 00:07:43:2e:e4:70 -"[2001:2:0:8010::1]-[2001:2:0:8010::46]"
 </code> </code>
- 
  
 <note warning> <note warning>
Line 57: Line 56:
 Receiver will use this command: Receiver will use this command:
 <code> <code>
-pkt-gen -i vcxl1 -f rx -w 4+pkt-gen -i vcxl1 -f rx -w 2
 </code> </code>
-==== Basic configuration ==== +==== Configuration and tuning ====
- +
-=== Disabling Ethernet flow-control === +
- +
-First, disable Ethernet flow-control on both servers. Chelsio T540 are configured like this: +
-<code> +
-echo "dev.cxl.2.pause_settings=0" >> /etc/sysctl.conf +
-echo "dev.cxl.3.pause_settings=0" >> /etc/sysctl.conf +
-service sysctl reload +
-</code> +
- +
-=== Disabling LRO and TSO === +
- +
-A router [[Documentation:Technical docs:Performance|should not use LRO and TSO]]. BSDRP disable by default using a RC script (disablelrotso_enable="YES" in /etc/rc.conf.misc). +
- +
-==== IP Configuration on DUT ====  +
- +
-/etc/rc.conf: +
-<code> +
-# IPv4 router +
-gateway_enable="YES" +
-ifconfig_cxl0="inet 198.18.0.10/24 -tso4 -tso6 -lro" +
-ifconfig_cxl1="inet 198.19.0.10/24 -tso4 -tso6 -lro" +
-static_routes="generator receiver" +
-route_generator="-net 198.18.0.0/16 198.18.0.108" +
-route_receiver="-net 198.19.0.0/16 198.19.0.108" +
-static_arp_pairs="generator receiver" +
-static_arp_generator="198.18.0.108 00:07:43:2e:e5:92" +
-static_arp_receiver="198.19.0.108 00:07:43:2e:e5:9a" +
- +
-# IPv6 router +
-ipv6_gateway_enable="YES" +
-ipv6_activate_all_interfaces="YES" +
-ifconfig_cxl0_ipv6="inet6 2001:2::10 prefixlen 64" +
-ifconfig_cxl1_ipv6="inet6 2001:2:0:8000::10 prefixlen 64" +
-ipv6_static_routes="generator receiver" +
-ipv6_route_generator="2001:2:: -prefixlen 49 2001:2::108" +
-ipv6_route_receiver="2001:2:0:8000:: -prefixlen 49 2001:2:0:8000::108" +
-static_ndp_pairs="generator receiver" +
-static_ndp_generator="2001:2::108 00:07:43:2e:e5:92" +
-static_ndp_receiver="2001:2:0:8000::108 00:07:43:2e:e5:9a" +
-</code> +
- +
-===== Routing performance with default value ===== +
-==== Default forwarding performance in front of a line-rate generator ==== +
- +
-Trying the "worse" scenario: Router receiving multiflows of almost line-rate: only 5Mpps are correctly forwarded (FreeBSD 12.0-CURRENT #1 r309510). +
- +
-<code> +
-[root@hp]~# netstat -iw 1 +
-            input        (Total)           output +
-   packets  errs idrops      bytes    packets  errs      bytes colls +
-   5284413          338202480    5284335      338197652     0 +
-   5286851          338358470    5286409      338330290     0 +
-   5286751          338352070    5287901      338381618     0 +
-   5292884          338743878    5291437      338696178     0 +
-   5288965          338494470    5289786      338546546     0 +
-   5295780          338929926    5295438      338908082     0 +
-   5283945          338172486    5284276      338193778     0 +
-   5279643          337896454    5279034      337858226     0 +
-   5287808          338420422    5288619      338471986     0 +
-   5365838          343413638    5364807      343347634     0 +
-   5315300          340179206    5316103      340230706     0 +
-   5286934          338363782    5286508      338336626     0 +
-   5279085          337861446    5279649      337897650     0 +
-   5286925          338363206    5286167      338314738     0 +
-   5295751          338928070    5296703      338989106     0 +
-   5284070          338180166    5283137      338121010     0 +
-   5285692          338282246    5285963      338301554     0 +
-   5285824          338295110    5285093      338246194     0 +
-</code> +
- +
-The traffic is correctly load-balanced between NIC-queue/CPU binding: +
- +
-<code> +
-[root@hp]~# vmstat -i | grep t5nex0 +
-irq291: t5nex0:evt                              0 +
-irq292: t5nex0:0a0                 44709         21 +
-irq293: t5nex0:0a1               1063763        500 +
-irq294: t5nex0:0a2                867671        408 +
-irq295: t5nex0:0a3               1221772        575 +
-irq296: t5nex0:0a4               1180242        555 +
-irq297: t5nex0:0a5               1265724        595 +
-irq298: t5nex0:0a6               1196989        563 +
-irq299: t5nex0:0a7               1219212        574 +
-irq305: t5nex0:1a0                  7028          3 +
-irq306: t5nex0:1a1                  5625          3 +
-irq307: t5nex0:1a2                  5653          3 +
-irq308: t5nex0:1a3                  5697          3 +
-irq309: t5nex0:1a4                  5882          3 +
-irq310: t5nex0:1a5                  5784          3 +
-irq311: t5nex0:1a6                  5617          3 +
-irq312: t5nex0:1a7                  5613          3 +
- +
-[root@hp]~# top -nCHSIzs1 +
-last pid:  2032;  load averages:  4.91,  3.52,  1.88  up 0+00:35:50    07:57:41 +
-205 processes: 12 running, 106 sleeping, 87 waiting +
- +
-Mem: 13M Active, 728K Inact, 504M Wired, 23M Buf, 62G Free +
-Swap: +
- +
- +
-  PID USERNAME   PRI NICE   SIZE    RES STATE     TIME     CPU COMMAND +
-   11 root       -92    -     0K  1440K CPU2    2   4:44  95.17% intr{irq292: t5nex0:0} +
-   11 root       -92    -     0K  1440K CPU5    5   5:04  91.26% intr{irq293: t5nex0:0} +
-   11 root       -92    -     0K  1440K CPU6    6   4:47  86.57% intr{irq294: t5nex0:0} +
-   11 root       -92    -     0K  1440K WAIT    3   4:51  86.47% intr{irq299: t5nex0:0} +
-   11 root       -92    -     0K  1440K WAIT    4   4:50  84.28% intr{irq298: t5nex0:0} +
-   11 root       -92    -     0K  1440K WAIT    7   4:39  82.28% intr{irq295: t5nex0:0} +
-   11 root       -92    -     0K  1440K WAIT    1   4:31  78.37% intr{irq297: t5nex0:0} +
-   11 root       -92    -     0K  1440K WAIT    0   4:19  74.56% intr{irq296: t5nex0:0} +
-   11 root       -60    -     0K  1440K WAIT    4   0:27   0.10% intr{swi4: clock (0)} +
-</code> +
- +
-Where the system spend this time? +
- +
-<code> +
-[root@hp]~# kldload hwpmc +
-[root@hp]~# pmcstat -TS CPU_CLK_UNHALTED_CORE -w 1 +
-PMC: [CPU_CLK_UNHALTED_CORE] Samples: 320832 (100.0%) , 0 unresolved +
- +
-%SAMP IMAGE      FUNCTION             CALLERS +
- 21.4 kernel     __rw_rlock           fib4_lookup_nh_basic:12.5 arpresolve:8.9 +
- 15.8 kernel     _rw_runlock_cookie   fib4_lookup_nh_basic:9.8 arpresolve:5.9 +
-  8.8 kernel     eth_tx               drain_ring +
-  6.3 kernel     bzero                ip_tryforward:1.7 fib4_lookup_nh_basic:1.6 ip_findroute:1.6 m_pkthdr_init:1.4 +
-  4.1 kernel     bcopy                get_scatter_segment:1.7 arpresolve:1.3 eth_tx:1.1 +
-  3.6 kernel     rn_match             fib4_lookup_nh_basic +
-  2.6 kernel     bcmp                 ether_nh_input +
-  2.0 kernel     ether_output         ip_tryforward +
-  2.0 kernel     mp_ring_enqueue      cxgbe_transmit +
-  2.0 libc.so.7  bsearch              0x63ac +
-  1.7 kernel     get_scatter_segment  service_iq +
-  1.6 kernel     ip_tryforward        ip_input +
-  1.5 kernel     cxgbe_transmit       ether_output +
-  1.4 kernel     fib4_lookup_nh_basic ip_findroute +
-  1.3 kernel     arpresolve           ether_output +
-  1.2 kernel     memcpy               ether_output +
-  1.2 kernel     ether_nh_input       netisr_dispatch_src +
-  1.1 kernel     service_iq           t4_intr +
-  1.1 kernel     reclaim_tx_descs     eth_tx +
-  1.0 kernel     drain_ring           mp_ring_enqueue +
-  1.0 kernel     uma_zalloc_arg       get_scatter_segment +
- +
-</code> +
- +
-Some lock contention on the fib4_lookup_nh_basic. +
- +
-==== Equilibrium throughput ==== +
- +
-Previous methodology, by generating about 14Mpps, is like testing the DUT under a "Denial-of-Service". Try another methodology known as  [[documentation:examples:Setting up a VPN (IPSec, GRE, etc...) performance benchmark lab|equilibrium throughput]]. +
- +
-=== IPv4 === +
- +
-From the pkt-generator, start an estimation of the "equilibrium throughput" starting at 10Mpps: +
-<code> +
-[root@pkt-gen]~# equilibrium -d 00:07:43:2e:e4:70 -p -l 10000 -t vcxl0 -r vcxl1 +
-Benchmark tool using equilibrium throughput method +
-- Benchmark mode: Throughput (pps) for Router +
-- UDP load = 18B, IPv4 packet size=46B, Ethernet frame size=60B +
-- Link rate = 10000 Kpps +
-- Tolerance = 0.01 +
-Iteration 1 +
-  - Offering load = 5000 Kpps +
-  - Step = 2500 Kpps +
-  - Measured forwarding rate = 5000 Kpps +
-Iteration 2 +
-  - Offering load = 7500 Kpps +
-  - Step = 2500 Kpps +
-  - Trend = increasing +
-  - Measured forwarding rate = 5440 Kpps +
-Iteration 3 +
-  - Offering load = 6250 Kpps +
-  - Step = 1250 Kpps +
-  - Trend = decreasing +
-  - Measured forwarding rate = 5437 Kpps +
-Iteration 4 +
-  - Offering load = 5625 Kpps +
-  - Step = 625 Kpps +
-  - Trend = decreasing +
-  - Measured forwarding rate = 5442 Kpps +
-Iteration 5 +
-  - Offering load = 5313 Kpps +
-  - Step = 312 Kpps +
-  - Trend = decreasing +
-  - Measured forwarding rate = 5313 Kpps +
-Iteration 6 +
-  - Offering load = 5469 Kpps +
-  - Step = 156 Kpps +
-  - Trend = increasing +
-  - Measured forwarding rate = 5434 Kpps +
-Iteration 7 +
-  - Offering load = 5391 Kpps +
-  - Step = 78 Kpps +
-  - Trend = decreasing +
-  - Measured forwarding rate = 5390 Kpps +
-Estimated Equilibrium Ethernet throughput= 5390 Kpps (maximum value seen: 5442 Kpps) +
-</code> +
- +
-=> About the same performance as the "under DOS" bench (only running multiple times this same benchs can give valide "statistical" data). +
- +
-=== IPv6 === +
- +
-From the pkt-generator, start an estimation of the "equilibrium throughput" in IPv6 mode, starting at 10Mpps: +
-<code> +
-[root@pkt-gen]~# equilibrium -d 00:07:43:2e:e4:70 -p -l 10000 -t vcxl0 -r vcxl1 -6 +
-Benchmark tool using equilibrium throughput method +
-- Benchmark mode: Throughput (pps) for Router +
-- UDP load = 0B, IPv6 packet size=48B, Ethernet frame size=62B +
-- Link rate = 10000 Kpps +
-- Tolerance = 0.01 +
-Iteration 1 +
-  - Offering load = 5000 Kpps +
-  - Step = 2500 Kpps +
-  - Measured forwarding rate = 2681 Kpps +
-Iteration 2 +
-  - Offering load = 2500 Kpps +
-  - Step = 2500 Kpps +
-  - Trend = decreasing +
-  - Measured forwarding rate = 2499 Kpps +
-Iteration 3 +
-  - Offering load = 3750 Kpps +
-  - Step = 1250 Kpps +
-  - Trend = increasing +
-  - Measured forwarding rate = 2682 Kpps +
-Iteration 4 +
-  - Offering load = 3125 Kpps +
-  - Step = 625 Kpps +
-  - Trend = decreasing +
-  - Measured forwarding rate = 2681 Kpps +
-Iteration 5 +
-  - Offering load = 2813 Kpps +
-  - Step = 312 Kpps +
-  - Trend = decreasing +
-  - Measured forwarding rate = 2681 Kpps +
-Iteration 6 +
-  - Offering load = 2657 Kpps +
-  - Step = 156 Kpps +
-  - Trend = decreasing +
-  - Measured forwarding rate = 2657 Kpps +
-Iteration 7 +
-  - Offering load = 2735 Kpps +
-  - Step = 78 Kpps +
-  - Trend = increasing +
-  - Measured forwarding rate = 2680 Kpps +
-Iteration 8 +
-  - Offering load = 2696 Kpps +
-  - Step = 39 Kpps +
-  - Trend = decreasing +
-  - Measured forwarding rate = 2679 Kpps +
-Estimated Equilibrium Ethernet throughput= 2679 Kpps (maximum value seen: 2682 Kpps) +
-</code> +
- +
-From 5.4Mpps in IPv4, it lower to 2.67Mppps in IPv6 (no fastforward with IPv6). +
-==== Firewall impact ==== +
- +
-One rule for each firewall and 2000 UDP "sessions", more information on the [[documentation:examples:forwarding_performance_lab_of_an_ibm_system_x3550_m3_with_intel_82580#firewall_impact|GigaEthernet performance lab]]. +
- +
-{{documentation:examples:bench.ipfw.pf.HP-Gen8.png|Impact of ipfw and pf on 8 cores Xeon E5-2650 with Chelsio T540-CR on FreeBSD 10.1}} +
- +
-===== tuning ====+
- +
-=== BIOS === +
- +
-Disable Hyperthreading: The fake cores are very bad for routing performance. +
- +
-{{documentation:technical_docs:HyperThearding-impact-forwarding.png|Impact of HyperThreading on forwarding performance on FreeBSD}} +
- +
-=== Chelsio drivers === +
- +
-== Reducing NIC queues (FreeBSD 11.0 or older only)== +
- +
-By default queues are: +
-  * TX: 16 or ncpu if ncpu<16 +
-  * RX: 8 or ncpu if ncpu<8 +
- +
-Then in our case there are equal to 8: +
-<code> +
-[root@hp]~# sysctl dev.cxl.3.nrxq +
-dev.cxl.3.nrxq:+
-[root@hp]~# sysctl dev.cxl.3.ntxq +
-dev.cxl.3.ntxq:+
-</code> +
- +
-Here is how to changes number of queue to 4: +
-<code> +
-mount -uw / +
-echo 'hw.cxgbe.ntxq10g="4"' >> /boot/loader.conf.local +
-echo 'hw.cxgbe.nrxq10g="4"' >> /boot/loader.conf.local +
-mount -ur / +
-reboot +
-</code> +
- +
-{{:documentation:examples:bench.impact.of.num_queues.hp-gen8-chelsio-t540.png}} +
- +
-<note warning> +
-On a 8 cores machine, we had to reduce NIC queue numbers to 4 on FreeBSD 11.0 and older. +
-</note> +
- +
-=== descriptor ring size === +
- +
-The size, in number of entries, of the descriptor ring used for a RX and TX queue are 1024 by default. +
- +
-<code>  +
-[root@hp]~# sysctl dev.cxl.3.qsize_rxq +
-dev.cxl.3.qsize_rxq: 1024 +
-[root@hp]~# sysctl dev.cxl.2.qsize_rxq +
-dev.cxl.2.qsize_rxq: 1024 +
-</code> +
- +
-Let's change them to different value (1024, 2048 and 4096) and measuring the impact: +
- +
-<code> +
-mount -uw / +
-echo 'hw.cxgbe.qsize_txq="4096"' >> /boot/loader.conf.local +
-echo 'hw.cxgbe.qsize_rxq="4096"' >> /boot/loader.conf.local +
-mount -ur / +
-reboot +
-</code> +
- +
-{{:documentation:examples:bench.impact.of.descriptor.ring.size.hp-gen8-chelsio-t540.png}} +
- +
-Ministat: +
- +
-<code> +
-x pps.qsize1024 +
-+ pps.qsize2048 +
-* pps.qsize4096 +
-+--------------------------------------------------------------------------+ +
-|x    x                      *+x       +*        x**    x    +        *+ + | +
-|   |________________________A_M_____________________|                     | +
-|                                  |___________________A_____M____________|| +
-|                                |______________A_M____________|           | +
-+--------------------------------------------------------------------------+ +
-    N           Min           Max        Median           Avg        Stddev +
-x         2148318       2492921       2333251     2321334.2     154688.45 +
-+         2328049       2596042       2525599       2484254     120543.04 +
-No difference proven at 95.0% confidence +
-*         2325210       2581890       2452394     2442485.8     94913.584 +
-No difference proven at 95.0% confidence +
-</code> +
- +
-By reading the graphic it seems there is a better behaviour with a qsize of 2048, but ministat answers to 5 benchs says there is not.+
  
 +[[https://github.com/ocochard/netbenches/tree/master/Xeon_E5-2650-8Cores-Chelsio_T540-CR/forwarding-pf-ipfw/configs|DUT configurations repository]]
 +=== Results ===
  
 +{{https://raw.githubusercontent.com/ocochard/netbenches/master/Xeon_E5-2650_8Cores-Chelsio_T540-CR/forwarding-pf-ipfw/results/fbsd12-stable.r354440.BSDRP.1.96/graph.png}}
documentation/examples/forwarding_performance_lab_of_a_hp_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-cr.1485787113.txt.gz · Last modified: 2017/01/30 15:38 by 127.0.0.1