User Tools

Site Tools


documentation:examples:forwarding_performance_lab_of_a_hp_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-cr
no way to compare when less than two revisions

Differences

This shows you the differences between two versions of the page.


Next revision
documentation:examples:forwarding_performance_lab_of_a_hp_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-cr [2017/01/30 15:38] – external edit 127.0.0.1
Line 1: Line 1:
 +====== Forwarding performance lab of a HP ProLiant DL360p Gen8 with 10-Gigabit Chelsio T540-CR ======
 +{{description>Forwarding performance lab of a quad cores Xeon 2.13GHz and quad-port 10-Gigabit Chelsio T540-CR}}
 +
 +===== Bench lab =====
 +
 +==== Hardware detail ====
 +
 +This lab will test an [[HP ProLiant DL360p Gen8]] with **eight** cores (Intel Xeon E5-2650  @ 2.60GHz), quad port Chelsio 10-Gigabit T540-CR and OPT SFP (SFP-10G-LR).
 +
 +The lab is detailed here: [[documentation:examples:Setting up a forwarding performance benchmark lab]].
 +
 +=== Diagram ===
 +
 +<code>
 ++------------------------------------------+ +-------+ +------------------------------+
 +|        Device under test                 | |Juniper| | Packet generator & receiver  |
 +|                                          | |  QFX  | |                              |
 +|                cxl0: 198.18.0.10/24      |=|   <   |=| vcxl0: 198.18.0.108/24       |
 +|                      2001:2::10/64       | |       | |        2001:2::108/64        |
 +|                      (00:07:43:2e:e4:70) | |       | |        (00:07:43:2e:e5:92)   |
 +|                                          | |       | |                              |
 +|                cxl1: 198.19.0.10/24      |=|   >   |=| vcxl1: 198.19.0.108/24       |
 +|                     2001:2:0:8000::10/64 | |       | |        2001:2:0:8000::108/64 |
 +|                      (00:07:43:2e:e4:78) | +-------+ |        (00:07:43:2e:e5:9a)   |
 +|                                          |                                        |
 +|            static routes                                                        |
 +| 192.18.0.0/16      => 198.18.0.108                                              |
 +| 192.19.0.0/16      => 198.19.0.108                                              |
 +| 2001:2::/49        => 2001:2::108        |                                        |
 +| 2001:2:0:8000::/49 => 2001:2:0:8000::108 |                                        |
 +|                                          |                                        |
 +|        static arp and ndp                |           | /boot/loader.conf:           |
 +| 198.18.0.108        => 00:07:43:2e:e5:92 |                hw.cxgbe.num_vis=2      |
 +| 2001:2::108                              |                                        |
 +|                                          |                                        |
 +| 198.19.0.108        => 00:07:43:2e:e5:9a |                                        |
 +| 2001:2:0:8000::108                                                              |
 ++------------------------------------------+           +------------------------------+
 +</code>
 +
 +The generator **MUST** generate lot's of smallest IP flows (multiple source/destination IP addresses and/or UDP src/dst port).
 +
 +Here is an example for generating 2000 flows (100 different source IP * 20 different destination IP) at line-rate by using 2 threads:
 +<code>
 +pkt-gen -i vcxl0 -f tx -n 1000000000 -l 60 -d 198.19.10.1:2000-198.19.10.10 -D 00:07:43:2e:e4:70 -s 198.18.10.1:2000-198.18.10.20 -w 4 -p 2
 +</code>
 +
 +And the same with IPv6 flows (minimum frame size of 62 here):
 +<code>
 +pkt-gen -f tx -i vcxl0 -n 1000000000 -l 62 -6 -d "[2001:2:0:8001::1]-[2001:2:0:8001::64]" -D 00:07:43:2e:e4:70 -s "[2001:2:0:1::1]-[2001:2:0:1::14]" -w 4 -p 2
 +</code>
 +
 +
 +<note warning>
 +Netmap disable hardware checksum on the NIC, if you can't re-enable hardware checksum in netmap mode (like with Intel NIC), you need to use a FreeBSD -head with svn revision of 257758 minimum with the[[https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187149|pkt-gen software-checksum patch]] for using multiple src/dst IP or port with netmap's pkt-gen. But this software checksum patch will reduce performance from line-rate to about 10Mpps.
 +</note>
 +Receiver will use this command:
 +<code>
 +pkt-gen -i vcxl1 -f rx -w 4
 +</code>
 +==== Basic configuration ====
 +
 +=== Disabling Ethernet flow-control ===
 +
 +First, disable Ethernet flow-control on both servers. Chelsio T540 are configured like this:
 +<code>
 +echo "dev.cxl.2.pause_settings=0" >> /etc/sysctl.conf
 +echo "dev.cxl.3.pause_settings=0" >> /etc/sysctl.conf
 +service sysctl reload
 +</code>
 +
 +=== Disabling LRO and TSO ===
 +
 +A router [[Documentation:Technical docs:Performance|should not use LRO and TSO]]. BSDRP disable by default using a RC script (disablelrotso_enable="YES" in /etc/rc.conf.misc).
 +
 +==== IP Configuration on DUT ==== 
 +
 +/etc/rc.conf:
 +<code>
 +# IPv4 router
 +gateway_enable="YES"
 +ifconfig_cxl0="inet 198.18.0.10/24 -tso4 -tso6 -lro"
 +ifconfig_cxl1="inet 198.19.0.10/24 -tso4 -tso6 -lro"
 +static_routes="generator receiver"
 +route_generator="-net 198.18.0.0/16 198.18.0.108"
 +route_receiver="-net 198.19.0.0/16 198.19.0.108"
 +static_arp_pairs="generator receiver"
 +static_arp_generator="198.18.0.108 00:07:43:2e:e5:92"
 +static_arp_receiver="198.19.0.108 00:07:43:2e:e5:9a"
 +
 +# IPv6 router
 +ipv6_gateway_enable="YES"
 +ipv6_activate_all_interfaces="YES"
 +ifconfig_cxl0_ipv6="inet6 2001:2::10 prefixlen 64"
 +ifconfig_cxl1_ipv6="inet6 2001:2:0:8000::10 prefixlen 64"
 +ipv6_static_routes="generator receiver"
 +ipv6_route_generator="2001:2:: -prefixlen 49 2001:2::108"
 +ipv6_route_receiver="2001:2:0:8000:: -prefixlen 49 2001:2:0:8000::108"
 +static_ndp_pairs="generator receiver"
 +static_ndp_generator="2001:2::108 00:07:43:2e:e5:92"
 +static_ndp_receiver="2001:2:0:8000::108 00:07:43:2e:e5:9a"
 +</code>
 +
 +===== Routing performance with default value =====
 +==== Default forwarding performance in front of a line-rate generator ====
 +
 +Trying the "worse" scenario: Router receiving multiflows of almost line-rate: only 5Mpps are correctly forwarded (FreeBSD 12.0-CURRENT #1 r309510).
 +
 +<code>
 +[root@hp]~# netstat -iw 1
 +            input        (Total)           output
 +   packets  errs idrops      bytes    packets  errs      bytes colls
 +   5284413          338202480    5284335      338197652     0
 +   5286851          338358470    5286409      338330290     0
 +   5286751          338352070    5287901      338381618     0
 +   5292884          338743878    5291437      338696178     0
 +   5288965          338494470    5289786      338546546     0
 +   5295780          338929926    5295438      338908082     0
 +   5283945          338172486    5284276      338193778     0
 +   5279643          337896454    5279034      337858226     0
 +   5287808          338420422    5288619      338471986     0
 +   5365838          343413638    5364807      343347634     0
 +   5315300          340179206    5316103      340230706     0
 +   5286934          338363782    5286508      338336626     0
 +   5279085          337861446    5279649      337897650     0
 +   5286925          338363206    5286167      338314738     0
 +   5295751          338928070    5296703      338989106     0
 +   5284070          338180166    5283137      338121010     0
 +   5285692          338282246    5285963      338301554     0
 +   5285824          338295110    5285093      338246194     0
 +</code>
 +
 +The traffic is correctly load-balanced between NIC-queue/CPU binding:
 +
 +<code>
 +[root@hp]~# vmstat -i | grep t5nex0
 +irq291: t5nex0:evt                              0
 +irq292: t5nex0:0a0                 44709         21
 +irq293: t5nex0:0a1               1063763        500
 +irq294: t5nex0:0a2                867671        408
 +irq295: t5nex0:0a3               1221772        575
 +irq296: t5nex0:0a4               1180242        555
 +irq297: t5nex0:0a5               1265724        595
 +irq298: t5nex0:0a6               1196989        563
 +irq299: t5nex0:0a7               1219212        574
 +irq305: t5nex0:1a0                  7028          3
 +irq306: t5nex0:1a1                  5625          3
 +irq307: t5nex0:1a2                  5653          3
 +irq308: t5nex0:1a3                  5697          3
 +irq309: t5nex0:1a4                  5882          3
 +irq310: t5nex0:1a5                  5784          3
 +irq311: t5nex0:1a6                  5617          3
 +irq312: t5nex0:1a7                  5613          3
 +
 +[root@hp]~# top -nCHSIzs1
 +last pid:  2032;  load averages:  4.91,  3.52,  1.88  up 0+00:35:50    07:57:41
 +205 processes: 12 running, 106 sleeping, 87 waiting
 +
 +Mem: 13M Active, 728K Inact, 504M Wired, 23M Buf, 62G Free
 +Swap:
 +
 +
 +  PID USERNAME   PRI NICE   SIZE    RES STATE     TIME     CPU COMMAND
 +   11 root       -92    -     0K  1440K CPU2    2   4:44  95.17% intr{irq292: t5nex0:0}
 +   11 root       -92    -     0K  1440K CPU5    5   5:04  91.26% intr{irq293: t5nex0:0}
 +   11 root       -92    -     0K  1440K CPU6    6   4:47  86.57% intr{irq294: t5nex0:0}
 +   11 root       -92    -     0K  1440K WAIT    3   4:51  86.47% intr{irq299: t5nex0:0}
 +   11 root       -92    -     0K  1440K WAIT    4   4:50  84.28% intr{irq298: t5nex0:0}
 +   11 root       -92    -     0K  1440K WAIT    7   4:39  82.28% intr{irq295: t5nex0:0}
 +   11 root       -92    -     0K  1440K WAIT    1   4:31  78.37% intr{irq297: t5nex0:0}
 +   11 root       -92    -     0K  1440K WAIT    0   4:19  74.56% intr{irq296: t5nex0:0}
 +   11 root       -60    -     0K  1440K WAIT    4   0:27   0.10% intr{swi4: clock (0)}
 +</code>
 +
 +Where the system spend this time?
 +
 +<code>
 +[root@hp]~# kldload hwpmc
 +[root@hp]~# pmcstat -TS CPU_CLK_UNHALTED_CORE -w 1
 +PMC: [CPU_CLK_UNHALTED_CORE] Samples: 320832 (100.0%) , 0 unresolved
 +
 +%SAMP IMAGE      FUNCTION             CALLERS
 + 21.4 kernel     __rw_rlock           fib4_lookup_nh_basic:12.5 arpresolve:8.9
 + 15.8 kernel     _rw_runlock_cookie   fib4_lookup_nh_basic:9.8 arpresolve:5.9
 +  8.8 kernel     eth_tx               drain_ring
 +  6.3 kernel     bzero                ip_tryforward:1.7 fib4_lookup_nh_basic:1.6 ip_findroute:1.6 m_pkthdr_init:1.4
 +  4.1 kernel     bcopy                get_scatter_segment:1.7 arpresolve:1.3 eth_tx:1.1
 +  3.6 kernel     rn_match             fib4_lookup_nh_basic
 +  2.6 kernel     bcmp                 ether_nh_input
 +  2.0 kernel     ether_output         ip_tryforward
 +  2.0 kernel     mp_ring_enqueue      cxgbe_transmit
 +  2.0 libc.so.7  bsearch              0x63ac
 +  1.7 kernel     get_scatter_segment  service_iq
 +  1.6 kernel     ip_tryforward        ip_input
 +  1.5 kernel     cxgbe_transmit       ether_output
 +  1.4 kernel     fib4_lookup_nh_basic ip_findroute
 +  1.3 kernel     arpresolve           ether_output
 +  1.2 kernel     memcpy               ether_output
 +  1.2 kernel     ether_nh_input       netisr_dispatch_src
 +  1.1 kernel     service_iq           t4_intr
 +  1.1 kernel     reclaim_tx_descs     eth_tx
 +  1.0 kernel     drain_ring           mp_ring_enqueue
 +  1.0 kernel     uma_zalloc_arg       get_scatter_segment
 +
 +</code>
 +
 +Some lock contention on the fib4_lookup_nh_basic.
 +
 +==== Equilibrium throughput ====
 +
 +Previous methodology, by generating about 14Mpps, is like testing the DUT under a "Denial-of-Service". Try another methodology known as  [[documentation:examples:Setting up a VPN (IPSec, GRE, etc...) performance benchmark lab|equilibrium throughput]].
 +
 +=== IPv4 ===
 +
 +From the pkt-generator, start an estimation of the "equilibrium throughput" starting at 10Mpps:
 +<code>
 +[root@pkt-gen]~# equilibrium -d 00:07:43:2e:e4:70 -p -l 10000 -t vcxl0 -r vcxl1
 +Benchmark tool using equilibrium throughput method
 +- Benchmark mode: Throughput (pps) for Router
 +- UDP load = 18B, IPv4 packet size=46B, Ethernet frame size=60B
 +- Link rate = 10000 Kpps
 +- Tolerance = 0.01
 +Iteration 1
 +  - Offering load = 5000 Kpps
 +  - Step = 2500 Kpps
 +  - Measured forwarding rate = 5000 Kpps
 +Iteration 2
 +  - Offering load = 7500 Kpps
 +  - Step = 2500 Kpps
 +  - Trend = increasing
 +  - Measured forwarding rate = 5440 Kpps
 +Iteration 3
 +  - Offering load = 6250 Kpps
 +  - Step = 1250 Kpps
 +  - Trend = decreasing
 +  - Measured forwarding rate = 5437 Kpps
 +Iteration 4
 +  - Offering load = 5625 Kpps
 +  - Step = 625 Kpps
 +  - Trend = decreasing
 +  - Measured forwarding rate = 5442 Kpps
 +Iteration 5
 +  - Offering load = 5313 Kpps
 +  - Step = 312 Kpps
 +  - Trend = decreasing
 +  - Measured forwarding rate = 5313 Kpps
 +Iteration 6
 +  - Offering load = 5469 Kpps
 +  - Step = 156 Kpps
 +  - Trend = increasing
 +  - Measured forwarding rate = 5434 Kpps
 +Iteration 7
 +  - Offering load = 5391 Kpps
 +  - Step = 78 Kpps
 +  - Trend = decreasing
 +  - Measured forwarding rate = 5390 Kpps
 +Estimated Equilibrium Ethernet throughput= 5390 Kpps (maximum value seen: 5442 Kpps)
 +</code>
 +
 +=> About the same performance as the "under DOS" bench (only running multiple times this same benchs can give valide "statistical" data).
 +
 +=== IPv6 ===
 +
 +From the pkt-generator, start an estimation of the "equilibrium throughput" in IPv6 mode, starting at 10Mpps:
 +<code>
 +[root@pkt-gen]~# equilibrium -d 00:07:43:2e:e4:70 -p -l 10000 -t vcxl0 -r vcxl1 -6
 +Benchmark tool using equilibrium throughput method
 +- Benchmark mode: Throughput (pps) for Router
 +- UDP load = 0B, IPv6 packet size=48B, Ethernet frame size=62B
 +- Link rate = 10000 Kpps
 +- Tolerance = 0.01
 +Iteration 1
 +  - Offering load = 5000 Kpps
 +  - Step = 2500 Kpps
 +  - Measured forwarding rate = 2681 Kpps
 +Iteration 2
 +  - Offering load = 2500 Kpps
 +  - Step = 2500 Kpps
 +  - Trend = decreasing
 +  - Measured forwarding rate = 2499 Kpps
 +Iteration 3
 +  - Offering load = 3750 Kpps
 +  - Step = 1250 Kpps
 +  - Trend = increasing
 +  - Measured forwarding rate = 2682 Kpps
 +Iteration 4
 +  - Offering load = 3125 Kpps
 +  - Step = 625 Kpps
 +  - Trend = decreasing
 +  - Measured forwarding rate = 2681 Kpps
 +Iteration 5
 +  - Offering load = 2813 Kpps
 +  - Step = 312 Kpps
 +  - Trend = decreasing
 +  - Measured forwarding rate = 2681 Kpps
 +Iteration 6
 +  - Offering load = 2657 Kpps
 +  - Step = 156 Kpps
 +  - Trend = decreasing
 +  - Measured forwarding rate = 2657 Kpps
 +Iteration 7
 +  - Offering load = 2735 Kpps
 +  - Step = 78 Kpps
 +  - Trend = increasing
 +  - Measured forwarding rate = 2680 Kpps
 +Iteration 8
 +  - Offering load = 2696 Kpps
 +  - Step = 39 Kpps
 +  - Trend = decreasing
 +  - Measured forwarding rate = 2679 Kpps
 +Estimated Equilibrium Ethernet throughput= 2679 Kpps (maximum value seen: 2682 Kpps)
 +</code>
 +
 +From 5.4Mpps in IPv4, it lower to 2.67Mppps in IPv6 (no fastforward with IPv6).
 +==== Firewall impact ====
 +
 +One rule for each firewall and 2000 UDP "sessions", more information on the [[documentation:examples:forwarding_performance_lab_of_an_ibm_system_x3550_m3_with_intel_82580#firewall_impact|GigaEthernet performance lab]].
 +
 +{{documentation:examples:bench.ipfw.pf.HP-Gen8.png|Impact of ipfw and pf on 8 cores Xeon E5-2650 with Chelsio T540-CR on FreeBSD 10.1}}
 +
 +===== tuning =====
 +
 +=== BIOS ===
 +
 +Disable Hyperthreading: The fake cores are very bad for routing performance.
 +
 +{{documentation:technical_docs:HyperThearding-impact-forwarding.png|Impact of HyperThreading on forwarding performance on FreeBSD}}
 +
 +=== Chelsio drivers ===
 +
 +== Reducing NIC queues (FreeBSD 11.0 or older only)==
 +
 +By default queues are:
 +  * TX: 16 or ncpu if ncpu<16
 +  * RX: 8 or ncpu if ncpu<8
 +
 +Then in our case there are equal to 8:
 +<code>
 +[root@hp]~# sysctl dev.cxl.3.nrxq
 +dev.cxl.3.nrxq: 8
 +[root@hp]~# sysctl dev.cxl.3.ntxq
 +dev.cxl.3.ntxq: 8
 +</code>
 +
 +Here is how to changes number of queue to 4:
 +<code>
 +mount -uw /
 +echo 'hw.cxgbe.ntxq10g="4"' >> /boot/loader.conf.local
 +echo 'hw.cxgbe.nrxq10g="4"' >> /boot/loader.conf.local
 +mount -ur /
 +reboot
 +</code>
 +
 +{{:documentation:examples:bench.impact.of.num_queues.hp-gen8-chelsio-t540.png}}
 +
 +<note warning>
 +On a 8 cores machine, we had to reduce NIC queue numbers to 4 on FreeBSD 11.0 and older.
 +</note>
 +
 +=== descriptor ring size ===
 +
 +The size, in number of entries, of the descriptor ring used for a RX and TX queue are 1024 by default.
 +
 +<code> 
 +[root@hp]~# sysctl dev.cxl.3.qsize_rxq
 +dev.cxl.3.qsize_rxq: 1024
 +[root@hp]~# sysctl dev.cxl.2.qsize_rxq
 +dev.cxl.2.qsize_rxq: 1024
 +</code>
 +
 +Let's change them to different value (1024, 2048 and 4096) and measuring the impact:
 +
 +<code>
 +mount -uw /
 +echo 'hw.cxgbe.qsize_txq="4096"' >> /boot/loader.conf.local
 +echo 'hw.cxgbe.qsize_rxq="4096"' >> /boot/loader.conf.local
 +mount -ur /
 +reboot
 +</code>
 +
 +{{:documentation:examples:bench.impact.of.descriptor.ring.size.hp-gen8-chelsio-t540.png}}
 +
 +Ministat:
 +
 +<code>
 +x pps.qsize1024
 ++ pps.qsize2048
 +* pps.qsize4096
 ++--------------------------------------------------------------------------+
 +|x    x                      *+x       +*        x**    x    +        *+ + |
 +|   |________________________A_M_____________________|                     |
 +|                                  |___________________A_____M____________||
 +|                                |______________A_M____________|           |
 ++--------------------------------------------------------------------------+
 +    N           Min           Max        Median           Avg        Stddev
 +x         2148318       2492921       2333251     2321334.2     154688.45
 ++         2328049       2596042       2525599       2484254     120543.04
 +No difference proven at 95.0% confidence
 +*         2325210       2581890       2452394     2442485.8     94913.584
 +No difference proven at 95.0% confidence
 +</code>
 +
 +By reading the graphic it seems there is a better behaviour with a qsize of 2048, but ministat answers to 5 benchs says there is not.
 +
  
documentation/examples/forwarding_performance_lab_of_a_hp_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-cr.txt · Last modified: 2019/12/27 11:41 by olivier

Except where otherwise noted, content on this wiki is licensed under the following license: BSD 2-Clause
Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki