User Tools

Site Tools


documentation:technical_docs:performance

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
documentation:technical_docs:performance [2019/03/26 00:13]
documentation:technical_docs:performance [2019/03/26 00:22]
127.0.0.1 external edit
Line 1: Line 1:
 +====== FreeBSD forwarding Performance ======
 +{{description>​Tips and information about FreeBSD forwarding performance}}
 +There are lot's of guide about [[http://​serverfault.com/​questions/​64356/​freebsd-performance-tuning-sysctls-loader-conf-kernel|tuning FreeBSD TCP performance]] (where the FreeBSD host is an end-point of the TCP session), but it's not the same that tunig forwarding performance (where the FreeBSD host don't have to read the TCP information of the packet being forwarded) or firewalling performance.
  
 +===== Concepts =====
 +
 +==== How to bench a router ====
 +
 +Benchmarking a router **is not** measuring the maximum bandwidth crossing the router, but it's about measuring the network throughput (in packets-per-second unit):
 +  * [[http://​www.ietf.org/​rfc/​rfc1242.txt|RFC1242:​ Benchmarking Terminology for Network Interconnection Devices]]
 +  * [[http://​www.ietf.org/​rfc/​rfc2544.txt|RFC2544:​ Benchmarking Methodology for Network Interconnect Devices]]
 +  * [[http://​tools.ietf.org/​html/​rfc3222|RFC3222:​ Terminology for Forwarding Information Base (FIB) based Router Performance]]
 +  * [[http://​www.ixiacom.com/​pdfs/​test_plans/​agilent_journal_of_internet_test_methodologies.pdf|The Journal of Internet Test Methodologies]] (pdf)
 +
 +==== Definition ====
 +
 +Clear definition regarding some relations between the bandwidth and frame rate is mandatory:
 +  * [[http://​www.cisco.com/​web/​about/​security/​intelligence/​network_performance_metrics.html|Bandwidth,​ Packets Per Second, and Other Network Performance Metrics]]: The relationship of bandwidth and packet forwarding rate
 +  * [[http://​wiki.networksecuritytoolkit.org/​nstwiki/​index.php/​LAN_Ethernet_Maximum_Rates,​_Generation,​_Capturing_%26_Monitoring|LAN Ethernet Maximum Rates, Generation, Capturing & Monitoring]] : Give another good explanation of the Ethernet maximum rates
 +
 +===== Benchmarks =====
 +
 +==== Cisco or Linux ====
 +
 +  * [[http://​www.cisco.com/​web/​partners/​downloads/​765/​tools/​quickreference/​routerperformance.pdf|Routing performance of Cisco routers]] (PDF)
 +  * [[http://​rhelblog.redhat.com/​2015/​09/​29/​pushing-the-limits-of-kernel-networking/​|Pushing the Limits of Kernel Networking]] (2015, September) : Linux RedHat 7.2: 1.3Mpps/​core and "​Beyond the ninth CPU we see some gain but all results are pretty much fixed at 12.4Mpps as this is the limit of the PCIe bandwidth for the device."​
 +  * [[http://​www.telematica.polito.it/​oldsite/​courmayeur06/​papers/​06-A.2.1.pdf|RFC2544 Performance Evaluation for a Linux Based Open Router]] (2006, June)
 +  * [[http://​data.guug.de/​slides/​lk2008/​10G_preso_lk2008.pdf|Towards 10Gb/s open-source routing]] (2008):​ Include an hardware comparison between a "​real"​ router and a PC.
 +==== FreeBSD ====
 +
 +Here are some benchs regarding network forwarding performance of FreeBSD (made by BSDRP team):
 +  * AsiaBSDCon 2018 - Tuning FreeBSD for routing and firewalling ([[https://​people.freebsd.org/​~olivier/​talks/​2018_AsiaBSDCon_Tuning_FreeBSD_for_routing_and_firewalling-Paper.pdf|paper]],​[[https://​people.freebsd.org/​~olivier/​talks/​2018_AsiaBSDCon_Tuning_FreeBSD_for_routing_and_firewalling-Slides.pdf|slides]] and [[https://​www.youtube.com/​watch?​v=SLlzep0IxVY|video]])
 +  * [[http://​blog.cochard.me/​2015/​09/​receipt-for-building-10mpps-freebsd.html|Recipe for building a 10Mpps FreeBSD based router]]
 +  * [[https://​github.com/​ocochard/​netbenches/​blob/​master/​Xeon_E5-2650-8Cores-Chelsio_T540-CR/​forwarding-pf-ipfw/​results/​fbsd11-routing.r287531/​README.md|Impact of enabling ipfw or pf on fastforwarding performance with 8 cores Xeon E5-2650]]:​ 9.5Mpps
 +  * [[https://​github.com/​ocochard/​netbenches/​blob/​master/​Xeon_L5630-4Cores-Intel_82599EB/​forwarding-pf-ipfw/​results/​fbsd11-routing.r287531/​README.md|Impact of enabling ipfw or pf on fastforwarding performance with 4 cores Xeon L5630]] : Gigabit line-rate (1.48Mpps) even with few ipfw or pf rules enabled
 +  * [[https://​github.com/​ocochard/​netbenches/​blob/​master/​Atom_C2558_4Cores-Intel_i350/​forwarding-pf-ipfw/​results/​fbsd11-routing.r287531/​README.md|Impact of enabling ipfw or pf on fastforwarding performance with 4 cores Atom C2558E (Netgate RCC-VE 4860)]] : Small device perfect for 1Gb/s of IMIX traffic
 +  * [[https://​github.com/​ocochard/​netbenches/​blob/​master/​AMD_G-T40E_2Cores_RTL8111E/​forwarding-pf-ipfw/​results/​fbsd11-routing.r287531/​README.md|Impact of enabling ipfw or pf on fastforwarding performance with 2 cores AMD G-T40E (PC Engines APU)]]: Another cheap router that reach about 400Mb/s of IMIX
 +
 +===== Bench lab =====
 +
 +The [[bench lab]] should permit to measure the pps. For obtaining accurate result the [[http://​www.ietf.org/​rfc/​rfc2544.txt|RFC 2544 (Benchmarking Methodology for Network Interconnect Devices)]] is a good reference. If switches are used, they need to have proper configuration too, refers to the [[documentation:​examples:​setting_up_a_forwarding_performance_benchmark_lab|BSDRP performance lab]] for some examples.
 +
 +===== Tuning =====
 +
 +==== Literature ====
 +
 +Here is a list of sources about optimizing/​analysis forwarding performance under FreeBSD.
 +
 +How to bench or tune the network stack:
 +  * [[http://​wiki.freebsd.org/​NetworkPerformanceTuning |  FreeBSD Network Performance Tuning]]: What need to be done to tune networking stack
 +  * [[http://​www.slideshare.net/​brendangregg/​meetbsd2014-performance-analysis | Brendan Gregg'​s Performance analysis presentation]]:​ The "must read" HOW TO
 +  * [[http://​www.freebsd.org/​projects/​netperf/​index.html|FreeBSD Network Performance Project (netperf)]]
 +  * [[http://​www.watson.org/​~robert/​freebsd/​netperf/​20051027-eurobsdcon2005-netperf.pdf|Introduction to Multithreading and Multiprocessing in the FreeBSD SMPng Network Stack]], EuroBSDCon 2005 (PDF)
 +  * [[http://​www.freebsd.org/​cgi/​man.cgi?​query=tuning&​apropos=0&​sektion=0&​manpath=FreeBSD+8.2-RELEASE&​arch=default&​format=html|man tuning]] : performance tuning under FreeBSD
 +  * [[http://​wwwx.cs.unc.edu/​~krishnan/​classes/​spring_07/​os_impl/​report.pdf|Improving Memory and Interrupt Processing in FreeBSD Network Stack]] (PDF)
 +  * [[http://​conferences.sigcomm.org/​sigcomm/​2009/​workshops/​presto/​papers/​p37.pdf|Optimizing the BSD Routing System for Parallel Processing]] (PDF)
 +  * [[https://​people.sunyit.edu/​~sengupta/​CSC521/​systemperformance.ppt|Using netstat and vmstat for performance analysis]] (Powerpoint))
 +  * [[http://​www.freebsd.org/​cgi/​man.cgi?​query=polling&​sektion=4|polling man page]] (Warning: enabling polling is not a good idea with the new generation of Ethernet controller that include interruption control)
 +  * [[http://​info.iet.unipi.it/​~luigi/​polling/​|Device Polling support for FreeBSD ]], the original presentation of polling implementation
 +  * [[http://​www.freebsd.org/​doc/​en_US.ISO8859-1/​books/​handbook/​configtuning-kernel-limits.html|Tuning Kernel Limits]] on the FreeBSD Handbook
 +
 +FreeBSD Experimental high-performance network stacks:
 +  * [[http://​info.iet.unipi.it/​~luigi/​netmap/​|Netmap - memory mapping of network devices]] //"​(...)a single core running at 1.33GHz can generate the 14.8Mpps that saturate a 10GigE interface."//​
 +  * [[http://​conferences.sigcomm.org/​hotnets/​2013/​papers/​hotnets-final43.pdf|Network Stack Specialization for Performance]] This paper presents Sandstorm, a clean-slate userspace network stack based on Netmap
 +
 +==== Multiple flows ====
 +
 +Don't try to bench a router with only one flow (same source|destination address and same source|destination port): You need to generate multiples flows.
 +Multi-queue NIC uses feature like [[https://​en.wikipedia.org/​wiki/​Toeplitz_Hash_Algorithm|Toeplitz Hash Algorithm]] that balance multiples flows between all cores. Then generating only one flow will use only one NIC queue.
 +
 +During your load, check that each queues are used with sysctl or with [[https://​github.com/​ocochard/​BSDRP/​blob/​master/​BSDRP/​Files/​usr/​local/​bin/​nic-queue-usage|python script like this one]] that will display real-time usage of each queue.
 +
 +On this example we can see that all flows are correctly shared between each 8 queues (about 340K paquets-per-seconds for each):
 +<​code>​
 +
 +[root@router]~#​ nic-queue-usage cxl0
 +[Q0   ​346K/​s] [Q1   ​343K/​s] [Q2   ​339K/​s] [Q3   ​338K/​s] [Q4   ​338K/​s] [Q5   ​338K/​s] [Q6   ​343K/​s] [Q7   ​346K/​s] [QT  2734K/​s ​ 3269K/s ->     0K/s]
 +[Q0   ​347K/​s] [Q1   ​344K/​s] [Q2   ​339K/​s] [Q3   ​339K/​s] [Q4   ​338K/​s] [Q5   ​338K/​s] [Q6   ​343K/​s] [Q7   ​346K/​s] [QT  2735K/​s ​ 3277K/s ->     0K/s]
 +[Q0   ​344K/​s] [Q1   ​341K/​s] [Q2   ​338K/​s] [Q3   ​338K/​s] [Q4   ​337K/​s] [Q5   ​337K/​s] [Q6   ​342K/​s] [Q7   ​345K/​s] [QT  2727K/​s ​ 3262K/s ->     0K/s]
 +[Q0   ​355K/​s] [Q1   ​352K/​s] [Q2   ​348K/​s] [Q3   ​349K/​s] [Q4   ​348K/​s] [Q5   ​347K/​s] [Q6   ​352K/​s] [Q7   ​355K/​s] [QT  2809K/​s ​ 3381K/s ->     0K/s]
 +[Q0   ​351K/​s] [Q1   ​348K/​s] [Q2   ​344K/​s] [Q3   ​343K/​s] [Q4   ​342K/​s] [Q5   ​344K/​s] [Q6   ​349K/​s] [Q7   ​352K/​s] [QT  2776K/​s ​ 3288K/s ->     0K/s]
 +[Q0   ​344K/​s] [Q1   ​341K/​s] [Q2   ​338K/​s] [Q3   ​339K/​s] [Q4   ​338K/​s] [Q5   ​338K/​s] [Q6   ​343K/​s] [Q7   ​346K/​s] [QT  2731K/​s ​ 3261K/s ->     0K/s]
 +</​code>​
 +
 +<note warning>
 +Beware of configurations setup that prevent multi-queue,​ like GRE,​GIF,​IPSec tunnels or PPPoE (= same source/​destination address). If PPPoE usage is mandatory on your Gigabit Internet link, using small hardware, like 4 cores AMD GX (PC Engines APU2), will prevent to reach Gigabit speed.
 +</​note>​
 +==== Choosing good Hardware ====
 +=== CPU ===
 +
 +Avoid NUMA architecture but prefer a CPU in only one package with maximum core (8 or 16).
 +If you are using NUMA, check that inbound/​outbound NIC queues are correctly mapping to the same package.
 +
 +=== Network Interface Card ===
 +
 +Mellanox or Chelsio, by mixing good chipset and excellent drivers are an excellent choice.
 +
 +Intel seems to have problem for managing lot's of PPS (= IRQ).
 +
 +Avoid "​embedded"​ NIC into common Dell/HP servers like these one that are very bad regarding their maximum packets-per-second performance:​
 +  * 10G Emulex OneConnect (be3)
 +  * 10G Broadcom NetXtreme II BCM57810
 +
 +==== Choosing good FreeBSD release ====
 +
 +Before tuning, you need to use the good FreeBSD version.
 +This mean a FreeBSD -head version older than r309257 (Andrey V. Elsukov 's improvement:​ Rework ip_tryforward() to use FIB4 KPI) backported to FreeBSD 11-stable r310771 (MFC to stable).
 +
 +BSDRP since version 1.70 is using a FreeBSD 11-stable (r312663) that includes this improvement.
 +
 +{{documentation:​technical_docs:​2016-performance-evolution.png|2016 Forwarding performance evolution of FreeBSD -head on a 8 core Atom}}
 +
 +For better (and linear scale) performance there is the [[https://​svnweb.freebsd.org/​base/​projects/​routing/​|projects/​routing]] too [[http://​blog.cochard.me/​2015/​09/​receipt-for-building-10mpps-freebsd.html|that still give better performance]].
 +
 +==== Disabling Hyper Threading ====
 +
 +Disable Hyper Threading (HT): By default, lot's of multi-queue NIC drivers create one queue per core.
 +But "​logical"​ cores didn't help at all for managing interrupts generated by high speed NIC.
 +
 +HT can be disabled with this command:
 +<​code>​
 +echo '​machdep.hyperthreading_allowed="​0"'​ >> /​boot/​loader.conf
 +</​code>​
 +
 +Here is an example on a 8cores x hardware threads Intel CPU and 10G Chelsio NIC:
 +
 +<​code>​
 +x HT-enabled-8rxq(default).packets-per-seconds
 ++ HT-enabled-16rxq.packets-per-seconds
 +* HT-disabled.packets-per-seconds
 ++--------------------------------------------------------------------------+
 +|                                                                        **|
 +|x      xx    x          +       + + +  +                               ***|
 +|   ​|____A_____| ​                                                          |
 +|                           ​|_____AM____| ​                                 |
 +|                                                                       |A||
 ++--------------------------------------------------------------------------+
 +    N           ​Min ​          ​Max ​       Median ​          ​Avg ​       Stddev
 +x   ​5 ​      ​4500078 ​      ​4735822 ​      ​4648451 ​    ​4648293.8 ​    ​94545.404
 ++   ​5 ​      ​4925106 ​      ​5198632 ​      ​5104512 ​    ​5088362.1 ​    ​102920.87
 +Difference at 95.0% confidence
 +        440068 +/- 144126
 +        9.46731% +/- 3.23827%
 +        (Student'​s t, pooled s = 98821.9)
 +*   ​5 ​      ​5765684 ​    ​5801231.5 ​      ​5783115 ​    ​5785004.7 ​    ​13724.265
 +Difference at 95.0% confidence
 +        1.13671e+06 +/- 98524.2
 +        24.4544% +/- 2.62824%
 +        (Student'​s t, pooled s = 67554.4)
 +</​code>​
 +
 +There is a benefit of about 24% to disable hyper threading.
 +
 +==== fastforwarding ====
 +
 +=== FreeBSD 10.3 or older ===
 +
 +You should enable fastforwarding with a:
 +<​code>​
 +echo "​net.inet.ip.fastforwarding=1"​ >> /​etc/​sysctl.conf
 +service sysctl restart
 +</​code>​
 +
 +=== FreeBSD 12.0 or newer ===
 +
 +You should enable tryforward by disabling ICMP redirect:
 +<​code>​
 +echo "​net.inet.ip.redirect=0" ​ >> /​etc/​sysctl.conf
 +echo "​net.inet6.ip6.redirect=0"​ >> /​etc/​sysctl.conf
 +service sysctl restart
 +</​code>​
 +
 +==== Entropy harvest impact ====
 +
 +Lot's of tuning guide indicate to disable:
 +  * kern.random.sys.harvest.ethernet
 +  * kern.random.sys.harvest.interrupt
 +
 +By default the binary mask 511 select almost all these source as entropy sources:
 +<​code>​
 +kern.random.harvest.mask_symbolic:​ [UMA],​[FS_ATIME],​SWI,​INTERRUPT,​NET_NG,​NET_ETHER,​NET_TUN,​MOUSE,​KEYBOARD,​ATTACH,​CACHED
 +kern.random.harvest.mask_bin:​ 00111111111
 +kern.random.harvest.mask:​ 511
 +</​code>​
 +
 +By replacing this mask by 351, we exclude INTERRUPT and NET_ETHER:
 +<​code>​
 +kern.random.harvest.mask_symbolic:​ [UMA],​[FS_ATIME],​SWI,​[INTERRUPT],​NET_NG,​[NET_ETHER],​NET_TUN,​MOUSE,​KEYBOARD,​ATTACH,​CACHED
 +kern.random.harvest.mask_bin:​ 00101011111
 +kern.random.harvest.mask:​ 351
 +</​code>​
 +
 +And we can notice on forwarding performance of a FreeBSD 11.1:
 +
 +<​code>​
 +x PC-Engines-APU2-igb,​ 511 (default): inet4 packets-per-second
 ++ PC-Engines-APU2-igb,​ 351: inet4 packets-per-second
 ++--------------------------------------------------------------------------+
 +|xx  x       ​xx ​                                ​+ ​     +    +    +        +|
 +||___M_A_____| ​                                                            |
 +|                                                  |________A_________| ​   |
 ++--------------------------------------------------------------------------+
 +    N           ​Min ​          ​Max ​       Median ​          ​Avg ​       Stddev
 +x   ​5 ​       724811 ​       730197 ​       726304 ​     727281.6 ​    ​2522.9161
 ++   ​5 ​       744832 ​     755871.5 ​       749956 ​     750112.9 ​    ​4208.7383
 +Difference at 95.0% confidence
 +        22831.3 +/- 5060.46
 +        3.13927% +/- 0.701645%
 +        (Student'​s t, pooled s = 3469.77
 +</​code>​
 +
 +On a PC Engines APU2, there is +3% performance benefit
 +
 +<​code>​
 +x Netgate-igb,​ 511 (default): inet4 packets-per-second
 ++ Netgate-igb,​ 351: inet4 packets-per-second
 ++--------------------------------------------------------------------------+
 +|x x    x          x     ​x ​                                  ​++ ​     +  + +|
 +||______M__A__________| ​                                                   |
 +|                                                             ​|_____AM___| |
 ++--------------------------------------------------------------------------+
 +    N           ​Min ​          ​Max ​       Median ​          ​Avg ​       Stddev
 +x   ​5 ​     946426.5 ​       965962 ​       951906 ​     954721.7 ​    ​8435.4561
 ++   ​5 ​       994839 ​      ​1005327 ​      ​1000935 ​    ​1000098.2 ​    ​4620.4263
 +Difference at 95.0% confidence
 +        45376.5 +/- 9918.76
 +        4.75285% +/- 1.0771%
 +        (Student'​s t, pooled s = 6800.93)
 +</​code>​
 +
 +On a Netgate RCC-VE 4860 there is about 4.7% performance benefit.
 +
 +Using the FreeBSD "​projects/​routing"​ branch, this impact is a lot's more important:
 +
 +{{documentation:​technical_docs:​entropy_source_impact.png|Impact of disabling some entropy source on FreeBSD forwarding performance}}
 +
 +==== Polling mode ====
 +
 +Polling can be used in 2 cases:
 +   * On **old hardware only** (where Ethernet card doesn'​t support **Intelligent interrupt management**),​ using the polling mode can improve performance by reducing CPU interrupt
 +   * When used [[http://​lists.freebsd.org/​pipermail/​freebsd-net/​2013-May/​035626.html|for usage in a Virtual Machine]] but don't forgot to [[https://​lists.freebsd.org/​pipermail/​freebsd-net/​2015-March/​041657.html|overwrite the default HZ value in this case too]].
 +For enabling polling mode:
 +  - Edit /​etc/​rc.conf.misc and replace //​polling_enable="​NO"//​ by //​polling_enable="​YES"//​
 +  - Execute: service polling start
 +
 +=== NIC drivers compatibility matrix ===
 +
 +BSDRP can use some special features on somes NIC:
 +  * [[http://​www.freebsd.org/​cgi/​man.cgi?​query=polling|Ethernet device polling]] for high performance with Ethernet controllers that didn't include interrupt management feature or [[http://​info.iet.unipi.it/​~luigi/​papers/​20130520-rizzo-vm.pdf|for usage in a VM]].
 +  * [[http://​www.freebsd.org/​cgi/​man.cgi?​query=altq|ALTQ]] for queuing, but try to use [[http://​www.freebsd.org/​cgi/​man.cgi?​query=dummynet|dummynet]] in place
 +
 +And only theses devices support these modes:
 +
 +^ name      ^ Description ​   ^ Polling ​  ^ ALTQ       ^
 +| ae  | Attansic/​Atheros L2 FastEthernet controller driver | no | yes | 
 +| age | Attansic/​Atheros L1 Gigabit Ethernet driver | no | yes |
 +| alc | Atheros AR813x/​AR815x Gigabit/​Fast Ethernet driver | no | yes |
 +| ale | Atheros AR8121/​AR8113/​AR8114 Gigabit/​Fast Ethernet driver | no | yes |
 +| bce | Broadcom NetXtreme II (BCM5706/​5708/​5709/​5716) PCI/PCIe Gigabit Ethernet adapter driver | no | yes |
 +| bfe | Broadcom BCM4401 Ethernet Device Driver | no | yes |
 +| bge | Broadcom BCM570x/​5714/​5721/​5722/​5750/​5751/​5752/​5789 PCI Gigabit Ethernet adapter driver | yes | yes |
 +| cas | Sun Cassini/​Cassini+ and National Semiconductor DP83065 Saturn Gigabit Ethernet driver ​ | no | yes |
 +| cxgbe | Chelsio T4 and T5 based 40Gb, 10Gb, and 1Gb Ethernet adapter driver | no | yes |
 +| dc | DEC/Intel 21143 and clone 10/100 Ethernet driver | yes | yes |
 +| de | DEC DC21x4x Ethernet device driver | no | yes |
 +| ed | NE-2000 and WD-80x3 Ethernet driver | no | yes |
 +| em | Intel(R) PRO/1000 Gigabit Ethernet adapter driver | yes | yes |
 +| et | Agere ET1310 10/​100/​Gigabit Ethernet driver | no | yes |
 +| ep | Ethernet driver for 3Com Etherlink III (3c5x9) interfaces | no | yes |
 +| fxp | Intel EtherExpress PRO/100 Ethernet device driver | yes | yes |
 +| gem | ERI/​GEM/​GMAC Ethernet device driver | no | yes |
 +| hme | Sun Microelectronics STP2002-STQ Ethernet interfaces device driver | no | yes |
 +| igb | Intel(R) PRO/1000 PCI Express Gigabit Ethernet adapter driver | yes | needs IGB_LEGACY_TX |
 +| ixgb(e) | Intel(R) 10Gb Ethernet driver | yes | needs IGB_LEGACY_TX |
 +| jme | JMicron Gigabit/​Fast Ethernet driver | no | yes |
 +| le | AMD Am7900 LANCE and Am79C9xx ILACC/PCnet Ethernet interface driver | no | yes |
 +| msk | Marvell/​SysKonnect Yukon II Gigabit Ethernet adapter driver | no | yes |
 +| mxge | Myricom Myri10GE 10 Gigabit Ethernet adapter driver | no | yes |
 +| my | Myson Technology Ethernet PCI driver | no | yes |
 +| nfe | NVIDIA nForce MCP Ethernet driver | yes | yes |
 +| nge | National Semiconductor PCI Gigabit Ethernet adapter driver | yes | no |
 +| nve | NVIDIA nForce MCP Networking Adapter device driver | no | yes |
 +| qlxgb | QLogic 10 Gigabit Ethernet & CNA Adapter Driver | no | yes |
 +| re | RealTek 8139C+/​8169/​816xS/​811xS/​8101E PCI/PCIe Ethernet adapter driver | yes | yes |
 +| rl | RealTek 8129/8139 Fast Ethernet device driver | yes | yes |
 +| sf | Adaptec AIC‐6915 "​Starfire"​ PCI Fast Ethernet adapter driver | yes | yes |
 +| sge | Silicon Integrated Systems SiS190/191 Fast/​Gigabit Ethernet driver | no | yes |
 +| sis | SiS 900, SiS 7016 and NS DP83815/​DP83816 Fast Ethernet device driver | yes | yes |
 +| sk | SysKonnect SK-984x and SK-982x PCI Gigabit Ethernet adapter driver | yes | yes |
 +| ste | Sundance Technologies ST201 Fast Ethernet device driver | no | yes |
 +| stge | Sundance/​Tamarack TC9021 Gigabit Ethernet adapter driver | yes | yes |
 +| ti | Alteon Networks Tigon I and Tigon II Gigabit Ethernet driver | no | yes |
 +| txp | 3Com 3XP Typhoon/​Sidewinder (3CR990) Ethernet interface | no | yes |
 +| vge | VIA Networking Technologies VT6122 PCI Gigabit Ethernet adapter driver | yes | yes |
 +| vr | VIA Technologies Rhine I/II/III Ethernet device driver | yes | yes |
 +| xl | 3Com Etherlink XL and Fast Etherlink XL Ethernet device driver | yes | yes |
 +
 +Using others NIC will works too :-)
 +==== NIC drivers tuning ====
 +
 +=== RX & TX descriptor (queue) size on igb ===
 +
 +Received (hw.igb.rxd) and transmit (hw.igb.txd) internal buffer size of igb/em NIC can be increased, but it's not a good idea.
 +
 +Here are some examples that decrease performance when buffer increased:
 +
 +<​code>​
 +x PC-Engine-APU2-igb,​ 1024 (default):​ inet4 packets-per-second
 ++ PC-Engine-APU2-igb,​ 2048:​ inet4 packets-per-second
 +* PC-Engine-APU2-igb,​ 4096:​ inet4 packets-per-second
 ++--------------------------------------------------------------------------+
 +|*                                                                         |
 +|* ***                  + +  +++                                      xx xx|
 +|                                                                      MA| |
 +|                        |__AM_| ​                                          |
 +||_A_| ​                                                                    |
 ++--------------------------------------------------------------------------+
 +    N           ​Min ​          ​Max ​       Median ​          ​Avg ​       Stddev
 +x   ​5 ​       724024 ​       731531 ​       726058 ​     727317.6 ​    ​3006.2996
 ++   ​5 ​       626546 ​       640326 ​       637463 ​     634497.6 ​    ​5856.2198
 +Difference at 95.0% confidence
 +        -92820 +/- 6788.67
 +        -12.762% +/- 0.909828%
 +        (Student'​s t, pooled s = 4654.74)
 +*   ​5 ​       577830 ​       585426 ​       582886 ​     581913.4 ​    ​3413.6019
 +Difference at 95.0% confidence
 +        -145404 +/- 4690.94
 +        -19.9918% +/- 0.592106%
 +        (Student'​s t, pooled s = 3216.4)
 +</​code>​
 +
 +On a PC Engines APU2, increasing rx&tx buffers badly impact forwarding perfomance to about 20%.
 +
 +<​code>​
 +x Netgate-igb,​ 1024 (default):​ inet4 packets-per-second
 ++ Netgate-igb,​ 2048:​ inet4 packets-per-second
 +* Netaget-igb,​ 4096:​ inet4 packets-per-second
 ++--------------------------------------------------------------------------+
 +|*      *       ​* ​   *   ​*+ ​  ​++ ​     +    +               ​x ​  x x      x x|
 +|                                                           ​|____MA______| |
 +|                          |___M__A______| ​                                |
 +|   ​|_________A_M______| ​                                                  |
 ++--------------------------------------------------------------------------+
 +    N           ​Min ​          ​Max ​       Median ​          ​Avg ​       Stddev
 +x   ​5 ​       943843 ​       961431 ​       950960 ​     952699.4 ​    ​7285.7808
 ++   ​5 ​       907050 ​       926520 ​       912317 ​     915489.3 ​      ​7816.85
 +Difference at 95.0% confidence
 +        -37210.1 +/- 11020
 +        -3.90575% +/- 1.13593%
 +        (Student'​s t, pooled s = 7555.98)
 +*   ​5 ​       877923 ​       904869 ​       894519 ​     892616.2 ​    ​10954.317
 +Difference at 95.0% confidence
 +        -60083.2 +/- 13567.4
 +        -6.30663% +/- 1.39717%
 +        (Student'​s t, pooled s = 9302.68)
 +</​code>​
 +
 +On a Netgate RCC-VE 4860 performance decrease to about 6%.
 +
 +=== Maximum number of received packets to process at a time (Intel) ===
 +
 +By default Intel drivers (em|igb) limit the maximum number of received packets to process at a time (hw.igb.rx_process_limit=100).
 +
 +Disabling this limit can improve a little bit the overall performance:​
 +
 +<​code>​
 +x PC-Engines-APU2-igb,​ 100 (default): inet4 packets-per-second
 ++ PC-Engines-APU2-igb,​ disabled: inet4 packets-per-second
 ++--------------------------------------------------------------------------+
 +|x         ​x ​                  ​x ​  ​x ​    ​x ​              ​+ ​       ++ +    +|
 +|      |________________A______M_________| ​                                |
 +|                                                            |_____A_____| |
 ++--------------------------------------------------------------------------+
 +    N           ​Min ​          ​Max ​       Median ​          ​Avg ​       Stddev
 +x   ​5 ​       721322 ​       729817 ​       727669 ​     726175.8 ​    ​3592.7624
 ++   ​5 ​       733327 ​       736835 ​       735392 ​     735283.4 ​    ​1266.5036
 +Difference at 95.0% confidence
 +        9107.6 +/- 3928.6
 +        1.25419% +/- 0.547037%
 +        (Student'​s t, pooled s = 2693.69)
 +</​code>​
 +
 +A small 1% improvement on a PC Engines APU2.
 +
 +<​code>​
 +x Netgate-igb,​ 100 (default): inet4 packets-per-second
 ++ Netgate-igb,​ disabled: inet4 packets-per-second
 ++--------------------------------------------------------------------------+
 +|x    x         ​x ​                   x              x  +    + +        +  +|
 +||______________M_____A_____________________| ​                             |
 +|                                                       ​|_____M_A_______| ​ |
 ++--------------------------------------------------------------------------+
 +    N           ​Min ​          ​Max ​       Median ​          ​Avg ​       Stddev
 +x   ​5 ​       943049 ​       963053 ​       948889 ​     951350.1 ​    ​8428.1046
 ++   ​5 ​       964038 ​       971603 ​       966764 ​     967747.4 ​    ​3168.1615
 +Difference at 95.0% confidence
 +        16397.3 +/- 9285.49
 +        1.72358% +/- 0.990789%
 +        (Student'​s t, pooled s = 6366.72)
 +</​code>​
 +
 +Almost same improvement,​ 1.7% on a Netgate.
 +
 +=== Increasing maximum interrupts per second ===
 +
 +By default igb|em limit the maximum number of interrupts per second to 8000.
 +
 +What result by increasing this number:
 +
 +<​code>​
 +x PC-Engine-APU2-igb,​ max_interrupt_rate=8000 (default): inet4 pps
 ++ PC-Engine-APU2-igb,​ max_interrupt_rate=16000:​ inet4 pps
 +* PC-Engine-APU2-igb,​ max_interrupt_rate=32000:​ inet4 pps
 ++--------------------------------------------------------------------------+
 +|x           ​x ​             +   ​x* ​    ​*+* ​             *  x+x           +*|
 +|     ​|_________________________MA__________________________| ​             |
 +|                         ​|_____________M_____A___________________| ​       |
 +|                                |_______M_______A_______________| ​        |
 ++--------------------------------------------------------------------------+
 +    N           ​Min ​          ​Max ​       Median ​          ​Avg ​       Stddev
 +x   ​5 ​       721191 ​       730448 ​       725919 ​     726128.6 ​    ​4137.1981
 ++   ​5 ​       725281 ​       732247 ​       727142 ​       728045 ​    ​3083.0635
 +No difference proven at 95.0% confidence
 +*   ​5 ​       726157 ​       732417 ​       727400 ​       728554 ​    ​2518.7778
 +No difference proven at 95.0% confidence
 +</​code>​
 +
 +No benefit on a PC Engines APU2.
 +
 +<​code>​
 +x Netgate-igb,​ max_interrupt_rate=8000 (default): inet4 pps
 ++ Netgate-igb,​ max_interrupt_rate=16000:​ inet4 pps
 +* Netgate-igb,​ max_interrupt_rate=32000:​ inet4 pps
 ++--------------------------------------------------------------------------+
 +|x              *   ​x ​   x*   ​+* ​  ​x ​      ​+ ​      ​* ​      ​+ ​  ​+ ​   *     +|
 +|       ​|________________MA__________________| ​                            |
 +|                                    |________________A____M___________| ​  |
 +|                |_____________M_______A____________________| ​             |
 ++--------------------------------------------------------------------------+
 +    N           ​Min ​          ​Max ​       Median ​          ​Avg ​       Stddev
 +x   ​5 ​     936969.5 ​       954851 ​       945631 ​     946049.3 ​     6595.479
 ++   ​5 ​       947461 ​       962965 ​       957754 ​       955782 ​     6143.043
 +Difference at 95.0% confidence
 +        9732.7 +/- 9295.06
 +        1.02877% +/- 0.987938%
 +        (Student'​s t, pooled s = 6373.28)
 +*   ​5 ​       942302 ​       960913 ​       947667 ​     950327.4 ​    ​7504.7836
 +No difference proven at 95.0% confidence
 +</​code>​
 +
 +A little 1% benefit if doubled form the default (8000 to 16000), but no benefit after.
 +
 +=== Disabling LRO and TSO ===
 +
 +All modern NIC support [[http://​en.wikipedia.org/​wiki/​Large_receive_offload|LRO]] and [[http://​en.wikipedia.org/​wiki/​Large_segment_offload|TSO]] features that needs to be disabled on a router:
 +  - By waiting to store multiple packets at the NIC level before to hand them up to the stack: This add latency, and because all packets need to be sending out again, the stack have to split in different packets again before to hand them down to the NIC. [[http://​downloadmirror.intel.com/​22919/​eng/​README.txt|Intel drivers readme]] include this note "The result of not disabling LRO when combined with ip forwarding or bridging can be low throughput or even a kernel panic."​
 +  - This break the [[http://​en.wikipedia.org/​wiki/​End-to-end_principle|End-to-end principle]]
 +
 +There is no real impact of disabling these features on PPS performance:​
 +<​code>​
 +x Xeon_E5-2650-8Cores-Chelsio_T540,​ TSO-LRO-enabled (default): inet4 packets-per-second
 ++ Xeon_E5-2650-8Cores-Chelsio_T540,​ TSO-LRO-disabled:​ inet4 packets-per-second
 ++--------------------------------------------------------------------------+
 +|x                    x     ​+x ​       +     ​x ​ +x                   ​+ ​    +|
 +|         ​|__________________A__________________| ​                         |
 +|                              |_______________M___A__________________| ​   |
 ++--------------------------------------------------------------------------+
 +    N           ​Min ​          ​Max ​       Median ​          ​Avg ​       Stddev
 +x   ​5 ​      ​5727360 ​    ​5806270.5 ​      ​5773573 ​    ​5773454.5 ​    ​31394.005
 ++   ​5 ​      ​5771681 ​      ​5848728 ​      ​5803277 ​    ​5810232.2 ​    ​32556.338
 +No difference proven at 95.0% confidence
 +</​code>​
 +==== Where is the bottleneck ? ====
 +
 +Tools:
 +  * [[http://​www.slideshare.net/​brendangregg/​meetbsd2014-performance-analysis|MeetBSD 2014 - Brendan Gregg'​s performance analysis presentation]] : The ultimate guide for performance analysis on FreeBSD
 +  * [[http://​www.freebsd.org/​cgi/​man.cgi?​query=netstat|netstat]]:​ show network status
 +  * [[http://​www.freebsd.org/​cgi/​man.cgi?​query=vmstat|vmstat]]:​ report virtual memory statistics
 +  * [[http://​www.freebsd.org/​cgi/​man.cgi?​query=top|top]]:​ display and update information about the top cpu processes
 +  * [[https://​www.freebsd.org/​cgi/​man.cgi?​query=pmcstat|pmcstat]]:​ Measuring performance using hardward counter
 +
 +=== Packets load ===
 +
 +Display the information regarding packet traffic, with refresh each second.
 +
 +Here is a first example:
 +
 +<​code>​
 +[root@hp]~# netstat -ihw1
 +            input        (Total) ​          ​output
 +   ​packets ​ errs idrops ​     bytes    packets ​ errs      bytes colls
 +       ​14M ​    ​0 ​ 7.9M       ​836M ​      ​5.8M ​    ​0 ​      ​353M ​    0
 +       ​14M ​    ​0 ​ 8.0M       ​841M ​      ​5.8M ​    ​0 ​      ​352M ​    0
 +       ​14M ​    ​0 ​ 8.1M       ​849M ​      ​5.8M ​    ​0 ​      ​353M ​    0
 +       ​14M ​    ​0 ​ 7.8M       ​833M ​      ​5.8M ​    ​0 ​      ​352M ​    0
 +       ​14M ​    ​0 ​ 7.9M       ​837M ​      ​5.8M ​    ​0 ​      ​352M ​    0
 +       ​14M ​    ​0 ​ 8.0M       ​843M ​      ​5.8M ​    ​0 ​      ​353M ​    0
 +       ​14M ​    ​0 ​ 7.9M       ​838M ​      ​5.8M ​    ​0 ​      ​352M ​    0
 +</​code>​
 +
 +=> This system is receiving 14Mpps (10G line-rate) and reach to forwarding only at 5.8Mpps rate then need to drop about 8Mpps.
 +
 +=== Traffic distribution between each queues ===
 +
 +Check the input queues are equally distributed. BSDRP include a sysctl parser script for that:
 +
 +<​code>​
 +[root@hp]~# nic-queue-usage cxl0
 +[Q0   ​935K/​s] [Q1   ​944K/​s] [Q2   ​925K/​s] [Q3   ​912K/​s] [Q4   ​914K/​s] [Q5   ​914K/​s] [Q6   ​930K/​s] [Q7   ​922K/​s] [QT  7400K/s 17749K/s ->     2K/s]
 +[Q0   ​886K/​s] [Q1   ​892K/​s] [Q2   ​872K/​s] [Q3   ​883K/​s] [Q4   ​885K/​s] [Q5   ​885K/​s] [Q6   ​906K/​s] [Q7   ​900K/​s] [QT  7112K/s 16950K/s ->     1K/s]
 +[Q0   ​874K/​s] [Q1   ​879K/​s] [Q2   ​855K/​s] [Q3   ​855K/​s] [Q4   ​852K/​s] [Q5   ​842K/​s] [Q6   ​866K/​s] [Q7   ​860K/​s] [QT  6887K/s 16748K/s ->     1K/s]
 +[Q0   ​892K/​s] [Q1   ​899K/​s] [Q2   ​870K/​s] [Q3   ​865K/​s] [Q4   ​873K/​s] [Q5   ​882K/​s] [Q6   ​899K/​s] [Q7   ​893K/​s] [QT  7076K/s 17023K/s ->     1K/s]
 +[Q0   ​880K/​s] [Q1   ​890K/​s] [Q2   ​878K/​s] [Q3   ​882K/​s] [Q4   ​883K/​s] [Q5   ​885K/​s] [Q6   ​904K/​s] [Q7   ​896K/​s] [QT  7102K/s 16864K/s ->     1K/s]
 +[Q0   ​894K/​s] [Q1   ​891K/​s] [Q2   ​868K/​s] [Q3   ​864K/​s] [Q4   ​860K/​s] [Q5   ​860K/​s] [Q6   ​882K/​s] [Q7   ​873K/​s] [QT  6996K/s 17154K/s ->     1K/s]
 +[Q0   ​876K/​s] [Q1   ​889K/​s] [Q2   ​873K/​s] [Q3   ​876K/​s] [Q4   ​881K/​s] [Q5   ​878K/​s] [Q6   ​898K/​s] [Q7   ​890K/​s] [QT  7064K/s 16586K/s ->     1K/s]
 +[Q0   ​957K/​s] [Q1   ​964K/​s] [Q2   ​941K/​s] [Q3   ​941K/​s] [Q4   ​943K/​s] [Q5   ​945K/​s] [Q6   ​967K/​s] [Q7   ​963K/​s] [QT  7624K/s 18220K/s ->     1K/s]
 +[Q0   ​864K/​s] [Q1   ​873K/​s] [Q2   ​853K/​s] [Q3   ​852K/​s] [Q4   ​854K/​s] [Q5   ​856K/​s] [Q6   ​874K/​s] [Q7   ​868K/​s] [QT  6897K/s 16413K/s ->     1K/s]
 +</​code>​
 +
 +=> All 8 receive queues are correctly used here.
 +
 +=== Interrupt usage ===
 +
 +Report on the number of interrupts taken by each device since system startup, ​
 +
 +Here is a first example:
 +<​code>​
 +[root@hp]~# vmstat -i
 +interrupt ​                         total       rate
 +irq1: atkbd0 ​                          ​6 ​         0
 +irq4: uart0                            3          0
 +irq20: ehci1                        3263          3
 +irq21: ehci0                       ​12612 ​        10
 +cpu0:​timer ​                       749022 ​       583
 +cpu6:​timer ​                       736682 ​       573
 +cpu7:​timer ​                       742555 ​       578
 +cpu1:​timer ​                       739964 ​       576
 +cpu2:​timer ​                       739621 ​       575
 +cpu5:​timer ​                       738796 ​       575
 +cpu4:​timer ​                       738756 ​       575
 +cpu3:​timer ​                       737954 ​       574
 +irq273: igb1:que 0                  1291          1
 +irq274: igb1:que 1                  1245          1
 +irq275: igb1:que 2                  1245          1
 +irq276: igb1:que 3                  1246          1
 +irq277: igb1:que 4                  1245          1
 +irq278: igb1:que 5                  2435          2
 +irq279: igb1:que 6                  1245          1
 +irq280: igb1:que 7                  1245          1
 +irq281: igb1:​link ​                     2          0
 +irq283: t5nex0:​evt ​                    ​4 ​         0
 +irq284: t5nex0:​0a0 ​               608689 ​       474
 +irq285: t5nex0:​0a1 ​               281882 ​       219
 +irq286: t5nex0:​0a2 ​               875179 ​       681
 +irq287: t5nex0:​0a3 ​               813033 ​       632
 +irq288: t5nex0:​0a4 ​               968845 ​       754
 +irq289: t5nex0:​0a5 ​              ​1099491 ​       855
 +irq290: t5nex0:​0a6 ​                ​58210 ​        45
 +irq291: t5nex0:​0a7 ​               638755 ​       497
 +irq294: t5nex0:​1a0 ​               102669 ​        80
 +irq295: t5nex0:​1a1 ​               136891 ​       106
 +irq296: t5nex0:​1a2 ​                ​51888 ​        40
 +irq297: t5nex0:​1a3 ​                ​59324 ​        46
 +irq298: t5nex0:​1a4 ​                ​61052 ​        47
 +irq299: t5nex0:​1a5 ​                ​80827 ​        63
 +irq300: t5nex0:​1a6 ​                ​88800 ​        69
 +irq301: t5nex0:​1a7 ​               102177 ​        79
 +Total                           ​11978149 ​      9318
 +</​code>​
 +
 +=> There is no IRQ sharing here, and each queue have correctly their own IRQ (thanks MSI-X).
 +
 +=== Memory Buffer ===
 +
 +Show statistics recorded by the memory management routines. The network manages a private pool of memory buffers.
 +
 +<​code>​
 +[root@hp]~# vmstat -z | head -1 ; vmstat -z | grep -i mbuf
 +ITEM                   ​SIZE ​ LIMIT     ​USED ​    ​FREE ​     REQ FAIL SLEEP
 +mbuf_packet: ​           256, 26137965, ​   8190,    1424,    9121,   ​0, ​  0
 +mbuf:                   256, 26137965, ​   3657,    4699,​4372990387, ​  ​0, ​  0
 +mbuf_cluster: ​         2048, 4084056, ​   9614,      12,    9614,   ​0, ​  0
 +mbuf_jumbo_page: ​      4096, 2042027, ​  ​16192, ​    ​131, ​  ​16207, ​  ​0, ​  0
 +mbuf_jumbo_9k: ​        9216, 605045, ​      ​0, ​      ​0, ​      ​0, ​  ​0, ​  0
 +mbuf_jumbo_16k: ​      ​16384,​ 340337, ​    ​128, ​      ​0, ​    ​128, ​  ​0, ​  0
 +</​code>​
 +
 +=> No "​failed"​ here.
 +
 +=== CPU / NIC ===
 +
 +top can give very useful information regarding the CPU/NIC affinity:
 +
 +<​code>​
 +[root@hp]~# top -CHIPS
 +last pid:  1180;  load averages: 10.05, ​ 8.86,  5.71                                                 up 0+00:​23:​58 ​ 12:05:19
 +187 processes: 15 running, 100 sleeping, 72 waiting
 +CPU 0:  0.0% user,  0.0% nice,  0.0% system, 96.9% interrupt, ​ 3.1% idle
 +CPU 1:  0.0% user,  0.0% nice,  0.0% system, 99.2% interrupt, ​ 0.8% idle
 +CPU 2:  0.0% user,  0.0% nice,  0.0% system, 99.6% interrupt, ​ 0.4% idle
 +CPU 3:  0.0% user,  0.0% nice,  0.0% system, 97.7% interrupt, ​ 2.3% idle
 +CPU 4:  0.0% user,  0.0% nice,  0.0% system, 98.1% interrupt, ​ 1.9% idle
 +CPU 5:  0.0% user,  0.0% nice,  0.0% system, 97.3% interrupt, ​ 2.7% idle
 +CPU 6:  0.0% user,  0.0% nice,  0.0% system, 97.7% interrupt, ​ 2.3% idle
 +CPU 7:  0.0% user,  0.0% nice,  0.0% system, 97.3% interrupt, ​ 2.7% idle
 +Mem: 16M Active, 16M Inact, 415M Wired, 7239K Buf, 62G Free
 +Swap:
 +
 +  PID USERNAME ​  PRI NICE   ​SIZE ​   RES STATE   ​C ​  ​TIME ​    CPU COMMAND
 +   12 root       ​-92 ​   -     ​0K ​ 1248K CPU1    1  13:14 100.00% intr{irq285:​ t5nex0:0a1}
 +   12 root       ​-92 ​   -     ​0K ​ 1248K CPU7    7  13:09  98.54% intr{irq284:​ t5nex0:0a0}
 +   12 root       ​-92 ​   -     ​0K ​ 1248K WAIT    3  13:10  98.21% intr{irq291:​ t5nex0:0a7}
 +   12 root       ​-92 ​   -     ​0K ​ 1248K WAIT    6  13:02  97.33% intr{irq287:​ t5nex0:0a3}
 +   12 root       ​-92 ​   -     ​0K ​ 1248K CPU0    0  13:02  97.30% intr{irq286:​ t5nex0:0a2}
 +   12 root       ​-92 ​   -     ​0K ​ 1248K CPU5    5  13:01  97.26% intr{irq288:​ t5nex0:0a4}
 +   12 root       ​-92 ​   -     ​0K ​ 1248K WAIT    4  12:59  97.19% intr{irq289:​ t5nex0:0a5}
 +   12 root       ​-92 ​   -     ​0K ​ 1248K CPU2    2  13:17  11.43% intr{irq290:​ t5nex0:0a6}
 +   11 root       155 ki31     ​0K ​  128K RUN     ​5 ​ 11:02   2.86% idle{idle: cpu5}
 +   11 root       155 ki31     ​0K ​  128K RUN     ​6 ​ 10:58   2.31% idle{idle: cpu6}
 +   11 root       155 ki31     ​0K ​  128K RUN     ​3 ​ 10:52   2.11% idle{idle: cpu3}
 +   11 root       155 ki31     ​0K ​  128K CPU4    4  10:58   2.05% idle{idle: cpu4}
 +   11 root       155 ki31     ​0K ​  128K RUN     ​7 ​ 10:54   1.87% idle{idle: cpu7}
 +   11 root       155 ki31     ​0K ​  128K RUN     ​0 ​ 10:54   1.72% idle{idle: cpu0}
 +   11 root       155 ki31     ​0K ​  128K RUN     ​1 ​ 10:52   1.20% idle{idle: cpu1}
 +   15 root       ​-16 ​   -     ​0K ​   16K -       ​7 ​  ​0:​02 ​  0.24% rand_harvestq
 +    0 root       ​-92 ​   -     ​0K ​  624K -       ​6 ​  ​0:​01 ​  0.10% kernel{t5nex0 tq1}
 +   11 root       155 ki31     ​0K ​  128K RUN     ​2 ​ 10:44   0.09% idle{idle: cpu2}
 + 1180 root        20    0 20012K ​ 3876K CPU4    4   ​0:​00 ​  0.06% top
 +   12 root       ​-92 ​   -     ​0K ​ 1248K WAIT    0   ​0:​00 ​  0.04% intr{irq301:​ t5nex0:1a7}
 +   12 root       ​-92 ​   -     ​0K ​ 1248K WAIT    6   ​0:​00 ​  0.03% intr{irq295:​ t5nex0:1a1}
 +   12 root       ​-92 ​   -     ​0K ​ 1248K WAIT    4   ​0:​00 ​  0.03% intr{irq300:​ t5nex0:1a6}
 +   12 root       ​-92 ​   -     ​0K ​ 1248K WAIT    7   ​0:​00 ​  0.03% intr{irq299:​ t5nex0:1a5}
 +   12 root       ​-92 ​   -     ​0K ​ 1248K WAIT    0   ​0:​00 ​  0.03% intr{irq294:​ t5nex0:1a0}
 +   12 root       ​-60 ​   -     ​0K ​ 1248K WAIT    1   ​0:​00 ​  0.02% intr{swi4: clock (0)}
 +   12 root       ​-92 ​   -     ​0K ​ 1248K WAIT    4   ​0:​00 ​  0.02% intr{irq296:​ t5nex0:1a2}
 +   12 root       ​-92 ​   -     ​0K ​ 1248K WAIT    0   ​0:​00 ​  0.02% intr{irq298:​ t5nex0:1a4}
 +   12 root       ​-92 ​   -     ​0K ​ 1248K WAIT    0   ​0:​00 ​  0.01% intr{irq297:​ t5nex0:1a3}
 + 1090 root        20    0 56296K ​ 6384K select ​ 3   ​0:​00 ​  0.00% sshd
 +   14 root       ​-68 ​   -     ​0K ​  240K -       ​6 ​  ​0:​00 ​  0.00% usb{usbus2}
 +   14 root       ​-68 ​   -     ​0K ​  240K -       ​6 ​  ​0:​00 ​  0.00% usb{usbus2}
 +   14 root       ​-68 ​   -     ​0K ​  240K -       ​6 ​  ​0:​00 ​  0.00% usb{usbus0}
 +   12 root       ​-88 ​   -     ​0K ​ 1248K WAIT    6   ​0:​00 ​  0.00% intr{irq20: ehci1}
 +   12 root       ​-92 ​   -     ​0K ​ 1248K WAIT    5   ​0:​00 ​  0.00% intr{irq278:​ igb1:que 5}
 +   14 root       ​-68 ​   -     ​0K ​  240K -       ​6 ​  ​0:​00 ​  0.00% usb{usbus0}
 +   12 root       ​-88 ​   -     ​0K ​ 1248K WAIT    6   ​0:​00 ​  0.00% intr{irq21: ehci0}
 +   14 root       ​-68 ​   -     ​0K ​  240K -       ​1 ​  ​0:​00 ​  0.00% usb{usbus1}
 +   18 root       ​-16 ​   -     ​0K ​   48K psleep ​ 5   ​0:​00 ​  0.00% pagedaemon{pagedaemon}
 +   24 root        16    -     ​0K ​   16K syncer ​ 0   ​0:​00 ​  0.00% syncer
 +   21 root       ​-16 ​   -     ​0K ​   16K -       ​5 ​  ​0:​00 ​  0.00% bufspacedaemon
 +   23 root       ​-16 ​   -     ​0K ​   16K vlruwt ​ 7   ​0:​00 ​  0.00% vnlru
 +</​code>​
 +
 +=== Drivers ===
 +
 +Depending the NIC drivers used, there are some counters available:
 +
 +<​code>​
 +[root@hp]~# sysctl dev.cxl.0.stats.
 +dev.cxl.0.stats.rx_ovflow2:​ 0
 +dev.cxl.0.stats.rx_ovflow1:​ 0
 +dev.cxl.0.stats.rx_ovflow0:​ 7301719197
 +dev.cxl.0.stats.rx_ppp7:​ 0
 +dev.cxl.0.stats.rx_ppp6:​ 0
 +dev.cxl.0.stats.rx_ppp5:​ 0
 +dev.cxl.0.stats.rx_ppp4:​ 0
 +dev.cxl.0.stats.rx_ppp3:​ 0
 +dev.cxl.0.stats.rx_ppp2:​ 0
 +dev.cxl.0.stats.rx_ppp1:​ 0
 +dev.cxl.0.stats.rx_ppp0:​ 0
 +dev.cxl.0.stats.rx_pause:​ 0
 +dev.cxl.0.stats.rx_frames_1519_max:​ 0
 +dev.cxl.0.stats.rx_frames_1024_1518:​ 0
 +dev.cxl.0.stats.rx_frames_512_1023:​ 0
 +dev.cxl.0.stats.rx_frames_256_511:​ 0
 +dev.cxl.0.stats.rx_frames_128_255:​ 0
 +dev.cxl.0.stats.rx_frames_65_127:​ 0
 +dev.cxl.0.stats.rx_frames_64:​ 12522860904
 +(...)
 +[root@hp]~# sysctl -d dev.cxl.0.stats.rx_ovflow0
 +dev.cxl.0.stats.rx_ovflow0:​ # drops due to buffer-group 0 overflows
 +</​code>​
 +
 +=> Notice the high level of "drops du to buffer-group 0 overflows"​.
 +It's a problem regarding global performance of the system (on this example, the packet generator send smallest packet at a rate about 14Mpps).
 +
 +=== pmcstat ===
 +
 +During high-load of your router/​firewall,​ load the [[https://​www.freebsd.org/​cgi/​man.cgi?​query=hwpmc&​sektion=4|hwpmc(4)]] module:
 +<​code>​
 +kldload hwpmc
 +</​code>​
 +
 +== Time used by process ==
 +
 +Now you can display the most time consumed process with:
 +<​code>​
 +pmcstat -TS inst_retired.any_p -w1
 +</​code>​
 +
 +That will display this output:
 +<​code>​
 +PMC: [INSTR_RETIRED_ANY] Samples: 56877 (100.0%) , 0 unresolved
 +
 +%SAMP IMAGE      FUNCTION ​            ​CALLERS
 +  7.2 kernel ​    ​bzero ​               m_pkthdr_init:​2.4 ip_findroute:​1.8 ip_tryforward:​1.6 fib4_lookup_nh_basic:​1.3
 +  6.1 if_cxgbe.k eth_tx ​              ​drain_ring
 +  5.9 kernel ​    ​bcopy ​               eth_tx:2.3 arpresolve:​1.9 get_scatter_segment:​1.7
 +  4.5 kernel ​    ​atomic_cmpset_long ​  ​mp_ring_enqueue:​2.5 drain_ring:​2.0
 +  3.7 kernel ​    ​__rw_rlock ​          ​arpresolve:​2.7 fib4_lookup_nh_basic:​1.0
 +  3.4 kernel ​    ​rn_match ​            ​fib4_lookup_nh_basic
 +  3.3 kernel ​    ​ip_tryforward ​       ip_input
 +  3.2 kernel ​    ​_rw_runlock_cookie ​  ​fib4_lookup_nh_basic:​2.2 arpresolve:​1.0
 +  2.9 if_cxgbe.k reclaim_tx_descs ​    ​eth_tx
 +  2.9 if_cxgbe.k service_iq ​          ​t4_intr
 +  2.9 kernel ​    ​ether_output ​        ​ip_tryforward
 +  2.8 kernel ​    ​ether_nh_input ​      ​netisr_dispatch_src
 +  2.8 if_cxgbe.k cxgbe_transmit ​      ​ether_output
 +  2.8 kernel ​    ​netisr_dispatch_src ​ ether_demux:​1.7 ether_input:​1.1
 +  2.7 kernel ​    ​uma_zalloc_arg ​      ​get_scatter_segment
 +  2.5 kernel ​    ​_rm_rlock ​           in_localip
 +  2.4 if_cxgbe.k t4_eth_rx ​           service_iq
 +  2.3 if_cxgbe.k parse_pkt ​           cxgbe_transmit
 +  2.3 kernel ​    ​memcpy ​              ​ether_output
 +  2.2 kernel ​    ​uma_zfree_arg ​       m_freem
 +  2.1 kernel ​    ​spinlock_exit ​       ether_nh_input
 +  2.0 kernel ​    ​fib4_lookup_nh_basic ip_findroute
 +  2.0 kernel ​    ​ip_input ​            ​netisr_dispatch_src
 +  2.0 if_cxgbe.k get_scatter_segment ​ service_iq
 +  1.9 kernel ​    ​__mtx_lock_flags ​    ​eth_tx
 +  1.8 kernel ​    ​__mtx_unlock_flags ​  ​eth_tx
 +  1.6 kernel ​    ​random_harvest_queue ether_nh_input
 +  1.4 if_cxgbe.k mp_ring_enqueue ​     cxgbe_transmit
 +  1.1 kernel ​    ​bcmp ​                ​ether_nh_input
 +  1.1 kernel ​    ​critical_enter
 +  1.1 kernel ​    ​key_havesp ​          ​ipsec4_capability
 +  1.0 kernel ​    ​m_adj ​               ether_demux
 +  1.0 kernel ​    ​ipsec4_capability ​   ip_input
 +  1.0 kernel ​    ​in_localip ​          ​ip_tryforward
 +  0.9 kernel ​    ​pmap_kextract ​       parse_pkt
 +  0.7 kernel ​    ​ether_demux ​         ether_nh_input
 +  0.7 kernel ​    ​arpresolve ​          ​ether_output
 +  0.7 kernel ​    ​_rm_runlock ​         in_localip
 +  0.7 kernel ​    ​sglist_count ​        ​parse_pkt
 +  0.6 kernel ​    ​lock_delay ​          ​_mtx_lock_spin_cookie
 +  0.6 kernel ​    ​critical_exit
 +</​code>​
 +
 +On this case the bootleneck is just the network stack.
 +
 +== CPU cycles spent ==
 +
 +For displaying where the most cpu cycles are being spent with.
 +We first need a partition with about 200MB that include the debug kernel:
 +
 +<​code>​
 +system expand-data-slice
 +mount /data
 +</​code>​
 +
 +Then, under high-load, start collecting during about 20 seconds:
 +<​code>​
 +pmcstat -z 50 -S cpu_clk_unhalted.thread -l 20 -O /​data/​pmc.out
 +</​code>​
 +
 +Then analyses the output with:
 +<​code>​
 +fetch http://​BSDRP-release-debug
 +tar xzfv BSDRP-release-debug.tar.xz
 +pmcannotate /​data/​pmc.out /​data/​debug/​boot/​kernel/​kernel.symbols
 +</​code>​
documentation/technical_docs/performance.txt · Last modified: 2019/07/17 17:32 by olivier