documentation:technical_docs:performance
no way to compare when less than two revisions
Differences
This shows you the differences between two versions of the page.
Previous revisionNext revision | |||
— | documentation:technical_docs:performance [2019/06/27 01:35] – [Where is the bottleneck ?] olivier | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== FreeBSD forwarding Performance ====== | ||
+ | {{description> | ||
+ | There are lot's of guide about [[http:// | ||
+ | ===== Concepts ===== | ||
+ | |||
+ | ==== How to bench a router ==== | ||
+ | |||
+ | Benchmarking a router **is not** measuring the maximum bandwidth crossing the router, but it's about measuring the network throughput (in packets-per-second unit): | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | |||
+ | ==== Definition ==== | ||
+ | |||
+ | Clear definition regarding some relations between the bandwidth and frame rate is mandatory: | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | |||
+ | ===== Benchmarks ===== | ||
+ | |||
+ | ==== Cisco or Linux ==== | ||
+ | |||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | ==== FreeBSD ==== | ||
+ | |||
+ | Here are some benchs regarding network forwarding performance of FreeBSD (made by BSDRP team): | ||
+ | * AsiaBSDCon 2018 - Tuning FreeBSD for routing and firewalling ([[https:// | ||
+ | * [[http:// | ||
+ | * [[https:// | ||
+ | * [[https:// | ||
+ | * [[https:// | ||
+ | * [[https:// | ||
+ | |||
+ | ===== Bench lab ===== | ||
+ | |||
+ | The [[bench lab]] should permit to measure the pps. For obtaining accurate result the [[http:// | ||
+ | |||
+ | ===== Tuning ===== | ||
+ | |||
+ | ==== Literature ==== | ||
+ | |||
+ | Here is a list of sources about optimizing/ | ||
+ | |||
+ | How to bench or tune the network stack: | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | * [[https:// | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | |||
+ | FreeBSD Experimental high-performance network stacks: | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | |||
+ | ==== Multiple flows ==== | ||
+ | |||
+ | Don't try to bench a router with only one flow (same source|destination address and same source|destination port): You need to generate multiples flows. | ||
+ | Multi-queue NIC uses feature like [[https:// | ||
+ | |||
+ | During your load, check that each queues are used with sysctl or with [[https:// | ||
+ | |||
+ | On this example we can see that all flows are correctly shared between each 8 queues (about 340K paquets-per-seconds for each): | ||
+ | < | ||
+ | |||
+ | [root@router]~# | ||
+ | [Q0 | ||
+ | [Q0 | ||
+ | [Q0 | ||
+ | [Q0 | ||
+ | [Q0 | ||
+ | [Q0 | ||
+ | </ | ||
+ | |||
+ | <note warning> | ||
+ | Beware of configurations setup that prevent multi-queue, | ||
+ | </ | ||
+ | ==== Choosing good Hardware ==== | ||
+ | === CPU === | ||
+ | |||
+ | Avoid NUMA architecture but prefer a CPU in only one package with maximum core (8 or 16). | ||
+ | If you are using NUMA, check that inbound/ | ||
+ | |||
+ | === Network Interface Card === | ||
+ | |||
+ | Mellanox or Chelsio, by mixing good chipset and excellent drivers are an excellent choice. | ||
+ | |||
+ | Intel seems to have problem for managing lot's of PPS (= IRQ). | ||
+ | |||
+ | Avoid " | ||
+ | * 10G Emulex OneConnect (be3) | ||
+ | * 10G Broadcom NetXtreme II BCM57810 | ||
+ | |||
+ | ==== Choosing good FreeBSD release ==== | ||
+ | |||
+ | Before tuning, you need to use the good FreeBSD version. | ||
+ | This mean a FreeBSD -head version older than r309257 (Andrey V. Elsukov 's improvement: | ||
+ | |||
+ | BSDRP since version 1.70 is using a FreeBSD 11-stable (r312663) that includes this improvement. | ||
+ | |||
+ | {{documentation: | ||
+ | |||
+ | For better (and linear scale) performance there is the [[https:// | ||
+ | |||
+ | ==== Disabling Hyper Threading ==== | ||
+ | |||
+ | Disable Hyper Threading (HT): By default, lot's of multi-queue NIC drivers create one queue per core. | ||
+ | But " | ||
+ | |||
+ | HT can be disabled with this command: | ||
+ | < | ||
+ | echo ' | ||
+ | </ | ||
+ | |||
+ | Here is an example on a 8cores x hardware threads Intel CPU and 10G Chelsio NIC: | ||
+ | |||
+ | < | ||
+ | x HT-enabled-8rxq(default).packets-per-seconds | ||
+ | + HT-enabled-16rxq.packets-per-seconds | ||
+ | * HT-disabled.packets-per-seconds | ||
+ | +--------------------------------------------------------------------------+ | ||
+ | | **| | ||
+ | |x xx x + + + + + ***| | ||
+ | | | ||
+ | | | ||
+ | | |A|| | ||
+ | +--------------------------------------------------------------------------+ | ||
+ | N | ||
+ | x | ||
+ | + | ||
+ | Difference at 95.0% confidence | ||
+ | 440068 +/- 144126 | ||
+ | 9.46731% +/- 3.23827% | ||
+ | (Student' | ||
+ | * | ||
+ | Difference at 95.0% confidence | ||
+ | 1.13671e+06 +/- 98524.2 | ||
+ | 24.4544% +/- 2.62824% | ||
+ | (Student' | ||
+ | </ | ||
+ | |||
+ | There is a benefit of about 24% to disable hyper threading. | ||
+ | |||
+ | ==== fastforwarding ==== | ||
+ | |||
+ | === FreeBSD 10.3 or older === | ||
+ | |||
+ | You should enable fastforwarding with a: | ||
+ | < | ||
+ | echo " | ||
+ | service sysctl restart | ||
+ | </ | ||
+ | |||
+ | === FreeBSD 12.0 or newer === | ||
+ | |||
+ | You should enable tryforward by disabling ICMP redirect: | ||
+ | < | ||
+ | echo " | ||
+ | echo " | ||
+ | service sysctl restart | ||
+ | </ | ||
+ | |||
+ | ==== Entropy harvest impact ==== | ||
+ | |||
+ | Lot's of tuning guide indicate to disable: | ||
+ | * kern.random.sys.harvest.ethernet | ||
+ | * kern.random.sys.harvest.interrupt | ||
+ | |||
+ | By default the binary mask 511 select almost all these source as entropy sources: | ||
+ | < | ||
+ | kern.random.harvest.mask_symbolic: | ||
+ | kern.random.harvest.mask_bin: | ||
+ | kern.random.harvest.mask: | ||
+ | </ | ||
+ | |||
+ | By replacing this mask by 351, we exclude INTERRUPT and NET_ETHER: | ||
+ | < | ||
+ | kern.random.harvest.mask_symbolic: | ||
+ | kern.random.harvest.mask_bin: | ||
+ | kern.random.harvest.mask: | ||
+ | </ | ||
+ | |||
+ | And we can notice on forwarding performance of a FreeBSD 11.1: | ||
+ | |||
+ | < | ||
+ | x PC-Engines-APU2-igb, | ||
+ | + PC-Engines-APU2-igb, | ||
+ | +--------------------------------------------------------------------------+ | ||
+ | |xx x | ||
+ | ||___M_A_____| | ||
+ | | |________A_________| | ||
+ | +--------------------------------------------------------------------------+ | ||
+ | N | ||
+ | x | ||
+ | + | ||
+ | Difference at 95.0% confidence | ||
+ | 22831.3 +/- 5060.46 | ||
+ | 3.13927% +/- 0.701645% | ||
+ | (Student' | ||
+ | </ | ||
+ | |||
+ | On a PC Engines APU2, there is +3% performance benefit | ||
+ | |||
+ | < | ||
+ | x Netgate-igb, | ||
+ | + Netgate-igb, | ||
+ | +--------------------------------------------------------------------------+ | ||
+ | |x x x x | ||
+ | ||______M__A__________| | ||
+ | | | ||
+ | +--------------------------------------------------------------------------+ | ||
+ | N | ||
+ | x | ||
+ | + | ||
+ | Difference at 95.0% confidence | ||
+ | 45376.5 +/- 9918.76 | ||
+ | 4.75285% +/- 1.0771% | ||
+ | (Student' | ||
+ | </ | ||
+ | |||
+ | On a Netgate RCC-VE 4860 there is about 4.7% performance benefit. | ||
+ | |||
+ | Using the FreeBSD " | ||
+ | |||
+ | {{documentation: | ||
+ | |||
+ | ==== Polling mode ==== | ||
+ | |||
+ | Polling can be used in 2 cases: | ||
+ | * On **old hardware only** (where Ethernet card doesn' | ||
+ | * When used [[http:// | ||
+ | For enabling polling mode: | ||
+ | - Edit / | ||
+ | - Execute: service polling start | ||
+ | |||
+ | === NIC drivers compatibility matrix === | ||
+ | |||
+ | BSDRP can use some special features on somes NIC: | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | |||
+ | And only theses devices support these modes: | ||
+ | |||
+ | ^ name ^ Description | ||
+ | | ae | Attansic/ | ||
+ | | age | Attansic/ | ||
+ | | alc | Atheros AR813x/ | ||
+ | | ale | Atheros AR8121/ | ||
+ | | bce | Broadcom NetXtreme II (BCM5706/ | ||
+ | | bfe | Broadcom BCM4401 Ethernet Device Driver | no | yes | | ||
+ | | bge | Broadcom BCM570x/ | ||
+ | | cas | Sun Cassini/ | ||
+ | | cxgbe | Chelsio T4 and T5 based 40Gb, 10Gb, and 1Gb Ethernet adapter driver | no | yes | | ||
+ | | dc | DEC/Intel 21143 and clone 10/100 Ethernet driver | yes | yes | | ||
+ | | de | DEC DC21x4x Ethernet device driver | no | yes | | ||
+ | | ed | NE-2000 and WD-80x3 Ethernet driver | no | yes | | ||
+ | | em | Intel(R) PRO/1000 Gigabit Ethernet adapter driver | yes | yes | | ||
+ | | et | Agere ET1310 10/ | ||
+ | | ep | Ethernet driver for 3Com Etherlink III (3c5x9) interfaces | no | yes | | ||
+ | | fxp | Intel EtherExpress PRO/100 Ethernet device driver | yes | yes | | ||
+ | | gem | ERI/ | ||
+ | | hme | Sun Microelectronics STP2002-STQ Ethernet interfaces device driver | no | yes | | ||
+ | | igb | Intel(R) PRO/1000 PCI Express Gigabit Ethernet adapter driver | yes | needs IGB_LEGACY_TX | | ||
+ | | ixgb(e) | Intel(R) 10Gb Ethernet driver | yes | needs IGB_LEGACY_TX | | ||
+ | | jme | JMicron Gigabit/ | ||
+ | | le | AMD Am7900 LANCE and Am79C9xx ILACC/PCnet Ethernet interface driver | no | yes | | ||
+ | | msk | Marvell/ | ||
+ | | mxge | Myricom Myri10GE 10 Gigabit Ethernet adapter driver | no | yes | | ||
+ | | my | Myson Technology Ethernet PCI driver | no | yes | | ||
+ | | nfe | NVIDIA nForce MCP Ethernet driver | yes | yes | | ||
+ | | nge | National Semiconductor PCI Gigabit Ethernet adapter driver | yes | no | | ||
+ | | nve | NVIDIA nForce MCP Networking Adapter device driver | no | yes | | ||
+ | | qlxgb | QLogic 10 Gigabit Ethernet & CNA Adapter Driver | no | yes | | ||
+ | | re | RealTek 8139C+/ | ||
+ | | rl | RealTek 8129/8139 Fast Ethernet device driver | yes | yes | | ||
+ | | sf | Adaptec AIC‐6915 " | ||
+ | | sge | Silicon Integrated Systems SiS190/191 Fast/ | ||
+ | | sis | SiS 900, SiS 7016 and NS DP83815/ | ||
+ | | sk | SysKonnect SK-984x and SK-982x PCI Gigabit Ethernet adapter driver | yes | yes | | ||
+ | | ste | Sundance Technologies ST201 Fast Ethernet device driver | no | yes | | ||
+ | | stge | Sundance/ | ||
+ | | ti | Alteon Networks Tigon I and Tigon II Gigabit Ethernet driver | no | yes | | ||
+ | | txp | 3Com 3XP Typhoon/ | ||
+ | | vge | VIA Networking Technologies VT6122 PCI Gigabit Ethernet adapter driver | yes | yes | | ||
+ | | vr | VIA Technologies Rhine I/II/III Ethernet device driver | yes | yes | | ||
+ | | xl | 3Com Etherlink XL and Fast Etherlink XL Ethernet device driver | yes | yes | | ||
+ | |||
+ | Using others NIC will works too :-) | ||
+ | ==== NIC drivers tuning ==== | ||
+ | |||
+ | === RX & TX descriptor (queue) size on igb === | ||
+ | |||
+ | Received (hw.igb.rxd) and transmit (hw.igb.txd) internal buffer size of igb/em NIC can be increased, but it's not a good idea. | ||
+ | |||
+ | Here are some examples that decrease performance when buffer increased: | ||
+ | |||
+ | < | ||
+ | x PC-Engine-APU2-igb, | ||
+ | + PC-Engine-APU2-igb, | ||
+ | * PC-Engine-APU2-igb, | ||
+ | +--------------------------------------------------------------------------+ | ||
+ | |* | | ||
+ | |* *** + + +++ xx xx| | ||
+ | | MA| | | ||
+ | | |__AM_| | ||
+ | ||_A_| | ||
+ | +--------------------------------------------------------------------------+ | ||
+ | N | ||
+ | x | ||
+ | + | ||
+ | Difference at 95.0% confidence | ||
+ | -92820 +/- 6788.67 | ||
+ | -12.762% +/- 0.909828% | ||
+ | (Student' | ||
+ | * | ||
+ | Difference at 95.0% confidence | ||
+ | -145404 +/- 4690.94 | ||
+ | -19.9918% +/- 0.592106% | ||
+ | (Student' | ||
+ | </ | ||
+ | |||
+ | On a PC Engines APU2, increasing rx&tx buffers badly impact forwarding perfomance to about 20%. | ||
+ | |||
+ | < | ||
+ | x Netgate-igb, | ||
+ | + Netgate-igb, | ||
+ | * Netaget-igb, | ||
+ | +--------------------------------------------------------------------------+ | ||
+ | |* * | ||
+ | | | ||
+ | | |___M__A______| | ||
+ | | | ||
+ | +--------------------------------------------------------------------------+ | ||
+ | N | ||
+ | x | ||
+ | + | ||
+ | Difference at 95.0% confidence | ||
+ | -37210.1 +/- 11020 | ||
+ | -3.90575% +/- 1.13593% | ||
+ | (Student' | ||
+ | * | ||
+ | Difference at 95.0% confidence | ||
+ | -60083.2 +/- 13567.4 | ||
+ | -6.30663% +/- 1.39717% | ||
+ | (Student' | ||
+ | </ | ||
+ | |||
+ | On a Netgate RCC-VE 4860 performance decrease to about 6%. | ||
+ | |||
+ | === Maximum number of received packets to process at a time (Intel) === | ||
+ | |||
+ | By default Intel drivers (em|igb) limit the maximum number of received packets to process at a time (hw.igb.rx_process_limit=100). | ||
+ | |||
+ | Disabling this limit can improve a little bit the overall performance: | ||
+ | |||
+ | < | ||
+ | x PC-Engines-APU2-igb, | ||
+ | + PC-Engines-APU2-igb, | ||
+ | +--------------------------------------------------------------------------+ | ||
+ | |x | ||
+ | | |________________A______M_________| | ||
+ | | |_____A_____| | | ||
+ | +--------------------------------------------------------------------------+ | ||
+ | N | ||
+ | x | ||
+ | + | ||
+ | Difference at 95.0% confidence | ||
+ | 9107.6 +/- 3928.6 | ||
+ | 1.25419% +/- 0.547037% | ||
+ | (Student' | ||
+ | </ | ||
+ | |||
+ | A small 1% improvement on a PC Engines APU2. | ||
+ | |||
+ | < | ||
+ | x Netgate-igb, | ||
+ | + Netgate-igb, | ||
+ | +--------------------------------------------------------------------------+ | ||
+ | |x x | ||
+ | ||______________M_____A_____________________| | ||
+ | | | ||
+ | +--------------------------------------------------------------------------+ | ||
+ | N | ||
+ | x | ||
+ | + | ||
+ | Difference at 95.0% confidence | ||
+ | 16397.3 +/- 9285.49 | ||
+ | 1.72358% +/- 0.990789% | ||
+ | (Student' | ||
+ | </ | ||
+ | |||
+ | Almost same improvement, | ||
+ | |||
+ | === Increasing maximum interrupts per second === | ||
+ | |||
+ | By default igb|em limit the maximum number of interrupts per second to 8000. | ||
+ | |||
+ | What result by increasing this number: | ||
+ | |||
+ | < | ||
+ | x PC-Engine-APU2-igb, | ||
+ | + PC-Engine-APU2-igb, | ||
+ | * PC-Engine-APU2-igb, | ||
+ | +--------------------------------------------------------------------------+ | ||
+ | |x | ||
+ | | | ||
+ | | | ||
+ | | |_______M_______A_______________| | ||
+ | +--------------------------------------------------------------------------+ | ||
+ | N | ||
+ | x | ||
+ | + | ||
+ | No difference proven at 95.0% confidence | ||
+ | * | ||
+ | No difference proven at 95.0% confidence | ||
+ | </ | ||
+ | |||
+ | No benefit on a PC Engines APU2. | ||
+ | |||
+ | < | ||
+ | x Netgate-igb, | ||
+ | + Netgate-igb, | ||
+ | * Netgate-igb, | ||
+ | +--------------------------------------------------------------------------+ | ||
+ | |x * | ||
+ | | | ||
+ | | |________________A____M___________| | ||
+ | | |_____________M_______A____________________| | ||
+ | +--------------------------------------------------------------------------+ | ||
+ | N | ||
+ | x | ||
+ | + | ||
+ | Difference at 95.0% confidence | ||
+ | 9732.7 +/- 9295.06 | ||
+ | 1.02877% +/- 0.987938% | ||
+ | (Student' | ||
+ | * | ||
+ | No difference proven at 95.0% confidence | ||
+ | </ | ||
+ | |||
+ | A little 1% benefit if doubled form the default (8000 to 16000), but no benefit after. | ||
+ | |||
+ | === Disabling LRO and TSO === | ||
+ | |||
+ | All modern NIC support [[http:// | ||
+ | - By waiting to store multiple packets at the NIC level before to hand them up to the stack: This add latency, and because all packets need to be sending out again, the stack have to split in different packets again before to hand them down to the NIC. [[http:// | ||
+ | - This break the [[http:// | ||
+ | |||
+ | There is no real impact of disabling these features on PPS performance: | ||
+ | < | ||
+ | x Xeon_E5-2650-8Cores-Chelsio_T540, | ||
+ | + Xeon_E5-2650-8Cores-Chelsio_T540, | ||
+ | +--------------------------------------------------------------------------+ | ||
+ | |x x | ||
+ | | | ||
+ | | |_______________M___A__________________| | ||
+ | +--------------------------------------------------------------------------+ | ||
+ | N | ||
+ | x | ||
+ | + | ||
+ | No difference proven at 95.0% confidence | ||
+ | </ | ||
+ | ==== Where is the bottleneck ? ==== | ||
+ | |||
+ | Tools: | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | * [[https:// | ||
+ | |||
+ | === Packets load === | ||
+ | |||
+ | Display the information regarding packet traffic, with refresh each second. | ||
+ | |||
+ | Here is a first example: | ||
+ | |||
+ | < | ||
+ | [root@hp]~# netstat -ihw1 | ||
+ | input (Total) | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | </ | ||
+ | |||
+ | => This system is receiving 14Mpps (10G line-rate) and reach to forwarding only at 5.8Mpps rate then need to drop about 8Mpps. | ||
+ | |||
+ | === Traffic distribution between each queues === | ||
+ | |||
+ | Check the input queues are equally distributed. BSDRP include a sysctl parser script for that: | ||
+ | |||
+ | < | ||
+ | [root@hp]~# nic-queue-usage cxl0 | ||
+ | [Q0 | ||
+ | [Q0 | ||
+ | [Q0 | ||
+ | [Q0 | ||
+ | [Q0 | ||
+ | [Q0 | ||
+ | [Q0 | ||
+ | [Q0 | ||
+ | [Q0 | ||
+ | </ | ||
+ | |||
+ | => All 8 receive queues are correctly used here. | ||
+ | |||
+ | === Interrupt usage === | ||
+ | |||
+ | Report on the number of interrupts taken by each device since system startup, | ||
+ | |||
+ | Here is a first example: | ||
+ | < | ||
+ | [root@hp]~# vmstat -i | ||
+ | interrupt | ||
+ | irq1: atkbd0 | ||
+ | irq4: uart0 3 0 | ||
+ | irq20: ehci1 3263 3 | ||
+ | irq21: ehci0 | ||
+ | cpu0: | ||
+ | cpu6: | ||
+ | cpu7: | ||
+ | cpu1: | ||
+ | cpu2: | ||
+ | cpu5: | ||
+ | cpu4: | ||
+ | cpu3: | ||
+ | irq273: igb1:que 0 1291 1 | ||
+ | irq274: igb1:que 1 1245 1 | ||
+ | irq275: igb1:que 2 1245 1 | ||
+ | irq276: igb1:que 3 1246 1 | ||
+ | irq277: igb1:que 4 1245 1 | ||
+ | irq278: igb1:que 5 2435 2 | ||
+ | irq279: igb1:que 6 1245 1 | ||
+ | irq280: igb1:que 7 1245 1 | ||
+ | irq281: igb1: | ||
+ | irq283: t5nex0: | ||
+ | irq284: t5nex0: | ||
+ | irq285: t5nex0: | ||
+ | irq286: t5nex0: | ||
+ | irq287: t5nex0: | ||
+ | irq288: t5nex0: | ||
+ | irq289: t5nex0: | ||
+ | irq290: t5nex0: | ||
+ | irq291: t5nex0: | ||
+ | irq294: t5nex0: | ||
+ | irq295: t5nex0: | ||
+ | irq296: t5nex0: | ||
+ | irq297: t5nex0: | ||
+ | irq298: t5nex0: | ||
+ | irq299: t5nex0: | ||
+ | irq300: t5nex0: | ||
+ | irq301: t5nex0: | ||
+ | Total | ||
+ | </ | ||
+ | |||
+ | => There is no IRQ sharing here, and each queue have correctly their own IRQ (thanks MSI-X). | ||
+ | |||
+ | === Memory Buffer === | ||
+ | |||
+ | Show statistics recorded by the memory management routines. The network manages a private pool of memory buffers. | ||
+ | |||
+ | < | ||
+ | [root@hp]~# vmstat -z | head -1 ; vmstat -z | grep -i mbuf | ||
+ | ITEM | ||
+ | mbuf_packet: | ||
+ | mbuf: 256, 26137965, | ||
+ | mbuf_cluster: | ||
+ | mbuf_jumbo_page: | ||
+ | mbuf_jumbo_9k: | ||
+ | mbuf_jumbo_16k: | ||
+ | </ | ||
+ | |||
+ | => No " | ||
+ | |||
+ | === CPU / NIC === | ||
+ | |||
+ | top can give very useful information regarding the CPU/NIC affinity: | ||
+ | |||
+ | < | ||
+ | [root@hp]~# top -CHIPS | ||
+ | last pid: 1180; load averages: 10.05, | ||
+ | 187 processes: 15 running, 100 sleeping, 72 waiting | ||
+ | CPU 0: 0.0% user, 0.0% nice, 0.0% system, 96.9% interrupt, | ||
+ | CPU 1: 0.0% user, 0.0% nice, 0.0% system, 99.2% interrupt, | ||
+ | CPU 2: 0.0% user, 0.0% nice, 0.0% system, 99.6% interrupt, | ||
+ | CPU 3: 0.0% user, 0.0% nice, 0.0% system, 97.7% interrupt, | ||
+ | CPU 4: 0.0% user, 0.0% nice, 0.0% system, 98.1% interrupt, | ||
+ | CPU 5: 0.0% user, 0.0% nice, 0.0% system, 97.3% interrupt, | ||
+ | CPU 6: 0.0% user, 0.0% nice, 0.0% system, 97.7% interrupt, | ||
+ | CPU 7: 0.0% user, 0.0% nice, 0.0% system, 97.3% interrupt, | ||
+ | Mem: 16M Active, 16M Inact, 415M Wired, 7239K Buf, 62G Free | ||
+ | Swap: | ||
+ | |||
+ | PID USERNAME | ||
+ | 12 root | ||
+ | 12 root | ||
+ | 12 root | ||
+ | 12 root | ||
+ | 12 root | ||
+ | 12 root | ||
+ | 12 root | ||
+ | 12 root | ||
+ | 11 root 155 ki31 | ||
+ | 11 root 155 ki31 | ||
+ | 11 root 155 ki31 | ||
+ | 11 root 155 ki31 | ||
+ | 11 root 155 ki31 | ||
+ | 11 root 155 ki31 | ||
+ | 11 root 155 ki31 | ||
+ | 15 root | ||
+ | 0 root | ||
+ | 11 root 155 ki31 | ||
+ | 1180 root 20 0 20012K | ||
+ | 12 root | ||
+ | 12 root | ||
+ | 12 root | ||
+ | 12 root | ||
+ | 12 root | ||
+ | 12 root | ||
+ | 12 root | ||
+ | 12 root | ||
+ | 12 root | ||
+ | 1090 root 20 0 56296K | ||
+ | 14 root | ||
+ | 14 root | ||
+ | 14 root | ||
+ | 12 root | ||
+ | 12 root | ||
+ | 14 root | ||
+ | 12 root | ||
+ | 14 root | ||
+ | 18 root | ||
+ | 24 root 16 - | ||
+ | 21 root | ||
+ | 23 root | ||
+ | </ | ||
+ | |||
+ | === Drivers === | ||
+ | |||
+ | Depending the NIC drivers used, there are some counters available: | ||
+ | |||
+ | < | ||
+ | [root@hp]~# sysctl dev.cxl.0.stats. | ||
+ | dev.cxl.0.stats.rx_ovflow2: | ||
+ | dev.cxl.0.stats.rx_ovflow1: | ||
+ | dev.cxl.0.stats.rx_ovflow0: | ||
+ | dev.cxl.0.stats.rx_ppp7: | ||
+ | dev.cxl.0.stats.rx_ppp6: | ||
+ | dev.cxl.0.stats.rx_ppp5: | ||
+ | dev.cxl.0.stats.rx_ppp4: | ||
+ | dev.cxl.0.stats.rx_ppp3: | ||
+ | dev.cxl.0.stats.rx_ppp2: | ||
+ | dev.cxl.0.stats.rx_ppp1: | ||
+ | dev.cxl.0.stats.rx_ppp0: | ||
+ | dev.cxl.0.stats.rx_pause: | ||
+ | dev.cxl.0.stats.rx_frames_1519_max: | ||
+ | dev.cxl.0.stats.rx_frames_1024_1518: | ||
+ | dev.cxl.0.stats.rx_frames_512_1023: | ||
+ | dev.cxl.0.stats.rx_frames_256_511: | ||
+ | dev.cxl.0.stats.rx_frames_128_255: | ||
+ | dev.cxl.0.stats.rx_frames_65_127: | ||
+ | dev.cxl.0.stats.rx_frames_64: | ||
+ | (...) | ||
+ | [root@hp]~# sysctl -d dev.cxl.0.stats.rx_ovflow0 | ||
+ | dev.cxl.0.stats.rx_ovflow0: | ||
+ | </ | ||
+ | |||
+ | => Notice the high level of "drops du to buffer-group 0 overflows" | ||
+ | It's a problem regarding global performance of the system (on this example, the packet generator send smallest packet at a rate about 14Mpps). | ||
+ | |||
+ | === pmcstat === | ||
+ | |||
+ | During high-load of your router/ | ||
+ | < | ||
+ | kldload hwpmc | ||
+ | </ | ||
+ | |||
+ | == Time used by process == | ||
+ | |||
+ | Now you can display the most time consumed process with: | ||
+ | < | ||
+ | pmcstat -TS inst_retired.any_p -w1 | ||
+ | </ | ||
+ | |||
+ | That will display this output: | ||
+ | < | ||
+ | PMC: [INSTR_RETIRED_ANY] Samples: 56877 (100.0%) , 0 unresolved | ||
+ | |||
+ | %SAMP IMAGE FUNCTION | ||
+ | 7.2 kernel | ||
+ | 6.1 if_cxgbe.k eth_tx | ||
+ | 5.9 kernel | ||
+ | 4.5 kernel | ||
+ | 3.7 kernel | ||
+ | 3.4 kernel | ||
+ | 3.3 kernel | ||
+ | 3.2 kernel | ||
+ | 2.9 if_cxgbe.k reclaim_tx_descs | ||
+ | 2.9 if_cxgbe.k service_iq | ||
+ | 2.9 kernel | ||
+ | 2.8 kernel | ||
+ | 2.8 if_cxgbe.k cxgbe_transmit | ||
+ | 2.8 kernel | ||
+ | 2.7 kernel | ||
+ | 2.5 kernel | ||
+ | 2.4 if_cxgbe.k t4_eth_rx | ||
+ | 2.3 if_cxgbe.k parse_pkt | ||
+ | 2.3 kernel | ||
+ | 2.2 kernel | ||
+ | 2.1 kernel | ||
+ | 2.0 kernel | ||
+ | 2.0 kernel | ||
+ | 2.0 if_cxgbe.k get_scatter_segment | ||
+ | 1.9 kernel | ||
+ | 1.8 kernel | ||
+ | 1.6 kernel | ||
+ | 1.4 if_cxgbe.k mp_ring_enqueue | ||
+ | 1.1 kernel | ||
+ | 1.1 kernel | ||
+ | 1.1 kernel | ||
+ | 1.0 kernel | ||
+ | 1.0 kernel | ||
+ | 1.0 kernel | ||
+ | 0.9 kernel | ||
+ | 0.7 kernel | ||
+ | 0.7 kernel | ||
+ | 0.7 kernel | ||
+ | 0.7 kernel | ||
+ | 0.6 kernel | ||
+ | 0.6 kernel | ||
+ | </ | ||
+ | |||
+ | On this case the bootleneck is just the network stack. | ||
+ | |||
+ | == CPU cycles spent == | ||
+ | |||
+ | For displaying where the most cpu cycles are being spent with. | ||
+ | We first need a partition with about 200MB that include the debug kernel: | ||
+ | |||
+ | < | ||
+ | system expand-data-slice | ||
+ | mount /data | ||
+ | </ | ||
+ | |||
+ | Then, under high-load, start collecting during about 20 seconds: | ||
+ | < | ||
+ | pmcstat -z 50 -S cpu_clk_unhalted.thread -l 20 -O / | ||
+ | pmcstat -R / | ||
+ | less / | ||
+ | </ |
documentation/technical_docs/performance.txt · Last modified: 2020/01/18 01:04 by olivier