documentation:technical_docs:performance
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionNext revisionBoth sides next revision | ||
documentation:technical_docs:performance [2019/06/27 01:35] – [Where is the bottleneck ?] olivier | documentation:technical_docs:performance [2020/01/18 01:02] – [Disabling Hyper Threading (HT)] olivier | ||
---|---|---|---|
Line 27: | Line 27: | ||
* [[http:// | * [[http:// | ||
* [[http:// | * [[http:// | ||
+ | * [[https:// | ||
==== FreeBSD ==== | ==== FreeBSD ==== | ||
Line 90: | Line 91: | ||
Avoid NUMA architecture but prefer a CPU in only one package with maximum core (8 or 16). | Avoid NUMA architecture but prefer a CPU in only one package with maximum core (8 or 16). | ||
- | If you are using NUMA, check that inbound/ | + | If you are using NUMA, you need to check that inbound/ |
=== Network Interface Card === | === Network Interface Card === | ||
Line 104: | Line 105: | ||
==== Choosing good FreeBSD release ==== | ==== Choosing good FreeBSD release ==== | ||
- | Before tuning, you need to use the good FreeBSD version. | + | Before tuning, you need to use the good FreeBSD version... this mean a recent |
- | This mean a FreeBSD -head version older than r309257 (Andrey V. Elsukov 's improvement: | + | |
- | BSDRP since version 1.70 is using a FreeBSD | + | BSDRP is currently following |
+ | ==== Disabling Hyper Threading | ||
- | {{documentation: | + | By default |
- | + | But on some older CPU (like Xeon E5-2650 V1) those logical cores didn't help at all for managing interrupts generated by high speed NIC. | |
- | For better (and linear scale) performance there is the [[https:// | + | |
- | + | ||
- | ==== Disabling Hyper Threading ==== | + | |
- | + | ||
- | Disable Hyper Threading (HT): By default, lot's of multi-queue NIC drivers create one queue per core. | + | |
- | But "logical" | + | |
HT can be disabled with this command: | HT can be disabled with this command: | ||
Line 123: | Line 118: | ||
</ | </ | ||
- | Here is an example on a 8cores x hardware threads Intel CPU and 10G Chelsio NIC: | + | Here is an example on a Xeon E5 2650 (8c, |
< | < | ||
- | x HT-enabled-8rxq(default).packets-per-seconds | + | x HT-enabled-8rxq(default): inet packets-per-second forwarded |
- | + HT-enabled-16rxq.packets-per-seconds | + | + HT-enabled-16rxq: inet packets-per-second forwarded |
- | * HT-disabled.packets-per-seconds | + | * HT-disabled-8rxq: inet packets-per-seconds |
+--------------------------------------------------------------------------+ | +--------------------------------------------------------------------------+ | ||
| **| | | **| | ||
Line 150: | Line 145: | ||
</ | </ | ||
- | There is a benefit of about 24% to disable hyper threading. | + | There is a benefit of about 24% to disable hyper threading |
+ | |||
+ | But here is another example where there is a benefit to kept it enabled (and with the NIC configured to uses all the treads) on Xeon E5 2650L (10c, 20t): | ||
+ | |||
+ | < | ||
+ | x HT on, 8q (default): inet4 packets-per-second forwarded | ||
+ | + HT off, 8q: inet4 packets-per-second forwarded | ||
+ | * HT on, 16q: inet4 packets-per-second forwarded | ||
+ | +--------------------------------------------------------------------------+ | ||
+ | |x x ++ | ||
+ | |x xx +++ * * *| | ||
+ | ||AM| |A_| |_MA_|| | ||
+ | +--------------------------------------------------------------------------+ | ||
+ | N | ||
+ | x | ||
+ | + | ||
+ | Difference at 95.0% confidence | ||
+ | 1.01311e+06 +/- 113098 | ||
+ | 23.2388% +/- 2.94299% | ||
+ | (Student' | ||
+ | * | ||
+ | Difference at 95.0% confidence | ||
+ | 4.41004e+06 +/- 173536 | ||
+ | 101.157% +/- 5.21388% | ||
+ | (Student' | ||
+ | </ | ||
==== fastforwarding ==== | ==== fastforwarding ==== | ||
Line 746: | Line 766: | ||
</ | </ | ||
- | On this case the bootleneck is just the network stack. | + | On this case the bootleneck is just the network stack (most of the time spend into function ip_findroute called by ip_tryforward). |
== CPU cycles spent == | == CPU cycles spent == | ||
Line 761: | Line 781: | ||
< | < | ||
pmcstat -z 50 -S cpu_clk_unhalted.thread -l 20 -O / | pmcstat -z 50 -S cpu_clk_unhalted.thread -l 20 -O / | ||
- | pmcstat -R / | + | pmcstat -R / |
less / | less / | ||
</ | </ | ||
+ | |||
+ | === Lock contention source === | ||
+ | |||
+ | To identifying lock contention source (like if function lock_delay or __mtx_lock_sleep was quite high from the pcm output), you can try to search which lock is contended and why with lockstat. | ||
+ | |||
+ | You can generate 2 output: | ||
+ | * contented locks broken down by type: < | ||
+ | * stacks associated with the lock contention to identify the source: < |
documentation/technical_docs/performance.txt · Last modified: 2020/01/18 01:04 by olivier