====== Forwarding performance lab of a PC Engines APU ====== {{description>Forwarding performance lab of a dual core AMD G series T40E APU (1 GHz) with 3 Realtek RTL8111E Gigabit}} ===== Hardware detail ===== This lab will test a [[http://www.pcengines.ch/apu.htm|PC Engines APU 1]] ([[PC Engines APU|dmesg]]): * Dual core [[http://www.amd.com/us/Documents/49282_G-Series_platform_brief.pdf|AMD G-T40E Processor]] (1 GHz) * 3 Realtek RTL8111E Gigabit Ethernet ports * 2Gb of RAM [[documentation:examples:Forwarding performance lab of a PC Engines APU2|Forwarding performance of APU version 2 is here.]] ===== Lab set-up ===== For more information about full setup of this lab: [[documentation:examples:Setting up a forwarding performance benchmark lab]] (switch configuration, etc.). ==== Diagram ==== +------------------------------------------+ +-----------------------+ | Device under Test | | Packet gen | | | | | | re1: 198.18.0.207/24 |<=====| igb2: 198.18.0.203/24 | | 2001:2::207/64 | | 2001:2::203/64 | | 00:0d:b9:3c:dd:3d | | 00:1b:21:c4:95:7a | | | | | | re2: 198.19.0.207/24 |=====>| igb3: 198.19.0.203/24 | | 2001:2:0:8000::8/64 | | 2001:2:0:8000::203/64 | | 00:0d:b9:3c:dd:3e | | 00:1b:21:c4:95:7b | | | | | | static routes | | | | 198.19.0.0/16 => 198.19.0.203 | +-----------------------+ | 198.18.0.0/16 => 198.18.0.203 | | 2001:2::/49 => 2001:2::203 | | 2001:2:0:8000::/49 => 2001:2:0:8000::203 | | | | static arp and ndp | | 198.18.0.203 => 00:1b:21:c4:95:7a | | 2001:2::203 | | | | 198.19.0.203 => 00:1b:21:c4:95:7b | | 2001:2:0:8000::203 | | | +------------------------------------------+ The generator **MUST** generate lot's of IP flows (multiple source/destination IP addresses and/or UDP src/dst port) and minimum packet size (for generating maximum packet rate) with one of these commands: Multiple source/destination IP addresses (don't forget to precise port to use for avoiding to use port number 0 filtered by pf): pkt-gen -U -i igb3 -f tx -n 80000000 -l 60 -d 198.19.10.1:2000-198.19.10.20 -D 00:0d:b9:3c:dd:3e -s 198.18.10.1:2000-198.18.10.100 -w 4 Receiver will use these commands: pkt-gen -i igb2 -f rx -w 4 ===== Basic configuration ===== ==== Disabling Ethernet flow-control === re(4) drivers didn't seems to support flow-control and the switch confirms this behavior: switch#sh int Gi1/0/16 flowcontrol Port Send FlowControl Receive FlowControl RxPause TxPause admin oper admin oper --------- -------- -------- -------- -------- ------- ------- Gi1/0/16 Unsupp. Unsupp. off off 0 0 switch#sh int Gi1/0/17 flowcontrol Port Send FlowControl Receive FlowControl RxPause TxPause admin oper admin oper --------- -------- -------- -------- -------- ------- ------- Gi1/0/17 Unsupp. Unsupp. off off 0 0 ==== Static routes and ARP entries ==== Configure static routes, configure IP addresses and static ARP. A router [[Documentation:Technical docs:Performance|should not use LRO and TSO]]. BSDRP disable by default using a RC script (disablelrotso_enable="YES" in /etc/rc.conf.misc), but re(4) drivers didn't support it. /etc/rc.conf: # IPv4 router gateway_enable="YES" ifconfig_re1="inet 198.18.0.207/24" ifconfig_re2="inet 198.19.0.207/24" static_routes="generator receiver" route_generator="-net 198.18.0.0/16 198.18.0.203" route_receiver="-net 198.19.0.0/16 198.19.0.203" static_arp_pairs="receiver generator" static_arp_generator="198.18.0.203 00:1b:21:c4:95:7a" static_arp_receiver="198.19.0.203 00:1b:21:c4:95:7b" # IPv6 router ipv6_gateway_enable="YES" ipv6_activate_all_interfaces="YES" ipv6_static_routes="generator receiver" ipv6_route_generator="2001:2:: -prefixlen 49 2001:2::203" ipv6_route_receiver="2001:2:0:8000:: -prefixlen 49 2001:2:0:8000::203" ifconfig_re1_ipv6="inet6 2001:2::207 prefixlen 64" ifconfig_re2_ipv6="inet6 2001:2:0:8000::207 prefixlen 64" static_ndp_pairs="receiver generator" static_ndp_generator="2001:2::203 00:1b:21:c4:95:7a" static_ndp_receiver="2001:2:0:8000::203 00:1b:21:c4:95:7b" ===== Default forwarding rate ===== We start the first test by starting one packet generator at gigabit line-rate (1.488Mpps) and found: * APU is still responsive during this test (thanks to the dual core); * About 154Kpps are accepted by the re(4) Ethernet interface. [root@BSDRP]~# netstat -iw 1 input (Total) output packets errs idrops bytes packets errs bytes colls 154273 0 0 9256386 154241 0 9256550 0 154081 0 0 9244866 154081 0 9244982 0 154113 0 0 9246786 154113 0 9246902 0 154151 0 0 9249066 154177 0 9249182 0 154139 0 0 9248346 154113 0 9248462 0 154113 0 0 9246786 154113 0 9246902 0 154145 0 0 9248706 154145 0 9248822 0 154193 0 0 9251586 154209 0 9252662 0 154135 0 0 9248106 154145 0 9247322 0 154139 0 0 9248346 154113 0 9248402 0 154151 0 0 9249066 154177 0 9249242 0 154145 0 0 9248706 154145 0 9248822 0 154147 0 0 9248826 154145 0 9248882 0 154169 0 0 9250146 154177 0 9250262 0 154145 0 0 9248706 154113 0 9248822 0 The forwarding rate is not very high: RealTek NIC are not very very fast and doesn't support multi-queues and it's only a 1Ghz CPU. We notice input error counters of re(4) are not updated: re(4) drivers bugs?. We can force drivers stats with this command: [root@BSDRP]~# sysctl dev.re.1.stats=1 dev.re.1.stats: -1 -> -1 [root@BSDRP]~# dmesg (etc...) re1 statistics: Tx frames : 6 Rx frames : 16394206 Tx errors : 0 Rx errors : 0 Rx missed frames : 16421 Rx frame alignment errs : 0 Tx single collisions : 0 Tx multiple collisions : 0 Rx unicast frames : 16394204 Rx broadcast frames : 2 Rx multicast frames : 0 Tx aborts : 0 Tx underruns : 0 But the RX missed frame counter is still not accurate. About system load during this test: [root@BSDRP]/# top -nCHSIzs1 last pid: 4067; load averages: 0.49, 0.16, 0.04 up 0+01:04:10 18:32:04 86 processes: 3 running, 67 sleeping, 16 waiting Mem: 6312K Active, 19M Inact, 75M Wired, 12M Buf, 1849M Free Swap: PID USERNAME PRI NICE SIZE RES STATE C TIME CPU COMMAND 11 root -92 - 0K 256K WAIT 0 0:25 68.26% intr{irq260: re1} 11 root -92 - 0K 256K WAIT 0 0:03 8.25% intr{irq261: re2} ===== Firewalls impact ===== This test will generate 2000 different flows by using 2000 different UDP destination ports. pf and ipfw configurations used are detailed on the previous [[documentation:examples:forwarding_performance_lab_of_an_ibm_system_x3550_m3_with_intel_82580#firewall_impact|Forwarding performance lab of an IBM System x3550 M3 with Intel 82580]]. ==== Graph ==== scale information about Gigabit Ethernet: * 1.488Mpps is the maximum paquet-per-second (pps) rate with smallest 46 bytes packets. * 81Kpps is the minimum pps rate with biggest 1500 bytes packets. {{bench.forwarding.and.firewalling.rate.on.pc.engines.apu.png|forwarding and firewalling rate with a PC Engines APU running FreeBSD FreeBSD 10.3}} ==== Ministat ==== All benchs were done 5 times, with a reboot between them. x forwarding + ipfw-statefull * pf-statefull +--------------------------------------------------------------------------+ |* | |* + x| |* + x| |* + x| |* ++ x| | A| | |A | |A | +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 154144 154200 154167 154171.6 20.671236 + 5 113357 114637 114173 114152.6 486.93151 Difference at 95.0% confidence -40019 +/- 502.612 -25.9574% +/- 0.326008% (Student's t, pooled s = 344.623) * 5 88037 88385 88108 88169 144.98793 Difference at 95.0% confidence -66002.6 +/- 151.034 -42.8111% +/- 0.0979651% (Student's t, pooled s = 103.559) ===== Netmap's pkt-gen performance ===== re(4) has [[http://info.iet.unipi.it/~luigi/netmap/|netmap]] support… what's about the rate with the netmap's packet generator/receiver ? As a receiver (the sender is emitting at 1.48 Mpps): [root@APU]~# pkt-gen -i re1 -f rx -w 4 -c 2 854.089137 main [1641] interface is re1 854.089501 extract_ip_range [275] range is 10.0.0.1:0 to 10.0.0.1:0 854.089523 extract_ip_range [275] range is 10.1.0.1:0 to 10.1.0.1:0 854.111967 main [1824] mapped 334980KB at 0x801dff000 Receiving from netmap:re1: 1 queues, 1 threads and 2 cpus. 854.112634 main [1904] Wait 4 secs for phy reset 858.123495 main [1906] Ready... 858.123756 nm_open [457] overriding ifname re1 ringid 0x0 flags 0x1 859.124355 receiver_body [1189] waiting for initial packets, poll returns 0 0 (etc...) 862.129332 main_thread [1438] 579292 pps (580438 pkts in 1001978 usec) 863.131433 main_thread [1438] 579115 pps (580332 pkts in 1002101 usec) 894.184371 main_thread [1438] 577549 pps (578725 pkts in 1002036 usec) 895.185330 main_thread [1438] 577483 pps (578037 pkts in 1000959 usec) 896.191334 main_thread [1438] 580069 pps (583552 pkts in 1006004 usec) 897.193330 main_thread [1438] 578174 pps (579328 pkts in 1001996 usec) 898.195328 main_thread [1438] 581974 pps (583137 pkts in 1001998 usec) 899.196916 main_thread [1438] 579600 pps (580520 pkts in 1001588 usec) 900.198344 main_thread [1438] 578366 pps (579191 pkts in 1001427 usec) 901.200327 main_thread [1438] 579327 pps (580476 pkts in 1001984 usec) 902.202328 main_thread [1438] 581601 pps (582765 pkts in 1002001 usec) 903.204329 main_thread [1438] 577499 pps (578655 pkts in 1002001 usec) Netmap usage improve the receiving packet rate to about 580Kpps only: It's strange that it didn't reach the maximum Ethernet frame rate (1.48Mpps) with netmap. As a packet generator: [root@APU]~# pkt-gen -i re1 -f tx -w 4 -c 2 -n 80000000 -l 60 -d 2.1.3.1-2.1.3.20 -D 00:1b:21:d4:3f:2a -s 1.1.3.3-1.1.3 .100 -c 2 759.415059 main [1641] interface is re1 759.415387 extract_ip_range [275] range is 1.1.3.3:0 to 1.1.3.100:0 759.415409 extract_ip_range [275] range is 2.1.3.1:0 to 2.1.3.20:0 759.922110 main [1824] mapped 334980KB at 0x801dff000 Sending on netmap:re1: 1 queues, 1 threads and 2 cpus. 1.1.3.3 -> 2.1.3.1 (00:00:00:00:00:00 -> 00:1b:21:d4:3f:2a) 759.922737 main [1880] --- SPECIAL OPTIONS: copy 759.922750 main [1902] Sending 512 packets every 0.000000000 s 759.922763 main [1904] Wait 4 secs for phy reset 763.923715 main [1906] Ready... 763.924310 nm_open [457] overriding ifname re1 ringid 0x0 flags 0x1 763.924929 sender_body [1016] start 764.926557 main_thread [1438] 407993 pps (408672 pkts in 1001665 usec) 765.928548 main_thread [1438] 408091 pps (408904 pkts in 1001991 usec) 766.929550 main_thread [1438] 407939 pps (408348 pkts in 1001002 usec) 767.931548 main_thread [1438] 407808 pps (408623 pkts in 1001998 usec) 768.933359 main_thread [1438] 407880 pps (408619 pkts in 1001811 usec) 769.934548 main_thread [1438] 408138 pps (408623 pkts in 1001189 usec) 770.936548 main_thread [1438] 407825 pps (408641 pkts in 1002000 usec) (etc...) 792.976553 main_thread [1438] 407872 pps (408690 pkts in 1002005 usec) 793.978549 main_thread [1438] 408184 pps (408999 pkts in 1001996 usec) 794.980547 main_thread [1438] 408201 pps (409017 pkts in 1001998 usec) 795.982552 main_thread [1438] 407892 pps (408710 pkts in 1002005 usec) 796.984546 main_thread [1438] 407984 pps (408798 pkts in 1001994 usec) 797.986546 main_thread [1438] 408069 pps (408885 pkts in 1002000 usec) 798.988442 main_thread [1438] 408080 pps (408854 pkts in 1001896 usec) 799.989548 main_thread [1438] 407815 pps (408266 pkts in 1001106 usec) ^C800.990686 main_thread [1438] 183685 pps (183894 pkts in 1001137 usec) Sent 14897486 packets, 60 bytes each, in 36.52 seconds. Speed: 407.98 Kpps Bandwidth: 195.83 Mbps (raw 274.16 Mbps) Still not able to reach the maximum Ethernet throughput with netmap !?!? Realtek chipset limitation ?