There are lot's of guide about tuning FreeBSD TCP performance (where the FreeBSD host is an end-point of the TCP session), but it's not the same that tunig forwarding performance (where the FreeBSD host don't have to read the TCP information of the packet being forwarded).
Benchmarking a router is not measuring the maximum bandwidth crossing the router:
Clear definition regarding some relations is mandatory:
Here are some benchs regarding network forwarding performance of FreeBSD:
The bench lab should permit to measure the pps. For obtaining accurate result the RFC 2544 (Benchmarking Methodology for Network Interconnect Devices) is a good reference.
A packet generator
Here is a list of sources about optimizing forwarding performance under FreeBSD.
How to bench or tune the network stack:
FreeBSD Experimental high-performance network stacks:
Tools:
Display the information regarding packet traffic, with refresh each second.
Here is a first example:
[root@BSDRP3]~# netstat -i -h -w 1
input (Total) output
packets errs idrops bytes packets errs bytes colls
370k 0 0 38M 370k 0 38M 0
369k 0 0 38M 368k 0 38M 0
370k 0 0 38M 370k 0 38M 0
373k 0 0 38M 376k 0 38M 0
370k 0 0 38M 368k 0 38M 0
368k 0 0 38M 368k 0 38M 0
368k 0 0 38M 369k 0 38M 0
⇒ This system is forwarding 370Kpps (in and out) without any in/out errs (The packet generator used netblast with 64B packet-size a 370Kpps).
Here is a second example:
[root@BSDRP3]~# netstat -ihw 1
input (Total) output
packets errs idrops bytes packets errs bytes colls
399k 915k 0 25M 395k 0 24M 0
398k 914k 0 24M 398k 0 24M 0
399k 915k 0 25M 399k 0 25M 0
398k 915k 0 24M 397k 0 24M 0
399k 914k 0 25M 398k 0 24M 0
398k 914k 0 24M 400k 0 25M 0
398k 915k 0 24M 396k 0 24M 0
400k 915k 0 25M 401k 0 25M 0
397k 914k 0 24M 397k 0 24M 0
398k 914k 0 24M 399k 0 25M 0
400k 914k 0 25M 401k 0 25M 0
398k 914k 0 24M 397k 0 24M 0
⇒ This system is forwarding about 400Kpps (in and out), but it's overloaded because it drops (errs) about 914Kpps (the generator used netmap pkt-gen with 64B packet size at a rate of 1.34Mpps).
Report on the number of interrupts taken by each device since system startup.
Here is a first example:
[root@BSDRP3]~# vmstat -i interrupt total rate irq4: uart0 6670 5 irq14: ata0 5 0 irq16: bge0 27 0 irq17: em0 bge1 5209668 4510 cpu0:timer 1299291 1124 irq256: ahci0 1172 1 Total 6516833 5642
⇒ Notice that em0 and bge1 are sharing the same IRQ. It's not a good news.
Here is a second example:
[root@BSDRP3]# vmstat -i interrupt total rate irq4: uart0 17869 0 irq14: ata0 5 0 irq16: bge0 1 0 irq17: em0 bge1 2 0 cpu0:timer 214331752 1125 irq256: ahci0 1725 0 Total 214351354 1126
⇒ Almost zero rate and counters regarding NIC IRQ means polling is enabled: IRQ management of current NIC avoid the use of polling.
Show statistics recorded by the memory management routines. The network manages a private pool of memory buffers.
[root@BSDRP3]~# netstat -m 5220/810/6030 mbufs in use (current/cache/total) 5219/675/5894/512000 mbuf clusters in use (current/cache/total/max) 5219/669 mbuf+clusters out of packet secondary zone in use (current/cache) 0/0/0/256000 4k (page size) jumbo clusters in use (current/cache/total/max) 0/0/0/128000 9k jumbo clusters in use (current/cache/total/max) 0/0/0/64000 16k jumbo clusters in use (current/cache/total/max) 11743K/1552K/13295K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 0/0/0 sfbufs in use (current/peak/max) 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile 0 calls to protocol drain routines
Or more verbose:
[root@BSDRP3]~# vmstat -z | head -1 ; vmstat -z | grep -i mbuf ITEM SIZE LIMIT USED FREE REQ FAIL SLEEP mbuf_packet: 256, 0, 5221, 667,414103198, 0, 0 mbuf: 256, 0, 1, 141, 135, 0, 0 mbuf_cluster: 2048, 512000, 5888, 6, 5888, 0, 0 mbuf_jumbo_page: 4096, 256000, 0, 0, 0, 0, 0 mbuf_jumbo_9k: 9216, 128000, 0, 0, 0, 0, 0 mbuf_jumbo_16k: 16384, 64000, 0, 0, 0, 0, 0 mbuf_ext_refcnt: 4, 0, 0, 0, 0, 0, 0
⇒ No “failed” here.
top can give very usefull information regarding the CPU/NIC affinity:
[root@BSDRP3]~# top -nCHSIzs1
last pid: 1392; load averages: 0.15, 0.48, 0.33 up 0+00:22:06 15:44:26
75 processes: 2 running, 57 sleeping, 16 waiting
Mem: 10M Active, 8752K Inact, 79M Wired, 272K Cache, 17M Buf, 878M Free
Swap:
PID USERNAME PRI NICE SIZE RES STATE TIME CPU COMMAND
0 root -92 0 0K 176K - 16:24 96.39% kernel{em0 taskq}
11 root -92 - 0K 256K WAIT 1:01 5.76% intr{irq17: em0 bge1}
⇒ Not very interesting output one CPU here.
Here is another example on a 2 cores computer :
[root@BSDRP2]~# top -nCHSIzs1 | awk '$5 ~ /(K|SIZE)/ { printf "%7s %2s %6s %10s %15s %s\n", $7, $8, $9, $10, $11, $12}'
STATE C TIME CPU COMMAND
CPU0 0 7:23 99.76% kernel{em0 rxq}
RUN 1 0:44 6.40% intr{irq260: bce1}
istorm 1 4:18 4.05% intr{irq256: em0:rx
RUN 1 0:04 0.68% intr{irq258: em0:link}
⇒ em0 is under interrupt storm, and consume 100% of CPU n°1.
Depending the NIC drivers used, there are some counters available:
[root@BSDRP3]~# sysctl dev.em.0.mac_stats. | grep -v ': 0' dev.em.0.mac_stats.missed_packets: 221189883 dev.em.0.mac_stats.recv_no_buff: 94987654 dev.em.0.mac_stats.total_pkts_recvd: 351270928 dev.em.0.mac_stats.good_pkts_recvd: 130081045 dev.em.0.mac_stats.bcast_pkts_recvd: 1 dev.em.0.mac_stats.rx_frames_64: 2 dev.em.0.mac_stats.rx_frames_65_127: 130081043 dev.em.0.mac_stats.good_octets_recvd: 14308901524 dev.em.0.mac_stats.good_octets_txd: 892 dev.em.0.mac_stats.total_pkts_txd: 10 dev.em.0.mac_stats.good_pkts_txd: 10 dev.em.0.mac_stats.bcast_pkts_txd: 2 dev.em.0.mac_stats.mcast_pkts_txd: 5 dev.em.0.mac_stats.tx_frames_64: 2 dev.em.0.mac_stats.tx_frames_65_127: 8
⇒ Notice the high level of missed_packets and recv_no_buff. It's a problem regarding performance of the NIC or its drivers (on this example, the packet generator send packet at a rate about 1.38Mpps).