User Tools

Site Tools


documentation:examples:network_performance_tuning_on_low-end_hardware

Network Performance tuning on Low-End Hardware

Setting-up the packet generator

Diagram

PC 1 and PC 2 are both old Thin Client HP Compaq T5000 with BSDRP installed:

  • CPU: Transmeta™ Crusoe™ Processor TM5700 (798.13-MHz 586-class CPU)
  • RAM: 256M
  • Drive: 245MB <256MB ATA Flash Disk ADBA217H>
  • Ethernet: Rhine II PCI Fast Ethernet Controller (VT6103) … 100Mbit/s

PC 1 (10.0.1.1) will be use as data sender, and PC 2 (10.0.1.2) as data receiver.

Choosing the right tools

BSDRP includes 3 network benchmark tools:

  • Iperf is a well-know multi-os bandwitdh benchmark sofware
  • netblast/netsend/netreceive, are include on FreeBSD (but not installed by default)
  • NetPIPE support only TCP bench: It will not be used here.

All tests will be done in UDP: The purpose is to have a simple packet generator.

Iperf

IPv4 tests

First we will start Iperf, IPv4 mode on PC 2 (receiver):

[root@PC2]~# iperf -s -u
------------------------------------------------------------
Server listening on UDP port 5001
Receiving 1470 byte datagrams
UDP buffer size: 41.1 KByte (default)
------------------------------------------------------------

Then, from PC 1, we will start sending data to PC 2:

  • -u : UDP test
  • -f m : Display value in Mbits/sec
  • -b 200M: Try to send at 200Mbit/s (over a FastEthernet link, it should be enough)
  • -t 60: During 60 seconds

We need to run it 3 times minimum…

[root@PC1]~# iperf -u -c 10.0.1.2 -f m -b 200M -t 60
------------------------------------------------------------
Client connecting to 10.0.1.2, UDP port 5001
Sending 1470 byte datagrams
UDP buffer size: 0.01 MByte (default)
------------------------------------------------------------
[  3] local 10.0.1.1 port 62471 connected with 10.0.1.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-60.0 sec   505 MBytes  70.5 Mbits/sec
[  3] Sent 359938 datagrams
[  3] Server Report:
[  3]  0.0-60.0 sec   505 MBytes  70.5 Mbits/sec   0.197 ms   27/359937 (0.0075%)
[  3]  0.0-60.0 sec  1 datagrams received out-of-order

[root@PC1]~# iperf -u -c 10.0.1.2 -f m -b 200M -t 60
------------------------------------------------------------
Client connecting to 10.0.1.2, UDP port 5001
Sending 1470 byte datagrams
UDP buffer size: 0.01 MByte (default)
------------------------------------------------------------
[  3] local 10.0.1.1 port 20006 connected with 10.0.1.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-60.0 sec   505 MBytes  70.6 Mbits/sec
[  3] Sent 360103 datagrams
[  3] Server Report:
[  3]  0.0-60.0 sec   503 MBytes  70.4 Mbits/sec   0.225 ms  967/360102 (0.27%)
[  3]  0.0-60.0 sec  1 datagrams received out-of-order

[root@PC1]~# iperf -u -c 10.0.1.2 -f m -b 200M -t 60
------------------------------------------------------------
Client connecting to 10.0.1.2, UDP port 5001
Sending 1470 byte datagrams
UDP buffer size: 0.01 MByte (default)
------------------------------------------------------------
[  3] local 10.0.1.1 port 48945 connected with 10.0.1.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-60.0 sec   505 MBytes  70.6 Mbits/sec
[  3] Sent 360306 datagrams
[  3] Server Report:
[  3]  0.0-60.0 sec   505 MBytes  70.6 Mbits/sec   0.229 ms   24/360305 (0.0067%)
[  3]  0.0-60.0 sec  1 datagrams received out-of-order

We can see that Iperf can generate about 70Mbit/s of IPv4 traffic using full packet size (1470).

During theses IPerf IPv4 tests, here is the system load:

on PC1:

15 processes:  2 running, 12 slee 0.25,  0.51,  0.46    up 0+00:59:00  12:59:20
CPU:  3.5% user3  0.0% nice1 23.0% system, 68.5% interrupt,  5.1% idle
Mem:  4.3K Active, 5740K Inac17.96M Wired, 72.4 Cache, 18M B 5.4196M Free

On PC2:

15 processes:  4 running, 11 sleeping
CPU:  0.4% user,  0.0% nice,  7.7% system, 80.7% interrupt, 11.2% idle
Mem: 8552K Active, 5852K Inact, 16M Wired, 764K Cache, 19M Buf, 196M Free
IPv6 tests

First we will start Iperf, IPv6 mode on PC 2 (receiver):

[root@PC2]~# iperf -s -u
------------------------------------------------------------
Server listening on UDP port 5001
Receiving 1470 byte datagrams
UDP buffer size: 41.1 KByte (default)
------------------------------------------------------------

Then, from PC 1, we will start sending data to PC 2:

  • -V : IPv6
  • -u : UDP test
  • -f m : Display value in Mbits/sec
  • -b 200M: Try to send at 200Mbit/s (over a FastEthernet link, it should be enough)
  • -t 60: During 60 seconds

We need to run it 3 times minimum…

[root@PC1]~# iperf -V -u -c 2001:db8::2 -f m -b 200M -t 60
------------------------------------------------------------
Client connecting to 2001:db8::2, UDP port 5001
Sending 1470 byte datagrams
UDP buffer size: 0.01 MByte (default)
------------------------------------------------------------
[  3] local 2001:db8::1 port 26388 connected with 2001:db8::2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-60.0 sec   372 MBytes  52.0 Mbits/sec
[  3] Sent 265193 datagrams
[  3] Server Report:
[  3]  0.0-60.0 sec   372 MBytes  52.0 Mbits/sec   0.340 ms    0/265192 (0%)
[  3]  0.0-60.0 sec  1 datagrams received out-of-order

[root@PC1]~# iperf -V -u -c 2001:db8::2 -f m -b 200M -t 60
------------------------------------------------------------
Client connecting to 2001:db8::2, UDP port 5001
Sending 1470 byte datagrams
UDP buffer size: 0.01 MByte (default)
------------------------------------------------------------
[  3] local 2001:db8::1 port 33815 connected with 2001:db8::2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-60.0 sec   371 MBytes  51.9 Mbits/sec
[  3] Sent 264895 datagrams
[  3] Server Report:
[  3]  0.0-60.0 sec   371 MBytes  51.9 Mbits/sec   0.337 ms    0/264894 (0%)
[  3]  0.0-60.0 sec  1 datagrams received out-of-order

[root@PC1]~# iperf -V -u -c 2001:db8::2 -f m -b 500M -t 60
------------------------------------------------------------
Client connecting to 2001:db8::2, UDP port 5001
Sending 1470 byte datagrams
UDP buffer size: 0.01 MByte (default)
------------------------------------------------------------
[  3] local 2001:db8::1 port 46141 connected with 2001:db8::2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-60.0 sec   371 MBytes  51.8 Mbits/sec
[  3] Sent 264493 datagrams
[  3] Server Report:
[  3]  0.0-60.0 sec   370 MBytes  51.7 Mbits/sec   0.332 ms  532/264492 (0.2%)
[  3]  0.0-60.0 sec  1 datagrams received out-of-order

IPerf have a bug: It didn't take care of the IPv6 Ethernet MTU (1480 and not 1500), neither on the IPv6 size header and send 1470 bytes datagrams in place of 1430 bytes datagrams.

We can see that Iperf can generate about 52Mbit/s of fragmented IPv6 traffic using a buggy packet size (1470).

During theses buggy IPerf IPv6 tests, here is the system load:

on PC1:

14 processes:  2 running, 12 slee 0.45,  0.62,  0.47    up 0+00:55:09  12:55:29
CPU:  3.5% user,  0.0% nice, 39.3% system, 57.2% interrupt,  0.0% idle
Mem:  0.8K Active, 5744K Inac44.06M Wired, 55.3 Cache, 18M Buf, 196M Free

On PC2:

last pid:  1188;  load averages:  0.37,  0.41,  0.58    up 0+00:56:27  12:33:22
15 processes:  4 running, 11 sleeping
CPU:  3.1% user,  0.0% nice,  8.1% system, 69.0% interrupt, 19.8% idle
Mem: 8548K Active, 5856K Inact, 16M Wired, 764K Cache, 19M Buf, 196M Free

Netblast

Now we will test the FreeBSD Netblast using the same packet size as iperf (we are not measuring forwarding performance here, only bandwidth).

IPv4

First, we will start netreceive on PC 2 (receiver, UDP port 9090) and monitor the network usage with systat:

[root@pc2]~#netreceive 9090 &
[root@pc2]~#systat -ifstat
:scale mbit

Then on the PC 1 (sender), we start to send data, 3 runs:

[root@PC1]~# netblast 10.0.1.2 9090 1470 60

start:             1325166025.031250619
finish:            1325166085.073804351
send calls:        403121
send errors:       0
send success:      403121
approx send rate:  6718
approx error rate: 0
approx throughput: 81 Mib/s

[root@PC1]~# netblast 10.0.1.2 9090 1470 60

start:             1325166135.210622858
finish:            1325166195.252919574
send calls:        402469
send errors:       0
send success:      402469
approx send rate:  6707
approx error rate: 0
approx throughput: 81 Mib/s

[root@PC1]~# netblast 10.0.1.2 9090 1470 60

start:             1325166224.939911636
finish:            1325166284.982166727
send calls:        402361
send errors:       0
send success:      402361
approx send rate:  6706
approx error rate: 0
approx throughput: 81 Mib/s

We have the same result regarding throughput receive by PC2:

                   /0   /1   /2   /3   /4   /5   /6   /7   /8   /9   /10
     Load Average   ||||

      Interface           Traffic               Peak                Total
            vr0  in     38.237 Mb/s         81.254 Mb/s            1.341 GB
                 out     0.000 Mb/s          0.000 Mb/s            1.685 MB

And the packet loss by PC2 is quiet low:

[root@PC2]~# netstat -ss
		 
udp:
        1610767 datagrams received
        2 broadcast/multicast datagrams undelivered
        18614 dropped due to full socket buffers
        1592151 delivered
ip:
        1610767 total packets received
        1610767 packets for this host

During this IPv4 netblast test, the CPU usage of PC 1 was:

last pid:  1333;  load averages:  0.76,  0.78,  0.46    up 0+01:48:28  13:48:48
8 processes:   2 running, 6 sleeping
CPU:  1.6% user,  0.0% nice, 14.4% system, 84.0% interrupt,  0.0% idle
Mem: 5892K Active, 5552K Inact, 16M Wired, 760K Cache, 19M Buf, 199M Free

and on PC2:

last pid:  1287;  load averages:  0.79,  0.46,  0.25    up 0+01:50:15  13:27:10
8 processes:   2 running, 6 sleeping
CPU:  0.4% user,  0.0% nice,  6.9% system, 89.3% interrupt,  3.4% idle
Mem: 6008K Active, 5856K Inact, 16M Wired, 764K Cache, 19M Buf, 199M Free

We notice very high CPU interrupt usage on both side: Theses interrupts are generated by the NIC.

There is 10Mbit/s difference between IPv4 iperf (70Mbit/s) and IPv4 netblast (80Mbit/s) !

IPv6

We kept the netreceive/systat on PC 2 like the IPv4 netblas test.

Then on the PC 1 (sender), we start to send data, 3 runs:

[root@PC1]/tmp# netblast 2001:db8::2 9090 1430 60

start:             1325175022.973543284
finish:            1325175083.016163506
send calls:        388622
send errors:       0
send success:      388622
approx send rate:  6477
approx error rate: 0
approx throughput: 77 Mib/s

[root@PC1]/tmp# netblast 2001:db8::2 9090 1430 60

start:             1325174850.578943412
finish:            1325174910.621620344
send calls:        389590
send errors:       0
send success:      389590
approx send rate:  6493
approx error rate: 0
approx throughput: 77 Mib/s

[root@PC1]/tmp# netblast 2001:db8::2 9090 1430 60

start:             1325174934.827957183
finish:            1325174994.869647677
send calls:        389118
send errors:       0
send success:      389118
approx send rate:  6485
approx error rate: 0
approx throughput: 77 Mib/s

We have the same result regarding throughput receive by PC2:

                    /0   /1   /2   /3   /4   /5   /6   /7   /8   /9   /10
     Load Average   |||||||||||||

      Interface           Traffic               Peak                Total
            vr0  in     77.460 Mb/s         77.507 Mb/s            2.969 GB
                 out     0.000 Mb/s          0.000 Mb/s            1.707 MB

And the packet loss by PC2 is quiet low:

[root@PC2]~# netstat -ss -f inet6
udp:
        779940 datagrams received
        2 broadcast/multicast datagrams undelivered
        32877 dropped due to full socket buffers
        747061 delivered
ip6:
        779953 total packets received
        779938 packets for this host
        14 packets sent from this host
        Input histogram:
                UDP: 779938
                ICMP6: 15
        Mbuf statistics:
                0 one mbuf
                779953 one ext mbuf
                0 two or more ext mbuf

During this IPv6 netblast test, the CPU usage of PC 1 was:

last pid:  2031;  load averages:  1.95,  0.97,  0.68    up 0+04:26:24  16:26:44
9 processes:   3 running, 6 sleeping
CPU:  2.3% user,  0.0% nice, 15.6% system, 82.1% interrupt,  0.0% idle
Mem: 6588K Active, 6156K Inact, 16M Wired, 304K Cache, 20M Buf, 198M Free

and on PC2:

last pid:  2011;  load averages:  0.58,  0.50,  0.53    up 0+04:27:10  16:04:05
10 processes:  2 running, 8 sleeping
CPU:  1.1% user,  0.0% nice,  8.0% system, 90.4% interrupt,  0.4% idle
Mem: 7092K Active, 6720K Inact, 16M Wired, 308K Cache, 20M Buf, 196M Free

We notice very high CPU interrupt usage on both side: Theses interrupts are generated by the NIC.

There are 4 Mbit/s difference between IPv4 netblast (81Mbit/s) and IPv6 netblast (77Mbit/s), and using the same packet size (1430 for IPv4 and IPv6) there are still a 3Mbit/s gap.

Avoid using IPerf for FreeBSD UDP throughput benchs!

Enabling Polling

During the netblast test, there was a very high level of CPU interrupt, here is how to reduce it:

On both PC 1 and PC 2, enable NIC polling:

  1. Edit /etc/rc.conf.misc
  2. Replace the line polling_enable=“NO” by polling_enable=“YES”
  3. Start polling: /usr/local/etc/rc.d/polling start

Now, restart the netblast bench again:

From PC 1, re-send data to PC 2:

[root@PC1]~# netblast 10.0.1.2 9090 1470 60

start:             1325178791.914166744
finish:            1325178851.956046648
send calls:        5166494
send errors:       4706483
send success:      460011
approx send rate:  7666
approx error rate: 0
approx throughput: 92 Mib/s

[root@PC1]~# netblast 2001:db8::2 9090 1430 60

start:             1325178962.979784581
finish:            1325179023.021798021
send calls:        2347583
send errors:       1880785
send success:      466798
approx send rate:  7779
approx error rate: 0
approx throughput: 92 Mib/s

With polling enabled, the maximum throughput was increase to 92 Mbit/s !

And the new CPU load on PC 1 (sender):

last pid:  1108;  load averages:  0.59,  0.66,  0.42    up 0+00:12:17  17:18:42
7 processes:   2 running, 5 sleeping
CPU:  8.6% user,  0.0% nice, 77.4% system, 14.0% interrupt,  0.0% idle
Mem: 4924K Active, 3244K Inact, 15M Wired, 696K Cache, 16M Buf, 203M Free

During the same time, on the PC 2 (receiver) side, the measured incoming bandwidth:

                    /0   /1   /2   /3   /4   /5   /6   /7   /8   /9   /10
     Load Average   ||||

      Interface           Traffic               Peak                Total
            vr0  in     92.638 Mb/s         92.642 Mb/s          826.272 MB
                 out     0.000 Mb/s          0.000 Mb/s            0.518 KB

And, PC 2 (receive), CPU usage:

last pid:  1112;  load averages:  0.44,  0.53,  0.33    up 0+00:11:14  16:56:05
7 processes:   1 running, 6 sleeping
CPU:  0.8% user,  0.0% nice, 12.1% system, 19.1% interrupt, 68.1% idle
Mem: 5100K Active, 3748K Inact, 15M Wired, 696K Cache, 17M Buf, 202M Free

UDP stat on PC2:

udp:
        1441998 datagrams received
        1 broadcast/multicast datagram undelivered
        4585 dropped due to full socket buffers
        1437412 delivered
ip:
        1379658 total packets received
        1379656 packets for this host
        2 packets not forwardable

By enabling polling (on this old hardware) we increase the throughput and reduce a lot's the CPU interrupt usage :-)

Generating packets

The main goal of a packet generator, is not to generate bandwidth usage (by using big packet), but to generate lot's of packet per second (pps) by using small packet. We need to use a small packet size for generate high number of pps

But, what is the maximum frame per second of an 100Mb/s Ethernet link ?

The document Bandwidth, Packets Per Second, and Other Network Performance Metrics give, in table 1 “Maximum Frame Rate and Throughput Calculations For a 1-Gb/s Ethernet Link” the answer: The minimum total frame physical size is 84 byte (including inter frame gap). We can now calculate the maximum frame per second for a FastEthernet link:

100,000,000 b/s / (84 B * 8 b/B)] == 148 809 f/s == 149Kpps

Without polling

[root@PC1]~# netblast 10.0.1.2 9090 64 60

start:             1325177692.699528012
finish:            1325177752.741911611
send calls:        434693
send errors:       0
send success:      434693
approx send rate:  7244
approx error rate: 0
approx throughput: 6 Mib/s

[root@PC1]~# netblast 2001:db8::2 9090 64 60

start:             1325177922.574319641
finish:            1325177982.615896712
send calls:        412694
send errors:       0
send success:      412694
approx send rate:  6878
approx error rate: 0
approx throughput: 6 Mib/s

Without polling, this device is able to generate about 7 kpps IPv4 and 6.9 kpps IPv6.

With polling

We have a strange result with polling:

[root@PC1]~# netblast 10.0.1.2 9090 64 60

start:             1325179600.042469922
finish:            1325179660.084268251
send calls:        4190292
send errors:       0
send success:      4190292
approx send rate:  69838
approx error rate: 0
approx throughput: 59 Mib/s
[root@PC1]~# netblast 10.0.1.2 9090 64 60

start:             1325179714.014201525
finish:            1325179774.055861009
send calls:        4227953
send errors:       0
send success:      4227953
approx send rate:  70465
approx error rate: 0
approx throughput: 59 Mib/s
[root@PC1]~# netblast 2001:db8::2 9090 64 60

start:             1325180513.563255106
finish:            1325180573.604868216
send calls:        2977488
send errors:       0
send success:      2977488
approx send rate:  49624
approx error rate: 0
approx throughput: 50 Mib/s

[root@PC1]~# netblast 2001:db8::2 9090 64 60

start:             1325180599.589205331
finish:            1325180659.630854200
send calls:        3007401
send errors:       0
send success:      3007401
approx send rate:  50123
approx error rate: 0
approx throughput: 50 Mib/s

Throughput receive on PC 2:

                    /0   /1   /2   /3   /4   /5   /6   /7   /8   /9   /10
     Load Average   |||||

      Interface           Traffic               Peak                Total
            vr0  in     44.072 Mb/s         44.072 Mb/s          749.572 MB
                 out     0.000 Mb/s          0.000 Mb/s            1.754 KB

With polling, this device is able to generate about 70 kpps IPv4 and 50 kpps IPv6: Almost x10 better performance than without polling !

For information, UDP stat on PC:

[root@PC2]~# netstat -ss 
udp:
        19145968 datagrams received
        8 broadcast/multicast datagrams undelivered
        2951504 dropped due to full socket buffers
        16194456 delivered
ip:
        10732239 total packets received
        10732231 packets for this host
ip6:
        8413759 total packets received
        8413739 packets for this host
        26 packets sent from this host

Routing lab

Now that we have packet generator, we can build the full lab using a PC engine WRAP 1e203 as router.

Diagram

TO DO

WRAP as packet generator

We will begin by testing the maximum throughput and pps that this device can generate. Same test between WRAP and PC2.

Without polling

Max throughput:

[root@wrap]/# netblast 10.0.2.2 9090 1470 60

start:             946717690.521191814
finish:            946717750.558459518
send calls:        147019
send errors:       0
send success:      147019
approx send rate:  2450
approx error rate: 0
approx throughput: 29 Mib/s

Max PPS:

[root@wrap]/# netblast 10.0.2.2 9090 64 60

start:             946718035.917633110
finish:            946718095.954543332
send calls:        248378
send errors:       0
send success:      248378
approx send rate:  4139
approx error rate: 0
approx throughput: 3 Mib/s

⇒ Without polling, about 4Kpps only.

With polling

Max throughput:

[root@wrap]~# netblast 10.0.2.2 9090 1470 60

start:             947743395.648211295
finish:            947743455.685888369
send calls:        71946
send errors:       0
send success:      71946
approx send rate:  1199
approx error rate: 0
approx throughput: 14 Mib/s

Note: Lower throughput with polling enabled !

Max PPS:

[root@wrap]~# netblast 10.0.2.2 9090 64 60

start:             947743786.379087368
finish:            947743846.417400442
send calls:        117315
send errors:       0
send success:      117315
approx send rate:  1955
approx error rate: 0
approx throughput: 1 Mib/s

⇒ With polling enabled, it generate about 2Kpps only.

Enabling polling for end-point packet generator is not a good idea on the WRAP.

WRAP as a router

PC1 will generate its maximum pps to PC2 across WRAP.

Without polling

From PC1:

[root@PC1]~# netblast 10.0.2.2 9090 64 60

start:             1326210268.937558253
finish:            1326210328.979533141
send calls:        4053658
send errors:       0
send success:      4053658
approx send rate:  67560
approx error rate: 0
approx throughput: 57 Mib/s

WRAP begin to full its log with:

interrupt storm detected on "irq10:"; throttling interrupt source
interrupt storm detected on "irq10:"; throttling interrupt source
interrupt storm detected on "irq10:"; throttling interrupt source
interrupt storm detected on "irq10:"; throttling interrupt source
interrupt storm detected on "irq10:"; throttling interrupt source
interrupt storm detected on "irq10:"; throttling interrupt source
interrupt storm detected on "irq10:"; throttling interrupt source

Then it crash:

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0xc
fault code              = supervisor read, page not present
instruction pointer     = 0x20:0xc07533f7
stack pointer           = 0x28:0xc7eb3af0
frame pointer           = 0x28:0xc7eb3b1c
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, def32 1, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 11 (irq10: sis0)
trap number             = 12
panic: page fault
cpuid = 0
Uptime: 8m5s
Cannot dump. Device not defined or unavailable

With polling

documentation/examples/network_performance_tuning_on_low-end_hardware.txt · Last modified: 2013/02/26 10:02 by olivier