Hopefully this post will save someone time, money and aggravation. If you’re considering buying Supermicro X11SBA-F/LN4F you need to be aware that at the very least some boards (and possibly all) with a hardware revision 1.01 have serious and seemingly non-correctable issues with a secondary PCI-E bridge that feeds secondary LANs. DO NOT BUY X11SBA rev 1.01 if you need secondary NICs and always make sure the hardware revision is at least rev 1.02.
X11SBA series are very nice low-cost ridiculously low-power feature-packed boards that allow you to get 3 firewalls/domain controllers/PBX appliances for under $1000 total (if you shop around), with growth capability that should last you a decade (assuming you’re using Linux/BSD and not Windows, of course). One of the key features in addition to multi-core is that -F model comes with 2 GbE NICs, while -LN4F features for 4 GbE NICs.
Here’s the problem: the NIC #2 on -F and NICs #2, #3 and #4 on -LN4F sit on top of a separate PCI-E bridge than the 1st NIC. And that secondary bridge seems to have hardware problems for X11SBA hardware rev 1.01 causing transfer failures and requiring LAN PHY reset.
The problem manifests itself in the following way. LAN comes up, responds to pings and passes traffic happily. As the load increases you suddenly lose all connectivity. On FreeBSD 10/11 this failure looks like this:
igb1: Watchdog timeout -- resetting
igb1: Queue(846295657) tdh = -1249464976, hw tdt = 589450809
igb1: TX(846295657) desc avail = 0,Next TX to Clean = 0
igb1: link state changed to DOWN
igb1: link state changed to UP
igb1: Watchdog timeout -- resetting
igb1: Queue(846295657) tdh = -1249464976, hw tdt = 589450809
igb1: TX(846295657) desc avail = 0,Next TX to Clean = 0
igb1: link state changed to DOWN
igb1: link state changed to UP
This issue will be easily reproducible by using iperf
as follows:
$ iperf -c <your reliable iperf server> -d -t600 -l1m -e -i1
[ 5] local <client IP> port 60094 connected with <server IP> port 5001
[ 4] local <server IP> port 5001 connected with <client IP> port 58122
[ ID] Interval Transfer Bandwidth Write/Err Rtry Cwnd
[ 5] 0.00-1.00 sec 112 MBytes 940 Mbits/sec 112/0 0 1377K
[ 4] 0.00-1.00 sec 31.6 MBytes 265 Mbits/sec 3662 3659:1:1:1:0:0:0:0
[ 5] 1.00-2.00 sec 104 MBytes 872 Mbits/sec 104/0 78 858K
[ 4] 1.00-2.00 sec 32.4 MBytes 271 Mbits/sec 4160 4157:0:2:0:0:0:0:1
...
[ 5] 49.00-50.00 sec 41.0 MBytes 344 Mbits/sec 41/0 0 718K
[ 5] 50.00-51.00 sec 39.7 MBytes 333 Mbits/sec 40/0 1 1K
[ 4] 50.00-51.00 sec 86.4 MBytes 724 Mbits/sec 3617 3611:3:2:0:0:0:0:1
[ 5] 51.00-52.00 sec 445 KBytes 3.65 Mbits/sec 1/1 1 1K
[ 4] 51.00-52.00 sec 0.00 Bytes 0.00 bits/sec 0 0:0:0:0:0:0:0:0
[ 5] 52.00-53.00 sec 0.00 Bytes 0.00 bits/sec 0/2 1 1K
What To Do to Fix X11SBA
What will NOT work to fix the issue:
- Disabling ACPI
- Disabling MSI-X
- Disabling MSI
- Increasing mbuf or whatever network stack memory buffers are called on your OS
- Disabling PCI-E power management (ASPM)
- Tuning sysctl
- Updating BIOS
- Updating LAN EEPROM
- Messing with BIOS settings, power management etc.
If any of the manipulations above solve your problem, you’re not experiencing the hardware issue I’m describing and it’s likely software, settings or both.
If you are having the hardware problem there is nothing you will be able to do short of RMAing your X11SBA with Supermicro asking them to provide you with a board revision 1.02 or higher or returning your board to the retailer. If you are requesting an RMA, make sure to describe the problem in exhaustive detail and/or provide them a link to this article. It took me two RMAs to get my problem solved – the first time the tech only tested for “ping OK” and declared the board functional after updating LAN EEPROM.
Links:
- pfSense forum describing the watchdoggate in detail.
- Watchdoggate described from the Linux side of the playground.