[Bioclusters] Warning: Current Dell Powerconnect 3048 switches can fail in some conditions under high NFS traffic loads

Fri, 09 Aug 2002 13:33:36 -0400

I've been shamelessly talking up the Powerconnect switch line from Dell
in informal talks with various friends and contacts so I feel honor
bound to report that the same product line has turned around to bite me
pretty seriously on a current biocluster project.

Background: Dell has a line of aggressively priced switches that
pack lots of very useful features (small form factor, gigabit uplinks,
nice performance results, manageable via telnet/web/snmp & serial
console, etc. etc) compared to the competition.

I had some very positive experiences with the Powerconnect 3024 line on
a previous project so when the 3048 was announced (48 10/100 ports plus
2x copper GigE and 2x Mini-GBIC GigE ports in a 1U form factor) I jumped
all over it.

The bad news is that the 3048's are actually crashing/freezing under
certain conditions involving high NFS loads. The good news is that Dell 
is aware of this and the problem should be fixable via a firmware 
update. The problem is isolated to the 3048 product line only.

Current issue status as of 8 August 2002
========================================

o We were the 2nd site to report this. The 1st was a large biocluster 
implementation at a New York area institution with a whole pile of 3048 
switches.

o Dell has been extremely responsive and helpful. I've been personally 
contacted by senior powerconnect engineers and the product manager. 
Internally Dell has escalated this to a P1 engineering issue and they 
have roped in the Dell cluster team.

o At this time Dell can replicate some of the behaviour we noticed in 
one of their internal labs. I don't think they have been able to crash a 
switch yet though.

o Dell is confident that this problem only exists in the 3048 product.

o We (bioteam.net) have swapped out our 3048's units with Powerconnect 
3248's and after extensive torture testing have _not_ been able to crash 
or otherwise disrupt the replacement switches. Things are looking very good.

I'm going to append an old description of our original trouble report 
for those who are interested in the specifics.

Regards,
Chris

> #######
> 
> 
> Original problem as reported to Dell
> 
> We have 2 identically configured Powerconnect 3048 switches rackmounted in a linux cluster rack. Each switch aggregates 100TX traffic from 32 compute nodes and forwards traffic over a trunked pair of copper gigE connections to a large core switch. The core switch is an ExtremeNetworks Alpine 3808. The situation is almost exactly the same as what I put in at a previous cluster project with the exception that these switches are PC3048's and the switches at the other project are PC3024's.
> 
> We can reliably sling gigabytes of data intra-switch and across the two switches using ftp, rsync, and copying from NFS mounts between cluster elements.
> 
> The problem lies when we try to NFS mount a NetworkAppliance F840 filer which has a gigabit link (no jumbo frames) into our Alpine 3808.
> 
> We can reliably and repeatedly cause both Dell switches to totally freeze up (even on the serial console tty) whenever we push lots of NFS traffic to/from the NetApp.
> 
> We can freeze the Dell switches when they are uplinking in trunk mode or when there is just a single gigabit forwarding port. All of the Powerconnect switches have the latest firmware and are basically set to factory default settings with the exception of configuring IP addresses on them.
> 
> Here are the symptons:
> 
> o uplink/forwarding port lights blink constantly even when there
>   is no network traffic
> 
> o Switch becomes unpingable
> 
> o Serial console connection freezes
> 
> o Serial console reports the following strange types of error
>   messages:
> 
>   Unhandled interrupts (isc): 00000002 (GT-48304)
>   Unhandled interrupts (isc): 00000002 (GT-48304)
>   Unhandled interrupts (isc): 00000040 (GT-48304)
>   Unhandled interrupts (imec): 40000000 (GT-48304
> 
> The only solution is to power cycle the switch. This will get
> connectivity back for a few minutes before the switch freezes again.