I've been shamelessly talking up the Powerconnect switch line from Dell in informal talks with various friends and contacts so I feel honor bound to report that the same product line has turned around to bite me pretty seriously on a current biocluster project. Background: Dell has a line of aggressively priced switches that pack lots of very useful features (small form factor, gigabit uplinks, nice performance results, manageable via telnet/web/snmp & serial console, etc. etc) compared to the competition. I had some very positive experiences with the Powerconnect 3024 line on a previous project so when the 3048 was announced (48 10/100 ports plus 2x copper GigE and 2x Mini-GBIC GigE ports in a 1U form factor) I jumped all over it. The bad news is that the 3048's are actually crashing/freezing under certain conditions involving high NFS loads. The good news is that Dell is aware of this and the problem should be fixable via a firmware update. The problem is isolated to the 3048 product line only. Current issue status as of 8 August 2002 ======================================== o We were the 2nd site to report this. The 1st was a large biocluster implementation at a New York area institution with a whole pile of 3048 switches. o Dell has been extremely responsive and helpful. I've been personally contacted by senior powerconnect engineers and the product manager. Internally Dell has escalated this to a P1 engineering issue and they have roped in the Dell cluster team. o At this time Dell can replicate some of the behaviour we noticed in one of their internal labs. I don't think they have been able to crash a switch yet though. o Dell is confident that this problem only exists in the 3048 product. o We (bioteam.net) have swapped out our 3048's units with Powerconnect 3248's and after extensive torture testing have _not_ been able to crash or otherwise disrupt the replacement switches. Things are looking very good. I'm going to append an old description of our original trouble report for those who are interested in the specifics. Regards, Chris > ####### > > > Original problem as reported to Dell > > We have 2 identically configured Powerconnect 3048 switches rackmounted in a linux cluster rack. Each switch aggregates 100TX traffic from 32 compute nodes and forwards traffic over a trunked pair of copper gigE connections to a large core switch. The core switch is an ExtremeNetworks Alpine 3808. The situation is almost exactly the same as what I put in at a previous cluster project with the exception that these switches are PC3048's and the switches at the other project are PC3024's. > > We can reliably sling gigabytes of data intra-switch and across the two switches using ftp, rsync, and copying from NFS mounts between cluster elements. > > The problem lies when we try to NFS mount a NetworkAppliance F840 filer which has a gigabit link (no jumbo frames) into our Alpine 3808. > > We can reliably and repeatedly cause both Dell switches to totally freeze up (even on the serial console tty) whenever we push lots of NFS traffic to/from the NetApp. > > We can freeze the Dell switches when they are uplinking in trunk mode or when there is just a single gigabit forwarding port. All of the Powerconnect switches have the latest firmware and are basically set to factory default settings with the exception of configuring IP addresses on them. > > Here are the symptons: > > o uplink/forwarding port lights blink constantly even when there > is no network traffic > > o Switch becomes unpingable > > o Serial console connection freezes > > o Serial console reports the following strange types of error > messages: > > Unhandled interrupts (isc): 00000002 (GT-48304) > Unhandled interrupts (isc): 00000002 (GT-48304) > Unhandled interrupts (isc): 00000040 (GT-48304) > Unhandled interrupts (imec): 40000000 (GT-48304 > > The only solution is to power cycle the switch. This will get > connectivity back for a few minutes before the switch freezes again.