[Bioclusters] gmond and high loads under Suse w/ 2.6 kernel?

Chris Dwan bioclusters@bioinformatics.org
Thu, 4 Nov 2004 08:17:44 -0500


I'm working with a cluster which has unexplained high load values 
(hovering between 1 and 2 with the system sitting idle) on the portal.  
It's a 32 node, 64 cpu opteron cluster, running SUSE, with the 2.6 
kernel.

When I turn off GANGLIA's gmon daemon, the load drops down to ordinary 
rest states (0.1-ish).  After some debugging to isolate the behavior, 
there's clearly a causal link between gmond on the portal and these 
high loads.

Gmond does not appear to be taking very much cpu time, doesn't hang out 
in "top", and otherwise doesn't seem to be the real problem.  The 
cluster is relatively small (32 nodes).  If I turn off all of the 
cluster gmond processes, the load drops some, but not all the way to a 
rest state.

The system is sluggish when the load reports high, but not as sluggish 
as I might expect.

Has anyone seen this before?  It's more annoying than anything else.  
I'm tempted to blame "something in the kernel" and "multicast," but I 
would love to have a more robust explanation.

-Chris Dwan
  The BioTeam