[Bioclusters] gmond and high loads under Suse w/ 2.6 kernel?

Thu, 4 Nov 2004 10:31:06 -0500

Hi Chris:

Without knowing your specific hardware & software configuration, I'm
just taking a stab in the dark here, but it sounds to me like there is
something causing the kernel to eat up cpu time, or at least causing
some sort of delay.  Use 'top' to see what the system cpu load is. 
When idle, this should be 0%.

If the kernel is busy during idle periods, there are a number of
things that might cause this:
 - High network traffic loads, especially when there is lots of
broadcast traffic on the network.   You can use 'tcpdump' to see what
sort of traffic is showing up.  Poorly configured routers, duplicated
ips, poorly configured network services, unecessary network services,
etc can all generate broadcast traffic that will swamp a network and
slow every machine on that network down.

- Bad or marginal hardware.  For example, a dying ethernet card or
switch port can cause lots of errors that slows down network traffic
and cause high kernel loads, but work just well enough for you to not
realize that it's problematic.

- Bad or poorly configured device drivers.  I once had a server that
had really poor response times because I had compiled the wrong IDE
drivers into kernel so DMA was off, causing everything to get really
sluggish. (Note that this didn't really show up as high system CPU
usage during idle, but caused the kernel CPU utilization to go way up
during sustained disk activity.)  Are you using the stock SUSE kernel
or compiling your own?  There might be something in your cluster
hardware that's interacting badly with a device driver.

- When the system load is high, is there also lots of disk activity? 
This might be indicative of the machines running out of memory.

I'm not using gnomd in my own cluster, but unless gmond is a poorly
designed beast to start with, I would suspect that your problem is
symptomatic of something else -- particularly since your load average
is not dropping to zero at rest with gmond turned off.

Anyway, I hope that gives you a few ideas...

Regards,
hai

On Thu, 4 Nov 2004 08:17:44 -0500, Chris Dwan <cdwan@bioteam.net> wrote:
> 
> I'm working with a cluster which has unexplained high load values
> (hovering between 1 and 2 with the system sitting idle) on the portal.
> It's a 32 node, 64 cpu opteron cluster, running SUSE, with the 2.6
> kernel.
> 
> When I turn off GANGLIA's gmon daemon, the load drops down to ordinary
> rest states (0.1-ish).  After some debugging to isolate the behavior,
> there's clearly a causal link between gmond on the portal and these
> high loads.
> 
> Gmond does not appear to be taking very much cpu time, doesn't hang out
> in "top", and otherwise doesn't seem to be the real problem.  The
> cluster is relatively small (32 nodes).  If I turn off all of the
> cluster gmond processes, the load drops some, but not all the way to a
> rest state.
> 
> The system is sluggish when the load reports high, but not as sluggish
> as I might expect.
> 
> Has anyone seen this before?  It's more annoying than anything else.
> I'm tempted to blame "something in the kernel" and "multicast," but I
> would love to have a more robust explanation.
> 
> -Chris Dwan
>   The BioTeam
> 
> _______________________________________________
> Bioclusters maillist  -  Bioclusters@bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>