[Bioclusters] SGE array job produces too many files and concatenation is slow

Mon Jan 30 14:20:10 EST 2006

Hi Shane, Hi Joe,

>> We are running large numbers (10,000 to 100,000) jobs that are  
>> very short (1 second).

Ditto to what Joe said.  In this case scheduling overhead is  
outstripping your performance by an order of magnitude or so.  You're  
on the right track with batching the jobs.  Nobody ever creates  
10,000 truly unique jobs.  They create 10,000 unique inputs.

In one case, I found it useful to impose an artificial limitation of  
500 tasks for any batch of work.  I then scaled my chunk size  
(granularity) to fit that.  This turned out to be much simpler than  
playing the never ending "explore the limits of my queuing system" game.

>> Admittedly, one second is too short for a job and will produce a  
>> lot of overhead no matter what, but there are times when it is  
>> difficult to change our code to produce longer jobs, and we'd like  
>> to provide some facility to do this with at least minimal overhead.

Sounds like you're looking for a way to schedule and dispatch several  
jobs at once as a single unit.  You would run the jobs in the same  
sandbox on the node and share the build up and tear down costs?

I've never heard of any queuing system with this functionality.

>> Also, when our file systems have more than a few thousand files in  
>> one directory things slow down tremendously, and it becomes  
>> impossible to
>> even ls the directory.  It also can crash our file servers.  We  
>> are using NFS.

I've seen this too.  It's a real pain.  Again, the answer is to find  
the limits of your system and batch things up so as to work around  
them.  When dealing with EST projects, we would make directory trees  
such that we could get below about 1,000 files per directory.  It  
added complexity, but yielded a working system.  Most filesystems  
just don't stand up that sort of flat topology.  Besides, once you  
conquer that, there is an endless pit of trivia behind it, starting  
with "too many arguments on the command line" for all your favorite  
shell tools.

Watch your inode counts.

One behavior that's bitten me more than a couple of times is the fact  
that NFS servers can be overwhelmed by too many concurrent writes to  
the same file.  So, one can either write a truly large number of  
STDERR and STDOUT files, or else risk having jobs die on file write  
errors.

In theory, this is a bug in NFS, since it should scale  
"indefinitely."  In reality, even a moderate size compute farm can  
quickly overwhelm even the most robust filesystem.  (braces for  
marketing claims to the contrary)

>> However, when I went to 100,000 jobs the number of files grew  
>> faster than they could be concatenated, and the system is now slowly
>> going through that huge directory and trying to append the smaller  
>> files, even though the array job is long since finished.
>
> ... stuff like this happens.

...

> Have a single process handle appending.  Write the append meta  
> information into a queue (a database), and have the single process  
> walk the database. This way you are updating a database and not  
> dealing with file locking issues.

The only thing I would add to this suggestion is to take a look at  
what you're going to use to consume these results.  It's possible  
that you could feed directly into that instead of saving intermediary  
files.  Ensembl does this (database writes directly from the compute  
nodes as they finish their work) and it's pretty nifty.

-Chris Dwan