Hi Shane, Hi Joe, >> We are running large numbers (10,000 to 100,000) jobs that are >> very short (1 second). Ditto to what Joe said. In this case scheduling overhead is outstripping your performance by an order of magnitude or so. You're on the right track with batching the jobs. Nobody ever creates 10,000 truly unique jobs. They create 10,000 unique inputs. In one case, I found it useful to impose an artificial limitation of 500 tasks for any batch of work. I then scaled my chunk size (granularity) to fit that. This turned out to be much simpler than playing the never ending "explore the limits of my queuing system" game. >> Admittedly, one second is too short for a job and will produce a >> lot of overhead no matter what, but there are times when it is >> difficult to change our code to produce longer jobs, and we'd like >> to provide some facility to do this with at least minimal overhead. Sounds like you're looking for a way to schedule and dispatch several jobs at once as a single unit. You would run the jobs in the same sandbox on the node and share the build up and tear down costs? I've never heard of any queuing system with this functionality. >> Also, when our file systems have more than a few thousand files in >> one directory things slow down tremendously, and it becomes >> impossible to >> even ls the directory. It also can crash our file servers. We >> are using NFS. I've seen this too. It's a real pain. Again, the answer is to find the limits of your system and batch things up so as to work around them. When dealing with EST projects, we would make directory trees such that we could get below about 1,000 files per directory. It added complexity, but yielded a working system. Most filesystems just don't stand up that sort of flat topology. Besides, once you conquer that, there is an endless pit of trivia behind it, starting with "too many arguments on the command line" for all your favorite shell tools. Watch your inode counts. One behavior that's bitten me more than a couple of times is the fact that NFS servers can be overwhelmed by too many concurrent writes to the same file. So, one can either write a truly large number of STDERR and STDOUT files, or else risk having jobs die on file write errors. In theory, this is a bug in NFS, since it should scale "indefinitely." In reality, even a moderate size compute farm can quickly overwhelm even the most robust filesystem. (braces for marketing claims to the contrary) >> However, when I went to 100,000 jobs the number of files grew >> faster than they could be concatenated, and the system is now slowly >> going through that huge directory and trying to append the smaller >> files, even though the array job is long since finished. > > ... stuff like this happens. ... > Have a single process handle appending. Write the append meta > information into a queue (a database), and have the single process > walk the database. This way you are updating a database and not > dealing with file locking issues. The only thing I would add to this suggestion is to take a look at what you're going to use to consume these results. It's possible that you could feed directly into that instead of saving intermediary files. Ensembl does this (database writes directly from the compute nodes as they finish their work) and it's pretty nifty. -Chris Dwan