[BioBrew Users] Unable to get mpiblast running

Thu Apr 6 06:56:53 EDT 2006

Hi,

after installing our new cluster with Rocks 4.1 and BioBrew (plus a 
number of rolls; hpc roll included), I have a hard time getting 
mpiblast to run.

The cluster consists of 7 machines (head node plus 6 compute nodes), 
each equipped with two dual-core Opteron CPUs and 8 GB of RAM.

These are the steps I did:
* Extend sysctl.conf to provide more shared mem:
  # Shared mem = 1 GB!!
  kernel.shmmax = 1099511627776
  kernel.shmall = 1099511627776
  on frontend and all nodes
* Extended .bashrc to use mpich, and increase P4_GLOBMEMSIZE:
  export PATH=/opt/mpich/gnu/bin:$PATH
  export P4_GLOBMEMSIZE=157286400
* Put all nodes into /opt/mpich/gnu/share/machines.LINUX (hm... did I do
  this manually? Don't remember)

I was trying Glen's "mpiblast introduction" as published on the 
Rocks-Discuss mailing list on 2005-03-24 and executed the following 
command line:
mpirun -np 30 /usr/local/bin/mpiblast -p blastn -d Hs.seq.uniq -i IL2RA 
-o blast_results

~/.ncbirc is configured like this:
======================================================================
[NCBI]
Data=/usr/share/ncbi/data/

[BLAST]
BLASTDB=/state/partition1/blastdb
BLASTMAT=/usr/share/ncbi/data/

[mpiBLAST]
Shared=/state/partition1/blastdb
Local=/tmp
======================================================================

(/state/partition1/blastdb is a symlink to the blastdb path on the 
frontend, and contains the database on the nodes. I tried this via NFS, 
too)

Depending on the value of P4_GLOBMEMSIZE, I get different errors - but 
errors in all cases. The jobs are distributed among the nodes, though.
For "smaller" values of P4_GLOBMEMSIZE (i.e. 104857600 == 100 MB, most 
of the time 200 MB) I get this error:
======================================================================
p0_8400: (23.453125) xx_shmalloc: returning NULL; requested 22880510 
bytes
p0_8400: (23.453125) p4_shmalloc returning NULL; request = 22880510 
bytes
You can increase the amount of memory by setting the environment 
variable
P4_GLOBMEMSIZE (in bytes); the current size is 104857600
p0_8400:  p4_error: alloc_p4_msg failed: 0
======================================================================

For 200 MB, I sometimes (?) get the same error, sometimes this one:
======================================================================
rm_21956:  p4_error: semget failed for setnum: 19
======================================================================

For 300 MB, I get this:
======================================================================
p0_20214:  p4_error: exceeding max num of P4_MAX_SYSV_SHMIDS: 256
======================================================================

I tried to test my mpich installation with the sample programs included 
(cpi.c, mainly). I was able to get it running with -np <small number>, 
but the errors described above occured when I increased the process 
number.

Yes, I executed "cleanipcs; cluster-fork cleanipcs" in advance in all 
cases.

I frankly have not yet understood the correlation between (possible) 
shmmax/shmall settings, P4_GLOBMEMSIZE and P4_MAX_SYSV_SHMIDS and how 
to tune each one for a successful mpich parallelization.

Due to these mpich problems, I installed OpenMPI and compiled the 
mpiblast src.rpm against OpenMPI; the errors above did not occur, but 
the blast job seemed to get stuck somewhere, too (no error message, but 
the job seemed to last forever).

As I am quite new to clusters, MPI and mpiblast, I feel a little lost. 
Do you have any ideas what the problems may be, and how to fix them?

Thx and Regards,
   Bastian

-- 
 Bastian Friedrich                  bastian at bastian-friedrich.de
 Adress & Fon available on my HP   http://www.bastian-friedrich.de/
\~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\
\ To learn more about paranoids, follow them around!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://bioinformatics.org/pipermail/biobrew-users/attachments/20060406/45ee42e5/attachment.bin