[WashU] HMMER
User's Guide


| Dept. of Genetics | WashU | Medical School | Sequencing Center | CGM | IBC|
| Eddy lab | Internal (lab only) | HMMER | PFAM | tRNAscan-SE | Software | Publications |

next up previous contents
Next: Recommended systems Up: Installation Previous: Parallelization using threads

Subsections

Parallelization using PVM

Parallelization across multiple machines (as opposed to multithreading on a single multiprocessor machine) can be done with PVM, the Parallel Virtual Machine software from Oak Ridge National Labs.

PVM is freely available. You can obtain it from http://www.epm.ornl.gov/. You must install and configure PVM before compiling PVM support into HMMER. During compilation, HMMER needs to see the environment variables PVM_ROOT and PVM_ARCH, and the PVM header files and libraries must be found in the appropriate place under $PVM_ROOT.

To enable PVM support in HMMER, add -with-pvm to the command line for ./configure before you compile a source distribution. PVM is completely optional, and the software will work fine without it.

Whereas multithreading requires no special configuration once support is compiled into HMMER, configuring a PVM cluster and using HMMER on it is a little more involved.

Configuring a PVM cluster for HMMER

Here, I will assume you're already familiar with PVM.

Designate one machine as the ``master'', and the other machines as ``slaves''. You will start your HMMER process on the master, and the master will spawn jobs on the slaves using PVM.

Install PVM on the master and all the slaves. On the master, make sure the environment variables PVM_ROOT and PVM_ARCH are set properly (ideally, in a system-wide .cshrc file).

Add the master's name to your .rhosts or /etc/hosts.equiv file on the slaves, so the slaves accept rsh connections from the master.

Put copies of HMMER executables in a directory on the master and all the slaves. For each PVM-capable program (hmmcalibrate, hmmpfam, and hmmsearch, there is a corresponding slave PVM program (hmmcalibrate-pvm, hmmpfam-pvm, and hmmsearch-pvm). The master machine needs copies of all the HMMER programs, including the slave PVM programs. The slaves only need copies of the three slave PVM programs. (You never need to start the slave programs yourself; PVM does that. You just need to make sure they're installed where PVM can see them.)

The PVM implementation of hmmpfam needs a copy of any HMM databases you may search to be installed on the master and every slave. All HMM databases must be indexed with hmmindex. The reason is that hmmpfam is I/O bound; the PVM implementation can't distribute an HMM database fast enough over a typical cluster's Ethernet. Instead, each PVM node accesses its own copy of the HMM database, distributing the I/O load across the nodes. hmmcalibrate and hmmsearch, in contrast, are freestanding. Only the master node needs to be able to access any HMM and/or sequence files.

Write a PVM hostfile for the cluster. Specify the location of the HMMER executables using the ep= directive. Specify the location of pvmd on the slaves using the dx= directive (alternatively, you can make sure PVM_ROOT and PVM_ARCH get set properly on the slaves). For the slaves, use the wd= directive to specify the location of the HMM databases for hmmpfam (alternatively, you can make sure HMMERDB gets set properly on the slaves). Use the sp= directive to tell HMMER how many processors each node has (and hence, how many independent PVM processes it should start); sp=1000 means 1 CPU, sp=2000 means 2 CPUs, sp=4000 means 4 CPUS, etc.

Start the PVM by typing
> pvm hostfile
(where ``hostfile'' is the name of your hostfile) on the master. Make sure all the nodes started properly by typing > conf
at the PVM console prompt. Type > quit
to exit from the PVM console, which leaves the PVM running in the background. You should only need to start PVM once. (We have a PVM running continuously on our network right now, waiting for HMMER jobs.)

Once PVM is running, at any time you can run HMMER programs on the master and exploit your PVM, just by adding the option -pvm; for instance,
> hmmpfam -pvm Pfam my.query
parallelizes a search of a query sequence in the file my.query against the Pfam database.

Once PVM is properly configured and your slave nodes have the required slave programs (and databases, in the case of hmmpfam), the only difference you will notice between the serial and the PVM version is a (potentially massive) increase in search speed. Aside from the addition of the -pvm option on the command line, all other options and input/output formats remain identical.

Example of a PVM cluster

The St. Louis Pfam server runs its searches using HMMER on a PVM cluster called Wulfpack. I'll use it as a specific example of configuring a PVM cluster. It's a little more intricate than you'd usually need for personal use, just because of the details of running PVM jobs in a standalone way from CGI scripts on a Web server.

The master node is the Web server, fisher. The slave nodes are eight rack-mounted dual processor Intel/Linux boxes called wulf01 through wulf08. Collectively, we refer to this cluster as Wulfpack; it is a Beowulf-class Linux computing cluster.

PVM 3.3.11 is installed in /usr/local/pvm3 on the master and the slaves.

On fisher, all HMMER executables are installed in /usr/local/bin. On the wulf slave nodes, the three PVM slave executables are installed in /usr/local/wulfpack.

Pfam and PfamFrag, two Pfam databases, are installed on the wulf slave nodes in /usr/local/wulfpack. They are converted to binary format using hmmconvert -b, then indexed using hmmindex. (Using binary format databases is a big performance win for hmmpfam searches, because hmmpfam is I/O bound and binary HMM databases are smaller.)

An ls of /usr/local/wulfpack on any wulf node looks like:

[eddy@wulf01 /home]$ ls /usr/local/wulfpack/
Pfam             PfamFrag         hmmcalibrate-pvm   hmmsearch-pvm
Pfam.gsi         PfamFrag.gsi     hmmpfam-pvm

The PVM hostfile for the cluster looks like:

# Config file for Pfam Web server PVM
#
* ep=/usr/local/bin sp=1000
fisher.wustl.edu
* lo=pfam dx=/usr/local/pvm3/lib/pvmd ep=/usr/local/wulfpack sp=2000
wulf01
wulf02
wulf03
wulf04
wulf05
wulf06
wulf07
wulf08

Note one wrinkle specific to configuring Web servers: the web server is running HMMER as user ``nobody'' because it's calling HMMER from a CGI script. We can't configure a shell for ``nobody'' on the slaves, so we create a dummy user called ``pfam'' on each wulf node. The lo= directive in the PVM hostfile tells the master to connect to the slaves as user ``pfam''. On each slave, there is a user ``pfam'' with a .rhosts that looks like:

   fisher nobody
   fisher.wustl.edu nobody
which tells the wulf node to accept rsh connections from fisher's user ``nobody''.

Also note how we use the sp= directive to tell HMMER (via PVM) that the wulf nodes are dual processors. fisher is actually a dual processor too, but by setting sp=1000, HMMER will only start one PVM process on it (leaving the other CPU free to do all the things that keep Web servers happy).

The trickiest thing is making sure PVM_ROOT and PVM_ARCH get set properly. For my own private PVM use, my .cshrc contains the lines:

	setenv PVM_ROOT    /usr/local/pvm3
	setenv PVM_ARCH    `$PVM_ROOT/lib/pvmgetarch`
But for the web server PVM, it's a little trickier. We start the Web server PVM as user ``nobody'' on fisher using a local init script, /etc/rc.d/init.d/pvm_init. With its error checking deleted for clarity, this script basically looks like:

#!/bin/sh
wulfpack_conf=/home/www/pfam/pfam-3.1/wulfpack.conf
. /usr/local/pvm3/.pvmprofile
$PVM_ROOT/lib/pvmd $wulfpack_conf >/dev/null &

We call this at boot time by adding the line su nobody -c "sh /etc/rc.d/init.d/pvm_init to our rc.local file. .pvmprofile is a little PVM-supplied script that properly sets PVM_ROOT and PVM_ARCH, and wulfpack.conf is our PVM hostfile.

The relevant lines of the CGI Perl script that runs HMMER jobs from the Web server (again, heavily edited for clarity) are:

# Configure environment for PVM
$ENV{'HMMERDB'}    = "/usr/local/wulfpack:/home/www/pfam/data/"
$ENV{'PVM_EXPORT'} = "HMMERDB";
$output = `/usr/local/bin/hmmpfam --pvm Pfam /tmp/query`;

The trick here is that we export the HMMERDB environment variable via PVM, so the PVM processes on wulf nodes will know where to find their copy of Pfam.

PVM is relatively complex, but with luck, this and the PVM documentation give you enough information to get HMMER running on a cluster. It's well worth it. Wulfpack was simple to assemble; besides the eight rack-mounted machines, there's a terminal switch, a single console, a 10baseT Ethernet switch, and a UPS. Each machine runs stock Red Hat Linux (we don't need no steenking Extreme Linux hype). The whole thing cost us $20K, but it runs HMMER searches as fast as a 16-processor SGI Origin - and it's bigger and has more blinking lights than an Origin, so it's more impressive to look at.


next up previous contents
Next: Recommended systems Up: Installation Previous: Parallelization using threads


Direct comments and questions to <eddy@genetics.wustl.edu>