MPI - the Message Passing Interface
Multi-node (distributed memory) parallel jobs are typically written using MPI, a C/C++ and Fortran based Application Programming Interface that implements job parallelism via passing (network) messages between co-ordinated processes.
- Compiling MPI codes
- Submitting OpenMPI job
- Anatomy of an MPI job
- A Note on Memory Reservation
- Tips on parallel jobs
Compiling MPI codes
The HEC currently supports the OpenMPI implementation of MPI, available through one of several modules described below. The implementation offers compiler wrappers to assist with compiling MPI codes. The wrappers call an underlying compiler, and automatically take care of locating the correct version of MPI libraries and include files. The compiler wrappers are:
|Wrapper name||Supported Language|
These wrappers can be called once the relevant module has been added to your environment and are used in exactly the same way that a standard compiler can be called.
Note: More complex MPI applications are typically built with a Makefile and/or a configure script. These applications can be built by specifying the compiler wrappers in place of the general compilers. It isn't necessary to specify the location or name of the MPI libraries or header files - the wrappers will handle these details automatically - so these fields can typically be left blank in a Makefile.
Each version of OpenMPI has been built in one of three flavours, one for each of the three supported compilers on the HEC: gcc, intel and pgi. For example, the version of OpenMPI 1.4.3 built for Intel compilers will have the module name: openmpi/1.4.3-intel. When submitting MPI jobs, care should be take to load the same module version and flavour as was used to originally build it.
Submitting OpenMPI Jobs
Here is a sample job to run the IMB benchmark across 2 nodes, and 8 cores per node:
#$ -q parallel
#$ -l nodes=2
module add openmpi/1.6.5-gcc
Anatomy of an MPI job script
In the above job script example, many of the SGE job directives are explained under the basic batch job submission web page. Below is a description of the additional entries required to launch MPI jobs:
#$ -q parallel
This signals to the job scheduler that this shoud be run on the parallel queue, which offers full support for parallel jobs.
#$ -l nodes=2
This specifies that the job requires two whole nodes. The nodes will be used exclusively for the parallel job - no other serial or parallel job may co-exist on the selected compute nodes.
A standard compute node offers 8 cores, and specifying a nodes value on its own will make the job run at a default 8 MPI processes per node. The above delcaration therefore means "2 compute nodes, using 8 MPI processes per node".
On rare occasions, it may be required to run fewer than the default number of processes per node. The nodes request can be supplemented by a processes-per-node value. For example, to run the above IMB benchmark using 2 nodes with 1 process per node (effectively measuring point-to-point performance between two nodes), you can use the following instead:
#$ -l nodes=2,ppn-2NOTE: As MPI jobs exclusively book whole compute nodes, specifying fewer processes per node than the number of available cores will result in the remaining cores going unused - they cannot be used by other users or other jobs. Only specify a ppn value when absolutely necessary.
The final line in the job script:
is the call to the parallel application (in this case IMB-MPI1) wrapped in a call to the mpirun application which will handle the parallel startup of the user application. Note that mpirun does not need to be told the number of processes to run; OpenMPI automatically picks this value up from the job scheduler.
The name of the MPI application should typically be the last argument to mpirun. For MPI applications that require their own additional arguments, you should place them after the call to the application itself, as arguments before the application call are interpreted by the mpirun command.
Note: Testing suggests OpenMPI supports basic input redirection on the assumption that standard input is read by rank zero of the application.
A Note on Memory Reservation
As parallel jobs reserve whole nodes, memory reservation is automatically set to the compute node's full memory. By default, a 24G compute node is assumed. If larger memory nodes are required, you can add the following line to the job script:
#$ -l node_type=eth96GPlease note: the 96G (large memory) compute nodes are a sparse resource and in high demand. Parallel jobs requiring them may wait a very long time to schedule. It may be preferable to split your job over a larger number of standard-memory nodes.
Tips on parallel jobs
- Not all jobs scale well when parallelised - running on n cores will not result in your code running n times faster. Always test an application with different job sizes (including single-core) to find the 'sweet spot' which best uses the resources available.
- To maximise performance, parallel jobs should use the maximum number of processors per node possible (8 on the current cluster). The only general exception should be where the amount of memory needed per processor would exceed a compute node's memory.
- The system has been desgined to support MPI jobs with moderate message passing on up to 64 processors (8 nodes). Applications with lighter message passing loads may scale higher than this.
- As parallel jobs require much more resource than regular
single-core batch jobs, there is usually a much longer wait between
job submission and job launch, particularly when the cluster is
busy. Opt to submit jobs as serial rather than parallel unless the
improvement in runtime is essential.
Lancaster UniversityBailriggLancasterLA1 4YW United Kingdom+44 (0) 1524 65201