Resgrp:comp-photo-hpc
IC has a centrally-managed HPC computer system. Also, we own two nodes, which have a dedicated batch queue (pqmb) for the group, mainly for short test calculations.
More details here: high performance computing
Join the mailing list. If you have problems, ask around within the group first, otherwise contact Matt Harvey in HPC support directly (m.j.harvey@imperial.ac.uk).
Using The Cluster
Before running calculations on the cluster, look at the tutorial:
Using the cluster : tutorial and examples
Below is a summary / reminder.
Connecting
To connect to the PC cluster and forward display information for X-windows, use
ssh -Y ab1234@login.hpc.ic.ac.uk
Use your short IC college account username instead of ab1234
This connects to one of 3 front-end login nodes. All cluster nodes (the rest are for running calculations through a queuing system) share common file systems.
(To compile Gaussian code, need to specify login-0 explicitly when connecting, as this is the node the supported Gaussian compiler is licensed for). Not true in 10/2022
Once connected, using the command id should give something like the following:
[login-0 ~]$ id uid=45751(mjbear) gid=11000(hpc-users) groups=1010(gaussian-users),11000(hpc-users),11100(gaussian-devel),11232(pgi-users)
To access the current development version of Gaussian, you will need to be in the gaussian-devel group (and sign the developer's license agreement).
To access run-time libraries to run Gaussian, you also need to be in the pgi-users groups.
Both should have been set up when your account was created.
Queuing system
Once you have accessed a login node you can now start submitting jobs to the queue. You must not run calculations on the login nodes as they are a shared resource.
Instead, you should ask the login node to find you a suitable compute node to run your job on.
IC uses the PBS queue system. You interact with the queue typically via one of three commands:
qsub job.sh
qsub submits a job file (job.sh in this case) to the queue. job files are slightly modified bash scripts that instruct the compute node what to do
qstat
qstat tells you about the status of your current jobs - whether they are queuing or running. Not that dead jobs do not appear in the listing by default
qdel
Lastly, qdel lets you delete a queued or running job should you change your mind
Job files
A typical job file will look like
#PBS -l ncpus=2 #PBS -l mem=1700mb #PBS -l walltime=00:09:00 #PBS -joe module load gaussian/devel-modules module load gdvh11 gdv < /home/mjbear/test_h11/test009.com > $WORK/test009.log
The lines starting with #PBS are commands to the queuing system. In this case we request 2 cores, 1700mb of RAM and minutes of runtime.
Note that if you request more than one node, the ncpus refers to cpu cores/node not total cpu cores
The other commands (not prefixed with #PBS) will be run on the compute node once your job has finished queuing. In general you will want to initialise the code you need (done here using modules) and then run your job (gdv in this case)
Nodes on pqmb
NOTE: At time of writing 10/2022 PQMB nodes are significantly slower than the general throughput. Unless you need to run a job > 72 hours I strongly recommend using general queue.
We have an own queue named "pqmb" on CX1 (see above). Jobs can be directed to pqmb using #PBS -q pqmb
. As of April 2017, there exist three groups of nodes accessible by their microarchitecture variable through PBS:
Group | Nodes | Cores/Node | Memory /Node (GB) | Microarchitecture | Gaussian |
---|---|---|---|---|---|
104 | 2 | 12 | 50 | westmere | G03+G09 |
5 | 8 | 16 | 132 | sandybridge | G03+G09+G16 |
100 | 8 | 24 | 264 | broadwell | G03+G09+G16 |
This table shows the maximum resources available for each node. A single node with 8 cores available in the Broadwell group may be selected using nodes=1:broadwell:ppn=8
. Replacing broadwell
with one of the microarchitecture variables above will allow you to specify which node to run on. Note that Gaussian 16 will not run on the old Westmere nodes.
Example:
#PBS -l nodes=1:broadwell:ppn=8 #PBS -l mem=16000mb #PBS -l walltime=2096:00:00 #PBS -q pqmb
This script will request one node in the Broadwell group with 8 CPUs and 16000MB of RAM. Note that the current maximum walltime on the private queue is 2096 hours.
With the above notation, multiple nodes may be selected in a job. ppn
defines the processors per node to be used.
One can also run jobs across different types of nodes, as follows:
#PBS -l nodes=2:broadwell:ppn=24+sandyb:ppn=16
Further, a specific node can be assigned using the same notation:
#PBS -l select=1:ncpus=4:host=cx1-100-4-3
The -l select
argument can also be used, but does not seem to work well for running across several nodes. If used, the following format can be used:
#PBS -l select=1:ncpus=8:broadwell=true #PBS -l mem=16000mb #PBS -l walltime=2096:00:00 #PBS -q pqmb