Jump to content

Resgrp:comp-photo-hpc

From ChemWiki

IC has a centrally-managed HPC computer system. Also, we own two nodes, which have a dedicated batch queue (pqmb) for the group, mainly for short test calculations.

More details here: high performance computing

Join the mailing list. If you have problems, ask around within the group first, otherwise contact Matt Harvey in HPC support directly (m.j.harvey@imperial.ac.uk).

Using The Cluster

Before running calculations on the cluster, look at the tutorial:

Using the cluster : tutorial and examples

Below is a summary / reminder.

Connecting

To connect to the PC cluster and forward display information for X-windows, use

ssh -Y ab1234@login.hpc.ic.ac.uk

Use your short IC college account username instead of ab1234

This connects to one of 3 front-end login nodes. All cluster nodes (the rest are for running calculations through a queuing system) share common file systems.

(To compile Gaussian code, need to specify login-0 explicitly when connecting, as this is the node the supported Gaussian compiler is licensed for). Not true in 10/2022

Once connected, using the command id should give something like the following:

[login-0 ~]$ id
uid=45751(mjbear) gid=11000(hpc-users) groups=1010(gaussian-users),11000(hpc-users),11100(gaussian-devel),11232(pgi-users)

To access the current development version of Gaussian, you will need to be in the gaussian-devel group (and sign the developer's license agreement).

To access run-time libraries to run Gaussian, you also need to be in the pgi-users groups.

Both should have been set up when your account was created.

Queuing system

Once you have accessed a login node you can now start submitting jobs to the queue. You must not run calculations on the login nodes as they are a shared resource.

Instead, you should ask the login node to find you a suitable compute node to run your job on.

IC uses the PBS queue system. You interact with the queue typically via one of three commands:

qsub job.sh

qsub submits a job file (job.sh in this case) to the queue. job files are slightly modified bash scripts that instruct the compute node what to do

qstat

qstat tells you about the status of your current jobs - whether they are queuing or running. Not that dead jobs do not appear in the listing by default

qdel

Lastly, qdel lets you delete a queued or running job should you change your mind

Job files

A typical job file will look like

#PBS -l ncpus=2
#PBS -l mem=1700mb
 #PBS -l walltime=00:09:00
 #PBS -joe
 
 module load gaussian/devel-modules
 module load gdvh11
 
 gdv < /home/mjbear/test_h11/test009.com > $WORK/test009.log

The lines starting with #PBS are commands to the queuing system. In this case we request 2 cores, 1700mb of RAM and minutes of runtime.

Note that if you request more than one node, the ncpus refers to cpu cores/node not total cpu cores

The other commands (not prefixed with #PBS) will be run on the compute node once your job has finished queuing. In general you will want to initialise the code you need (done here using modules) and then run your job (gdv in this case)

Nodes on pqmb

NOTE: At time of writing 10/2022 PQMB nodes are significantly slower than the general throughput. Unless you need to run a job > 72 hours I strongly recommend using general queue.

We have an own queue named "pqmb" on CX1 (see above). Jobs can be directed to pqmb using #PBS -q pqmb. As of April 2017, there exist three groups of nodes accessible by their microarchitecture variable through PBS:

Group Nodes Cores/Node Memory /Node (GB) Microarchitecture Gaussian
104 2 12 50 westmere G03+G09
5 8 16 132 sandybridge G03+G09+G16
100 8 24 264 broadwell G03+G09+G16

This table shows the maximum resources available for each node. A single node with 8 cores available in the Broadwell group may be selected using nodes=1:broadwell:ppn=8. Replacing broadwell with one of the microarchitecture variables above will allow you to specify which node to run on. Note that Gaussian 16 will not run on the old Westmere nodes.

Example:

#PBS -l nodes=1:broadwell:ppn=8
#PBS -l mem=16000mb
#PBS -l walltime=2096:00:00
#PBS -q pqmb

This script will request one node in the Broadwell group with 8 CPUs and 16000MB of RAM. Note that the current maximum walltime on the private queue is 2096 hours.

With the above notation, multiple nodes may be selected in a job. ppn defines the processors per node to be used.

One can also run jobs across different types of nodes, as follows:

#PBS -l nodes=2:broadwell:ppn=24+sandyb:ppn=16

Further, a specific node can be assigned using the same notation:

#PBS -l select=1:ncpus=4:host=cx1-100-4-3


The -l select argument can also be used, but does not seem to work well for running across several nodes. If used, the following format can be used:

#PBS -l select=1:ncpus=8:broadwell=true
#PBS -l mem=16000mb
#PBS -l walltime=2096:00:00
#PBS -q pqmb