Jump to content

Mod:Hunt Research Group/Getting started on the HPC

From ChemWiki

Introduction

The aim of this wiki is to get new users set up on the Imperial HPC and to take you through:

  • Introduce the Imperial HPC
  • Logging in to the HPC
  • Setting up your HPC environment (.bashrc)
  • Job submission
  • Managing your jobs

Before going through this wiki make sure that you have a:

  • HPC account
  • Are on the HPC Gaussian users list

What is the HPC?

HPC systems are usually composed of a cluster of nodes (computers). Just like your laptop/desktop, each node has CPUs (cores/processors), disk space and memory (RAM). The imperial HPC has several clusters: CX1 (general), CX2 (high-end parallel jobs) and AX4 (big data). You will be using CX1 for your work.

Upon logging in you will find yourself on one of the login nodes. These nodes act as a gateway to the actual compute nodes (where your jobs will be run) and are good for file transfers, small job testing and setting up software. Don't use the login node as the place to run your job (it slows it down for everyone!)

From the login node, you will submit your jobs to the compute nodes on CX1. The job submission is handled by something called a scheduler. Imperial use PBS as the scheduler but others exist and all operate in a similar way with similar syntax. The job of the scheduler is to submit (non-interactively) the job to run on the compute node appropriately to ensure the resources available are being used efficiently. When you queue a job to run, you have to tell it which queue to send it to and the resources you want (number of processors, memory and walltime), the scheduler will do the rest.

Logging on

To login to the Imperial HPC from a Linux/Unix (mac) system a secure shell client can be used (ssh):

  1. Open a terminal window
  2. Type the line:

ssh -XY username@login.cx1.hpc.imperial.ac.uk
Replacing username with your username e.g. th194

  1. If it is your first time logging in then you will be asked to accept the host, type yes to do so.
  2. Enter password when prompted

You are now on one of the HPC login nodes!

Notes:

  • The -XY flags in the ssh command enable X11 forwarding

Setting up your .bashrc

Similarly to when setting up your local mac environment, you can use a .bashrc to set variables for your HPC bash environment. Below are the steps to do so and an example .bashrc with some useful alias' on. As you progress you can edit your .bashrc (always remember to source it to activate any updates) and if you think of any particularly useful lines then let the group know!

  1. Using your favourite text editor create the file .bashrc
  2. Copy and paste the script below into the new file

  3. Initialise your .bashrc by executing the command:
    source ~/.bashrc

Initial .bashrc to copy:

#!/bin/sh

# Change the prompt
   export PS1="[\$USER@\h]\$PWD \$ "

# Bash history commands
   HISTSIZE=100
   PATH=$PATH:~/bin
   EDITOR=vi
   export EDITOR

# alias definitions
   alias force="grep -i 'Maximum Force'"
   alias dist="grep -i 'Maximum Disp'"
   alias energy="grep -i 'SCF Done:'"

   alias q='qstat'
   alias qs="qstat -q pqph"
   alias qq="qstat -q"
   alias gv="module load gaussian gaussview; gaussview"

The script mainly contains alias definitions. The most important ones for the HPC are probably the gv alias to allow easy loading of GuassView and the qstat aliases. Once you are more familiar with bash and your HPC use you should feel free to edit your .bashrc to suit you.

You should now see that your command line prompt has changed. If it hasn't then the above hasn't worked. If it has then you can now see that you are logged on to one of the login nodes on the HPC (`user@login-#-internal` where # is the number of the login node).

Job Submission

To introduce you to the HPC we are going to run a test Gaussian calculation. To do this we need the necessary input files for the calculation (.com and .chk file) and a way to submit out Gaussian job to cx1 to run. Remember, the PBS scheduler manages running the job on the compute nodes. To run our job successfully PBS needs to know the resources and programs that our job requires. We use a runscript (or jobscript) to contain this information. The runscript is essentially a set of instructions on how to run the job.

Therefore, to run our job we need:

  • Input files (.com/.chk)
  • A runscript to tell PBS how to run our job

Input Files

  1. In your home directory set up a folder for the test job and cd into this folder.
    If you have a job you want to run on the HPC then we will use that file. This file is likely to be located on your local machine somewhere, in which case:
  2. Open a new terminal window and cd to the directory where your .com file is located
  3. We want to copy this to your new directory on the hpc, which can be done with the command:
    scp test.com username@login.hpc.ic.ac.uk:/rds/general/user/username/home
    This command is a secure copy and should be familiar from the unix cp command. Make sure you edit the destination to be the directory for your test job, put your shortcode instead of username and change the name of the file from test.com if it is different.
  4. Enter your password at the prompt
  5. If the copy was successful then your test .com file should now be located in your directory on the HPC.
  6. If the job requires a .chk file then repeat the process for this file
  7. A file created on your mac will not run on the hpc, it needs some additional information
  8. you need to add a %mem= for how much memory is required and a %nprocshared= for how many processors are required commands, the following is the first part of a test.com file setup for the hpc
%chk=test.chk
%nprocshared=12
%mem=45000MB
# hf/3-21g geom=connectivity

Title Card Required

0 1
 C

Checking the .com file

We now have a runscript and our input files within the directory. The resources requested in the run script above must match those entered at the top of your .com file.

  1. Open your .com file
  2. %chk should have the name of the checkpoint file which must be the same as the .com file
  3. %NProcShared is the number of processors requested and must match the number requested in the runscript. Edit it to 12.
  4. %mem should be slightly less than the memory requested in the runscript. Edit it to 46000MB

NB: There is an easier way to access your files on the HPC which is to mount it locally, this is almost like creating a tunnel between the two so that they can see each other directly. See the bottom of the page for information on how to do this later.

Runscript

As mentioned before the runscript contains all of the instructions to successfully run our job. The runscript usually contains PBS directives, which tell PBS the resources our job needs, and then a list of commands executing the job.

Modules

To run our job we will be using Gaussian. Firstly, check that you are registered as a user for the Gaussian group. If you are not then the job will fail to run as you will not be able to execute Gaussian. If you are not on the list then email Tricia to get added.

Gaussian and other programmes, such as GaussView, are available on the HPC as modules. To use a module you have to load it first, an example of module load commands were in the .bashrc file before.

Useful commands:

module avail: This lists the modules available on the HPC. The names of the modules are usually the programme name and the version (e.g. gaussian/g09-d01)
module load: Used to load a required module. Only once loaded can the program be used. (e.g. module load gaussian/g09-d01 loads guassian to your local environment)

We will be loading gaussian for use in our runscript.

Computational Resources

To tell PBS the resources our job needs we use special PBS directives. These are lines in the script which start with #PBS. Resource requests are denoted by the flag -l and then the resource itself. These can be:

walltime=[hhh:mm:ss]: The amount of real time the job requires to run. (There is usually a limit to the walltime available and this will change for each queue).
select=[integer]: The number of nodes our job needs to run on.
nprocs=[integer]: The number of processors on each node.
mem=[integer|GB/MB]: The amount of memory required.

Queues

We also need to tell PBS where to submit our job, this is the queue. The PBS directive to set the queue is:

-q [queue name]

There are several queues which you may have access too. A queue will have a set number of resources assigned to it and different limits (e.g. to walltime). The number of people who are using a queue defines how busy it will be and therefore, how long it may take waiting for your job to run. Specifying resources efficiently will help jobs run faster on the queue. Queues include:

pqph (various, see below) this is the hunt group queue, runs on the servers listed below
Each user has can have a maximum of 12 running jobs
To help balance usage please have a maximum of 20 jobs running or queued
pqchem (42 nodes) this is the chemistry department queue

Script

The below script is an example run script.

  • The script starts with "#!/bin/sh", without this the job will always go to the queue "short" instead of the queue asked for.
  • The next part of the script is the PBS directives discussed above which set the resources and variables needed.
  • The module for gaussian is then loaded
  • The script then checks to see if a .chk file exists and if so, copies it over to the temporary working directory on the compute node.
  • The final section executes when the job has complete and searches for the output files (e.g. .log file) to copy back over to your home directory.

The script needs to be placed in the directory you are running the job from:

  1. Open a new file 'rs20' and copy the below into it:
#!/bin/sh

# submit jobs to the queue with this script using the following command:
# qsub -N jobname -v in=name rs20
# rs12 is this script
# jobname is a name you will see in the qstat command
# name is the actual file minus .com etc it is passed into this script as ${in}

# batch processing commands
#PBS -l walltime=72:00:00
#PBS -l select=1:ncpus=20:mem=64000MB
#PBS -j oe
#PBS -q pqph

# load modules
#
  module load gaussian/g16-a01

# check for a checkpoint file
#
# variable PBS_O_WORKDIR=directory from which the job was submitted.
   test -r $PBS_O_WORKDIR/${in}.chk
   if [ $? -eq 0 ]
   then
     echo "located $PBS_O_WORKDIR/${in}.chk"
     cp $PBS_O_WORKDIR/${in}.chk $TMPDIR/.
   else
     echo "no checkpoint file $PBS_O_WORKDIR/${in}.chk"
   fi   
#
# run gaussian
#
  g09 $PBS_O_WORKDIR/${in}.com
#
# job has ended copy back the checkpoint file
# check to see if there are other external files like .wfn or .mos and copy these as well
#
  cp $TMPDIR/${in}.chk /$PBS_O_WORKDIR/.
# exit

We now have compatible input files and a runscript. We are ready to submit our job!

Job Submission

The instructions to submit a job are the same as those at the top of the runscript. We run the command: qsub -N jobname -v in=name rs12

  • qsub is the PBS command to submit the job.
  • jobname is the name you will see for your job in the qstat command
  • name is the actual file minus .com etc it is passed into this script as ${in}
  • rs12 is the name of the runscript

the run script must be in the same directory as your job!

  1. Run the command with the appropriate substitutions

If successful, a job number (XXXXXXX.cx1) should be printed out to the terminal, this is your jobID which PBS assigns to a submitted job.

Monitoring your Job

Now that your job has been submitted you can monitor by using the command qstat. This gives you the status of your jobs in the queues. Useful commands may be:

qstat to get your jobs that are running
qstat -q to get a list of all queues
qstat -f to get a full printout of all your queued jobs information

To delete a job from the queue you can use the command:

qdel [jobID] to remove a job from the queue

In your .bashrc there was an alias set for some of these options. Typing 'q' in the terminal should produce the same result as 'qstat'. The status of your job in the queue will either be Q (waiting to run), or R (running) and the run time so far.

Keep checking your job until it has run. If successful then you the .log file should be copied back to your working directory, check this to see if your job was successful. You will also find a file which has the extension: .o[jobID], this is the merged output and error files for your job. If there has been an error it will be detailed with this file along with the resources requested and used by your job.