Mod:Hunt Research Group/hpc
Contents
pqph Resources
Current pqph resources:
- You can check the current queue resources and staus here: pqph queue status
- Currently, pqph consists mainly of 40 proc/124GB nodes and a couple of 48 proc/256GB nodes in pqph.
Gaussian jobs
Recommended job specifications
For running Gaussian jobs on pqph it is recommended to just use two job sizings. The sizings mean that either a full 40 proc node will be used or just half of the node, allowing for a second job to be run on the other half of the node. These are only applicable to Gaussian jobs which can't be run across nodes. You may want to use multiple nodes/alternate job sizings for codes which are parallelised.
Small/medium jobs:
- Run jobs using half of a 40 processor node and half the memory allowance (64GB).
- PBS script input:
-
#PBS -l walltime=72:00:00 -
#PBS -lselect=1:ncpus=20:mem=64000MB
-
- Gaussian .com file input:
-
%nprocs=20 -
%mem=60000MB
-
Medium/large jobs:
- Run jobs using a full 40 processor node and the full the memory allowance (128GB).
- PBS script input:
-
#PBS -l walltime=72:00:00 -
#PBS -lselect=1:ncpus=40:mem=128000MB
-
- Gaussian .com file input:
-
%nprocs=40 -
%mem=122000MB
-
If you need to use the larger (48 proc) nodes for more expensive calculations:
- Run jobs using a full 48 processor node and the full the memory allowance (256GB).
- PBS script input:
-
#PBS -l walltime=72:00:00 -
#PBS -lselect=1:ncpus=48:mem=25000MB
-
- Gaussian .com file input:
-
%nprocs=48 -
%mem=256000MB
-
Runscripts
An example gaussian runscript for a 20 processor job:
#!/bin/sh
# Submit jobs to the queue with this script using the following command:
#
# qsub -N jobname -v in=name rs20
#
# Where: rs20 is the name of this runscript
# jobname is a name you will see in the qstat command
# name is the actual file minus .com etc it is passed into this script as ${in%.com}
# batch processing commands
#PBS -l walltime=72:00:00
#PBS -lselect=1:ncpus=20:mem=64000MB
#PBS -j oe
#PBS -q pqph
#PBS -m a
# Load relevant modules
module load gaussian/g09-d01
# Check for a checkpoint file to copy to the temp directory
# variable PBS_O_WORKDIR=directory from which the job was submitted.
if [[ -e $PBS_O_WORKDIR/${in%.com}.chk ]]
then
echo "$PBS_O_WORKDIR/${in%.com}.chk located"
cp $PBS_O_WORKDIR/${in%.com}.chk $TMPDIR/.
else
echo "no checkpoint file $PBS_O_WORKDIR/${in%.com}.chk"
fi
# Execute Gaussian
#
g09 $PBS_O_WORKDIR/${in}
# Once job is finished copy across the .chk file
cp $TMPDIR/${in%.com}.chk /$PBS_O_WORKDIR/.
# Check for the existence of other possible output files and copy if located
if [[ -e $TMPDIR/tesserae.off ]]
then
cp $TMPDIR/tesserae.off /$PBS_O_WORKDIR/${in%.com}_tesserae.off
fi
if [[ -e $TMPDIR/charge.off ]]
then
cp $TMPDIR/charge.off /$PBS_O_WORKDIR/${in%.com}_charge.off
fi
if [[ -e $TMPDIR/points.off ]]
then
cp $TMPDIR/points.off /$PBS_O_WORKDIR/${in%.com}_points.off
fi
# exit
Edit the PBS lines to create runscripts for other job specifications. If you are not sure what the PBS commands are or on what the runscript does then check out the introduction to the hpc page: Getting Started on the HPC
Extra information/troubleshooting
- add tmpspace=400 only for large disk jobs to ensure you are put on a node with enough disk!!
- Note that this requires you to include maxdisk=400gb in your gaussian input.
- NOTE the queuing system does not check disk allocations. When requesting large disk jobs remember to request all of the processors on a node even if you are not using all of the processors. For large jobs the maximum disk space you can request is 800GB on the 12 processor nodes.
More details for if you seem to be having memory or disk issues
- normal jobs
- will need 2*N^2 W *8.4 to get B (1,048,576B =1MB)
- so 300 basis functions will need 180000W =0.18MW =1.5MB in addition to the above requirements
- require 2ON^2 W of disk to run where O=number of occupied orbitals, N=number of basis functions
- MP2 jobs
- work best with %mem and maxdisk defined
- in-core requires N^4/4 divided by 1,000,000 MW memory
- so 400 basis functions will need 6400MW=53760MB=54GB memory per node, which is unlikely!
- semi-direct requires 2*O(N^2) memory and N^3 disk
- so N=476 basis functions O=56 occupied orbitals will need
- 25.4MW=214MB of memory
- and 108MW=906MB disk (this is not actually true it will need much more probably around 1800MB disk per processor!)
- so total memory for MP2 freq 8proc will be
- 12*8*8.4=807MB to run and 8*214=1712MB for calcs and some extra 400MB=3019MB=3.3GB
- gaussian does not like GB directive so give %mem in MB
Checkpoint and other files
- checkpoint files should be exactly the same name as the input file name
- for jobs that may exceed the wall time specify the full path of the checkpoint file, for example
- %chk=/work/phunt/tmp/filename.chk
- this means the checkpoint file will be written into your personal work directory, it may slow the job down
- this is also the reason /work is sometimes very slow on CX1 so only do this as an exception!
Memory needed to run
- Gaussian is greedy and will exceed the allocated memory
- each proc needs a gaussian executable, which takes about 8MW (or 12 for MP2 frequencies)
- MW is megaword which is the unit gaussian allocates memory
- 1MW is about 8.4MB
- so each proc needs 1*8*8.4 approximately 68MB just to run
- so 12 proc jobs require 12*68=816MB just to run
- so 16 proc jobs require 16*68=1088MB just to run
- so 20 proc jobs require 20*68=1360MB just to run
- so 24 proc jobs require 24*68=1632MB just to run
- so 40 proc jobs require 40*68=2720MB just to run
- so 48 proc jobs require 48*68=3264MB just to run
- so when allocating memory inside the gaussian job you must reduce the memory by at least this amount
- thus best to reduce the memory by about 100MB*no.processors inside the gaussian script
- you also need some overhead within the PBS script
- the memory can be given in binary such as 251 GB (binary) is really 251 GB =251000*1,048,576 Bytes =264GB (decimal)
Extra links
The Imperial Research Computing service have a hpc wiki which has useful information including an intro to shell scripting, modules and job management information:
The RCS also run several courses throughout the year, including intro to Linux, HPC, python and more advanced topics. Upcoming courses can be viewed from:
Next steps
Mount Alias shortcut for logging in Keypair page Once you are comfortable and understand the job submission process then the automatic job script which ... can be used
Other information (may be out of date)
- 3.1 CPMD:
- 3.2 DL-POLY:
- https://www.ch.ic.ac.uk/wiki/index.php/Image:Mpirun.sh
- Note: You´ll not be able to see the output until the job finishes : the directory /tmp/pb.XXX isn´t accessible to you because it is on the private disk of the node running the job.
- To get DLPOLY to terminate before the job hits the walltime limit and killed, you need to run it through a program called pbsexec, for example:
- pbsexec mpiexec DLPOLY.X
- This will kill DLPOLY 15 minutes before the walltime limit, giving your script time to transfer files back to $work.