Mod:Hunt Research Group/hpc
General Resources
pqph
- You can check the current queue resources and staus here: pqph queue status
- Currently, pqph consists mainly of 40 proc/124GB nodes and a couple of 48 proc/256GB nodes in pqph.
Express queue
- We can also now submit jobs to the Express queue
- Use this for anytime our pqph is looking full or if you have a job you think will take more than a day or longer than 3 days
To run express jobs, use the command line input:
qsub -q express -P exp-00034 -lselect=1:ncpus=48:mem=126gb -lwalltime=72:00:00
Or use this inside a PBS submit script:
# batch processing commands #PBS -l walltime=72:00:00 #PBS -lselect=1:ncpus=48:mem=126000MB #PBS -j oe #PBS -q express -P exp-00034
- Don’t forget to call less memory in the gaussian com file say 125GB
Gaussian jobs
Recommended job specifications
For running Gaussian jobs on pqph it is recommended to just use two job sizings. The sizings mean that either a full 40 proc node will be used or just half of the node, allowing for a second job to be run on the other half of the node. These are only applicable to Gaussian jobs which can't be run across nodes. You may want to use multiple nodes/alternate job sizings for codes which are parallelised.
Small/medium jobs:
- Run jobs using half of a 40 processor node and half the memory allowance (64GB).
- PBS script input:
#PBS -l walltime=72:00:00
#PBS -lselect=1:ncpus=20:mem=64000MB
- Gaussian .com file input:
%nprocs=20
%mem=60000MB
Medium/large jobs:
- Run jobs using a full 40 processor node and the full the memory allowance (128GB).
- PBS script input:
#PBS -l walltime=72:00:00
#PBS -lselect=1:ncpus=40:mem=128000MB
- Gaussian .com file input:
%nprocs=40
%mem=122000MB
If you need to use the larger (48 proc) nodes for more expensive calculations:
- Run jobs using a full 48 processor node and the full the memory allowance (256GB).
- PBS script input:
#PBS -l walltime=72:00:00
#PBS -lselect=1:ncpus=48:mem=25000MB
- Gaussian .com file input:
%nprocs=48
%mem=256000MB
Runscripts
Standard job script
An example gaussian runscript for a 20 processor job:
#!/bin/sh # Submit jobs to the queue with this script using the following command: # # qsub -N jobname -v in=name rs20 # # Where: rs20 is the name of this runscript # jobname is a name you will see in the qstat command # name is the actual file minus .com etc it is passed into this script as ${in%.com} # batch processing commands #PBS -l walltime=72:00:00 #PBS -lselect=1:ncpus=20:mem=64000MB #PBS -j oe #PBS -q pqph #PBS -m a # Load relevant modules module load gaussian/g16-a03 # Check for a checkpoint file to copy to the temp directory # variable PBS_O_WORKDIR=directory from which the job was submitted. if [[ -e $PBS_O_WORKDIR/${in%.com}.chk ]] then echo "$PBS_O_WORKDIR/${in%.com}.chk located" cp $PBS_O_WORKDIR/${in%.com}.chk $TMPDIR/. else echo "no checkpoint file $PBS_O_WORKDIR/${in%.com}.chk" fi # Execute Gaussian # g16 $PBS_O_WORKDIR/${in} # Once job is finished copy across the .chk file cp $TMPDIR/${in%.com}.chk /$PBS_O_WORKDIR/. # exit
- Edit the PBS lines to create runscripts for other job specifications.
- If you are not sure what the PBS commands are or on what the runscript does then check out the introduction to the hpc page: Getting Started on the HPC
- If you need to copy back any other output files you can either run the job in
${EPHERMAL}
instead of${TMPDIR}
as all output files will remain in ephermal for a period of time. Or, if you know the extension of the additional desired output file you can use a modified version of the code:
# Check for the existence of other possible output files and copy if located if [[ -e $TMPDIR/*.extension ]] then cp $TMPDIR/*.extension /$PBS_O_WORKDIR/${in%.com}_*.extension fi
- Replace *.extension with the correct file extension that you want to copy back.
Array jobs
If you have a large number of small jobs which are only slightly different e.g. optimising a large number of conformers of a molecule/system that only vary in the input structure, then you should use an array job.
An example array job runscript for a 20 processor job is:
#!/bin/sh # batch processing commands #PBS -l walltime=72:00:00 #PBS -lselect=1:ncpus=20:mem=64000MB #PBS -J 1-X #PBS -j oe #PBS -q pqph #PBS -m a #PBS -N arrayJobName in=$( sed -n ${PBS_ARRAY_INDEX}p ${PBS_O_WORKDIR}/inputFiles.txt) echo ${in} # Load relevant modules module load gaussian/g16-a03 # Check for a checkpoint file to copy to the temp directory # variable PBS_O_WORKDIR=directory from which the job was submitted. if [[ -e $PBS_O_WORKDIR/${in%.com}.chk ]] then echo "$PBS_O_WORKDIR/${in%.com}.chk located" cp $PBS_O_WORKDIR/${in%.com}.chk $TMPDIR/. fi # Execute Gaussian g16 $PBS_O_WORKDIR/${in} # Once job is finished copy across the .chk file cp $TMPDIR/${in%.com}.chk /$PBS_O_WORKDIR/. # exit
To use the script:
- Set up all your input .com files in the same directory
- Edit the line in the runscript that sets the number of jobs in the array:
#PBS -J 1-X
. Change X to the number of input files you have to run. - The runscript works by running X separate jobs within the array. For each job, there is a PBS variable set (PBS_ARRAY_INDEX) which is the jobs number within the array. E.g. for the first job to run, PBS_ARRAY_INDEX = 1.
- Change the job name using the -N flag in the script or by the command line option
- Save your changes to the runscript and exit
- Create a text file with the names of all the input .com files. An easy way to do this is by the command line:
ls *.com > inputFiles.txt
- You will notice that the file inputFiles.txt is called in the line in the runscript which sets the variable in. It uses the array job number (PBS_ARRAY_INDEX) as an index to reference the correct line in the text file, so each job will call a different input file.
- Submit the array job using the command:
qsub rs_ja
- The job runs as a single job on the queue and gets a single job id number (e.g. 1096738), each of the separate jobs within the array job are then given an index (e.g. 1096738[4] for job 4 of the array)
- The qstat information for the array job now tells you how many jobs are in the array, how many are queued and how many are finished.
Extra information/troubleshooting
- add tmpspace=400 only for large disk jobs to ensure you are put on a node with enough disk!!
- Note that this requires you to include maxdisk=400gb in your gaussian input.
- NOTE the queuing system does not check disk allocations. When requesting large disk jobs remember to request all of the processors on a node even if you are not using all of the processors. For large jobs the maximum disk space you can request is 800GB on the 12 processor nodes.
Memory needed to run
- Gaussian is greedy and will exceed the allocated memory
- Each proc needs a gaussian executable, which takes about 8MW (or 12 for MP2 frequencies)
- MW is megaword which is the unit gaussian allocates memory
- 1MW is about 8.4MB
- so each proc needs 1*8*8.4 approximately 68MB just to run
- so 12 proc jobs require 12*68=816MB just to run
- so 16 proc jobs require 16*68=1088MB just to run
- so 20 proc jobs require 20*68=1360MB just to run
- so 24 proc jobs require 24*68=1632MB just to run
- so 40 proc jobs require 40*68=2720MB just to run
- so 48 proc jobs require 48*68=3264MB just to run
- so when allocating memory inside the gaussian job you must reduce the memory by at least this amount
- thus best to reduce the memory by about 100MB*no.processors inside the gaussian script
- you also need some overhead within the PBS script
- the memory can be given in binary such as 251 GB (binary) is really 251 GB =251000*1,048,576 Bytes =264GB (decimal):newer notes
- Larger jobs, a good rule of thumb for >50 atoms or >500 basis functions is 4GB minimum per processor
- so 20proc is 80GB minimum
- for mp2 frequency and ccsd you should leave enough memory to buffer the large disk files
- so only give the gaussian job 50-70% of the total memory
More details for if you seem to be having memory or disk issues
- normal jobs
- will need 2*N^2 W *8.4 to get B (1,048,576B =1MB)
- so 300 basis functions will need 180000W =0.18MW =1.5MB in addition to the above requirements
- require 2ON^2 W of disk to run where O=number of occupied orbitals, N=number of basis functions
- MP2 jobs
- work best with %mem and maxdisk defined
- in-core requires N^4/4 divided by 1,000,000 MW memory
- so 400 basis functions will need 6400MW=53760MB=54GB memory per node, which is unlikely!
- semi-direct requires 2*O(N^2) memory and N^3 disk
- so N=476 basis functions O=56 occupied orbitals will need
- 25.4MW=214MB of memory
- and 108MW=906MB disk (this is not actually true it will need much more probably around 1800MB disk per processor!)
- so total memory for MP2 freq 8proc will be
- 12*8*8.4=807MB to run and 8*214=1712MB for calcs and some extra 400MB=3019MB=3.3GB
- gaussian does not like GB directive so give %mem in MB
Checkpoint and other files
- checkpoint files should be exactly the same name as the input file name
- for jobs that may exceed the wall time specify the full path of the checkpoint file, for example
- %chk=/work/phunt/tmp/filename.chk
- this means the checkpoint file will be written into your personal work directory, it may slow the job down
- this is also the reason /work is sometimes very slow on CX1 so only do this as an exception!
Extra links
The Imperial Research Computing service have a hpc wiki which has useful information including an intro to shell scripting, modules and job management information:
The RCS also run several courses throughout the year, including intro to Linux, HPC, python and more advanced topics. Upcoming courses can be viewed from:
Next steps
Mount Alias shortcut for logging in Keypair page Once you are comfortable and understand the job submission process then the automatic job script which ... can be used
Other information (may be out of date)
- 3.1 CPMD:
- 3.2 DL-POLY:
- https://www.ch.ic.ac.uk/wiki/index.php/Image:Mpirun.sh
- Note: You´ll not be able to see the output until the job finishes : the directory /tmp/pb.XXX isn´t accessible to you because it is on the private disk of the node running the job.
- To get DLPOLY to terminate before the job hits the walltime limit and killed, you need to run it through a program called pbsexec, for example:
- pbsexec mpiexec DLPOLY.X
- This will kill DLPOLY 15 minutes before the walltime limit, giving your script time to transfer files back to $work.