Imperial CX1: Instructions and basic concepts of parallel computing

This tutorial is divided into 2 separate sections. In the fist section, introductions and available resources of CX1 are listed and classified. Since the Research Computing Service (RCS) team already developed great tutorials on their webpages, this part functions as a guide towards RCS webpages with necessary supplementary comments. In the second section, basic concepts of parallel computing and explanations of important terms are introduced. The main focus of this section is helping beginners to understand how high-performance computers (HPC) works on the basis of their daily practise.

This tutorial was initially written between Feb. and Mar. 2022 to be shared within the group for induction and training proposes ^[1]^[2]. Special thanks to Mr K. Tallat-Kelpsa, Ms A. Arber, Dr G. Mallia and Prof N. M. Harrison.

Introduction to CX1

CX1 is the old name of the first HPC that served the whole college. New facilities (known as CX2) were gradually installed and integrated with the old system (CX3, a rather short-lived domain), while CX1 remains to be the most popular name that generally referring to the college-owned clusters. To grant a student access to CX1, the group PI can, on behave of that student, ask RCS team to add the specified account into HPC active user mailing list.

Connect to CX1

CX1 is typically accessed via ssh (secured shell). Linux command line (Linux & MacOS) / sub-system (Windows 10,11) ^[3] / SSH client (such as XShell ^[4]) can be used. VPN is needed for off-campus users.

In linux command line, use the following command to connect CX1:

~$ ssh -XY username@login.hpc.ic.ac.uk

P.S. -XY option can be omitted for most of cases, if you do not need GUI to run that program.

Alternatively, when the VPN service is unstable or even not available, it is possible to channel through the gateway of the cluster via a client, which is an 'agent'. To visit CX1, type the previous command in the client's command line.

~$ ssh username@sshgw.ic.ac.uk

Use the scp command to upload / download files, which is similar to ssh and cp command. For example, to upload a file:

~$ scp /local/path/file_name username@login.hpc.ic.ac.uk:/path/file_name

Usage

The RCS Wiki Page in ReadTheDocs contains information needed. The support page, online clinic and courses from graduate school are available. To examine the status of CX1, use RCS status page.

Environmental Variables and Disk Space

Use env to access all the environmental variables - be careful, the output is HUGE. Some useful environmental variables:

${USER} The user's college account, i.e., login credential.
${HOME} '/rds/general/user/${USER}/home', or '~', which has 1 TB disk space for data backups.
${EPHEMERAL} '/rds/general/user/${USER}/ephemeral' Temporal unlimited disk space lasting for 30 days. Suitable for running calculations.
${PATH} Path to the executable can be attached for quick access. The Environment Modules package (see below) can automatically do that.

Software Management

The Environment Modules^[5] package is implemented on CX1 to manage computing software (see the following section for introductions). Basic commands are listed below:

module avail List the available modules
module load mod_name Load a specific module, 'mod_name'
module rm mod_name Remove a specific module, 'mod_name'
module list List all the loaded modules in the current environment
module help mod_name Check the instructions of the module 'mod_name'

Note: There is a CRYSTAL14 module in the list. For users in NMH's group, the latest CRYSTAL edition is available, so do not use that module.

Job Partition Guide

A hierachy of jobs is designed for the optimial efficiency of CX1. The current job partition guide is available on RCS Wiki Page

Batch System

The PBS batch system ^[6] is used on CX1 (see the following section for the meaning of batch system). Basic commands of PBS are listed below:

availability Check the availability of computational resources
qsub filename.qsub Submit the job 'filename'
qstat Check the state of submitted jobs
qdel jobID Kill the process with the ID number 'jobID'

To examine the queue status across the whole system, use RCS status page.

A General Job Submission Script

A general job submission script for CX1 is developed by the author himself. See the GitHub repository of CMSG for details. Parameterised software includes: CRYSTAL14/17/23, Quantum Espresso 7, LAMMPS, GROMACS, GULP6.

Basic Concepts of Parallel Computing

A brief introduction to parallel computing is given in this section by taking CX1, a medium-sized general-propose cluster, as an example.

Divide a job: Nodes, Processors and Threads

Node: A bunch of CPUs and probably with GPUs / coprocessors for acceleration. Memory and input files are shared by processors in the same node, so a node can be considered as an independent computer. The communication between nodes are achieved by ultra-fast network, which is the bottleneck of modern clusters.

Processor: The unit to deal with a 'process', also known as 'central processing unit', or CPU. Processors in the same node communicate via shared memory.

Thread: Subdivision of a process. Multiple threads in the same process share the resources allocated to the CPU.

The figure on the right hand side illustrates the hierarchy of node, processor, and thread. Note: The word 'processor' is not a very accurate term. Might be better with 'process' (I am just too lazy to update that figure). Many modern CPUs supports sub-CPU threading, which means the number of logical CPUs is larger than physical CPUs, so it is possible to have multiple threads within 1 processor. However, it is also possible to use multiple processors for 1 process, or even 1 thread.

Multiple processes vs multiple threads

From the figure above, it is not difficult to distinguish the differences between a 'process' and a 'thread': process is the smallest unit for resource allocation; thread is part of a process. The idea of 'thread' is introduced to address the huge difference in the speed of CPU and RAM. CPU is always several orders of magnitude faster than RAM, so typically the bottleneck of a process is loading the required environment from RAM, rather than computations in CPU. By using multiple threads in the same process, various branches of the same program can be executed simultaneously. Therefore, the shared environmental requirements doesn't need to be read from RAM for multiple times, and the loading time for threads is much smaller than for processes.

However, multithreading is not always advantageous. A technical prerequisite is that the program should be developed for multithread proposes. Python, for example, is a pseudo-multithread language, while Java is a real one. Sometimes multithreading can lead to catastrophic results. Since threads share the same resource allocation (CPU, RAM, I/O, etc.), when a thread fails, the whole process fails as well. Comparatively, in multiple processes, other processes will be protected if a process fails.

In practice, users can either run each process in serial (i.e., number of threads = 1), or in parallel (i.e., number of threads > 1) on clusters. However, the former one is recommended, because of more secured resource managements. The latter is not advantageous. Besides the problem mentioned above, it might lead to problems such as memory leak when running programs either: not developed for multithreading / requires improper packages (Here is a famous issue with libfabric on ARCHER2 identified in early 2022).

More nodes vs more CPUs

When the allocated memory permits, from my experience, using more CPUs/processes per node is usually a better idea, considering that all nodes have independent memory space and the inter-node communications are achieved by wired networks. It almost always takes longer to coordinate nodes than to coordinate processors within the same node.

The internal coordinator: What is MPI

Message passing interface, or MPI, is a standard for communicating and transferring data between nodes and therefore distributed memories. It is utilised via MPI libraries. The most popular implementations include:

MPICH ^[7] - an open-source library
Intel MPI ^[8] - a popular implementation of MPICH especially optimised for Intel CPUs
OpenMPI ^[9] - an open-source library
OpenMP ^[10] - Not MPI; parallelization based on shared memory, so only implemented in a single node; can be used for multithreading

In practice, a hybrid parallelization combining MPI and OpenMP to run multithread jobs on cluster is allowed, though sometimes not recommended. The first process (probably not a node or a processor) is usually allocated for I/O, and the rest is used for parallel computing.

So far, MPI only supports C/C++ and FORTRAN, which explains why all parallel computing software is based on these languages. To launch an executable in parallel, one should use: mpiexec or mpirun.

Secure your storage: Tmp memory, Work directory and home directory

Almost all the modern clusters have separate disk spaces for differently proposes, namely, temporary memory, work directory and home directory. This originates again from the famous speed difference between CPU and RAM/ROM. 2 distinctly kinds of disks are used respectively to improve the overall efficiency and secure important data:

For temporary memory large, high-frequency disks are used. It is allocated by job requests, which is not accessible by login nodes. Everything is erased after the job is terminated.
For work directory, large, high-frequency disks are used. Data stored in work directory is usually not backed up, and in the case of CX1, will be automatically cleaned after a fixed time length.
For home directory, mechanical disks with slower read/write frequency but better robustness are used. Usually files in home space are backed up.

For large clusters like ARCHER2 ^[11], the work directory and the home directory are completely separated, i.e., directory is only viable by login nodes; work directory is viable by both job and login nodes. Job submission in home directory is prohibited. For more flexible clusters like Imperial CX1, submitting jobs in home directory and visiting of home directory by job nodes are allowed, but storing temporary files during calculation in home directory is still not recommended because of the potential influence on other files and the reduced overall efficiency. (And it is not something new for CX1 users to receive the RDS failure news email)

Setup your environment: What does an application need?

Executable

The binary executable should, theoretically, all be stored in '\usr\bin'. This never happens in practice, unless you are a fanatical fundamentalist of the early Linux releases. To guide your system to the desired executable, you can either laboriously type its absolute path every time you need it or add the path to the environmental variable:

~$ export PATH=${PATH}:path_to_bin

Running any executable in parallel requires mpi to coordinate all the processes/threads. The path to mpi executable is also required. Besides, many scientific codes require other specific environmental variables such as linear algebra packages. Read their documentations for further information.

.lib/.a/.o files

When writing a script, you might need some extra packages to do more complex jobs. Those packages are developed by experts in computer science and can be called by a line of code. The same thing happens when people were developing applications like CRYSTAL and ONETEP.

However, scientific computing codes are usually distributed in the form of source code. Source codes in FORTRAN/C/C++ need be compiled into a binary executable. There are 2 options during compiling:

Include the whole package as long as one of its functions is called, also known as a 'static lib'.
Only include a 'table of contents' when compiling, also known as 'dynamic lib'. The packages needed are separately stored in '.dll/.so' files, making it possible for multiple applications sharing the same lib.

Details about compilation are beyond the scope of this post. The thing is: when running a dynamically linked application, information should be given to help the code find the libs needed. This can be specified by:

~$ export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:path_to_lib`

For statically linked applications, usually you need not worry about it - but the volume of the compiled executable might make you wonder whether there is an alternative way.

Conflicts

Improper previous settings may lead to a wrong application, or a wrong version, if multiple applications with similar functions are installed in the system, such as Intel compiler and GCC, OpenMPI and MPICH - a common phenomenon for shared computing resources. To avoid this, the path to the undesired application or lib should be removed from the environmental variables.

Environmental Modules

Environmental Modules ^[12] is a popular software managing the necessary environmental setups and conflicts for each application. It can easily add or erase the environmental variables by commands (such as module load, module rm) and modulefiles written in Tool Command Language (TCL)^[13]. The default directory of modulefiles is given in the environmental variable ${MODULEPATH}, but files in other directories can also be loaded by their absolute path.

Both Imperial CX1 and ARCHER2 adopt this application, with which pre-compiled applications are offered.

The external coordinator: What is a batch system

Always bear in mind that the computational resources are limited, so you need to acquire reasonable resources for your job. Besides, the cluster also needs to calculate your budget, coordinate jobs submitted by various users, and make the best of available resources. When job is running, maybe you also want to check its status. All of this are fulfilled by batch systems.

In practice, a Linux shell script is needed. Parameters of the batch system of are set in the commented lines at the top of the file. After the user submit the script to batch system, the system will:

Examine the parameters
Allocate and coordinate the requested resources
Set up the environments, such as environmental variables, package dependency, and sync the same setting to all nodes
Launch a parallel calculation - see mpi part
Post-process

Note that a 'walltime' is usually required for a batch job, i.e., the maximum allowed time for the running job. The job will be 'killed', or suspended, when the time exceeds the walltime, and the rest part of the script will not be executed. timeout command can be used to set another walltime for a specific command.

Common batch systems include PBS and Slurm ^[14]. For Imperial cluster CX1 and MMM Hub Young (managed by UCL) ^[15], PBS system is implemented; for ARCHER2 and Tianhe-2 LvLiang(天河二号-吕梁), Slurm is implemented. Tutorials of batch systems are not covered here, since they are heavily tailored according to specific machines - usually modifications are made to enhance the efficiency. Refer to the specific user documentations for more information.

How to run a job in parallel: Things to consider

Successfully setting and submitting a batch job script symbolises that you do not need this tutorial any more. Before being able to do that, some considerations might be important:

How large is my system? Is it efficient to use the resources I requested(Note that it is not a linear-scaling problem... Refer to this test on CRYSTAL17)?
To which queue should I submit my job? Is it too long/not applicable/not available?
Is it safe to use multi-threading?
Is it memory, GPU etc. demanding?
Roughly how long will it take?
What is my budget code? Do I have enough resources?
Which MPI release version is my code compatible with? Should I load a module or set variables?
Any other specific environmental setups does my code need?
Do I have any post-processing script after MPI part is finished? How long does it take?

References

[1] ttps://spica-vir.github.io/posts/Connect-to-the-Imperial-Cluster/

[2] ttps://spica-vir.github.io/posts/Structure-and-usage-of-clusters/

[3] ttps://learn.microsoft.com/en-us/windows/wsl/install

[4] ttps://www.xshell.com/en/xshell/

[5] ttps://modules.readthedocs.io/en/latest/

[6] ttps://en.wikipedia.org/wiki/Portable_Batch_System

[7] ttps://www.mpich.org/

[8] ttps://www.intel.com/content/www/us/en/developer/tools/oneapi/mpi-library.html#gs.xld8oa

[9] ttps://www.open-mpi.org/

[10] ttps://www.openmp.org/

[11] ttps://www.archer2.ac.uk/

[12] ttp://modules.sourceforge.net/

[13] ttps://www.tcl.tk/

[14] ttps://slurm.schedmd.com/overview.html

[15] ttp://mmmhub.ac.uk/young/

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]