Hippo | UKZN high performance computing facility

Cluster hardware

Hippo is a 1000-core Ivy Bridge cluster that consists of 50 nodes, 20 cores each. Each node has 64 Gb RAM and 1 Tb local disk space. There is currently 895 Tb usable global storage. Compute nodes have 1 Gb network connections, and the full cluster can read from storage simutaneously at that bandwidth.

Note: Documentation here was updated on 3 June 2021 (describing new features after 14 May 2021 OS upgrade).

Getting an account

To get an account on Hippo, email hippo-admin@googlegroups.com and include your preferred username (something short with no special characters) and email address (please include your UKZN email address if you have one). If you are not a UKZN staff member, please copy your UKZN collaborator (or supervisor, if a student) on the email and include a sentence explaining why you need an account.

When you receive your account, you can access the cluster via ssh to hippo.ukzn.ac.za or 198.54.83.16. Please change your default password immediately using the “passwd” command. Obtaining a Hippo account is quick and easy, so please do not share accounts. It is Hippo policy that account sharing is not permitted. Users who do so will have their accounts disabled.

Note that if you want to access Hippo from on-campus at UKZN (e.g., when connected to the UKZN wifi network), then your UKZN network username needs to be added to a list of authorised users maintained by ICS (due to a change in security policy in July 2021). You will also need to install the GlobalProtect client and use it to connect to the UKZN VPN. You won’t be able to connect to Hippo from the campus network if you do not do both of these steps. If you only intend to access Hippo from off the campus network (e.g., if you are an external collaborator), then you will not need to do this.

fail2ban: For security reasons, hippo locks out IP address that have multiple failed login attempts. If you find yourself suddenly unable to login, this is likely what has happened. You can either try to login via a different machine or from a different network. Failing that, the ban will be automatically lifted in 24 hours.

Directories and quotas

When your account is created, you will have one home directory (/home/<username>) and one data directory (/data/<username>). Home directories are regularly backed up and should only be used for storing small, important files such as source code. Large data products or code outputs should be stored in /data. For disk usage stats, check the contents of /home/diskspace and /data/diskspace, or use “df -h” to find out how much total space is left. Each user has a quota of 5 TB on /data – note that this goes a long way, as the filesystem is compressed. You can use “lfs quota /data -h” to see how much of your quota is being used.

Available software

To find the available software packages on Hippo, either look in /apps or run the “module avail” command (which shows a list) or “module spider” (which shows a short description of each module).

You can load modules into your path by running “module load <module name>” or “ml <module name>” at the command line. Note that this is case sensitive (so “ml Python” will load the default version of python (3.8.6), but “ml python” will return an error).

If a piece of software appears to be missing or broken when you log in, try the above steps first before writing to your friendly Hippo admins. If you require additional software that isn’t currently installed and that will be widely used by others, contact the administrators for further assistance. Otherwise, please install locally in your own directory (under “/data/<username>” is best for large packages like Anaconda).

Some tips for module usage:

For frequently used modules, you may wish to add the corresponding “module load” or “ml” commands to your /home/<username>/.profile file so that they are automatically executed every time you log in.
If you run simultaneous batch jobs that require conflicting software environments, then you should place the “module load” commands within your batch scripts (see next section) instead of /home/<username>/.profile. You additionally may wish to use the “module purge” command when appropriate.

Note that Singularity is available on hippo (see the available modules) – but containers can only be run on the compute nodes, not the login node.

Running jobs

Hippo uses slurm for job submissions and resource allocations. To submit a job to the compute nodes, you first need to write a batch script that looks something like this:

#!/bin/sh
#SBATCH --nodes=<# of nodes> 
#SBATCH --ntasks-per-node=<# tasks per node>
#SBATCH --mem=<memory in Mb, max is approx. 63000>
#SBATCH --time=<time in HH:MM:SS format, max is 48:00:00>

# Load modules here.  As an example, uncomment the following line if your job requires python. 
# ml Python 

# Commands that you actually want to run go here. 
# As a test, you can try uncommenting the following: 
# cd /home/<username> 
# echo "howdy" > my_first_slurm_job.txt 
# echo "Look for my_first_slurm_job.txt in your home directory."

Save the above text to a batch script file. To actually submit the job, use the command “sbatch <batch script name>”. To check the queue status, use the command “squeue,” and to delete a job, use the command “scancel <job ID>” (you can check the queue to find the job IDs).

If you see the error message “sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying”, this probably means the queue is completely full – try submitting your job later.

You can monitor your running jobs using the new Job Monitor web application. For the moment, to do this you must tunnel like so:

  ssh -L 9999:h-man1:80 -N username@hippo.ukzn.ac.za

and then point your web browser to:

http://localhost:9999/jobmon/

This tool gives a lot of useful information that you can use to help make more efficient use of the cluster (e.g., you can monitor the RAM usage of your job).

You can request GPUs by adding something like the following to your batch script

--gres=gpu  
--gres=gpu:t4  
--gres=gpu:t4:1  
--gres=gpu:1

There are 10 nodes that have Nvidia T4 GPUs installed (one per node).

You can request local diskspace (aka JOBFS) with, e.g.,

  #SBATCH --tmp=20g

with the default value being 100 MB. Usage is quota’d to the –tmp request size. Most nodes have 800g requestable. Local disk usage is good for high I/O tasks that may be slow when using the global Lustre file system. The environment variable $JOBFS can be used to find the path to a directory on local disk. Note that /tmp is quota’d as part of the JOBFS –tmp request.

If you wish to run a job that does not use the full complement of 20 cores on each node, please do not set your slurm job to use the maximum amount of memory. Many people have jobs to run that do not need much memory, and setting your job to use the maximum of 64 Gb while not using all of the cores on a node means that those cores will be sitting idle until your job finishes. To ensure the most efficient usage of hippo, please check how much memory your job actually requires and request only that.

The current wall clock limit on jobs is 48 hours, so do not try to request a longer run time in your script; otherwise, your job will simply hang in the queue. For longer jobs, make sure your code checkpoints its results (most widely used software packages can be instructed to do this), and restart jobs from the checkpoint files. The head node can be used for interactive debugging and short code runs, but do not run long jobs on the head node.

For code that uses /dev/shm, usage is accounted against the –mem request of the job, i.e., if you ask for –mem=1g and try to write 1.1g of files to /dev/shm, then the write will fail. Files written to /dev/shm are automatically cleaned up at the end of each job

To get an interactive session on a compute node, you can use the convenient “sinteractive” command.

Contact and support

Email hippo-admin@googlegroups.com for support – ensure that you cc this address on all replies, so that we can keep consistent records; never email individual admins (e.g., Robin) directly.

Publications and acknowledgements

Please include something like the following text in publications resulting from usage of Hippo:

“Computations were performed on Hippo at the University of KwaZulu-Natal.”