Cluster hardware

Hippo is a 1000-core Ivy Bridge cluster that consists of 50 nodes, 20 cores each.  Each node has 64 Gb RAM and 1 Tb local disk space.  There is currently 895 Tb usable global storage.  Compute nodes have 1 Gb network connections, and the full cluster can read from storage simutaneously at that bandwidth.

Getting an account

To get an account on Hippo, email hippo-admin@googlegroups.com and include your preferred username and email address.  When you receive your account, you can access the cluster via ssh to hippo.ukzn.ac.za or 146.230.128.25.  Please change your default password immediately using the “passwd” command.  Obtaining a Hippo account is quick and easy, so please do not share accounts.  It is Hippo policy that account sharing is not permitted.  Users who do so will have their accounts disabled.

fail2ban: For security reasons, hippo locks out IP address that have multiple failed login attempts. If you find yourself suddenly unable to login, this is likely what has happened. You can either try to login via a different machine or from a different network. Failing that, the ban will be automatically lifted in 24 hours.

Directories and quotas

When your account is created, you will have one home directory (/home/<username>) and one data directory (/data/<username>).  Home directories are regularly backed up and should only be used for storing small, important files such as source code.  Large data products or code outputs should be stored in /data.  For disk usage stats, check the contents of /home/diskspace and /data/diskspace, or use “df -h” to find out how much total space is left.  Each user has a quota of 5 TB on /data – note that this goes a long way, as the filesystem is compressed. You can use “lfs quota /data -h” to see how much of your quota is being used.

Available software

To find the available software packages on Hippo, either look in /apps or run the “module avail” command.  You can load modules into your path by running “module load <module name>” at the command line.  If a piece of software appears to be missing or broken when you log in, try the above steps first before writing to your friendly Hippo admins.  If you require additional software that isn’t currently installed and that will be widely used by others, contact the administrators for further assistance.  Otherwise, please install locally in your own directory.

Some tips for module usage:

  1. For frequently used modules, you may wish to add the corresponding “module load” commands to your /home/<username>/.profile file so that they are automatically executed every time you log in.
  2. If you run simultaneous batch jobs that require conflicting software environments, then you should place the “module load” commands within your batch scripts (see next section) instead of /home/<username>/.profile.  You additionally may wish to use the “module purge” command when appropriate.

Running jobs

Hippo uses slurm for job submissions and resource allocations.  To submit a job to the compute nodes, you first need to write a batch script that looks something like this:

#!/bin/sh
#SBATCH --nodes=<# of nodes>
#SBATCH --ntasks-per-node=<# tasks per node>
#SBATCH --mem=<memory in Mb, max is 64000>
#SBATCH --time=<time in HH:MM:SS format, max is 24:00:00>

# Load modules here.  As an example, uncomment the following line
# if your job requires python.
# module load python

# Commands that you actually want to run go here.
# As a test, you can try uncommenting the following:
# cd /home/<username>
# echo "howdy" > my_first_slurm_job.txt
# echo "Look for my_first_slurm_job.txt in your home directory."

Save the above text to a batch script file.  To actually submit the job, use the command “sbatch <batch script name>”.  To check the queue status, use the command “squeue,” and to delete a job, use the command “scancel <job ID>” (you can check the queue to find the job IDs).

If you see the error message “sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying”, this probably means the queue is completely full – try submitting your job later.

If you wish to run a job that does not use the full complement of 20 cores on each node, please do not set your slurm job to use the maximum amount of memory. Many people have jobs to run that do not need much memory, and setting your job to use the maximum of 64 Gb while not using all of the cores on a node means that those cores will be sitting idle until your job finishes. To ensure the most efficient usage of hippo, please check how much memory your job actually requires and request only that.

The current wall clock limit on jobs is 48 hours, so do not try to request a longer run time in your script; otherwise, your job will simply hang in the queue.  For longer jobs, make sure your code checkpoints its results (most widely used software packages can be instructed to do this), and restart jobs from the checkpoint files.  The head node can be used for interactive debugging and short code runs, but do not run long jobs on the head node.

Contact and support

Email hippo-admin@googlegroups.com for support – ensure that you cc this address on all replies, so that we can keep consistent records; never email individual admins (e.g., Robin) directly.

Publications and acknowledgements

Please include something like the following text in publications resulting from usage of Hippo:

“Computations were performed on Hippo at the University of KwaZulu-Natal.”