Shark CentOS Slurm - User guide
-
Shark CentOS Slurm - User guide
- Contact
- HPC-Linux team
- Cluster overview:
- Cluster hardware - Network overview
- Hardware overview
- Rules
- How to get a Shark cluster account
- Shark cluster introduction course
- How to connect to the login node / hpc cluster
- Module environment
- Compiling programs
- Workload manager: Slurm
- Remote GPU-accelerated visualization on res-hpc-lo02
- More Slurm info
- Comparison between SGE and Slurm
- Working with sensitive data on the Shark cluster
- Data storage / access
- Applications
- Python
- Open OnDemand [OOD]
Contact
For accounts, storage, requests, anything related to the cluster, use the Topdesk Self-Service Portal:
Topdesk Self-Service Portal [Use Self-Service Portal]
HPC-Linux team
General email: ITenDI_Infra-Linux@lumc.nl
Naam | Locatie | |
---|---|---|
John Berbee | D-01-133 | J.A.M.Berbee@lumc.nl |
Tom Brusche | D-01-128 | T.A.W.Brusche@lumc.nl |
Michel Villerius | D-01-133 | M.P.Villerius@lumc.nl |
Pieter van Vliet | D-01-133 | P.Y.B.van_Vliet@lumc.nl |
Cluster overview:
-
OS (Linux): CentOS 8
-
Workload manager: Slurm version 20.02
Cluster hardware - Network overview
Hardware overview
Hostname | IP address | CPU | Cores | Memory | GPUs | Purpose | Type machine | Chassis / Slot |
---|---|---|---|---|---|---|---|---|
res-hpc-lo01 | 145.88.76.243 | Intel E5-2660 | 32 | 128Gb | 0 | Login node | Dell PowerEdge M620 | |
res-hpc-lo02 | 145.88.76.217 | Intel Xe 6248 | 80 | 128Gb | 1 | Login node + Rem Vis* | Dell PowerEdge R740 | |
res-hpc-exe007 | 145.88.76.220 | Intel E5-2697 | 24 | 384Gb | 0 | Execution node | Dell PowerEdge M620 | |
res-hpc-exe008 | 145.88.76.224 | Intel E5-2697 | 24 | 384Gb | 0 | Execution node | Dell PowerEdge M620 | |
res-hpc-exe009 | 145.88.76.222 | Intel E5-2697 | 24 | 384Gb | 0 | Execution node | Dell PowerEdge M620 | |
res-hpc-exe010 | 145.88.76.221 | Intel E5-2690 | 24 | 384Gb | 0 | Execution node | Dell PowerEdge M630 | |
res-hpc-exe011 | 145.88.76.223 | Intel E5-2697 | 24 | 384Gb | 0 | Execution node | Dell PowerEdge M620 | |
res-hpc-exe012 | 145.88.76.233 | Intel E5-2690 | 24 | 384Gb | 0 | Execution node | Dell PowerEdge M630 | |
res-hpc-exe013 | 145.88.76.247 | Intel E5-2670 | 16 | 128Gb | 0 | Execution node | Dell PowerEdge M620 | |
res-hpc-exe014 | 145.88.76.242 | Intel E5-2697 | 24 | 384Gb | 0 | Execution node | Dell PowerEdge M620 | |
res-hpc-exe015 | 145.88.76.235 | Intel E5-2660 | 16 | 96Gb | 0 | Execution node | Dell PowerEdge M620 | |
res-hpc-exe016 | 145.88.76.236 | Intel E5-2660 | 16 | 128Gb | 0 | Execution node | Dell PowerEdge M620 | |
res-hpc-exe019 | 145.88.76.239 | Intel E5-2660 | 16 | 128Gb | 0 | Execution node | Dell PowerEdge M620 | |
res-hpc-exe020 | 145.88.76.229 | Intel E5-2697 | 24 | 192Gb | 0 | Execution node | Dell PowerEdge M620 | |
res-hpc-exe021 | 145.88.76.228 | Intel E5-2660 | 16 | 96Gb | 0 | Execution node | Dell PowerEdge M620 | |
res-hpc-exe022 | 145.88.76.227 | Intel E5-2697 | 24 | 192Gb | 0 | Execution node | Dell PowerEdge M620 | |
res-hpc-exe023 | 145.88.76.225 | Intel E5-2697 | 24 | 192Gb | 0 | Execution node | Dell PowerEdge M620 | |
res-hpc-exe024 | 145.88.76.232 | Intel E5-2690 | 24 | 128Gb | 0 | Execution node | Dell PowerEdge M630 | |
res-hpc-exe025 | 145.88.76.213 | Intel E5-2670 | 16 | 96Gb | 0 | Execution node | Dell PowerEdge M620 | |
res-hpc-exe027 | 145.88.76.209 | Intel E5-2690 | 28 | 128Gb | 0 | Execution node | Dell PowerEdge M630 | |
res-hpc-exe028 | 145.88.76.212 | Intel E5-2670 | 16 | 96Gb | 0 | Execution node | Dell PowerEdge M620 | |
res-hpc-exe029 | 145.88.76.210 | Intel E5-2690 | 28 | 384Gb | 0 | Execution node | Dell PowerEdge M630 | |
res-hpc-exe030 | 145.88.76.215 | Intel E5-2697 | 24 | 384Gb | 0 | Execution node | Dell PowerEdge M620 | |
res-hpc-exe031 | 145.88.76.214 | Intel E5-2697 | 24 | 512Gb | 0 | Execution node | Dell PowerEdge M620 | |
res-hpc-gpu01 | 145.88.76.237 | Intel 8160 | 48 | 512Gb | 3 | GPU node | Dell PowerEdge R740 | |
res-hpc-gpu02 | 145.88.76.234 | Intel 8160 | 48 | 512Gb | 3 | GPU node | Dell PowerEdge R740 | |
res-hpc-lkeb03 | 145.88.76.226 | Intel E5-2698 | 40 | 256Gb | 4 | GPU node | NVIDIA DGX Station | |
res-hpc-lkeb04 | 145.88.76.248 | Intel E5-1650 | 12 | 256Gb | 4 | GPU node | Asus X99-E-10G WS | |
res-hpc-lkeb05 | 145.88.76.244 | Intel Xe 6134 | 16 | 256Gb | 3 | GPU node | Dell Precision 7920 | |
res-hpc-lkeb06 | 145.88.76.238 | Intel Xe 6234 | 16 | 256Gb | 3 | GPU node | Dell Precision 7920 | |
res-hpc-mem01 | 145.88.76.230 | Intel E7-4890 | 60 | 3Tb | 0 | High mem | Dell PowerEdge R920 | |
res-hpc-mem02 | 145.88.76.218 | Intel E5-4657L | 48 | 1Tb | 0 | High mem | Dell PowerEdge M820 | |
res-hpc-mem03 | 145.88.76.216 | Intel E5-4657L | 48 | 1Tb | 0 | High mem | Dell PowerEdge M820 | |
res-hpc-ma01 | 145.88.76.246 | Intel E5-2697 | 2 | 4Gb | 0 | Controller node 1 | VM: VMware | |
res-hpc-ma02 | 145.88.76.249 | Intel E5-4650 | 1 | 4Gb | 0 | Controller node 2 | VM: VMware | |
res-hpc-db01 | 145.88.76.245 | Intel E5-2697 | 1 | 4Gb | 0 | Slurm DB node | VM: VMware | |
res-hpc-ood01 | 145.88.76.231 | Intel E5-4650 | 2 | 4Gb | 0 | OpenOnDemand portal | VM: VMware |
- Rem Vis = Remote Visualization
GPU overview
Hostname | GPU 0 (cores/mem) | GPU 1 (cores/mem) | GPU 2 (cores/mem) | GPU 3 (cores/mem) |
---|---|---|---|---|
res-hpc-lo02 | Tesla T4 (2560/16Gb) | |||
res-hpc-gpu01 | TITAN Xp (3840/12Gb) | TITAN Xp (3840/12Gb) | TITAN Xp (3840/12Gb) | |
res-hpc-gpu02 | TITAN Xp (3840/12Gb) | TITAN Xp (3840/12Gb) | TITAN Xp (3840/12Gb) | |
res-hpc-lkeb01 | Tesla K40c (2880/12Gb) | |||
res-hpc-lkeb02 | TITAN X (Pascal) (3584/12Gb) | |||
res-hpc-lkeb03 | Tesla V100-DGXS (5120/16Gb) | Tesla V100-DGXS (5120/16Gb) | Tesla V100-DGXS (5120/16Gb) | Tesla V100-DGXS (5120/16Gb) |
res-hpc-lkeb04 | GeForce GTX 1080 Ti (3584/12Gb) | GeForce GTX 1080 Ti (3584/12Gb) | GeForce GTX 1080 Ti (3584/12Gb) | GeForce GTX 1080 Ti (3584/12Gb) |
res-hpc-lkeb05 | Quadro RTX 6000 (4608/24Gb) | Quadro RTX 6000 (4608/24Gb) | Quadro RTX 6000 (4608/24Gb) |
Rules
- Always login/connect to the login node res-hpc-lo01.researchlumc.nl or res-hpc-lo02.researchlumc.nl
- Always use the workload manager (Slurm) to run/submit jobs or use it interactive
- Never run a job outside the workload manager (Slurm), not on the login node nor on the execution nodes
- Never run (heavy) calculations on the login node, but do this on a compute node
How to get a Shark cluster account
- To get a Shark account you first need to have some basic Linux knowledge.
- A basic Linux introduction course will be given once a month in the LUMC practical Linux course
Without basic Linux knowledge you cannot work on the Shark cluster. - Create a self service desk call.
- A default cluster account will be created, you will receive an email with your RESEARCHLUMC/username and password.
Shark cluster introduction course
- A Shark introduction course can be followed where you will receive basic information on how to start using a HPC cluster.
Schedule 2020
Date | Time | Room | Room size | Seats left |
---|---|---|---|---|
2020 | Canceled until further notice due to coronavirus (SARS-CoV-2) |
Presentations
How to connect to the login node / hpc cluster
From a Linux workstation
You are free to choose your Linux distribution, but we recommend the following distributions:
Other distributions:
- CentOS 8
- Debian 10.6.0
- Red Hat 8 Commercial license required
- Arch Linux Rolling distribution
From the command line (ssh)
If your login user name from your workstation is the same as the username on the HPC cluster, you can use:
- ssh res-hpc-lo01.researchlumc.nl
or
- ssh res-hpc-lo02.researchlumc.nl
Otherwise:
or
You can make your life easier by editing the file:
vi ~/.ssh/config
Host res-hpc-lo01
Hostname 145.88.76.243
User user-name
ServerAliveInterval 60
Host res-hpc-lo02
Hostname 145.88.76.217
User user-name
ServerAliveInterval 60
Where you adapt the user-name.
X11 forwarding
You can show graphical output when you enable X11 forwarding
- ssh -X res-hpc-lo01.researchlumc.nl
or
- ssh -X res-hpc-lo02.researchlumc.nl
or
- ssh -Y res-hpc-lo01.researchlumc.nl
or
- ssh -Y res-hpc-lo02.researchlumc.nl
Once you are logged in, you should be able to run a graphical program, for example:
- xterm
- xclock
- xeyes
A remote desktop
Install the X2Go client:
- CentOS/Fedora/Red Hat: yum install x2goclient
- Ubuntu/Debian: apt-get install x2goclient
- Arch Linux: pacman -S x2goclient
X2Go enables to access a graphical desktop of a computer over the network. The protocol is tunneled through the Secure Shell protocol, so it is encrypted.
Go to Session and create a new session:
For the Host:
- res-hpc-lo01.researchlumc.nl
or
- res-hpc-lo02.researchlumc.nl
For the Session type:
- XFCE
- ICEWM
- MATE [only for res-hpc-lo02.researchlumc.nl]
After you have created a new session
Working from home with the SSH proxy server
If you are working from home or from outside the LUMC (network), you can use the SSH proxy server to connect to the cluster.
- IP address: 145.88.35.10
- Hostname: res-ssh-alg01.researchlumc.nl
With the X2Go client, you have to enable the following options:
- Use Proxy server for SSH connection
- Type: SSH
- Same login as on X2Go Server
- Same password as on X2Go Server
- Host: 145.88.35.10
- Port: 22
From a Windows workstation
From the command line (ssh)
A simple terminal ssh shell is putty. Download the client from the putty homepage
Direct client download putty
Once you have started the putty program, you will see:
Fill in at the Host Name (or IP address): res-hpc-lo01.researchlumc.nl or res-hpc-lo02.researchlumc.nl
At the connection setting, fill in 60 at the Seconds between keepalives. If needed, enable the Enable TCP keepalives
At the X11 setting, you can enable the Enable X11 forwarding. You want this if you need graphical output, but for this to get this working, you need to install separately a X11 server for Windows. It is better and easier to install mobaXterm, which will be explained below.
Now press Open to connect to the login node.
If you connect for the first time to the login node, you will see the warning. Press Yes to continue.
Login with your user name ans password.
Now you are logged in and you can start using the cluster.
If you press on the putty symbol at the left corner of the terminal window, you have multiple options, like:
- New Session
- Duplicate Session
- Change Settings
- Copy All to Clipboard
- Clear Scrollback
- Reset Terminal
You can give your session a name so that you can save it and reuse it later.
Give you session a name, for example "res-hpc-lo01". Save
Later you can load your saved session. Select your saved sessions and press Load
MobaXterm
With mobaXterm, you have a buildin SSH shell/terminal and an embedded X11 server. With this, you don't have to worry about setting up a special X11 server for showing graphical output on your Windows workstation.
Go to the MobaXterm website: MobaXterm
Here is the direct download link: Download
Choose the MobaXterm Home Editon v20.3 and install it on your Windows workstation. Once you have installed it, you can start it and you have to create a SSH session:
Choose for the session type: SSH and press OK
Fill in the Remote host and Specify username
All the default settings should be ok, but here you can check your Advanced SSH settings
The Terminal settings
Network settings
Pres OK to start the connection
On the left screen you can have multiple session saved
You can also add a, for example, a "SFTP" session, so that you can easily transfer files from and to your workstation.
Right clicking on one of your session, you can, for example, edit you session settings
Also here you can set the settings for the ssh jump host, if needed
A remote desktop
For the windows version, we can use the same version as we used with Linux:
- x2go x2go
Install the X2Go client
The setup is already described with the Linux client.
Module environment
When you will be working on the cluster, you probably need to load the correct module, which will set the correct environment for your library, compiler or program.
Here below are some useful commands and examples:
- List all available modules
module av
----------------------------------------------------------------------------------------------- /share/modulefiles ------------------------------------------------------------------------------------------------
container/singularity/3.5.3/gcc.8.3.1 library/cudnn/9.2/cudnn neuroImaging/fsl/5.0.11
cryogenicEM/chimera/1.14/gcc-8.3.1 library/fltk/1.3.5/gcc-8.3.1 neuroImaging/fsl/5.0.9
cryogenicEM/ctffind/4.1.13/gcc-8.3.1 library/ftgl/2.1.3/gcc-8.3.1 neuroImaging/fsl/6.0.0
cryogenicEM/eman2/2.31 library/gdal/2.4.4/gcc-8.3.1 neuroImaging/fsl/6.0.1
cryogenicEM/gctf/1.06 library/gdal/3.0.4/gcc-8.3.1 neuroImaging/fsl/6.0.2
cryogenicEM/imod/4.9.12 library/htslib/1.10.2/gcc-8.3.1 neuroImaging/fsl/6.0.3
cryogenicEM/motioncor2/1.31 library/java/OpenJDK-11.0.2 neuroImaging/fsl/fix/1.06.12
cryogenicEM/relion/3.0.8/gcc-8.3.1 library/java/OpenJDK-12.0.2 neuroImaging/mrtrix/3.0.0/gcc-8.3.1
cryogenicEM/relion/3.1-beta/gcc-8.3.1 library/java/OpenJDK-13.0.2 pharmaceutical/PsN/4.9.0
cryogenicEM/resmap/1.1.4 library/java/OpenJDK-14.0.1 (D) pharmaceutical/nonmem/7.4.4/gcc-8.3.1
genomics/hmmer/openmpi-3.1.5/3.3/gcc-8.3.1 library/lapack/3.9.0/gcc-8.3.1 pharmaceutical/pirana/2.9.7
genomics/ngs/bcftools/1.10.2/gcc-8.3.1 library/mpi/mpich/3.3.2/gcc-8.3.1 pharmaceutical/pirana/2.9.8 (D)
genomics/ngs/bcl2fastq/2.20.0 library/mpi/openmpi/3.1.5/gcc-8.3.1 statistical/MATLAB/R2016b
genomics/ngs/bedtools2/2.29.1/gcc-8.3.1 library/mpi/openmpi/4.0.2/gcc-8.3.1 statistical/MATLAB/R2018b
genomics/ngs/bwa/0.7.17/gcc-8.3.1 library/mpi/openmpi/4.0.3/gcc-8.3.1 (L) statistical/MATLAB/R2019b
genomics/ngs/samtools/1.10/gcc-8.3.1 library/pmi/openpmix/2.2.3/gcc-8.3.1 statistical/MATLAB/v93/MCR2017b
genomics/ngs/shapeit4/4.1.3/gcc-8.3.1 library/pmi/openpmix/3.1.4/gcc-8.3.1 statistical/MATLAB/v97/MCR2019b
genomics/ngs/vcftools/0.1.16/gcc-8.3.1 library/sparsehash/2.0.3/gcc-8.3.1 statistical/R/3.4.4/gcc.8.3.1
graphics/gnuplot/5.2.8/gcc-8.3.1 library/wxwidgets/3.1.3/gcc-8.3.1 statistical/R/3.5.3/gcc.8.3.1
graphics/graphicsmagick/1.3.35/gcc-8.3.1 mathematical/octave/5.2.0/gcc-8.3.1 statistical/R/3.6.2/gcc.8.3.1
gwas/depict/1.rel194 mathematical/octave/libs/SuiteSparse/5.7.2/gcc-8.3.1 statistical/R/4.0.2/gcc.8.3.1
gwas/plink/1.07 mathematical/octave/libs/arpack/3.7.0/gcc-8.3.1 statistical/RStudio/1.2.5033/gcc-8.3.1
gwas/plink/1.90b6.17 mathematical/octave/libs/gl2ps/1.4.2/gcc-8.3.1 system/go/1.13.7
gwas/plink/1.90p mathematical/octave/libs/glpk/4.65/gcc-8.3.1 system/hwloc/1.11.13/gcc-8.3.1
gwas/plink/2.00a3LM (D) mathematical/octave/libs/qhull/8.0.0/gcc-8.3.1 system/hwloc/2.1.0/gcc-8.3.1
library/blas/0.3.10/gcc-8.3.1 mathematical/octave/libs/qrupdate/1.1.2/gcc-8.3.1 system/knem/1.1.3/gcc-8.3.1
library/boost/1.72.0/gcc-8.3.1 mathematical/octave/libs/sundials/5.3.0/gcc-8.3.1 system/python/2.7.17
library/cuda/10.0/gcc.8.3.1 medicalImaging/minc-stuffs/0.1.25/gcc-8.3.1 system/python/3.7.6
library/cuda/10.1/gcc.8.3.1 medicalImaging/minc-toolkit-v2/1.9.17/gcc-8.3.1 system/python/3.8.1 (D)
library/cuda/10.2/gcc.8.3.1 medicalImaging/minc2-simple/2.2.30/gcc-8.3.1 system/qt/5.14.2/gcc-8.3.1
library/cuda/7.5/gcc.8.3.1 medicalImaging/pydpiper/2.0.9 system/swi-prolog/8.2.0
library/cuda/8.0/gcc.8.3.1 medicalImaging/pydpiper/2.0.14 (D) tools/biomake/0.1.5
library/cuda/9.0/gcc.8.3.1 medicalImaging/pyminc/0.52 tools/cmake/3.11.4
library/cuda/9.1/gcc.8.3.1 neuroImaging/Elastix/5.0.0/gcc-7.4.0 tools/gitlab-runner/12.8.0
library/cuda/9.2/gcc.8.3.1 neuroImaging/FSLeyes/0.32.3 tools/jupyterlab/4.3.1
library/cudnn/10.0/cudnn neuroImaging/SimpleElastix/0.10.0/python3.6.8 tools/luarocks/3.3.1/gcc-8.3.1
library/cudnn/10.1/cudnn neuroImaging/freesurfer/stable-pub-v6.0.0.patched tools/miniconda/python2.7/4.7.12
library/cudnn/10.2/cudnn neuroImaging/freesurfer/7.1.0 (D) tools/miniconda/python3.7/4.7.12
library/cudnn/9.0/cudnn neuroImaging/fsl/5.0.10 tools/websockify/0.9.0
-------------------------------------------------------------------------------------- /usr/share/lmod/lmod/modulefiles/Core --------------------------------------------------------------------------------------
lmod settarg
Where:
D: Default Module
Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".
- load a module
module load library/mpi/openmpi/4.0.2/gcc-8.3.1
- show loaded modules
module li
Currently Loaded Modules:
1) library/mpi/openmpi/4.0.2/gcc-8.3.1
- delete one module
module del library/mpi/openmpi/4.0.2/gcc-8.3.1
- purge all modules
module purge
Compiling programs
We are going to compile a very simple MPI (message passing interface) program, which is quite common on a cluster.
- vi hello.c
#include <stdio.h>
#include <mpi.h>
int main (int argc, char *argv[])
{
int id, np, i;
char processor_name[MPI_MAX_PROCESSOR_NAME];
int processor_name_len;
MPI_Init (&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &np);
MPI_Comm_rank (MPI_COMM_WORLD, &id);
MPI_Get_processor_name (processor_name, &processor_name_len);
for (i=1; i<2; i++)
printf ("Hello world from process %03d out of %03d, processor name %s \n", id, np, processor_name);
MPI_Finalize ();
return 0;
}
If you are going to compile this program without the correct loaded module(s), you would see something like this:
$ module li
No modules loaded
$ gcc hello.c -o hello
hello.c:2:10: fatal error: mpi.h: No such file or directory
#include <mpi.h>
^~~~~~~
compilation terminated.
So we need to load the correct module and use the correct compiler
$ module add library/mpi/openmpi/4.0.2/gcc-8.3.1
- mpicc hello.c -o hello
Handy reference:
Language | C | C++ | Fortran77 | Fortran90 | Fortran95 |
---|---|---|---|---|---|
Command | mpicc | mpiCC | mpif77 | mpif90 | mpif95 |
- ./hello
Hello world from process 000 out of 001, processor name res-hpc-lo01.researchlumc.nl
Here you can see that we ran the program only on 1 core of the cpu. (which is the same as running: mpirun -np 1 ./hello) (np = number of processes to launch)
To make use of the MPI capabilities of the program, we have to run the program with the "mpirun" which comes with the loaded module library/mpi/openmpi/4.0.2/gcc-8.3.1
- mpirun ./hello
Hello world from process 003 out of 016, processor name res-hpc-lo01.researchlumc.nl
Hello world from process 006 out of 016, processor name res-hpc-lo01.researchlumc.nl
Hello world from process 013 out of 016, processor name res-hpc-lo01.researchlumc.nl
Hello world from process 015 out of 016, processor name res-hpc-lo01.researchlumc.nl
Hello world from process 000 out of 016, processor name res-hpc-lo01.researchlumc.nl
Hello world from process 005 out of 016, processor name res-hpc-lo01.researchlumc.nl
Hello world from process 010 out of 016, processor name res-hpc-lo01.researchlumc.nl
Hello world from process 011 out of 016, processor name res-hpc-lo01.researchlumc.nl
Hello world from process 012 out of 016, processor name res-hpc-lo01.researchlumc.nl
Hello world from process 002 out of 016, processor name res-hpc-lo01.researchlumc.nl
Hello world from process 004 out of 016, processor name res-hpc-lo01.researchlumc.nl
Hello world from process 007 out of 016, processor name res-hpc-lo01.researchlumc.nl
Hello world from process 001 out of 016, processor name res-hpc-lo01.researchlumc.nl
Hello world from process 008 out of 016, processor name res-hpc-lo01.researchlumc.nl
Hello world from process 009 out of 016, processor name res-hpc-lo01.researchlumc.nl
Hello world from process 014 out of 016, processor name res-hpc-lo01.researchlumc.nl
Now the program is using all the cores of the local machine. (which is the same as: mpirun -np 16 ./hello)
Workload manager: Slurm
The workload manager Slurm is installed on the cluster. This is a resource manager and a scheduler all in one.
- resource manager: are there resources free (memory, cpu, gpus, etc) on the nodes for the job?
- scheduler: when to run the job?
Slurm commands
User commands
Command | Info |
---|---|
salloc | Obtain a Slurm job allocation (a set of nodes), execute a command, and then release the allocation when the command is finished |
sbatch | Submit a batch script to Slurm |
scancel | Used to signal jobs or job steps that are under the control of Slurm |
scontrol | Used view and modify Slurm configuration and state |
sinfo | View information about Slurm nodes and partitions |
squeue | View information about jobs located in the Slurm scheduling queue |
srun | Run parallel jobs |
Accounting info
Command | Info |
---|---|
sacct | Displays accounting data for all jobs and job steps in the Slurm job accounting log or Slurm database |
sstat | Display various status information of a running job/step |
sacctmgr | Used to view and modify Slurm account information |
sreport | Generate reports from the slurm accounting data |
Scheduling info
Command | Info |
---|---|
sprio | View the factors that comprise a job's scheduling priority |
sshare | Tool for listing the shares of associations to a cluster |
Partitions (info) (in SGE called queues)
Available partitions
Partition name | Nodes | Default | MemPerCpu | DefaultTime | Remark |
---|---|---|---|---|---|
all | res-hpc-exe[007-031] | Yes | 2048 | 1:00:00 | Default partition |
gpu | res-hpc-gpu[01-02] | No | 2048 | 1:00:00 | Only for GPU/CUDA calculations |
highmem | res-hpc-mem[01-03] | No | 2048 | 1:00:00 | For memory intensive applications |
LKEBgpu | res-hpc-lkeb[01-05] | No | 2048 | - | Only for GPU/CUDA calculations |
short | res-hpc-gpu[01-02] | No | 2048 | - | max 60 cores, 1 hour walltime, for non GPU calculations |
A job must always be submitted to a partition (queue). By default, the jobs will be submitted to the "all" partition unless you specify a different queue.
The following commands are useful:
- sinfo
- sinfo -a
- sinfo -l
- sinfo -N -l
[user@res-hpc-lo01 ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
all* up infinite 2 mix res-hpc-exe[014,019]
all* up infinite 1 alloc res-hpc-exe018
all* up infinite 4 idle res-hpc-exe[013,015-017]
gpu up infinite 2 mix res-hpc-gpu[01-02]
highmem up infinite 1 mix res-hpc-mem01
[user@res-hpc-lo01 ~]$ sinfo -a
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
all* up infinite 2 mix res-hpc-exe[014,019]
all* up infinite 1 alloc res-hpc-exe018
all* up infinite 4 idle res-hpc-exe[013,015-017]
gpu up infinite 2 mix res-hpc-gpu[01-02]
highmem up infinite 1 mix res-hpc-mem01
LKEBgpu up infinite 1 comp* res-hpc-lkeb02
LKEBgpu up infinite 1 down* res-hpc-lkeb03
LKEBgpu up infinite 2 mix res-hpc-lkeb[04-05]
LKEBgpu up infinite 1 idle res-hpc-lkeb01
[user@res-hpc-lo01 ~]$ sinfo -l
Mon Mar 23 09:21:27 2020
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE NODELIST
all* up infinite 1-infinite no NO all 2 mixed res-hpc-exe[014,019]
all* up infinite 1-infinite no NO all 1 allocated res-hpc-exe018
all* up infinite 1-infinite no NO all 4 idle res-hpc-exe[013,015-017]
gpu up infinite 1-infinite no NO all 2 mixed res-hpc-gpu[01-02]
highmem up infinite 1-infinite no NO all 1 mixed res-hpc-mem01
[user@res-hpc-lo01 ~]$ sinfo -l -N -a
Mon Mar 23 09:34:14 2020
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
res-hpc-exe013 1 all* mixed 16 2:8:1 128800 0 1 (null) none
res-hpc-exe014 1 all* mixed 24 2:12:1 386800 0 1 (null) none
res-hpc-exe015 1 all* idle 16 2:8:1 96000 0 1 (null) none
res-hpc-exe016 1 all* idle 16 2:8:1 128800 0 1 (null) none
res-hpc-exe017 1 all* idle 16 2:8:1 64000 0 1 (null) none
res-hpc-exe018 1 all* allocated 16 2:8:1 64000 0 1 (null) none
res-hpc-exe019 1 all* mixed 16 2:8:1 128800 0 1 (null) none
res-hpc-gpu01 1 gpu mixed 48 2:24:1 515000 0 1 (null) none
res-hpc-gpu02 1 gpu mixed 48 2:24:1 515000 0 1 (null) none
res-hpc-lkeb01 1 LKEBgpu idle 8 1:4:2 63000 0 1 (null) none
res-hpc-lkeb02 1 LKEBgpu completing* 12 1:6:2 15000 0 1 (null) none
res-hpc-lkeb03 1 LKEBgpu down* 40 1:20:2 250000 0 1 (null) Not responding
res-hpc-lkeb04 1 LKEBgpu mixed 12 1:6:2 257000 0 1 (null) none
res-hpc-lkeb05 1 LKEBgpu mixed 16 2:8:1 256000 0 1 (null) none
res-hpc-mem01 1 highmem mixed 60 4:15:1 300000 0 1 (null) none
- idle: this node has no jobs running on it
- alloc(ated): the whole node is allocated by 1 or more jobs
- mix(ed): there is 1 or more jobs running on the node, but there are still cores free on this node
Jobs info
With the following command, you can get information about your running jobs and jobs from other users:
- squeue
- squeue -a
- squeue -l
[user@res-hpc-lo01 mpi-benchmarks]$ squeue
JOBID PARTITION USER ST TIME NODES NODELIST(REASON)
258 all user R 0:03 2 res-hpc-exe[013-014]
[user@res-hpc-lo01 mpi-benchmarks]$ squeue -a
JOBID PARTITION USER ST TIME NODES NODELIST(REASON)
258 all user R 0:06 2 res-hpc-exe[013-014]
[user@res-hpc-lo01 mpi-benchmarks]$ squeue -l
Thu Jan 23 09:14:22 2020
JOBID PARTITION USER STATE TIME TIME_LIMIT NODES NODELIST(REASON)
258 all user RUNNING 0:12 30:00 2 res-hpc-exe[013-014]
Jobs typically pass through several states in the course of their execution.
The typical states are PENDING, RUNNING, SUSPENDED, COMPLETING, and COMPLETED.
An explanation of some state follows:
State | State (full) | Explanation |
---|---|---|
CA | CANCELLED | Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. |
CD | COMPLETED | Job has terminated all processes on all nodes with an exit code of zero. |
CG | COMPLETING | Job is in the process of completing. Some processes on some nodes may still be active. |
F | FAILED | Job terminated with non-zero exit code or other failure condition. |
PD | PENDING | Job is awaiting resource allocation. |
R | RUNNING | Job currently has an allocation. |
S | SUSPENDED | Job has an allocation, but execution has been suspended and CPUs have been released for other jobs. |
scontrol
With the slurm command scontrol you can get a more detailed overview of your running job, node hardware and partitions:
[user@res-hpc-lo01 ~]$ scontrol show job 260
JobId=260 JobName=IMB
UserId=user(225812) GroupId=Domain Users(513) MCS_label=N/A
Priority=35603 Nice=0 Account=dnst-ict QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:13 TimeLimit=00:30:00 TimeMin=N/A
SubmitTime=2020-01-23T10:27:45 EligibleTime=2020-01-23T10:27:45
AccrueTime=2020-01-23T10:27:45
StartTime=2020-01-23T10:27:45 EndTime=2020-01-23T10:57:45 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-01-23T10:27:45
Partition=all AllocNode:Sid=res-hpc-ma01:46428
ReqNodeList=(null) ExcNodeList=(null)
NodeList=res-hpc-exe[013-014]
BatchHost=res-hpc-exe013
NumNodes=2 NumCPUs=32 NumTasks=32 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=32,mem=64G,node=2,billing=32
Socks/Node=* NtasksPerN:B:S:C=16:0:*:* CoreSpec=*
MinCPUsNode=16 MinMemoryCPU=2G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/user/Software/imb/mpi-benchmarks/imb.slurm
WorkDir=/home/user/Software/imb/mpi-benchmarks
StdErr=/home/user/Software/imb/mpi-benchmarks/job.%J.err
StdIn=/dev/null
StdOut=/home/user/Software/imb/mpi-benchmarks/job.%J.out
Power=
MailUser=user@gmail.com MailType=BEGIN,END,FAIL
[user@res-hpc-lo01 ~]$ scontrol show node res-hpc-exe014
NodeName=res-hpc-exe014 Arch=x86_64 CoresPerSocket=12
CPUAlloc=16 CPUTot=24 CPULoad=0.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=res-hpc-exe014 NodeHostName=res-hpc-exe014 Version=20.02.0-0pre1
OS=Linux 4.18.0-80.11.2.el8_0.x86_64 #1 SMP Tue Sep 24 11:32:19 UTC 2019
RealMemory=386800 AllocMem=32768 FreeMem=380208 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=all
BootTime=2019-12-11T11:51:40 SlurmdStartTime=2020-01-14T15:36:20
CfgTRES=cpu=24,mem=386800M,billing=24
AllocTRES=cpu=16,mem=32G
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
[user@res-hpc-lo01 ~]$ scontrol show partition all
PartitionName=all
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=YES QoS=N/A
DefaultTime=01:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=res-hpc-exe[013-014]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=40 TotalNodes=2 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerCPU=2048 MaxMemPerNode=UNLIMITED
Interactive jobs
- salloc
You can open an interactive session with the salloc command:
[user@res-hpc-lo01 ~]$ salloc -N1
salloc: Granted job allocation 267
salloc: Waiting for resource configuration
salloc: Nodes res-hpc-exe013 are ready for job
[user@res-hpc-exe013 ~]$ squeue
JOBID PARTITION USER ST TIME NODES NODELIST(REASON)
267 all user R 0:04 1 res-hpc-exe013
[user@res-hpc-exe013 ~]$ exit
exit
salloc: Relinquishing job allocation 267
[user@res-hpc-lo01 ~]$
In the example above, we won't run a command so we ended up in the bash environment. With exit we leave the environment and we release the node.
[user@res-hpc-lo01 ~]$ salloc -N1 mpirun ./hello1
salloc: Granted job allocation 268
salloc: Waiting for resource configuration
salloc: Nodes res-hpc-exe013 are ready for job
Hello world from process 000 out of 001, processor name res-hpc-exe013.researchlumc.nl
salloc: Relinquishing job allocation 268
salloc: Job allocation 268 has been revoked.
Here we allocated 1 node with one core and ran the openmpi compiled "hello1" program.
Now the same with 2 nodes, 16 cores on each machine:
[user@res-hpc-lo01 ~]$ salloc -N2 --ntasks-per-node=16 mpirun ./hello1
salloc: Granted job allocation 270
salloc: Waiting for resource configuration
salloc: Nodes res-hpc-exe[013-014] are ready for job
Hello world from process 003 out of 032, processor name res-hpc-exe013.researchlumc.nl
Hello world from process 021 out of 032, processor name res-hpc-exe014.researchlumc.nl
Hello world from process 004 out of 032, processor name res-hpc-exe013.researchlumc.nl
Hello world from process 005 out of 032, processor name res-hpc-exe013.researchlumc.nl
Hello world from process 027 out of 032, processor name res-hpc-exe014.researchlumc.nl
Hello world from process 000 out of 032, processor name res-hpc-exe013.researchlumc.nl
Hello world from process 029 out of 032, processor name res-hpc-exe014.researchlumc.nl
Hello world from process 006 out of 032, processor name res-hpc-exe013.researchlumc.nl
Hello world from process 031 out of 032, processor name res-hpc-exe014.researchlumc.nl
Hello world from process 007 out of 032, processor name res-hpc-exe013.researchlumc.nl
Hello world from process 016 out of 032, processor name res-hpc-exe014.researchlumc.nl
Hello world from process 010 out of 032, processor name res-hpc-exe013.researchlumc.nl
Hello world from process 019 out of 032, processor name res-hpc-exe014.researchlumc.nl
Hello world from process 011 out of 032, processor name res-hpc-exe013.researchlumc.nl
Hello world from process 030 out of 032, processor name res-hpc-exe014.researchlumc.nl
Hello world from process 012 out of 032, processor name res-hpc-exe013.researchlumc.nl
Hello world from process 017 out of 032, processor name res-hpc-exe014.researchlumc.nl
Hello world from process 013 out of 032, processor name res-hpc-exe013.researchlumc.nl
Hello world from process 018 out of 032, processor name res-hpc-exe014.researchlumc.nl
Hello world from process 014 out of 032, processor name res-hpc-exe013.researchlumc.nl
Hello world from process 020 out of 032, processor name res-hpc-exe014.researchlumc.nl
Hello world from process 015 out of 032, processor name res-hpc-exe013.researchlumc.nl
Hello world from process 022 out of 032, processor name res-hpc-exe014.researchlumc.nl
Hello world from process 001 out of 032, processor name res-hpc-exe013.researchlumc.nl
Hello world from process 023 out of 032, processor name res-hpc-exe014.researchlumc.nl
Hello world from process 024 out of 032, processor name res-hpc-exe014.researchlumc.nl
Hello world from process 002 out of 032, processor name res-hpc-exe013.researchlumc.nl
Hello world from process 025 out of 032, processor name res-hpc-exe014.researchlumc.nl
Hello world from process 008 out of 032, processor name res-hpc-exe013.researchlumc.nl
Hello world from process 026 out of 032, processor name res-hpc-exe014.researchlumc.nl
Hello world from process 028 out of 032, processor name res-hpc-exe014.researchlumc.nl
Hello world from process 009 out of 032, processor name res-hpc-exe013.researchlumc.nl
salloc: Relinquishing job allocation 270
- srun
With the srun command you can also open an interactive session or you can run a program through the scheduler.
Interactive:
[user@res-hpc-lo01 ~]$ srun --pty bash
[user@res-hpc-exe013 ~]$ exit
exit
Running a program:
[user@res-hpc-lo01 ~]$ cat hello.sh
#!/bin/bash
#
echo "Hello from $(hostname)"
echo "It is currently $(date)"
echo ""
echo "SLURM_JOB_NAME: $SLURM_JOB_NAME"
echo "SLURM_JOBID: " $SLURM_JOBID
[user@res-hpc-lo01 ~]$ chmod +x hello.sh
[user@res-hpc-lo01 ~]$ srun -N1 hello.sh
Hello from res-hpc-exe013.researchlumc.nl
It is currently Thu Jan 23 12:35:18 CET 2020
SLURM_JOB_NAME: hello.sh
SLURM_JOBID: 282
sbatch
The normal and correct way to submit a job is with a slurm batch file. This is a normal bash script with special directives for slurm.
In SBATCH lines, “#SBATCH” is used to submit options. The various meanings of lines starting with “#” are:
Line starts with | Treated as |
---|---|
# | Comment in shell and Slurm |
#SBATCH | Comment in shell, option in Slurm |
# SBATCH | Comment in shell and Slurm |
Options, sometimes called “directives”, can be set in the job script file using this line format for each option:
#SBATCH {option} {parameter}
Directive Description | Specified As #SBATCH |
---|---|
Name the job < jobname > | -J < jobname > |
Request at least < minnodes > nodes | -N < minnodes > |
Request < minnodes > to < maxnodes > nodes | -N < minnodes >-< maxnodes > |
Request at least < MB > amount of temporary disk space | --tmp < MB > |
Run the job for a time of < walltime > | -t < walltime > |
Run the job at < time > | --begin < time > |
Set the working directory to < directorypath > | -D < directorypath > |
Set error log name to < jobname.err > | -e < jobname.err > |
Set output log name to < jobname.log > | -o < jobname.log > |
Mail < user@address > | --mail-user < user@address > |
Mail on any event | --mail-type=ALL |
Mail on job end | --mail-type=END |
Run job in partition | -p < destination > |
Run job using GPU with ID < number > | --gres=gpu:< number > |
Specify the real memory required per node. Suffix [K|M|G|T]. Better use "--mem-per-cpu" | --mem=<size[units]> |
Minimum memory required per allocated CPU. Suffix [K|M|G|T] | --mem-per-cpu=<size[units]> |
Node-Core reservation:
Short option | Long option | Description |
---|---|---|
-N | --nodes= | Request this many nodes on the cluster. Use 1 core on each node by default |
-n | --ntasks= | Request this many tasks on the cluster. Defaults to 1 task per node |
(none) | --ntasks-per-node= | Request this number of tasks per node |
For example:
Options | Description |
---|---|
-N2 | use 2 nodes, 1 core on each node, so in total 2 cores |
-N2 --ntasks-per-node=16 | use 2 nodes, 16 cores on each node, so in total 32 cores |
-n32 | use 32 cores in total, let Slurm decide where to run (one or multiple nodes) |
Submitting jobs
module purge
module add library/mpi/openmpi/4.0.2/gcc-8.3.1
git clone https://github.com/intel/mpi-benchmarks
cd mpi-benchmarks
make clean
cd src_c
make clean
make -f Makefile TARGET=MPI1
Please notice in the Makefile: CC=mpicc
ldd ./IMB-MPI1
linux-vdso.so.1 (0x00007fff6e9f6000)
libmpi.so.40 => /share/software/library/mpi/openmpi/4.0.2/gcc-8.3.1/lib/libmpi.so.40 (0x00007f7c6acb7000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f7c6aa97000)
libc.so.6 => /lib64/libc.so.6 (0x00007f7c6a6d4000)
libopen-rte.so.40 => /share/software/library/mpi/openmpi/4.0.2/gcc-8.3.1/lib/libopen-rte.so.40 (0x00007f7c6a41e000)
libopen-pal.so.40 => /share/software/library/mpi/openmpi/4.0.2/gcc-8.3.1/lib/libopen-pal.so.40 (0x00007f7c6a12e000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f7c69f2a000)
libudev.so.1 => /lib64/libudev.so.1 (0x00007f7c69d03000)
libpciaccess.so.0 => /lib64/libpciaccess.so.0 (0x00007f7c69af9000)
librt.so.1 => /lib64/librt.so.1 (0x00007f7c698f0000)
libm.so.6 => /lib64/libm.so.6 (0x00007f7c6956e000)
libutil.so.1 => /lib64/libutil.so.1 (0x00007f7c6936a000)
libz.so.1 => /lib64/libz.so.1 (0x00007f7c69153000)
libevent-2.1.so.6 => /lib64/libevent-2.1.so.6 (0x00007f7c68efa000)
libevent_pthreads-2.1.so.6 => /lib64/libevent_pthreads-2.1.so.6 (0x00007f7c68cf7000)
/lib64/ld-linux-x86-64.so.2 (0x00007f7c6afda000)
libmount.so.1 => /lib64/libmount.so.1 (0x00007f7c68a9d000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f7c68885000)
libcrypto.so.1.1 => /lib64/libcrypto.so.1.1 (0x00007f7c683a7000)
libblkid.so.1 => /lib64/libblkid.so.1 (0x00007f7c68155000)
libuuid.so.1 => /lib64/libuuid.so.1 (0x00007f7c67f4d000)
libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f7c67d22000)
libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007f7c67a9e000)
#!/bin/bash
#SBATCH -J IMB
#SBATCH -N 2
# SBATCH --ntasks-per-node=16
# SBATCH --ntasks-per-node=6
# SBATCH -n 32
# SBATCH --exclusive
#SBATCH --time=00:30:00
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=user@lumc.nl
# Clear the environment from any previously loaded modules
module purge > /dev/null 2>&1
# Load the module environment suitable for the job
# module load library/mpi/openmpi/3.1.5/gcc-8.3.1
module load library/mpi/openmpi/4.0.2/gcc-8.3.1
echo "Starting at `date`"
echo "Running on hosts: $SLURM_JOB_NODELIST"
echo "Running on $SLURM_JOB_NUM_NODES nodes."
echo "Running $SLURM_NTASKS tasks."
echo "Account: $SLURM_JOB_ACCOUNT"
echo "Job ID: $SLURM_JOB_ID"
echo "Job name: $SLURM_JOB_NAME"
echo "Node running script: $SLURMD_NODENAME"
echo "Submit host: $SLURM_SUBMIT_HOST"
echo "Current working directory is `pwd`"
mpirun ./IMB-MPI1
echo "Program finished with exit code $? at: `date`"
scancel
With the scancel command you can cancel your running job or your scheduled job:
- scancel jobid
where the jobid is your job identifier.
scontrol update
While the scontrol show a powerful command is to show info about your job, with the scontrol update command, you can change certain settings as long as your job is on hold or pending. First put your job on hold, update the settings and then release your job:
- scontrol hold jobid
- scontrol update job jobid NumNodes=2-2 NumTasks=2 Features=intel16
- scontrol release jobid
See the man page for the scontrol command.
X11 forwarding
You can enable X11 forwarding with the "--x11" parameter, for example:
- srun -n1 --pty --x11 xclock
Using GPU's
You can use a GPU with the --gres parameter, for example:
--partition=gpu
--gres=gpu:1
--ntasks=1
--cpus-per-task=1
Syntax:
- --gres=gpu:[type gpu]:[number of gpus]
Normally you don't have to specify the type of GPU. But if there are different kind of GPUs in a single machine or you want to run on a certain type of GPU you have to specify on which GPU you want to run, for example:
--partition=LKEBgpu
--gres=gpu:1080Ti:1
Hostname | Partition | Type GPU and number |
---|---|---|
res-hpc-lkeb01 | LKEBgpu | Gres=gpu:K40C:1 |
res-hpc-lkeb02 | LKEBgpu | Gres=gpu:TitanX:1 |
res-hpc-lkeb03 | LKEBgpu | Gres=gpu:V100:4 |
res-hpc-lkeb04 | LKEBgpu | Gres=gpu:1080Ti:4 |
res-hpc-lkeb05 | LKEBgpu | Gres=gpu:RTX6000:3 |
res-hpc-gpu[01-02] | gpu | Gres=gpu:TitanXp:3 |
Simple GPU example
cat test-gpu.slurm
#!/bin/bash
#
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=3:00
module purge
module add library/cuda/10.1/gcc.8.3.1
hostname
echo "Cuda devices: $CUDA_VISIBLE_DEVICES"
nvidia-smi
sleep 10
Output:
cat slurm-206044.out
res-hpc-gpu01.researchlumc.nl
Cuda devices: 0
Tue Apr 14 14:24:59 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:3B:00.0 Off | N/A |
| 17% 30C P0 60W / 250W | 0MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
You can see from the output that we have 1 GPU: Cuda devices: 0
The same, but now we make a reservation for 3 GPU's:
#!/bin/bash
#
#SBATCH --partition=gpu
#SBATCH --gres=gpu:3
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=3:00
module purge
module add library/cuda/10.1/gcc.8.3.1
hostname
echo "Cuda devices: $CUDA_VISIBLE_DEVICES"
nvidia-smi
sleep 10
cat slurm-206045.out
res-hpc-gpu01.researchlumc.nl
Cuda devices: 0,1,2
Tue Apr 14 14:26:22 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:3B:00.0 Off | N/A |
| 17% 31C P0 61W / 250W | 0MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp Off | 00000000:AF:00.0 Off | N/A |
| 18% 29C P0 60W / 250W | 0MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN Xp Off | 00000000:D8:00.0 Off | N/A |
| 18% 30C P0 61W / 250W | 0MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
You can see from the output that we have 3 GPU's: Cuda devices: 0,1,2
Compiling and running GPU programs
First download and compile the samples from NVidia: Cuda samples
module purge
module add library/cuda/10.1/gcc.8.3.1
cd
git clone https://github.com/NVIDIA/cuda-samples.git
cd cuda-samples/Samples/UnifiedMemoryPerf/
make
Create a slurm batch script:
cat gpu-test.slurm
#!/bin/bash
#
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=3:00
module purge
module add library/cuda/10.2/gcc.8.3.1
hostname
echo "Cuda devices: $CUDA_VISIBLE_DEVICES"
$HOME/cuda-samples/Samples/UnifiedMemoryPerf/UnifiedMemoryPerf
- sbatch gpu-test.slurm
While running, ssh to the node (in this case res-hpc-gpu01) and run the command nvidia-smi. This will show that the "UnifiedMemoryPerf" program is running on a GPU.
[user@res-hpc-gpu01 GPU]$ nvidia-smi
Tue Apr 14 16:06:06 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:3B:00.0 Off | N/A |
| 17% 31C P0 61W / 250W | 0MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp Off | 00000000:AF:00.0 Off | N/A |
| 23% 34C P2 69W / 250W | 259MiB / 12196MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN Xp Off | 00000000:D8:00.0 Off | N/A |
| 18% 31C P0 61W / 250W | 0MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 1 29726 C ...les/UnifiedMemoryPerf/UnifiedMemoryPerf 145MiB |
+-----------------------------------------------------------------------------+
Output:
cat slurm-206625.out
res-hpc-gpu01.researchlumc.nl
Cuda devices: 0
GPU Device 0: "Pascal" with compute capability 6.1
Running ........................................................
Overall Time For matrixMultiplyPerf
Printing Average of 100 measurements in (ms)
Size_KB UMhint UMhntAs UMeasy 0Copy MemCopy CpAsync CpHpglk CpPglAs
4 10.879 23.178 0.222 0.014 0.031 0.026 0.035 0.026
16 10.657 25.849 0.580 0.030 0.051 0.046 0.052 0.039
64 21.117 37.351 0.852 0.103 0.124 0.116 0.095 0.081
256 21.184 38.074 1.387 0.587 0.450 0.415 0.313 0.302
1024 24.174 33.124 3.032 3.650 1.741 1.649 1.211 1.199
4096 21.668 35.167 11.067 25.803 7.119 7.104 5.329 5.333
16384 51.674 62.263 49.300 191.051 34.179 34.632 28.582 28.054
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Slurm Environment Variables
Available environment variables include:
Variable | Meaning |
---|---|
SLURM_CPUS_ON_NODE | processors available to the job on this node |
SLURM_JOB_ID | job ID of executing job |
SLURM_LAUNCH_NODE_IPADDR | IP address of node where job launched |
SLURM_NNODES | total number of nodes |
SLURM_NODEID | relative node ID of current node |
SLURM_NODELIST | list of nodes allocated to job |
SLURM_NTASKS | total number of processes in current job |
SLURM_PROCID | MPI rank (or relative process ID) of the current process |
SLURM_SUBMIT_DIR | directory from with job was launched |
SLURM_TASK_PID | process ID of task started |
SLURM_TASKS_PER_NODE | number of task to be run on each node |
CUDA_VISIBLE_DEVICES | which GPUs are available for use |
Job arrays
For job arrays see the slurm web page
Batch option:
- --array=0-31
You can cancel your job array with the command:
- scancel jobid_[0-31]
Environment variables
Environment variable | Comment |
---|---|
SLURM_ARRAY_JOB_ID | set to the first job ID of the array |
SLURM_ARRAY_TASK_ID | set to the job array index value |
SLURM_ARRAY_TASK_COUNT | set to the number of tasks in the job array |
SLURM_ARRAY_TASK_MAX | set to the highest job array index value |
SLURM_ARRAY_TASK_MIN | set to the lowest job array index value |
Limitations - Restrictions
For now we have set the maximum amount of job arrays you can summit at once to 121
scontrol show config | grep MaxArraySize
MaxArraySize = 121
If we don't set the limitation, some users are submitting large amounts of job arrays which will occupy the cluster so that other users can't run.
Limit the amount of simultaneously running jobs
You can limit the amount of simultaneously running jobs with the following command (for example):
- "--array=0-100%4" will limit the number of simultaneously running tasks from this job array to 25
We recommend to use this option.
Remote GPU-accelerated visualization on res-hpc-lo02
If you want to run a graphical program that will show 3D animation, movies or any other kind of simulation/visualization, we have the second login node res-hpc-lo02.researchlumc.nl for this. This server has a powerful Tesla T4 GPU card (16Gb memory).
Steps for setting up a remote GPU-accelerated visualization:
- connect your remote desktop (X2Go) to the second login node res-hpc-lo02.researchlumc.nl
- start your visualization program (with "vglrun" in front of it if needed)
Once you are in your remote desktop, open a terminal.
For the GPU acceleration, you have to run the VirtualGL command: vglrun in front of the real program you want to run.
Examples:
- /opt/VirtualGL/bin/vglrun /opt/VirtualGL/bin/glxinfo
- /opt/VirtualGL/bin/vglrun /opt/VirtualGL/bin/glxspheres64
- /opt/VirtualGL/bin/vglrun glxgears
With the "glxinfo" program, you should check for the strings:
- direct rendering: Yes
- OpenGL renderer string: Tesla T4/PCIe/SSE2
vglrun glxinfo | egrep "rendering|OpenGL"
direct rendering: Yes
OpenGL vendor string: NVIDIA Corporation
OpenGL renderer string: Tesla T4/PCIe/SSE2
If you see llvmpipe then you are using software rendering in stead of hardware accelerating.
glxinfo | egrep "rendering|OpenGL"
direct rendering: Yes
OpenGL vendor string: VMware, Inc.
OpenGL renderer string: llvmpipe (LLVM 9.0.0, 256 bits)
Programs that can run with "vglrun"
- fsleyes
How to run:
- module add neuroImaging/FSLeyes/0.32.3
- /opt/VirtualGL/bin/vglrun fsleyes
Check for: OpenGL renderer: Tesla T4/PCIe/SSE2
If you see llvmpipe then you are using software rendering:
With the nvidia-smi command, you can also check if your program is running on the GPU. Below you can see 2 programs running on the GPU: the Xorg server and the fsleyes program:
nvidia-smi
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 25565 G /usr/libexec/Xorg 25MiB |
| 0 N/A N/A 28059 G ...ng/fsl/FSLeyes/bin/python 2MiB |
+-----------------------------------------------------------------------------+
Create a vncserver setup and start a vncserver session
If you are setting up a vcnserver session for the first time, it will ask you a few questions and after that you have to adapt the configuration and startup file:
vncserver
You will require a password to access your desktops.
Password:
Verify:
Would you like to enter a view-only password (y/n)? n
Desktop 'TurboVNC: res-hpc-lo02.researchlumc.nl:1 (username)' started on display res-hpc-lo02.researchlumc.nl:1
Creating default startup script /home/username/.vnc/xstartup.turbovnc
Starting applications specified in /home/username/.vnc/xstartup.turbovnc
Log file is /home/username/.vnc/res-hpc-lo02.researchlumc.nl:1.log
You should choose a strong password, the password should not be the same as the user login password.
Kill the vncserver connection:
- vncserver -kill :1
Now adapt the xstartup.turbovnc file:
- vi $HOME/.vnc/xstartup.turbovnc
#!/bin/sh
unset SESSION_MANAGER
unset DBUS_SESSION_BUS_ADDRESS
XDG_SESSION_TYPE=x11; export XDG_SESSION_TYPE
exec icewm-session
Adapt/create a turbovncserver.conf file for the vncserver with some useful settings:
- vi $HOME/.vnc/turbovncserver.conf
$geometry="1280x1024"
$depth=24
Now start the vncserver:
vncserver
Desktop 'TurboVNC: res-hpc-lo02.researchlumc.nl:1 (username)' started on display res-hpc-lo02.researchlumc.nl:1
Starting applications specified in /home/username/.vnc/xstartup.turbovnc
Log file is /home/username/.vnc/res-hpc-lo02.researchlumc.nl:1.log
You can list you vncserver session with the following command:
- vncserver -list
vncserver -list
TurboVNC sessions:
X DISPLAY # PROCESS ID
:1 33915
You can/should kill your vncserver session when you are done with running your own application:
- vncserver -kill :1
vncserver -kill :1
Killing Xvnc process ID 47947
vncserver and port numbers
Everytime someone start a vcnsession and there is already a vncsession running, the port number will increase. The first connection will be on :1 which is port 5900 + 1, where the port 5900 is the standard port number for VNC. For example:
Desktop 'TurboVNC: res-hpc-lo02.researchlumc.nl:3 (username)' started on display res-hpc-lo02.researchlumc.nl:3
In this case, the port to connect to is number 3. So you connect to "res-hpc-lo02.researchlumc.nl:3".
You can always list your own open vnc session with the vncserver -list command.
Remember to kill you vnc session when you are done running your own application.
Remote visualization with a "reverse SSH tunnel" and a vncserver/client
With a reverse SSH tunnel you can make a quick connection to a remote desktop. We assume that you already have setup correctly your vncserver.
We are using the SSH proxy server for this:
- IP address: 145.88.35.10
- Hostname: res-ssh-alg01.researchlumc.nl
Setting up the reverse SSH tunnel
You have to follow the following steps:
- [on res-hpc-lo02:] vncserver
Desktop 'TurboVNC: res-hpc-lo02.researchlumc.nl:1 (username)' started on display res-hpc-lo02.researchlumc.nl:1
Starting applications specified in /home/username/.vnc/xstartup.turbovnc
Log file is /home/username/.vnc/res-hpc-lo02.researchlumc.nl:1.log
- [on res-hpc-lo02:] ssh -R 8899:localhost:5901 -l username 145.88.35.10
You are now logged in on "username@res-ssh-alg01" [keep this window open]
At home (terminal 1):
- ssh -L 5901:localhost:8899 -l username 145.88.35.10 [keep this window open]
At home (terminal 2):
- first install a vnc client for your OS, for example:
- vncviewer localhost:1
This will open a desktop (icewm)
From here:
- terminal
- module add neuroImaging/FSLeyes/0.32.3
- /opt/VirtualGL/bin/vglrun fsleyes
Remember: the first vnc connection will start at port 5901 (:1), the second connection at port 5902 (:2) and so on. The same for the port 8899. Only 1 user can connect to this port. If this port is occupied, you can't get a connection, so you have to try another port, for example: 8900 or 8898, etc.
Closing the reverse SSH tunnel
- vncserver -kill :1
- close both connection/terminals where you are logged in [to the ssh proxy server] with "exit"
More Slurm info
Sview
sview is a graphical frontend for slurm which can be handy sometimes. It only gives you some minimal functionality.
Don't forget to enable X11 forwarding.
Comparison between SGE and Slurm
User commands
User command | SGE | Slurm |
---|---|---|
Interactive login | qlogin | srun --pty bash or srun (-p "partition name"--time=4:0:0 --pty bash) |
Job submission | qsub [script_file] | sbatch [script_file] |
Job deletion | qdel [job_id] | scancel [job_id] |
Job status by job | qstat -u “*” [-j job_id] | squeue [job_id] |
Job status by user | qstat [-u user_name] | squeue -u [user_name] |
Job hold | qhold [job_id] | scontrol hold [job_id] |
Job release | qrls [job_id] | scontrol release [job_id] |
Queue list | qconf -sql | squeue |
List nodes | qhost | sinfo -N or scontrol show nodes |
Cluster status | qhost -q | sinfo |
GUI | qmon | sview |
Environmental
Environmental | SGE | SLURM |
---|---|---|
Job ID | $JOB_ID | $SLURM_JOB_ID |
Submit directory | $SGE_O_WORKDIR | $SLURM_SUBMIT_DIR |
Submit host | $SGE_O_HOST | $SLURM_SUBMIT_HOST |
Node list | $PE_HOSTFILE | $SLURM_NODELIST |
Job Array Index | $SGE_TASK_ID | $SLURM_ARRAY_TASK_ID |
Number of CPUs | $NSLOTS | $SLURM_NPROCS |
More:
Slurm | Comment |
---|---|
$SLURM_CPUS_ON_NODE | processors available to the job on this node |
$SLURM_LAUNCH_NODE_IPADDR | IP address of node where job launched |
$SLURM_NNODES | total number of nodes |
$SLURM_NODEID | relative node ID of current node |
$SLURM_NTASKS | total number of processes in current job |
$SLURM_PROCID | MPI rank (or relative process ID) of the current process |
$SLURM_TASK_PID | process ID of task started |
$SLURM_TASKS_PER_NODE | number of task to be run on each node. |
$CUDA_VISIBLE_DEVICES | which GPUs are available for use |
Job Specification
Job Specification | SGE | SLURM |
---|---|---|
Script directive | #$ | #SBATCH |
queue | -q [queue] | -p [partition] |
count of nodes | N/A | -N [min[-max]] |
CPU count | -pe [PE] [count] | -n [count] |
Wall clock limit | -l h_rt=[seconds] | -t [min] OR -t [days-hh:mm:ss] |
Standard out file | -o [file_name] | -o [file_name] |
Standard error file | -e [file_name] | -e [file_name] |
Combine STDOUT & STDERR files | -j yes | (use -o without -e) |
Copy environment | -V | --export=[ALL | NONE | variables] |
Event notification | -m abe | --mail-type=[events] --mail-type=ALL (any event) --mail-type=END (job end) |
Send notification email | -M [address] | --mail-user=[address] |
Job name | -N [name] | --job-name=[name] [-J] |
Restart job | -r [yes|no] | --requeue OR --no-requeue NOTE: configurable default) |
Set working directory | -wd [directory] | --workdir=[dir_name] [-D] |
Resource sharing | -l exclusive | --exclusive OR --shared |
Memory size | -l mem_free=[memory][K|M|G] | --mem=[mem][M|G|T] OR --mem-per-cpu=[mem][M|G|T] |
Charge to an account | -A [account] | --account=[account] |
Tasks per node | (Fixed allocation_rule in PE) | --tasks-per-node=[count] --cpus-per-task=[count] |
Job dependency | -hold_jid [job_id | job_name] | --depend=[state:job_id] |
Job project | -P [name] | --wckey=[name] |
Job host preference | -q [queue]@[node] OR -q [queue]@@[hostgroup] | --nodelist=[nodes] AND/OR --exclude=[nodes] |
Quality of service | --qos=[name] | |
Job arrays | -t [array_spec] | --array=[array_spec] |
Generic Resources | -l [resource]=[value] | --gres=[resource_spec] |
Licenses | -l [license]=[count] | --licenses=[license_spec] |
Begin Time | -a [YYMMDDhhmm] | --begin=YYYY-MM-DD[HH:MM[:SS]] |
Working with sensitive data on the Shark cluster
Data should be acted on in concordance with the General Data Protection Regulation (EU) 2016/697 (GDPR)
Data analyzed on the Shark cluster should at least be pseudomised.
Data type definitions
-
Personal data
Personal data, also known as personal information or personally identifiable information is any information relating to an identifiable person. -
Pseudonymisation
Is a required process for stored data that transforms personal data in such a way that the resulting data cannot be attributed to a specific data subject without the use of additional information -
anonymization
Data anonymization has been defined as a "process by which personal data is irreversibly altered in such a way that a data subject can no longer be identified directly or indirectly".
Data information links (access only from within LUMC)
- Project: Implementation of the LUMC data management guidelines
- template Data management plan - DM08-F01 (Versie 3)
- Code for the protection of personal data (Versie 4)
- LUMC richtlijnen datamanagement voor onderzoek - DM08 (Versie 5)
Data storage / access
This shark cluster has multiple data storage types.
Storage solutions
- HPC Isilon storage.
This is fast storage for direct acces to your data on the cluster, which can be purchased from the IT&DI department through Topdesk. Once purchased this storage will be NFS v4 mounted on all the nodes on the cluster. The deafult mountpoint will be /exports/.
Access to this mountpoint is handeld by an Active Directory group. The default mount access rights are set by an Ansible playbook. To grant users access to this share you need to have them added to the Active Directory group attached to the share. To find out which group is attached to your data storage use the following command
ls -aldh /exports/<storage-share-name> | awk '{print $4}'
- research LTS Isilon storage
This is slow storage for archiving data which can be purchased from the IT&DI department through Topdesk. Once purchased this storage will be NFS v4 mounted on all the execution/gpu/mem nodes on the cluster with read only access, on the login nodes you wil have read and write access. The deafult mountpoint will be /exports/archive/. Access to this mountpoint is handeld by an Active Directory group. The default mount access rights are set by an Ansible playbook. To grant users access to this share you need to have them added to the Active Directory group attached to the share. To find out which group is attached to your data storage use the following command
ls -aldh /exports/archive/<storage-share-name> | awk '{print $4}'
Special directories on the cluster
- /bam-export/ This directory is created for sharing your Binary Alignment/Map files. This is a temporarely share, first in first out principle. Data here should not contain any patient related data and can be deleted at any time (make sure you have a copy somwhere else). This directory can be used in the UCSC Genome Browser to view your data tracks. The files here are accessable through a webbrowser with the following URL https://barmsijs.lumc.nl/. The files can only be accessed if you know the exact file name and the files are on the /bam-export/ directory word readable.
- /home The /home directory is an Isilon HPC export mounted on the /home directory for all the nodes in the cluster. This export is limited to 10 Gb per person. Your home directory is automatically created the first time you log into the cluster and is the same as your username. By default your /home/ directory is world readable and world executable. This directory should not be used for data storage.
mount point | Storage | Size | Usage | owner | group | security rights | mount option login nodes | mount option compute nodes |
---|---|---|---|---|---|---|---|---|
/home | HPC Isilon | 10GB | For small personel storage | AD username | AD Group | rwxr-xr-x | read/write | read/write |
/exports | HPC Isilon | - | Mount point for depertment/project storage | AD username | AD Group | rwxrws--- | read/write | read/write |
/exports/archive | LTS Isilon | - | Mount point for department/project Long Term Storage | AD username | AD Group | rwxrws--- | read/write | read/only |
/bam-export | HPC Isilon | 2TB | For displaying BAM files on https://barmsijs.lumc.nl | AD username | AD Group | rwxrwxrwx | read/write | read/write |
Applications
R
R is a programming language for statistical computing and graphics.
You can load R with one of the modules:
- statistical/R/3.4.4/gcc.8.3.1
- statistical/R/3.5.3/gcc.8.3.1
- statistical/R/3.6.2/gcc.8.3.1
- statistical/R/4.0.0/gcc.8.3.1
- statistical/RStudio/1.2.5033/gcc-8.3.1
Running R interactively
You can start to run R interactively, just as an exercise and test. The recommended way is to run R in batch mode.
[username@res-hpc-lo01 ~]$ salloc -N1 -n1
salloc: Pending job allocation 386499
salloc: job 386499 queued and waiting for resources
salloc: job 386499 has been allocated resources
salloc: Granted job allocation 386499
salloc: Waiting for resource configuration
salloc: Nodes res-hpc-exe017 are ready for job
[username@res-hpc-exe017 ~]$ module add statistical/R/4.0.0/gcc.8.3.1
[username@res-hpc-exe017 ~]$ R
R version 4.0.0 (2020-04-24) -- "Arbor Day"
...
Type 'q()' to quit R.
> q()
Save workspace image? [y/n/c]: n
[username@res-hpc-exe017 ~]$ exit
exit
salloc: Relinquishing job allocation 386499
salloc: Job allocation 386499 has been revoked.
Running a R script in batch mode
First example
HelloWorld.R
print ("Hello world!")
myscript.sh
#!/bin/bash
#SBATCH --job-name=HelloWord # Job name
#SBATCH --output=slurm.out # Output file name
#SBATCH --error=slurm.err # Error file name
#SBATCH --partition=short # Partition
#SBATCH --time=00:05:00 # Time limit
#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks-per-node=1 # MPI processes per node
module purge
module add statistical/R/4.0.0/gcc.8.3.1
Rscript --vanilla HelloWorld.R
- sbatch myscript.sh
Submitted batch job 386860
[username@res-hpc-lo01 R]$ cat slurm.out
[1] "Hello world!"
Second example
driver.R
x <- rnorm(50)
cat("My sample from N(0,1) is:\n")
print(x)
run.slurm
#!/bin/bash
#SBATCH --job-name=serialR # Job name
#SBATCH --output=slurm.out # Output file name
#SBATCH --error=slurm.err # Error file name
#SBATCH --partition=short # Partition
#SBATCH --time=00:05:00 # Time limit
#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks-per-node=1 # MPI processes per node
module purge
module add statistical/R/4.0.0/gcc.8.3.1
Rscript driver.R
[username@res-hpc-lo01 R]$ sbatch run.slurm
Submitted batch job 386568
[username@res-hpc-lo01 R]$ ls -l
total 78
-rw-r--r-- 1 username Domain Users 59 Jun 5 11:42 driver.R
-rw-r--r-- 1 username Domain Users 483 Jun 5 11:42 run.slurm
-rw-r--r-- 1 username Domain Users 0 Jun 5 11:43 slurm.err
-rw-r--r-- 1 username Domain Users 671 Jun 5 11:43 slurm.out
[username@res-hpc-lo01 R]$ cat slurm.out
My sample from N(0,1) is:
[1] 0.32241013 -0.78250675 -0.28872991 0.12559634 -0.29176358 0.57962942
[7] -0.38277807 -0.21266343 0.86537064 1.06636737 0.96487417 0.31699518
[13] 0.38003556 0.78275327 -0.85745177 -1.47682958 -0.16192662 0.09207091
[19] -0.64508782 1.01504976 -0.07736039 -1.08819811 1.17762738 -0.22819258
[25] 0.79564029 1.36863520 -0.63137494 -0.58452239 -0.96832479 -1.56506037
[31] 1.68344229 1.03967058 -0.20854621 1.39479829 -0.95509839 0.80826154
[37] -0.89781029 0.99954821 -1.25047597 -1.11034908 -1.10759254 1.32150663
[43] -0.04589279 -0.62886137 0.63947415 0.18295622 0.63929410 0.16774740
[49] 0.92311091 -0.13370228
[username@res-hpc-lo01 R]$ scontrol show job 386568
JobId=386568 JobName=serialR
UserId=username(225812) GroupId=Domain Users(513) MCS_label=N/A
Priority=449759 Nice=0 Account=dnst-ict QOS=normal
JobState=COMPLETED Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:02 TimeLimit=00:05:00 TimeMin=N/A
SubmitTime=2020-06-05T11:43:02 EligibleTime=2020-06-05T11:43:02
AccrueTime=2020-06-05T11:43:02
StartTime=2020-06-05T11:43:02 EndTime=2020-06-05T11:43:04 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-06-05T11:43:02
Partition=short AllocNode:Sid=res-hpc-ma01:27472
ReqNodeList=(null) ExcNodeList=(null)
NodeList=res-hpc-gpu01
BatchHost=res-hpc-gpu01
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=2G,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=2G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/username/R/run.slurm
WorkDir=/home/username/R
StdErr=/home/username/R/slurm.err
StdIn=/dev/null
StdOut=/home/username/R/slurm.out
Power=
MailUser=(null) MailType=NONE
Some useful links:
SLURM Job Submission with R, Python, Bash
How to useRon the Bioinformatics cluster
Running R parallel
Unfornately, "R" is not very efficient when running on a HPC cluster. Basically every R instance is running on only 1 core. To make your R program to run parallel and more efficient, we have installed for now the following libraries:
- Rmpi
- snow
- snowfall
- parallel
Loading one of these libraires does not make your program to run parallel. For that, you have to adapt your R program.
Example
hello.R
library(Rmpi)
id <- mpi.comm.rank(comm = 0)
np <- mpi.comm.size(comm = 0)
hostname <- mpi.get.processor.name()
msg <- sprintf("Hello world from process %03d of %03d, on host %s\n", id, np, hostname)
cat(msg)
mpi.barrier(comm = 0)
mpi.finalize()
run-rpmi.slurm
#!/bin/bash
#SBATCH --job-name=hello_parallel # Job name
#SBATCH --output=slurm-rmpi.out # Output file name
#SBATCH --error=slurm-rmpi.err # Error file name
#SBATCH --partition=short # Partition
#SBATCH --time=00:05:00 # Time limit
#SBATCH --nodes=2 # Number of nodes
#SBATCH --ntasks-per-node=4 # MPI processes per node
module purge
module add statistical/R/4.0.0/gcc.8.3.1
module add library/mpi/openmpi/4.0.3/gcc-8.3.1
mpirun Rscript hello.R
[username@res-hpc-lo01 R]$ cat slurm-rmpi.out
Hello world from process 000 of 008, on host res-hpc-gpu01
Hello world from process 001 of 008, on host res-hpc-gpu01
Hello world from process 002 of 008, on host res-hpc-gpu01
Hello world from process 003 of 008, on host res-hpc-gpu01
Hello world from process 004 of 008, on host res-hpc-gpu02
Hello world from process 005 of 008, on host res-hpc-gpu02
Hello world from process 006 of 008, on host res-hpc-gpu02
Hello world from process 007 of 008, on host res-hpc-gpu02
We recommend to have a look at the following web pages:
High-Performance and Parallel Computing with R
Quick Intro to Parallel Computing in R
How-to go parallel in R - basics + tips
Rmpi
Parallel Computing: Introduction to MPI
RStudio
RStudio is an integrated development environment for R.
You can run RStudio on the login node if you want (X11 forwarding enabled or connected with X2Go or MobaXterm):
module purge
module add statistical/RStudio/1.3.959/gcc-8.3.1
rstudio
RStudio on a compute node
You can also start RStudio on a compute node:
[username@res-hpc-lo01 ~]$ srun --x11 --pty bash
[username@res-hpc-exe014 ~]$ module purge
[username@res-hpc-exe014 ~]$ module add statistical/RStudio/1.3.959/gcc-8.3.1
[username@res-hpc-exe014 ~]$ rstudio
RStudio on the OOD (OpenOnDemand portal)
You can also start a RStudio server on the OOD portal:
FSLeyes
See: Programs that can run with "vglrun"
Python
As a researcher, student, scientist or health care worker, there is a big change you have to work with the programming language Python. Basically it is almost de facto programming language in the research world.
By default, Python version 3.6.8 is installed as part of the opererating system (CentOS 8.X).
python --version
Python 3.6.8
If you need another version of Python (older or newer), you can load them with the module command (module add).
Python versions
We have the following extra Python versions installed on the cluster:
- 2.7.17
- 3.7.6
- 3.8.1
You can load one of these with:
module add system/python/2.7.17
module add system/python/3.7.6
module add system/python/3.8.1
Installing Python packages
Some Python packages are already installed on the cluster and you can load/use them with the "module load" command. If you need an extra Python package you can use the pip3 command to install the Python package.
pip
For Python version 3 you should use the pip3 command. For Python version 2 you should use the pip2 command.
Edit first your pip config file:
$HOME/.config/pip/pip.conf
[list]
format=columns
Useful commmands:
pip install packageName
pip uninstall packageName
pip search packageName
pip help
pip install --help
As a normal user you should always install a Python packages (when you are not running in a virtual Python environment) with the command:
pip install --user
So do not use pip install --user some_pkg inside a virtual environment, otherwise, virtual environment's pip will be confused.
Example
pip3 install nibabel --user
Collecting nibabel
Downloading https://files.pythonhosted.org/packages/8b/8c/cf676b9b3cf69164ba0703a9dcb86ed895ab172e09bece4480db4f03fcce/nibabel-3.1.1-py3-none-any.whl (3.3MB)
100% |████████████████████████████████| 3.3MB 200kB/s
Collecting packaging>=14.3 (from nibabel)
Downloading https://files.pythonhosted.org/packages/46/19/c5ab91b1b05cfe63cccd5cfc971db9214c6dd6ced54e33c30d5af1d2bc43/packaging-20.4-py2.py3-none-any.whl
Requirement already satisfied: numpy>=1.13 in /usr/local/lib64/python3.6/site-packages (from nibabel)
Requirement already satisfied: six in /usr/lib/python3.6/site-packages (from packaging>=14.3->nibabel)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/lib/python3.6/site-packages (from packaging>=14.3->nibabel)
Installing collected packages: packaging, nibabel
Successfully installed nibabel-3.1.1 packaging-20.4
pip3 show nibabel
Name: nibabel
Version: 3.1.1
Summary: Access a multitude of neuroimaging data formats
Home-page: https://nipy.org/nibabel
Author: nibabel developers
Author-email: neuroimaging@python.org
License: MIT License
Location: /home/username/.local/lib/python3.6/site-packages
Requires: numpy, packaging
pip3 list
Package Version
------------------------ ------------
...
nibabel 3.1.1
...
As you can see from the example above, the Python package(s) will be installed in your local user environment:
Location: /home/username/.local/lib/python3.6/site-packages
Python virtual environments
If you are working on different projects and you need different Python packages for each project, it is better to work in a special virtual environment.
When you activate this virtual environment, it will create a special virtual Python environment for you. In this virtual environment you can use the pip command (without the --user option) and other commands.
You create a new virtual environment with one of the following commands:
- $ virtualenv python3 -m venv /path/to/new/virtual/environmentname
- $ python3 -m venv /path/to/new/virtual/environmentname
You activate a new virtual environment with the command:
- $ source /path/to/new/virtual/environmentname/bin/activate
You deactivate a virtual environment with the command (it will not be destroyed):
- (envname) $ deactivate
Example:
virtualenv /exports/example/projects/Project-A
Using base prefix '/usr'
New python executable in /exports/example/projects/Project-A/bin/python3.6
Also creating executable in /exports/example/projects/Project-A/bin/python
Installing setuptools, pip, wheel...done.
[username@res-hpc-lo01 ~]$ source /exports/example/projects/Project-A/bin/activate
(Project-A) [username@res-hpc-lo01 ~]$
(Project-A) [username@res-hpc-lo01 python3.6]$ pip3 list
Package Version
---------- -------
pip 20.1.1
setuptools 49.1.0
wheel 0.34.2
(Project-A) [username@res-hpc-lo01 ~]$ deactivate
[username@res-hpc-lo01 ~]$
To remove your Python virtual environment delete the virtual environment directory.
- $ rm -Rf /path/to/virtual/environmentname
Conda, Anaconda, Miniconda and Bioconda
If you have to install, setup and work with a complex program/project you should make use of the conda tool. Conda it self is a package management system, while anaconda, miniconda and bioconda provides you with a virtual Python environment and a lot of optimized Python packages, especially for researchers and scientists. These packages you can easily install within this environment.
- Anaconda - collection with the most packages (> 7,500 data science and machine learning packages)
- Miniconda - light-weighted Anaconda version (you should start with this version)
- Bioconda - specializing in bio-informatics software
Conda
Conda is an open source package management system and environment management system that runs on Windows, macOS and Linux. Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer. It was created for Python programs, but it can package and distribute software for any language.
Anaconda
Anaconda is a package manager, an environment manager, a Python/R data science distribution, and a collection of over 7,500+ open-source packages.
Miniconda
Miniconda is a free minimal installer for conda. It is a small, bootstrap version of Anaconda that includes only conda, Python, the packages they depend on, and a small number of other useful packages, including pip, zlib and a few others. Use the conda install command to install 720+ additional conda packages from the Anaconda repository.
Bioconda
Bioconda is a channel for the conda package manager specializing in bioinformatics software.
Overview useful commands
Desciption | command |
---|---|
Verify Conda is installed, check version number | conda info |
Create a new environment named ENVNAME | conda create --name ENVNAME |
Activate a named Conda environment | conda activate ENVNAME |
Deactivate current environment | conda deactivate |
List all packages and versions in the active environment | conda list |
Delete an entire environment | conda remove --name ENVNAME --all |
Search for a package in currently configured channels | conda search PKGNAME |
Install a package | conda install PKGNAME |
Detailed information about package versions | conda search PKGNAME --info |
Remove a package from an environment | conda uninstall PKGNAME --name ENVNAME |
Add a channel to your Conda configuration | conda config --add channels CHANNELNAME |
Example
module purge
module add tools/miniconda/python3.7/4.7.12
conda info
active environment : None
shell level : 0
user config file : /home/username/.condarc
populated config files : /home/username/.condarc
conda version : 4.7.12
conda-build version : not installed
python version : 3.7.4.final.0
virtual packages :
base environment : /share/software/tools/miniconda/3.7/4.7.12 (read only)
channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/r/linux-64
https://repo.anaconda.com/pkgs/r/noarch
package cache : /share/software/tools/miniconda/3.7/4.7.12/pkgs
/home/username/.conda/pkgs
envs directories : /home/username/.conda/envs
/share/software/tools/miniconda/3.7/4.7.12/envs
platform : linux-64
user-agent : conda/4.7.12 requests/2.22.0 CPython/3.7.4 Linux/4.18.0-147.8.1.el8_1.x86_64 centos/8.1.1911 glibc/2.28
UID:GID : 225812:513
netrc file : None
offline mode : False
conda create --name Project-B
Collecting package metadata (current_repodata.json): done
Solving environment: done
## Package Plan ##
environment location: /home/username/.conda/envs/Project-B
Proceed ([y]/n)? y
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
# $ conda activate Project-B
#
# To deactivate an active environment, use
#
# $ conda deactivate
conda init bash
no change /share/software/tools/miniconda/3.7/4.7.12/condabin/conda
no change /share/software/tools/miniconda/3.7/4.7.12/bin/conda
no change /share/software/tools/miniconda/3.7/4.7.12/bin/conda-env
no change /share/software/tools/miniconda/3.7/4.7.12/bin/activate
no change /share/software/tools/miniconda/3.7/4.7.12/bin/deactivate
no change /share/software/tools/miniconda/3.7/4.7.12/etc/profile.d/conda.sh
no change /share/software/tools/miniconda/3.7/4.7.12/etc/fish/conf.d/conda.fish
no change /share/software/tools/miniconda/3.7/4.7.12/shell/condabin/Conda.psm1
no change /share/software/tools/miniconda/3.7/4.7.12/shell/condabin/conda-hook.ps1
no change /share/software/tools/miniconda/3.7/4.7.12/lib/python3.7/site-packages/xontrib/conda.xsh
no change /share/software/tools/miniconda/3.7/4.7.12/etc/profile.d/conda.csh
modified /home/username/.bashrc
==> For changes to take effect, close and re-open your current shell. <==
[username@res-hpc-lo01 ~]$ conda activate Project-B
(Project-B) [username@res-hpc-lo01 ~]$
conda search beautifulsoup4
Loading channels: done
# Name Version Build Channel
beautifulsoup4 4.6.0 py27_1 pkgs/main
...
beautifulsoup4 4.9.1 py38_0 pkgs/main
conda install beautifulsoup4
Collecting package metadata (current_repodata.json): done
Solving environment: done
## Package Plan ##
environment location: /home/username/.conda/envs/Project-B
added / updated specs:
- beautifulsoup4
The following packages will be downloaded:
package | build
---------------------------|-----------------
beautifulsoup4-4.9.1 | py38_0 171 KB
ca-certificates-2020.6.24 | 0 125 KB
certifi-2020.6.20 | py38_0 156 KB
libedit-3.1.20191231 | h7b6447c_0 167 KB
libffi-3.3 | he6710b0_2 50 KB
ncurses-6.2 | he6710b0_1 817 KB
openssl-1.1.1g | h7b6447c_0 2.5 MB
pip-20.1.1 | py38_1 1.7 MB
python-3.8.3 | hcff3b4d_2 49.1 MB
readline-8.0 | h7b6447c_0 356 KB
setuptools-47.3.1 | py38_0 515 KB
soupsieve-2.0.1 | py_0 33 KB
sqlite-3.32.3 | h62c20be_0 1.1 MB
tk-8.6.10 | hbc83047_0 3.0 MB
xz-5.2.5 | h7b6447c_0 341 KB
------------------------------------------------------------
Total: 60.1 MB
The following NEW packages will be INSTALLED:
_libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main
beautifulsoup4 pkgs/main/linux-64::beautifulsoup4-4.9.1-py38_0
ca-certificates pkgs/main/linux-64::ca-certificates-2020.6.24-0
certifi pkgs/main/linux-64::certifi-2020.6.20-py38_0
ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.33.1-h53a641e_7
libedit pkgs/main/linux-64::libedit-3.1.20191231-h7b6447c_0
libffi pkgs/main/linux-64::libffi-3.3-he6710b0_2
libgcc-ng pkgs/main/linux-64::libgcc-ng-9.1.0-hdf63c60_0
libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-9.1.0-hdf63c60_0
ncurses pkgs/main/linux-64::ncurses-6.2-he6710b0_1
openssl pkgs/main/linux-64::openssl-1.1.1g-h7b6447c_0
pip pkgs/main/linux-64::pip-20.1.1-py38_1
python pkgs/main/linux-64::python-3.8.3-hcff3b4d_2
readline pkgs/main/linux-64::readline-8.0-h7b6447c_0
setuptools pkgs/main/linux-64::setuptools-47.3.1-py38_0
soupsieve pkgs/main/noarch::soupsieve-2.0.1-py_0
sqlite pkgs/main/linux-64::sqlite-3.32.3-h62c20be_0
tk pkgs/main/linux-64::tk-8.6.10-hbc83047_0
wheel pkgs/main/linux-64::wheel-0.34.2-py38_0
xz pkgs/main/linux-64::xz-5.2.5-h7b6447c_0
zlib pkgs/main/linux-64::zlib-1.2.11-h7b6447c_3
Proceed ([y]/n)? y
Downloading and Extracting Packages
libedit-3.1.20191231 | 167 KB | #################################### | 100%
sqlite-3.32.3 | 1.1 MB | #################################### | 100%
readline-8.0 | 356 KB | #################################### | 100%
pip-20.1.1 | 1.7 MB | #################################### | 100%
python-3.8.3 | 49.1 MB | #################################### | 100%
certifi-2020.6.20 | 156 KB | #################################### | 100%
ncurses-6.2 | 817 KB | #################################### | 100%
ca-certificates-2020 | 125 KB | #################################### | 100%
setuptools-47.3.1 | 515 KB | #################################### | 100%
xz-5.2.5 | 341 KB | #################################### | 100%
openssl-1.1.1g | 2.5 MB | #################################### | 100%
libffi-3.3 | 50 KB | #################################### | 100%
soupsieve-2.0.1 | 33 KB | #################################### | 100%
beautifulsoup4-4.9.1 | 171 KB | #################################### | 100%
tk-8.6.10 | 3.0 MB | #################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
conda list
# packages in environment at /home/username/.conda/envs/Project-B:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main
beautifulsoup4 4.9.1 py38_0
ca-certificates 2020.6.24 0
certifi 2020.6.20 py38_0
ld_impl_linux-64 2.33.1 h53a641e_7
libedit 3.1.20191231 h7b6447c_0
libffi 3.3 he6710b0_2
libgcc-ng 9.1.0 hdf63c60_0
libstdcxx-ng 9.1.0 hdf63c60_0
ncurses 6.2 he6710b0_1
openssl 1.1.1g h7b6447c_0
pip 20.1.1 py38_1
python 3.8.3 hcff3b4d_2
readline 8.0 h7b6447c_0
setuptools 47.3.1 py38_0
soupsieve 2.0.1 py_0
sqlite 3.32.3 h62c20be_0
tk 8.6.10 hbc83047_0
wheel 0.34.2 py38_0
xz 5.2.5 h7b6447c_0
zlib 1.2.11 h7b6447c_3
conda uninstall -y beautifulsoup4
Collecting package metadata (repodata.json): done
Solving environment: done
## Package Plan ##
environment location: /home/username/.conda/envs/Project-B
removed specs:
- beautifulsoup4
The following packages will be REMOVED:
beautifulsoup4-4.9.1-py38_0
soupsieve-2.0.1-py_0
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
(Project-B) [username@res-hpc-lo01 ~]$ conda deactivate
[username@res-hpc-lo01 ~]$
conda remove --name Project-B --all -y
Remove all packages in environment /home/username/.conda/envs/Project-B:
## Package Plan ##
environment location: /home/username/.conda/envs/Project-B
The following packages will be REMOVED:
_libgcc_mutex-0.1-main
ca-certificates-2020.6.24-0
certifi-2020.6.20-py38_0
ld_impl_linux-64-2.33.1-h53a641e_7
libedit-3.1.20191231-h7b6447c_0
libffi-3.3-he6710b0_2
libgcc-ng-9.1.0-hdf63c60_0
libstdcxx-ng-9.1.0-hdf63c60_0
ncurses-6.2-he6710b0_1
openssl-1.1.1g-h7b6447c_0
pip-20.1.1-py38_1
python-3.8.3-hcff3b4d_2
readline-8.0-h7b6447c_0
setuptools-47.3.1-py38_0
sqlite-3.32.3-h62c20be_0
tk-8.6.10-hbc83047_0
wheel-0.34.2-py38_0
xz-5.2.5-h7b6447c_0
zlib-1.2.11-h7b6447c_3
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Open OnDemand [OOD]
Open OnDemand [OOD] provides an integrated, single access point for all of your HPC resources.
This will give you a web interface with the following options:
- Files
- Home directory/File Explorer
- Jobs
- Active/Completed Jobs
- Job Composer
- Clusters
- Cluster Shell Access
- Interactive Apps
- Shark cluster Desktop
- RStudio Server
- Jupyter Notebook (with GPU support)