Update home authored by van Vliet's avatar van Vliet
......@@ -387,7 +387,94 @@ This is a resource manager and a scheduler all in one.
- resource manager: are there resources free (memory, cpu, gpus, etc) on the nodes for the job?
- scheduler: when to run the job?
### Partitions (info) (SGE queues)
A job must always be submitted to a partition (queue). By default, the jobs will be submitted to the "all" partition unless you specify a different queue.
The following commands are useful:
- sinfo
- sinfo -a
- sinfo -l
```
[user@res-hpc-lo01 ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
all* up infinite 2 idle res-hpc-exe[013-014]
gpu up infinite 1 idle res-hpc-gpu01
[user@res-hpc-lo01 ~]$ sinfo -a
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
all* up infinite 2 idle res-hpc-exe[013-014]
gpu up infinite 1 idle res-hpc-gpu01
LKEBgpu up infinite 5 down* res-hpc-lkeb[01-05]
[user@res-hpc-lo01 ~]$ sinfo -l
Thu Jan 23 09:05:13 2020
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE NODELIST
all* up infinite 1-infinite no NO all 2 idle res-hpc-exe[013-014]
gpu up infinite 1-infinite no NO all 1 idle res-hpc-gpu01
```
If there are jobs running, you can see the following output:
```
[user@res-hpc-lo01 mpi-benchmarks]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
all* up infinite 1 mix res-hpc-exe014
all* up infinite 1 alloc res-hpc-exe013
gpu up infinite 1 idle res-hpc-gpu01
[user@res-hpc-lo01 mpi-benchmarks]$ sinfo -a
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
all* up infinite 1 mix res-hpc-exe014
all* up infinite 1 alloc res-hpc-exe013
gpu up infinite 1 idle res-hpc-gpu01
LKEBgpu up infinite 5 down* res-hpc-lkeb[01-05]
[user@res-hpc-lo01 mpi-benchmarks]$ sinfo -l
Thu Jan 23 09:16:21 2020
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE NODELIST
all* up infinite 1-infinite no NO all 1 mixed res-hpc-exe014
all* up infinite 1-infinite no NO all 1 allocated res-hpc-exe013
gpu up infinite 1-infinite no NO all 1 idle res-hpc-gpu01
```
- idle: this node has no jobs running on it
- alloc(ated): the whole node is allocated by 1 or more jobs
- mix(ed): there is 1 or more jobs running on the node, but there are still cores free on this node
### Jobs info
With the following command, you can get information about your running jobs and jobs from other users:
- squeue
- squeue -a
- squeue -l
```
[user@res-hpc-lo01 mpi-benchmarks]$ squeue
JOBID PARTITION USER ST TIME NODES NODELIST(REASON)
258 all user R 0:03 2 res-hpc-exe[013-014]
[user@res-hpc-lo01 mpi-benchmarks]$ squeue -a
JOBID PARTITION USER ST TIME NODES NODELIST(REASON)
258 all user R 0:06 2 res-hpc-exe[013-014]
[user@res-hpc-lo01 mpi-benchmarks]$ squeue -l
Thu Jan 23 09:14:22 2020
JOBID PARTITION USER STATE TIME TIME_LIMIT NODES NODELIST(REASON)
258 all user RUNNING 0:12 30:00 2 res-hpc-exe[013-014]
```
Jobs typically pass through several states in the course of their execution.
The typical states are PENDING, RUNNING, SUSPENDED, COMPLETING, and COMPLETED.
An explanation of some state follows:
| State | State (full) | Explanation|
| --- | --- | --- |
| CA | CANCELLED | Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. |
| CD | COMPLETED | Job has terminated all processes on all nodes with an exit code of zero. |
| CG | COMPLETING | Job is in the process of completing. Some processes on some nodes may still be active. |
| F | FAILED | Job terminated with non-zero exit code or other failure condition. |
| PD | PENDING | Job is awaiting resource allocation. |
| R | RUNNING | Job currently has an allocation. |
| S | SUSPENDED | Job has an allocation, but execution has been suspended and CPUs have been released for other jobs. |
### Submitting jobs
......
......