... | @@ -387,7 +387,94 @@ This is a resource manager and a scheduler all in one. |
... | @@ -387,7 +387,94 @@ This is a resource manager and a scheduler all in one. |
|
- resource manager: are there resources free (memory, cpu, gpus, etc) on the nodes for the job?
|
|
- resource manager: are there resources free (memory, cpu, gpus, etc) on the nodes for the job?
|
|
- scheduler: when to run the job?
|
|
- scheduler: when to run the job?
|
|
|
|
|
|
|
|
### Partitions (info) (SGE queues)
|
|
|
|
|
|
|
|
A job must always be submitted to a partition (queue). By default, the jobs will be submitted to the "all" partition unless you specify a different queue.
|
|
|
|
The following commands are useful:
|
|
|
|
- sinfo
|
|
|
|
- sinfo -a
|
|
|
|
- sinfo -l
|
|
|
|
|
|
|
|
```
|
|
|
|
[user@res-hpc-lo01 ~]$ sinfo
|
|
|
|
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
|
|
|
|
all* up infinite 2 idle res-hpc-exe[013-014]
|
|
|
|
gpu up infinite 1 idle res-hpc-gpu01
|
|
|
|
|
|
|
|
[user@res-hpc-lo01 ~]$ sinfo -a
|
|
|
|
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
|
|
|
|
all* up infinite 2 idle res-hpc-exe[013-014]
|
|
|
|
gpu up infinite 1 idle res-hpc-gpu01
|
|
|
|
LKEBgpu up infinite 5 down* res-hpc-lkeb[01-05]
|
|
|
|
|
|
|
|
[user@res-hpc-lo01 ~]$ sinfo -l
|
|
|
|
Thu Jan 23 09:05:13 2020
|
|
|
|
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE NODELIST
|
|
|
|
all* up infinite 1-infinite no NO all 2 idle res-hpc-exe[013-014]
|
|
|
|
gpu up infinite 1-infinite no NO all 1 idle res-hpc-gpu01
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
If there are jobs running, you can see the following output:
|
|
|
|
```
|
|
|
|
[user@res-hpc-lo01 mpi-benchmarks]$ sinfo
|
|
|
|
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
|
|
|
|
all* up infinite 1 mix res-hpc-exe014
|
|
|
|
all* up infinite 1 alloc res-hpc-exe013
|
|
|
|
gpu up infinite 1 idle res-hpc-gpu01
|
|
|
|
|
|
|
|
[user@res-hpc-lo01 mpi-benchmarks]$ sinfo -a
|
|
|
|
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
|
|
|
|
all* up infinite 1 mix res-hpc-exe014
|
|
|
|
all* up infinite 1 alloc res-hpc-exe013
|
|
|
|
gpu up infinite 1 idle res-hpc-gpu01
|
|
|
|
LKEBgpu up infinite 5 down* res-hpc-lkeb[01-05]
|
|
|
|
|
|
|
|
[user@res-hpc-lo01 mpi-benchmarks]$ sinfo -l
|
|
|
|
Thu Jan 23 09:16:21 2020
|
|
|
|
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE NODELIST
|
|
|
|
all* up infinite 1-infinite no NO all 1 mixed res-hpc-exe014
|
|
|
|
all* up infinite 1-infinite no NO all 1 allocated res-hpc-exe013
|
|
|
|
gpu up infinite 1-infinite no NO all 1 idle res-hpc-gpu01
|
|
|
|
```
|
|
|
|
- idle: this node has no jobs running on it
|
|
|
|
- alloc(ated): the whole node is allocated by 1 or more jobs
|
|
|
|
- mix(ed): there is 1 or more jobs running on the node, but there are still cores free on this node
|
|
|
|
|
|
|
|
### Jobs info
|
|
|
|
With the following command, you can get information about your running jobs and jobs from other users:
|
|
|
|
- squeue
|
|
|
|
- squeue -a
|
|
|
|
- squeue -l
|
|
|
|
|
|
|
|
```
|
|
|
|
[user@res-hpc-lo01 mpi-benchmarks]$ squeue
|
|
|
|
JOBID PARTITION USER ST TIME NODES NODELIST(REASON)
|
|
|
|
258 all user R 0:03 2 res-hpc-exe[013-014]
|
|
|
|
|
|
|
|
[user@res-hpc-lo01 mpi-benchmarks]$ squeue -a
|
|
|
|
JOBID PARTITION USER ST TIME NODES NODELIST(REASON)
|
|
|
|
258 all user R 0:06 2 res-hpc-exe[013-014]
|
|
|
|
|
|
|
|
[user@res-hpc-lo01 mpi-benchmarks]$ squeue -l
|
|
|
|
Thu Jan 23 09:14:22 2020
|
|
|
|
JOBID PARTITION USER STATE TIME TIME_LIMIT NODES NODELIST(REASON)
|
|
|
|
258 all user RUNNING 0:12 30:00 2 res-hpc-exe[013-014]
|
|
|
|
```
|
|
|
|
|
|
|
|
Jobs typically pass through several states in the course of their execution.
|
|
|
|
The typical states are PENDING, RUNNING, SUSPENDED, COMPLETING, and COMPLETED.
|
|
|
|
An explanation of some state follows:
|
|
|
|
|
|
|
|
| State | State (full) | Explanation|
|
|
|
|
| --- | --- | --- |
|
|
|
|
| CA | CANCELLED | Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. |
|
|
|
|
| CD | COMPLETED | Job has terminated all processes on all nodes with an exit code of zero. |
|
|
|
|
| CG | COMPLETING | Job is in the process of completing. Some processes on some nodes may still be active. |
|
|
|
|
| F | FAILED | Job terminated with non-zero exit code or other failure condition. |
|
|
|
|
| PD | PENDING | Job is awaiting resource allocation. |
|
|
|
|
| R | RUNNING | Job currently has an allocation. |
|
|
|
|
| S | SUSPENDED | Job has an allocation, but execution has been suspended and CPUs have been released for other jobs. |
|
|
|
|
|
|
### Submitting jobs
|
|
### Submitting jobs
|
|
|
|
|
... | | ... | |