... | ... | @@ -5,209 +5,200 @@ You can login to the Shark cluster head node with your user name and password pr |
|
|
To submit jobs, we use the qsub command. The qsub command requires a file(script) which describes what needs to be run in what way.
|
|
|
The example script that we want to execute:
|
|
|
|
|
|
save this example as **my_first_job.sh** or get the examples with [git](git@git.lumc.nl:shark/SHARK.git):
|
|
|
save this example as **my_first_job.sh** or get the examples with [git](https://git.lumc.nl/shark/SHARK.git):
|
|
|
|
|
|
````
|
|
|
cd ~
|
|
|
git clone git@git.lumc.nl:shark/SHARK.git
|
|
|
git clone https://git.lumc.nl/shark/SHARK.git
|
|
|
````
|
|
|
Once your git repository is created you only have to update your svn repository. while in your svn directory:
|
|
|
|
|
|
|
|
|
svn update
|
|
|
|
|
|
my_first_job.sh:
|
|
|
|
|
|
|
|
|
#!sh
|
|
|
````
|
|
|
#!/bin/bash
|
|
|
echo 'Starting job...'
|
|
|
sleep 10
|
|
|
echo '10 seconds, end of script.'
|
|
|
````
|
|
|
qsub example file:
|
|
|
|
|
|
qsub example file * save this example as** run_my_first_job.sh ***
|
|
|
|
|
|
`qsub my_first_job.sh`
|
|
|
|
|
|
#!sh
|
|
|
#!/bin/bash
|
|
|
#$ -S /bin/bash
|
|
|
#$ -q all.q
|
|
|
#$ -N my_first_job
|
|
|
#$ -l h_vmem=1G
|
|
|
#$ -cwd
|
|
|
#$ -j Y
|
|
|
#$ -V
|
|
|
#$ -m be
|
|
|
#$ -M email@address.lumc
|
|
|
To use qsub options inside your script you can use the following example:
|
|
|
````
|
|
|
#!/bin/bash
|
|
|
#$ -S /bin/bash
|
|
|
#$ -q all.q
|
|
|
#$ -N my_first_job
|
|
|
#$ -l h_vmem=1G
|
|
|
#$ -cwd
|
|
|
#$ -j Y
|
|
|
#$ -V
|
|
|
#$ -m be
|
|
|
#$ -M email@address.lumc
|
|
|
|
|
|
echo Start time : `date`
|
|
|
/home/user/my_first_job.sh
|
|
|
echo 'Starting job...'
|
|
|
sleep 10
|
|
|
echo '10 seconds, end of script.'
|
|
|
echo End time : `date`
|
|
|
}}}
|
|
|
Every line starting with "#$" is a parameter to the SGE.
|
|
|
|
|
|
The options explained:
|
|
|
||-S ||Used to define the shell ||
|
|
|
||-q ||The 'sub cluster' that your job will go to (use all.q unless the admin tells you otherwise) ||
|
|
|
||-N ||Your job name (can not start with a number ||
|
|
|
||-l h_vmem=1G || set the maximum amount of memory that can be used by your job. ||
|
|
|
||-cwd ||Used to let the output be put in the Dir you submitted the job from ||
|
|
|
||-j Y ||The standard error of the batch job and The standard output of the batch job are joined together ||
|
|
|
||-V ||Specify that all of the environment variables of the process are exported to the context of the batch job. ||
|
|
|
||-m be ||Email user when job starts (b = begin) and when the job ends (e = ends) can also be: -m e. ||
|
|
|
||-M ||The email address where the info is send to. ||
|
|
|
||-pe BWA x ||The number of slots reserved for a job where x is the amount of slots. Important when tools are told to use multiple threads. I.e. if you run an alignment tool with 8 threads, make sure you use this option to request 8 slots. ||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
We can submit our job as:
|
|
|
|
|
|
|
|
|
|
|
|
#!div class=important style="border: 2pt solid; text-align: left"
|
|
|
qsub ./run_my_first_job.sh
|
|
|
|
|
|
|
|
|
Your job 1517 ("my_first_job") has been submitted
|
|
|
|
|
|
Your job will get a number which you need to track the progress, errors, or for canceling. Once submitted you can check the status of your job with the qstat command:
|
|
|
|
|
|
|
|
|
#!div class=important style="border: 2pt solid; text-align: left"
|
|
|
qstat
|
|
|
````
|
|
|
````
|
|
|
Every line starting with "#$" is a parameter for the Scheduler OGS.
|
|
|
````
|
|
|
# The options explained:
|
|
|
|
|
|
|option |explanation|
|
|
|
|----------|-----------|
|
|
|
|-S|Used to define the shell|
|
|
|
|-q|The 'sub cluster' that your job will go to (use all.q unless the admin tells you otherwise)|
|
|
|
|-N|Your job name (can not start with a number|
|
|
|
|-l h_vmem=1G|set the maximum amount of memory that can be used by your job.|
|
|
|
|-cwd|Used to let the output be put in the Dir you submitted the job from |
|
|
|
|-j Y|The standard error of the batch job and The standard output of the batch job are joined together|
|
|
|
|-V|Specify that all of the environment variables of the process are exported to the context of the batch job.|
|
|
|
|-m be|Email user when job starts (b = begin) and when the job ends (e = ends) can also be: -m e.|
|
|
|
|-M|The email address where the info is send to.|
|
|
|
|-pe BWA x|The number of slots reserved for a job where x is the amount of slots. Important when tools are told to use multiple threads. I.e. if you run an alignment tool with 8 threads, make sure you use this option to request 8 slots.|
|
|
|
|
|
|
# We can submit our job as:
|
|
|
|
|
|
`qsub ./run_my_first_job.sh`
|
|
|
````
|
|
|
Your job 1517 ("my_first_job") has been submitted
|
|
|
````
|
|
|
|
|
|
Your job will get a number which you need in order to track the progress, errors, or for cancelling. Once submitted you can check the status of your job with the qstat command:
|
|
|
|
|
|
job-ID prior name user state submit/start at queue slots ja-task-ID
|
|
|
`qstat`
|
|
|
````
|
|
|
job-ID prior name user state submit/start at queue slots ja-task-ID
|
|
|
|
|
|
------------------------------------------------------------------------------------------------------
|
|
|
517 0.00000 my_first_j chiel qw 06/09/2010 13:24:15 1
|
|
|
}}}
|
|
|
|
|
|
#!div class=important style="border: 2pt solid; text-align: left"
|
|
|
qstat -ext
|
|
|
````
|
|
|
`qstat -ext`
|
|
|
|
|
|
The -ext Displays additional information for each job related to the job ticket policy scheme. note the slots used = 24 this can be achieved with the -pe BWA 24 flag.
|
|
|
|
|
|
job-ID prior ntckts name user department state cpu mem io tckts ovrts otckt ftckt stckt share queue slots
|
|
|
````
|
|
|
job-ID prior ntckts name user department state cpu mem io tckts ovrts otckt ftckt stckt share queue slots
|
|
|
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
|
|
107024 0.60500 0.50000 j_B2Fq_AD0 chiel LGTC r 0:11:31:12 1089.99432 703.01516 0 0 0 0 0 0.00 LGTC_HiSeq.q@baskingshark.clus 24
|
|
|
107029 0.50500 0.50000 j_42V9B_62 chiel LGTC r 0:00:00:02 0.07282 0.16434 0 0 0 0 0 0.00 DIV5ngs.q@dogfishshark.cluster 1
|
|
|
107030 0.50500 0.50000 j_ACTTGA_A chiel LGTC r 0:00:03:04 2.43385 3.73068 0 0 0 0 0 0.00 DIV5ngs.q@whaleshark.cluster.l 1
|
|
|
|
|
|
Below 'state' you can read 'qw' which means queue waiting. Once the head node finds available resources this will change to 'r' which means running. Other states can be 'd' for a deleted job or 'e' when the job is in an error state.
|
|
|
|
|
|
### Sun Grid Engine jobs states:
|
|
|
||qw ||job is waiting in the queue ||
|
|
|
||r ||job is currently running ||
|
|
|
||t ||job is being transferred to the compute nodes ||
|
|
|
||s or S ||job is suspended, probably due to higher-priority jobs ||
|
|
|
||h ||job is on hold, probably due to sysadmin action ||
|
|
|
||E ||submission is in error state, use qstat -j <job_id> to find out why ||
|
|
|
||d ||Job is being deleted ||
|
|
|
|
|
|
|
|
|
With qhost you can check the memory usage on the different nodes:
|
|
|
|
|
|
|
|
|
#!div class=important style="border: 2pt solid; text-align: left"
|
|
|
qhost
|
|
|
|
|
|
|
|
|
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
|
|
|
-------------------------------------------------------------------------------
|
|
|
global - - - - - - -
|
|
|
angelshark linux-x64 8 0.32 31.5G 766.8M 7.5G 8.8M
|
|
|
baskingshark linux-x64 24 0.20 252.4G 5.3G 6.5G 7.5M
|
|
|
blacktipshark linux-x64 8 0.26 31.5G 741.0M 7.0G 7.5M
|
|
|
caribbeanshark linux-x64 8 0.40 31.5G 726.2M 7.0G 12.1M
|
|
|
dogfishshark linux-x64 12 0.07 94.6G 2.1G 7.0G 9.3M
|
|
|
epauletteshark linux-x64 12 0.09 63.0G 1.4G 7.0G 1.0M
|
|
|
frilledshark linux-x64 12 0.12 63.0G 1.4G 7.0G 6.3M
|
|
|
greatwhiteshark linux-x64 12 0.06 94.6G 2.1G 7.0G 5.6M
|
|
|
hammerheadshark linux-x64 12 0.10 94.6G 2.0G 7.0G 16.5M
|
|
|
kitefinshark linux-x64 12 0.05 63.0G 1.4G 7.0G 0.0
|
|
|
lemonshark linux-x64 12 0.22 94.6G 13.7G 7.0G 7.3M
|
|
|
makoshark linux-x64 12 0.11 94.6G 2.0G 7.0G 0.0
|
|
|
megamouthshark linux-x64 12 0.11 94.6G 2.1G 7.0G 10.5M
|
|
|
nightshark linux-x64 12 0.07 63.0G 1.4G 7.0G 6.9M
|
|
|
pygmeshark linux-x64 12 0.06 63.0G 1.4G 7.0G 0.0
|
|
|
threshershark linux-x64 12 1.07 63.0G 1.4G 7.0G 8.9M
|
|
|
tigershark linux-x64 12 0.03 94.6G 2.0G 7.0G 10.7M
|
|
|
whaleshark linux-x64 12 0.07 94.6G 1.8G 7.0G 0.0
|
|
|
wobbegongshark linux-x64 12 0.01 252.4G 5.2G 7.0G 0.0
|
|
|
zebrashark linux-x64 12 0.11 63.0G 1.4G 7.0G 0.0
|
|
|
|
|
|
To delete your job you can use qdel.
|
|
|
|
|
|
|
|
|
#!div class=important style="border: 2pt solid; text-align: left"
|
|
|
qdel 1517
|
|
|
|
|
|
107024 0.60500 0.50000 j_B2Fq_AD0 username LGTC r 0:11:31:12 1089.99432 703.01516 0 0 0 0 0 0.00 all.q@baskingshark.clus 24
|
|
|
107029 0.50500 0.50000 j_42V9B_62 username LGTC r 0:00:00:02 0.07282 0.16434 0 0 0 0 0 0.00 all.q@dogfishshark.cluster 1
|
|
|
107030 0.50500 0.50000 j_ACTTGA_A username LGTC r 0:00:03:04 2.43385 3.73068 0 0 0 0 0 0.00 all.q@whaleshark.cluster.l 1
|
|
|
````
|
|
|
Below **'state'** you can read **'qw'** which means queue waiting. Once the head node finds available resources this will change to **'r'** which means running. Other states can be **'d'** for a deleted job or **'E'** when the job is in an error state.
|
|
|
|
|
|
## Sun Grid Engine jobs states:
|
|
|
|state|explanation|
|
|
|
|-----|-----------|
|
|
|
|qw|job is waiting in the queue|
|
|
|
|r|job is currently running|
|
|
|
|t|job is being transferred to the compute nodes|
|
|
|
|s or S|job is suspended, probably due to higher-priority jobs|
|
|
|
|h|job is on hold, probably due to sysadmin action|
|
|
|
|E|submission is in error state, use qstat -j <job_id> to find out why|
|
|
|
|d|Job is being deleted|
|
|
|
|
|
|
With the `qhost` command you can check the memory usage on the different nodes:
|
|
|
````
|
|
|
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
|
|
|
-------------------------------------------------------------------------------
|
|
|
global - - - - - - -
|
|
|
angelshark linux-x64 8 1.01 31.4G 30.1G 93.1G 29.9G
|
|
|
baskingshark linux-x64 24 8.32 251.9G 9.1G 93.1G 263.4M
|
|
|
blacktipshark linux-x64 8 0.03 31.4G 448.4M 93.1G 111.6M
|
|
|
blueshark linux-x64 16 13.88 125.9G 15.6G 93.1G 267.5M
|
|
|
camouflageshark linux-x64 16 5.19 62.9G 2.3G 93.1G 388.9M
|
|
|
caribbeanshark linux-x64 8 4.81 31.4G 9.9G 93.1G 32.5M
|
|
|
catshark linux-x64 24 8.54 125.9G 10.9G 93.1G 191.4M
|
|
|
chimerashark linux-x64 60 21.06 3023.9G 107.0G 0.0 0.0
|
|
|
cowshark linux-x64 16 5.69 125.9G 3.8G 93.1G 345.7M
|
|
|
dogfishshark linux-x64 12 12.02 94.4G 33.7G 93.1G 3.1G
|
|
|
elephantshark linux-x64 24 12.08 188.9G 3.9G 93.1G 5.0G
|
|
|
epauletteshark linux-x64 12 5.38 62.9G 2.7G 93.1G 246.4M
|
|
|
frilledshark linux-x64 12 5.35 62.9G 3.6G 93.1G 260.4M
|
|
|
goblinshark linux-x64 16 8.32 62.9G 1.9G 93.1G 276.2M
|
|
|
greatwhiteshark linux-x64 12 0.10 94.4G 1.1G 93.1G 44.2M
|
|
|
greenlandshark linux-x64 16 6.67 62.9G 11.6G 93.1G 1.9G
|
|
|
hammerheadshark linux-x64 12 5.96 94.4G 2.7G 93.1G 255.3M
|
|
|
iridescentshark linux-x64 24 9.73 125.9G 5.1G 93.1G 201.4M
|
|
|
kitefinshark linux-x64 12 5.70 62.9G 2.0G 93.1G 178.0M
|
|
|
lemonshark linux-x64 12 10.75 94.4G 12.0G 93.1G 3.5G
|
|
|
leopardshark linux-x64 24 2.44 188.9G 14.6G 93.1G 448.0M
|
|
|
makoshark linux-x64 12 11.21 125.9G 32.2G 93.1G 278.6M
|
|
|
megalodonshark linux-x64 16 7.93 62.9G 18.3G 93.1G 368.6M
|
|
|
megamouthshark linux-x64 12 4.73 94.4G 2.3G 93.1G 147.4M
|
|
|
nightshark linux-x64 12 11.79 62.9G 3.9G 93.1G 255.1M
|
|
|
pygmeshark linux-x64 12 11.85 62.9G 25.1G 93.1G 195.0M
|
|
|
reefshark linux-x64 16 1.00 125.9G 1.5G 93.1G 412.0M
|
|
|
rivershark linux-x64 8 0.06 23.5G 3.6G 24.0G 384.0K
|
|
|
sawshark linux-x64 16 9.16 62.9G 7.4G 93.1G 277.1M
|
|
|
sleepershark linux-x64 24 19.18 125.9G 2.6G 93.1G 308.4M
|
|
|
swellshark linux-x64 24 17.38 188.9G 3.2G 93.1G 316.1M
|
|
|
threshershark linux-x64 12 11.74 62.9G 14.4G 93.1G 1.1G
|
|
|
tigershark linux-x64 12 11.64 94.4G 6.9G 93.1G 234.7M
|
|
|
whaleshark linux-x64 12 5.72 94.4G 2.1G 93.1G 154.7M
|
|
|
whorltoothshark linux-x64 16 1.34 62.9G 1.7G 93.1G 830.3M
|
|
|
wobbegongshark linux-x64 12 5.20 251.9G 18.3G 93.1G 238.4M
|
|
|
zebrashark linux-x64 12 11.51 62.9G 7.3G 93.1G 123.0M
|
|
|
````
|
|
|
|
|
|
chiel has deleted job 1517
|
|
|
To delete your job you can use the `qdel` command.
|
|
|
`qdel 1517`
|
|
|
````
|
|
|
username has deleted job 1517
|
|
|
````
|
|
|
|
|
|
If your job creates files they will be put in the working directory unless told otherwise. In this example the script only prints something to screen. If you don't catch this output (by adding a >output.txt after the invocation of your script) this has to go somewhere.
|
|
|
|
|
|
For each job an error and output file will be generated if the option #$ -j Y is not given. You can find those files in the directory you ran your scripts from (if “#$ -cwd” was included in the submission script). Typically the filename will include your job name and number. The contents of theses files are the standard output and error of your submitted script and everything that the submission script may have printed to the screen. In this case the files are called as follow:
|
|
|
|
|
|
|
|
|
my_first_job.o1517
|
|
|
my_first_job.e1517
|
|
|
|
|
|
Using qsub to submit jobs is encouraged because this makes optimal use of the cluster. Although, additionally, directly logging in to a node is also possible with the command qlogin.
|
|
|
|
|
|
|
|
|
#!div class=important style="border: 2pt solid; text-align: left"
|
|
|
qlogin
|
|
|
|
|
|
|
|
|
Your job 4952 ("QLOGIN") has been submitted
|
|
|
waiting for interactive job to be scheduled ...
|
|
|
Your interactive job 4952 has been successfully scheduled.
|
|
|
Establishing built in session to host blacktipshark.cluster.loc ...
|
|
|
chiel@blacktipshark:~$
|
|
|
|
|
|
This will open a connection to a node reserving it until you exit. You can directly run jobs in the console similar to any other generic console. The complete node will scheduled for your session. To avoid the cluster being overloaded with idle qlogin sessions your session will automatically logout after being idle for 20 minutes. Directly logging in to a specific node is also possible with the command:
|
|
|
|
|
|
|
|
|
#!div class=important style="border: 2pt solid; text-align: left"
|
|
|
qlogin -q all.q@tigershark
|
|
|
For each job an error and output file will be generated if the option **#$ -j Y** is not given. You can find those files in the directory you ran your scripts from (if **“#$ -cwd”** was included in the submission script). Typically the filename will include your job name and number. The contents of theses files are the standard output and error of your submitted script and everything that the submission script may have printed to the screen. In this case the files are called as follow:
|
|
|
````
|
|
|
my_first_job.o1517
|
|
|
my_first_job.e1517
|
|
|
````
|
|
|
|
|
|
Using `qsub` to submit jobs is encouraged because this makes optimal use of the cluster. Although, additionally, directly logging in to a node is also possible with the command `qlogin`.
|
|
|
````
|
|
|
qlogin
|
|
|
|
|
|
Your job 20192 ("QLOGIN") has been submitted
|
|
|
waiting for interactive job to be scheduled ...
|
|
|
Your interactive job 20192 has been successfully scheduled.
|
|
|
Establishing built in session to host tigershark.cluster.loc ...
|
|
|
vill@tigershark:~$
|
|
|
Your job 4952 ("QLOGIN") has been submitted
|
|
|
waiting for interactive job to be scheduled ...
|
|
|
Your interactive job 4952 has been successfully scheduled.
|
|
|
Establishing built in session to host blacktipshark.cluster.loc ...
|
|
|
username@blacktipshark:~$
|
|
|
````
|
|
|
|
|
|
This will open a connection to a qlogin node reserving it until you exit. You can directly run jobs in the console similar to any other generic console. The node will scheduled for your session. To avoid the cluster being overloaded with idle qlogin sessions your session will automatically logout after 12 hours. Directly logging in to a specific node is also possible with the command:
|
|
|
`qlogin -q qlogin.q@wobbegongshark`
|
|
|
````
|
|
|
Your job 20192 ("QLOGIN") has been submitted
|
|
|
waiting for interactive job to be scheduled ...
|
|
|
Your interactive job 20192 has been successfully scheduled.
|
|
|
Establishing built in session to host tigershark.cluster.loc ...
|
|
|
username@wobbegongshark:~$
|
|
|
````
|
|
|
For qlogin it is also possible to reserve more than a single slot. Remember to not request more slots than available on a single node or your request will be denied. To reserve for example 6 slots, use the following command:
|
|
|
|
|
|
`qlogin -q qlogin.q@angelshark -pe BWA 6`
|
|
|
|
|
|
#!div class=important style="border: 2pt solid; text-align: left"
|
|
|
qlogin -q all.q@angelshark -pe BWA 6
|
|
|
|
|
|
With the qstat command you can see that this job now has 6 slots reserved. Do not abuse the the -pe BWA “nr. Of slots to reserve” by unnecessary requesting more slots that you will use.
|
|
|
|
|
|
With the `qstat` command you can see that this job now has 6 slots reserved. Do not abuse the the -pe BWA “nr. Of slots to reserve” by unnecessary requesting more slots that you will use.
|
|
|
|
|
|
#!div class=important style="border: 2pt solid; text-align: left"
|
|
|
qstat
|
|
|
|
|
|
|
|
|
job-ID prior name user state submit/start at queue slots ja-task-ID
|
|
|
-----------------------------------------------------------------------------------------------------------------
|
|
|
22103 0.60500 QLOGIN vill r 03/01/2011 13:25:20 all.q@angelshark.cluster.loc 6
|
|
|
|
|
|
|
|
|
|
|
|
#!div class=important style="border: 2pt solid; text-align: center"
|
|
|
Do not abuse the -pe BWA "nr. of slots to reserve".
|
|
|
````
|
|
|
qstat
|
|
|
|
|
|
job-ID prior name user state submit/start at queue slots ja-task-ID
|
|
|
-----------------------------------------------------------------------------------------------------------------
|
|
|
22103 0.60500 QLOGIN username r 03/01/2011 13:25:20 qlogin.q@angelshark.cluster.loc 6
|
|
|
````
|
|
|
````
|
|
|
Do not abuse the -pe BWA "nr. of slots to reserve".
|
|
|
````
|
|
|
|
|
|
Shark has a hard memory limit set that is 3G per slot, if you use a parallel job with the option -pe BWA 4 then you will use 4*3G for that job. If your job exceeds the default h_vmem=3G limit per slot then your job will be killed by the Open Grid Scheduler.
|
|
|
If your job needs more memory then the default limit you need to specify that as an option while you submit your job or set that optin inside your script:
|
|
|
Shark has a hard memory limit set that is 3G per slot, if you use a parallel job with the option **-pe BWA 4** then you will use **4*3G** for that job. If your job exceeds the default **h_vmem=3G** limit per slot then your job will be killed by the Open Grid Scheduler.
|
|
|
If your job needs more memory then the default limit you need to specify that as an option while you submit your job or set that option inside your script:
|
|
|
|
|
|
qsub -l h_vmem=12G test.sh
|
|
|
`qsub -l h_vmem=12G test.sh`
|
|
|
|
|
|
|
|
|
### Checkpointing example
|
... | ... | |