|
|
# Using the checkpointing queue
|
|
|
----
|
|
|
|
|
|
There is one subordinate queue available called subordinate.q.
|
|
|
This queue has 368 slots available (thus all slots).
|
|
|
|
|
|
This queue has only check-pointing available, thus no parallel environment.
|
|
|
This can be used to circumvent the 55 slots-per-user limit, when there are many slots available and you have a lot of short jobs
|
|
|
(where restarting from the beginning when they get suspended is okay) or specially prepared jobs which can restart from a saved state.
|
|
|
Jobs will get suspended and rescheduled when an other queue requests the slots or when a blade crashes.
|
|
|
|
|
|
Check-pointing makes sure that when your jobs gets suspended they immediately are being stopped, and rescheduled.
|
|
|
The type of check-pointing that are on shark are:
|
|
|
||type|| OGS -ckpt name ||
|
|
|
||user defined interface || check_userdefined ||
|
|
|
||transparent interface || check_transparent ||
|
|
|
|
|
|
Please read the following on check-pointing (includes examples):
|
|
|
http://gridscheduler.sourceforge.net/howto/checkpointing.html
|
|
|
|
|
|
The all.q has a subordination list that is defined as :
|
|
|
subordinate_list NONE,[@24_core=subordinate.q],[@16_core=subordinate.q],[@12_core=subordinate.q],[@8_core=subordinate.q]
|
|
|
all.q can use for example 16 cores/blade on the 16 slot blades and if subordinate.q is running there and slots are needed
|
|
|
then the subordinate.q slots get suspended, this triggers the check-pointing to reschedule the complete job.
|
|
|
|
|
|
A checkpoint dir has been created for now :
|
|
|
/home/checkpoint everyone can write to that dir at the moment, please do not abuse this folder.
|
|
|
|
|
|
If people want to use this queue mail Michel Villerius M.P.Villerius@… or Matthijs Moed M.H.Moed@… and we will add you to the queue.
|
|
|
|
|
|
Keep in mind that you need to give the -ckpt <name> option with submitting else your jobs stays forever in the queue wait.
|
|
|
|
|
|
Simple example:
|
|
|
|
|
|
#!div class=important style="border: 2pt solid; text-align: left"
|
|
|
qsub -ckpt check_transparent sleep1000.sh
|
|
|
|
|
|
If you see your job running try to suspend this job your self with:
|
|
|
|
|
|
#!div class=important style="border: 2pt solid; text-align: left"
|
|
|
qmod -s <job id>
|
|
|
|
|
|
You will then see a Rq(rerunning and queued) and then a Rr(rerunning). Your job will start from the beginning.
|
|
|
|
|
|
More elaborate example: see [Examples page](Examples) |