Changes
Page history
Original Shark Trac Wiki, converted to gitlab markdown
authored
Nov 06, 2015
by
Villerius
Hide whitespace changes
Inline
Side-by-side
CheckpointingQueue.md
0 → 100644
View page @
20a398d4
# Using the checkpointing queue
----
There is one subordinate queue available called subordinate.q.
This queue has 368 slots available (thus all slots).
This queue has only check-pointing available, thus no parallel environment.
This can be used to circumvent the 55 slots-per-user limit, when there are many slots available and you have a lot of short jobs
(where restarting from the beginning when they get suspended is okay) or specially prepared jobs which can restart from a saved state.
Jobs will get suspended and rescheduled when an other queue requests the slots or when a blade crashes.
Check-pointing makes sure that when your jobs gets suspended they immediately are being stopped, and rescheduled.
The type of check-pointing that are on shark are:
||type|| OGS -ckpt name ||
||user defined interface || check_userdefined ||
||transparent interface || check_transparent ||
Please read the following on check-pointing (includes examples):
http://gridscheduler.sourceforge.net/howto/checkpointing.html
The all.q has a subordination list that is defined as :
subordinate_list NONE,[@24_core=subordinate.q],[@16_core=subordinate.q],[@12_core=subordinate.q],[@8_core=subordinate.q]
all.q can use for example 16 cores/blade on the 16 slot blades and if subordinate.q is running there and slots are needed
then the subordinate.q slots get suspended, this triggers the check-pointing to reschedule the complete job.
A checkpoint dir has been created for now :
/home/checkpoint everyone can write to that dir at the moment, please do not abuse this folder.
If people want to use this queue mail Michel Villerius M.P.Villerius@… or Matthijs Moed M.H.Moed@… and we will add you to the queue.
Keep in mind that you need to give the -ckpt
<name>
option with submitting else your jobs stays forever in the queue wait.
Simple example:
#!div class=important style="border: 2pt solid; text-align: left"
qsub -ckpt check_transparent sleep1000.sh
If you see your job running try to suspend this job your self with:
#!div class=important style="border: 2pt solid; text-align: left"
qmod -s <job id>
You will then see a Rq(rerunning and queued) and then a Rr(rerunning). Your job will start from the beginning.
More elaborate example: see
[
Examples page
](
Examples
)