Using the checkpointing queue
There is one subordinate queue available called subordinate.q. This queue has 368 slots available (thus all slots).
This queue has only check-pointing available, thus no parallel environment. This can be used to circumvent the 55 slots-per-user limit, when there are many slots available and you have a lot of short jobs (where restarting from the beginning when they get suspended is okay) or specially prepared jobs which can restart from a saved state. Jobs will get suspended and rescheduled when an other queue requests the slots or when a blade crashes.
Check-pointing makes sure that when your jobs gets suspended they immediately are being stopped, and rescheduled. The type of check-pointing that are on shark are: ||type|| OGS -ckpt name || ||user defined interface || check_userdefined || ||transparent interface || check_transparent ||
Please read the following on check-pointing (includes examples): http://gridscheduler.sourceforge.net/howto/checkpointing.html
The all.q has a subordination list that is defined as : subordinate_list NONE,[@24_core=subordinate.q],[@16_core=subordinate.q],[@12_core=subordinate.q],[@8_core=subordinate.q] all.q can use for example 16 cores/blade on the 16 slot blades and if subordinate.q is running there and slots are needed then the subordinate.q slots get suspended, this triggers the check-pointing to reschedule the complete job.
A checkpoint dir has been created for now : /home/checkpoint everyone can write to that dir at the moment, please do not abuse this folder.
If people want to use this queue mail Michel Villerius M.P.Villerius@… or Matthijs Moed M.H.Moed@… and we will add you to the queue.
Keep in mind that you need to give the -ckpt option with submitting else your jobs stays forever in the queue wait.
Simple example:
#!div class=important style="border: 2pt solid; text-align: left"
qsub -ckpt check_transparent sleep1000.sh
If you see your job running try to suspend this job your self with:
#!div class=important style="border: 2pt solid; text-align: left"
qmod -s <job id>
You will then see a Rq(rerunning and queued) and then a Rr(rerunning). Your job will start from the beginning.
More elaborate example: see Examples page