There is one subordinate queue available called subordinate.q.
This queue has 368 slots available (thus all slots).
This queue has only check-pointing available, thus no parallel environment.
This can be used to circumvent the 55 slots-per-user limit, when there are many slots available and you have a lot of short jobs
(where restarting from the beginning when they get suspended is okay) or specially prepared jobs which can restart from a saved state.
Jobs will get suspended and rescheduled when an other queue requests the slots or when a blade crashes.
Check-pointing makes sure that when your jobs gets suspended they immediately are being stopped, and rescheduled.
The type of check-pointing that are on shark are:
||type|| OGS -ckpt name ||
||user defined interface || check_userdefined ||
||transparent interface || check_transparent ||
The all.q has a subordination list that is defined as :
all.q can use for example 16 cores/blade on the 16 slot blades and if subordinate.q is running there and slots are needed
then the subordinate.q slots get suspended, this triggers the check-pointing to reschedule the complete job.
A checkpoint dir has been created for now :
/home/checkpoint everyone can write to that dir at the moment, please do not abuse this folder.
If people want to use this queue mail Michel Villerius M.P.Villerius@… or Matthijs Moed M.H.Moed@… and we will add you to the queue.
Keep in mind that you need to give the -ckpt option with submitting else your jobs stays forever in the queue wait.