Skip to content

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
    • Help
    • Submit feedback
    • Contribute to GitLab
  • Sign in
SHARK
SHARK
  • Project
    • Project
    • Details
    • Activity
    • Releases
    • Cycle Analytics
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Charts
  • Issues 1
    • Issues 1
    • List
    • Board
    • Labels
    • Milestones
  • Merge Requests 0
    • Merge Requests 0
  • Wiki
    • Wiki
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Charts
  • Create a new issue
  • Commits
  • Issue Boards
  • Shark
  • SHARKSHARK
  • Wiki
  • CheckpointingQueue

CheckpointingQueue

Last edited by mpvillerius Nov 06, 2015
Page history

Using the checkpointing queue


There is one subordinate queue available called subordinate.q. This queue has 368 slots available (thus all slots).

This queue has only check-pointing available, thus no parallel environment. This can be used to circumvent the 55 slots-per-user limit, when there are many slots available and you have a lot of short jobs (where restarting from the beginning when they get suspended is okay) or specially prepared jobs which can restart from a saved state. Jobs will get suspended and rescheduled when an other queue requests the slots or when a blade crashes.

Check-pointing makes sure that when your jobs gets suspended they immediately are being stopped, and rescheduled. The type of check-pointing that are on shark are: ||type|| OGS -ckpt name || ||user defined interface || check_userdefined || ||transparent interface || check_transparent ||

Please read the following on check-pointing (includes examples): http://gridscheduler.sourceforge.net/howto/checkpointing.html

The all.q has a subordination list that is defined as : subordinate_list NONE,[@24_core=subordinate.q],[@16_core=subordinate.q],[@12_core=subordinate.q],[@8_core=subordinate.q] all.q can use for example 16 cores/blade on the 16 slot blades and if subordinate.q is running there and slots are needed then the subordinate.q slots get suspended, this triggers the check-pointing to reschedule the complete job.

A checkpoint dir has been created for now : /home/checkpoint everyone can write to that dir at the moment, please do not abuse this folder.

If people want to use this queue mail Michel Villerius M.P.Villerius@… or Matthijs Moed M.H.Moed@… and we will add you to the queue.

Keep in mind that you need to give the -ckpt option with submitting else your jobs stays forever in the queue wait.

Simple example:

#!div class=important style="border: 2pt solid; text-align: left"
qsub -ckpt check_transparent sleep1000.sh

If you see your job running try to suspend this job your self with:

#!div class=important style="border: 2pt solid; text-align: left"
qmod -s <job id>

You will then see a Rq(rerunning and queued) and then a Rr(rerunning). Your job will start from the beginning.

More elaborate example: see Examples page

Clone repository
  • AccessingGridStorage
  • ChangePasswd
  • CheckpointingQueue
  • Configuration
  • Contact_info_shark
  • DescriptionExenode
  • EnvironmentModules
  • Errorqueue
  • Examples
  • FAQ
  • FineTune__SLASH__Solutions
  • GetConnected
  • Graphical
  • Graphicalview
  • Guidelines
More Pages

New Wiki Page

Tip: You can specify the full path for the new file. We will automatically create any missing directories.