Villerius · 20a398d4
--- a/HowDoIStart.md
+++ b/HowDoIStart.md
+# How do I start working on Shark the right way?
+----
+Lets assume that we have Next Generation Sequencing data from an Escherichia coli genome and we want to align our fastq file to the reference genome with BWA. Our final output file format will be a BAM file. We have no idea how long the steps will take and what resources (memory, run time, disk space) we need. The following workflow could be a good starting point:
+
+Step 1: Qlogin to an interactive node to test that your program works on the Shark cluster with a very small data set.
+
+your data set is      : e_coli_1000.fq
+
+your reference genome : Escherichia_coli_536_uid58531_NC_008253
+
+create a small data set from the first 400 fastq lines from our full data set(fastq consist out 4 lines thus 400 * 4 = 1600):
+
+BWA will use the one threat option for now: -t 1
+
+
+    #!div class=important style="border: 2pt solid; text-align: left"
+    head -1600 e_coli_1000.fq > small-set-e_coli_1000.fq
+
+Check how many lines do we have in the new fastq file:
+
+
+    #!div class=important style="border: 2pt solid; text-align: left"
+    cat small-set-e_coli_1000.fq | wc -l
+    1600
+
+First secure shell to the Shark cluster:
+
+
+    #!div class=important style="border: 2pt solid; text-align: left"
+    ssh shark.lumcnet.prod.intern
+
+Qlogin and run BWA on the small data set: h_vmem=5.0G means The job can only use a max. memory of 5 GB to run else it will be killed, h_rt=:20: means the maximum time this job will take is 20 minutes, afther 20 min the job will be killed automatically. h_rt format = HH:MM:SS
+
+
+    #!div class=important style="border: 2pt solid; text-align: left"
+    qlogin -l  h_vmem=5.0G,h_rt=:20:
+
+
+    Your job 107057 ("QLOGIN") has been submitted
+    waiting for interactive job to be scheduled ...
+    Your interactive job 107057 has been successfully scheduled.
+    Establishing builtin session to host wobbegongshark.cluster.loc ...
+    mesg: /dev/pts/0: Operation not permitted                       <================= do not pay attention to this error
+    vill@wobbegongshark:~$ cd /data/DIV5/HumGen/vill/
+    vill@wobbegongshark:/data/DIV5/HumGen/vill$ bwa aln /usr/local/Genomes/E.coli/Escherichia_coli_536_uid58531_NC_008253 small-set-e_coli_1000.fq > small-set-output-ecoli.sai
+    [bwa_aln] 17bp reads: max_diff = 2
+    [bwa_aln] 38bp reads: max_diff = 3
+    [bwa_aln] 64bp reads: max_diff = 4
+    [bwa_aln] 93bp reads: max_diff = 5
+    [bwa_aln] 124bp reads: max_diff = 6
+    [bwa_aln] 157bp reads: max_diff = 7
+    [bwa_aln] 190bp reads: max_diff = 8
+    [bwa_aln] 225bp reads: max_diff = 9
+    [bwa_aln_core] calculate SA coordinate... 0.03 sec
+    [bwa_aln_core] write to the disk... 0.00 sec
+    [bwa_aln_core] 400 sequences have been processed.
+
+If the BWA alingment went fine.
+
+
+    #!div class=important style="border: 2pt solid; text-align: left"
+    exit
+
+Now that we know that BWA runs fine on Shark with our data set we can write a script that we can qsub.
+
+
+    #!sh
+    #!/bin/bash
+    bwa aln /usr/local/Genomes/E.coli/Escherichia_coli_536_uid58531_NC_008253 small-set-e_coli_1000.fq > small-set-output-ecoli.sai
+
+We can now submit our job
+
+
+    #!div class=important style="border: 2pt solid; text-align: left"
+    qsub -l  h_vmem=5.0G,h_rt=:20: /data/DIV5/HumGen/vill/run_BWA_example.sh
+
+
+    Your job 107065 ("run_BWA_example.sh") has been submitted
+
+With the qstat we can check our job
+
+
+    #!div class=important style="border: 2pt solid; text-align: left"
+    qstat
+
+
+    job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
+    -----------------------------------------------------------------------------------------------------------------
+     107065 0.50500 run_BWA_ex vill         r     07/28/2011 15:53:35 LGTC.q@caribbeanshark.cluster.     1
+
+When our job has finished we can view how many resources this script used with:
+
+
+    #!div class=important style="border: 2pt solid; text-align: left"
+    qacct -j 107065
+
+
+    ==============================================================
+    qname        LGTC.q
+    hostname     caribbeanshark.cluster.loc
+    group        NexGenSeq
+    owner        vill
+    project      NONE
+    department   LGTC
+    jobname      run_BWA_example.sh
+    jobnumber    107065
+    taskid       undefined
+    account      sge
+    priority     0
+    qsub_time    Thu Jul 28 15:53:27 2011
+    start_time   Thu Jul 28 15:48:52 2011
+    end_time     Thu Jul 28 15:49:34 2011
+    granted_pe   NONE
+    slots        1
+    failed       0
+    exit_status  0
+    ru_wallclock 42
+    ru_utime     40.460
+    ru_stime     0.810
+    ru_maxrss    86644
+    ru_ixrss     0
+    ru_ismrss    0
+    ru_idrss     0
+    ru_isrss     0
+    ru_minflt    91035
+    ru_majflt    0
+    ru_nswap     0
+    ru_inblock   241304
+    ru_oublock   43304
+    ru_msgsnd    0
+    ru_msgrcv    0
+    ru_nsignals  0
+    ru_nvcsw     1530
+    ru_nivcsw    8081
+    cpu          41.270
+    mem          3.626
+    io           0.133
+    iow          0.000
+    maxvmem      105.809M
+    arid         undefined
+
+Look at the start, end time and mem
+
+
+    start_time   Thu Jul 28 15:48:52 2011
+    end_time     Thu Jul 28 15:49:34 2011
+    maxvmem      105.909M
+
+Now you know the time it takes to run your program and the maximum amount of memory.