|
|
# How do I start working on Shark the right way?
|
|
|
----
|
|
|
Lets assume that we have Next Generation Sequencing data from an Escherichia coli genome and we want to align our fastq file to the reference genome with BWA. Our final output file format will be a BAM file. We have no idea how long the steps will take and what resources (memory, run time, disk space) we need. The following workflow could be a good starting point:
|
|
|
|
|
|
Step 1: Qlogin to an interactive node to test that your program works on the Shark cluster with a very small data set.
|
|
|
|
|
|
your data set is : e_coli_1000.fq
|
|
|
|
|
|
your reference genome : Escherichia_coli_536_uid58531_NC_008253
|
|
|
|
|
|
create a small data set from the first 400 fastq lines from our full data set(fastq consist out 4 lines thus 400 * 4 = 1600):
|
|
|
|
|
|
BWA will use the one threat option for now: -t 1
|
|
|
|
|
|
|
|
|
#!div class=important style="border: 2pt solid; text-align: left"
|
|
|
head -1600 e_coli_1000.fq > small-set-e_coli_1000.fq
|
|
|
|
|
|
Check how many lines do we have in the new fastq file:
|
|
|
|
|
|
|
|
|
#!div class=important style="border: 2pt solid; text-align: left"
|
|
|
cat small-set-e_coli_1000.fq | wc -l
|
|
|
1600
|
|
|
|
|
|
First secure shell to the Shark cluster:
|
|
|
|
|
|
|
|
|
#!div class=important style="border: 2pt solid; text-align: left"
|
|
|
ssh shark.lumcnet.prod.intern
|
|
|
|
|
|
Qlogin and run BWA on the small data set: h_vmem=5.0G means The job can only use a max. memory of 5 GB to run else it will be killed, h_rt=:20: means the maximum time this job will take is 20 minutes, afther 20 min the job will be killed automatically. h_rt format = HH:MM:SS
|
|
|
|
|
|
|
|
|
#!div class=important style="border: 2pt solid; text-align: left"
|
|
|
qlogin -l h_vmem=5.0G,h_rt=:20:
|
|
|
|
|
|
|
|
|
Your job 107057 ("QLOGIN") has been submitted
|
|
|
waiting for interactive job to be scheduled ...
|
|
|
Your interactive job 107057 has been successfully scheduled.
|
|
|
Establishing builtin session to host wobbegongshark.cluster.loc ...
|
|
|
mesg: /dev/pts/0: Operation not permitted <================= do not pay attention to this error
|
|
|
vill@wobbegongshark:~$ cd /data/DIV5/HumGen/vill/
|
|
|
vill@wobbegongshark:/data/DIV5/HumGen/vill$ bwa aln /usr/local/Genomes/E.coli/Escherichia_coli_536_uid58531_NC_008253 small-set-e_coli_1000.fq > small-set-output-ecoli.sai
|
|
|
[bwa_aln] 17bp reads: max_diff = 2
|
|
|
[bwa_aln] 38bp reads: max_diff = 3
|
|
|
[bwa_aln] 64bp reads: max_diff = 4
|
|
|
[bwa_aln] 93bp reads: max_diff = 5
|
|
|
[bwa_aln] 124bp reads: max_diff = 6
|
|
|
[bwa_aln] 157bp reads: max_diff = 7
|
|
|
[bwa_aln] 190bp reads: max_diff = 8
|
|
|
[bwa_aln] 225bp reads: max_diff = 9
|
|
|
[bwa_aln_core] calculate SA coordinate... 0.03 sec
|
|
|
[bwa_aln_core] write to the disk... 0.00 sec
|
|
|
[bwa_aln_core] 400 sequences have been processed.
|
|
|
|
|
|
If the BWA alingment went fine.
|
|
|
|
|
|
|
|
|
#!div class=important style="border: 2pt solid; text-align: left"
|
|
|
exit
|
|
|
|
|
|
Now that we know that BWA runs fine on Shark with our data set we can write a script that we can qsub.
|
|
|
|
|
|
|
|
|
#!sh
|
|
|
#!/bin/bash
|
|
|
bwa aln /usr/local/Genomes/E.coli/Escherichia_coli_536_uid58531_NC_008253 small-set-e_coli_1000.fq > small-set-output-ecoli.sai
|
|
|
|
|
|
We can now submit our job
|
|
|
|
|
|
|
|
|
#!div class=important style="border: 2pt solid; text-align: left"
|
|
|
qsub -l h_vmem=5.0G,h_rt=:20: /data/DIV5/HumGen/vill/run_BWA_example.sh
|
|
|
|
|
|
|
|
|
Your job 107065 ("run_BWA_example.sh") has been submitted
|
|
|
|
|
|
With the qstat we can check our job
|
|
|
|
|
|
|
|
|
#!div class=important style="border: 2pt solid; text-align: left"
|
|
|
qstat
|
|
|
|
|
|
|
|
|
job-ID prior name user state submit/start at queue slots ja-task-ID
|
|
|
-----------------------------------------------------------------------------------------------------------------
|
|
|
107065 0.50500 run_BWA_ex vill r 07/28/2011 15:53:35 LGTC.q@caribbeanshark.cluster. 1
|
|
|
|
|
|
When our job has finished we can view how many resources this script used with:
|
|
|
|
|
|
|
|
|
#!div class=important style="border: 2pt solid; text-align: left"
|
|
|
qacct -j 107065
|
|
|
|
|
|
|
|
|
==============================================================
|
|
|
qname LGTC.q
|
|
|
hostname caribbeanshark.cluster.loc
|
|
|
group NexGenSeq
|
|
|
owner vill
|
|
|
project NONE
|
|
|
department LGTC
|
|
|
jobname run_BWA_example.sh
|
|
|
jobnumber 107065
|
|
|
taskid undefined
|
|
|
account sge
|
|
|
priority 0
|
|
|
qsub_time Thu Jul 28 15:53:27 2011
|
|
|
start_time Thu Jul 28 15:48:52 2011
|
|
|
end_time Thu Jul 28 15:49:34 2011
|
|
|
granted_pe NONE
|
|
|
slots 1
|
|
|
failed 0
|
|
|
exit_status 0
|
|
|
ru_wallclock 42
|
|
|
ru_utime 40.460
|
|
|
ru_stime 0.810
|
|
|
ru_maxrss 86644
|
|
|
ru_ixrss 0
|
|
|
ru_ismrss 0
|
|
|
ru_idrss 0
|
|
|
ru_isrss 0
|
|
|
ru_minflt 91035
|
|
|
ru_majflt 0
|
|
|
ru_nswap 0
|
|
|
ru_inblock 241304
|
|
|
ru_oublock 43304
|
|
|
ru_msgsnd 0
|
|
|
ru_msgrcv 0
|
|
|
ru_nsignals 0
|
|
|
ru_nvcsw 1530
|
|
|
ru_nivcsw 8081
|
|
|
cpu 41.270
|
|
|
mem 3.626
|
|
|
io 0.133
|
|
|
iow 0.000
|
|
|
maxvmem 105.809M
|
|
|
arid undefined
|
|
|
|
|
|
Look at the start, end time and mem
|
|
|
|
|
|
|
|
|
start_time Thu Jul 28 15:48:52 2011
|
|
|
end_time Thu Jul 28 15:49:34 2011
|
|
|
maxvmem 105.909M
|
|
|
|
|
|
Now you know the time it takes to run your program and the maximum amount of memory. |