Villerius · 20a398d4
--- a/pacBioToCA.md
+++ b/pacBioToCA.md
+**How to correct long !PacBio reads using pacBioToCA**
+[[BR]]
+
+[Contents,inline)]([PageOutline(1-2,Page)]
+
+[[BR]]
+
+# Celera assembler and the LUMC Shark compute cluster
+----
+
+## Introduction
+This document describes the use of Celera assembler ("wgs assembler") and PacBioToCA (unreleased version as of August 1 2013) on the LUMC Shark computing cluster using the SGE grid.
+
+## Version
+Unreleased version as of August 1, 2013. Downloaded from CVS repository.
+
+
+    /usr/local/wgs-assembler/wgs-svn/Linux-amd64/bin/runCA -version
+
+CA version CVS TIP ($Id: AS_GKP_main.C 4371 2013-08-01 17:19:47Z brianwalenz $).
+CA version CVS TIP ($Id: AS_CGB_unitigger.C 4371 2013-08-01 17:19:47Z brianwalenz $).
+CA version CVS TIP ($Id: BuildUnitigs.C 4371 2013-08-01 17:19:47Z brianwalenz $).
+Using up to 8 OpenMP threads.
+CA version CVS TIP ($Id: AS_CGW_main.C 4371 2013-08-01 17:19:47Z brianwalenz $).
+CA version CVS TIP ($Id: terminator.C 4371 2013-08-01 17:19:47Z brianwalenz $).
+}}}
+
+''NOTE: this is not the default (“current”) version on Shark.[[BR]]
+Use the full path to run this unreleased version of Celera Assembler.''
+
+## Installation
+Instructions regarding installation of the most recent unreleased version of Celera Assembler from a CVS repository can be found here:
+[Check_out_and_Compile](http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Check_out_and_Compile) [[BR]]
+Steps 6 and 7 of the guide were skipped. No additional (bug)fixes were performed.
+
+[[BR]][[BR]]
+
+# PacBioToCA
+----
+
+##  Introduction
+I will use a case study to provide a superficial walkthrough of PacBioToCA. [[BR]]
+For this research project 2 organisms were sequenced simultaneously resulting in a combined datasets.[[BR]] 
+
+Since the !PacBio platform has a high random error rate (~86%) data correction is required before the data can be used for assembly.
+
+Illumina !HiSeq 2000 data including paired- and unpaired reads will be used to correct a !PacBio dataset consisting out of 20 SMRT cells. [[BR]]
+Basic characteristics of both these datasets are presented in table 1 and 2. 
+
+[[BR]]
+
+|| ||**!PacBio filtered subreads 20 cells**||
+||Total reads||1,000,321||
+||Total bases||1,152,284,129 bp||
+||Mean read length||1,152 bp||
+||Median read length||996 bp||
+||Size range (bp)||500 - 7,977||
+||Standard deviation||593.37||
+||Overall GC||33.42%||
+
+**Table 1:** Uncorrected !PacBio data.
+
+[[BR]][[BR]]
+
+|| ||**!HiSeq forward unpaired**||**!HiSeq reverse unpaired**||**!HiSeq forward paired (subset)**||**!HiSeq reverse paired (subset)**||
+||Total reads||6,054,045||801,344||14,162,580||14,154,532||
+||Total bases||521,866,017 bp||69,878,229 bp||1,391,281,652 bp||1,366,430,609 bp||
+||Mean read length||86 bp||87 bp||98 bp||97 bp||
+||Median read length||100 bp||100 bp||100 bp||100 bp||
+||Size range (bp)||36 - 100||36 - 100||36 - 100||36 - 100||
+||Standard deviation||19.60||20.69||8.04||2.75||
+||Overall GC||33.66%||30.55%||	30.22%||30.05%||
+
+**Table 2:** Illumina !HiSeq 2000 datasets. [[BR]]
+*NOTE:* The statistics for the paired datasets are based on a random subset (10%) taken from the original dataset. [[BR]]
+A subset was analysed in favor of the complete dataset due to the large size of the full paired datasets.
+
+[[BR]]
+
+## Input - Uncorrected !PacBio data (FASTQ)
+PacBioToCA requires !PacBio data to be supplied as FASTQ files. [[BR]]
+Pacific Biosciences encodes quality values in FASTQ files using their own specific format. This format is not compatible with Celera Assembler. [To circumvent this apparent problem !PacBio data was retrieved as FASTA files from the SMRT portal. [[BR]([BR]])]
+The FASTA files were then “converted” to FASTQ files by assigning standard quality values. [[BR]]
+For this the fastools package (available on Shark) was used:
+
+
+    fastools fa2fq <in.fasta> <out.fastq>
+
+
+## Input - Illumina !HiSeq data (FRG)
+The !HiSeq dataset will be used to correct the !PacBio reads. [[BR]]
+Any dataset to be used to perform !PacBio data correction must be supplied as a FRG file. [[BR]]
+Celera Assembler includes a utility, fastqToCA}}}, to generate wrapper LIB messages for (Illumina) FASTQ files. [[BR]]
+    [http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=FastqToCA FastqToCA]
+
+
+    fastqToCA -insertsize 400 50 -libraryname HiSeq -technology illumina -reads HiSeq_R1001_forward_unpaired.fastq -reads HiSeq_R2001_reverse_unpaired.fastq -mates HiSeq_R1001_forward_paired.fastq,HiSeq_R2001_reverse_paired.fastq  
+
+{VER
+ver:2
+}
+{LIB
+act:A
+acc:!HiSeq
+ori:I
+mea:400.000
+std:50.000
+src:
+.
+nft:18
+fea:
+forceBOGunitigger=1
+isNotRandom=0
+doNotTrustHomopolymerRuns=0
+doTrim_initialNone=0
+doTrim_initialMerBased=1
+doTrim_initialFlowBased=0
+doTrim_initialQualityBased=0
+doRemoveDuplicateReads=1
+doTrim_finalLargestCovered=1
+doTrim_finalEvidenceBased=0
+doRemoveSpurReads=1
+doRemoveChimericReads=1
+doConsensusCorrection=0
+fastqQualityValues=sanger
+fastqOrientation=innie
+fastqReads=/data/LGTC/LGTCusers/jfrank/Clav/FASTQ_FRG/!HiSeq_R1001_forward_unpaired.fastq
+fastqReads=/data/LGTC/LGTCusers/jfrank/Clav/FASTQ_FRG/!HiSeq_R2001_reverse_unpaired.fastq
+fastqMates=/data/LGTC/LGTCusers/jfrank/Clav/FASTQ_FRG/!HiSeq_R1001_forward_paired.fastq,/data/LGTC/LGTCusers/jfrank/Clav/FASTQ_FRG/!HiSeq_R2001_reverse_paired.fastq
+
+.
+}
+{VER
+ver:1
+}
+
+}}}
+
+[[BR]][[BR]]
+
+# Specifying options - The spec file
+----
+The most convenient way to configure Celera Assembler/PacBioToCA is to use a spec file. [[BR]]
+In this file options are defined in a key=value manner. There are many options available. [[BR]] 
+For a complete list, please refer to [RunCA](http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=RunCA) or run:
+
+
+    runCA -options
+
+
+A spec file must be well planned and designed. For example, specific options need to be defined If you want to make use of the SGE grid on Shark. [[BR]]
+Experimenting with spec files on the grid has learned that only few configurations work well, while most other setups will end up crashing your job. [[BR]]
+Parameters most important for SGE and the shark cluster will be discussed later.
+
+Since the Shark cluster employs a standard memory limit of 4 Gb per slot, you need to define the amount of memory for every process up front. [[BR]]
+This can be tricky to predict as there are many factors influencing the memory footprint, such as the size and complexity of the genome/data, [[BR]]
+automatic memory allocation of certain options etc. This requires you to know the program and your data very well. [[BR]] 
+To get an idea of the memory usage the spec file used during this study is provided below. 
+
+
+    cat pacBioToCA_svn_SGE.spec
+
+sgeName 				= pacBioToCA
+sge					= -A assembly -l h_vmem=10G -l h_stack=256m
+sgeScript 				= -pe BWA 8
+
+useGrid 				= 1
+scriptOnGrid 				= 1
+
+mbtOnGrid				= 1
+frgCorrOnGrid 			        = 1
+ovlOnGrid				= 1
+ovlCorrOnGrid 			        = 1
+cnsOnGrid 				= 1
+
+##############################################################################
+
+mbtBatchSize 				= 1500000
+mbtThreads				= 1
+
+frgCorrBatchSize 			= 200000
+frgCorrThreads			        = 2
+sgeFragmentCorrection 		        = -pe BWA 2 -l h_vmem=5G
+
+merSize                         	= 14
+frgMinLen                       	= 64
+ovlMinLen                       	= 40
+
+ovlHashBits                     	= 23
+ovlHashBlockLength              	= 450000000
+ovlRefBlockSize                 	= 15000000
+
+ovlStoreMemory			        = 8192
+ovlThreads 				= 2
+sgeOverlap            		        = -pe BWA 2 -l h_vmem=11G
+
+merylMemory                     	= 80000
+
+}}}
+
+
+### sgeName
+Entered string is appended to the job name supplied to SGE. [[BR]]
+This parameter is needed to prevent different correction/assembly jobs from clashing with each other.
+
+### sge
+Entered string is passed to the qsub command used to submit to the grid any job for which no memory/slot allocation is specified. [[BR]]
+This is useful since some processes do not offer any options at all to set memory usage or the amount of slots to use. [[BR]]
+Set this parameter accordingly for jobs that you cannot control with any other specific parameters, such as the “mertrim” stage.
+
+All SGE jobs are run with the -A assembly option, which annotates the SGE accounting information for these jobs with the string "assembly". 
+
+We have to set h_vmem}}} since Shark imposes memory limits (default 4 Gb per slot) on all jobs. [[BR]]
+    {{{h_vmem}}} sets a limit on virtual memory. We should also set a default value for {{{h_stack}}}: {{{h_stack}}} sets a limit on stack space for binary execution. [[BR]] 
+    Without a sufficient value for {{{h_stack}}} some programs will fail to start.
+
+*Adding and/or removing parameters from this line will likely crash your correction.* 
+Try and experiment!
+
+### sgeScript
+Entered string is passed to the qsub command that initiates the run of the main script: runCA}}}. [[BR]]
+    Every stage that runs, unless explicitly submitted to the grid, is run within {{{runCA}}} (e.g. unitigger, scaffolfder). [[BR]]
+    This means that processes like unitigger and scaffolder will have access to as many resources defined by {{{sgeScript}}}. [[BR]]
+
+*Adding and/or removing parameters from this line will likely crash your correction.* 
+Try and experiment!
+
+### !OnGrid
+By enabling useGrid}}} and {{{scriptOnGrid}}}, the main script ({{{runCA}}}) will be submitted directly to the grid. [[BR]]
+    Now all stages of the pipeline will run automatically in parallel on the computational grid. [[BR]]
+    Process specific switches are available to allow for users to decide whether or not to submit the specific process to the grid.
+
+[[BR]]
+
+# Assigning memory and slots to processes
+Certain parameters (e.g. sgeFragmentCorrection}}}, {{{sgeOverlap}}}, {{{sgeOverlapCorrection}}}, {{{sgeConsensus}}}) allows a user [[BR]]
+    to set the amount of memory and the number of slots for specific processes. Each job spawned from that process will get the the same [[BR]]
+    amount of memory and slots assigned. For some processes however, it is not possible to assign memory and slots manually.
+
+*Most processes run at optimal efficiency using just 1 slot/core. Exceptions to this rule are sgeFragmentCorrection}}} and {{{sgeOverlap}}} which are designed to run optimally on 2 slots/cores.* [[BR]]
+    The Shark cluster has a parallel environment called {{{BWA}}}. The following line of code provides an example of how to set the number of slots/cores for {{{sgeOverlap}}}:
+
+
+    sgeOverlap = -pe BWA 2
+
+
+As mentioned before, the Shark cluster employs a standard memory limit of 4 Gb per slot. [[BR]]
+This requires you to define the amount of memory for every process up front. To allocate memory for a process, use h_vmem}}}:
+
+
+    sgeOverlap = -pe BWA 2 -l h_vmem=11G
+
+
+In the above example, each sgeOverlap}}} job gets 2 slots assigned and a total of 22G of memory (2 slots * 11G). 
+    You can also specify a specific queue for a batch of jobs:
+
+
+    sgeOverlap = -pe BWA 2 -l h_vmem=11G -q all.q
+
+
+[[BR]]
+
+## How to estimate memory usage
+It can be tricky to predict memory usage as there are many factors influencing the memory footprint. [[BR]]
+Setting certain parameters directly influences the amount of memory allocated. If this exceeds the limit of 4G/slot [[BR]] 
+the job in question will get killed automatically without warning, causing the pipeline to crash (eventually). [[BR]] 
+To prevent this from happening a good understanding of the pipeline is required. Configure your spec file well [[BR]] 
+and use this page to find out exactly what each parameter does and how it may influence memory usage. [[BR]]
+
+''Please note: do not set parameters too close to available memory. [[BR]]
+Celera Assembler does often allocate a bit more memory then is to be expected. [[BR]]
+Be safe and leave 1 to 2G of free memory per job.''
+
+**Example:** configuring the overlapper stage: 
+
+
+    frgMinLen                       	= 64
+    ovlMinLen                       	= 40
+
+ovlHashBits                     	= 23
+ovlHashBlockLength              	= 450000000
+ovlRefBlockSize                 	= 15000000
+
+ovlStoreMemory			        = 8192
+ovlThreads 				= 2
+sgeOverlap            		        = -pe BWA 2 -l h_vmem=11G
+}}}
+
+
+### ovlHashBits
+Size of the overlap hash table in bits. A 23 bit overlap hash table accommodates for up to  176,160,768 unique Kmers. [[BR]]
+Defining a table of 23 bits immediately allocates > 1700 Mb or memory.
+
+### ovlHashBlockLength
+Amount of sequence (in bp) to load into the hash table. Each base loaded consumes 10 bytes of memory. [[BR]]
+Loading 450,000,000 bases will consume 4.5G of memory in addition to that used by ovlHashBits}}}.
+
+### ovlRefBlockSize
+ovlRefBlockSize}}} directly controls the number of overlap jobs and the run time of each. [[BR]]
+    Smaller values result in more jobs that each need less time to finish. [[BR]]
+      ''It is often best to configure your jobs in such a way that they will spawn around 400 jobs per stage (e.g. 400 overlapper jobs, 400 trim jobs etc.)''. 
+
+I have found no way to find the best setting for this parameter: just experiment and see how many jobs will be spawned. [[BR]]
+Given the size of the dataset used in this example, if the work is divided up to around 400 jobs, [[BR]] 
+jobs will turn out rather large and occupy quite some memory.
+
+Given this information our overlapper jobs will need at least 1,700 Mb for the table, we will load 4.5G of sequences [[BR]] 
+in memory and try to create around 400 jobs. After running PacBioToCA using this configuration overlapper jobs turned out [[BR]]
+to consume about 20G of RAM. This is more than you might expect. There are even more parameters influencing memory storage. [[BR]]
+For the full story, please continue reading here. 
+
+# Last advice
+''Memory intensive stages usually are the overlapper and consensus stage. [[BR]]
+In some cases the layout stage (runCorrection.sh}}}, run as a single process), requires a lot of memory. [[BR]]
+    You may be forced to restart your pipeline to run this process on a high memory node (e.g. Baskingshark). [[BR]]
+    For more information see “Common issues”.''
+
+Monitor your correction as you run it. See how much memory processes are using and whether you need to modify your memory settings. [[BR]]
+Use qstat -j <job id>}}} for this. To get statistics for jobs that have already finished, use {{{qacct -j <job id>}}}. [[BR]]
+    With {{{qacct}}} it is also possible to figure out whether a job was killed.
+
+
+    qstat
+    job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
+    -----------------------------------------------------------------------------------------
+    6770304 0.51429 ovl_ASM_pa jfrank       r     09/13/2013 11:03:28 all.q@greatwhiteshark.cluster.     2 66
+    6770304 0.51429 ovl_ASM_pa jfrank       r     09/13/2013 11:03:33 all.q@cowshark.cluster.loc         2 125
+
+qstat -j 6770304
+##############################################################
+job_number:                 6770304
+exec_file:                  job_scripts/6770304
+submission_time:            Fri Sep 13 11:03:25 2013
+
+<... part excluded ...>
+
+job-array tasks:            1-628:1
+usage   66:                 cpu=04:13:29, mem=141953.77315 GBs, io=9.48519, vmem=9.408G, maxvmem=9.500G
+usage  125:                 cpu=02:49:44, mem=95074.72777 GBs, io=5.10018, vmem=9.408G, maxvmem=9.501G
+
+qacct -j 6770304
+##############################################################
+qname        all.q               
+hostname     zebrashark.cluster.loc
+
+<... part excluded ...>
+
+jobname      ovl_ASM_PacBioToCA_!HiSeq_ASM_Lclav
+jobnumber    6770304             
+taskid       64                  
+         
+<... part excluded ...>
+
+qsub_time    Fri Sep 13 11:03:25 2013
+start_time   Fri Sep 13 11:03:31 2013
+end_time     Fri Sep 13 12:22:05 2013
+granted_pe   BWA                 
+slots        2                   
+failed       0    
+exit_status  0   
+
+<... part excluded ...>                
+        
+maxvmem      9.406G
+arid         undefined
+}}}
+
+ [[BR]]
+
+# Submit your job - Using qsub
+----
+
+
+    cat PacBioToCA_svn_SGE_qsub.sh
+
+#!/bin/bash
+#$ -q all.q
+#$ -N PacBioToCA
+#$ -cwd
+#$ -j y
+#$ -V
+#$ -pe BWA 6
+#$ -l h_vmem=10G
+#$ -m e
+#$ -M E.M.Ployee@lumc.nl
+
+echo Process started `date`
+
+/usr/local/wgs-assembler/wgs-svn/Linux-amd64/bin/PacBioToCA -noclean -partitions 100 -l PacBioToCA_svn_SGE -s /data/LGTC/LGTCusers/jfrank/Clav/Correction/PacBioToCA_svn_SGE.spec -t 24 -fastq /data/LGTC/LGTCusers/jfrank/Clav/FASTQ_FRG/clav_filtered_subreads_20c_editQuality.fastq /data/LGTC/LGTCusers/jfrank/Clav/FASTQ_FRG/clav_HiSeq.frg
+
+echo Process ended `date`
+}}}
+
+I write out the entire qsub command on a single line: I don’t use variables. [[BR]]
+Believe it or not: *using variables in the qsub command may refrain the pipeline from initiating or cause problems down the road.* [[BR]]
+(I have no idea why/no explanation for this behavior). Adding certain qsub parameters has caused the pipeline to fail in the past: [[BR]]
+try and experiment for yourself. 
+
+For correction (PacBioToCA}}}) you can opt not to specify a queue. This way jobs will get scheduled on any node the group you're in has access to. [[BR]]
+    For assembly, there are some problems using Baskingshark and therefore I advise to use the {{{all.q}}} queue. 
+
+[[BR]]
+
+# Common issues
+----
+
+[pacBioToCA known issues](http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=PacBioToCA#Known_Issues) [[BR]]
+[Bugs (SourceForge)](http://sourceforge.net/p/wgs-assembler/bugs/)
+
+**__Symptom__**: Batch jobs quickly exiting before the process is finished.
+This error is mostly seen during assembly (not correction). [[BR]]
+The .out file contains the following message:
+
+
+    perl=/usr/bin/env perl: Command not found.
+    jobid=65: Command not found.
+    jobid: Undefined variable.
+
+
+**__Solution__**: A specific node (e.g. “Baskingshark”) is not correctly configured causing certain Perl processes to crash. [[BR]]
+Michel Villerius (system admin) reinstalled Baskingshark but this did not or only temporarily solve the problem. [[BR]]
+Do not use Baskingshark for this process, and thus do not use the LGTC_HiSeq.q queue. Use all.q and restart the pipeline. [[BR]]
+Also do not forget to specify all.q for your batch jobs as well:
+
+
+    sgeOverlap = -pe BWA 2 -l h_vmem=8G -q all.q
+
+
+[[BR]]
+
+**__Symptom__**: The pipeline fails or hangs during the runCorrection.sh step while generating asm.n.olaps files (layout stage). [[BR]]
+In the general output file you will encounter this message: 
+
+
+    ----------------------------------------END Tue Jan 17 11:04:28 2012 (1 seconds)
+    Failed to execute temppacbio/runCorrection.sh
+
+
+**__Solution__**: Stop the pipeline. Write a separate qsub file for runCorrection.sh (see below). [[BR]]
+This process may use a lot of memory, in this case about 160G. [[BR]]
+Run this job on a high memory node, such as Baskingshark, using the LGTC_HiSeq.q queue. [[BR]]
+Since we will use lots of memory we might as well request all cores. Edit the runCorrection.sh}}} file 
+    and change the amount of threads to 24. {{{cd}}} into the temp directory and qsub the command. [[BR]]
+    The pipeline should now pick up again and run fine.
+
+
+    cat pacBioToCA_svn_SGE_runCor_qsub.sh
+
+#!/bin/bash
+#$ -N pacBioToCA
+#$ -q LGTC_HiSeq.q
+#$ -cwd
+#$ -j y
+#$ -V
+#$ -pe BWA 24
+#$ -l h_vmem=10G
+#$ -m e
+#$ -M E.M.Ployee@lumc.nl
+
+sh runCorrection.sh
+}}}
+
+*Additional information -  Note*: the following may not work for SVN version of Celera Assembler''. [[BR]]
+ [Error in runCorrection.sh Step](http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=PacBioToCA#Error_in_runCorrection.sh_Step)
+
+[[BR]]
+[[BR]]
+
+**__Symptom__**: The pipeline fails during runPartition.sh}}} step. [[BR]]
+    For some reason certain partition jobs do not get executed and remain empty. [[BR]]	
+    When the process “finishes” it detects the empty files and fails.
+
+
+    ----------------------------------------END Fri Sep 13 17:08:32 2013 (1 seconds)
+    Failed to execute temppacbio/runPartition.sh
+
+
+
+**__Solution__**: Remove the runPartition.sh}}} file. Restart the pipeline using your original qsub command. [[BR]]
+    The pipeline will pickup the leftover partition jobs. You may have to repeat this process several times until all jobs are finished properly.
+
+[[BR]]
+[[BR]]
+
+**__Symptom__**: The pipeline (gatekeeper process) fails to load all reads. [[BR]]
+The gatekeeper error log ("asm.gkpStore.err}}}") contains the following similar error messages:
+     
+    {{{
+    Processing SINGLE-ENDED SANGER QV encoding reads from:  GKP finished with 578612 alerts or errors: 540303 # ILL Error: not a sequence start line. 38309 # ILL Error: not a quality start line.
+    ...
+
+
+**__Solution__**: You have to reinstall Celera Assembler. [[BR]]
+If you will be working with files over 2Kb, you will have to modify the source code to allow long reads. [[BR]]
+Modify the file AS_global.H}}} and change {{{AS_READ_MAX_NORMAL_LEN_BITS}}} from 11 to 15. [[BR]]
+    This is mandatory for both the official release of CA7 and the unstable SVN version. 
\ No newline at end of file