|
|
**How to correct long !PacBio reads using pacBioToCA**
|
|
|
[[BR]]
|
|
|
|
|
|
[Contents,inline)]([PageOutline(1-2,Page)]
|
|
|
|
|
|
[[BR]]
|
|
|
|
|
|
# Celera assembler and the LUMC Shark compute cluster
|
|
|
----
|
|
|
|
|
|
## Introduction
|
|
|
This document describes the use of Celera assembler ("wgs assembler") and PacBioToCA (unreleased version as of August 1 2013) on the LUMC Shark computing cluster using the SGE grid.
|
|
|
|
|
|
## Version
|
|
|
Unreleased version as of August 1, 2013. Downloaded from CVS repository.
|
|
|
|
|
|
|
|
|
/usr/local/wgs-assembler/wgs-svn/Linux-amd64/bin/runCA -version
|
|
|
|
|
|
CA version CVS TIP ($Id: AS_GKP_main.C 4371 2013-08-01 17:19:47Z brianwalenz $).
|
|
|
CA version CVS TIP ($Id: AS_CGB_unitigger.C 4371 2013-08-01 17:19:47Z brianwalenz $).
|
|
|
CA version CVS TIP ($Id: BuildUnitigs.C 4371 2013-08-01 17:19:47Z brianwalenz $).
|
|
|
Using up to 8 OpenMP threads.
|
|
|
CA version CVS TIP ($Id: AS_CGW_main.C 4371 2013-08-01 17:19:47Z brianwalenz $).
|
|
|
CA version CVS TIP ($Id: terminator.C 4371 2013-08-01 17:19:47Z brianwalenz $).
|
|
|
}}}
|
|
|
|
|
|
''NOTE: this is not the default (“current”) version on Shark.[[BR]]
|
|
|
Use the full path to run this unreleased version of Celera Assembler.''
|
|
|
|
|
|
## Installation
|
|
|
Instructions regarding installation of the most recent unreleased version of Celera Assembler from a CVS repository can be found here:
|
|
|
[Check_out_and_Compile](http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Check_out_and_Compile) [[BR]]
|
|
|
Steps 6 and 7 of the guide were skipped. No additional (bug)fixes were performed.
|
|
|
|
|
|
[[BR]][[BR]]
|
|
|
|
|
|
# PacBioToCA
|
|
|
----
|
|
|
|
|
|
## Introduction
|
|
|
I will use a case study to provide a superficial walkthrough of PacBioToCA. [[BR]]
|
|
|
For this research project 2 organisms were sequenced simultaneously resulting in a combined datasets.[[BR]]
|
|
|
|
|
|
Since the !PacBio platform has a high random error rate (~86%) data correction is required before the data can be used for assembly.
|
|
|
|
|
|
Illumina !HiSeq 2000 data including paired- and unpaired reads will be used to correct a !PacBio dataset consisting out of 20 SMRT cells. [[BR]]
|
|
|
Basic characteristics of both these datasets are presented in table 1 and 2.
|
|
|
|
|
|
[[BR]]
|
|
|
|
|
|
|| ||**!PacBio filtered subreads 20 cells**||
|
|
|
||Total reads||1,000,321||
|
|
|
||Total bases||1,152,284,129 bp||
|
|
|
||Mean read length||1,152 bp||
|
|
|
||Median read length||996 bp||
|
|
|
||Size range (bp)||500 - 7,977||
|
|
|
||Standard deviation||593.37||
|
|
|
||Overall GC||33.42%||
|
|
|
|
|
|
**Table 1:** Uncorrected !PacBio data.
|
|
|
|
|
|
[[BR]][[BR]]
|
|
|
|
|
|
|| ||**!HiSeq forward unpaired**||**!HiSeq reverse unpaired**||**!HiSeq forward paired (subset)**||**!HiSeq reverse paired (subset)**||
|
|
|
||Total reads||6,054,045||801,344||14,162,580||14,154,532||
|
|
|
||Total bases||521,866,017 bp||69,878,229 bp||1,391,281,652 bp||1,366,430,609 bp||
|
|
|
||Mean read length||86 bp||87 bp||98 bp||97 bp||
|
|
|
||Median read length||100 bp||100 bp||100 bp||100 bp||
|
|
|
||Size range (bp)||36 - 100||36 - 100||36 - 100||36 - 100||
|
|
|
||Standard deviation||19.60||20.69||8.04||2.75||
|
|
|
||Overall GC||33.66%||30.55%|| 30.22%||30.05%||
|
|
|
|
|
|
**Table 2:** Illumina !HiSeq 2000 datasets. [[BR]]
|
|
|
*NOTE:* The statistics for the paired datasets are based on a random subset (10%) taken from the original dataset. [[BR]]
|
|
|
A subset was analysed in favor of the complete dataset due to the large size of the full paired datasets.
|
|
|
|
|
|
[[BR]]
|
|
|
|
|
|
## Input - Uncorrected !PacBio data (FASTQ)
|
|
|
PacBioToCA requires !PacBio data to be supplied as FASTQ files. [[BR]]
|
|
|
Pacific Biosciences encodes quality values in FASTQ files using their own specific format. This format is not compatible with Celera Assembler. [To circumvent this apparent problem !PacBio data was retrieved as FASTA files from the SMRT portal. [[BR]([BR]])]
|
|
|
The FASTA files were then “converted” to FASTQ files by assigning standard quality values. [[BR]]
|
|
|
For this the fastools package (available on Shark) was used:
|
|
|
|
|
|
|
|
|
fastools fa2fq <in.fasta> <out.fastq>
|
|
|
|
|
|
|
|
|
## Input - Illumina !HiSeq data (FRG)
|
|
|
The !HiSeq dataset will be used to correct the !PacBio reads. [[BR]]
|
|
|
Any dataset to be used to perform !PacBio data correction must be supplied as a FRG file. [[BR]]
|
|
|
Celera Assembler includes a utility, fastqToCA}}}, to generate wrapper LIB messages for (Illumina) FASTQ files. [[BR]]
|
|
|
[http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=FastqToCA FastqToCA]
|
|
|
|
|
|
|
|
|
fastqToCA -insertsize 400 50 -libraryname HiSeq -technology illumina -reads HiSeq_R1001_forward_unpaired.fastq -reads HiSeq_R2001_reverse_unpaired.fastq -mates HiSeq_R1001_forward_paired.fastq,HiSeq_R2001_reverse_paired.fastq
|
|
|
|
|
|
{VER
|
|
|
ver:2
|
|
|
}
|
|
|
{LIB
|
|
|
act:A
|
|
|
acc:!HiSeq
|
|
|
ori:I
|
|
|
mea:400.000
|
|
|
std:50.000
|
|
|
src:
|
|
|
.
|
|
|
nft:18
|
|
|
fea:
|
|
|
forceBOGunitigger=1
|
|
|
isNotRandom=0
|
|
|
doNotTrustHomopolymerRuns=0
|
|
|
doTrim_initialNone=0
|
|
|
doTrim_initialMerBased=1
|
|
|
doTrim_initialFlowBased=0
|
|
|
doTrim_initialQualityBased=0
|
|
|
doRemoveDuplicateReads=1
|
|
|
doTrim_finalLargestCovered=1
|
|
|
doTrim_finalEvidenceBased=0
|
|
|
doRemoveSpurReads=1
|
|
|
doRemoveChimericReads=1
|
|
|
doConsensusCorrection=0
|
|
|
fastqQualityValues=sanger
|
|
|
fastqOrientation=innie
|
|
|
fastqReads=/data/LGTC/LGTCusers/jfrank/Clav/FASTQ_FRG/!HiSeq_R1001_forward_unpaired.fastq
|
|
|
fastqReads=/data/LGTC/LGTCusers/jfrank/Clav/FASTQ_FRG/!HiSeq_R2001_reverse_unpaired.fastq
|
|
|
fastqMates=/data/LGTC/LGTCusers/jfrank/Clav/FASTQ_FRG/!HiSeq_R1001_forward_paired.fastq,/data/LGTC/LGTCusers/jfrank/Clav/FASTQ_FRG/!HiSeq_R2001_reverse_paired.fastq
|
|
|
|
|
|
.
|
|
|
}
|
|
|
{VER
|
|
|
ver:1
|
|
|
}
|
|
|
|
|
|
}}}
|
|
|
|
|
|
[[BR]][[BR]]
|
|
|
|
|
|
# Specifying options - The spec file
|
|
|
----
|
|
|
The most convenient way to configure Celera Assembler/PacBioToCA is to use a spec file. [[BR]]
|
|
|
In this file options are defined in a key=value manner. There are many options available. [[BR]]
|
|
|
For a complete list, please refer to [RunCA](http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=RunCA) or run:
|
|
|
|
|
|
|
|
|
runCA -options
|
|
|
|
|
|
|
|
|
A spec file must be well planned and designed. For example, specific options need to be defined If you want to make use of the SGE grid on Shark. [[BR]]
|
|
|
Experimenting with spec files on the grid has learned that only few configurations work well, while most other setups will end up crashing your job. [[BR]]
|
|
|
Parameters most important for SGE and the shark cluster will be discussed later.
|
|
|
|
|
|
Since the Shark cluster employs a standard memory limit of 4 Gb per slot, you need to define the amount of memory for every process up front. [[BR]]
|
|
|
This can be tricky to predict as there are many factors influencing the memory footprint, such as the size and complexity of the genome/data, [[BR]]
|
|
|
automatic memory allocation of certain options etc. This requires you to know the program and your data very well. [[BR]]
|
|
|
To get an idea of the memory usage the spec file used during this study is provided below.
|
|
|
|
|
|
|
|
|
cat pacBioToCA_svn_SGE.spec
|
|
|
|
|
|
sgeName = pacBioToCA
|
|
|
sge = -A assembly -l h_vmem=10G -l h_stack=256m
|
|
|
sgeScript = -pe BWA 8
|
|
|
|
|
|
useGrid = 1
|
|
|
scriptOnGrid = 1
|
|
|
|
|
|
mbtOnGrid = 1
|
|
|
frgCorrOnGrid = 1
|
|
|
ovlOnGrid = 1
|
|
|
ovlCorrOnGrid = 1
|
|
|
cnsOnGrid = 1
|
|
|
|
|
|
##############################################################################
|
|
|
|
|
|
mbtBatchSize = 1500000
|
|
|
mbtThreads = 1
|
|
|
|
|
|
frgCorrBatchSize = 200000
|
|
|
frgCorrThreads = 2
|
|
|
sgeFragmentCorrection = -pe BWA 2 -l h_vmem=5G
|
|
|
|
|
|
merSize = 14
|
|
|
frgMinLen = 64
|
|
|
ovlMinLen = 40
|
|
|
|
|
|
ovlHashBits = 23
|
|
|
ovlHashBlockLength = 450000000
|
|
|
ovlRefBlockSize = 15000000
|
|
|
|
|
|
ovlStoreMemory = 8192
|
|
|
ovlThreads = 2
|
|
|
sgeOverlap = -pe BWA 2 -l h_vmem=11G
|
|
|
|
|
|
merylMemory = 80000
|
|
|
|
|
|
}}}
|
|
|
|
|
|
|
|
|
### sgeName
|
|
|
Entered string is appended to the job name supplied to SGE. [[BR]]
|
|
|
This parameter is needed to prevent different correction/assembly jobs from clashing with each other.
|
|
|
|
|
|
### sge
|
|
|
Entered string is passed to the qsub command used to submit to the grid any job for which no memory/slot allocation is specified. [[BR]]
|
|
|
This is useful since some processes do not offer any options at all to set memory usage or the amount of slots to use. [[BR]]
|
|
|
Set this parameter accordingly for jobs that you cannot control with any other specific parameters, such as the “mertrim” stage.
|
|
|
|
|
|
All SGE jobs are run with the -A assembly option, which annotates the SGE accounting information for these jobs with the string "assembly".
|
|
|
|
|
|
We have to set h_vmem}}} since Shark imposes memory limits (default 4 Gb per slot) on all jobs. [[BR]]
|
|
|
{{{h_vmem}}} sets a limit on virtual memory. We should also set a default value for {{{h_stack}}}: {{{h_stack}}} sets a limit on stack space for binary execution. [[BR]]
|
|
|
Without a sufficient value for {{{h_stack}}} some programs will fail to start.
|
|
|
|
|
|
*Adding and/or removing parameters from this line will likely crash your correction.*
|
|
|
Try and experiment!
|
|
|
|
|
|
### sgeScript
|
|
|
Entered string is passed to the qsub command that initiates the run of the main script: runCA}}}. [[BR]]
|
|
|
Every stage that runs, unless explicitly submitted to the grid, is run within {{{runCA}}} (e.g. unitigger, scaffolfder). [[BR]]
|
|
|
This means that processes like unitigger and scaffolder will have access to as many resources defined by {{{sgeScript}}}. [[BR]]
|
|
|
|
|
|
*Adding and/or removing parameters from this line will likely crash your correction.*
|
|
|
Try and experiment!
|
|
|
|
|
|
### !OnGrid
|
|
|
By enabling useGrid}}} and {{{scriptOnGrid}}}, the main script ({{{runCA}}}) will be submitted directly to the grid. [[BR]]
|
|
|
Now all stages of the pipeline will run automatically in parallel on the computational grid. [[BR]]
|
|
|
Process specific switches are available to allow for users to decide whether or not to submit the specific process to the grid.
|
|
|
|
|
|
[[BR]]
|
|
|
|
|
|
# Assigning memory and slots to processes
|
|
|
Certain parameters (e.g. sgeFragmentCorrection}}}, {{{sgeOverlap}}}, {{{sgeOverlapCorrection}}}, {{{sgeConsensus}}}) allows a user [[BR]]
|
|
|
to set the amount of memory and the number of slots for specific processes. Each job spawned from that process will get the the same [[BR]]
|
|
|
amount of memory and slots assigned. For some processes however, it is not possible to assign memory and slots manually.
|
|
|
|
|
|
*Most processes run at optimal efficiency using just 1 slot/core. Exceptions to this rule are sgeFragmentCorrection}}} and {{{sgeOverlap}}} which are designed to run optimally on 2 slots/cores.* [[BR]]
|
|
|
The Shark cluster has a parallel environment called {{{BWA}}}. The following line of code provides an example of how to set the number of slots/cores for {{{sgeOverlap}}}:
|
|
|
|
|
|
|
|
|
sgeOverlap = -pe BWA 2
|
|
|
|
|
|
|
|
|
As mentioned before, the Shark cluster employs a standard memory limit of 4 Gb per slot. [[BR]]
|
|
|
This requires you to define the amount of memory for every process up front. To allocate memory for a process, use h_vmem}}}:
|
|
|
|
|
|
|
|
|
sgeOverlap = -pe BWA 2 -l h_vmem=11G
|
|
|
|
|
|
|
|
|
In the above example, each sgeOverlap}}} job gets 2 slots assigned and a total of 22G of memory (2 slots * 11G).
|
|
|
You can also specify a specific queue for a batch of jobs:
|
|
|
|
|
|
|
|
|
sgeOverlap = -pe BWA 2 -l h_vmem=11G -q all.q
|
|
|
|
|
|
|
|
|
[[BR]]
|
|
|
|
|
|
## How to estimate memory usage
|
|
|
It can be tricky to predict memory usage as there are many factors influencing the memory footprint. [[BR]]
|
|
|
Setting certain parameters directly influences the amount of memory allocated. If this exceeds the limit of 4G/slot [[BR]]
|
|
|
the job in question will get killed automatically without warning, causing the pipeline to crash (eventually). [[BR]]
|
|
|
To prevent this from happening a good understanding of the pipeline is required. Configure your spec file well [[BR]]
|
|
|
and use this page to find out exactly what each parameter does and how it may influence memory usage. [[BR]]
|
|
|
|
|
|
''Please note: do not set parameters too close to available memory. [[BR]]
|
|
|
Celera Assembler does often allocate a bit more memory then is to be expected. [[BR]]
|
|
|
Be safe and leave 1 to 2G of free memory per job.''
|
|
|
|
|
|
**Example:** configuring the overlapper stage:
|
|
|
|
|
|
|
|
|
frgMinLen = 64
|
|
|
ovlMinLen = 40
|
|
|
|
|
|
ovlHashBits = 23
|
|
|
ovlHashBlockLength = 450000000
|
|
|
ovlRefBlockSize = 15000000
|
|
|
|
|
|
ovlStoreMemory = 8192
|
|
|
ovlThreads = 2
|
|
|
sgeOverlap = -pe BWA 2 -l h_vmem=11G
|
|
|
}}}
|
|
|
|
|
|
|
|
|
### ovlHashBits
|
|
|
Size of the overlap hash table in bits. A 23 bit overlap hash table accommodates for up to 176,160,768 unique Kmers. [[BR]]
|
|
|
Defining a table of 23 bits immediately allocates > 1700 Mb or memory.
|
|
|
|
|
|
### ovlHashBlockLength
|
|
|
Amount of sequence (in bp) to load into the hash table. Each base loaded consumes 10 bytes of memory. [[BR]]
|
|
|
Loading 450,000,000 bases will consume 4.5G of memory in addition to that used by ovlHashBits}}}.
|
|
|
|
|
|
### ovlRefBlockSize
|
|
|
ovlRefBlockSize}}} directly controls the number of overlap jobs and the run time of each. [[BR]]
|
|
|
Smaller values result in more jobs that each need less time to finish. [[BR]]
|
|
|
''It is often best to configure your jobs in such a way that they will spawn around 400 jobs per stage (e.g. 400 overlapper jobs, 400 trim jobs etc.)''.
|
|
|
|
|
|
I have found no way to find the best setting for this parameter: just experiment and see how many jobs will be spawned. [[BR]]
|
|
|
Given the size of the dataset used in this example, if the work is divided up to around 400 jobs, [[BR]]
|
|
|
jobs will turn out rather large and occupy quite some memory.
|
|
|
|
|
|
Given this information our overlapper jobs will need at least 1,700 Mb for the table, we will load 4.5G of sequences [[BR]]
|
|
|
in memory and try to create around 400 jobs. After running PacBioToCA using this configuration overlapper jobs turned out [[BR]]
|
|
|
to consume about 20G of RAM. This is more than you might expect. There are even more parameters influencing memory storage. [[BR]]
|
|
|
For the full story, please continue reading here.
|
|
|
|
|
|
# Last advice
|
|
|
''Memory intensive stages usually are the overlapper and consensus stage. [[BR]]
|
|
|
In some cases the layout stage (runCorrection.sh}}}, run as a single process), requires a lot of memory. [[BR]]
|
|
|
You may be forced to restart your pipeline to run this process on a high memory node (e.g. Baskingshark). [[BR]]
|
|
|
For more information see “Common issues”.''
|
|
|
|
|
|
Monitor your correction as you run it. See how much memory processes are using and whether you need to modify your memory settings. [[BR]]
|
|
|
Use qstat -j <job id>}}} for this. To get statistics for jobs that have already finished, use {{{qacct -j <job id>}}}. [[BR]]
|
|
|
With {{{qacct}}} it is also possible to figure out whether a job was killed.
|
|
|
|
|
|
|
|
|
qstat
|
|
|
job-ID prior name user state submit/start at queue slots ja-task-ID
|
|
|
-----------------------------------------------------------------------------------------
|
|
|
6770304 0.51429 ovl_ASM_pa jfrank r 09/13/2013 11:03:28 all.q@greatwhiteshark.cluster. 2 66
|
|
|
6770304 0.51429 ovl_ASM_pa jfrank r 09/13/2013 11:03:33 all.q@cowshark.cluster.loc 2 125
|
|
|
|
|
|
qstat -j 6770304
|
|
|
##############################################################
|
|
|
job_number: 6770304
|
|
|
exec_file: job_scripts/6770304
|
|
|
submission_time: Fri Sep 13 11:03:25 2013
|
|
|
|
|
|
<... part excluded ...>
|
|
|
|
|
|
job-array tasks: 1-628:1
|
|
|
usage 66: cpu=04:13:29, mem=141953.77315 GBs, io=9.48519, vmem=9.408G, maxvmem=9.500G
|
|
|
usage 125: cpu=02:49:44, mem=95074.72777 GBs, io=5.10018, vmem=9.408G, maxvmem=9.501G
|
|
|
|
|
|
qacct -j 6770304
|
|
|
##############################################################
|
|
|
qname all.q
|
|
|
hostname zebrashark.cluster.loc
|
|
|
|
|
|
<... part excluded ...>
|
|
|
|
|
|
jobname ovl_ASM_PacBioToCA_!HiSeq_ASM_Lclav
|
|
|
jobnumber 6770304
|
|
|
taskid 64
|
|
|
|
|
|
<... part excluded ...>
|
|
|
|
|
|
qsub_time Fri Sep 13 11:03:25 2013
|
|
|
start_time Fri Sep 13 11:03:31 2013
|
|
|
end_time Fri Sep 13 12:22:05 2013
|
|
|
granted_pe BWA
|
|
|
slots 2
|
|
|
failed 0
|
|
|
exit_status 0
|
|
|
|
|
|
<... part excluded ...>
|
|
|
|
|
|
maxvmem 9.406G
|
|
|
arid undefined
|
|
|
}}}
|
|
|
|
|
|
[[BR]]
|
|
|
|
|
|
# Submit your job - Using qsub
|
|
|
----
|
|
|
|
|
|
|
|
|
cat PacBioToCA_svn_SGE_qsub.sh
|
|
|
|
|
|
#!/bin/bash
|
|
|
#$ -q all.q
|
|
|
#$ -N PacBioToCA
|
|
|
#$ -cwd
|
|
|
#$ -j y
|
|
|
#$ -V
|
|
|
#$ -pe BWA 6
|
|
|
#$ -l h_vmem=10G
|
|
|
#$ -m e
|
|
|
#$ -M E.M.Ployee@lumc.nl
|
|
|
|
|
|
echo Process started `date`
|
|
|
|
|
|
/usr/local/wgs-assembler/wgs-svn/Linux-amd64/bin/PacBioToCA -noclean -partitions 100 -l PacBioToCA_svn_SGE -s /data/LGTC/LGTCusers/jfrank/Clav/Correction/PacBioToCA_svn_SGE.spec -t 24 -fastq /data/LGTC/LGTCusers/jfrank/Clav/FASTQ_FRG/clav_filtered_subreads_20c_editQuality.fastq /data/LGTC/LGTCusers/jfrank/Clav/FASTQ_FRG/clav_HiSeq.frg
|
|
|
|
|
|
echo Process ended `date`
|
|
|
}}}
|
|
|
|
|
|
I write out the entire qsub command on a single line: I don’t use variables. [[BR]]
|
|
|
Believe it or not: *using variables in the qsub command may refrain the pipeline from initiating or cause problems down the road.* [[BR]]
|
|
|
(I have no idea why/no explanation for this behavior). Adding certain qsub parameters has caused the pipeline to fail in the past: [[BR]]
|
|
|
try and experiment for yourself.
|
|
|
|
|
|
For correction (PacBioToCA}}}) you can opt not to specify a queue. This way jobs will get scheduled on any node the group you're in has access to. [[BR]]
|
|
|
For assembly, there are some problems using Baskingshark and therefore I advise to use the {{{all.q}}} queue.
|
|
|
|
|
|
[[BR]]
|
|
|
|
|
|
# Common issues
|
|
|
----
|
|
|
|
|
|
[pacBioToCA known issues](http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=PacBioToCA#Known_Issues) [[BR]]
|
|
|
[Bugs (SourceForge)](http://sourceforge.net/p/wgs-assembler/bugs/)
|
|
|
|
|
|
**__Symptom__**: Batch jobs quickly exiting before the process is finished.
|
|
|
This error is mostly seen during assembly (not correction). [[BR]]
|
|
|
The .out file contains the following message:
|
|
|
|
|
|
|
|
|
perl=/usr/bin/env perl: Command not found.
|
|
|
jobid=65: Command not found.
|
|
|
jobid: Undefined variable.
|
|
|
|
|
|
|
|
|
**__Solution__**: A specific node (e.g. “Baskingshark”) is not correctly configured causing certain Perl processes to crash. [[BR]]
|
|
|
Michel Villerius (system admin) reinstalled Baskingshark but this did not or only temporarily solve the problem. [[BR]]
|
|
|
Do not use Baskingshark for this process, and thus do not use the LGTC_HiSeq.q queue. Use all.q and restart the pipeline. [[BR]]
|
|
|
Also do not forget to specify all.q for your batch jobs as well:
|
|
|
|
|
|
|
|
|
sgeOverlap = -pe BWA 2 -l h_vmem=8G -q all.q
|
|
|
|
|
|
|
|
|
[[BR]]
|
|
|
|
|
|
**__Symptom__**: The pipeline fails or hangs during the runCorrection.sh step while generating asm.n.olaps files (layout stage). [[BR]]
|
|
|
In the general output file you will encounter this message:
|
|
|
|
|
|
|
|
|
----------------------------------------END Tue Jan 17 11:04:28 2012 (1 seconds)
|
|
|
Failed to execute temppacbio/runCorrection.sh
|
|
|
|
|
|
|
|
|
**__Solution__**: Stop the pipeline. Write a separate qsub file for runCorrection.sh (see below). [[BR]]
|
|
|
This process may use a lot of memory, in this case about 160G. [[BR]]
|
|
|
Run this job on a high memory node, such as Baskingshark, using the LGTC_HiSeq.q queue. [[BR]]
|
|
|
Since we will use lots of memory we might as well request all cores. Edit the runCorrection.sh}}} file
|
|
|
and change the amount of threads to 24. {{{cd}}} into the temp directory and qsub the command. [[BR]]
|
|
|
The pipeline should now pick up again and run fine.
|
|
|
|
|
|
|
|
|
cat pacBioToCA_svn_SGE_runCor_qsub.sh
|
|
|
|
|
|
#!/bin/bash
|
|
|
#$ -N pacBioToCA
|
|
|
#$ -q LGTC_HiSeq.q
|
|
|
#$ -cwd
|
|
|
#$ -j y
|
|
|
#$ -V
|
|
|
#$ -pe BWA 24
|
|
|
#$ -l h_vmem=10G
|
|
|
#$ -m e
|
|
|
#$ -M E.M.Ployee@lumc.nl
|
|
|
|
|
|
sh runCorrection.sh
|
|
|
}}}
|
|
|
|
|
|
*Additional information - Note*: the following may not work for SVN version of Celera Assembler''. [[BR]]
|
|
|
[Error in runCorrection.sh Step](http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=PacBioToCA#Error_in_runCorrection.sh_Step)
|
|
|
|
|
|
[[BR]]
|
|
|
[[BR]]
|
|
|
|
|
|
**__Symptom__**: The pipeline fails during runPartition.sh}}} step. [[BR]]
|
|
|
For some reason certain partition jobs do not get executed and remain empty. [[BR]]
|
|
|
When the process “finishes” it detects the empty files and fails.
|
|
|
|
|
|
|
|
|
----------------------------------------END Fri Sep 13 17:08:32 2013 (1 seconds)
|
|
|
Failed to execute temppacbio/runPartition.sh
|
|
|
|
|
|
|
|
|
|
|
|
**__Solution__**: Remove the runPartition.sh}}} file. Restart the pipeline using your original qsub command. [[BR]]
|
|
|
The pipeline will pickup the leftover partition jobs. You may have to repeat this process several times until all jobs are finished properly.
|
|
|
|
|
|
[[BR]]
|
|
|
[[BR]]
|
|
|
|
|
|
**__Symptom__**: The pipeline (gatekeeper process) fails to load all reads. [[BR]]
|
|
|
The gatekeeper error log ("asm.gkpStore.err}}}") contains the following similar error messages:
|
|
|
|
|
|
{{{
|
|
|
Processing SINGLE-ENDED SANGER QV encoding reads from:
GKP finished with 578612 alerts or errors:
540303 # ILL Error: not a sequence start line.
38309 # ILL Error: not a quality start line.
|
|
|
...
|
|
|
|
|
|
|
|
|
**__Solution__**: You have to reinstall Celera Assembler. [[BR]]
|
|
|
If you will be working with files over 2Kb, you will have to modify the source code to allow long reads. [[BR]]
|
|
|
Modify the file AS_global.H}}} and change {{{AS_READ_MAX_NORMAL_LEN_BITS}}} from 11 to 15. [[BR]]
|
|
|
This is mandatory for both the official release of CA7 and the unstable SVN version. |
|
|
\ No newline at end of file |