... | ... | @@ -24,4 +24,34 @@ Leiden University Medical Center <br> |
|
|
|
|
|
### **Summary** <br>
|
|
|
The multilayered control of gene expression requires tight coordination of regulatory mechanisms at the transcriptional and post-transcriptional level. Here, we studied the interdependence of transcription, splicing and polyadenylation events on single mRNA molecules by full-length mRNA sequencing. In MCF-7 breast cancer cells and three human tissues, we found an unforeseen number of genes that demonstrate mutually inclusive or exclusive alternative transcription and mRNA processing events, which can span the entire length of mRNA molecules. Furthermore, alternative poly(A) sites that are coupled with alternative splicing events are depleted for known poly(A) signals and enriched for MBNL binding motifs, supporting a dual role of MBNL proteins in regulating splicing and polyadenylation. We predict thousands of open-reading frames from the sequence of full-length mRNAs, allowing for a more sensitive proteogenomics analysis of MCF-7 mass-spectrometry data. Our findings demonstrate that our understanding of transcriptome complexity is far from complete and provides a framework to reveal largely unresolved mechanisms that coordinate transcription and mRNA processing.<br>
|
|
|
<br>
|
|
|
|
|
|
---
|
|
|
### **Materials and Preprocessing** <br>
|
|
|
Full-length cDNA was generated from polyA RNA using standard cDNA synthesis kits (Clontech® SMARTer™ and Invitrogen® Superscript® kits). To capture longer, rarer transcripts in sufficient abundance, parts of the double-stranded cDNA were size selected into three fractions, which were subsequently amplified and converted into SMRTbell™ templates. Details on the sample preparation can be found on [`Sample Net`](http://www.smrtcommunity.com/Share/Protocol?id=a1q70000000HqSvAAK&strRecordTypeName=Protocol). SMRTbell libraries were sequenced using the [`P4-C2 sequencing chemistry`](http://blog.pacificbiosciences.com/2013/08/new-dna-polymerase-p4-delivers-higher.html) with 2-hour movies. <br>
|
|
|
|
|
|
After sequencing, the completeness of the sequences were computationally determined using polyA-tail signals and library adapters. To obtain a non-redundant set of full-length, high-quality transcript sequences without bias from other sequencing platforms, a de novo isoform-level clustering algorithm was used only on PacBio data. Briefly, the algorithm iteratively clusters reads to generate consensus sequences that represent the original transcripts. The algorithm takes into account the existence of the polyA-tail signal to differentiate isoforms with alternative stop sites. The final consensus sequences were called using [`Quiver`](https://github.com/PacificBiosciences/GenomicConsensus/blob/master/doc/HowToQuiver.rst) and filtered to create the final polished, full-length, non-redundant dataset. In 2015, the MCF-7 dataset was updated by performing 28 additional SMRT Cells on PacBio RSII platform. <br>
|
|
|
|
|
|
Some statistics from the sequencing and results are listed below. Total number of SMRT Cells: **147** <br>
|
|
|
|
|
|
* 12 SMRT Cells: no-size selection
|
|
|
* 37 SMRT Cells: 1-2 kb
|
|
|
* 37 SMRT Cells: 2-3 kb
|
|
|
* 33 SMRT Cells: > 3 kb
|
|
|
<br>
|
|
|
|
|
|
2015 addition: <br>
|
|
|
* 28 SMRT Cells
|
|
|
|
|
|
Total number of post-filtered bases: **14,062,161,755** <br>
|
|
|
|
|
|
For full description of the sample preparation protocol and pre-processing of sequencing data please refer to the official release note on [`MCF-7 dataset`](http://www.pacb.com/blog/data-release-human-mcf-7-transcriptome/). <br>
|
|
|
<br>
|
|
|
|
|
|
---
|
|
|
### **Data Characteristics** <br>
|
|
|
Post sequencing, transcript structures were defined by applying the isoform-level clustering algorithm ([ICE](https://github.com/PacificBiosciences/cDNA_primer/wiki)) on full-length reads, capturing the entire mRNA molecule (containing 5', polyA-tail and 3' primer sequences). To find transcript clusters, ICE performs a pairwise alignment and reiterative assignment of full-length reads to clusters based on likelihood. This process is followed by consensus calling and further polishing of the sequence to reduce redundancy and increase the overall accuracy of sequences for identified transcript variants. Our analysis pipeline could precisely determine the position of polyadenylation site (presence of polyA-tail) and intro-exon boundaries, as evident from the presence of the canonical **GU** motif in 93% of donor splice sites and the canonical **AG** motif in 95% of acceptor splice sites. In addition to 7,364 single-exon transcripts, the MCF-7 transcriptome consists of 11,350 multi-exon genes of which 69% produced multiple transcript structures (**Figure 1**; [GFF](); [FASTA]()). Furthermore, 49% of identified transcripts in MCF-7 cells are potentially novel in comparison with the [GENCODE](http://www.gencodegenes.org/) annotation (versoin 19). The comparison was carried out using `cuffcompare` from Cufflinks suite. <br>
|
|
|
<br>
|
|
|
|
|
|
---
|
|
|
### **Prerequisites** <br> |
|
|
\ No newline at end of file |