... | ... | @@ -22,12 +22,12 @@ Leiden University Medical Center <br> |
|
|
<br>
|
|
|
<br>
|
|
|
|
|
|
### **Summary** <br>
|
|
|
## **Summary** <br>
|
|
|
The multilayered control of gene expression requires tight coordination of regulatory mechanisms at the transcriptional and post-transcriptional level. Here, we studied the interdependence of transcription, splicing and polyadenylation events on single mRNA molecules by full-length mRNA sequencing. In MCF-7 breast cancer cells and three human tissues, we found an unforeseen number of genes that demonstrate mutually inclusive or exclusive alternative transcription and mRNA processing events, which can span the entire length of mRNA molecules. Furthermore, alternative poly(A) sites that are coupled with alternative splicing events are depleted for known poly(A) signals and enriched for MBNL binding motifs, supporting a dual role of MBNL proteins in regulating splicing and polyadenylation. We predict thousands of open-reading frames from the sequence of full-length mRNAs, allowing for a more sensitive proteogenomics analysis of MCF-7 mass-spectrometry data. Our findings demonstrate that our understanding of transcriptome complexity is far from complete and provides a framework to reveal largely unresolved mechanisms that coordinate transcription and mRNA processing.<br>
|
|
|
<br>
|
|
|
|
|
|
---
|
|
|
### **Materials and Sequencing** <br>
|
|
|
## **Materials and Sequencing** <br>
|
|
|
Full-length cDNA was generated from polyA RNA using standard cDNA synthesis kits (Clontech® SMARTer™ and Invitrogen® Superscript® kits). To capture longer, rarer transcripts in sufficient abundance, parts of the double-stranded cDNA were size selected into three fractions, which were subsequently amplified and converted into SMRTbell™ templates. Details on the sample preparation can be found on [`Sample Net`](http://www.smrtcommunity.com/Share/Protocol?id=a1q70000000HqSvAAK&strRecordTypeName=Protocol). SMRTbell libraries were sequenced using the [`P4-C2 sequencing chemistry`](http://blog.pacificbiosciences.com/2013/08/new-dna-polymerase-p4-delivers-higher.html) with 2-hour movies. <br>
|
|
|
|
|
|
After sequencing, the completeness of the sequences were computationally determined using polyA-tail signals and library adapters. To obtain a non-redundant set of full-length, high-quality transcript sequences without bias from other sequencing platforms, a de novo isoform-level clustering algorithm was used only on PacBio data. Briefly, the algorithm iteratively clusters reads to generate consensus sequences that represent the original transcripts. The algorithm takes into account the existence of the polyA-tail signal to differentiate isoforms with alternative stop sites. The final consensus sequences were called using [`Quiver`](https://github.com/PacificBiosciences/GenomicConsensus/blob/master/doc/HowToQuiver.rst) and filtered to create the final polished, full-length, non-redundant dataset. In 2015, the MCF-7 dataset was updated by performing 28 additional SMRT Cells on PacBio RSII platform. <br>
|
... | ... | @@ -49,7 +49,7 @@ For full description of the sample preparation protocol and pre-processing of se |
|
|
<br>
|
|
|
|
|
|
---
|
|
|
### **Data Preprocessing and Characteristics** <br>
|
|
|
## **Data Preprocessing and Characteristics** <br>
|
|
|
Post sequencing, transcript structures were defined by applying the isoform-level clustering algorithm ([ICE](https://github.com/PacificBiosciences/cDNA_primer/wiki)) on full-length reads, capturing the entire mRNA molecule (containing 5', polyA-tail and 3' primer sequences). To find transcript clusters, ICE performs a pairwise alignment and reiterative assignment of full-length reads to clusters based on likelihood. This process is followed by consensus calling and further polishing of the sequence to reduce redundancy and increase the overall accuracy of sequences for identified transcript variants. Our analysis pipeline could precisely determine the position of polyadenylation site (presence of polyA-tail) and intro-exon boundaries, as evident from the presence of the canonical **GU** motif in 93% of donor splice sites and the canonical **AG** motif in 95% of acceptor splice sites. <br>
|
|
|
|
|
|
Clustering of the transcripts into genes was achieved using the following criteria: For each genomic region containing transcripts, we create an empty graph in which each transcript is represented as a node. If two transcripts are on the same strand and share at least 2 exons, an edge is added to link the two transcripts. For each component of the resulting network, a unique gene id is generated and assigned to the transcripts using the same scheme as the original data. Transcript ids are derived from the gene ids, with an index 1 to the total number of transcripts assigned to each gene. A lookup table is generated for the mapping of the new and old ids. <br>
|
... | ... | @@ -70,7 +70,7 @@ In addition to 7,364 single-exon transcripts, the MCF-7 transcriptome consists o |
|
|
<br>
|
|
|
|
|
|
---
|
|
|
### **Prerequisites** <br>
|
|
|
## **Prerequisites** <br>
|
|
|
|
|
|
This is the script used for this task: <br>
|
|
|
```python
|
... | ... | |