... | ... | @@ -22,13 +22,13 @@ Leiden University Medical Center <br> |
|
|
|
|
|
## **Summary** <br>
|
|
|
The multilayered control of gene expression requires tight coordination of regulatory mechanisms at the transcriptional and post-transcriptional level. Here, we studied the interdependence of transcription, splicing and polyadenylation events on single mRNA molecules by full-length mRNA sequencing. In MCF-7 breast cancer cells and three human tissues, we found an unforeseen number of genes that demonstrate mutually inclusive or exclusive alternative transcription and mRNA processing events, which can span the entire length of mRNA molecules. Furthermore, alternative poly(A) sites that are coupled with alternative splicing events are depleted for known poly(A) signals and enriched for MBNL binding motifs, supporting a dual role of MBNL proteins in regulating splicing and polyadenylation. We predict thousands of open-reading frames from the sequence of full-length mRNAs, allowing for a more sensitive proteogenomics analysis of MCF-7 mass-spectrometry data. Our findings demonstrate that our understanding of transcriptome complexity is far from complete and provides a framework to reveal largely unresolved mechanisms that coordinate transcription and mRNA processing.<br>
|
|
|
<br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Materials and Sequencing** <br>
|
|
|
Full-length cDNA was generated from polyA RNA using standard cDNA synthesis kits (Clontech® SMARTer™ and Invitrogen® Superscript® kits). To capture longer, rarer transcripts in sufficient abundance, parts of the double-stranded cDNA were size selected into three fractions, which were subsequently amplified and converted into SMRTbell™ templates. Details on the sample preparation can be found on [`Sample Net`](http://www.smrtcommunity.com/Share/Protocol?id=a1q70000000HqSvAAK&strRecordTypeName=Protocol). SMRTbell libraries were sequenced using the [`P4-C2 sequencing chemistry`](http://blog.pacificbiosciences.com/2013/08/new-dna-polymerase-p4-delivers-higher.html) with 2-hour movies. <br>
|
|
|
Full-length cDNA was generated from polyA RNA using standard cDNA synthesis kits (Clontech® SMARTer™ and Invitrogen® Superscript® kits). To capture longer, rarer transcripts in sufficient abundance, parts of the double-stranded cDNA were size selected into three fractions, which were subsequently amplified and converted into SMRTbell™ templates. Details on the sample preparation can be found on [Sample Net](http://www.smrtcommunity.com/Share/Protocol?id=a1q70000000HqSvAAK&strRecordTypeName=Protocol). SMRTbell libraries were sequenced using the [P4-C2 sequencing chemistry](http://blog.pacificbiosciences.com/2013/08/new-dna-polymerase-p4-delivers-higher.html) with 2-hour movies. <br>
|
|
|
|
|
|
After sequencing, the completeness of the sequences were computationally determined using polyA-tail signals and library adapters. To obtain a non-redundant set of full-length, high-quality transcript sequences without bias from other sequencing platforms, a de novo isoform-level clustering algorithm was used only on PacBio data. Briefly, the algorithm iteratively clusters reads to generate consensus sequences that represent the original transcripts. The algorithm takes into account the existence of the polyA-tail signal to differentiate isoforms with alternative stop sites. The final consensus sequences were called using [`Quiver`](https://github.com/PacificBiosciences/GenomicConsensus/blob/master/doc/HowToQuiver.rst) and filtered to create the final polished, full-length, non-redundant dataset. In 2015, the MCF-7 dataset was updated by performing 28 additional SMRT Cells on PacBio RSII platform. <br>
|
|
|
After sequencing, the completeness of the sequences were computationally determined using polyA-tail signals and library adapters. To obtain a non-redundant set of full-length, high-quality transcript sequences without bias from other sequencing platforms, a de novo isoform-level clustering algorithm was used only on PacBio data. Briefly, the algorithm iteratively clusters reads to generate consensus sequences that represent the original transcripts. The algorithm takes into account the existence of the polyA-tail signal to differentiate isoforms with alternative stop sites. The final consensus sequences were called using [Quiver](https://github.com/PacificBiosciences/GenomicConsensus/blob/master/doc/HowToQuiver.rst) and filtered to create the final polished, full-length, non-redundant dataset. In 2015, the MCF-7 dataset was updated by performing 28 additional SMRT Cells on PacBio RSII platform. <br>
|
|
|
|
|
|
Some statistics from the sequencing and results are listed below. Total number of SMRT Cells: **147** <br>
|
|
|
|
... | ... | @@ -43,10 +43,10 @@ Some statistics from the sequencing and results are listed below. Total number o |
|
|
|
|
|
Total number of post-filtered bases: **14,062,161,755** <br>
|
|
|
|
|
|
For full description of the sample preparation protocol and pre-processing of sequencing data please refer to the official release note on [`MCF-7 dataset`](http://www.pacb.com/blog/data-release-human-mcf-7-transcriptome/). <br>
|
|
|
For full description of the sample preparation protocol and pre-processing of sequencing data please refer to the official release note on [MCF-7 dataset](http://www.pacb.com/blog/data-release-human-mcf-7-transcriptome/). <br>
|
|
|
|
|
|
Please note that full-length mRNA sequencing data from three human tissues (**brain**, **heart** and **liver**) have also been used for comparative analysis of our findings. For full description of the sample preparation protocol and pre-processing of sequencing data please refer to the official release note on [`whole transcriptome of three human tissues`](http://www.pacb.com/blog/data-release-whole-human-transcriptome/). <br>
|
|
|
<br>
|
|
|
Please note that full-length mRNA sequencing data from three human tissues (**brain**, **heart** and **liver**) have also been used for comparative analysis of our findings. For full description of the sample preparation protocol and pre-processing of sequencing data please refer to the official release note on [whole transcriptome of three human tissues](http://www.pacb.com/blog/data-release-whole-human-transcriptome/). <br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Data Preprocessing and Characteristics** <br>
|
... | ... | @@ -66,14 +66,14 @@ In addition to 7,364 single-exon transcripts, the MCF-7 transcriptome consists o |
|
|
|
|
|
> **Figure 1: Overview of identified transcripts in MCF-7 transcriptome.** Histograms show the distribution of the number of identified transcripts per gene and transcript lengths. Density plot depicts the number of supporting reads based on transcript length. The number of supporting reads does not correlate with the length of full-length transcripts.
|
|
|
|
|
|
<br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Prerequisites** <br>
|
|
|
It is essential to produce the GFF file containing the annotation of identified transcripts and the number of reads that support each transcript variant. If the number of supporting reads per transcript is not available, this information can be easily produced by simply aligning single-molecule long PacBio reads to FASTA file containing the sequences of unique transcripts using BLASR. <br>
|
|
|
|
|
|
Prior to running BLASR, FASTA file containing the reference transcript sequenecs (i.e., reference.fa) needs to be checked for potential redundant sequences. This process will lead to renaming transcripts by unique indexes. A map to the original transcript ids are provided in the reference.fa.nonredundant.id_map.txt file. After the alignment is complete, the number of supporting reads per transcript can be easily extracted using the `samtools idxstats` function to produce the IsoSeq_MCF7.reads_of_insert.coverage file. <br>
|
|
|
<br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Defining Unique Features** <br>
|
... | ... | @@ -83,7 +83,7 @@ In this study, by processing the GFF file that contains the annotation of all id |
|
|
|
|
|
> **Figure 2: Schematic overview of the approach to characterize the interdependencies between mRNA transcription and processing events. A)** Identified full-length reads (reads with RNA inserts between 5’ and 3’ primers) are clustered into unique transcript structures using the ICE algorithm and further polished using the partial reads (reads where one of the primer sequences is missing). The number of unique transcript structures per locus and the distribution of transcript lengths are assessed. **B)** Based on the available transcripts per locus, the available sequence and unique set of features and splice-sites are identified. The available sequence is the union of all exonic sequences that are observed at each locus. Features are defined as unique set of transcription start sites (TSS), exons, and polyadenylation sites (PAS). The unique set of splice sites consists of unique donor and acceptor splice-sites as well as all alternative TSSs and PASs. **C)** The survey of coupling events is done by performing all possible pairwise tests between unique features in genes. The sum of the coverage of all transcripts that support the inclusion or exclusion of each pair is used in a contingency table to perform a Fisher’s exact test for statistical significance. The odds ratio (OR) is used to differentiate between mutually inclusive (positive log-transformed OR) and exclusive (negative log-transformed OR) coupling. **D)** Set of interdependent coupling events were identified based on networks of coupling between features in each gene. Nodes represent features and links depict the mutual inclusivity (black edges) or mutual exclusivity (red edges) of each feature pair. Unique network components can thereby be filtered based on the type of interaction: mutual inclusive or mutual exclusive coupling events, represented as network components. **E)** For all alternative exons that show significant linkage, a motif search is performed to assess the enrichment of specific RNA-binding protein motifs. For all alternative exons, the 35bp intronic sequences upstream of the acceptor site are defined as R1 domain (depicted in orange), the 32bp exonic sequences downstream of the acceptor site and upstream of the donor site are defined as R2 domain (depicted in dark grey), and 40bp intronic sequences downstream of the donor site are defined as R3 domain (depicted in purple). The 35bp sequence upstream of each PAS (depicted in red) is searched for the presence of canonical and non-canonical poly(A) signals.
|
|
|
|
|
|
<br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Statistical Analysis** <br>
|
... | ... | @@ -91,7 +91,7 @@ After defining unique features (TSSs, exons and PASs) and identifying the number |
|
|
|
|
|
> **`script:`** The python script for calculating the coupling between feature-pairs and post-processing can be found [**here**](https://git.lumc.nl/s.y.anvar/mRNA-Coupling/ipython_notebook/master/scripts/coupling_statistical_analysis.ipynb). <br>
|
|
|
|
|
|
<br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Polyadenylation Sites Sequence Motif Analysis** <br>
|
... | ... | @@ -103,13 +103,13 @@ For PASs that could not be attributed to known polyA signals, we ran DREME (vers |
|
|
dreme -o output -png -eps -v 1 -t 18000 -p input.targets.fasta -n input.background.fasta -k 6 -norc -e 0.05 -m 10
|
|
|
```
|
|
|
|
|
|
<br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Tandem 3' UTR Analysis** <br>
|
|
|
This analysis was performed to identify loci that contain tandem 3' UTRs (loci with multiple PASs located in the same last exon). Custom scripts were used to identify loci that contain at least two PASs that share the same coordinates of the start of the last exon. The number of loci with tandem 3' UTRs was calculated for those in which PAS was significantly coupled to alternative exons and for those that did not show any significant interdepenedncies between alternative exons and the PAS usage. <br>
|
|
|
|
|
|
<br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Sequence Motif Analysis Relative to Acceptor and Donor Sites** <br>
|
... | ... | @@ -117,7 +117,7 @@ For each detected gene, we report the first and last nucleotide of each exon as |
|
|
|
|
|
> **`script:`** The python script for extracting dinucleotide sequences of the splice-sites at R1 and R3 domains (i.e., canonical GT and AG motifs) can be found [**here**](https://git.lumc.nl/s.y.anvar/mRNA-Coupling/ipython_notebook/master/scripts/Rdomain_splice_junction_motif.ipynb). <br>
|
|
|
|
|
|
<br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **RNA Binding Motif Analysis** <br>
|
... | ... | @@ -125,19 +125,19 @@ We used MEME suite to identify enriched sequence motifs present in exons signifi |
|
|
|
|
|
We locally ran DREME (version 4.11.4) for each region separately and performed a motif search analysis using a negative background (R1, R2 and R3 domains of exons that were not significantly coupled). We ran DREME without any limitation for the motifs' length (similar to poly(A) site motif analysis). In each case, a maximum of 10 motifs with E-value less than 0.05 was reported. The remaining parameters were kept as default. We then compared each motif found by DREME against the human RNA-binding motifs database CISBP-RNA using TOMTOM Motif Comparison tool. We ran the analysis by setting the Pearson correlation coefficient as comparison function and considered only matches with a minimum false discovery rate (q-value) less than 0.05. <br>
|
|
|
|
|
|
<br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Reported Bugs and Fixes** <br>
|
|
|
So far, we have not received any bug reports! In this section, we will report any future changes to the procedure or the accompanied scripts. Feel free to send in your suggestions and comments for improvement or additional features. <br>
|
|
|
|
|
|
<br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Citation** <br>
|
|
|
SY Anvar, G Allard, E Tseng, et al. (2017) **Full-length mRNA sequencing uncovers a widespread coupling between transcription and mRNA proecssing.** - *submitted* <br>
|
|
|
|
|
|
<br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Authors Affiliation** <br>
|
... | ... | @@ -199,4 +199,5 @@ Menlo Park, CA 94025, USA <br> |
|
|
**Peter AC 't Hoen** <br>
|
|
|
Leiden University Medical Center <br>
|
|
|
Department of Human Genetics <br>
|
|
|
Leiden, 2300 RC, The Netherlands <br> |
|
|
\ No newline at end of file |
|
|
Leiden, 2300 RC, The Netherlands <br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br> |
|
|
\ No newline at end of file |