... | ... | @@ -27,7 +27,7 @@ Leiden University Medical Center <br> |
|
|
|
|
|
## **Summary** <br>
|
|
|
The multilayered control of gene expression requires tight coordination of regulatory mechanisms at the transcriptional and post-transcriptional level. Here, we studied the interdependence of transcription initiation, splicing and polyadenylation events on single mRNA molecules by full-length mRNA sequencing. In MCF-7 breast cancer cells and three human tissues, we found an unforeseen number of genes that demonstrate mutually inclusive or exclusive alternative transcription initiation and mRNA processing events, which can span the entire length of mRNA molecules. Furthermore, alternative poly(A) sites that are coupled with alternative splicing events are depleted for known poly(A) signals and enriched for MBNL binding motifs, supporting a dual role of MBNL proteins in regulating splicing and polyadenylation. We predict thousands of open-reading frames from the sequence of full-length mRNAs, allowing for a more sensitive proteogenomics analysis of MCF-7 mass-spectrometry data. Our findings demonstrate that our understanding of transcriptome complexity is far from complete and provides a framework to reveal largely unresolved mechanisms that coordinate transcription initiation and mRNA processing.<br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-initiation-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Materials and Sequencing** <br>
|
... | ... | @@ -52,7 +52,7 @@ Total number of post-filtered reads: **9,198,553** <br> |
|
|
For full description of the sample preparation protocol and pre-processing of sequencing data please refer to the official release note on [MCF-7 dataset](http://www.pacb.com/blog/data-release-human-mcf-7-transcriptome/). <br>
|
|
|
|
|
|
Please note that full-length mRNA sequencing data from three human tissues (**brain**, **heart** and **liver**) have also been used for comparative analysis of our findings. For full description of the sample preparation protocol and pre-processing of sequencing data please refer to the official release note on [whole transcriptome of three human tissues](http://www.pacb.com/blog/data-release-whole-human-transcriptome/). <br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-initiation-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Data Preprocessing and Characteristics** <br>
|
... | ... | @@ -76,30 +76,30 @@ In addition to 7,364 single-exon transcripts, the MCF-7 transcriptome consists o |
|
|
|
|
|
> **Figure 1: Overview of identified transcripts in MCF-7 transcriptome.** Histograms show the distribution of the number of identified transcripts per gene and transcript lengths. Density plot depicts the number of supporting reads based on transcript length. The number of supporting reads does not correlate with the length of full-length transcripts.
|
|
|
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-initiation-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Prerequisites** <br>
|
|
|
It is essential to produce the GFF file containing the annotation of identified transcripts and the number of reads that support each transcript variant. If the number of supporting reads per transcript is not available, this information can be easily produced by simply aligning single-molecule long PacBio reads to FASTA file containing the sequences of unique transcripts using BLASR. <br>
|
|
|
|
|
|
Prior to running BLASR, FASTA file containing the reference transcript sequenecs (i.e., reference.fa) needs to be checked for potential redundant sequences. This process will lead to renaming transcripts by unique indexes. A map to the original transcript ids are provided in the reference.fa.nonredundant.id_map.txt file. After the alignment is complete, the number of supporting reads per transcript can be easily extracted using the `samtools idxstats` function to produce the IsoSeq_MCF7.reads_of_insert.coverage file. <br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-initiation-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Comparison to standard RNA-Seq datasets** <br>
|
|
|
We used five publicly available RNA-Seq datasets ([SRR1035698](http://sra.dnanexus.com/experiments/SRX381535/runs), [SRR1107833](http://sra.dnanexus.com/experiments/SRX426377/runs), [SRR1107834](http://sra.dnanexus.com/experiments/SRX426377/runs), [SRR1107835](http://sra.dnanexus.com/experiments/SRX426377/runs) and [SRR1313067](https://trace.ddbj.nig.ac.jp/DRASearch/run?acc=SRR1313067); generated on Illumina HiSeq2000 or Illumina HiSeq2500 platforms) to evaluate the reliability of gene expression quantification based on full-length mRNA sequencing data used in this study. As accurate transcription reconstruction is not feasible for short-read RNA-Seq data, the comparison is made at the gene level using GENCODE annotation (version 19). Median gene coverage (fragment counts adjusted for gene length) was used as a measure for gene expression quantifications using the [GENTRAP](http://biopet-docs.readthedocs.io/en/latest/pipelines/gentrap/) pipeline.<br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-initiation-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Comparison to Encode CAGE and RNA-PET datasets** <br>
|
|
|
We used the publicly available MCF-7 cells CAGE (GSM849364) and RNA-PET (GSM1006905) datasets from Encode project to evaluate the reliability of 5’ and 3’-ends of identified transcripts in PacBio data. We used Bedtools to identify base-pair distance between CAGE tags and TSSs found in PacBio data. For RNA-PET, both 5’ and 3’-ends were simultaneously compared to TSSs and PASs identified in PacBio to report the distances for the best matching full-length transcripts. All data processing was done using Bedtools closest function. Since Bedtools may report multiple hits, we have selected the one with the least number of differing nucleotide positions as the best hit.<br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-initiation-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Comparison to the GENCODE annotation** <br>
|
|
|
We used GENCODE annotated transcripts (version 19) as reference to compare with the identified transcripts in the human MCF-7 transcriptome data. The comparison for transcript annotations was carried out using cuffcompare from the Cufflinks suite.
|
|
|
To identify interdependent alternative exons that are already represented in the Gencode annotation, for each exon-exon coupling, we performed the following: 1) all transcripts that encompass both interdependent exons; 2) assessed whether the same splice-site junctions were annotated; 3) counted the number of transcripts that contain both exons, only one of the two exons, or none; 4) relative counts were calculated based on the total number of transcripts that encompassed the two exons. For each dataset, the relative counts of co-occurrence and independent observations were compared for mutually inclusive and mutually exclusive exons separately.<br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-initiation-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Defining Unique Features** <br>
|
... | ... | @@ -109,7 +109,7 @@ In this study, by processing the GFF file that contains the annotation of all id |
|
|
|
|
|
> **Figure 2: Schematic overview of the approach to characterize the interdependencies between mRNA transcription and processing events. A)** Identified full-length reads (reads with RNA inserts between 5’ and 3’ primers) are clustered into unique transcript structures using the ICE algorithm and further polished using the partial reads (reads where one of the primer sequences is missing). The number of unique transcript structures per locus and the distribution of transcript lengths are assessed. **B)** Based on the available transcripts per locus, the available sequence and unique set of features and splice-sites are identified. The available sequence is the union of all exonic sequences that are observed at each locus. Features are defined as unique set of transcription start sites (TSS), exons, and polyadenylation sites (PAS). The unique set of splice sites consists of unique donor and acceptor splice-sites as well as all alternative TSSs and PASs. **C)** The survey of coupling events is done by performing all possible pairwise tests between unique features in genes. The sum of the coverage of all transcripts that support the inclusion or exclusion of each pair is used in a contingency table to perform a Fisher’s exact test for statistical significance. The odds ratio (OR) is used to differentiate between mutually inclusive (positive log-transformed OR) and exclusive (negative log-transformed OR) coupling. **D)** Set of interdependent coupling events were identified based on networks of coupling between features in each gene. Nodes represent features and links depict the mutual inclusivity (black edges) or mutual exclusivity (red edges) of each feature pair. Unique network components can thereby be filtered based on the type of interaction: mutual inclusive or mutual exclusive coupling events, represented as network components. **E)** For all alternative exons that show significant linkage, a motif search is performed to assess the enrichment of specific RNA-binding protein motifs. For all alternative exons, the 35bp intronic sequences upstream of the acceptor site are defined as R1 domain (depicted in orange), the 32bp exonic sequences downstream of the acceptor site and upstream of the donor site are defined as R2 domain (depicted in dark grey), and 40bp intronic sequences downstream of the donor site are defined as R3 domain (depicted in purple). The 35bp sequence upstream of each PAS (depicted in red) is searched for the presence of canonical and non-canonical poly(A) signals.
|
|
|
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-initiation-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Statistical Analysis** <br>
|
... | ... | @@ -158,7 +158,7 @@ After defining unique features (TSSs, exons and PASs) and identifying the number |
|
|
> - flag for Gencode match (i.e., =, c, j)
|
|
|
> - bonferroni adjusted p-values
|
|
|
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-initiation-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Polyadenylation Sites Sequence Motif Analysis** <br>
|
... | ... | @@ -170,13 +170,13 @@ For PASs that could not be attributed to known polyA signals, we ran DREME (vers |
|
|
dreme -o output -png -eps -v 1 -t 18000 -p input.targets.fasta -n input.background.fasta -k 6 -norc -e 0.05 -m 10
|
|
|
```
|
|
|
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-initiation-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Tandem 3' UTR Analysis** <br>
|
|
|
This analysis was performed to identify loci that contain tandem 3' UTRs (loci with multiple PASs located in the same last exon). Custom scripts were used to identify loci that contain at least two PASs that share the same coordinates of the start of the last exon. The number of loci with tandem 3' UTRs was calculated for those in which PAS was significantly coupled to alternative exons and for those that did not show any significant interdepenedncies between alternative exons and the PAS usage. <br>
|
|
|
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-initiation-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Sequence Motif Analysis Relative to Acceptor and Donor Sites** <br>
|
... | ... | @@ -194,7 +194,7 @@ For each detected gene, we report the first and last nucleotide of each exon as |
|
|
> - R1 splicing motif (acceptor)
|
|
|
> - R3 splicing motif (donor)
|
|
|
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-initiation-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **RNA Binding Motif Analysis** <br>
|
... | ... | @@ -202,13 +202,13 @@ We used MEME suite to identify enriched sequence motifs present in exons signifi |
|
|
|
|
|
We locally ran DREME (version 4.11.4) for each region separately and performed a motif search analysis using a negative background (R1, R2 and R3 domains of exons that were not significantly coupled). We ran DREME without any limitation for the motifs' length (similar to poly(A) site motif analysis). In each case, a maximum of 10 motifs with E-value less than 0.05 was reported. The remaining parameters were kept as default. We then compared each motif found by DREME against the human RNA-binding motifs database CISBP-RNA using TOMTOM Motif Comparison tool. We ran the analysis by setting the Pearson correlation coefficient as comparison function and considered only matches with a minimum false discovery rate (q-value) less than 0.05. <br>
|
|
|
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-initiation-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Open Reading Frame Prediction and Proteomics Data Analysis** <br>
|
|
|
ORF prediction was done on the PacBio MCF-7 sequences using [ANGEL](http://www.github.com/Magdoll/ANGEL). Prediction was done on both the PacBio consensus reads and a genome-corrected version of the transcript, and whichever produced the longer ORF was chosen to represent the transcript CDS. The predicted MCF-7 ORF sequences were concatenated with Gencode version 19 and protein sequences representing common mass spectrometry (MS) contaminants, creating a customized FASTA file (i.e., proteomics search database). The [Morpheus](http://cwenger.github.io/Morpheus/) software (version 131) was employed for MS searching of the custom database against the MCF-7 Thermo Raw files obtained from [Geiger et al.](http://www.mcponline.org/content/11/3/M111.014050.long) study. Unknown precursor charge state range was set to +2 to +4. Absolute and relative MS/MS intensity thresholds were disabled. Maximum number of MS/MS peaks were set to 400. Assign charge state was set to true. De-isotoping was disabled. The protease specificity was set to trypsin with no proline rule enabled. Up to 1 missed cleavage was allowed and N-terminal methionine truncations was variable. Fixed modifications used were carbamidomethylation of cysteines. Variable modification used was oxidation of methionines. Precursor mass tolerance used was 2.1 Daltons (monoisotopic) and product mass tolerance was 0.025 Daltons (monoisotopic). Modified forms of the same peptide were collapsed and treated as one peptide identification for calculation of false discovery rate (FDR). An FDR of 1% was used to filter for final peptide identifications. All identified peptides were categorized as: single-transcript if the peptide matches to only one gene with one transcript; sub-gene if the peptide matches to a subset of transcripts of only one gene; single-gene if the peptide matches to all transcripts of only one gene; and multi-gene if the peptide matches to multiple transcripts from multiple genes.<br>
|
|
|
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-initiation-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **RNA-FISH and Sanger Validations** <br>
|
... | ... | @@ -259,19 +259,19 @@ Stellaris RNA FISH was performed as previously described for methanol/acetic aci |
|
|
DAPI-stained nuclei, fluorescein (FAM), Quasar 570 (Q570), CalFluor 610 (CF610) and Quasar 670 (Q670) dyes were imaged through a 60X 1.4NA oil-immersion lens on a Nikon TI widefield microscope using the appropriate filters: 49000-ET-DAPI, 49011-ET-FITC, SP102v1, SP103v2, 49022-ET-Cy5.5 respectively. The exposure and sequence of channels to acquire were determined based on the brightness and photostability of the dye with which each probe set was labeled. The sequence of exposure was Q670, followed by FAM, and then either or both Q570, CF610. Each Z-slice was exposed for 1 s, except for Q670 which required 2 s exposures. For each field of view, a range spanning the vertical dimension of the cell (typically 10 um) is defined and for each channel, a series of images were acquired through this span at 0.3 µm increments by using Nikon Elements' Advanced Research software.
|
|
|
Each Z-series was collapsed and rendered as a single, max-intensity projected image. Translational registration to align images shifted relative to another was accomplished by ImageJ macros after identification of a region containing overlapping signals in each channel. Peak positions of these signals were determined relative to each other to inform the shift of each channel. Next, spots and their centroid positions were identified in each channel using the ImageJ Find Maxima utility. These positions were then compared against one another and co-localized spots were grouped if within 3 pixels (330 nm). Based on these groupings, spots were categorized into separate transcript variants and displayed on the image for review. Finally, cell borders were defined and spots associated with distinct cells for per-cell and per-transcript variant copy number determination. RNA FISH features were counted in at least ten cells. <br>
|
|
|
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-initiation-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Reported Bugs and Fixes** <br>
|
|
|
So far, we have not received any bug reports! In this section, we will report any future changes to the procedure or the accompanied scripts. Feel free to send in your suggestions and comments for improvement or additional features. <br>
|
|
|
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-initiation-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Citation** <br>
|
|
|
SY Anvar, G Allard, E Tseng, et al. (2017) **Full-length mRNA sequencing uncovers a widespread coupling between transcription and mRNA proecssing.** - *submitted* <br>
|
|
|
SY Anvar, G Allard, E Tseng, et al. (2017) **Full-length mRNA sequencing uncovers a widespread coupling between transcription initiation and mRNA proecssing.** - *submitted* <br>
|
|
|
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-initiation-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Authors Affiliation** <br>
|
... | ... | @@ -339,4 +339,4 @@ Menlo Park, CA 94025, USA <br> |
|
|
Leiden University Medical Center <br>
|
|
|
Department of Human Genetics <br>
|
|
|
Leiden, 2300 RC, The Netherlands <br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br> |
|
|
\ No newline at end of file |
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-initiation-and-mrna-processing)</p><br> |
|
|
\ No newline at end of file |