Changes

Anvar · 1e2d6045
--- a/home.md
+++ b/home.md
-# **Full-length mRNA Sequencing Uncovers a Widespread Coupling between Transcription and mRNA Processing**
+# **Full-length mRNA Sequencing Uncovers a Widespread Coupling between Transcription Initiation and mRNA Processing**
 By: Yahya Anvar, PhD [✍](mailto:s.y.anvar@lumc.nl) <br>
 Department of Human Genetics <br>
 Leiden University Medical Center <br>
@@ -8,6 +8,8 @@ Leiden University Medical Center <br>
 1. [Data Preprocessing and Characteristics](#data-preprocessing-and-characteristics-)
 1. [Prerequisites](#prerequisites-)
 1. [Comparison to standard RNA-Seq datasets](#comparison-to-standard-rna-seq-datasets-)
+1. [Comparison to Encode CAGE and RNA-PET datasets](#comparison-to-encode-cage-and-rna-pet-datasets)
+1. [Comparison to the GENCODE annotation](#comparison-to-the-gencode-annotation)
 1. [Defining Unique Features](#defining-unique-features-)
 1. [Statistical Analysis](#statistical-analysis-)
 1. [Polyadenylation Sites Sequence Motif Analysis](#polyadenylation-sites-sequence-motif-analysis-)
@@ -24,7 +26,7 @@ Leiden University Medical Center <br>
 <br>

 ## **Summary** <br>
-The multilayered control of gene expression requires tight coordination of regulatory mechanisms at the transcriptional and post-transcriptional level. Here, we studied the interdependence of transcription, splicing and polyadenylation events on single mRNA molecules by full-length mRNA sequencing. In MCF-7 breast cancer cells and three human tissues, we found an unforeseen number of genes that demonstrate mutually inclusive or exclusive alternative transcription and mRNA processing events, which can span the entire length of mRNA molecules. Furthermore, alternative poly(A) sites that are coupled with alternative splicing events are depleted for known poly(A) signals and enriched for MBNL binding motifs, supporting a dual role of MBNL proteins in regulating splicing and polyadenylation. We predict thousands of open-reading frames from the sequence of full-length mRNAs, allowing for a more sensitive proteogenomics analysis of MCF-7 mass-spectrometry data. Our findings demonstrate that our understanding of transcriptome complexity is far from complete and provides a framework to reveal largely unresolved mechanisms that coordinate transcription and mRNA processing.<br>
+The multilayered control of gene expression requires tight coordination of regulatory mechanisms at the transcriptional and post-transcriptional level. Here, we studied the interdependence of transcription initiation, splicing and polyadenylation events on single mRNA molecules by full-length mRNA sequencing. In MCF-7 breast cancer cells and three human tissues, we found an unforeseen number of genes that demonstrate mutually inclusive or exclusive alternative transcription initiation and mRNA processing events, which can span the entire length of mRNA molecules. Furthermore, alternative poly(A) sites that are coupled with alternative splicing events are depleted for known poly(A) signals and enriched for MBNL binding motifs, supporting a dual role of MBNL proteins in regulating splicing and polyadenylation. We predict thousands of open-reading frames from the sequence of full-length mRNAs, allowing for a more sensitive proteogenomics analysis of MCF-7 mass-spectrometry data. Our findings demonstrate that our understanding of transcriptome complexity is far from complete and provides a framework to reveal largely unresolved mechanisms that coordinate transcription initiation and mRNA processing.<br>
 <p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>

 ---
@@ -54,7 +56,7 @@ Please note that full-length mRNA sequencing data from three human tissues (**br

 ---
 ## **Data Preprocessing and Characteristics** <br>
-Post sequencing, transcript structures were defined by applying the isoform-level clustering algorithm ([ICE](https://github.com/PacificBiosciences/cDNA_primer/wiki)) on full-length reads, capturing the entire mRNA molecule (containing 5', polyA-tail and 3' primer sequences). To find transcript clusters, ICE performs a pairwise alignment and reiterative assignment of full-length reads to clusters based on likelihood. This process is followed by consensus calling and further polishing of the sequence to reduce redundancy and increase the overall accuracy of sequences for identified transcript variants. Our analysis pipeline could precisely determine the position of polyadenylation site (presence of polyA-tail) and intro-exon boundaries, as evident from the presence of the canonical **GU** motif in 93% of donor splice sites and the canonical **AG** motif in 95% of acceptor splice sites. <br>
+Post sequencing, transcript structures were defined by applying the isoform-level clustering algorithm ([ICE](https://github.com/PacificBiosciences/cDNA_primer/wiki)) on full-length reads, capturing the entire mRNA molecule (containing 5', polyA-tail and 3' primer sequences). To find transcript clusters, ICE performs a pairwise alignment and reiterative assignment of full-length reads to clusters based on likelihood. This process is followed by consensus calling and further polishing. Based on the Quiver algorithm’s predicted consensus accuracy, transcript sequences that had a predicted accuracy of over 99% (excluding QVs from the first 100bp and last 30bp due to occasionally insufficient coverage for accurate estimation of accuracies) were considered high-quality (HQ) transcripts and used for further analysis. The HQ transcript sequences were mapped back to the human genome (hg19) and filtered for over 99% alignment coverage and over 85% alignment identity. Of 280,051 HQ transcripts in MCF-7, 9 didn’t map to hg19, 13,543 were filtered due to low coverage and 14 were filtered due to low identity. Redundant transcripts were then collapsed to create a final dataset used in this study. Our analysis pipeline could precisely determine the position of polyadenylation site (presence of polyA-tail) and intro-exon boundaries, as evident from the presence of the canonical **GU** motif in 93% of donor splice sites and the canonical **AG** motif in 95% of acceptor splice sites. <br>

 Clustering of the transcripts into genes was achieved using the following criteria: For each genomic region containing transcripts, we create an empty graph in which each transcript is represented as a node. If two transcripts are on the same strand and share at least 2 exons, an edge is added to link the two transcripts. For each component of the resulting network, a unique gene id is generated and assigned to the transcripts using the same scheme as the original data. Transcript ids are derived from the gene ids, with an index 1 to the total number of transcripts assigned to each gene. A lookup table is generated for the mapping of the new and old ids. <br>

@@ -88,6 +90,17 @@ Prior to running BLASR, FASTA file containing the reference transcript sequenecs
 We used five publicly available RNA-Seq datasets ([SRR1035698](http://sra.dnanexus.com/experiments/SRX381535/runs), [SRR1107833](http://sra.dnanexus.com/experiments/SRX426377/runs), [SRR1107834](http://sra.dnanexus.com/experiments/SRX426377/runs), [SRR1107835](http://sra.dnanexus.com/experiments/SRX426377/runs) and [SRR1313067](https://trace.ddbj.nig.ac.jp/DRASearch/run?acc=SRR1313067); generated on Illumina HiSeq2000 or Illumina HiSeq2500 platforms) to evaluate the reliability of gene expression quantification based on full-length mRNA sequencing data used in this study. As accurate transcription reconstruction is not feasible for short-read RNA-Seq data, the comparison is made at the gene level using GENCODE annotation (version 19). Median gene coverage (fragment counts adjusted for gene length) was used as a measure for gene expression quantifications using the [GENTRAP](http://biopet-docs.readthedocs.io/en/latest/pipelines/gentrap/) pipeline.<br>
 <p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>

+---
+## **Comparison to Encode CAGE and RNA-PET datasets** <br>
+We used the publicly available MCF-7 cells CAGE (GSM849364) and RNA-PET (GSM1006905) datasets from Encode project to evaluate the reliability of 5’ and 3’-ends of identified transcripts in PacBio data. We used Bedtools to identify base-pair distance between CAGE tags and TSSs found in PacBio data. For RNA-PET, both 5’ and 3’-ends were simultaneously compared to TSSs and PASs identified in PacBio to report the distances for the best matching full-length transcripts. All data processing was done using Bedtools closest function. Since Bedtools may report multiple hits, we have selected the one with the least number of differing nucleotide positions as the best hit.<br>
+<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
+
+---
+## **Comparison to the GENCODE annotation** <br>
+We used GENCODE annotated transcripts (version 19) as reference to compare with the identified transcripts in the human MCF-7 transcriptome data. The comparison for transcript annotations was carried out using cuffcompare from the Cufflinks suite.
+To identify interdependent alternative exons that are already represented in the Gencode annotation, for each exon-exon coupling, we performed the following: 1) all transcripts that encompass both interdependent exons; 2) assessed whether the same splice-site junctions were annotated; 3) counted the number of transcripts that contain both exons, only one of the two exons, or none; 4) relative counts were calculated based on the total number of transcripts that encompassed the two exons. For each dataset, the relative counts of co-occurrence and independent observations were compared for mutually inclusive and mutually exclusive exons separately.<br>
+<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
+
 ---
 ## **Defining Unique Features** <br>
 In this study, by processing the GFF file that contains the annotation of all identified transcripts and exon-intron boundaries (defined by the genomic position and strand on the HG19 reference sequence), a list of all transcription and mRNA processing events is produced (**Figure 2**). Transcription start sites (TSSs) are defined as the first genomic position of each transcript structure. Polyadenylation sites (PASs) are defined as the last genomic position of each transcript. The most upstream and downstream position of exons were used to define donor and acceptor splice sites, respectively. However, for the first exon only the donor site is described as the first position is defined as transcription start site. Likewise, the last exon does not contain a donor splice site as the position is defined as polyadenylation site. If multiple transcripts share the same feature, then only one copy is kept in the unique set of features at each locus. Furthermore, the union of all unique exons is defined as the available sequence at each locus. <br>
@@ -100,7 +113,7 @@ In this study, by processing the GFF file that contains the annotation of all id

 ---
 ## **Statistical Analysis** <br>
-After defining unique features (TSSs, exons and PASs) and identifying the number of supporting reads for transcripts at each locus (Full-length read support, Full-length and Partial read support, or alignment-based read support), all possible pairwise comparisons between features were made. To do this, the sum of all reads that support the presence of the two selected features in all observed transcripts is reported in a two-by-two contingency table (**Figure 2**). The table describes the number of times two features are observed in the same transcript or exclusively, as well as the sum of reads that are mapped to transcripts that do not support the presence of either features (**Figure 2**). A significant coupling between two features is assessed using the Fisher's exact test. Features that are by definition interdependent (i.e., PAS and the terminal 3' exon, or multiple TSSs of the same gene) were omitted from the analysis to avoid introducing any bias. Also, for assessing the interdependency of two features, only transcripts that expand the region that the two features are located are considered and the other transcripts can't be used as evidence for considering dependencies. The mutual inclusivity or exclusivity of coupled features are determined using their log-transformed odd ratio. All p-values are adjusted using Bonferroni multiple testing correction, meaning that the threshold for significance of coupling is set to 1.4e-08 based based on the total number of tests performed. <br>
+After defining unique features (TSSs, exons and PASs) and identifying the number of supporting reads for transcripts at each locus (Full-length read support, Full-length and Partial read support, or alignment-based read support), all possible pairwise comparisons between features were made. To do this, a two-by-two contingency table is constructed based on the following criteria: 1) two features should not partially or fully overlap; 2) dependency between alternative TSSs and features that are located in their upstream region are omitted as their exclusivity is given. The same rule is applied to downstream of alternative PASs; 3) only transcripts that fully encompass the region that is represented by two features are used to populated the two-by-two contingency table. This is to ensure that all counts are based on direct observation of their mutual inclusion/exclusion in identified transcripts. 4) single-exon transcripts are excluded from the analysis; 5) only multi-transcript loci are examined for possible interdependencies between alternative transcript initiation, alternative splicing and alternative polyadenylation. The table describes the number of times two features are observed in the same transcript or exclusively, as well as the sum of reads that are mapped to transcripts that do not support the presence of either features (**Figure 2**). A significant coupling between two features is assessed using the Fisher's exact test. Features that are by definition interdependent (i.e., PAS and the terminal 3' exon, or multiple TSSs of the same gene) were omitted from the analysis to avoid introducing any bias. Also, for assessing the interdependency of two features, only transcripts that expand the region that the two features are located are considered and the other transcripts can't be used as evidence for considering dependencies. The mutual inclusivity or exclusivity of coupled features are determined using their log-transformed odd ratio. All p-values are adjusted using Bonferroni multiple testing correction, meaning that the threshold for significance of coupling is set to 1.4e-08 based based on the total number of tests performed. <br>

 > **`script:`** The python script for calculating the coupling between feature-pairs and post-processing can be found [**here**](https://git.lumc.nl/s.y.anvar/mRNA-Coupling/ipython_notebook/master/scripts/coupling_statistical_analysis.ipynb). <br> <br>
 > **input** <br> [:arrow_down:](https://barmsijs.lumc.nl/RNAcoupling/) GFF file, curated for terminal positions as well as filtered for single-exon genes as no statistical test can be performed. <br> [:arrow_down:](https://barmsijs.lumc.nl/RNAcoupling/compare.mcf7plus2016.gencode.MCF7_2015.ANGEL_FINAL_nominus_noduplicate.counts.new_id.curated.noCDS.no_single_exon.gff.tmap) TMAP file, generated by cuffcompare to provide gencode matching gene annotation. <br> <br>