Changes

Anvar · 581106d9
--- a/home.md
+++ b/home.md
@@ -7,8 +7,8 @@ Leiden University Medical Center <br>
 1. [Materials and Sequencing](#materials-and-sequencing-)
 1. [Data Preprocessing and Characteristics](#data-preprocessing-and-characteristics-)
 1. [Prerequisites](#prerequisites-)
-1. [Defining Unique Features]()
-1. [Statistical Analysis]()
+1. [Defining Unique Features](#defining-unique-features-)
+1. [Statistical Analysis](#statistical-analysis-)
 1. [Pathway Analysis]()
 1. [Annotation of Alternative Exons]()
 1. [Polyadenylation Sites Sequence Motif Analysis]()
@@ -64,16 +64,34 @@ In addition to 7,364 single-exon transcripts, the MCF-7 transcriptome consists o
 <br>
 ![SY_Anvar_-_Figure_S1_-_Histogram_of_Length_and_Counts_per_Gene](/uploads/621961f894a9d5fa43819c0421600040/SY_Anvar_-_Figure_S1_-_Histogram_of_Length_and_Counts_per_Gene.png)

-> **Figure 1:** Overview of identified transcripts in MCF-7 transcriptome. Histograms show the distribution of the number of identified transcripts per gene and transcript lengths. Density plot depicts the number of supporting reads based on transcript length. The number of supporting reads does not correlate with the length of full-length transcripts.
-
+> **Figure 1: Overview of identified transcripts in MCF-7 transcriptome.** Histograms show the distribution of the number of identified transcripts per gene and transcript lengths. Density plot depicts the number of supporting reads based on transcript length. The number of supporting reads does not correlate with the length of full-length transcripts.

 <br>

 ---
 ## **Prerequisites** <br>
+It is essential to produce the GFF file containing the annotation of identified transcripts and the number of reads that support each transcript variant. If the number of supporting reads per transcript is not available, this information can be easily produced by simply aligning single-molecule long PacBio reads to FASTA file containing the sequences of unique transcripts using BLASR. <br>
+
+Prior to running BLASR, FASTA file containing the reference transcript sequenecs (i.e., reference.fa) needs to be checked for potential redundant sequences. This process will lead to renaming transcripts by unique indexes. A map to the original transcript ids are provided in the reference.fa.nonredundant.id_map.txt file. After the alignment is complete, the number of supporting reads per transcript can be easily extracted using the `samtools idxstats` function to produce the IsoSeq_MCF7.reads_of_insert.coverage file. <br>
+<br>
+
+---
+## **Defining Unique Features** <br>
+In this study, by processing the GFF file that contains the annotation of all identified transcripts and exon-intron boundaries (defined by the genomic position and strand on the HG19 reference sequence), a list of all transcription and mRNA processing events is produced (**Figure 2**). Transcription start sites (TSSs) are defined as the first genomic position of each transcript structure. Polyadenylation sites (PASs) are defined as the last genomic position of each transcript. The most upstream and downstream position of exons were used to define donor and acceptor splice sites, respectively. However, for the first exon only the donor site is described as the first position is defined as transcription start site. Likewise, the last exon does not contain a donor splice site as the position is defined as polyadenylation site. If multiple transcripts share the same feature, then only one copy is kept in the unique set of features at each locus. Furthermore, the union of all unique exons is defined as the available sequence at each locus. <br>
+<br>
+![SY_Anvar_-_Figure_1](/uploads/21a0a621ea6148be52d22236345f86d4/SY_Anvar_-_Figure_1.png)

-This is the script used for this task: <br>
-```python
-for i in range(0,10):
-    print 'hellow'
-```
\ No newline at end of file
+> **Figure 2: Schematic overview of the approach to characterize the interdependencies between mRNA transcription and processing events. A)** Identified full-length reads (reads with RNA inserts between 5’ and 3’ primers) are clustered into unique transcript structures using the ICE algorithm and further polished using the partial reads (reads where one of the primer sequences is missing). The number of unique transcript structures per locus and the distribution of transcript lengths are assessed. **B)** Based on the available transcripts per locus, the available sequence and unique set of features and splice-sites are identified. The available sequence is the union of all exonic sequences that are observed at each locus. Features are defined as unique set of transcription start sites (TSS), exons, and polyadenylation sites (PAS). The unique set of splice sites consists of unique donor and acceptor splice-sites as well as all alternative TSSs and PASs. **C)** The survey of coupling events is done by performing all possible pairwise tests between unique features in genes. The sum of the coverage of all transcripts that support the inclusion or exclusion of each pair is used in a contingency table to perform a Fisher’s exact test for statistical significance. The odds ratio (OR) is used to differentiate between mutually inclusive (positive log-transformed OR) and exclusive (negative log-transformed OR) coupling. **D)** Set of interdependent coupling events were identified based on networks of coupling between features in each gene. Nodes represent features and links depict the mutual inclusivity (black edges) or mutual exclusivity (red edges) of each feature pair. Unique network components can thereby be filtered based on the type of interaction: mutual inclusive or mutual exclusive coupling events, represented as network components. **E)** For all alternative exons that show significant linkage, a motif search is performed to assess the enrichment of specific RNA-binding protein motifs. For all alternative exons, the 35bp intronic sequences upstream of the acceptor site are defined as R1 domain (depicted in orange), the 32bp exonic sequences downstream of the acceptor site and upstream of the donor site are defined as R2 domain (depicted in dark grey), and 40bp intronic sequences downstream of the donor site are defined as R3 domain (depicted in purple). The 35bp sequence upstream of each PAS (depicted in red) is searched for the presence of canonical and non-canonical poly(A) signals.
+
+<br>
+
+---
+## **Statistical Analysis** <br>
+After defining unique features (TSSs, exons and PASs) and identifying the number of supporting reads for transcripts at each locus (Full-length read support, Full-length and Partial read support, or alignment-based read support), all possible pairwise comparisons between features were made. To do this, the sum of all reads that support the presence of the two selected features in all observed transcripts is reported in a two-by-two contingency table (**Figure 2**). The table describes the number of times two features are observed in the same transcript or exclusively, as well as the sum of reads that are mapped to transcripts that do not support the presence of either features (**Figure 2**). A significant coupling between two features is assessed using the Fisher's exact test. Features that are by definition interdependent (i.e., PAS and the terminal 3' exon, or multiple TSSs of the same gene) were omitted from the analysis to avoid introducing any bias. Also, for assessing the interdependency of two features, only transcripts that expand the region that the two features are located are considered and the other transcripts can't be used as evidence for considering dependencies. The mutual inclusivity or exclusivity of coupled features are determined using their log-transformed odd ratio. All p-values are adjusted using Bonferroni multiple testing correction, meaning that the threshold for significance of coupling is set to 1.4e-08 based based on the total number of tests performed. <br>
+
+> **`script:`** The python script for calculating the coupling between feature-pairs and post-processing can be found [**here**](). <br>
+
+<br>
+
+---
+## **Polyadenylation Sites Sequence Motif Analysis** <br>
\ No newline at end of file