Changes

Anvar · 074c30f4
--- a/home.md
+++ b/home.md
@@ -9,9 +9,7 @@ Leiden University Medical Center <br>
 1. [Prerequisites](#prerequisites-)
 1. [Defining Unique Features](#defining-unique-features-)
 1. [Statistical Analysis](#statistical-analysis-)
-1. [Pathway Analysis]()
+1. [Polyadenylation Sites Sequence Motif Analysis](#polyadenylation-sites-sequence-motif-analysis-)
-1. [Annotation of Alternative Exons]()
-1. [Polyadenylation Sites Sequence Motif Analysis]()
 1. [Tandem 3' UTR Analysis]()
 1. [Sequence Motif Analysis Relative to Acceptor and Donor Sites]()
 1. [RNA Binding Motif Analysis]()
@@ -46,6 +44,8 @@ Some statistics from the sequencing and results are listed below. Total number o
 Total number of post-filtered bases: **14,062,161,755** <br>
 For full description of the sample preparation protocol and pre-processing of sequencing data please refer to the official release note on [`MCF-7 dataset`](http://www.pacb.com/blog/data-release-human-mcf-7-transcriptome/). <br>
+Please note that full-length mRNA sequencing data from three human tissues (**brain**, **heart** and **liver**) have also been used for comparative analysis of our findings. For full description of the sample preparation protocol and pre-processing of sequencing data please refer to the official release note on [`whole transcriptome of three human tissues`](http://www.pacb.com/blog/data-release-whole-human-transcriptome/). <br>
 <br>
 ---
@@ -89,9 +89,22 @@ In this study, by processing the GFF file that contains the annotation of all id
 ## **Statistical Analysis** <br>
 After defining unique features (TSSs, exons and PASs) and identifying the number of supporting reads for transcripts at each locus (Full-length read support, Full-length and Partial read support, or alignment-based read support), all possible pairwise comparisons between features were made. To do this, the sum of all reads that support the presence of the two selected features in all observed transcripts is reported in a two-by-two contingency table (**Figure 2**). The table describes the number of times two features are observed in the same transcript or exclusively, as well as the sum of reads that are mapped to transcripts that do not support the presence of either features (**Figure 2**). A significant coupling between two features is assessed using the Fisher's exact test. Features that are by definition interdependent (i.e., PAS and the terminal 3' exon, or multiple TSSs of the same gene) were omitted from the analysis to avoid introducing any bias. Also, for assessing the interdependency of two features, only transcripts that expand the region that the two features are located are considered and the other transcripts can't be used as evidence for considering dependencies. The mutual inclusivity or exclusivity of coupled features are determined using their log-transformed odd ratio. All p-values are adjusted using Bonferroni multiple testing correction, meaning that the threshold for significance of coupling is set to 1.4e-08 based based on the total number of tests performed. <br>
-> **`script:`** The python script for calculating the coupling between feature-pairs and post-processing can be found [**here**](). <br>
+> **`script:`** The python script for calculating the coupling between feature-pairs and post-processing can be found [**here**](https://git.lumc.nl/s.y.anvar/mRNA-Coupling/ipython_notebook/master/scripts/coupling_statistical_analysis.ipynb). <br>
+<br>
+---
+## **Polyadenylation Sites Sequence Motif Analysis** <br>
+For each detected locus, we reported the last nucleotide as PAS. Each genomic location was converted into a BED format. Strand specific genomic sequences located up to 35bp upstream of each unique PAS were extracted, in a FASTA format, using UCSC Table Browser (GCh37/hg19). FASTA files were parsed using a custom bash script to count the number of sequences containing specific 6-mer motifs: one of the two **canonical polyA signals** AATAAA and ATTAAA, or one of the 11 **non-canonical polyA signals** (AAGAAA, AATACA, AATAGA, AATATA, AATGAA, ACTAAA, AGTAAA, CATAAA, GATAAA, TATAAA and TTTAAA). Subsequently, the same 6-mer motifs were counted for each unique PAS significantly coupled to alternative TSSs or alternative exons and for each unique PAS that did not show a significant coupling. <br>
+For PASs that could not be attributed to known polyA signals, we ran DREME (version 4.11.4) to identify enriched motifs (see below). Randomly shuffled set of sequences was geenrated from the original sequenecs of the examined PASs and was used as a background set. In addition, the sequences of known recognition motifs for MBNL proteins were counted for each set using a custom script. Subsequently, the enrichment of each motif was assessed by Fisher's exact test. <br>
+```
+dreme -o output -png -eps -v 1 -t 18000 -p input.targets.fasta -n input.background.fasta -k 6 -norc -e 0.05 -m 10
+```
 <br>
 ---
-## **Polyadenylation Sites Sequence Motif Analysis** <br>
+## **Tandem 3' UTR Analysis** <br>
\ No newline at end of file
+This analysis was performed to identify loci that contain tandem 3' UTRs (loci with multiple PASs located in the same last exon). Custom scripts were used to identify loci that contain at least two PASs that share the same coordinates of the start of the last exon. The number of loci with tandem 3' UTRs was calculated for those in which PAS was significantly coupled to alternative exons and for those that did not show any significant interdepenedncies between alternative exons and the PAS usage. <br>