Changes

Anvar · 7bcf33f6
--- a/home.md
+++ b/home.md
@@ -60,11 +60,13 @@ Clustering of the transcripts into genes was achieved using the following criter

 > **`script:`** The python script for curating gene models and transcript ids is provided [**here**](https://git.lumc.nl/wgallard/IsoseqProcessing/ipython_notebook/master/scripts/mcf7plus_processing.ipynb#5\)-Transcripts-were-clustered-into-'genes'). <br> <br>
 > **input** <br> [:arrow_down:]() GFF file, generated after running ANGEL to predict open-reading frames. <br> <br>
-> **output** <br> [:arrow_up:]() GTF file processed for regrouping of transcripts into gene clusters. <br> [:arrow_up:]() map of old and new gene and transcript ids. <br>
+> **output** <br> [:arrow_up:]() GFF file processed for regrouping of transcripts into gene clusters. <br> [:arrow_up:]() map of old and new gene and transcript ids. <br>

 For each gene, the position of TSSs and PASs is subject to noise. To select for most likely alternative TSS and PAS in a given gene, the terminal positions of transcripts at each gene locus are been curated based on the following criteria: For each gene, the terminus start and end that are within the specified distance threshold (10bp in this case) are grouped and the terminus position with the highest number of supporting read was used to curate the transcript start and end. <br>

-> **`script:`** The python script for curating the terminal positions of transcripts at each gene locus is provided [**here**](https://git.lumc.nl/wgallard/IsoseqProcessing/ipython_notebook/master/scripts/mcf7plus_processing.ipynb#6\)-Terminal-feature-coordinates-were-curated--using-a-+/--10bp-window). <br>
+> **`script:`** The python script for curating the terminal positions of transcripts at each gene locus is provided [**here**](https://git.lumc.nl/wgallard/IsoseqProcessing/ipython_notebook/master/scripts/mcf7plus_processing.ipynb#6\)-Terminal-feature-coordinates-were-curated--using-a-+/--10bp-window). <br> <br>
+> **input** <br> [:arrow_down:]() GFF file, pre-processed for regrouping transcripts into gene clusters. <br> <br>
+> **output** <br> [:arrow_up:]() GFF file curated for terminal positions to reduce stochasticity and overrepresentation of TSSs and PASs per gene. Please note that a threshold (default is 10bp either side of each terminal position) needs to be set for the window for polishing. <br>

 In addition to 7,364 single-exon transcripts, the MCF-7 transcriptome consists of 11,350 multi-exon genes of which 69% produced multiple transcript structures (**Figure 1**; [GFF](); [FASTA]()). Furthermore, 49% of identified transcripts in MCF-7 cells are potentially novel in comparison with the [GENCODE](http://www.gencodegenes.org/) annotation (versoin 19). The comparison was carried out using `cuffcompare` from Cufflinks suite. <br>
 <br>
@@ -100,7 +102,48 @@ In this study, by processing the GFF file that contains the annotation of all id
 ## **Statistical Analysis** <br>
 After defining unique features (TSSs, exons and PASs) and identifying the number of supporting reads for transcripts at each locus (Full-length read support, Full-length and Partial read support, or alignment-based read support), all possible pairwise comparisons between features were made. To do this, the sum of all reads that support the presence of the two selected features in all observed transcripts is reported in a two-by-two contingency table (**Figure 2**). The table describes the number of times two features are observed in the same transcript or exclusively, as well as the sum of reads that are mapped to transcripts that do not support the presence of either features (**Figure 2**). A significant coupling between two features is assessed using the Fisher's exact test. Features that are by definition interdependent (i.e., PAS and the terminal 3' exon, or multiple TSSs of the same gene) were omitted from the analysis to avoid introducing any bias. Also, for assessing the interdependency of two features, only transcripts that expand the region that the two features are located are considered and the other transcripts can't be used as evidence for considering dependencies. The mutual inclusivity or exclusivity of coupled features are determined using their log-transformed odd ratio. All p-values are adjusted using Bonferroni multiple testing correction, meaning that the threshold for significance of coupling is set to 1.4e-08 based based on the total number of tests performed. <br>

-> **`script:`** The python script for calculating the coupling between feature-pairs and post-processing can be found [**here**](https://git.lumc.nl/s.y.anvar/mRNA-Coupling/ipython_notebook/master/scripts/coupling_statistical_analysis.ipynb). <br>
+> **`script:`** The python script for calculating the coupling between feature-pairs and post-processing can be found [**here**](https://git.lumc.nl/s.y.anvar/mRNA-Coupling/ipython_notebook/master/scripts/coupling_statistical_analysis.ipynb). <br> <br>
+> **input** <br> [:arrow_down:]() GFF file, curated for terminal positions as well as filtered for single-exon genes as no statistical test can be performed. <br> [:arrow_down:]() TMAP file, generated by cuffcompare to provide gencode matching gene annotation. <br> <br>
+> **output** <br> [:arrow_up:]() Reference stats file consisting of: 
+> - gene id
+> - transcript id
+> - genomic position
+> - strand
+> - length
+> - coverage
+> - full-length read support
+> - partial read support
+> - alignment-based read support
+>
+> [:arrow_up:]() Features stat file consisting of:
+> - gene id
+> - number of transcripts
+> - number of TSSs
+> - number of exons
+> - number of PAS
+>
+> [:arrow_up:]() Coupling stat file that contains all the statistics for all possible feature-pairs along with various annotations: 
+> - gene id
+> - chromosome
+> - strand
+> - start position
+> - end position
+> - length of exonic sequence
+> - feature A
+> - feature B
+> - flag for separate or overlapping features (the latter is not considered since the exclusivity is given)
+> - flag for other interdependencies by definition (i.e., proximal PAS coupled with distal exon)
+> - exonic position of feature A
+> - exonic position of feature B
+> - relative position of feature A
+> - relative position of feature B
+> - p-value, odds ratio and contingency table for stats performed using:
+>      - full-length read support
+>      - coverage
+>      - alignment-based counts
+> - Gencode gene
+> - flag for Gencode match (i.e., =, c, j
+> - bonferroni adjusted p-values. <br>

 <p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>