Changes

Anvar · 1cbba5bd
--- a/home.md
+++ b/home.md
@@ -4,8 +4,8 @@ Department of Human Genetics <br>
 Leiden University Medical Center <br>
 <br>

-1. [Materials and Preprocessing](#materials-and-preprocessing-)
-1. [Data Characteristics](#data-characteristics-)
+1. [Materials and Sequencing](#materials-and-sequencing-)
+1. [Data Preprocessing and Characteristics](#data-preprocessing-and-characteristics-)
 1. [Prerequisites](#prerequisites-)
 1. [Defining Unique Features]()
 1. [Statistical Analysis]()
@@ -27,7 +27,7 @@ The multilayered control of gene expression requires tight coordination of regul
 <br>

 ---
-### **Materials and Preprocessing** <br>
+### **Materials and Sequencing** <br>
 Full-length cDNA was generated from polyA RNA using standard cDNA synthesis kits (Clontech® SMARTer™ and Invitrogen® Superscript® kits). To capture longer, rarer transcripts in sufficient abundance, parts of the double-stranded cDNA were size selected into three fractions, which were subsequently amplified and converted into SMRTbell™ templates. Details on the sample preparation can be found on [`Sample Net`](http://www.smrtcommunity.com/Share/Protocol?id=a1q70000000HqSvAAK&strRecordTypeName=Protocol). SMRTbell libraries were sequenced using the [`P4-C2 sequencing chemistry`](http://blog.pacificbiosciences.com/2013/08/new-dna-polymerase-p4-delivers-higher.html) with 2-hour movies. <br>

 After sequencing, the completeness of the sequences were computationally determined using polyA-tail signals and library adapters. To obtain a non-redundant set of full-length, high-quality transcript sequences without bias from other sequencing platforms, a de novo isoform-level clustering algorithm was used only on PacBio data. Briefly, the algorithm iteratively clusters reads to generate consensus sequences that represent the original transcripts. The algorithm takes into account the existence of the polyA-tail signal to differentiate isoforms with alternative stop sites. The final consensus sequences were called using [`Quiver`](https://github.com/PacificBiosciences/GenomicConsensus/blob/master/doc/HowToQuiver.rst) and filtered to create the final polished, full-length, non-redundant dataset. In 2015, the MCF-7 dataset was updated by performing 28 additional SMRT Cells on PacBio RSII platform. <br>
@@ -49,11 +49,21 @@ For full description of the sample preparation protocol and pre-processing of se
 <br>

 ---
-### **Data Characteristics** <br>
-Post sequencing, transcript structures were defined by applying the isoform-level clustering algorithm ([ICE](https://github.com/PacificBiosciences/cDNA_primer/wiki)) on full-length reads, capturing the entire mRNA molecule (containing 5', polyA-tail and 3' primer sequences). To find transcript clusters, ICE performs a pairwise alignment and reiterative assignment of full-length reads to clusters based on likelihood. This process is followed by consensus calling and further polishing of the sequence to reduce redundancy and increase the overall accuracy of sequences for identified transcript variants. Our analysis pipeline could precisely determine the position of polyadenylation site (presence of polyA-tail) and intro-exon boundaries, as evident from the presence of the canonical **GU** motif in 93% of donor splice sites and the canonical **AG** motif in 95% of acceptor splice sites. In addition to 7,364 single-exon transcripts, the MCF-7 transcriptome consists of 11,350 multi-exon genes of which 69% produced multiple transcript structures (**Figure 1**; [GFF](); [FASTA]()). Furthermore, 49% of identified transcripts in MCF-7 cells are potentially novel in comparison with the [GENCODE](http://www.gencodegenes.org/) annotation (versoin 19). The comparison was carried out using `cuffcompare` from Cufflinks suite. <br>
+### **Data Preprocessing and Characteristics** <br>
+Post sequencing, transcript structures were defined by applying the isoform-level clustering algorithm ([ICE](https://github.com/PacificBiosciences/cDNA_primer/wiki)) on full-length reads, capturing the entire mRNA molecule (containing 5', polyA-tail and 3' primer sequences). To find transcript clusters, ICE performs a pairwise alignment and reiterative assignment of full-length reads to clusters based on likelihood. This process is followed by consensus calling and further polishing of the sequence to reduce redundancy and increase the overall accuracy of sequences for identified transcript variants. Our analysis pipeline could precisely determine the position of polyadenylation site (presence of polyA-tail) and intro-exon boundaries, as evident from the presence of the canonical **GU** motif in 93% of donor splice sites and the canonical **AG** motif in 95% of acceptor splice sites. <br>
+
+Clustering of the transcripts into genes was achieved using the following criteria: For each genomic region containing transcripts, we create an empty graph in which each transcript is represented as a node. If two transcripts are on the same strand and share at least 2 exons, an edge is added to link the two transcripts. For each component of the resulting network, a unique gene id is generated and assigned to the transcripts using the same scheme as the original data. Transcript ids are derived from the gene ids, with an index 1 to the total number of transcripts assigned to each gene. A lookup table is generated for the mapping of the new and old ids. <br>
+
+> **`script:`** The python script for curating gene models and transcript ids is provided [**here**](https://git.lumc.nl/wgallard/IsoseqProcessing/ipython_notebook/master/scripts/mcf7plus_processing.ipynb#5\)-Transcripts-were-clustered-into-'genes'). <br>
+
+For each gene, the position of TSSs and PASs is subject to noise. To select for most likely alternative TSS and PAS in a given gene, the terminal positions of transcripts at each gene locus are been curated based on the following criteria: For each gene, the terminus start and end that are within the specified distance threshold (10bp in this case) are grouped and the terminus position with the highest number of supporting read was used to curate the transcript start and end. <br>
+
+> **`script:`** The python script for curating the terminal positions of transcripts at each gene locus is provided [**here**](https://git.lumc.nl/wgallard/IsoseqProcessing/ipython_notebook/master/scripts/mcf7plus_processing.ipynb#6\)-Terminal-feature-coordinates-were-curated--using-a-+/--10bp-window). <br>
+
+In addition to 7,364 single-exon transcripts, the MCF-7 transcriptome consists of 11,350 multi-exon genes of which 69% produced multiple transcript structures (**Figure 1**; [GFF](); [FASTA]()). Furthermore, 49% of identified transcripts in MCF-7 cells are potentially novel in comparison with the [GENCODE](http://www.gencodegenes.org/) annotation (versoin 19). The comparison was carried out using `cuffcompare` from Cufflinks suite. <br>
 <br>
 ![SY_Anvar_-_Figure_S1_-_Histogram_of_Length_and_Counts_per_Gene](/uploads/621961f894a9d5fa43819c0421600040/SY_Anvar_-_Figure_S1_-_Histogram_of_Length_and_Counts_per_Gene.png)
-<br>
+
 > **Figure 1:** Overview of identified transcripts in MCF-7 transcriptome. Histograms show the distribution of the number of identified transcripts per gene and transcript lengths. Density plot depicts the number of supporting reads based on transcript length. The number of supporting reads does not correlate with the length of full-length transcripts.