... | ... | @@ -59,14 +59,14 @@ Post sequencing, transcript structures were defined by applying the isoform-leve |
|
|
Clustering of the transcripts into genes was achieved using the following criteria: For each genomic region containing transcripts, we create an empty graph in which each transcript is represented as a node. If two transcripts are on the same strand and share at least 2 exons, an edge is added to link the two transcripts. For each component of the resulting network, a unique gene id is generated and assigned to the transcripts using the same scheme as the original data. Transcript ids are derived from the gene ids, with an index 1 to the total number of transcripts assigned to each gene. A lookup table is generated for the mapping of the new and old ids. <br>
|
|
|
|
|
|
> **`script:`** The python script for curating gene models and transcript ids is provided [**here**](https://git.lumc.nl/wgallard/IsoseqProcessing/ipython_notebook/master/scripts/mcf7plus_processing.ipynb#5\)-Transcripts-were-clustered-into-'genes'). <br> <br>
|
|
|
> **input** <br> [:arrow_down:]() GFF file, generated after running ANGEL to predict open-reading frames. <br> <br>
|
|
|
> **output** <br> [:arrow_up:]() GFF file processed for regrouping of transcripts into gene clusters. <br> [:arrow_up:]() map of old and new gene and transcript ids. <br>
|
|
|
> **input** <br> [:arrow_down:](https://barmsijs.lumc.nl/RNAcoupling/MCF7_2015.ANGEL_FINAL_nominus_noduplicate.gff) GFF file, generated after running ANGEL to predict open-reading frames. <br> <br>
|
|
|
> **output** <br> [:arrow_up:](https://barmsijs.lumc.nl/RNAcoupling/MCF7_2015.ANGEL_FINAL_nominus_noduplicate.new_id.gff) GFF file processed for regrouping of transcripts into gene clusters. <br> [:arrow_up:](https://barmsijs.lumc.nl/RNAcoupling/MCF7_2015.ANGEL_FINAL_nominus_noduplicate.remap) map of old and new gene and transcript ids. <br>
|
|
|
|
|
|
For each gene, the position of TSSs and PASs is subject to noise. To select for most likely alternative TSS and PAS in a given gene, the terminal positions of transcripts at each gene locus are been curated based on the following criteria: For each gene, the terminus start and end that are within the specified distance threshold (10bp in this case) are grouped and the terminus position with the highest number of supporting read was used to curate the transcript start and end. <br>
|
|
|
|
|
|
> **`script:`** The python script for curating the terminal positions of transcripts at each gene locus is provided [**here**](https://git.lumc.nl/wgallard/IsoseqProcessing/ipython_notebook/master/scripts/mcf7plus_processing.ipynb#6\)-Terminal-feature-coordinates-were-curated--using-a-+/--10bp-window). <br> <br>
|
|
|
> **input** <br> [:arrow_down:]() GFF file, pre-processed for regrouping transcripts into gene clusters. <br> <br>
|
|
|
> **output** <br> [:arrow_up:]() GFF file curated for terminal positions to reduce stochasticity and overrepresentation of TSSs and PASs per gene. Please note that a threshold (default is 10bp either side of each terminal position) needs to be set for the window for polishing. <br>
|
|
|
> **input** <br> [:arrow_down:](https://barmsijs.lumc.nl/RNAcoupling/MCF7_2015.ANGEL_FINAL_nominus_noduplicate.new_id.gff) GFF file, pre-processed for regrouping transcripts into gene clusters. <br> <br>
|
|
|
> **output** <br> [:arrow_up:](https://barmsijs.lumc.nl/RNAcoupling/MCF7_2015.ANGEL_FINAL_nominus_noduplicate.curated.gff) GFF file curated for terminal positions to reduce stochasticity and overrepresentation of TSSs and PASs per gene. Please note that a threshold (default is 10bp either side of each terminal position) needs to be set for the window for polishing. <br>
|
|
|
|
|
|
In addition to 7,364 single-exon transcripts, the MCF-7 transcriptome consists of 11,350 multi-exon genes of which 69% produced multiple transcript structures (**Figure 1**; [GFF](); [FASTA]()). Furthermore, 49% of identified transcripts in MCF-7 cells are potentially novel in comparison with the [GENCODE](http://www.gencodegenes.org/) annotation (versoin 19). The comparison was carried out using `cuffcompare` from Cufflinks suite. <br>
|
|
|
<br>
|
... | ... | @@ -103,8 +103,8 @@ In this study, by processing the GFF file that contains the annotation of all id |
|
|
After defining unique features (TSSs, exons and PASs) and identifying the number of supporting reads for transcripts at each locus (Full-length read support, Full-length and Partial read support, or alignment-based read support), all possible pairwise comparisons between features were made. To do this, the sum of all reads that support the presence of the two selected features in all observed transcripts is reported in a two-by-two contingency table (**Figure 2**). The table describes the number of times two features are observed in the same transcript or exclusively, as well as the sum of reads that are mapped to transcripts that do not support the presence of either features (**Figure 2**). A significant coupling between two features is assessed using the Fisher's exact test. Features that are by definition interdependent (i.e., PAS and the terminal 3' exon, or multiple TSSs of the same gene) were omitted from the analysis to avoid introducing any bias. Also, for assessing the interdependency of two features, only transcripts that expand the region that the two features are located are considered and the other transcripts can't be used as evidence for considering dependencies. The mutual inclusivity or exclusivity of coupled features are determined using their log-transformed odd ratio. All p-values are adjusted using Bonferroni multiple testing correction, meaning that the threshold for significance of coupling is set to 1.4e-08 based based on the total number of tests performed. <br>
|
|
|
|
|
|
> **`script:`** The python script for calculating the coupling between feature-pairs and post-processing can be found [**here**](https://git.lumc.nl/s.y.anvar/mRNA-Coupling/ipython_notebook/master/scripts/coupling_statistical_analysis.ipynb). <br> <br>
|
|
|
> **input** <br> [:arrow_down:]() GFF file, curated for terminal positions as well as filtered for single-exon genes as no statistical test can be performed. <br> [:arrow_down:]() TMAP file, generated by cuffcompare to provide gencode matching gene annotation. <br> <br>
|
|
|
> **output** <br> [:arrow_up:]() Reference stats file consisting of:
|
|
|
> **input** <br> [:arrow_down:](https://barmsijs.lumc.nl/RNAcoupling/) GFF file, curated for terminal positions as well as filtered for single-exon genes as no statistical test can be performed. <br> [:arrow_down:](https://barmsijs.lumc.nl/RNAcoupling/compare.mcf7plus2016.gencode.MCF7_2015.ANGEL_FINAL_nominus_noduplicate.counts.new_id.curated.noCDS.no_single_exon.gff.tmap) TMAP file, generated by cuffcompare to provide gencode matching gene annotation. <br> <br>
|
|
|
> **output** <br> [:arrow_up:](https://barmsijs.lumc.nl/RNAcoupling/mcf7plus.2016.reference.q30.stats) Reference stats file consisting of:
|
|
|
> - gene id
|
|
|
> - transcript id
|
|
|
> - genomic position
|
... | ... | @@ -115,14 +115,14 @@ After defining unique features (TSSs, exons and PASs) and identifying the number |
|
|
> - partial read support
|
|
|
> - alignment-based read support
|
|
|
>
|
|
|
> [:arrow_up:]() Features stat file consisting of:
|
|
|
> [:arrow_up:](https://barmsijs.lumc.nl/RNAcoupling/mcf7plus.2016.features.q30) Features stat file consisting of:
|
|
|
> - gene id
|
|
|
> - number of transcripts
|
|
|
> - number of TSSs
|
|
|
> - number of exons
|
|
|
> - number of PAS
|
|
|
>
|
|
|
> [:arrow_up:]() Coupling stat file that contains all the statistics for all possible feature-pairs along with various annotations:
|
|
|
> [:arrow_up:](https://barmsijs.lumc.nl/RNAcoupling/mcf7plus2016.overlap.adjusted.raw.q30.stats_annotation.plus.txt) Coupling stat file that contains all the statistics for all possible feature-pairs along with various annotations:
|
|
|
> - gene id
|
|
|
> - chromosome
|
|
|
> - strand
|
... | ... | @@ -170,8 +170,8 @@ This analysis was performed to identify loci that contain tandem 3' UTRs (loci w |
|
|
For each detected gene, we report the first and last nucleotide of each exon as acceptor and donor splice sites, respectively. Each unique genomic position was converted into a BED format and the strand specific sequences of 2 nucleotides were extracted using UCSC Table Browser (GRCh37/hg19) for both acceptor and donor splice sites.
|
|
|
|
|
|
> **`script:`** The python script for extracting dinucleotide sequences of the splice-sites at R1 and R3 domains (i.e., canonical GT and AG motifs) can be found [**here**](https://git.lumc.nl/s.y.anvar/mRNA-Coupling/ipython_notebook/master/scripts/Rdomain_splice_junction_motif.ipynb). <br> <br>
|
|
|
> **input** <br> [:arrow_down:]() FASTA files, intronic sequences of R1 and R3 domains. Sequence headers contains additional information regarding the coupling of the exon with other features, etc. <br> <br>
|
|
|
> **output** <br> [:arrow_up:]() Text file containing information on:
|
|
|
> **input** <br> [:arrow_down:](https://barmsijs.lumc.nl/RNAcoupling/mcf7plus.Rdomains.fa.tar.gz) FASTA files, intronic sequences of R1 and R3 domains. Sequence headers contains additional information regarding the coupling of the exon with other features, etc. <br> <br>
|
|
|
> **output** <br> [:arrow_up:](https://barmsijs.lumc.nl/RNAcoupling/mcf7plus.splicing_motifs.txt) Text file containing information on:
|
|
|
> - gene id
|
|
|
> - genomic position
|
|
|
> - exon id/name
|
... | ... | |