... | ... | @@ -7,6 +7,7 @@ Leiden University Medical Center <br> |
|
|
1. [Materials and Sequencing](#materials-and-sequencing-)
|
|
|
1. [Data Preprocessing and Characteristics](#data-preprocessing-and-characteristics-)
|
|
|
1. [Prerequisites](#prerequisites-)
|
|
|
1. [Comparison to standard RNA-Seq datasets](#comparison-to-standard-rna-seq-datasets-)
|
|
|
1. [Defining Unique Features](#defining-unique-features-)
|
|
|
1. [Statistical Analysis](#statistical-analysis-)
|
|
|
1. [Polyadenylation Sites Sequence Motif Analysis](#polyadenylation-sites-sequence-motif-analysis-)
|
... | ... | @@ -75,6 +76,11 @@ It is essential to produce the GFF file containing the annotation of identified |
|
|
Prior to running BLASR, FASTA file containing the reference transcript sequenecs (i.e., reference.fa) needs to be checked for potential redundant sequences. This process will lead to renaming transcripts by unique indexes. A map to the original transcript ids are provided in the reference.fa.nonredundant.id_map.txt file. After the alignment is complete, the number of supporting reads per transcript can be easily extracted using the `samtools idxstats` function to produce the IsoSeq_MCF7.reads_of_insert.coverage file. <br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Comparison to standard RNA-Seq datasets** <br>
|
|
|
We used five publicly available RNA-Seq datasets ([SRR1035698](http://sra.dnanexus.com/experiments/SRX381535/runs), [SRR1107833](http://sra.dnanexus.com/experiments/SRX426377/runs), [SRR1107834](http://sra.dnanexus.com/experiments/SRX426377/runs), [SRR1107835](http://sra.dnanexus.com/experiments/SRX426377/runs) and [SRR1313067](https://trace.ddbj.nig.ac.jp/DRASearch/run?acc=SRR1313067); generated on Illumina HiSeq2000 or Illumina HiSeq2500 platforms) to evaluate the reliability of gene expression quantification based on full-length mRNA sequencing data used in this study. As accurate transcription reconstruction is not feasible for short-read RNA-Seq data, the comparison is made at the gene level using GENCODE annotation (version 19). Median gene coverage (fragment counts adjusted for gene length) was used as a measure for gene expression quantifications using the [GENTRAP](http://biopet-docs.readthedocs.io/en/latest/pipelines/gentrap/) pipeline.<br>
|
|
|
<p align="right"> [`TOP`](#full-length-mrna-sequencing-uncovers-a-widespread-coupling-between-transcription-and-mrna-processing)</p><br>
|
|
|
|
|
|
---
|
|
|
## **Defining Unique Features** <br>
|
|
|
In this study, by processing the GFF file that contains the annotation of all identified transcripts and exon-intron boundaries (defined by the genomic position and strand on the HG19 reference sequence), a list of all transcription and mRNA processing events is produced (**Figure 2**). Transcription start sites (TSSs) are defined as the first genomic position of each transcript structure. Polyadenylation sites (PASs) are defined as the last genomic position of each transcript. The most upstream and downstream position of exons were used to define donor and acceptor splice sites, respectively. However, for the first exon only the donor site is described as the first position is defined as transcription start site. Likewise, the last exon does not contain a donor splice site as the position is defined as polyadenylation site. If multiple transcripts share the same feature, then only one copy is kept in the unique set of features at each locus. Furthermore, the union of all unique exons is defined as the available sequence at each locus. <br>
|
... | ... | |