... | ... | @@ -59,7 +59,8 @@ Post sequencing, transcript structures were defined by applying the isoform-leve |
|
|
Clustering of the transcripts into genes was achieved using the following criteria: For each genomic region containing transcripts, we create an empty graph in which each transcript is represented as a node. If two transcripts are on the same strand and share at least 2 exons, an edge is added to link the two transcripts. For each component of the resulting network, a unique gene id is generated and assigned to the transcripts using the same scheme as the original data. Transcript ids are derived from the gene ids, with an index 1 to the total number of transcripts assigned to each gene. A lookup table is generated for the mapping of the new and old ids. <br>
|
|
|
|
|
|
> **`script:`** The python script for curating gene models and transcript ids is provided [**here**](https://git.lumc.nl/wgallard/IsoseqProcessing/ipython_notebook/master/scripts/mcf7plus_processing.ipynb#5\)-Transcripts-were-clustered-into-'genes'). <br> <br>
|
|
|
> The script expects the GFF file ([✇]()) generated after running ANGEL to predict open-reading frames as an input. The output consists of two files: [✇]() Polished GTF file, annotating transcript structures; and [✇]() map of old and new gene or transcript ids. <br>
|
|
|
> **input** <br> [✇]() GFF file, generated after running ANGEL to predict open-reading frames. <br> <br>
|
|
|
> **output** <br> [✇]() GTF file processed for regrouping of transcripts into gene clusters. <br> [✇]() map of old and new gene and transcript ids. <br>
|
|
|
|
|
|
For each gene, the position of TSSs and PASs is subject to noise. To select for most likely alternative TSS and PAS in a given gene, the terminal positions of transcripts at each gene locus are been curated based on the following criteria: For each gene, the terminus start and end that are within the specified distance threshold (10bp in this case) are grouped and the terminus position with the highest number of supporting read was used to curate the transcript start and end. <br>
|
|
|
|
... | ... | |