Skip to content
Snippets Groups Projects
Commit 6ea54bd9 authored by Sander Bollen's avatar Sander Bollen
Browse files

alignments and motifs

parent 639fdbbf
No related branches found
No related tags found
1 merge request!5Biopython 2018
%% Cell type:markdown id: tags:
# Biopython
![biopython logo](http://biopython.org/assets/images/biopython_logo_white.png)
## A quick overview
### [Guy Allard](mailto://w.g.allard@lumc.nl)
### [Sander Bollen](mailto://a.h.b.bollen@lumc.nl)
%% Cell type:markdown id: tags:
# What is Biopython?
## 'Python Tools for Computational Molecular Biology'
- Fully open-source
- Actively developed
- Large community
%% Cell type:markdown id: tags:
# What can it do?
Modules, classes and functions for manipulating biological data
- File parsers and writers.
- Sequence files: fasta, fastq, genbank, abi, sff, etc.
- Alignment files: clustal, emboss, phylip, nexus, etc.
- Sequence search outputs: BLAST, HMMER, BLAT, etc.
- Phylogenetic trees: newick, nexus, phyloxml, etc.
- Sequence motifs: AlignAce, TRANSFAC, etc.
- Others: PDB files, etc.
- Access to remote resources (e.g., Entrez, NCBI BLAST).
- Application wrappers.
- A simple graphing tool.
- Simple algorithms (e.g., pairwise alignment, cluster analysis).
- References such as codon tables and IUPAC sequences.
%% Cell type:markdown id: tags:
# Where can I find more information?
- [Biopython Homepage](http://biopython.org/)
- [Biopython development repository](http://github.com/biopython/biopython)
- [Biopython mailing list](http://lists.open-bio.org/pipermail/biopython/)
- [Biopython 'cookbook'](http://biopython.org/DIST/docs/tutorial/Tutorial.html) (essential reading!)
%% Cell type:markdown id: tags:
# Manipulating sequence data
## Seq and SeqRecord objects
`Seq` and `SeqRecord` objects are the basis of all sequence manipulation in Biopython.
* `Seq` is a raw sequence with an alphabet (e.g. DNA or RNA).
* `SeqRecord` is a sequence with metadata (e.g. names, ids, etc). This contains a `Seq` object.
%% Cell type:code id: tags:
``` python
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
# create a sequence, and store it in a variable
my_sequence = Seq("ATGGCCCTGTGGATGCGCCTCCTGCCCCTG", generic_dna)
print(my_sequence)
```
%% Output
ATGGCCCTGTGGATGCGCCTCCTGCCCCTG
%% Cell type:markdown id: tags:
There are lots of built in methods that can be used to manipulate the sequence
The sequence acts like a string in many ways
%% Cell type:code id: tags:
``` python
# get the length of a sequence
print("length: {0}".format(len(my_sequence)))
```
%% Output
length: 30
%% Cell type:code id: tags:
``` python
# slice and dice
print(my_sequence[:10])
```
%% Output
ATGGCCCTGT
%% Cell type:code id: tags:
``` python
# change the case
print(my_sequence.lower())
```
%% Output
atggccctgtggatgcgcctcctgcccctg
%% Cell type:code id: tags:
``` python
# concatenate the first and last 10 nucleotides
print(my_sequence[:10] + my_sequence[-10:])
```
%% Output
ATGGCCCTGTCCTGCCCCTG
%% Cell type:markdown id: tags:
But also has more sequence-specific methods
%% Cell type:code id: tags:
``` python
# complement
print(my_sequence.complement())
```
%% Output
TACCGGGACACCTACGCGGAGGACGGGGAC
%% Cell type:code id: tags:
``` python
# reverse complement
print(my_sequence.reverse_complement())
```
%% Output
CAGGGGCAGGAGGCGCATCCACAGGGCCAT
%% Cell type:code id: tags:
``` python
# transcribe from DNA to RNA
rna = my_sequence.transcribe()
print(rna)
```
%% Output
AUGGCCCUGUGGAUGCGCCUCCUGCCCCUG
%% Cell type:code id: tags:
``` python
# Translate from nucleotide to protein
protein = my_sequence.translate()
print(protein)
```
%% Output
MALWMRLLPL
%% Cell type:markdown id: tags:
# Manipulating Sequence Data
## Bio.SeqIO
Input and output of sequence files.
- `SeqIO.read`
- Read a file containing a single sequence
- `SeqIO.parse`
- Iterate over all sequences in a sequence file
- `SeqIO.write`
- write sequences to a file
%% Cell type:code id: tags:
``` python
from Bio import SeqIO
# read the first sequence
# returns SeqRecord objects
for record in SeqIO.parse("../data/records.fa", "fasta"):
dna = record
break
print(dna)
```
%% Output
ID: 1
Name: 1
Description: 1
Number of features: 0
Seq('TGGAACATGTCCCGCTAGCTTCTTCTTGCTAGCAGATTTTTTCAGTTGATCGTC...TCT', SingleLetterAlphabet())
%% Cell type:markdown id: tags:
Each record is an object with several fields, including:
- `record.id`
- the sequence id
- `record.name`
- sequence name, usually the same as the id
- `record.description`
- sequence description
The actual sequence is a separate object contained within the record which can be accessed using record.seq
The sequence has an 'alphabet' associated with it which defines which letters are allowed.
Different alphabets are used for DNA, RNA, protein etc.
%% Cell type:code id: tags:
``` python
print(dna.seq)
```
%% Output
TGGAACATGTCCCGCTAGCTTCTTCTTGCTAGCAGATTTTTTCAGTTGATCGTCACATGCGGTAGACTACCCAAGGTGTGACTACTCGCATGCCTGATCT
%% Cell type:code id: tags:
``` python
# we can then do our sequence manipulations on the `.seq` attribute of the record
print(dna.seq.reverse_complement())
```
%% Output
AGATCAGGCATGCGAGTAGTCACACCTTGGGTAGTCTACCGCATGTGACGATCAACTGAAAAAATCTGCTAGCAAGAAGAAGCTAGCGGGACATGTTCCA
%% Cell type:markdown id: tags:
Sequence records can easily be written to a file.
Specifying the file type allows conversion between different formats.
For example, to convert from a fastq file to fasta format:
%% Cell type:code id: tags:
``` python
records = SeqIO.parse("../data/easy.fastq", "fastq")
SeqIO.write(records, "tmp.fasta", "fasta")
```
%% Output
1
%% Cell type:markdown id: tags:
## Sequence alignment
It is possible align sequences using biopython with various methods.
Some of these depend on external tools (e.g. `clustalw`), but simple pair-wise alignment is supported out of the box.
%% Cell type:code id: tags:
``` python
from Bio.pairwise2 import align, format_alignment
# load fasta with insulin for several species as a handle
ins_handle = SeqIO.parse("../data/ins.fa", "fasta")
```
%% Cell type:code id: tags:
``` python
# make a list of records
ins_records = []
for item in ins_handle:
ins_records.append(item)
```
%% Cell type:code id: tags:
``` python
# extract a little of the sequence for human and chimp
human_ins_bit = ins_records[0][-45:]
chimp_ins_bit = ins_records[1][-45:]
```
%% Cell type:code id: tags:
``` python
# get the alignments with smith-waterman, without gap penalties
alignments = align.localxx(human_ins_bit.seq, chimp_ins_bit.seq)
# print the best alignment
best = alignments[0]
print(format_alignment(*best))
```
%% Output
CTCCTGC-A---C--C-G-AGAG--AGATGGAATAAAGCCCTTGAACCAGCAAAA
||| | | | | | | ||||||||||||||||||||||||||
---CTG-GAGAACTACTGCA-A-CTAGATGGAATAAAGCCCTTGAACCAGC----
Score=35
%% Cell type:code id: tags:
``` python
# get alignments with specified scores and penalties:
# 2 for match, 4 for mismatch,
#-2 for gap open, -0.5 for gap extend
gap_alignments = align.localms(human_ins_bit.seq,
chimp_ins_bit.seq,
2, -4, -2, -0.5)
print(format_alignment(*gap_alignments[0]))
```
%% Output
CTCCTGCACCGAGA------G-----AGATGGAATAAAGCCCTTGAACCAGCAAAA
||| |||| | ||||||||||||||||||||||||||
---CTG----GAGAACTACTGCAACTAGATGGAATAAAGCCCTTGAACCAGC----
Score=56
%% Cell type:markdown id: tags:
# Motis
We can also get a consensus sequence given multiple sequences, and visualize the result.
%% Cell type:code id: tags:
``` python
motif_handle = SeqIO.parse("../data/motif.fa", "fasta")
motif_records = []
for item in motif_handle:
motif_records.append(item.seq[:9].upper())
```
%% Cell type:code id: tags:
``` python
from Bio import motifs
# create a motif object
motif = motifs.create(motif_records)
```
%% Cell type:code id: tags:
``` python
print(motif.counts)
```
%% Output
0 1 2 3 4 5 6 7 8
A: 30.00 0.00 17.00 27.00 0.00 25.00 26.00 0.00 19.00
C: 52.00 50.00 0.00 57.00 50.00 0.00 58.00 50.00 0.00
G: 0.00 50.00 61.00 0.00 50.00 58.00 0.00 50.00 53.00
T: 18.00 0.00 22.00 16.00 0.00 17.00 16.00 0.00 28.00
%% Cell type:code id: tags:
``` python
print(motif.consensus)
```
%% Output
CGGCGGCGG
%% Cell type:code id: tags:
``` python
motif.weblogo("tmp.svg", format="SVG")
```
%% Cell type:markdown id: tags:
![ll](tmp.svg)
%% Cell type:markdown id: tags:
# Remote files
NCBI allow for remote querying of their Entrez database, and Biopython allows us to use their services from within python.
We can use the Entrez.efetch utility to retrieve various records from one of NCBI's databases.
A full list of these services and their documentation can be found on the [Entrez utilities help page](https://www.ncbi.nlm.nih.gov/books/NBK25500/)
%% Cell type:code id: tags:
``` python
from Bio import Entrez
```
%% Cell type:markdown id: tags:
IMPORTANT:
To monitor potential excessive use of their services, NCBI requests you to specify your email address with each request.
With Biopython, you can set it once for your session like this:
%% Cell type:code id: tags:
``` python
Entrez.email = 'python@lumc.nl'
```
%% Cell type:markdown id: tags:
Now we can make a query of the database.
The Entrez.efetch function returns a file-like handle that instead of pointing to a local file, points to a remote resource.
%% Cell type:code id: tags:
``` python
efetch_handle = Entrez.efetch(db="nucleotide", id="NM_005804",
rettype="gb", retmode="text")
```
%% Cell type:markdown id: tags:
We can use the handle as if it were a normal file handle opened with ```open("filename", "r")```, and read from it using SeqIO.read()
%% Cell type:code id: tags:
``` python
ncbi_record = SeqIO.read(efetch_handle, 'genbank')
print(ncbi_record)
```
%% Output
ID: NM_005804.3
Name: NM_005804
Description: Homo sapiens DExD-box helicase 39A (DDX39A), transcript variant 1, mRNA
Number of features: 25
/molecule_type=mRNA
/topology=linear
/data_file_division=PRI
/date=20-OCT-2018
/accessions=['NM_005804']
/sequence_version=3
/keywords=['RefSeq']
/source=Homo sapiens (human)
/organism=Homo sapiens
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo']
/references=[Reference(title='The RNA helicase DDX39B and its paralog DDX39A regulate androgen receptor splice variant AR-V7 generation', ...), Reference(title='Identification of DDX39A as a Potential Biomarker for Unfavorable Neuroblastoma Using a Proteomic Approach', ...), Reference(title='Up-regulation of DDX39 in human malignant pleural mesothelioma cell lines compared to normal pleural mesothelial cells', ...), Reference(title='The mRNA-bound proteome and its global occupancy profile on protein-coding transcripts', ...), Reference(title='Clinical proteomics identified ATP-dependent RNA helicase DDX39 as a novel biomarker to predict poor prognosis of patients with gastrointestinal stromal tumor', ...), Reference(title='The closely related RNA helicases, UAP56 and URH49, preferentially form distinct mRNA export machineries and coordinately regulate mitotic progression', ...), Reference(title='Hcc-1 is a novel component of the nuclear matrix with growth inhibitory function', ...), Reference(title='Growth-regulated expression and G0-specific turnover of the mRNA that encodes URH49, a mammalian DExH/D box protein that is highly related to the mRNA export protein UAP56', ...), Reference(title='Analysis of a high-throughput yeast two-hybrid system and its use to predict the function of intracellular proteins encoded within the human MHC class III region', ...), Reference(title='The BAT1 gene in the MHC encodes an evolutionarily conserved putative nuclear RNA helicase of the DEAD family', ...)]
/comment=REVIEWED REFSEQ: This record has been curated by NCBI staff. The
reference sequence was derived from DA432925.1, BC001009.2 and
BM792110.1.
This sequence is a reference standard in the RefSeqGene project.
On Oct 14, 2010 this sequence version replaced NM_005804.2.
Summary: This gene encodes a member of the DEAD box protein family.
These proteins are characterized by the conserved motif
Asp-Glu-Ala-Asp (DEAD) and are putative RNA helicases. They are
implicated in a number of cellular processes involving alteration
of RNA secondary structure, such as translation initiation, nuclear
and mitochondrial splicing, and ribosome and spliceosome assembly.
Based on their distribution patterns, some members of the DEAD box
protein family are believed to be involved in embryogenesis,
spermatogenesis, and cellular growth and division. This gene is
thought to play a role in the prognosis of patients with
gastrointestinal stromal tumors. A pseudogene of this gene is
present on chromosome 13. Alternate splicing results in multiple
transcript variants. Additional alternatively spliced transcript
variants of this gene have been described, but their full-length
nature is not known. [provided by RefSeq, Sep 2013].
Transcript Variant: This variant (1) represents the longer
transcript.
Publication Note: This RefSeq record includes a subset of the
publications that are available for this gene. Please see the Gene
record to access additional publications.
SRR1163655.274234.1 [ECO:0000332]
SAMEA1965299, SAMEA1966682
[ECO:0000350]
COMPLETENESS: complete on the 3' end.
/structured_comment=OrderedDict([('Evidence-Data', OrderedDict([('Transcript exon combination', 'SRR1163655.176131.1,'), ('RNAseq introns', 'mixed/partial sample support')]))])
Seq('AGCAGCAGCCCGACGCAAGAGGCAGGAAGCGCAGCAACTCGTGTCTGAGCGCCC...AAA', IUPACAmbiguousDNA())
%% Cell type:markdown id: tags:
It is also possible to query for multiple records
%% Cell type:code id: tags:
``` python
efetch_handle = Entrez.efetch(db="nucleotide", id=["NM_005804","NM_000967"],
rettype="gb", retmode="text")
```
%% Cell type:markdown id: tags:
Which can then be iterated over using ```SeqIO.parse```
%% Cell type:code id: tags:
``` python
for record in SeqIO.parse(efetch_handle, 'genbank'):
print(record.id, record.description)
```
%% Output
NM_005804.3 Homo sapiens DExD-box helicase 39A (DDX39A), transcript variant 1, mRNA
NM_000967.3 Homo sapiens ribosomal protein L3 (RPL3), transcript variant 1, mRNA
%% Cell type:markdown id: tags:
# Remote Tools
It is possible to use Biopyton with remote tools.
For example, we can submit a BLAST search to the NCBI service. ([Documentation here](https://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html))
We will use qblast function in the Bio.Blast.NCBIWWW module to perform a BLAST search using the record we retrieved earlier.
NOTE: It can take some time for the search results to become available
%% Cell type:code id: tags:
``` python
#from Bio.Blast.NCBIWWW import qblast
#blast_handle = qblast('blastn', 'refseq_mrna', ncbi_record.seq)
```
%% Cell type:markdown id: tags:
from Bio.Blast.NCBIWWW import qblast
blast_handle = qblast('blastn', 'nt', ncbi_record.seq)
We can the read from the file handle using the ```Bio.SearchIO``` module.
print(ncbi_record.seq)
```
%% Cell type:code id: tags:
%% Output
``` python
#from Bio import SearchIO
#qresult = SearchIO.read(blast_handle, 'blast-xml')
#print(qresult)
```
AGCAGCAGCCCGACGCAAGAGGCAGGAAGCGCAGCAACTCGTGTCTGAGCGCCCGGCGGAAAACCGAAGTTGGAAGTGTCTCTTAGCAGCGCGCGGAGAAGAACGGGGAGCCAGCATCATGGCAGAACAGGATGTGGAAAACGATCTTTTGGATTACGATGAAGAGGAAGAGCCCCAGGCTCCTCAAGAGAGCACACCAGCTCCCCCTAAGAAAGACATCAAGGGATCCTACGTTTCCATCCACAGCTCTGGCTTCCGGGACTTTCTGCTGAAGCCGGAGCTCCTGCGGGCCATCGTGGACTGTGGCTTTGAGCATCCTTCTGAGGTCCAGCATGAGTGCATTCCCCAGGCCATCCTGGGCATGGACGTCCTGTGCCAGGCCAAGTCCGGGATGGGCAAGACAGCGGTCTTCGTGCTGGCCACCCTACAGCAGATTGAGCCTGTCAACGGACAGGTGACGGTCCTGGTCATGTGCCACACGAGGGAGCTGGCCTTCCAGATCAGCAAGGAATATGAGCGCTTTTCCAAGTACATGCCCAGCGTCAAGGTGTCTGTGTTCTTCGGTGGTCTCTCCATCAAGAAGGATGAAGAAGTGTTGAAGAAGAACTGTCCCCATGTCGTGGTGGGGACCCCGGGCCGCATCCTGGCGCTCGTGCGGAATAGGAGCTTCAGCCTAAAGAATGTGAAGCACTTTGTGCTGGACGAGTGTGACAAGATGCTGGAGCAGCTGGACATGCGGCGGGATGTGCAGGAGATCTTCCGCCTGACACCACACGAGAAGCAGTGCATGATGTTCAGCGCCACCCTGAGCAAGGACATCCGGCCTGTGTGCAGGAAGTTCATGCAGGATCCCATGGAGGTGTTTGTGGACGACGAGACCAAGCTCACGCTGCACGGCCTGCAGCAGTACTACGTCAAACTCAAAGACAGTGAGAAGAACCGCAAGCTCTTTGATCTCTTGGATGTGCTGGAGTTTAACCAGGTGATAATCTTCGTCAAGTCAGTGCAGCGCTGCATGGCCCTGGCCCAGCTCCTCGTGGAGCAGAACTTCCCGGCCATCGCCATCCACCGGGGCATGGCCCAGGAGGAGCGCCTGTCACGCTATCAGCAGTTCAAGGATTTCCAGCGGCGGATCCTGGTGGCCACCAATCTGTTTGGCCGGGGGATGGACATCGAGCGAGTCAACATCGTCTTTAACTACGACATGCCTGAGGACTCGGACACCTACCTGCACCGGGTGGCCCGGGCGGGTCGCTTTGGCACCAAAGGCCTAGCCATCACTTTTGTGTCTGACGAGAATGATGCCAAAATCCTCAATGACGTCCAGGACCGGTTTGAAGTTAATGTGGCAGAACTTCCAGAGGAAATCGACATCTCCACATACATCGAGCAGAGCCGGTAACCACCACGTGCCAGAGCCGCCCACCCGGAGCCGCCCGCATGCAGCTTCACCTCCCCTTTCCAGGCGCCACTGTTGAGAAGCTAGAGATTGTATGAGAATAAACTTGTTATTATGGAAGCCTGGCTCCCACCCCATCTAAAAAAAAAAAAAAAAAAA
%% Cell type:code id: tags:
%% Cell type:markdown id: tags:
``` python
#print(qresult[0])
```
We can the read from the file handle using the ```Bio.SearchIO``` module.
%% Cell type:code id: tags:
``` python
#print(qresult[1])
```
from Bio import SearchIO
qresult = SearchIO.read(blast_handle, 'blast-xml')
print(qresult)
```
%% Output
Program: blastn (2.8.1+)
Query: No (1558)
definition line
Target: nt
Hits: ---- ----- ----------------------------------------------------------
# # HSP ID + description
---- ----- ----------------------------------------------------------
0 1 gi|308522777|ref|NM_005804.3| Homo sapiens DExD-box he...
1 1 gi|1367219251|ref|XM_016935303.2| PREDICTED: Pan trogl...
2 1 gi|675689963|ref|XM_003807080.2| PREDICTED: Pan panisc...
3 1 gi|1099186172|ref|XM_004060164.2| PREDICTED: Gorilla g...
4 1 gi|1351474314|ref|XM_002828787.4| PREDICTED: Pongo abe...
5 1 gi|1905997|gb|U90426.1|HSU90426 Human nuclear RNA heli...
6 1 gi|33875869|gb|BC001009.2| Homo sapiens DEAD (Asp-Glu-...
7 1 gi|10439504|dbj|AK026614.1| Homo sapiens cDNA: FLJ2296...
8 1 gi|795239725|ref|XM_011953207.1| PREDICTED: Colobus an...
9 1 gi|1411128774|ref|XM_025367317.1| PREDICTED: Theropith...
10 1 gi|1059109912|ref|XM_017851234.1| PREDICTED: Rhinopith...
11 1 gi|724815869|ref|XM_010361572.1| PREDICTED: Rhinopithe...
12 1 gi|1220191829|ref|XM_009193704.2| PREDICTED: Papio anu...
13 1 gi|982311930|ref|XM_005588244.2| PREDICTED: Macaca fas...
14 1 gi|1297694799|ref|XM_023229571.1| PREDICTED: Piliocolo...
15 1 gi|967496221|ref|XM_015123082.1| PREDICTED: Macaca mul...
16 1 gi|768000518|ref|XM_011527620.1| PREDICTED: Homo sapie...
17 1 gi|635036575|ref|XM_007995524.1| PREDICTED: Chlorocebu...
18 1 gi|795271240|ref|XM_011768647.1| PREDICTED: Macaca nem...
19 1 gi|795144436|ref|XM_011981779.1| PREDICTED: Mandrillus...
20 1 gi|194377853|dbj|AK301847.1| Homo sapiens cDNA FLJ5548...
21 1 gi|795433285|ref|XM_012094155.1| PREDICTED: Cercocebus...
22 1 gi|1367219254|ref|XM_016935304.2| PREDICTED: Pan trogl...
23 1 gi|795433280|ref|XM_012094154.1| PREDICTED: Cercocebus...
24 1 gi|1297694797|ref|XM_023229570.1| PREDICTED: Piliocolo...
25 1 gi|795239720|ref|XM_011953206.1| PREDICTED: Colobus an...
26 1 gi|1059109914|ref|XM_017851235.1| PREDICTED: Rhinopith...
27 1 gi|1220191830|ref|XM_021930788.1| PREDICTED: Papio anu...
28 1 gi|685606530|ref|XM_009193706.1| PREDICTED: Papio anub...
29 1 gi|1220191832|ref|XM_017952263.2| PREDICTED: Papio anu...
~~~
47 1 gi|1044402864|ref|XM_017497619.1| PREDICTED: Cebus cap...
48 1 gi|1044402866|ref|XM_017497620.1| PREDICTED: Cebus cap...
49 1 gi|1044402868|ref|XM_017497621.1| PREDICTED: Cebus cap...
%% Cell type:code id: tags:
``` python
print(qresult[0])
```
%% Output
Query: No
definition line
Hit: gi|308522777|ref|NM_005804.3| (1558)
Homo sapiens DExD-box helicase 39A (DDX39A), transcript variant 1, mRNA
HSPs: ---- -------- --------- ------ --------------- ---------------------
# E-value Bit score Span Query range Hit range
---- -------- --------- ------ --------------- ---------------------
0 0 2810.93 1558 [0:1558] [0:1558]
%% Cell type:code id: tags:
``` python
print(qresult[1])
```
%% Output
Query: No
definition line
Hit: gi|1367219251|ref|XM_016935303.2| (1530)
PREDICTED: Pan troglodytes DExD-box helicase 39A (DDX39A), transcript...
HSPs: ---- -------- --------- ------ --------------- ---------------------
# E-value Bit score Span Query range Hit range
---- -------- --------- ------ --------------- ---------------------
0 0 2695.52 1519 [0:1519] [11:1530]
%% Cell type:markdown id: tags:
# That was just an overview
This lesson was just a small taste of what can be done with Biopython.
I strongly recommend looking at the [Biopython 'cookbook'](http://biopython.org/DIST/docs/tutorial/Tutorial.html) to get an idea of the wide range of things that you can do with it.
%% Cell type:markdown id: tags:
The lesson was based on previous material by [Wibowo Arindrarto](mailto://w.arindrarto@lumc.nl) and Martijn Vermaat.
The lesson was based on previous material by [Guy Allard](mailto://w.g.allard@lumc.nl), [Wibowo Arindrarto](mailto://w.arindrarto@lumc.nl) and Martijn Vermaat.
License: [Creative Commons Attribution 3.0 License (CC-by)](http://creativecommons.org/licenses/by/3.0)
%% Cell type:code id: tags:
``` python
```
......
>NM_000207.2 Homo sapiens insulin (INS), transcript variant 1, mRNA
AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGATCACTGTCCTTCTGCCATGGCCCTGTG
GATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAAC
CAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACA
CACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGC
AGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACC
AGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGCAACTAGACGCAGCCCGCAGGCAGCCCCACACCCG
CCGCCTCCTGCACCGAGAGAGATGGAATAAAGCCCTTGAACCAGCAAAA
>NM_001008996.2 Pan troglodytes insulin (INS), mRNA
AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGATCACTGTCCTTCTGCCATGGCCCTGTG
GATGCGCCTCCTGCCCCTGCTGGTGCTGCTGGCCCTCTGGGGACCTGACCCAGCCTCGGCCTTTGTGAAC
CAACACCTGTGCGGCTCCCACCTGGTGGAAGCTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACA
CACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGC
AGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCCCTGCAGAAGCGTGGTATCGTGGAACAATGCTGTACC
AGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGCAACTAGATGGAATAAAGCCCTTGAACCAGC
>NM_001185083.2 Mus musculus insulin II (Ins2), transcript variant 1, mRNA
GGGGACCCAGTAACCACCAGCCCTAAGTGATCCGCTACAATCAAAAACCATCAGCAAGCAGGAAGGTTAT
TGTTTCAACATGGCCCTGTGGATGCGCTTCCTGCCCCTGCTGGCCCTGCTCTTCCTCTGGGAGTCCCACC
CCACCCAGGCTTTTGTCAAGCAGCACCTTTGTGGTTCCCACCTGGTGGAGGCTCTCTACCTGGTGTGTGG
GGAGCGTGGCTTCTTCTACACACCCATGTCCCGCCGTGAAGTGGAGGACCCACAAGTGGCACAACTGGAG
CTGGGTGGAGGCCCGGGAGCAGGTGACCTTCAGACCTTGGCACTGGAGGTGGCCCAGCAGAAGCGTGGCA
TTGTAGATCAGTGCTGCACCAGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGCAACTAGACCCACCA
CTACCCAGCCTACCCCTCTGCAATGAATAAAACCTTTGAATGAGCACAAAAAA
>NM_019130.2 Rattus norvegicus insulin 2 (Ins2), mRNA
AGCCCTAAGTGACCAGCTACAGTCGGAAACCATCAGCAAGCAGGTCATTGTTCCAACATGGCCCTGTGGA
TCCGCTTCCTGCCCCTGCTGGCCCTGCTCATCCTCTGGGAGCCCCGCCCTGCCCAGGCTTTTGTCAAACA
GCACCTTTGTGGTTCTCACTTGGTGGAAGCTCTCTACCTGGTGTGTGGGGAGCGTGGATTCTTCTACACA
CCCATGTCCCGCCGCGAAGTGGAGGACCCACAAGTGGCACAACTGGAGCTGGGTGGAGGCCCGGGGGCAG
GTGACCTTCAGACCTTGGCACTGGAGGTGGCCCGGCAGAAGCGCGGCATCGTGGATCAGTGCTGCACCAG
CATCTGCTCTCTCTACCAACTGGAGAACTACTGCAACTAGGCCCACCACTACCCTGTCCACCCCTCTGCA
ATGAATAAAACCTTTGAAAGAGCACTACAAAAAAAAAAAAAAAA
>NM_205222.3 Gallus gallus insulin (INS-IGF2), mRNA
ATATAAATATGGGAAAGAGAATGGGGAAATTTCTACCAGTCTTCATCTCTGAGAGCAAACTTCTCTGCAT
CTCTTTCTCTCTTCTCTGGGCCTCCCCCAGCTCATCATGGCTCTCTGGATCCGATCACTGCCTCTTCTGG
CTCTCCTTGTCTTTTCTGGCCCTGGAACCAGCTATGCAGCTGCCAACCAGCACCTCTGTGGCTCCCACTT
GGTGGAGGCTCTCTACCTGGTGTGTGGAGAGCGTGGCTTCTTCTACTCCCCCAAAGCCCGACGGGATGTC
GAGCAGCCCCTAGTGAGCAGTCCCTTGCGTGGCGAGGCAGGAGTGCTGCCTTTCCAGCAGGAGGAATACG
AGAAAGTCAAGCGAGGGATTGTTGAGCAATGCTGCCATAACACGTGTTCCCTCTACCAACTGGAGAACTA
CTGCAACTAGCCAAGAAGCCGGAAGCGGGCACAGACATACACTTACTCTATCGCACCTTCAAAGCATTTG
AATAAACCTTGTTGGTCTACTGGAAGACTTGTGCC
>NM_001100236.1 Xenopus tropicalis insulin (ins), mRNA
CCTCCTTTTGATCTTTCCAGCACTTGTCCAGCTCCCACTATCCTCTATCATGGCTCTTTGGATGCAGTGT
CTGCCCCTGGTACTTGTGCTCCTTTTCTCTACACCCAACACCGAAGCTCTAGCTAACCAACACCTGTGTG
GGTCTCACCTGGTAGAAGCCCTGTATCTAGTATGTGGGGATCGAGGCTTCTTCTACTACCCCAAGATCAA
ACGGGACATCGAACAAGCAATGGTCAATGGACCCCAGGACAACGAGTTGGATGGAATGCAGCTCCAGCCT
CAGGAGTACCAGAAAATGAAGAGGGGAATTGTGGAGCAATGCTGCCACAGCACATGTTCTCTCTTCCAGC
TGGAGAGCTACTGCAACTAGGGGACCAGGCAAATGCTCTCTTACCAAGGCACCTTCAAGGCAAATCCATT
ATGCCAAAACAACAGGACAACGAGCATTGTCTAACGGCACCAAGAACTTCTAACAATGTATATTTATTCC
ATATAAATTAGACATCGGTATCCCAACTAATCTGTTCTTAGTAGAAGGAGTTATATAGAGTAATTCTATG
TGACAGGACAAGAAATATCTGTTATTTTTGCATTTTAATTTGCTCAGAAACCACCACTTTAATGCTACTT
TAACATGGCTGTCATCAGCAAAGTACTGTGCAAGTCGGAAAGACCTTGTTTTAGGAGAGACCGGGCAGGT
TACATTGATAAAGTTCAAAAAAGAAAGTATCTGGAAGAAAAAGAGCCACCCAAAATGTTATTCCGATCTT
GCTTTTAAGTGCCTTGACCTACTGTATTTACTGTCTCTCTGTCTCACTGCAAATAAATGTAAGCTGAAGA
GCTAAAAAAAAAAAAAAAAAAA
>NM_131056.1 Danio rerio preproinsulin (ins), mRNA
CCATATCCACCATTCCTCGCCTCTGCTTCGAGAACAGTGTGACCATGGCAGTGTGGCTTCAGGCTGGTGC
TCTGTTGGTCCTGTTGGTCGTGTCCAGTGTAAGCACTAACCCAGGCACACCGCAGCACCTGTGTGGATCT
CATCTGGTCGATGCCCTTTATCTGGTCTGTGGCCCAACAGGCTTCTTCTACAACCCCAAGAGAGACGTTG
AGCCCCTTCTGGGTTTCCTTCCTCCTAAATCTGCCCAGGAAACTGAGGTGGCTGACTTTGCATTTAAAGA
TCATGCCGAGCTGATAAGGAAGAGAGGCATTGTAGAGCAGTGCTGCCACAAACCCTGCAGCATCTTTGAG
CTGCAGAACTACTGTAACTGAAGAGATTTGCCCACCGCCAATGCCAGAAACACCTGTTTGCACACAGGCC
TTAATGCTCTCCGTTTGTTTTTACAGAAAAAATAAAACTATCAAATGA
\ No newline at end of file
>chr6:57316079-57316087
ccaccacca
>chr7:73761746-73761754
ccgccacca
>chr20:59571443-59571454
aggaggtggagg
>chrY:15809432-15809440
CCACCTCCT
>chr17:40068950-40068958
CCTCCACCT
>chr14:65103070-65103078
CCGCCACCT
>chr19:54239423-54239431
cctccgcct
>chr6:150434480-150434488
CCACCACCG
>chr7:33812110-33812121
tggaggtggagg
>chr5:153717375-153717383
ccaccacca
>chr6:7914979-7914987
ccgcctcct
>chr22:27012040-27012051
aggaggaggagg
>chr3:69390851-69390859
aggaggtgg
>chr9:117615118-117615126
aggaggcgg
>chr15:84899756-84899767
aggaggaggagg
>chr3:129325550-129325561
CGGAggcggcgg
>chr7:137343473-137343481
CCTCCTCCT
>chr9:35997152-35997163
cctccgcctcca
>chr6:35548397-35548405
cctccgcct
>chr9:111229018-111229029
TGGTGGTGGAGG
>chr12:66746168-66746176
aggaggtgg
>chr8:35193429-35193437
ccaccacca
>chr2:100306477-100306485
aggtggagg
>chr4:88917359-88917367
cctccacct
>chr3:72299621-72299629
tggaggtgg
>chr7:6140042-6140053
cctccgcctcct
>chr1:30429604-30429612
CCACCACCT
>chr9:84454516-84454524
ccaccacca
>chr4:24681618-24681626
tggtggcgg
>chr6:39016541-39016549
CCGCCACCA
>chr19:47957834-47957845
aggaggtggagg
>chr2:121438697-121438705
CCACCACCG
>chr4:6606145-6606153
AGGCGGTGG
>chr10:54496597-54496605
CCACCACCA
>chr1:121283711-121283719
tggtggcgg
>chr15:83623612-83623620
cctccgcct
>chr10:96161580-96161588
ccgccacca
>chrX:153674870-153674878
aggaggCGG
>chr7:116353954-116353962
CCACCTCCA
>chr3:44738060-44738068
aggtggagg
>chr14:88795532-88795540
CCACCTCCT
>chr1:24149591-24149602
aggaggaggagg
>chr1:26463722-26463730
aggtggagg
>chrX:70882170-70882178
CCTCCACCT
>chr2:138133373-138133381
TGGAGGAGG
>chr5:179540966-179540974
tggaggagg
>chr12:108703822-108703833
tggaggaggtgg
>chr15:59788732-59788740
aggcggagg
>chr12:48567647-48567655
aggcggagg
>chr11:118709199-118709210
cctccacctcct
>chr16:83742554-83742565
aggaggcggagg
>chr21:21636262-21636270
aggcggagg
>chr16:66928929-66928937
AGGCGGAGG
>chr14:56018822-56018830
ccgccacca
>chr16:78283154-78283162
ccgcctcca
>chr17:27620974-27620985
tggtggtggtgg
>chr20:40577705-40577713
ccacctcct
>chrX:88694903-88694911
TGGAGGTGG
>chr1:12630736-12630744
AGGTGGAGG
>chr12:47635287-47635295
ccgccacca
>chr14:104575359-104575370
TGGAGGTGGTGG
>chr14:82091013-82091021
CCTCCTCCT
>chr7:138971990-138971998
CCTCCTCCT
>chr11:65620965-65620973
CCACCACCA
>chr10:14803680-14803688
cctccacct
>chr4:55794226-55794234
cctccacct
>chr1:158182140-158182151
tggtggtggtgg
>chr1:24622336-24622344
AGGTGGAGG
>chr11:125240043-125240051
CCACCTCCA
>chr5:87790918-87790926
ccaccacca
>chr6:17539578-17539586
CCTCCTCCA
>chr1:112985394-112985402
cctccgcct
>chr3:50185889-50185897
tggaggagg
>chr1:9406001-9406009
TGGTGGTGG
>chr9:107660691-107660699
tggtggtgg
>chr10:88410633-88410641
aggaggagg
>chr17:7450571-7450579
cctccacct
>chr18:54316735-54316743
aggaggtgg
>chr7:94732764-94732772
ccgcctcct
>chrX:110238366-110238374
CCTCCTCCT
>chr5:150716893-150716901
CGGAGGAGG
>chr9:6097303-6097311
ccacctcca
>chrUn_gl000219:120741-120752
AGGTGGAGGTGG
>chr10:30397169-30397177
aggtggagg
>chr5:3099125-3099133
AGGAGGTGG
>chr10:106022363-106022371
aggaggcgg
>chr3:107519161-107519169
TGGAGGAGG
>chr5:145007968-145007976
cctccacct
>chr9:33624940-33624948
CCACCTCCA
>chr17:41898360-41898368
CCGCCTCCT
>chr16:15915078-15915086
CCTCCTCCA
>chr13:20332903-20332911
aggtggagg
>chr10:34392633-34392644
cctccacctcct
>chrX:545065-545086
aggaggaggctggaggaggagg
>chr1:23931841-23931849
aggcggagg
>chr2:106719902-106719910
TGGAGGAGG
>chr2:92261744-92261752
CCGCCGCCG
>chr8:10983660-10983668
cctccgcct
>chr16:68583444-68583452
tggtggcgg
>chr8:145706955-145706963
aggcggagg
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment