...
 
Commits (4)
......@@ -13,7 +13,7 @@
"![biopython logo](http://biopython.org/assets/images/biopython_logo_white.png)\n",
"\n",
"## A quick overview\n",
"### [Guy Allard](mailto://w.g.allard@lumc.nl)"
"### [Sander Bollen](mailto://a.h.b.bollen@lumc.nl)"
]
},
{
......@@ -83,23 +83,19 @@
}
},
"source": [
"# Manipulating Sequence Data\n",
"# Manipulating sequence data\n",
"\n",
"## Bio.SeqIO\n",
"## Seq and SeqRecord objects\n",
"\n",
"Input and output of sequence files.\n",
"`Seq` and `SeqRecord` objects are the basis of all sequence manipulation in Biopython. \n",
"\n",
"- SeqIO.read\n",
" - Read a file containing a single sequence\n",
"- SeqIO.parse \n",
" - Iterate over all sequences in a sequence file\n",
"- SeqIO.write\n",
" - write sequences to a file"
"* `Seq` is a raw sequence with an alphabet (e.g. DNA or RNA).\n",
"* `SeqRecord` is a sequence with metadata (e.g. names, ids, etc). This contains a `Seq` object. \n"
]
},
{
"cell_type": "code",
"execution_count": 38,
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "subslide"
......@@ -110,23 +106,19 @@
"name": "stdout",
"output_type": "stream",
"text": [
"ID: 1\n",
"Name: 1\n",
"Description: 1\n",
"Number of features: 0\n",
"Seq('TGGAACATGTCCCGCTAGCTTCTTCTTGCTAGCAGATTTTTTCAGTTGATCGTC...TCT', SingleLetterAlphabet())\n"
"ATGGCCCTGTGGATGCGCCTCCTGCCCCTG\n"
]
}
],
"source": [
"from Bio import SeqIO\n",
"from Bio.Seq import Seq\n",
"from Bio.Alphabet import generic_dna\n",
"\n",
"# read the first sequence\n",
"for record in SeqIO.parse(\"../data/records.fa\", \"fasta\"):\n",
" dna = record\n",
" break\n",
"# create a sequence, and store it in a variable\n",
"\n",
"print dna"
"my_sequence = Seq(\"ATGGCCCTGTGGATGCGCCTCCTGCCCCTG\", generic_dna)\n",
"print(my_sequence)\n",
"\n"
]
},
{
......@@ -137,25 +129,14 @@
}
},
"source": [
"Each record is an object with several fields, including:\n",
"\n",
"- record.id\n",
" - the sequence id\n",
"- record.name\n",
" - sequence name, usually the same as the id\n",
"- record.description\n",
" - sequence description\n",
"\n",
"The actual sequence is a separate object contained within the record which can be accessed using record.seq\n",
"\n",
"The sequence has an 'alphabet' associated with it which defines which letters are allowed.\n",
"There are lots of built in methods that can be used to manipulate the sequence\n",
"\n",
"Different alphabets are used for DNA, RNA, protein etc."
"The sequence acts like a string in many ways\n"
]
},
{
"cell_type": "code",
"execution_count": 39,
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "subslide"
......@@ -166,30 +147,40 @@
"name": "stdout",
"output_type": "stream",
"text": [
"TGGAACATGTCCCGCTAGCTTCTTCTTGCTAGCAGATTTTTTCAGTTGATCGTCACATGCGGTAGACTACCCAAGGTGTGACTACTCGCATGCCTGATCT\n"
"length: 30\n"
]
}
],
"source": [
"print dna.seq"
"# get the length of a sequence\n",
"print(\"length: {0}\".format(len(my_sequence)))"
]
},
{
"cell_type": "markdown",
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ATGGCCCTGT\n"
]
}
],
"source": [
"There are lots of built in methods that can be used to manipulate the sequence\n",
"\n",
"The sequence acts like a string in many ways"
"# slice and dice\n",
"print(my_sequence[:10])"
]
},
{
"cell_type": "code",
"execution_count": 40,
"execution_count": 4,
"metadata": {
"slideshow": {
"slide_type": "subslide"
......@@ -200,18 +191,18 @@
"name": "stdout",
"output_type": "stream",
"text": [
"length: 100\n"
"atggccctgtggatgcgcctcctgcccctg\n"
]
}
],
"source": [
"# get the length of the sequence\n",
"print \"length: \", len(dna.seq)"
"# change the case\n",
"print(my_sequence.lower())"
]
},
{
"cell_type": "code",
"execution_count": 41,
"execution_count": 5,
"metadata": {
"slideshow": {
"slide_type": "subslide"
......@@ -222,18 +213,29 @@
"name": "stdout",
"output_type": "stream",
"text": [
"TGGAACATGT\n"
"ATGGCCCTGTCCTGCCCCTG\n"
]
}
],
"source": [
"# slice and dice\n",
"print dna.seq[:10]"
"# concatenate the first and last 10 nucleotides\n",
"print(my_sequence[:10] + my_sequence[-10:])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"But also has more sequence-specific methods"
]
},
{
"cell_type": "code",
"execution_count": 42,
"execution_count": 6,
"metadata": {
"slideshow": {
"slide_type": "subslide"
......@@ -244,18 +246,18 @@
"name": "stdout",
"output_type": "stream",
"text": [
"tggaacatgtcccgctagcttcttcttgctagcagattttttcagttgatcgtcacatgcggtagactacccaaggtgtgactactcgcatgcctgatct\n"
"TACCGGGACACCTACGCGGAGGACGGGGAC\n"
]
}
],
"source": [
"# change the case\n",
"print dna.seq.lower()"
"# complement\n",
"print(my_sequence.complement())"
]
},
{
"cell_type": "code",
"execution_count": 43,
"execution_count": 7,
"metadata": {
"slideshow": {
"slide_type": "subslide"
......@@ -266,51 +268,82 @@
"name": "stdout",
"output_type": "stream",
"text": [
"TGGAACATGTTGCCTGATCT\n"
"CAGGGGCAGGAGGCGCATCCACAGGGCCAT\n"
]
}
],
"source": [
"# concatenate the first and last 10 nucleotides\n",
"print dna.seq[:10] + dna.seq[-10:]"
"# reverse complement\n",
"print(my_sequence.reverse_complement())"
]
},
{
"cell_type": "markdown",
"cell_type": "code",
"execution_count": 8,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"AUGGCCCUGUGGAUGCGCCUCCUGCCCCUG\n"
]
}
],
"source": [
"But also has more sequence-specific methods"
"# transcribe from DNA to RNA\n",
"rna = my_sequence.transcribe()\n",
"print(rna)"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ACCTTGTACAGGGCGATCGAAGAAGAACGATCGTCTAAAAAAGTCAACTAGCAGTGTACGCCATCTGATGGGTTCCACACTGATGAGCGTACGGACTAGA\n"
"MALWMRLLPL\n"
]
}
],
"source": [
"# complement\n",
"print dna.seq.complement()"
"# Translate from nucleotide to protein\n",
"protein = my_sequence.translate()\n",
"print(protein)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Manipulating Sequence Data\n",
"\n",
"## Bio.SeqIO\n",
"\n",
"Input and output of sequence files.\n",
"\n",
"- `SeqIO.read`\n",
" - Read a file containing a single sequence\n",
"- `SeqIO.parse`\n",
" - Iterate over all sequences in a sequence file\n",
"- `SeqIO.write`\n",
" - write sequences to a file"
]
},
{
"cell_type": "code",
"execution_count": 45,
"execution_count": 10,
"metadata": {
"slideshow": {
"slide_type": "subslide"
......@@ -321,18 +354,53 @@
"name": "stdout",
"output_type": "stream",
"text": [
"AGATCAGGCATGCGAGTAGTCACACCTTGGGTAGTCTACCGCATGTGACGATCAACTGAAAAAATCTGCTAGCAAGAAGAAGCTAGCGGGACATGTTCCA\n"
"ID: 1\n",
"Name: 1\n",
"Description: 1\n",
"Number of features: 0\n",
"Seq('TGGAACATGTCCCGCTAGCTTCTTCTTGCTAGCAGATTTTTTCAGTTGATCGTC...TCT', SingleLetterAlphabet())\n"
]
}
],
"source": [
"# reverse complement\n",
"print dna.seq.reverse_complement()"
"from Bio import SeqIO\n",
"\n",
"# read the first sequence\n",
"# returns SeqRecord objects\n",
"for record in SeqIO.parse(\"../data/records.fa\", \"fasta\"):\n",
" dna = record\n",
" break\n",
"\n",
"print(dna)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Each record is an object with several fields, including:\n",
"\n",
"- `record.id`\n",
" - the sequence id\n",
"- `record.name`\n",
" - sequence name, usually the same as the id\n",
"- `record.description`\n",
" - sequence description\n",
"\n",
"The actual sequence is a separate object contained within the record which can be accessed using record.seq\n",
"\n",
"The sequence has an 'alphabet' associated with it which defines which letters are allowed.\n",
"\n",
"Different alphabets are used for DNA, RNA, protein etc."
]
},
{
"cell_type": "code",
"execution_count": 53,
"execution_count": 11,
"metadata": {
"slideshow": {
"slide_type": "subslide"
......@@ -343,19 +411,17 @@
"name": "stdout",
"output_type": "stream",
"text": [
"UGGAACAUGUCCCGCUAGCUUCUUCUUGCUAGCAGAUUUUUUCAGUUGAUCGUCACAUGCGGUAGACUACCCAAGGUGUGACUACUCGCAUGCCUGAUCU\n"
"TGGAACATGTCCCGCTAGCTTCTTCTTGCTAGCAGATTTTTTCAGTTGATCGTCACATGCGGTAGACTACCCAAGGTGTGACTACTCGCATGCCTGATCT\n"
]
}
],
"source": [
"# transcribe from DNA to RNA\n",
"rna = dna.seq.transcribe()\n",
"print rna"
"print(dna.seq)"
]
},
{
"cell_type": "code",
"execution_count": 63,
"execution_count": 12,
"metadata": {
"slideshow": {
"slide_type": "subslide"
......@@ -366,14 +432,14 @@
"name": "stdout",
"output_type": "stream",
"text": [
"WNMSR*LLLASRFFQLIVTCGRLPKV*LLACLI\n"
"AGATCAGGCATGCGAGTAGTCACACCTTGGGTAGTCTACCGCATGTGACGATCAACTGAAAAAATCTGCTAGCAAGAAGAAGCTAGCGGGACATGTTCCA\n"
]
}
],
"source": [
"# Translate from nucleotide to protein\n",
"protein = dna.seq.translate()\n",
"print protein"
"# we can then do our sequence manipulations on the `.seq` attribute of the record\n",
"\n",
"print(dna.seq.reverse_complement())"
]
},
{
......@@ -393,7 +459,7 @@
},
{
"cell_type": "code",
"execution_count": 66,
"execution_count": 13,
"metadata": {
"slideshow": {
"slide_type": "fragment"
......@@ -406,7 +472,7 @@
"1"
]
},
"execution_count": 66,
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
......@@ -416,6 +482,247 @@
"SeqIO.write(records, \"tmp.fasta\", \"fasta\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Sequence alignment\n",
"\n",
"It is possible align sequences using biopython with various methods.\n",
"Some of these depend on external tools (e.g. `clustalw`), but simple pair-wise alignment is supported out of the box.\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"from Bio.pairwise2 import align, format_alignment\n",
"\n",
"# load fasta with insulin for several species as a handle\n",
"ins_handle = SeqIO.parse(\"../data/ins.fa\", \"fasta\")"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"# make a list of records \n",
"ins_records = []\n",
"for item in ins_handle:\n",
" ins_records.append(item)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"# extract a little of the sequence for human and chimp\n",
"\n",
"human_ins_bit = ins_records[0][-45:]\n",
"chimp_ins_bit = ins_records[1][-45:]"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CTCCTGC-A---C--C-G-AGAG--AGATGGAATAAAGCCCTTGAACCAGCAAAA\n",
" ||| | | | | | | ||||||||||||||||||||||||||\n",
"---CTG-GAGAACTACTGCA-A-CTAGATGGAATAAAGCCCTTGAACCAGC----\n",
" Score=35\n",
"\n"
]
}
],
"source": [
"# get the alignments with smith-waterman, without gap penalties\n",
"alignments = align.localxx(human_ins_bit.seq, chimp_ins_bit.seq)\n",
"\n",
"\n",
"# print the best alignment \n",
"best = alignments[0]\n",
"print(format_alignment(*best))"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CTCCTGCACCGAGA------G-----AGATGGAATAAAGCCCTTGAACCAGCAAAA\n",
" ||| |||| | ||||||||||||||||||||||||||\n",
"---CTG----GAGAACTACTGCAACTAGATGGAATAAAGCCCTTGAACCAGC----\n",
" Score=56\n",
"\n"
]
}
],
"source": [
"# get alignments with specified scores and penalties:\n",
"# 2 for match, 4 for mismatch, \n",
"#-2 for gap open, -0.5 for gap extend\n",
"gap_alignments = align.localms(human_ins_bit.seq, \n",
" chimp_ins_bit.seq, \n",
" 2, -4, -2, -0.5)\n",
"\n",
"print(format_alignment(*gap_alignments[0]))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Motis\n",
"\n",
"We can also get a consensus sequence given multiple sequences, and visualize the result. \n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"motif_handle = SeqIO.parse(\"../data/motif.fa\", \"fasta\")\n",
"motif_records = []\n",
"for item in motif_handle:\n",
" motif_records.append(item.seq[:9].upper())"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"from Bio import motifs\n",
"\n",
"# create a motif object\n",
"motif = motifs.create(motif_records)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 0 1 2 3 4 5 6 7 8\n",
"A: 30.00 0.00 17.00 27.00 0.00 25.00 26.00 0.00 19.00\n",
"C: 52.00 50.00 0.00 57.00 50.00 0.00 58.00 50.00 0.00\n",
"G: 0.00 50.00 61.00 0.00 50.00 58.00 0.00 50.00 53.00\n",
"T: 18.00 0.00 22.00 16.00 0.00 17.00 16.00 0.00 28.00\n",
"\n"
]
}
],
"source": [
"print(motif.counts)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CGGCGGCGG\n"
]
}
],
"source": [
"print(motif.consensus)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"motif.weblogo(\"tmp.svg\", format=\"SVG\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"![ll](tmp.svg)"
]
},
{
"cell_type": "markdown",
"metadata": {
......@@ -435,9 +742,8 @@
},
{
"cell_type": "code",
"execution_count": 69,
"execution_count": 24,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "fragment"
}
......@@ -464,9 +770,8 @@
},
{
"cell_type": "code",
"execution_count": 70,
"execution_count": 25,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "fragment"
}
......@@ -491,9 +796,8 @@
},
{
"cell_type": "code",
"execution_count": 77,
"execution_count": 26,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "subslide"
}
......@@ -517,7 +821,7 @@
},
{
"cell_type": "code",
"execution_count": 78,
"execution_count": 27,
"metadata": {
"slideshow": {
"slide_type": "subslide"
......@@ -532,11 +836,22 @@
"Name: NM_005804\n",
"Description: Homo sapiens DExD-box helicase 39A (DDX39A), transcript variant 1, mRNA\n",
"Number of features: 25\n",
"/molecule_type=mRNA\n",
"/topology=linear\n",
"/data_file_division=PRI\n",
"/date=20-OCT-2018\n",
"/accessions=['NM_005804']\n",
"/sequence_version=3\n",
"/keywords=['RefSeq']\n",
"/source=Homo sapiens (human)\n",
"/organism=Homo sapiens\n",
"/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo']\n",
"/references=[Reference(title='The RNA helicase DDX39B and its paralog DDX39A regulate androgen receptor splice variant AR-V7 generation', ...), Reference(title='Identification of DDX39A as a Potential Biomarker for Unfavorable Neuroblastoma Using a Proteomic Approach', ...), Reference(title='Up-regulation of DDX39 in human malignant pleural mesothelioma cell lines compared to normal pleural mesothelial cells', ...), Reference(title='The mRNA-bound proteome and its global occupancy profile on protein-coding transcripts', ...), Reference(title='Clinical proteomics identified ATP-dependent RNA helicase DDX39 as a novel biomarker to predict poor prognosis of patients with gastrointestinal stromal tumor', ...), Reference(title='The closely related RNA helicases, UAP56 and URH49, preferentially form distinct mRNA export machineries and coordinately regulate mitotic progression', ...), Reference(title='Hcc-1 is a novel component of the nuclear matrix with growth inhibitory function', ...), Reference(title='Growth-regulated expression and G0-specific turnover of the mRNA that encodes URH49, a mammalian DExH/D box protein that is highly related to the mRNA export protein UAP56', ...), Reference(title='Analysis of a high-throughput yeast two-hybrid system and its use to predict the function of intracellular proteins encoded within the human MHC class III region', ...), Reference(title='The BAT1 gene in the MHC encodes an evolutionarily conserved putative nuclear RNA helicase of the DEAD family', ...)]\n",
"/comment=REVIEWED REFSEQ: This record has been curated by NCBI staff. The\n",
"reference sequence was derived from DA432925.1, BC001009.2 and\n",
"BM792110.1.\n",
"This sequence is a reference standard in the RefSeqGene project.\n",
"On Oct 14, 2010 this sequence version replaced gi:21040370.\n",
"On Oct 14, 2010 this sequence version replaced NM_005804.2.\n",
"Summary: This gene encodes a member of the DEAD box protein family.\n",
"These proteins are characterized by the conserved motif\n",
"Asp-Glu-Ala-Asp (DEAD) and are putative RNA helicases. They are\n",
......@@ -561,18 +876,7 @@
" SAMEA1965299, SAMEA1966682\n",
" [ECO:0000350]\n",
"COMPLETENESS: complete on the 3' end.\n",
"/source=Homo sapiens (human)\n",
"/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo']\n",
"/structured_comment=OrderedDict([('Evidence-Data', OrderedDict([('Transcript exon combination', 'SRR1163655.176131.1,'), ('RNAseq introns', 'mixed/partial sample support')]))])\n",
"/keywords=['RefSeq']\n",
"/references=[Reference(title='The RNA helicase DDX39B and its paralog DDX39A regulate androgen receptor splice variant AR-V7 generation', ...), Reference(title='Identification of DDX39A as a Potential Biomarker for Unfavorable Neuroblastoma Using a Proteomic Approach', ...), Reference(title='Up-regulation of DDX39 in human malignant pleural mesothelioma cell lines compared to normal pleural mesothelial cells', ...), Reference(title='The mRNA-bound proteome and its global occupancy profile on protein-coding transcripts', ...), Reference(title='Clinical proteomics identified ATP-dependent RNA helicase DDX39 as a novel biomarker to predict poor prognosis of patients with gastrointestinal stromal tumor', ...), Reference(title='The closely related RNA helicases, UAP56 and URH49, preferentially form distinct mRNA export machineries and coordinately regulate mitotic progression', ...), Reference(title='Hcc-1 is a novel component of the nuclear matrix with growth inhibitory function', ...), Reference(title='Growth-regulated expression and G0-specific turnover of the mRNA that encodes URH49, a mammalian DExH/D box protein that is highly related to the mRNA export protein UAP56', ...), Reference(title='Analysis of a high-throughput yeast two-hybrid system and its use to predict the function of intracellular proteins encoded within the human MHC class III region', ...), Reference(title='The BAT1 gene in the MHC encodes an evolutionarily conserved putative nuclear RNA helicase of the DEAD family', ...)]\n",
"/accessions=['NM_005804']\n",
"/molecule_type=mRNA\n",
"/data_file_division=PRI\n",
"/date=11-JUN-2017\n",
"/organism=Homo sapiens\n",
"/sequence_version=3\n",
"/topology=linear\n",
"Seq('AGCAGCAGCCCGACGCAAGAGGCAGGAAGCGCAGCAACTCGTGTCTGAGCGCCC...AAA', IUPACAmbiguousDNA())\n"
]
}
......@@ -580,7 +884,7 @@
"source": [
"ncbi_record = SeqIO.read(efetch_handle, 'genbank')\n",
"\n",
"print ncbi_record"
"print(ncbi_record)"
]
},
{
......@@ -596,9 +900,8 @@
},
{
"cell_type": "code",
"execution_count": 79,
"execution_count": 28,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "subslide"
}
......@@ -622,7 +925,7 @@
},
{
"cell_type": "code",
"execution_count": 81,
"execution_count": 29,
"metadata": {
"slideshow": {
"slide_type": "subslide"
......@@ -640,7 +943,7 @@
],
"source": [
"for record in SeqIO.parse(efetch_handle, 'genbank'):\n",
" print record.id, record.description"
" print(record.id, record.description)"
]
},
{
......@@ -664,17 +967,26 @@
},
{
"cell_type": "code",
"execution_count": 89,
"execution_count": 30,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"AGCAGCAGCCCGACGCAAGAGGCAGGAAGCGCAGCAACTCGTGTCTGAGCGCCCGGCGGAAAACCGAAGTTGGAAGTGTCTCTTAGCAGCGCGCGGAGAAGAACGGGGAGCCAGCATCATGGCAGAACAGGATGTGGAAAACGATCTTTTGGATTACGATGAAGAGGAAGAGCCCCAGGCTCCTCAAGAGAGCACACCAGCTCCCCCTAAGAAAGACATCAAGGGATCCTACGTTTCCATCCACAGCTCTGGCTTCCGGGACTTTCTGCTGAAGCCGGAGCTCCTGCGGGCCATCGTGGACTGTGGCTTTGAGCATCCTTCTGAGGTCCAGCATGAGTGCATTCCCCAGGCCATCCTGGGCATGGACGTCCTGTGCCAGGCCAAGTCCGGGATGGGCAAGACAGCGGTCTTCGTGCTGGCCACCCTACAGCAGATTGAGCCTGTCAACGGACAGGTGACGGTCCTGGTCATGTGCCACACGAGGGAGCTGGCCTTCCAGATCAGCAAGGAATATGAGCGCTTTTCCAAGTACATGCCCAGCGTCAAGGTGTCTGTGTTCTTCGGTGGTCTCTCCATCAAGAAGGATGAAGAAGTGTTGAAGAAGAACTGTCCCCATGTCGTGGTGGGGACCCCGGGCCGCATCCTGGCGCTCGTGCGGAATAGGAGCTTCAGCCTAAAGAATGTGAAGCACTTTGTGCTGGACGAGTGTGACAAGATGCTGGAGCAGCTGGACATGCGGCGGGATGTGCAGGAGATCTTCCGCCTGACACCACACGAGAAGCAGTGCATGATGTTCAGCGCCACCCTGAGCAAGGACATCCGGCCTGTGTGCAGGAAGTTCATGCAGGATCCCATGGAGGTGTTTGTGGACGACGAGACCAAGCTCACGCTGCACGGCCTGCAGCAGTACTACGTCAAACTCAAAGACAGTGAGAAGAACCGCAAGCTCTTTGATCTCTTGGATGTGCTGGAGTTTAACCAGGTGATAATCTTCGTCAAGTCAGTGCAGCGCTGCATGGCCCTGGCCCAGCTCCTCGTGGAGCAGAACTTCCCGGCCATCGCCATCCACCGGGGCATGGCCCAGGAGGAGCGCCTGTCACGCTATCAGCAGTTCAAGGATTTCCAGCGGCGGATCCTGGTGGCCACCAATCTGTTTGGCCGGGGGATGGACATCGAGCGAGTCAACATCGTCTTTAACTACGACATGCCTGAGGACTCGGACACCTACCTGCACCGGGTGGCCCGGGCGGGTCGCTTTGGCACCAAAGGCCTAGCCATCACTTTTGTGTCTGACGAGAATGATGCCAAAATCCTCAATGACGTCCAGGACCGGTTTGAAGTTAATGTGGCAGAACTTCCAGAGGAAATCGACATCTCCACATACATCGAGCAGAGCCGGTAACCACCACGTGCCAGAGCCGCCCACCCGGAGCCGCCCGCATGCAGCTTCACCTCCCCTTTCCAGGCGCCACTGTTGAGAAGCTAGAGATTGTATGAGAATAAACTTGTTATTATGGAAGCCTGGCTCCCACCCCATCTAAAAAAAAAAAAAAAAAAA\n"
]
}
],
"source": [
"from Bio.Blast.NCBIWWW import qblast\n",
"blast_handle = qblast('blastn', 'refseq_mrna', ncbi_record.seq)"
"blast_handle = qblast('blastn', 'nt', ncbi_record.seq)\n",
"\n",
"print(ncbi_record.seq)"
]
},
{
......@@ -690,7 +1002,7 @@
},
{
"cell_type": "code",
"execution_count": 90,
"execution_count": 31,
"metadata": {
"slideshow": {
"slide_type": "subslide"
......@@ -701,59 +1013,59 @@
"name": "stdout",
"output_type": "stream",
"text": [
"Program: blastn (2.7.0+)\n",
"Program: blastn (2.8.1+)\n",
" Query: No (1558)\n",
" definition line\n",
" Target: refseq_mrna\n",
" Target: nt\n",
" Hits: ---- ----- ----------------------------------------------------------\n",
" # # HSP ID + description\n",
" ---- ----- ----------------------------------------------------------\n",
" 0 1 gi|308522777|ref|NM_005804.3| Homo sapiens DExD-box he...\n",
" 1 1 gi|1034056594|ref|XM_016946961.1| PREDICTED: Pan trogl...\n",
" 2 1 gi|1034130164|ref|XM_016935303.1| PREDICTED: Pan trogl...\n",
" 3 1 gi|675689963|ref|XM_003807080.2| PREDICTED: Pan panisc...\n",
" 4 1 gi|1099186172|ref|XM_004060164.2| PREDICTED: Gorilla g...\n",
" 5 1 gi|686757516|ref|XM_002828787.3| PREDICTED: Pongo abel...\n",
" 6 1 gi|795239725|ref|XM_011953207.1| PREDICTED: Colobus an...\n",
" 7 1 gi|1059109912|ref|XM_017851234.1| PREDICTED: Rhinopith...\n",
" 8 1 gi|724815869|ref|XM_010361572.1| PREDICTED: Rhinopithe...\n",
" 9 1 gi|1220191829|ref|XM_009193704.2| PREDICTED: Papio anu...\n",
" 10 1 gi|982311930|ref|XM_005588244.2| PREDICTED: Macaca fas...\n",
" 11 1 gi|967496221|ref|XM_015123082.1| PREDICTED: Macaca mul...\n",
" 12 1 gi|768000518|ref|XM_011527620.1| PREDICTED: Homo sapie...\n",
" 13 1 gi|635036575|ref|XM_007995524.1| PREDICTED: Chlorocebu...\n",
" 14 1 gi|795271240|ref|XM_011768647.1| PREDICTED: Macaca nem...\n",
" 15 1 gi|795144436|ref|XM_011981779.1| PREDICTED: Mandrillus...\n",
" 16 1 gi|1034130166|ref|XM_016935304.1| PREDICTED: Pan trogl...\n",
" 17 1 gi|1034056596|ref|XM_016946962.1| PREDICTED: Pan trogl...\n",
" 18 1 gi|795433285|ref|XM_012094155.1| PREDICTED: Cercocebus...\n",
" 19 1 gi|795433280|ref|XM_012094154.1| PREDICTED: Cercocebus...\n",
" 20 1 gi|795239720|ref|XM_011953206.1| PREDICTED: Colobus an...\n",
" 21 1 gi|1059109914|ref|XM_017851235.1| PREDICTED: Rhinopith...\n",
" 22 1 gi|1220191830|ref|XM_021930788.1| PREDICTED: Papio anu...\n",
" 23 1 gi|685606530|ref|XM_009193706.1| PREDICTED: Papio anub...\n",
" 24 1 gi|1220191832|ref|XM_017952263.2| PREDICTED: Papio anu...\n",
" 25 1 gi|982311931|ref|XM_005588245.2| PREDICTED: Macaca fas...\n",
" 26 1 gi|967496225|ref|XM_015123084.1| PREDICTED: Macaca mul...\n",
" 27 1 gi|967496223|ref|XM_015123083.1| PREDICTED: Macaca mul...\n",
" 28 1 gi|967496227|ref|XM_015123085.1| PREDICTED: Macaca mul...\n",
" 29 1 gi|795271249|ref|XM_011768705.1| PREDICTED: Macaca nem...\n",
" 1 1 gi|1367219251|ref|XM_016935303.2| PREDICTED: Pan trogl...\n",
" 2 1 gi|675689963|ref|XM_003807080.2| PREDICTED: Pan panisc...\n",
" 3 1 gi|1099186172|ref|XM_004060164.2| PREDICTED: Gorilla g...\n",
" 4 1 gi|1351474314|ref|XM_002828787.4| PREDICTED: Pongo abe...\n",
" 5 1 gi|1905997|gb|U90426.1|HSU90426 Human nuclear RNA heli...\n",
" 6 1 gi|33875869|gb|BC001009.2| Homo sapiens DEAD (Asp-Glu-...\n",
" 7 1 gi|10439504|dbj|AK026614.1| Homo sapiens cDNA: FLJ2296...\n",
" 8 1 gi|795239725|ref|XM_011953207.1| PREDICTED: Colobus an...\n",
" 9 1 gi|1411128774|ref|XM_025367317.1| PREDICTED: Theropith...\n",
" 10 1 gi|1059109912|ref|XM_017851234.1| PREDICTED: Rhinopith...\n",
" 11 1 gi|724815869|ref|XM_010361572.1| PREDICTED: Rhinopithe...\n",
" 12 1 gi|1220191829|ref|XM_009193704.2| PREDICTED: Papio anu...\n",
" 13 1 gi|982311930|ref|XM_005588244.2| PREDICTED: Macaca fas...\n",
" 14 1 gi|1297694799|ref|XM_023229571.1| PREDICTED: Piliocolo...\n",
" 15 1 gi|967496221|ref|XM_015123082.1| PREDICTED: Macaca mul...\n",
" 16 1 gi|768000518|ref|XM_011527620.1| PREDICTED: Homo sapie...\n",
" 17 1 gi|635036575|ref|XM_007995524.1| PREDICTED: Chlorocebu...\n",
" 18 1 gi|795271240|ref|XM_011768647.1| PREDICTED: Macaca nem...\n",
" 19 1 gi|795144436|ref|XM_011981779.1| PREDICTED: Mandrillus...\n",
" 20 1 gi|194377853|dbj|AK301847.1| Homo sapiens cDNA FLJ5548...\n",
" 21 1 gi|795433285|ref|XM_012094155.1| PREDICTED: Cercocebus...\n",
" 22 1 gi|1367219254|ref|XM_016935304.2| PREDICTED: Pan trogl...\n",
" 23 1 gi|795433280|ref|XM_012094154.1| PREDICTED: Cercocebus...\n",
" 24 1 gi|1297694797|ref|XM_023229570.1| PREDICTED: Piliocolo...\n",
" 25 1 gi|795239720|ref|XM_011953206.1| PREDICTED: Colobus an...\n",
" 26 1 gi|1059109914|ref|XM_017851235.1| PREDICTED: Rhinopith...\n",
" 27 1 gi|1220191830|ref|XM_021930788.1| PREDICTED: Papio anu...\n",
" 28 1 gi|685606530|ref|XM_009193706.1| PREDICTED: Papio anub...\n",
" 29 1 gi|1220191832|ref|XM_017952263.2| PREDICTED: Papio anu...\n",
" ~~~\n",
" 47 1 gi|826285426|ref|XM_012641398.1| PREDICTED: Propithecu...\n",
" 48 1 gi|947308602|ref|XM_006161013.2| PREDICTED: Tupaia chi...\n",
" 49 1 gi|1220191833|ref|XM_021930789.1| PREDICTED: Papio anu...\n"
" 47 1 gi|1044402864|ref|XM_017497619.1| PREDICTED: Cebus cap...\n",
" 48 1 gi|1044402866|ref|XM_017497620.1| PREDICTED: Cebus cap...\n",
" 49 1 gi|1044402868|ref|XM_017497621.1| PREDICTED: Cebus cap...\n"
]
}
],
"source": [
"from Bio import SearchIO\n",
"qresult = SearchIO.read(blast_handle, 'blast-xml')\n",
"print qresult"
"print(qresult)"
]
},
{
"cell_type": "code",
"execution_count": 92,
"execution_count": 32,
"metadata": {
"slideshow": {
"slide_type": "subslide"
......@@ -776,12 +1088,12 @@
}
],
"source": [
"print qresult[0]"
"print(qresult[0])"
]
},
{
"cell_type": "code",
"execution_count": 94,
"execution_count": 33,
"metadata": {
"slideshow": {
"slide_type": "subslide"
......@@ -794,8 +1106,8 @@
"text": [
"Query: No\n",
" definition line\n",
" Hit: gi|1034056594|ref|XM_016946961.1| (1530)\n",
" PREDICTED: Pan troglodytes ATP-dependent RNA helicase DDX39A (LOC1079...\n",
" Hit: gi|1367219251|ref|XM_016935303.2| (1530)\n",
" PREDICTED: Pan troglodytes DExD-box helicase 39A (DDX39A), transcript...\n",
" HSPs: ---- -------- --------- ------ --------------- ---------------------\n",
" # E-value Bit score Span Query range Hit range\n",
" ---- -------- --------- ------ --------------- ---------------------\n",
......@@ -804,7 +1116,7 @@
}
],
"source": [
"print qresult[1]"
"print(qresult[1])"
]
},
{
......@@ -830,30 +1142,37 @@
}
},
"source": [
"The lesson was based on previous material by [Wibowo Arindrarto](mailto://w.arindrarto@lumc.nl) and Martijn Vermaat.\n",
"The lesson was based on previous material by [Guy Allard](mailto://w.g.allard@lumc.nl), [Wibowo Arindrarto](mailto://w.arindrarto@lumc.nl) and Martijn Vermaat.\n",
"\n",
"License: [Creative Commons Attribution 3.0 License (CC-by)](http://creativecommons.org/licenses/by/3.0)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 2",
"display_name": "Python 3",
"language": "python",
"name": "python2"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.13"
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
......
>NM_000207.2 Homo sapiens insulin (INS), transcript variant 1, mRNA
AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGATCACTGTCCTTCTGCCATGGCCCTGTG
GATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAAC
CAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACA
CACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGC
AGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACC
AGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGCAACTAGACGCAGCCCGCAGGCAGCCCCACACCCG
CCGCCTCCTGCACCGAGAGAGATGGAATAAAGCCCTTGAACCAGCAAAA
>NM_001008996.2 Pan troglodytes insulin (INS), mRNA
AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGATCACTGTCCTTCTGCCATGGCCCTGTG
GATGCGCCTCCTGCCCCTGCTGGTGCTGCTGGCCCTCTGGGGACCTGACCCAGCCTCGGCCTTTGTGAAC
CAACACCTGTGCGGCTCCCACCTGGTGGAAGCTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACA
CACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGC
AGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCCCTGCAGAAGCGTGGTATCGTGGAACAATGCTGTACC
AGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGCAACTAGATGGAATAAAGCCCTTGAACCAGC
>NM_001185083.2 Mus musculus insulin II (Ins2), transcript variant 1, mRNA
GGGGACCCAGTAACCACCAGCCCTAAGTGATCCGCTACAATCAAAAACCATCAGCAAGCAGGAAGGTTAT
TGTTTCAACATGGCCCTGTGGATGCGCTTCCTGCCCCTGCTGGCCCTGCTCTTCCTCTGGGAGTCCCACC
CCACCCAGGCTTTTGTCAAGCAGCACCTTTGTGGTTCCCACCTGGTGGAGGCTCTCTACCTGGTGTGTGG
GGAGCGTGGCTTCTTCTACACACCCATGTCCCGCCGTGAAGTGGAGGACCCACAAGTGGCACAACTGGAG
CTGGGTGGAGGCCCGGGAGCAGGTGACCTTCAGACCTTGGCACTGGAGGTGGCCCAGCAGAAGCGTGGCA
TTGTAGATCAGTGCTGCACCAGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGCAACTAGACCCACCA
CTACCCAGCCTACCCCTCTGCAATGAATAAAACCTTTGAATGAGCACAAAAAA
>NM_019130.2 Rattus norvegicus insulin 2 (Ins2), mRNA
AGCCCTAAGTGACCAGCTACAGTCGGAAACCATCAGCAAGCAGGTCATTGTTCCAACATGGCCCTGTGGA
TCCGCTTCCTGCCCCTGCTGGCCCTGCTCATCCTCTGGGAGCCCCGCCCTGCCCAGGCTTTTGTCAAACA
GCACCTTTGTGGTTCTCACTTGGTGGAAGCTCTCTACCTGGTGTGTGGGGAGCGTGGATTCTTCTACACA
CCCATGTCCCGCCGCGAAGTGGAGGACCCACAAGTGGCACAACTGGAGCTGGGTGGAGGCCCGGGGGCAG
GTGACCTTCAGACCTTGGCACTGGAGGTGGCCCGGCAGAAGCGCGGCATCGTGGATCAGTGCTGCACCAG
CATCTGCTCTCTCTACCAACTGGAGAACTACTGCAACTAGGCCCACCACTACCCTGTCCACCCCTCTGCA
ATGAATAAAACCTTTGAAAGAGCACTACAAAAAAAAAAAAAAAA
>NM_205222.3 Gallus gallus insulin (INS-IGF2), mRNA
ATATAAATATGGGAAAGAGAATGGGGAAATTTCTACCAGTCTTCATCTCTGAGAGCAAACTTCTCTGCAT
CTCTTTCTCTCTTCTCTGGGCCTCCCCCAGCTCATCATGGCTCTCTGGATCCGATCACTGCCTCTTCTGG
CTCTCCTTGTCTTTTCTGGCCCTGGAACCAGCTATGCAGCTGCCAACCAGCACCTCTGTGGCTCCCACTT
GGTGGAGGCTCTCTACCTGGTGTGTGGAGAGCGTGGCTTCTTCTACTCCCCCAAAGCCCGACGGGATGTC
GAGCAGCCCCTAGTGAGCAGTCCCTTGCGTGGCGAGGCAGGAGTGCTGCCTTTCCAGCAGGAGGAATACG
AGAAAGTCAAGCGAGGGATTGTTGAGCAATGCTGCCATAACACGTGTTCCCTCTACCAACTGGAGAACTA
CTGCAACTAGCCAAGAAGCCGGAAGCGGGCACAGACATACACTTACTCTATCGCACCTTCAAAGCATTTG
AATAAACCTTGTTGGTCTACTGGAAGACTTGTGCC
>NM_001100236.1 Xenopus tropicalis insulin (ins), mRNA
CCTCCTTTTGATCTTTCCAGCACTTGTCCAGCTCCCACTATCCTCTATCATGGCTCTTTGGATGCAGTGT
CTGCCCCTGGTACTTGTGCTCCTTTTCTCTACACCCAACACCGAAGCTCTAGCTAACCAACACCTGTGTG
GGTCTCACCTGGTAGAAGCCCTGTATCTAGTATGTGGGGATCGAGGCTTCTTCTACTACCCCAAGATCAA
ACGGGACATCGAACAAGCAATGGTCAATGGACCCCAGGACAACGAGTTGGATGGAATGCAGCTCCAGCCT
CAGGAGTACCAGAAAATGAAGAGGGGAATTGTGGAGCAATGCTGCCACAGCACATGTTCTCTCTTCCAGC
TGGAGAGCTACTGCAACTAGGGGACCAGGCAAATGCTCTCTTACCAAGGCACCTTCAAGGCAAATCCATT
ATGCCAAAACAACAGGACAACGAGCATTGTCTAACGGCACCAAGAACTTCTAACAATGTATATTTATTCC
ATATAAATTAGACATCGGTATCCCAACTAATCTGTTCTTAGTAGAAGGAGTTATATAGAGTAATTCTATG
TGACAGGACAAGAAATATCTGTTATTTTTGCATTTTAATTTGCTCAGAAACCACCACTTTAATGCTACTT
TAACATGGCTGTCATCAGCAAAGTACTGTGCAAGTCGGAAAGACCTTGTTTTAGGAGAGACCGGGCAGGT
TACATTGATAAAGTTCAAAAAAGAAAGTATCTGGAAGAAAAAGAGCCACCCAAAATGTTATTCCGATCTT
GCTTTTAAGTGCCTTGACCTACTGTATTTACTGTCTCTCTGTCTCACTGCAAATAAATGTAAGCTGAAGA
GCTAAAAAAAAAAAAAAAAAAA
>NM_131056.1 Danio rerio preproinsulin (ins), mRNA
CCATATCCACCATTCCTCGCCTCTGCTTCGAGAACAGTGTGACCATGGCAGTGTGGCTTCAGGCTGGTGC
TCTGTTGGTCCTGTTGGTCGTGTCCAGTGTAAGCACTAACCCAGGCACACCGCAGCACCTGTGTGGATCT
CATCTGGTCGATGCCCTTTATCTGGTCTGTGGCCCAACAGGCTTCTTCTACAACCCCAAGAGAGACGTTG
AGCCCCTTCTGGGTTTCCTTCCTCCTAAATCTGCCCAGGAAACTGAGGTGGCTGACTTTGCATTTAAAGA
TCATGCCGAGCTGATAAGGAAGAGAGGCATTGTAGAGCAGTGCTGCCACAAACCCTGCAGCATCTTTGAG
CTGCAGAACTACTGTAACTGAAGAGATTTGCCCACCGCCAATGCCAGAAACACCTGTTTGCACACAGGCC
TTAATGCTCTCCGTTTGTTTTTACAGAAAAAATAAAACTATCAAATGA
\ No newline at end of file
>chr6:57316079-57316087
ccaccacca
>chr7:73761746-73761754
ccgccacca
>chr20:59571443-59571454
aggaggtggagg
>chrY:15809432-15809440
CCACCTCCT
>chr17:40068950-40068958
CCTCCACCT
>chr14:65103070-65103078
CCGCCACCT
>chr19:54239423-54239431
cctccgcct
>chr6:150434480-150434488
CCACCACCG
>chr7:33812110-33812121
tggaggtggagg
>chr5:153717375-153717383
ccaccacca
>chr6:7914979-7914987
ccgcctcct
>chr22:27012040-27012051
aggaggaggagg
>chr3:69390851-69390859
aggaggtgg
>chr9:117615118-117615126
aggaggcgg
>chr15:84899756-84899767
aggaggaggagg
>chr3:129325550-129325561
CGGAggcggcgg
>chr7:137343473-137343481
CCTCCTCCT
>chr9:35997152-35997163
cctccgcctcca
>chr6:35548397-35548405
cctccgcct
>chr9:111229018-111229029
TGGTGGTGGAGG
>chr12:66746168-66746176
aggaggtgg
>chr8:35193429-35193437
ccaccacca
>chr2:100306477-100306485
aggtggagg
>chr4:88917359-88917367
cctccacct
>chr3:72299621-72299629
tggaggtgg
>chr7:6140042-6140053
cctccgcctcct
>chr1:30429604-30429612
CCACCACCT
>chr9:84454516-84454524
ccaccacca
>chr4:24681618-24681626
tggtggcgg
>chr6:39016541-39016549
CCGCCACCA
>chr19:47957834-47957845
aggaggtggagg
>chr2:121438697-121438705
CCACCACCG
>chr4:6606145-6606153
AGGCGGTGG
>chr10:54496597-54496605
CCACCACCA
>chr1:121283711-121283719
tggtggcgg
>chr15:83623612-83623620
cctccgcct
>chr10:96161580-96161588
ccgccacca
>chrX:153674870-153674878
aggaggCGG
>chr7:116353954-116353962
CCACCTCCA
>chr3:44738060-44738068
aggtggagg
>chr14:88795532-88795540
CCACCTCCT
>chr1:24149591-24149602
aggaggaggagg
>chr1:26463722-26463730
aggtggagg
>chrX:70882170-70882178
CCTCCACCT
>chr2:138133373-138133381
TGGAGGAGG
>chr5:179540966-179540974
tggaggagg
>chr12:108703822-108703833
tggaggaggtgg
>chr15:59788732-59788740
aggcggagg
>chr12:48567647-48567655
aggcggagg
>chr11:118709199-118709210
cctccacctcct
>chr16:83742554-83742565
aggaggcggagg
>chr21:21636262-21636270
aggcggagg
>chr16:66928929-66928937
AGGCGGAGG
>chr14:56018822-56018830
ccgccacca
>chr16:78283154-78283162
ccgcctcca
>chr17:27620974-27620985
tggtggtggtgg
>chr20:40577705-40577713
ccacctcct
>chrX:88694903-88694911
TGGAGGTGG
>chr1:12630736-12630744
AGGTGGAGG
>chr12:47635287-47635295
ccgccacca
>chr14:104575359-104575370
TGGAGGTGGTGG
>chr14:82091013-82091021
CCTCCTCCT
>chr7:138971990-138971998
CCTCCTCCT
>chr11:65620965-65620973
CCACCACCA
>chr10:14803680-14803688
cctccacct
>chr4:55794226-55794234
cctccacct
>chr1:158182140-158182151
tggtggtggtgg
>chr1:24622336-24622344
AGGTGGAGG
>chr11:125240043-125240051
CCACCTCCA
>chr5:87790918-87790926
ccaccacca
>chr6:17539578-17539586
CCTCCTCCA
>chr1:112985394-112985402
cctccgcct
>chr3:50185889-50185897
tggaggagg
>chr1:9406001-9406009
TGGTGGTGG
>chr9:107660691-107660699
tggtggtgg
>chr10:88410633-88410641
aggaggagg
>chr17:7450571-7450579
cctccacct
>chr18:54316735-54316743
aggaggtgg
>chr7:94732764-94732772
ccgcctcct
>chrX:110238366-110238374
CCTCCTCCT
>chr5:150716893-150716901
CGGAGGAGG
>chr9:6097303-6097311
ccacctcca
>chrUn_gl000219:120741-120752
AGGTGGAGGTGG
>chr10:30397169-30397177
aggtggagg
>chr5:3099125-3099133
AGGAGGTGG
>chr10:106022363-106022371
aggaggcgg
>chr3:107519161-107519169
TGGAGGAGG
>chr5:145007968-145007976
cctccacct
>chr9:33624940-33624948
CCACCTCCA
>chr17:41898360-41898368
CCGCCTCCT
>chr16:15915078-15915086
CCTCCTCCA
>chr13:20332903-20332911
aggtggagg
>chr10:34392633-34392644
cctccacctcct
>chrX:545065-545086
aggaggaggctggaggaggagg
>chr1:23931841-23931849
aggcggagg
>chr2:106719902-106719910
TGGAGGAGG
>chr2:92261744-92261752
CCGCCGCCG
>chr8:10983660-10983668
cctccgcct
>chr16:68583444-68583452
tggtggcgg
>chr8:145706955-145706963
aggcggagg