Skip to content
Snippets Groups Projects
Commit af03bb66 authored by bow's avatar bow
Browse files

Initial addition of

parent 0e5b7b08
No related branches found
No related tags found
No related merge requests found
{
"metadata": {
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<span style=\"font-size: 200%\">More Python Goodness</span>\n",
"===\n",
"\n",
"[Wibowo Arindrarto](mailto:w.arindrarto@lumc.nl), [Martijn Vermaat](mailto:m.vermaat.hg@lumc.nl)\n",
"\n",
"[Department of Human Genetics, Leiden University Medical Center](http://humgen.nl)\n",
"\n",
"[Sequencing Analysis Support Core, Leiden University Medical Center](http://sasc.lumc.nl)\n",
"\n",
"Based on: [Python Scientific Lecture Notes](http://scipy-lectures.github.io/)\n",
"\n",
"License: [Creative Commons Attribution 3.0 License (CC-by)](http://creativecommons.org/licenses/by/3.0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Agenda\n",
"\n",
"1. Working with scripts (+additional Python bits)\n",
"2. Working with modules\n",
"2. Brief tour of the standard library\n",
"3. Reading and writing files\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Working with scripts\n",
"\n",
"* Interpreters are great for prototyping, but not really suitable if you want to share or release code\n",
"* To do so, we write our Python commands in scripts (and later, modules)\n",
"* A script: a simple text file containing Python instructions to execute\n",
"* Two common ways to execute a script:\n",
" 1. As an argument of the python interpreter command\n",
" 2. As a standalone executable (with the appropriate shebang line & file mode)\n",
"* IPython gives you a third option:\n",
" 3. As an argument of the %run magic"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Writing your script\n",
"\n",
"1. Let's start with a simple GC calculator. Open your text editor, and write the following Python statements (remember your indentations):\n",
"\n",
" def calc_gc_percent(seq):\n",
" at_count, gc_count = 0, 0\n",
" for char in seq:\n",
" if char in ('A', 'T'):\n",
" at_count += 1\n",
" elif char in ('G', 'C'):\n",
" gc_count += 1\n",
" \n",
" return gc_count * 100.0 / (gc_count + at_count) \n",
"\n",
" print \"The sequence 'CAGG' has a %GC of {:.2f}\".format(calc_gc_percent(\"CAGG\"))\n",
"\n",
"2. Save the file (we'll use `seq_toolbox.py` here, but you can use any other name) and go to your shell."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Running the script\n",
"\n",
"Let's try the first method: using your script as an argument.\n",
"\n",
" $ python seq_toolbox.py\n",
"\n",
"Is the output as you expect?\n",
"\n",
"For the second method, we need to do two more things:\n",
"\n",
"1. Open the script in your editor and add the following line to the very top: `#!/usr/bin/env python`\n",
"\n",
"2. Save the file, go back to the shell, and allow the file to be executed\n",
" \n",
" $ chmod +x seq_toolbox.py\n",
"\n",
"Is the output the same as the previous method?\n",
"\n",
"Finally, try out the third method. Open an IPython interpreter session and do `%run seq_toolbox.py`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Improving the script: using the standard library\n",
"\n",
"* Our script is nice and dandy, but we don't want to edit the source file everytime we calculate a sequence's GC.\n",
"* Your first standard library module: `sys`\n",
"* Standard library: collection of Python modules (or functions, for now) that comes packaged with a default installation.\n",
"* Not part of the language per se, more like a 'batteries included' thing.\n",
"* For now, we'll use the simple `sys` module to make our script more flexible."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Your first standard library module: `sys`\n",
"\n",
"* Standard library (and other modules, as we'll see later) can be used via the `import` statement, e.g. `import sys`\n",
"* Like other objects so far, we can peek into the documentation of these modules using `help`, e.g. `help(sys)`\n",
"\n",
"## sys.argv\n",
"\n",
"* The `sys` module provides a way to capture runtime arguments with its `argv` object.\n",
"* `sys.argv`: a list of arguments for the current Python session.\n",
"* Not really useful for an interpreter session, but very handy for scripts."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import sys"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 10
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"sys.argv"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 11,
"text": [
"['-c',\n",
" '-f',\n",
" '/home/bow/.config/ipython/profile_default/security/kernel-f99c934a-43cb-42f9-b7af-55c1f7911c7d.json',\n",
" '--pylab',\n",
" 'inline',\n",
" \"--IPKernelApp.parent_appname='ipython-notebook'\",\n",
" '--parent=1']"
]
}
],
"prompt_number": 11
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"sys.argv[:3]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 12,
"text": [
"['-c',\n",
" '-f',\n",
" '/home/bow/.config/ipython/profile_default/security/kernel-f99c934a-43cb-42f9-b7af-55c1f7911c7d.json']"
]
}
],
"prompt_number": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Improving our script: sys.argv\n",
"\n",
"* To use `sys.argv` in our script, open a text editor and edit the script by adding an import statement, capturing the `sys.argv` value, and editing our last `print` line.\n",
"\n",
" #!/usr/bin/env python\n",
"\n",
" import sys\n",
"\n",
" def calc_gc_percent(seq):\n",
" at_count, gc_count = 0, 0\n",
" for char in seq:\n",
" if char in ('A', 'T'):\n",
" at_count += 1\n",
" elif char in ('G', 'C'):\n",
" gc_count += 1\n",
" \n",
" return gc_count * 100.0 / (gc_count + at_count) \n",
"\n",
" input_seq = sys.argv[1]\n",
" print \"The sequence '{}' has a %GC of {:.2f}\".format(input_seq, calc_gc_percent(input_seq))\n",
"\n",
"* To test it, you can run the following command in your shell:\n",
"\n",
" $ python seq_toolbox.py CAGG\n",
"\n",
"* Try it with `./seq_toolbox.py` instead, what happens?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Improving our script: string functions\n",
"\n",
"* Try running the script with 'cagg'` as the input sequence. What happens?\n",
"\n",
"* One way to squash this potential bug is by using Python's string function `upper`.\n",
"* Python strings are objects with useful built-in functions\n",
"* A complete documentation is available in the interpreter, via `help(str)`\n",
"* Other objects like `int`s, `list`s, and `dict`s also have built-in functions.\n",
"* Let's check out some commonly used string functions.\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"my_str = 'Hello again, ipython!'"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"my_str.upper()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 2,
"text": [
"'HELLO AGAIN, IPYTHON!'"
]
}
],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"my_str.lower()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 3,
"text": [
"'hello again, ipython!'"
]
}
],
"prompt_number": 3
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"my_str.title()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 4,
"text": [
"'Hello Again, Ipython!'"
]
}
],
"prompt_number": 4
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"my_str.startswith('H')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 5,
"text": [
"True"
]
}
],
"prompt_number": 5
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"my_str.startswith('h')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 6,
"text": [
"False"
]
}
],
"prompt_number": 6
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"my_str.split(',')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 7,
"text": [
"['Hello again', ' ipython!']"
]
}
],
"prompt_number": 7
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"my_str.replace('ipython', 'lumc')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 8,
"text": [
"'Hello again, lumc!'"
]
}
],
"prompt_number": 8
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"my_str.count('n')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 9,
"text": [
"2"
]
}
],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Improving our script: using upper()\n",
"\n",
"* Let's use `upper()` to fortify our function. It should now look something like this:\n",
"\n",
" def calc_gc_percent(seq):\n",
" at_count, gc_count = 0, 0\n",
" for char in seq.upper():\n",
" if char in ('A', 'T'):\n",
" at_count += 1\n",
" elif char in ('G', 'C'):\n",
" gc_count += 1\n",
" \n",
" return gc_count * 100.0 / (gc_count + at_count) \n",
"\n",
"* And run it (whichever way you prefer). Do you get the expected output?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Improving our script: comments & docstrings\n",
"\n",
"* Golden rule: write code for humans (this includes you in 6 months)\n",
"* Python provides two ways to accomplish this: comments and docstrings.\n",
"\n",
"## Comments\n",
"\n",
"* Any lines prepended with '#', making them ignored by the interpreter\n",
"* Freeform text ~ anything that helps in understanding the code\n",
"\n",
"## Docstrings\n",
"\n",
"* Python's way of attaching proper documentation to its objects\n",
"* Official: The first string literal that occurs in a module, function, class or method definition\n",
"* Usually done using triple-quoted strings, to handle newlines easier"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Improving our script: comments & docstrings\n",
"\n",
"* Open your script again in a text editor, and add the following comments & docstrings: \n",
"\n",
" #!/usr/bin/env python\n",
" \n",
" import sys\n",
"\n",
" def calc_gc_percent(seq):\n",
" \"\"\"Calculates the GC percentage of the given sequence.\n",
" \n",
" Arguments:\n",
" - seq - the input sequence (string).\n",
" \n",
" Returns:\n",
" - GC percentage (float).\n",
" \n",
" The returned value is always <= 100.0\n",
" \n",
" \"\"\"\n",
" at_count, gc_count = 0, 0\n",
" # change input to all caps to allow for non-capital input sequence\n",
" for char in seq.upper():\n",
" if char in ('A', 'T'):\n",
" at_count += 1\n",
" elif char in ('G', 'C'):\n",
" gc_count += 1\n",
" \n",
" return gc_count * 100.0 / (gc_count + at_count) \n",
"\n",
" input_seq = sys.argv[1]\n",
" print \"The sequence '{}' has a %GC of {:.2f}\".format(input_seq, calc_gc_percent(input_seq))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Detour: PEP8 & other PEPs\n",
"\n",
"* Since comments and docstrings are basically free-form text, whether it's useful or not depends heavily on the developer.\n",
"* To mitigate this, the Python community has come up with practical conventions.\n",
"* They are documented in a document called PEP8.\n",
"* PEP8: Python Enhancement Proposal no. 8 ~ http://www.python.org/dev/peps/pep-0008/\n",
"* PEP257 is for docstrings specifically.\n",
"* Not a must to follow them, but *very* encouraged to do so.\n",
"* PEPs are how Python grows. Hundreds of them now, all has to be approved by our BDFL."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Improving our script: handling errors & exceptions\n",
"\n",
"* Try running the script with 'ACTG123' as the argument, what happens? Is this acceptable behavior?\n",
"* Sometimes we want to put safeguards to handle invalid inputs.\n",
"* In this case we only accept `ACTG`, all other characters are invalid.\n",
"* Python provides a way to break out of the normal execution flow, by raising what's called as an `Exception`\n",
"* We can raise exceptions as well, by using the `raise` statement.\n",
"* Syntax: `raise {exception_type} ( {message} )`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Improving our script: handling invalid inputs\n",
"\n",
"* Open your script, and edit the `if` clause to add our exception:\n",
"\n",
" def calc_gc_percent(seq):\n",
" \"\"\"Calculates the GC percentage of the given sequence.\n",
" \n",
" Arguments:\n",
" - seq - the input sequence (string).\n",
" \n",
" Returns:\n",
" - GC percentage (float).\n",
" \n",
" The returned value is always <= 100.0\n",
" \n",
" \"\"\"\n",
" at_count, gc_count = 0, 0\n",
" # change input to all caps to allow for non-capital input sequence\n",
" for char in seq.upper():\n",
" if char in ('A', 'T'):\n",
" at_count += 1\n",
" elif char in ('G', 'C'):\n",
" gc_count += 1\n",
" else:\n",
" raise ValueError(\"Unexpeced character found: {}. Only ACTGs are allowed.\".format(char))\n",
" \n",
" return gc_count * 100.0 / (gc_count + at_count)\n",
"\n",
"* Try running the script again with `ACTG123` as the argument. What happens now?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Improving our script: handling corner cases\n",
"\n",
"* Try running the script with '' (two quote signs) as the argument. What happens? Why? Is this a valid input?\n",
"* We don't always want to let exceptions stop program flow, sometimes we want to provide alternative flow.\n",
"* The `try..except` block allows you to do this.\n",
"* The syntax is:\n",
"\n",
" try:\n",
" {statements that may raise exceptions}\n",
" except {exception type}:\n",
" {what to do when the exception type is raised}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Improving our script: handling corner cases\n",
"\n",
"* Let's change our script by adding a `try..except` block:\n",
"\n",
" def calc_gc_percent(seq):\n",
" \"\"\"Calculates the GC percentage of the given sequence.\n",
" \n",
" Arguments:\n",
" - seq - the input sequence (string).\n",
" \n",
" Returns:\n",
" - GC percentage (float).\n",
" \n",
" The returned value is always <= 100.0\n",
" \n",
" \"\"\"\n",
" at_count, gc_count = 0, 0\n",
" # change input to all caps to allow for non-capital input sequence\n",
" for char in seq.upper():\n",
" if char in ('A', 'T'):\n",
" at_count += 1\n",
" elif char in ('G', 'C'):\n",
" gc_count += 1\n",
" else:\n",
" raise ValueError(\"Unexpeced character found: {}. Only ACTGs are allowed.\".format(char))\n",
" # corner case handling: empty input sequence\n",
" try:\n",
" return gc_count * 100.0 / (gc_count + at_count)\n",
" except ZeroDivisionError:\n",
" return 0.0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Detour: Exception handling best practices\n",
"\n",
"## Aim for a minimal `try` block\n",
"\n",
"We want to be able to pinpoint the statements that may raise the exceptions so we can tailor our handling.\n",
"\n",
"Example of code that violates this principle:\n",
"\n",
" try:\n",
" my_function()\n",
" my_other_function()\n",
" except ValueError:\n",
" my_fallback_function()\n",
"\n",
"A better way would be:\n",
"\n",
" try:\n",
" my_function()\n",
" except ValueError:\n",
" my_fallback_function()\n",
" my_other_function()\n",
"\n",
"\n",
"## Be specific when handling exceptions\n",
"\n",
"The following code is syntactically valid, but *never* use it in your real scripts / programs:\n",
"\n",
" try:\n",
" my_function()\n",
" except:\n",
" my_fallback_function()\n",
"\n",
"*Always* use the full exception name when handling exceptions, to make for a much cleaner code:\n",
"\n",
" try:\n",
" my_function()\n",
" except ValueError:\n",
" my_fallback_function()\n",
" except TypeError:\n",
" my_other_fallback_function()\n",
" except IndexError:\n",
" my_final_function()\n",
"\n",
"## Look Before You Leap (LBYL) vs Easier to Ask for Apology (EAFP)\n",
" \n",
"We could have written our last exception block like so:\n",
"\n",
" if gc_count + at_count == 0:\n",
" return 0.0\n",
" return gc_count * 100.0 / (gc_count + at_count)\n",
"\n",
"Both approaches are correct and have their own plus and minuses in general. However in this case, I would argue that EAFP is better since it makes the code more readable."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Improving the script: handling corner cases\n",
"\n",
"* Now try running your script without any arguments at all. What happens?\n",
"* Armed with what you now know, how would you handle this situation?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Working with modules\n",
"\n",
"* Sometimes it is useful to group functions and other objects in different files.\n",
"* Sometimes you need to use that fancy function you've written 2 years ago.\n",
"* This is where modules in Python come in handy.\n",
"* More officially, module allows you to share code in the form of libraries.\n",
"* You've seen one example: the `sys` module in the standard library\n",
"* There are many other modules in the standard library, as we'll see soon.\n",
"* Let's start writing our own modules first."
]
}
],
"metadata": {}
}
]
}
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment