From af03bb66a37e9a2ba9fb25c0b44545031968d36f Mon Sep 17 00:00:00 2001 From: bow <bow@bow.web.id> Date: Sun, 4 Aug 2013 23:30:45 +0200 Subject: [PATCH] Initial addition of --- python-more.ipynb | 694 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 694 insertions(+) create mode 100644 python-more.ipynb diff --git a/python-more.ipynb b/python-more.ipynb new file mode 100644 index 0000000..34dc14e --- /dev/null +++ b/python-more.ipynb @@ -0,0 +1,694 @@ +{ + "metadata": { + "name": "" + }, + "nbformat": 3, + "nbformat_minor": 0, + "worksheets": [ + { + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "<span style=\"font-size: 200%\">More Python Goodness</span>\n", + "===\n", + "\n", + "[Wibowo Arindrarto](mailto:w.arindrarto@lumc.nl), [Martijn Vermaat](mailto:m.vermaat.hg@lumc.nl)\n", + "\n", + "[Department of Human Genetics, Leiden University Medical Center](http://humgen.nl)\n", + "\n", + "[Sequencing Analysis Support Core, Leiden University Medical Center](http://sasc.lumc.nl)\n", + "\n", + "Based on: [Python Scientific Lecture Notes](http://scipy-lectures.github.io/)\n", + "\n", + "License: [Creative Commons Attribution 3.0 License (CC-by)](http://creativecommons.org/licenses/by/3.0)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Agenda\n", + "\n", + "1. Working with scripts (+additional Python bits)\n", + "2. Working with modules\n", + "2. Brief tour of the standard library\n", + "3. Reading and writing files\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Working with scripts\n", + "\n", + "* Interpreters are great for prototyping, but not really suitable if you want to share or release code\n", + "* To do so, we write our Python commands in scripts (and later, modules)\n", + "* A script: a simple text file containing Python instructions to execute\n", + "* Two common ways to execute a script:\n", + " 1. As an argument of the python interpreter command\n", + " 2. As a standalone executable (with the appropriate shebang line & file mode)\n", + "* IPython gives you a third option:\n", + " 3. As an argument of the %run magic" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Writing your script\n", + "\n", + "1. Let's start with a simple GC calculator. Open your text editor, and write the following Python statements (remember your indentations):\n", + "\n", + " def calc_gc_percent(seq):\n", + " at_count, gc_count = 0, 0\n", + " for char in seq:\n", + " if char in ('A', 'T'):\n", + " at_count += 1\n", + " elif char in ('G', 'C'):\n", + " gc_count += 1\n", + " \n", + " return gc_count * 100.0 / (gc_count + at_count) \n", + "\n", + " print \"The sequence 'CAGG' has a %GC of {:.2f}\".format(calc_gc_percent(\"CAGG\"))\n", + "\n", + "2. Save the file (we'll use `seq_toolbox.py` here, but you can use any other name) and go to your shell." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Running the script\n", + "\n", + "Let's try the first method: using your script as an argument.\n", + "\n", + " $ python seq_toolbox.py\n", + "\n", + "Is the output as you expect?\n", + "\n", + "For the second method, we need to do two more things:\n", + "\n", + "1. Open the script in your editor and add the following line to the very top: `#!/usr/bin/env python`\n", + "\n", + "2. Save the file, go back to the shell, and allow the file to be executed\n", + " \n", + " $ chmod +x seq_toolbox.py\n", + "\n", + "Is the output the same as the previous method?\n", + "\n", + "Finally, try out the third method. Open an IPython interpreter session and do `%run seq_toolbox.py`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Improving the script: using the standard library\n", + "\n", + "* Our script is nice and dandy, but we don't want to edit the source file everytime we calculate a sequence's GC.\n", + "* Your first standard library module: `sys`\n", + "* Standard library: collection of Python modules (or functions, for now) that comes packaged with a default installation.\n", + "* Not part of the language per se, more like a 'batteries included' thing.\n", + "* For now, we'll use the simple `sys` module to make our script more flexible." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Your first standard library module: `sys`\n", + "\n", + "* Standard library (and other modules, as we'll see later) can be used via the `import` statement, e.g. `import sys`\n", + "* Like other objects so far, we can peek into the documentation of these modules using `help`, e.g. `help(sys)`\n", + "\n", + "## sys.argv\n", + "\n", + "* The `sys` module provides a way to capture runtime arguments with its `argv` object.\n", + "* `sys.argv`: a list of arguments for the current Python session.\n", + "* Not really useful for an interpreter session, but very handy for scripts." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "import sys" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 10 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "sys.argv" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "metadata": {}, + "output_type": "pyout", + "prompt_number": 11, + "text": [ + "['-c',\n", + " '-f',\n", + " '/home/bow/.config/ipython/profile_default/security/kernel-f99c934a-43cb-42f9-b7af-55c1f7911c7d.json',\n", + " '--pylab',\n", + " 'inline',\n", + " \"--IPKernelApp.parent_appname='ipython-notebook'\",\n", + " '--parent=1']" + ] + } + ], + "prompt_number": 11 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "sys.argv[:3]" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "metadata": {}, + "output_type": "pyout", + "prompt_number": 12, + "text": [ + "['-c',\n", + " '-f',\n", + " '/home/bow/.config/ipython/profile_default/security/kernel-f99c934a-43cb-42f9-b7af-55c1f7911c7d.json']" + ] + } + ], + "prompt_number": 12 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Improving our script: sys.argv\n", + "\n", + "* To use `sys.argv` in our script, open a text editor and edit the script by adding an import statement, capturing the `sys.argv` value, and editing our last `print` line.\n", + "\n", + " #!/usr/bin/env python\n", + "\n", + " import sys\n", + "\n", + " def calc_gc_percent(seq):\n", + " at_count, gc_count = 0, 0\n", + " for char in seq:\n", + " if char in ('A', 'T'):\n", + " at_count += 1\n", + " elif char in ('G', 'C'):\n", + " gc_count += 1\n", + " \n", + " return gc_count * 100.0 / (gc_count + at_count) \n", + "\n", + " input_seq = sys.argv[1]\n", + " print \"The sequence '{}' has a %GC of {:.2f}\".format(input_seq, calc_gc_percent(input_seq))\n", + "\n", + "* To test it, you can run the following command in your shell:\n", + "\n", + " $ python seq_toolbox.py CAGG\n", + "\n", + "* Try it with `./seq_toolbox.py` instead, what happens?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Improving our script: string functions\n", + "\n", + "* Try running the script with 'cagg'` as the input sequence. What happens?\n", + "\n", + "* One way to squash this potential bug is by using Python's string function `upper`.\n", + "* Python strings are objects with useful built-in functions\n", + "* A complete documentation is available in the interpreter, via `help(str)`\n", + "* Other objects like `int`s, `list`s, and `dict`s also have built-in functions.\n", + "* Let's check out some commonly used string functions.\n" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "my_str = 'Hello again, ipython!'" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 1 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "my_str.upper()" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "metadata": {}, + "output_type": "pyout", + "prompt_number": 2, + "text": [ + "'HELLO AGAIN, IPYTHON!'" + ] + } + ], + "prompt_number": 2 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "my_str.lower()" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "metadata": {}, + "output_type": "pyout", + "prompt_number": 3, + "text": [ + "'hello again, ipython!'" + ] + } + ], + "prompt_number": 3 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "my_str.title()" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "metadata": {}, + "output_type": "pyout", + "prompt_number": 4, + "text": [ + "'Hello Again, Ipython!'" + ] + } + ], + "prompt_number": 4 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "my_str.startswith('H')" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "metadata": {}, + "output_type": "pyout", + "prompt_number": 5, + "text": [ + "True" + ] + } + ], + "prompt_number": 5 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "my_str.startswith('h')" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "metadata": {}, + "output_type": "pyout", + "prompt_number": 6, + "text": [ + "False" + ] + } + ], + "prompt_number": 6 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "my_str.split(',')" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "metadata": {}, + "output_type": "pyout", + "prompt_number": 7, + "text": [ + "['Hello again', ' ipython!']" + ] + } + ], + "prompt_number": 7 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "my_str.replace('ipython', 'lumc')" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "metadata": {}, + "output_type": "pyout", + "prompt_number": 8, + "text": [ + "'Hello again, lumc!'" + ] + } + ], + "prompt_number": 8 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "my_str.count('n')" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "metadata": {}, + "output_type": "pyout", + "prompt_number": 9, + "text": [ + "2" + ] + } + ], + "prompt_number": 9 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Improving our script: using upper()\n", + "\n", + "* Let's use `upper()` to fortify our function. It should now look something like this:\n", + "\n", + " def calc_gc_percent(seq):\n", + " at_count, gc_count = 0, 0\n", + " for char in seq.upper():\n", + " if char in ('A', 'T'):\n", + " at_count += 1\n", + " elif char in ('G', 'C'):\n", + " gc_count += 1\n", + " \n", + " return gc_count * 100.0 / (gc_count + at_count) \n", + "\n", + "* And run it (whichever way you prefer). Do you get the expected output?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Improving our script: comments & docstrings\n", + "\n", + "* Golden rule: write code for humans (this includes you in 6 months)\n", + "* Python provides two ways to accomplish this: comments and docstrings.\n", + "\n", + "## Comments\n", + "\n", + "* Any lines prepended with '#', making them ignored by the interpreter\n", + "* Freeform text ~ anything that helps in understanding the code\n", + "\n", + "## Docstrings\n", + "\n", + "* Python's way of attaching proper documentation to its objects\n", + "* Official: The first string literal that occurs in a module, function, class or method definition\n", + "* Usually done using triple-quoted strings, to handle newlines easier" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Improving our script: comments & docstrings\n", + "\n", + "* Open your script again in a text editor, and add the following comments & docstrings: \n", + "\n", + " #!/usr/bin/env python\n", + " \n", + " import sys\n", + "\n", + " def calc_gc_percent(seq):\n", + " \"\"\"Calculates the GC percentage of the given sequence.\n", + " \n", + " Arguments:\n", + " - seq - the input sequence (string).\n", + " \n", + " Returns:\n", + " - GC percentage (float).\n", + " \n", + " The returned value is always <= 100.0\n", + " \n", + " \"\"\"\n", + " at_count, gc_count = 0, 0\n", + " # change input to all caps to allow for non-capital input sequence\n", + " for char in seq.upper():\n", + " if char in ('A', 'T'):\n", + " at_count += 1\n", + " elif char in ('G', 'C'):\n", + " gc_count += 1\n", + " \n", + " return gc_count * 100.0 / (gc_count + at_count) \n", + "\n", + " input_seq = sys.argv[1]\n", + " print \"The sequence '{}' has a %GC of {:.2f}\".format(input_seq, calc_gc_percent(input_seq))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Detour: PEP8 & other PEPs\n", + "\n", + "* Since comments and docstrings are basically free-form text, whether it's useful or not depends heavily on the developer.\n", + "* To mitigate this, the Python community has come up with practical conventions.\n", + "* They are documented in a document called PEP8.\n", + "* PEP8: Python Enhancement Proposal no. 8 ~ http://www.python.org/dev/peps/pep-0008/\n", + "* PEP257 is for docstrings specifically.\n", + "* Not a must to follow them, but *very* encouraged to do so.\n", + "* PEPs are how Python grows. Hundreds of them now, all has to be approved by our BDFL." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Improving our script: handling errors & exceptions\n", + "\n", + "* Try running the script with 'ACTG123' as the argument, what happens? Is this acceptable behavior?\n", + "* Sometimes we want to put safeguards to handle invalid inputs.\n", + "* In this case we only accept `ACTG`, all other characters are invalid.\n", + "* Python provides a way to break out of the normal execution flow, by raising what's called as an `Exception`\n", + "* We can raise exceptions as well, by using the `raise` statement.\n", + "* Syntax: `raise {exception_type} ( {message} )`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Improving our script: handling invalid inputs\n", + "\n", + "* Open your script, and edit the `if` clause to add our exception:\n", + "\n", + " def calc_gc_percent(seq):\n", + " \"\"\"Calculates the GC percentage of the given sequence.\n", + " \n", + " Arguments:\n", + " - seq - the input sequence (string).\n", + " \n", + " Returns:\n", + " - GC percentage (float).\n", + " \n", + " The returned value is always <= 100.0\n", + " \n", + " \"\"\"\n", + " at_count, gc_count = 0, 0\n", + " # change input to all caps to allow for non-capital input sequence\n", + " for char in seq.upper():\n", + " if char in ('A', 'T'):\n", + " at_count += 1\n", + " elif char in ('G', 'C'):\n", + " gc_count += 1\n", + " else:\n", + " raise ValueError(\"Unexpeced character found: {}. Only ACTGs are allowed.\".format(char))\n", + " \n", + " return gc_count * 100.0 / (gc_count + at_count)\n", + "\n", + "* Try running the script again with `ACTG123` as the argument. What happens now?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Improving our script: handling corner cases\n", + "\n", + "* Try running the script with '' (two quote signs) as the argument. What happens? Why? Is this a valid input?\n", + "* We don't always want to let exceptions stop program flow, sometimes we want to provide alternative flow.\n", + "* The `try..except` block allows you to do this.\n", + "* The syntax is:\n", + "\n", + " try:\n", + " {statements that may raise exceptions}\n", + " except {exception type}:\n", + " {what to do when the exception type is raised}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Improving our script: handling corner cases\n", + "\n", + "* Let's change our script by adding a `try..except` block:\n", + "\n", + " def calc_gc_percent(seq):\n", + " \"\"\"Calculates the GC percentage of the given sequence.\n", + " \n", + " Arguments:\n", + " - seq - the input sequence (string).\n", + " \n", + " Returns:\n", + " - GC percentage (float).\n", + " \n", + " The returned value is always <= 100.0\n", + " \n", + " \"\"\"\n", + " at_count, gc_count = 0, 0\n", + " # change input to all caps to allow for non-capital input sequence\n", + " for char in seq.upper():\n", + " if char in ('A', 'T'):\n", + " at_count += 1\n", + " elif char in ('G', 'C'):\n", + " gc_count += 1\n", + " else:\n", + " raise ValueError(\"Unexpeced character found: {}. Only ACTGs are allowed.\".format(char))\n", + " # corner case handling: empty input sequence\n", + " try:\n", + " return gc_count * 100.0 / (gc_count + at_count)\n", + " except ZeroDivisionError:\n", + " return 0.0" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Detour: Exception handling best practices\n", + "\n", + "## Aim for a minimal `try` block\n", + "\n", + "We want to be able to pinpoint the statements that may raise the exceptions so we can tailor our handling.\n", + "\n", + "Example of code that violates this principle:\n", + "\n", + " try:\n", + " my_function()\n", + " my_other_function()\n", + " except ValueError:\n", + " my_fallback_function()\n", + "\n", + "A better way would be:\n", + "\n", + " try:\n", + " my_function()\n", + " except ValueError:\n", + " my_fallback_function()\n", + " my_other_function()\n", + "\n", + "\n", + "## Be specific when handling exceptions\n", + "\n", + "The following code is syntactically valid, but *never* use it in your real scripts / programs:\n", + "\n", + " try:\n", + " my_function()\n", + " except:\n", + " my_fallback_function()\n", + "\n", + "*Always* use the full exception name when handling exceptions, to make for a much cleaner code:\n", + "\n", + " try:\n", + " my_function()\n", + " except ValueError:\n", + " my_fallback_function()\n", + " except TypeError:\n", + " my_other_fallback_function()\n", + " except IndexError:\n", + " my_final_function()\n", + "\n", + "## Look Before You Leap (LBYL) vs Easier to Ask for Apology (EAFP)\n", + " \n", + "We could have written our last exception block like so:\n", + "\n", + " if gc_count + at_count == 0:\n", + " return 0.0\n", + " return gc_count * 100.0 / (gc_count + at_count)\n", + "\n", + "Both approaches are correct and have their own plus and minuses in general. However in this case, I would argue that EAFP is better since it makes the code more readable." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Improving the script: handling corner cases\n", + "\n", + "* Now try running your script without any arguments at all. What happens?\n", + "* Armed with what you now know, how would you handle this situation?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Working with modules\n", + "\n", + "* Sometimes it is useful to group functions and other objects in different files.\n", + "* Sometimes you need to use that fancy function you've written 2 years ago.\n", + "* This is where modules in Python come in handy.\n", + "* More officially, module allows you to share code in the form of libraries.\n", + "* You've seen one example: the `sys` module in the standard library\n", + "* There are many other modules in the standard library, as we'll see soon.\n", + "* Let's start writing our own modules first." + ] + } + ], + "metadata": {} + } + ] +} \ No newline at end of file -- GitLab