Initial addition of

af03bb66 · bow · 0e5b7b08 · af03bb66
Commit af03bb66 authored 11 years ago by bow
--- a/python-more.ipynb
+++ b/python-more.ipynb
+{
+ "metadata": {
+  "name": ""
+ },
+ "nbformat": 3,
+ "nbformat_minor": 0,
+ "worksheets": [
+  {
+   "cells": [
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<span style=\"font-size: 200%\">More Python Goodness</span>\n",
+      "===\n",
+      "\n",
+      "[Wibowo Arindrarto](mailto:w.arindrarto@lumc.nl), [Martijn Vermaat](mailto:m.vermaat.hg@lumc.nl)\n",
+      "\n",
+      "[Department of Human Genetics, Leiden University Medical Center](http://humgen.nl)\n",
+      "\n",
+      "[Sequencing Analysis Support Core, Leiden University Medical Center](http://sasc.lumc.nl)\n",
+      "\n",
+      "Based on: [Python Scientific Lecture Notes](http://scipy-lectures.github.io/)\n",
+      "\n",
+      "License: [Creative Commons Attribution 3.0 License (CC-by)](http://creativecommons.org/licenses/by/3.0)"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "# Agenda\n",
+      "\n",
+      "1. Working with scripts (+additional Python bits)\n",
+      "2. Working with modules\n",
+      "2. Brief tour of the standard library\n",
+      "3. Reading and writing files\n"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "# Working with scripts\n",
+      "\n",
+      "* Interpreters are great for prototyping, but not really suitable if you want to share or release code\n",
+      "* To do so, we write our Python commands in scripts (and later, modules)\n",
+      "* A script: a simple text file containing Python instructions to execute\n",
+      "* Two common ways to execute a script:\n",
+      "  1. As an argument of the python interpreter command\n",
+      "  2. As a standalone executable (with the appropriate shebang line & file mode)\n",
+      "* IPython gives you a third option:\n",
+      "  3. As an argument of the %run magic"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "# Writing your script\n",
+      "\n",
+      "1. Let's start with a simple GC calculator. Open your text editor, and write the following Python statements (remember your indentations):\n",
+      "\n",
+      "        def calc_gc_percent(seq):\n",
+      "            at_count, gc_count = 0, 0\n",
+      "            for char in seq:\n",
+      "                if char in ('A', 'T'):\n",
+      "                    at_count += 1\n",
+      "                elif char in ('G', 'C'):\n",
+      "                    gc_count += 1\n",
+      "                \n",
+      "             return gc_count * 100.0 / (gc_count + at_count)       \n",
+      "\n",
+      "        print \"The sequence 'CAGG' has a %GC of {:.2f}\".format(calc_gc_percent(\"CAGG\"))\n",
+      "\n",
+      "2. Save the file (we'll use `seq_toolbox.py` here, but you can use any other name) and go to your shell."
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "# Running the script\n",
+      "\n",
+      "Let's try the first method: using your script as an argument.\n",
+      "\n",
+      "    $ python seq_toolbox.py\n",
+      "\n",
+      "Is the output as you expect?\n",
+      "\n",
+      "For the second method, we need to do two more things:\n",
+      "\n",
+      "1. Open the script in your editor and add the following line to the very top: `#!/usr/bin/env python`\n",
+      "\n",
+      "2. Save the file, go back to the shell, and allow the file to be executed\n",
+      "    \n",
+      "    $ chmod +x seq_toolbox.py\n",
+      "\n",
+      "Is the output the same as the previous method?\n",
+      "\n",
+      "Finally, try out the third method. Open an IPython interpreter session and do `%run seq_toolbox.py`."
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "# Improving the script: using the standard library\n",
+      "\n",
+      "* Our script is nice and dandy, but we don't want to edit the source file everytime we calculate a sequence's GC.\n",
+      "* Your first standard library module: `sys`\n",
+      "* Standard library: collection of Python modules (or functions, for now) that comes packaged with a default installation.\n",
+      "* Not part of the language per se, more like a 'batteries included' thing.\n",
+      "* For now, we'll use the simple `sys` module to make our script more flexible."
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "# Your first standard library module: `sys`\n",
+      "\n",
+      "* Standard library (and other modules, as we'll see later) can be used via the `import` statement, e.g. `import sys`\n",
+      "* Like other objects so far, we can peek into the documentation of these modules using `help`, e.g. `help(sys)`\n",
+      "\n",
+      "## sys.argv\n",
+      "\n",
+      "* The `sys` module provides a way to capture runtime arguments with its `argv` object.\n",
+      "* `sys.argv`: a list of arguments for the current Python session.\n",
+      "* Not really useful for an interpreter session, but very handy for scripts."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "import sys"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 10
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "sys.argv"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "metadata": {},
+       "output_type": "pyout",
+       "prompt_number": 11,
+       "text": [
+        "['-c',\n",
+        " '-f',\n",
+        " '/home/bow/.config/ipython/profile_default/security/kernel-f99c934a-43cb-42f9-b7af-55c1f7911c7d.json',\n",
+        " '--pylab',\n",
+        " 'inline',\n",
+        " \"--IPKernelApp.parent_appname='ipython-notebook'\",\n",
+        " '--parent=1']"
+       ]
+      }
+     ],
+     "prompt_number": 11
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "sys.argv[:3]"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "metadata": {},
+       "output_type": "pyout",
+       "prompt_number": 12,
+       "text": [
+        "['-c',\n",
+        " '-f',\n",
+        " '/home/bow/.config/ipython/profile_default/security/kernel-f99c934a-43cb-42f9-b7af-55c1f7911c7d.json']"
+       ]
+      }
+     ],
+     "prompt_number": 12
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "# Improving our script: sys.argv\n",
+      "\n",
+      "* To use `sys.argv` in our script, open a text editor and edit the script by adding an import statement, capturing the `sys.argv` value, and editing our last `print` line.\n",
+      "\n",
+      "        #!/usr/bin/env python\n",
+      "\n",
+      "        import sys\n",
+      "\n",
+      "        def calc_gc_percent(seq):\n",
+      "            at_count, gc_count = 0, 0\n",
+      "            for char in seq:\n",
+      "                if char in ('A', 'T'):\n",
+      "                    at_count += 1\n",
+      "                elif char in ('G', 'C'):\n",
+      "                    gc_count += 1\n",
+      "                \n",
+      "             return gc_count * 100.0 / (gc_count + at_count)       \n",
+      "\n",
+      "        input_seq = sys.argv[1]\n",
+      "        print \"The sequence '{}' has a %GC of {:.2f}\".format(input_seq, calc_gc_percent(input_seq))\n",
+      "\n",
+      "* To test it, you can run the following command in your shell:\n",
+      "\n",
+      "        $ python seq_toolbox.py CAGG\n",
+      "\n",
+      "* Try it with `./seq_toolbox.py` instead, what happens?"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "# Improving our script: string functions\n",
+      "\n",
+      "* Try running the script with 'cagg'` as the input sequence. What happens?\n",
+      "\n",
+      "* One way to squash this potential bug is by using Python's string function `upper`.\n",
+      "* Python strings are objects with useful built-in functions\n",
+      "* A complete documentation is available in the interpreter, via `help(str)`\n",
+      "* Other objects like `int`s, `list`s, and `dict`s also have built-in functions.\n",
+      "* Let's check out some commonly used string functions.\n"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "my_str = 'Hello again, ipython!'"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 1
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "my_str.upper()"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "metadata": {},
+       "output_type": "pyout",
+       "prompt_number": 2,
+       "text": [
+        "'HELLO AGAIN, IPYTHON!'"
+       ]
+      }
+     ],
+     "prompt_number": 2
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "my_str.lower()"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "metadata": {},
+       "output_type": "pyout",
+       "prompt_number": 3,
+       "text": [
+        "'hello again, ipython!'"
+       ]
+      }
+     ],
+     "prompt_number": 3
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "my_str.title()"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "metadata": {},
+       "output_type": "pyout",
+       "prompt_number": 4,
+       "text": [
+        "'Hello Again, Ipython!'"
+       ]
+      }
+     ],
+     "prompt_number": 4
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "my_str.startswith('H')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "metadata": {},
+       "output_type": "pyout",
+       "prompt_number": 5,
+       "text": [
+        "True"
+       ]
+      }
+     ],
+     "prompt_number": 5
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "my_str.startswith('h')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "metadata": {},
+       "output_type": "pyout",
+       "prompt_number": 6,
+       "text": [
+        "False"
+       ]
+      }
+     ],
+     "prompt_number": 6
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "my_str.split(',')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "metadata": {},
+       "output_type": "pyout",
+       "prompt_number": 7,
+       "text": [
+        "['Hello again', ' ipython!']"
+       ]
+      }
+     ],
+     "prompt_number": 7
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "my_str.replace('ipython', 'lumc')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "metadata": {},
+       "output_type": "pyout",
+       "prompt_number": 8,
+       "text": [
+        "'Hello again, lumc!'"
+       ]
+      }
+     ],
+     "prompt_number": 8
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "my_str.count('n')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "metadata": {},
+       "output_type": "pyout",
+       "prompt_number": 9,
+       "text": [
+        "2"
+       ]
+      }
+     ],
+     "prompt_number": 9
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "# Improving our script: using upper()\n",
+      "\n",
+      "* Let's use `upper()` to fortify our function. It should now look something like this:\n",
+      "\n",
+      "        def calc_gc_percent(seq):\n",
+      "            at_count, gc_count = 0, 0\n",
+      "            for char in seq.upper():\n",
+      "                if char in ('A', 'T'):\n",
+      "                    at_count += 1\n",
+      "                elif char in ('G', 'C'):\n",
+      "                    gc_count += 1\n",
+      "                \n",
+      "             return gc_count * 100.0 / (gc_count + at_count)       \n",
+      "\n",
+      "* And run it (whichever way you prefer). Do you get the expected output?"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "# Improving our script: comments & docstrings\n",
+      "\n",
+      "* Golden rule: write code for humans (this includes you in 6 months)\n",
+      "* Python provides two ways to accomplish this: comments and docstrings.\n",
+      "\n",
+      "## Comments\n",
+      "\n",
+      "* Any lines prepended with '#', making them ignored by the interpreter\n",
+      "* Freeform text ~ anything that helps in understanding the code\n",
+      "\n",
+      "## Docstrings\n",
+      "\n",
+      "* Python's way of attaching proper documentation to its objects\n",
+      "* Official: The first string literal that occurs in a module, function, class or method definition\n",
+      "* Usually done using triple-quoted strings, to handle newlines easier"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "# Improving our script: comments & docstrings\n",
+      "\n",
+      "* Open your script again in a text editor, and add the following comments & docstrings: \n",
+      "\n",
+      "        #!/usr/bin/env python\n",
+      "        \n",
+      "        import sys\n",
+      "\n",
+      "        def calc_gc_percent(seq):\n",
+      "            \"\"\"Calculates the GC percentage of the given sequence.\n",
+      "            \n",
+      "            Arguments:\n",
+      "                - seq - the input sequence (string).\n",
+      "            \n",
+      "            Returns:\n",
+      "                - GC percentage (float).\n",
+      "            \n",
+      "            The returned value is always <= 100.0\n",
+      "            \n",
+      "            \"\"\"\n",
+      "            at_count, gc_count = 0, 0\n",
+      "            # change input to all caps to allow for non-capital input sequence\n",
+      "            for char in seq.upper():\n",
+      "                if char in ('A', 'T'):\n",
+      "                    at_count += 1\n",
+      "                elif char in ('G', 'C'):\n",
+      "                    gc_count += 1\n",
+      "                \n",
+      "             return gc_count * 100.0 / (gc_count + at_count)       \n",
+      "\n",
+      "        input_seq = sys.argv[1]\n",
+      "        print \"The sequence '{}' has a %GC of {:.2f}\".format(input_seq, calc_gc_percent(input_seq))"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "# Detour: PEP8 & other PEPs\n",
+      "\n",
+      "* Since comments and docstrings are basically free-form text, whether it's useful or not depends heavily on the developer.\n",
+      "* To mitigate this, the Python community has come up with practical conventions.\n",
+      "* They are documented in a document called PEP8.\n",
+      "* PEP8: Python Enhancement Proposal no. 8 ~ http://www.python.org/dev/peps/pep-0008/\n",
+      "* PEP257 is for docstrings specifically.\n",
+      "* Not a must to follow them, but *very* encouraged to do so.\n",
+      "* PEPs are how Python grows. Hundreds of them now, all has to be approved by our BDFL."
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "# Improving our script: handling errors & exceptions\n",
+      "\n",
+      "* Try running the script with 'ACTG123' as the argument, what happens? Is this acceptable behavior?\n",
+      "* Sometimes we want to put safeguards to handle invalid inputs.\n",
+      "* In this case we only accept `ACTG`, all other characters are invalid.\n",
+      "* Python provides a way to break out of the normal execution flow, by raising what's called as an `Exception`\n",
+      "* We can raise exceptions as well, by using the `raise` statement.\n",
+      "* Syntax: `raise {exception_type} ( {message} )`"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "# Improving our script: handling invalid inputs\n",
+      "\n",
+      "* Open your script, and edit the `if` clause to add our exception:\n",
+      "\n",
+      "        def calc_gc_percent(seq):\n",
+      "            \"\"\"Calculates the GC percentage of the given sequence.\n",
+      "           \n",
+      "            Arguments:\n",
+      "                - seq - the input sequence (string).\n",
+      "            \n",
+      "            Returns:\n",
+      "                - GC percentage (float).\n",
+      "            \n",
+      "            The returned value is always <= 100.0\n",
+      "            \n",
+      "            \"\"\"\n",
+      "            at_count, gc_count = 0, 0\n",
+      "            # change input to all caps to allow for non-capital input sequence\n",
+      "            for char in seq.upper():\n",
+      "                if char in ('A', 'T'):\n",
+      "                    at_count += 1\n",
+      "                elif char in ('G', 'C'):\n",
+      "                    gc_count += 1\n",
+      "                else:\n",
+      "                    raise ValueError(\"Unexpeced character found: {}. Only ACTGs are allowed.\".format(char))\n",
+      "             \n",
+      "             return gc_count * 100.0 / (gc_count + at_count)\n",
+      "\n",
+      "* Try running the script again with `ACTG123` as the argument. What happens now?"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "# Improving our script: handling corner cases\n",
+      "\n",
+      "* Try running the script with '' (two quote signs) as the argument. What happens? Why? Is this a valid input?\n",
+      "* We don't always want to let exceptions stop program flow, sometimes we want to provide alternative flow.\n",
+      "* The `try..except` block allows you to do this.\n",
+      "* The syntax is:\n",
+      "\n",
+      "        try:\n",
+      "            {statements that may raise exceptions}\n",
+      "        except {exception type}:\n",
+      "            {what to do when the exception type is raised}"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "# Improving our script: handling corner cases\n",
+      "\n",
+      "* Let's change our script by adding a `try..except` block:\n",
+      "\n",
+      "        def calc_gc_percent(seq):\n",
+      "            \"\"\"Calculates the GC percentage of the given sequence.\n",
+      "           \n",
+      "            Arguments:\n",
+      "                - seq - the input sequence (string).\n",
+      "            \n",
+      "            Returns:\n",
+      "                - GC percentage (float).\n",
+      "            \n",
+      "            The returned value is always <= 100.0\n",
+      "            \n",
+      "            \"\"\"\n",
+      "            at_count, gc_count = 0, 0\n",
+      "            # change input to all caps to allow for non-capital input sequence\n",
+      "            for char in seq.upper():\n",
+      "                if char in ('A', 'T'):\n",
+      "                    at_count += 1\n",
+      "                elif char in ('G', 'C'):\n",
+      "                    gc_count += 1\n",
+      "                else:\n",
+      "                    raise ValueError(\"Unexpeced character found: {}. Only ACTGs are allowed.\".format(char))\n",
+      "             # corner case handling: empty input sequence\n",
+      "             try:\n",
+      "                 return gc_count * 100.0 / (gc_count + at_count)\n",
+      "             except ZeroDivisionError:\n",
+      "                 return 0.0"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "# Detour: Exception handling best practices\n",
+      "\n",
+      "## Aim for a minimal `try` block\n",
+      "\n",
+      "We want to be able to pinpoint the statements that may raise the exceptions so we can tailor our handling.\n",
+      "\n",
+      "Example of code that violates this principle:\n",
+      "\n",
+      "    try:\n",
+      "        my_function()\n",
+      "        my_other_function()\n",
+      "    except ValueError:\n",
+      "        my_fallback_function()\n",
+      "\n",
+      "A better way would be:\n",
+      "\n",
+      "    try:\n",
+      "        my_function()\n",
+      "    except ValueError:\n",
+      "        my_fallback_function()\n",
+      "    my_other_function()\n",
+      "\n",
+      "\n",
+      "## Be specific when handling exceptions\n",
+      "\n",
+      "The following code is syntactically valid, but *never* use it in your real scripts / programs:\n",
+      "\n",
+      "    try:\n",
+      "        my_function()\n",
+      "    except:\n",
+      "        my_fallback_function()\n",
+      "\n",
+      "*Always* use the full exception name when handling exceptions, to make for a much cleaner code:\n",
+      "\n",
+      "    try:\n",
+      "        my_function()\n",
+      "    except ValueError:\n",
+      "        my_fallback_function()\n",
+      "    except TypeError:\n",
+      "        my_other_fallback_function()\n",
+      "    except IndexError:\n",
+      "        my_final_function()\n",
+      "\n",
+      "## Look Before You Leap (LBYL) vs Easier to Ask for Apology (EAFP)\n",
+      " \n",
+      "We could have written our last exception block like so:\n",
+      "\n",
+      "    if gc_count + at_count == 0:\n",
+      "        return 0.0\n",
+      "    return gc_count * 100.0 / (gc_count + at_count)\n",
+      "\n",
+      "Both approaches are correct and have their own plus and minuses in general. However in this case, I would argue that EAFP is better since it makes the code more readable."
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "# Improving the script: handling corner cases\n",
+      "\n",
+      "* Now try running your script without any arguments at all. What happens?\n",
+      "* Armed with what you now know, how would you handle this situation?"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "# Working with modules\n",
+      "\n",
+      "* Sometimes it is useful to group functions and other objects in different files.\n",
+      "* Sometimes you need to use that fancy function you've written 2 years ago.\n",
+      "* This is where modules in Python come in handy.\n",
+      "* More officially, module allows you to share code in the form of libraries.\n",
+      "* You've seen one example: the `sys` module in the standard library\n",
+      "* There are many other modules in the standard library, as we'll see soon.\n",
+      "* Let's start writing our own modules first."
+     ]
+    }
+   ],
+   "metadata": {}
+  }
+ ]
+}
\ No newline at end of file