diff --git a/lightmotif-py/docs/guide/example.ipynb b/lightmotif-py/docs/guide/example.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..26cdea020ee426020d9098136636824ada6bd709 --- /dev/null +++ b/lightmotif-py/docs/guide/example.ipynb @@ -0,0 +1,298 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Example\n", + "\n", + "This Jupyter notebook shows how to use the library with common examples." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import lightmotif\n", + "lightmotif.__version__" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from urllib.request import urlopen" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Creating a motif\n", + "\n", + "A `Motif` can be created from several sequences of the same length using the\n", + "`lightmotif.create` function. This first builds a `CountMatrix` from each \n", + "sequence position, and then creates a `WeightMatrix` and a `ScoringMatrix`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "motif = lightmotif.create([\"AATTGTGGTTA\", \"ATCTGTGGTTA\", \"TTCTGCGGTTA\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Loading a motif\n", + "\n", + "The `lightmotif.load` function can be used to load the motifs found in a given\n", + "file. Because it supports any file-like object, we can immediately download a\n", + "motif from the [JASPAR](https://jaspar.elixir.no/) database and parse it on \n", + "the fly:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "url = \"https://jaspar.elixir.no/api/v1/matrix/MA0002.1.jaspar\"\n", + "with urlopen(url) as response:\n", + " motif = next(lightmotif.load(response, format=\"jaspar16\"))\n", + " print(f\"Loaded motif {motif.name} of length {len(motif.counts)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Adding pseudo-counts\n", + "\n", + "By default, the loaded scoring matrix is built with zero pseudo-counts and \n", + "a uniform background, which may not be ideal. Using the `CountMatrix.normalize`\n", + "and `WeightMatrix.log_odds` methods, we can build a new `ScoringMatrix` with\n", + "pseudo-counts of 0.1:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pssm = motif.counts.normalize(0.1).log_odds()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Preparing a sequence\n", + "\n", + "Since the motif we loaded is a human transcription factor binding site, \n", + "it makes sense to use a human sequence. As an example, we can load a \n", + "contig from the human chromosome 22 ([NT_167212.2](https://www.ncbi.nlm.nih.gov/nuccore/NT_167212.2))." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "url = \"https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?tool=portal&save=file&report=fasta&id=568801992\"\n", + "with urlopen(url) as response:\n", + " response.readline()\n", + " sequence = ''.join(line.strip().decode() for line in response)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To score a sequence with `lightmotif`, if must be first encoded and stored with\n", + "a particular memory layout. This is taken care of by the `lightmotif.stripe`\n", + "function. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "striped = lightmotif.stripe(sequence)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Calculate scores\n", + "\n", + "Once the sequence has been prepared, it can be used with the different functions\n", + "and methods of `lightmotif` to compute scores for each position. The most most\n", + "basic functionality is to compute the PSSM scores for every position of the \n", + "sequence. This can be done with the `ScoringMatrix.calculate` method:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "scores = pssm.calculate(striped)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The scores are computed in an efficient column-major matrix which can be used\n", + "to further extract high scoring positions:\n", + "\n", + "- The `argmax` method returns the smallest index with the highest score\n", + "- The `max` method returns the highest score\n", + "- The `threshold` method returns a list of positions with a score above the given score" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(f\"Highest score: {scores.max():.3f}\")\n", + "print(f\"Position with highest score: {scores.argmax()}\")\n", + "print(f\"Position with score above 14: {scores.threshold(13.0)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Otherwise, the resulting array can be accessed by index, and flattened into\n", + "a list (or an `array`):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"Score at position 90517:\", scores[156007])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using p-value thresholds\n", + "\n", + "LightMotif features a re-implementation of the TFP-PVALUE algorithm which \n", + "can convert between a bitscore and a p-value for a given scoring matrix. Use\n", + "the `ScoringMatrix.score` method to compute the score threshold for a *p-value*:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(f\"Score threshold for p=1e-5: {pssm.score(1e-5):.3f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `ScoringMatrix.pvalue` method can compute the *p-value* for a score, allowing\n", + "to compute them for scores obtained by the scoring pipeline:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for index in scores.threshold(13.0):\n", + " print(f\"Hit at position {index:6}: score={scores[index]:.3f} p={pssm.pvalue(scores[index]):.3g}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Scanning algorithm\n", + "\n", + "For cases where a long sequence is being processed, and only a handful of \n", + "significative hits is expected, using a scanner will be much more efficient.\n", + "A `Scanner` can be created with the `lightmotif.scan` function, and yields\n", + "`Hit` objects for every position above the threshold parameter:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "scanner = lightmotif.scan(pssm, striped, threshold=13.0)\n", + "for hit in scanner:\n", + " print(f\"Hit at position {hit.position:6}: score={hit.score:.3f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Although it gives equivalent results to the `calculate` example above, the \n", + "`scan` implementation uses less memory and is generally faster for higher\n", + "threshold values." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Reverse-complement\n", + "\n", + "All the examples above are showing how to calculate the hits for the direct \n", + "strand. To process the reverse-strand, one could reverse-complement the sequence;\n", + "however, it is much more efficient to reverse-complement the `ScoringMatrix`, \n", + "as it is usually much smaller in memory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "psmm_rc = pssm.reverse_complement()\n", + "scanner_rc = lightmotif.scan(psmm_rc, striped, threshold=13.0)" + ] + } + ], + "metadata": { + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/lightmotif-py/docs/guide/index.rst b/lightmotif-py/docs/guide/index.rst index f76d0e546ccb95235e664c235c2f8169ba5b6af7..f29fa22dd54f4c439adcf222c221bdd0714f539c 100644 --- a/lightmotif-py/docs/guide/index.rst +++ b/lightmotif-py/docs/guide/index.rst @@ -8,6 +8,7 @@ This section contains guides and documents about LightMotif usage. :caption: Getting Started Installation <install> + Example <example> .. toctree:: :maxdepth: 1