Add Jupyter notebook with library usage in `lightmotif-py` docs [ci skip]

6e1dd442 · Martin Larralde · 5068087c · 6e1dd442 · 6e1dd442
Commit 6e1dd442 authored 6 months ago by Martin Larralde
--- a/lightmotif-py/docs/guide/example.ipynb
+++ b/lightmotif-py/docs/guide/example.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Example\n",
+    "\n",
+    "This Jupyter notebook shows how to use the library with common examples."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import lightmotif\n",
+    "lightmotif.__version__"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from urllib.request import urlopen"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Creating a motif\n",
+    "\n",
+    "A `Motif` can be created from several sequences of the same length using the\n",
+    "`lightmotif.create` function. This first builds a `CountMatrix` from each \n",
+    "sequence position, and then creates a `WeightMatrix` and a `ScoringMatrix`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "motif = lightmotif.create([\"AATTGTGGTTA\", \"ATCTGTGGTTA\", \"TTCTGCGGTTA\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Loading a motif\n",
+    "\n",
+    "The `lightmotif.load` function can be used to load the motifs found in a given\n",
+    "file. Because it supports any file-like object, we can immediately download a\n",
+    "motif from the [JASPAR](https://jaspar.elixir.no/) database and parse it on \n",
+    "the fly:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "url = \"https://jaspar.elixir.no/api/v1/matrix/MA0002.1.jaspar\"\n",
+    "with urlopen(url) as response:\n",
+    "    motif = next(lightmotif.load(response, format=\"jaspar16\"))\n",
+    "    print(f\"Loaded motif {motif.name} of length {len(motif.counts)}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Adding pseudo-counts\n",
+    "\n",
+    "By default, the loaded scoring matrix is built with zero pseudo-counts and \n",
+    "a uniform background, which may not be ideal. Using the `CountMatrix.normalize`\n",
+    "and `WeightMatrix.log_odds` methods, we can build a new `ScoringMatrix` with\n",
+    "pseudo-counts of 0.1:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pssm = motif.counts.normalize(0.1).log_odds()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Preparing a sequence\n",
+    "\n",
+    "Since the motif we loaded is a human transcription factor binding site, \n",
+    "it makes sense to use a human sequence. As an example, we can load a \n",
+    "contig from the human chromosome 22 ([NT_167212.2](https://www.ncbi.nlm.nih.gov/nuccore/NT_167212.2))."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "url = \"https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?tool=portal&save=file&report=fasta&id=568801992\"\n",
+    "with urlopen(url) as response:\n",
+    "    response.readline()\n",
+    "    sequence = ''.join(line.strip().decode() for line in response)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To score a sequence with `lightmotif`, if must be first encoded and stored with\n",
+    "a particular memory layout. This is taken care of by the `lightmotif.stripe`\n",
+    "function. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "striped = lightmotif.stripe(sequence)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Calculate scores\n",
+    "\n",
+    "Once the sequence has been prepared, it can be used with the different functions\n",
+    "and methods of `lightmotif` to compute scores for each position. The most most\n",
+    "basic functionality is to compute the PSSM scores for every position of the \n",
+    "sequence. This can be done with the `ScoringMatrix.calculate` method:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "scores = pssm.calculate(striped)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The scores are computed in an efficient column-major matrix which can be used\n",
+    "to further extract high scoring positions:\n",
+    "\n",
+    "- The `argmax` method returns the smallest index with the highest score\n",
+    "- The `max` method returns the highest score\n",
+    "- The `threshold` method returns a list of positions with a score above the given score"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(f\"Highest score: {scores.max():.3f}\")\n",
+    "print(f\"Position with highest score: {scores.argmax()}\")\n",
+    "print(f\"Position with score above 14: {scores.threshold(13.0)}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Otherwise, the resulting array can be accessed by index, and flattened into\n",
+    "a list (or an `array`):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"Score at position 90517:\", scores[156007])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Using p-value thresholds\n",
+    "\n",
+    "LightMotif features a re-implementation of the TFP-PVALUE algorithm which \n",
+    "can convert between a bitscore and a p-value for a given scoring matrix. Use\n",
+    "the `ScoringMatrix.score` method to compute the score threshold for a *p-value*:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(f\"Score threshold for p=1e-5: {pssm.score(1e-5):.3f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The `ScoringMatrix.pvalue` method can compute the *p-value* for a score, allowing\n",
+    "to compute them for scores obtained by the scoring pipeline:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for index in scores.threshold(13.0):\n",
+    "    print(f\"Hit at position {index:6}: score={scores[index]:.3f} p={pssm.pvalue(scores[index]):.3g}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Scanning algorithm\n",
+    "\n",
+    "For cases where a long sequence is being processed, and only a handful of \n",
+    "significative hits is expected, using a scanner will be much more efficient.\n",
+    "A `Scanner` can be created with the `lightmotif.scan` function, and yields\n",
+    "`Hit` objects for every position above the threshold parameter:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "scanner = lightmotif.scan(pssm, striped, threshold=13.0)\n",
+    "for hit in scanner:\n",
+    "    print(f\"Hit at position {hit.position:6}: score={hit.score:.3f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Although it gives equivalent results to the `calculate` example above, the \n",
+    "`scan` implementation uses less memory and is generally faster for higher\n",
+    "threshold values."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Reverse-complement\n",
+    "\n",
+    "All the examples above are showing how to calculate the hits for the direct \n",
+    "strand. To process the reverse-strand, one could reverse-complement the sequence;\n",
+    "however, it is much more efficient to reverse-complement the `ScoringMatrix`, \n",
+    "as it is usually much smaller in memory."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "psmm_rc = pssm.reverse_complement()\n",
+    "scanner_rc = lightmotif.scan(psmm_rc, striped, threshold=13.0)"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
+%% Cell type:markdown id: tags:
+
+# Example
+
+This Jupyter notebook shows how to use the library with common examples.
+
+%% Cell type:code id: tags:
+
+``` 
+import lightmotif
+lightmotif.__version__
+```
+
+%% Cell type:code id: tags:
+
+``` 
+from urllib.request import urlopen
+```
+
+%% Cell type:markdown id: tags:
+
+## Creating a motif
+
+A `Motif` can be created from several sequences of the same length using the
+`lightmotif.create` function. This first builds a `CountMatrix` from each
+sequence position, and then creates a `WeightMatrix` and a `ScoringMatrix`.
+
+%% Cell type:code id: tags:
+
+``` 
+motif = lightmotif.create(["AATTGTGGTTA", "ATCTGTGGTTA", "TTCTGCGGTTA"])
+```
+
+%% Cell type:markdown id: tags:
+
+## Loading a motif
+
+The `lightmotif.load` function can be used to load the motifs found in a given
+file. Because it supports any file-like object, we can immediately download a
+motif from the [JASPAR](https://jaspar.elixir.no/) database and parse it on
+the fly:
+
+%% Cell type:code id: tags:
+
+``` 
+url = "https://jaspar.elixir.no/api/v1/matrix/MA0002.1.jaspar"
+with urlopen(url) as response:
+    motif = next(lightmotif.load(response, format="jaspar16"))
+    print(f"Loaded motif {motif.name} of length {len(motif.counts)}")
+```
+
+%% Cell type:markdown id: tags:
+
+## Adding pseudo-counts
+
+By default, the loaded scoring matrix is built with zero pseudo-counts and
+a uniform background, which may not be ideal. Using the `CountMatrix.normalize`
+and `WeightMatrix.log_odds` methods, we can build a new `ScoringMatrix` with
+pseudo-counts of 0.1:
+
+%% Cell type:code id: tags:
+
+``` 
+pssm = motif.counts.normalize(0.1).log_odds()
+```
+
+%% Cell type:markdown id: tags:
+
+## Preparing a sequence
+
+Since the motif we loaded is a human transcription factor binding site,
+it makes sense to use a human sequence. As an example, we can load a
+contig from the human chromosome 22 ([NT_167212.2](https://www.ncbi.nlm.nih.gov/nuccore/NT_167212.2)).
+
+%% Cell type:code id: tags:
+
+``` 
+url = "https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?tool=portal&save=file&report=fasta&id=568801992"
+with urlopen(url) as response:
+    response.readline()
+    sequence = ''.join(line.strip().decode() for line in response)
+```
+
+%% Cell type:markdown id: tags:
+
+To score a sequence with `lightmotif`, if must be first encoded and stored with
+a particular memory layout. This is taken care of by the `lightmotif.stripe`
+function.
+
+%% Cell type:code id: tags:
+
+``` 
+striped = lightmotif.stripe(sequence)
+```
+
+%% Cell type:markdown id: tags:
+
+## Calculate scores
+
+Once the sequence has been prepared, it can be used with the different functions
+and methods of `lightmotif` to compute scores for each position. The most most
+basic functionality is to compute the PSSM scores for every position of the
+sequence. This can be done with the `ScoringMatrix.calculate` method:
+
+%% Cell type:code id: tags:
+
+``` 
+scores = pssm.calculate(striped)
+```
+
+%% Cell type:markdown id: tags:
+
+The scores are computed in an efficient column-major matrix which can be used
+to further extract high scoring positions:
+
+- The `argmax` method returns the smallest index with the highest score
+- The `max` method returns the highest score
+- The `threshold` method returns a list of positions with a score above the given score
+
+%% Cell type:code id: tags:
+
+``` 
+print(f"Highest score: {scores.max():.3f}")
+print(f"Position with highest score: {scores.argmax()}")
+print(f"Position with score above 14: {scores.threshold(13.0)}")
+```
+
+%% Cell type:markdown id: tags:
+
+Otherwise, the resulting array can be accessed by index, and flattened into
+a list (or an `array`):
+
+%% Cell type:code id: tags:
+
+``` 
+print("Score at position 90517:", scores[156007])
+```
+
+%% Cell type:markdown id: tags:
+
+## Using p-value thresholds
+
+LightMotif features a re-implementation of the TFP-PVALUE algorithm which
+can convert between a bitscore and a p-value for a given scoring matrix. Use
+the `ScoringMatrix.score` method to compute the score threshold for a *p-value*:
+
+%% Cell type:code id: tags:
+
+``` 
+print(f"Score threshold for p=1e-5: {pssm.score(1e-5):.3f}")
+```
+
+%% Cell type:markdown id: tags:
+
+The `ScoringMatrix.pvalue` method can compute the *p-value* for a score, allowing
+to compute them for scores obtained by the scoring pipeline:
+
+%% Cell type:code id: tags:
+
+``` 
+for index in scores.threshold(13.0):
+    print(f"Hit at position {index:6}: score={scores[index]:.3f} p={pssm.pvalue(scores[index]):.3g}")
+```
+
+%% Cell type:markdown id: tags:
+
+## Scanning algorithm
+
+For cases where a long sequence is being processed, and only a handful of
+significative hits is expected, using a scanner will be much more efficient.
+A `Scanner` can be created with the `lightmotif.scan` function, and yields
+`Hit` objects for every position above the threshold parameter:
+
+%% Cell type:code id: tags:
+
+``` 
+scanner = lightmotif.scan(pssm, striped, threshold=13.0)
+for hit in scanner:
+    print(f"Hit at position {hit.position:6}: score={hit.score:.3f}")
+```
+
+%% Cell type:markdown id: tags:
+
+Although it gives equivalent results to the `calculate` example above, the
+`scan` implementation uses less memory and is generally faster for higher
+threshold values.
+
+%% Cell type:markdown id: tags:
+
+# Reverse-complement
+
+All the examples above are showing how to calculate the hits for the direct
+strand. To process the reverse-strand, one could reverse-complement the sequence;
+however, it is much more efficient to reverse-complement the `ScoringMatrix`,
+as it is usually much smaller in memory.
+
+%% Cell type:code id: tags:
+
+``` 
+psmm_rc = pssm.reverse_complement()
+scanner_rc = lightmotif.scan(psmm_rc, striped, threshold=13.0)
+```
--- a/lightmotif-py/docs/guide/index.rst
+++ b/lightmotif-py/docs/guide/index.rst
@@ -8,6 +8,7 @@ This section contains guides and documents about LightMotif usage.
   :caption: Getting Started

    Installation <install>
+    Example <example>

 .. toctree::
   :maxdepth: 1