Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
LightMotif
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package registry
Container Registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Martin Larralde
LightMotif
Commits
6e1dd442
Commit
6e1dd442
authored
6 months ago
by
Martin Larralde
Browse files
Options
Downloads
Patches
Plain Diff
Add Jupyter notebook with library usage in `lightmotif-py` docs [ci skip]
parent
5068087c
No related branches found
Branches containing commit
No related tags found
Tags containing commit
No related merge requests found
Pipeline
#17396
skipped
Changes
2
Pipelines
1
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
lightmotif-py/docs/guide/example.ipynb
+298
-0
298 additions, 0 deletions
lightmotif-py/docs/guide/example.ipynb
lightmotif-py/docs/guide/index.rst
+1
-0
1 addition, 0 deletions
lightmotif-py/docs/guide/index.rst
with
299 additions
and
0 deletions
lightmotif-py/docs/guide/example.ipynb
0 → 100644
+
298
−
0
View file @
6e1dd442
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Example\n",
"\n",
"This Jupyter notebook shows how to use the library with common examples."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import lightmotif\n",
"lightmotif.__version__"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from urllib.request import urlopen"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating a motif\n",
"\n",
"A `Motif` can be created from several sequences of the same length using the\n",
"`lightmotif.create` function. This first builds a `CountMatrix` from each \n",
"sequence position, and then creates a `WeightMatrix` and a `ScoringMatrix`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"motif = lightmotif.create([\"AATTGTGGTTA\", \"ATCTGTGGTTA\", \"TTCTGCGGTTA\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading a motif\n",
"\n",
"The `lightmotif.load` function can be used to load the motifs found in a given\n",
"file. Because it supports any file-like object, we can immediately download a\n",
"motif from the [JASPAR](https://jaspar.elixir.no/) database and parse it on \n",
"the fly:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"url = \"https://jaspar.elixir.no/api/v1/matrix/MA0002.1.jaspar\"\n",
"with urlopen(url) as response:\n",
" motif = next(lightmotif.load(response, format=\"jaspar16\"))\n",
" print(f\"Loaded motif {motif.name} of length {len(motif.counts)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Adding pseudo-counts\n",
"\n",
"By default, the loaded scoring matrix is built with zero pseudo-counts and \n",
"a uniform background, which may not be ideal. Using the `CountMatrix.normalize`\n",
"and `WeightMatrix.log_odds` methods, we can build a new `ScoringMatrix` with\n",
"pseudo-counts of 0.1:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pssm = motif.counts.normalize(0.1).log_odds()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preparing a sequence\n",
"\n",
"Since the motif we loaded is a human transcription factor binding site, \n",
"it makes sense to use a human sequence. As an example, we can load a \n",
"contig from the human chromosome 22 ([NT_167212.2](https://www.ncbi.nlm.nih.gov/nuccore/NT_167212.2))."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"url = \"https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?tool=portal&save=file&report=fasta&id=568801992\"\n",
"with urlopen(url) as response:\n",
" response.readline()\n",
" sequence = ''.join(line.strip().decode() for line in response)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To score a sequence with `lightmotif`, if must be first encoded and stored with\n",
"a particular memory layout. This is taken care of by the `lightmotif.stripe`\n",
"function. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"striped = lightmotif.stripe(sequence)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Calculate scores\n",
"\n",
"Once the sequence has been prepared, it can be used with the different functions\n",
"and methods of `lightmotif` to compute scores for each position. The most most\n",
"basic functionality is to compute the PSSM scores for every position of the \n",
"sequence. This can be done with the `ScoringMatrix.calculate` method:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"scores = pssm.calculate(striped)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The scores are computed in an efficient column-major matrix which can be used\n",
"to further extract high scoring positions:\n",
"\n",
"- The `argmax` method returns the smallest index with the highest score\n",
"- The `max` method returns the highest score\n",
"- The `threshold` method returns a list of positions with a score above the given score"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(f\"Highest score: {scores.max():.3f}\")\n",
"print(f\"Position with highest score: {scores.argmax()}\")\n",
"print(f\"Position with score above 14: {scores.threshold(13.0)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Otherwise, the resulting array can be accessed by index, and flattened into\n",
"a list (or an `array`):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"Score at position 90517:\", scores[156007])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using p-value thresholds\n",
"\n",
"LightMotif features a re-implementation of the TFP-PVALUE algorithm which \n",
"can convert between a bitscore and a p-value for a given scoring matrix. Use\n",
"the `ScoringMatrix.score` method to compute the score threshold for a *p-value*:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(f\"Score threshold for p=1e-5: {pssm.score(1e-5):.3f}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `ScoringMatrix.pvalue` method can compute the *p-value* for a score, allowing\n",
"to compute them for scores obtained by the scoring pipeline:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for index in scores.threshold(13.0):\n",
" print(f\"Hit at position {index:6}: score={scores[index]:.3f} p={pssm.pvalue(scores[index]):.3g}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Scanning algorithm\n",
"\n",
"For cases where a long sequence is being processed, and only a handful of \n",
"significative hits is expected, using a scanner will be much more efficient.\n",
"A `Scanner` can be created with the `lightmotif.scan` function, and yields\n",
"`Hit` objects for every position above the threshold parameter:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"scanner = lightmotif.scan(pssm, striped, threshold=13.0)\n",
"for hit in scanner:\n",
" print(f\"Hit at position {hit.position:6}: score={hit.score:.3f}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Although it gives equivalent results to the `calculate` example above, the \n",
"`scan` implementation uses less memory and is generally faster for higher\n",
"threshold values."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Reverse-complement\n",
"\n",
"All the examples above are showing how to calculate the hits for the direct \n",
"strand. To process the reverse-strand, one could reverse-complement the sequence;\n",
"however, it is much more efficient to reverse-complement the `ScoringMatrix`, \n",
"as it is usually much smaller in memory."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"psmm_rc = pssm.reverse_complement()\n",
"scanner_rc = lightmotif.scan(psmm_rc, striped, threshold=13.0)"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
%% Cell type:markdown id: tags:
# Example
This Jupyter notebook shows how to use the library with common examples.
%% Cell type:code id: tags:
```
import lightmotif
lightmotif.__version__
```
%% Cell type:code id: tags:
```
from urllib.request import urlopen
```
%% Cell type:markdown id: tags:
## Creating a motif
A
`Motif`
can be created from several sequences of the same length using the
`lightmotif.create`
function. This first builds a
`CountMatrix`
from each
sequence position, and then creates a
`WeightMatrix`
and a
`ScoringMatrix`
.
%% Cell type:code id: tags:
```
motif = lightmotif.create(["AATTGTGGTTA", "ATCTGTGGTTA", "TTCTGCGGTTA"])
```
%% Cell type:markdown id: tags:
## Loading a motif
The
`lightmotif.load`
function can be used to load the motifs found in a given
file. Because it supports any file-like object, we can immediately download a
motif from the
[
JASPAR
](
https://jaspar.elixir.no/
)
database and parse it on
the fly:
%% Cell type:code id: tags:
```
url = "https://jaspar.elixir.no/api/v1/matrix/MA0002.1.jaspar"
with urlopen(url) as response:
motif = next(lightmotif.load(response, format="jaspar16"))
print(f"Loaded motif {motif.name} of length {len(motif.counts)}")
```
%% Cell type:markdown id: tags:
## Adding pseudo-counts
By default, the loaded scoring matrix is built with zero pseudo-counts and
a uniform background, which may not be ideal. Using the
`CountMatrix.normalize`
and
`WeightMatrix.log_odds`
methods, we can build a new
`ScoringMatrix`
with
pseudo-counts of 0.1:
%% Cell type:code id: tags:
```
pssm = motif.counts.normalize(0.1).log_odds()
```
%% Cell type:markdown id: tags:
## Preparing a sequence
Since the motif we loaded is a human transcription factor binding site,
it makes sense to use a human sequence. As an example, we can load a
contig from the human chromosome 22 (
[
NT_167212.2
](
https://www.ncbi.nlm.nih.gov/nuccore/NT_167212.2
)
).
%% Cell type:code id: tags:
```
url = "https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?tool=portal&save=file&report=fasta&id=568801992"
with urlopen(url) as response:
response.readline()
sequence = ''.join(line.strip().decode() for line in response)
```
%% Cell type:markdown id: tags:
To score a sequence with
`lightmotif`
, if must be first encoded and stored with
a particular memory layout. This is taken care of by the
`lightmotif.stripe`
function.
%% Cell type:code id: tags:
```
striped = lightmotif.stripe(sequence)
```
%% Cell type:markdown id: tags:
## Calculate scores
Once the sequence has been prepared, it can be used with the different functions
and methods of
`lightmotif`
to compute scores for each position. The most most
basic functionality is to compute the PSSM scores for every position of the
sequence. This can be done with the
`ScoringMatrix.calculate`
method:
%% Cell type:code id: tags:
```
scores = pssm.calculate(striped)
```
%% Cell type:markdown id: tags:
The scores are computed in an efficient column-major matrix which can be used
to further extract high scoring positions:
-
The
`argmax`
method returns the smallest index with the highest score
-
The
`max`
method returns the highest score
-
The
`threshold`
method returns a list of positions with a score above the given score
%% Cell type:code id: tags:
```
print(f"Highest score: {scores.max():.3f}")
print(f"Position with highest score: {scores.argmax()}")
print(f"Position with score above 14: {scores.threshold(13.0)}")
```
%% Cell type:markdown id: tags:
Otherwise, the resulting array can be accessed by index, and flattened into
a list (or an
`array`
):
%% Cell type:code id: tags:
```
print("Score at position 90517:", scores[156007])
```
%% Cell type:markdown id: tags:
## Using p-value thresholds
LightMotif features a re-implementation of the TFP-PVALUE algorithm which
can convert between a bitscore and a p-value for a given scoring matrix. Use
the
`ScoringMatrix.score`
method to compute the score threshold for a
*p-value*
:
%% Cell type:code id: tags:
```
print(f"Score threshold for p=1e-5: {pssm.score(1e-5):.3f}")
```
%% Cell type:markdown id: tags:
The
`ScoringMatrix.pvalue`
method can compute the
*p-value*
for a score, allowing
to compute them for scores obtained by the scoring pipeline:
%% Cell type:code id: tags:
```
for index in scores.threshold(13.0):
print(f"Hit at position {index:6}: score={scores[index]:.3f} p={pssm.pvalue(scores[index]):.3g}")
```
%% Cell type:markdown id: tags:
## Scanning algorithm
For cases where a long sequence is being processed, and only a handful of
significative hits is expected, using a scanner will be much more efficient.
A
`Scanner`
can be created with the
`lightmotif.scan`
function, and yields
`Hit`
objects for every position above the threshold parameter:
%% Cell type:code id: tags:
```
scanner = lightmotif.scan(pssm, striped, threshold=13.0)
for hit in scanner:
print(f"Hit at position {hit.position:6}: score={hit.score:.3f}")
```
%% Cell type:markdown id: tags:
Although it gives equivalent results to the
`calculate`
example above, the
`scan`
implementation uses less memory and is generally faster for higher
threshold values.
%% Cell type:markdown id: tags:
# Reverse-complement
All the examples above are showing how to calculate the hits for the direct
strand. To process the reverse-strand, one could reverse-complement the sequence;
however, it is much more efficient to reverse-complement the
`ScoringMatrix`
,
as it is usually much smaller in memory.
%% Cell type:code id: tags:
```
psmm_rc = pssm.reverse_complement()
scanner_rc = lightmotif.scan(psmm_rc, striped, threshold=13.0)
```
This diff is collapsed.
Click to expand it.
lightmotif-py/docs/guide/index.rst
+
1
−
0
View file @
6e1dd442
...
...
@@ -8,6 +8,7 @@ This section contains guides and documents about LightMotif usage.
:caption: Getting Started
Installation <install>
Example <example>
.. toctree::
:maxdepth: 1
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment