*A lightweight [platform-accelerated](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data) library for [biological motif](https://en.wikipedia.org/wiki/Sequence_motif) scanning using [position weight matrices](https://en.wikipedia.org/wiki/Position_weight_matrix)*.

## 🗺️ Overview

[Motif](https://en.wikipedia.org/wiki/Sequence_motif) scanning with
[position weight matrices](https://en.wikipedia.org/wiki/Position_weight_matrix)
(also known as position-specific scoring matrices) is a robust method for
identifying motifs of fixed length inside a
[biological sequence](https://en.wikipedia.org/wiki/Sequence_(biology)). They can be
used to identify [transcription factor](https://en.wikipedia.org/wiki/Transcription_factor)
[binding sites in DNA](https://en.wikipedia.org/wiki/DNA_binding_site),
or [protease](https://en.wikipedia.org/wiki/Protease) [cleavage](https://en.wikipedia.org/wiki/Proteolysis) site in [polypeptides](https://en.wikipedia.org/wiki/Proteolysis).
Position weight matrices are often viewed as [sequence logos](https://en.wikipedia.org/wiki/Sequence_logo):


The `lightmotif` library provides a Rust crate to run very efficient
searches for a motif encoded in a position weight matrix. The position
scanning combines several techniques to allow high-throughput processing
of sequences:

- Compile-time definition of alphabets and matrix dimensions.
- Sequence symbol encoding for fast table look-ups, as implemented in
  HMMER[\[1\]](#ref1) or MEME[\[2\]](#ref2)
- Striped sequence matrices to process several positions in parallel,
  inspired by Michael Farrar[\[3\]](#ref3).
- Vectorized matrix row look-up using `permute` instructions of [AVX2](https://fr.wikipedia.org/wiki/Advanced_Vector_Extensions).

Other crates from the ecosystem provide additional features if needed:

- [`lightmotif-tfmpvalue`](https://crates.io/crates/lightmotif-tfmpvalue) is an exact reimplementation of the TFMPvalue[\[4\]](#ref4) algorithm for converting between a score and a P-value for a given scoring matrix.
- [`lightmotif-transfac`](https://crates.io/crates/lightmotif-transfac) is a parser for position-specific scoring matrices in the [TRANSFAC](https://en.wikipedia.org/wiki/TRANSFAC) format.

*This is the Rust version, there is a [Python package](https://pypi.org/project/lightmotif) available as well.*

## 💡 Example

use lightmotif::*;
use typenum::U32;

// Create a count matrix from an iterable of motif sequences
let counts = CountMatrix::<Dna>::from_sequences(&[

// Create a PSSM with 0.1 pseudocounts and uniform background frequencies.
let pssm = counts.to_freq(0.1).to_scoring(None);

// Encode the target sequence into a striped matrix
let encoded = EncodedSequence::encode(seq).unwrap();
let mut striped = encoded.to_striped::<U32>();

// Use a pipeline to compute scores for every position of the matrix.
let pli = Pipeline::generic();
let scores = pli.score(&striped, &pssm);

// Scores can be extracted into a Vec<f32>, or indexed directly.
let v = scores.to_vec();
assert_eq!(scores[0], -23.07094);
assert_eq!(v[0], -23.07094);

// The highest scoring position can be searched with a pipeline as well.
let best = pli.argmax(&scores).unwrap();
assert_eq!(best, 18);

This example uses the *generic* pipeline, which is not platform accelerated.
To use the much faster AVX2 code, create an AVX2 pipeline with 
`Pipeline::avx2` instead: this returns a `Result` which is `Ok` if AVX2 
is supported on the local platform.

## ⏱️ Benchmarks

Both benchmarks use the [MX000001](https://www.prodoric.de/matrix/MX000001.html)
motif from [PRODORIC](https://www.prodoric.de/)[\[5\]](#ref5), and the
[complete genome](https://www.ncbi.nlm.nih.gov/nuccore/U00096) of an
*Escherichia coli K12* strain.
*Benchmarks were run on a [i7-10710U CPU](https://ark.intel.com/content/www/us/en/ark/products/196448/intel-core-i7-10710u-processor-12m-cache-up-to-4-70-ghz.html) running @1.10GHz, compiled with `--target-cpu=native`*.

- Score every position of the genome with the motif weight matrix:
  running 3 tests
  test bench_avx2    ... bench:   4,510,794 ns/iter (+/-     9,570) = 1029 MB/s
  test bench_sse2    ... bench:  26,773,537 ns/iter (+/-    57,891) =  173 MB/s
  test bench_generic ... bench: 317,731,004 ns/iter (+/- 2,567,370) =   14 MB/s

- Find the highest-scoring position for a motif in a 10kb sequence
  (compared to the PSSM algorithm implemented in
  test bench_avx2    ... bench:      12,797 ns/iter (+/-   380) = 781 MB/s
  test bench_sse2    ... bench:      62,597 ns/iter (+/-    43) = 159 MB/s
  test bench_generic ... bench:     671,900 ns/iter (+/- 1,150) =  14 MB/s
  test bench_bio     ... bench:   1,193,911 ns/iter (+/- 2,519) =   8 MB/s

## 💭 Feedback

### ⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the [GitHub issue
tracker](https://github.com/althonos/lightmotif/issues) if you need to report
or ask something. If you are filing in on a bug, please include as much
information as you can about the issue, and try to recreate the same bug
in a simple, easily reproducible situation.

<!-- ### 🏗️ Contributing

Contributions are more than welcome! See [`CONTRIBUTING.md`](https://github.com/althonos/lightmotif/blob/master/CONTRIBUTING.md) for more details. -->

## 📋 Changelog

This project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html)
and provides a [changelog](https://github.com/althonos/lightmotif/blob/master/CHANGELOG.md)
in the [Keep a Changelog](http://keepachangelog.com/en/1.0.0/) format.

## ⚖️ License

This library is provided under the open-source
[MIT license](https://choosealicense.com/licenses/mit/).

*This project was developed by [Martin Larralde](https://github.com/althonos/)
during his PhD project at the [European Molecular Biology Laboratory](https://www.embl.de/)
in the [Zeller team](https://github.com/zellerlab).*

