-
Martin Larralde authoredMartin Larralde authored
🎼 🧬 lightmotif
A lightweight platform-accelerated library for biological motif scanning using position weight matrices.
🗺️ Overview
Motif scanning with position weight matrices (also known as position-specific scoring matrices) is a robust method for identifying motifs of fixed length inside a biological sequence. They can be used to identify transcription factor binding sites in DNA, or protease cleavage site in polypeptides.
The lightmotif
library provides a Rust crate to run very efficient
searches for a motif encoded in a position weight matrix. The position
scanning combines several techniques to allow high-throughput processing
of sequences:
- Compile-time definition of alphabets and matrix dimensions.
- Sequence symbol encoding for fast easy table look-ups, as implemented in HMMER[1] or MEME[2]
- Striped sequence matrices to process several positions in parallel, inspired by Farrar[3].
- Vectorized matrix row look-up using
permute
instructions of AVX2.
💡 Example
use lightmotif::*;
// Create a count matrix from an iterable of motif sequences
let counts = CountMatrix::<Dna, {Dna::K}>::from_sequences(&[
EncodedSequence::encode("GTTGACCTTATCAAC").unwrap(),
EncodedSequence::encode("GTTGATCCAGTCAAC").unwrap(),
]).unwrap();
// Create a PSSM with 0.1 pseudocounts and uniform background frequencies.
let pssm = counts.to_freq(0.1).to_scoring(None);
// Encode the target sequence into a striped matrix
let seq = "ATGTCCCAACAACGATACCCCGAGCCCATCGCCGTCATCGGCTCGGCATGCAGATTCCCAGGCG";
let encoded = EncodedSequence::<Dna>::encode(seq).unwrap();
let mut striped = encoded.to_striped::<32>();
striped.configure(&pssm);
// Use a pipeline to compute scores for every position of the matrix
let scores = Pipeline::<Dna, f32>::score(&striped, &pssm);
// Scores can be extracted into a Vec<f32>, or indexed directly.
let v = scores.to_vec();
assert_eq!(scores[0], -23.07094);
assert_eq!(v[0], -23.07094);
To use the AVX2 implementation, simply create a Pipeline<_, __m256>
instead
of the Pipeline<_, f32>
. This is only supported when the library is compiled
with the avx2
target feature, but it can be easily configured with Rust's
#[cfg]
attribute.
⏱️ Benchmarks
Benchmarks were run on a i7-10710U CPU running @1.10GHz, compiled with --target-cpu=native
.
Both benchmarks use the MX000001 motif from PRODORIC, and the complete genome of an Escherichia coli K12 strain.
-
Score every position of the genome with the motif weight matrix:
running 3 tests test bench_avx2 ... bench: 13,053,752 ns/iter (+/- 45,411) = 355 MB/s test bench_ssse3 ... bench: 37,203,277 ns/iter (+/- 2,416,572) = 124 MB/s test bench_generic ... bench: 314,682,807 ns/iter (+/- 1,072,174) = 14 MB/s
-
Find the highest-scoring position for a motif in a 10kb sequence (compared to the PSSM algorithm implemented in
bio::pattern_matching::pssm
):test bench_avx2 ... bench: 46,390 ns/iter (+/- 115) = 215 MB/s test bench_ssse3 ... bench: 97,691 ns/iter (+/- 2,720) = 102 MB/s test bench_generic ... bench: 740,305 ns/iter (+/- 2,527) = 13 MB/s test bench_bio ... bench: 1,575,504 ns/iter (+/- 2,799) = 6 MB/s
💭 Feedback
⚠️ Issue Tracker
Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.
⚖️ License
This library is provided under the open-source MIT license.
This project was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.
📚 References
- [1] Eddy, Sean R. ‘Accelerated Profile HMM Searches’. PLOS Computational Biology 7, no. 10 (20 October 2011): e1002195. doi:10.1371/journal.pcbi.1002195.
- [2] Grant, Charles E., Timothy L. Bailey, and William Stafford Noble. ‘FIMO: Scanning for Occurrences of a given Motif’. Bioinformatics 27, no. 7 (1 April 2011): 1017–18. doi:10.1093/bioinformatics/btr064.
- [3] Farrar, Michael. ‘Striped Smith–Waterman Speeds Database Searches Six Times over Other SIMD Implementations’. Bioinformatics 23, no. 2 (15 January 2007): 156–61. doi:10.1093/bioinformatics/btl582.