Commit abba1c04 authored by Hoogenboom, Jerry's avatar Hoogenboom, Jerry

FDSTools v0.0.5: new tools, changed defaults

* General changes in v0.0.5:
  * The TSSV tool now depends on version 0.4.0 of TSSV.
  * Added new Pipeline tool that runs one of three default analysis pipelines
    automatically given a configuration file with tool options and input/output
    file names. The three available pipeline options are 'reference-sample',
    analysing a single reference sample with TSSV and Stuttermark;
    'reference-database', analysing a collection of reference samples with
    BGEstimate and Stuttermodel; and 'case-sample', analysing a single case
    sample with TSSV, BGPredict, BGMerge, BGCorrect, and Samplestats.
  * Added new Library tool that creates an empty FDSTools library file. Users
    may optionally specify the intented use of the library (STR markers,
    non-STR-markers, or both). Only the sections that apply to the given types
    of markers will be included in the output. The [aliases] section is not
    included by default, but an option is available to add it.
  * Added new tool BGAnalyse which can be used to analyse the remaining amount
    of noise in reference samples after correction.  This tool is a more
    sensitive successor of the 'Blame' tool.
  * Added new visualisation BGAnalysevis for visualising data obtained from
    BGAnalyse. This visualisation allows for identifying unclean or otherwise
    suboptimal samples by comparing the lowest, highest, and/or total remaining
    noise after correction for each marker in each sample.
  * The Blame tool was removed in favour of BGAnalyse.
* Libconvert v1.1.0:
  * When converting to FDSTools format, Libconvert automatically creates an
    empty FDSTools library file with the same contents as what would be
    obtained from the new Library tool without arguments.
  * The -a/--aliases option was modified such that it has the same effect as
    the -a/--aliases option of the new Library tool. This means that without
    this option specified, the [aliases] section will not be present in the
    output anymore.
  * The ability of the Libconvert tool to produce an empty FDSTools library
    file if no input file was given has been removed from the documentation
    (but not from the tool itself).
* TSSV v1.0.2:
  * Added new option -n/--indel-score which can be used to increase the
    penalty given to insertions and deletions in the flanking sequences w.r.t.
    the penalty given to mismatches.
  * NOTE: Requires TSSV v0.4.0 or newer to be installed.
* Vis v1.0.2:
  * Changed default value of -n/--min-abs from 15 to 5.
  * Added -I/--input2 option, which allows for specifying a file with raw data
    points for Stuttermodelvis and Profilevis.
  * Added support for creating BGAnalysevis visualisations.
* Profilevis v2.0.0:
  * Replaced the simple Options overlay with responsive design options panels
    in HTML visualisations.
  * Alleles and sequences are now sorted by CE allele length when applicable.
  * Added option to plot BGHomRaw data on top of the profiles.
  * Added marker selection menu for easier filtering.
* BGRawvis v2.0.0:
  * Replaced the simple Options overlay with responsive design options panels
    in HTML visualisations.
  * Sequences are now sorted by CE allele length when applicable.
  * Changed default minimum number of reads from 15 to 5.
  * Added marker selection menu for easier filtering.
* Stuttermodelvis v2.0.0:
  * Replaced the simple Options overlay with responsive design options panels
    in HTML visualisations.
  * Fixed glitch that caused the graphs to be re-rendered twice when loading
    a file by drag-and-drop in HTML visualisations.
  * Fixed glitch that made it possible to replace the data that was embedded
    in an HTML visualisation through drag-and-drop.
  * Added repeat unit selection menu for easier filtering.
* Allelevis v2.0.0:
  * Replaced the simple Options overlay with responsive design options panels
    in HTML visualisations.
  * Reduced Vega graph spec complexity by using the new Rank transform to
    position the subgraphs.
  * Fixed glitch that caused unnecessary padding around the graph.
* Samplestats v1.1.0:
  * Changed default allele calling option thresholds:
    * Changed default value of -m/--min-pct-of-max from 5.0 to 2.0.
    * Changed default value of -p/--min-pct-of-sum from 3.0 to 1.5.
  * Mentioned allele calling in the tool descriptions.
* Samplevis v2.1.0:
  * Changed default minimum number of reads for graph filtering from 15 to 5.
  * Changed default table filtering options:
    * Percentage of highest allele per marker changed from 5% to 2%.
    * Percentage of the marker's total reads changed from 3% to 1.5%.
    * Minimum number of reads in both orientations changed from 0 to 1.
parent 68caf656
......@@ -29,13 +29,29 @@ Alternatively, FDSTools can be installed by running:
FDSTools Changelog
------------------
v0.0.5
- The Blame tool was removed in favour of BGAnalyse
- Includes BGAnalyse v1.0.0
- Includes Libconvert v1.1.0
- Includes Library v1.0.0
- Includes Pipeline v1.0.0
- Includes Samplestats v1.1.0
- Includes TSSV v1.0.2
- Includes Vis v1.0.2
- Includes Allelevis v2.0.0
- Includes BGAnalysevis v1.0.0
- Includes BGRawvis v2.0.0
- Includes Profilevis v2.0.0
- Includes Samplevis v1.1.0
- Includes Stuttermodelvis v2.0.0
v0.0.4
- FDSTools will now print profiling information to stdout when the
-d/--debug option was specified.
-d/--debug option was specified
- Fixed bug where specifying '-' as the output filename would be taken
literally, while it should have been interpreted as 'write to standard
out' (Affected tools: BGCorrect, Samplestats, Seqconvert, Stuttermark).
- Added more detailed license information to FDSTools.
out' (Affected tools: BGCorrect, Samplestats, Seqconvert, Stuttermark)
- Added more detailed license information to FDSTools
- Updated bundled JavaScript library Vega to v2.6.0
- Updated bundled JavaScript library D3 to v3.5.17
- Includes BGCorrect v1.0.1
......@@ -94,6 +110,12 @@ v1.0.0
- Initial version
BGAnalyse
~~~~~~~~~
v1.0.0
- Initial version
BGCorrect
~~~~~~~~~
v1.0.1
......@@ -158,12 +180,6 @@ v1.0.0
- Initial version
Blame
~~~~~
v1.0.0
- Initial version
FindNewAlleles
~~~~~~~~~~~~~~
v1.0.0
......@@ -172,6 +188,18 @@ v1.0.0
Libconvert
~~~~~~~~~~
v1.1.0
- When converting to FDSTools format, Libconvert automatically creates an
empty FDSTools library file with the same contents as what would be
obtained from the new Library tool without arguments.
- The -a/--aliases option was modified such that it has the same effect as
the -a/--aliases option of the new Library tool. This means that without
this option specified, the [aliases] section will not be present in the
output anymore.
- The ability of the Libconvert tool to produce an empty FDSTools library
file if no input file was given has been removed from the documentation
(but not from the tool itself).
v1.0.1
- Specifying '-' as the first positional argument to libconvert will now
correctly interpret this as "read from stdin" instead of throwing a "file
......@@ -181,8 +209,26 @@ v1.0.0
- Initial version
Library
~~~~~~~
v1.0.0
- Initial version
Pipeline
~~~~~~~~
v1.0.0
- Initial version
Samplestats
~~~~~~~~~~~
v1.1.0
- Changed default allele calling option thresholds:
- Changed default value of -m/--min-pct-of-max from 5.0 to 2.0
- Changed default value of -p/--min-pct-of-sum from 3.0 to 1.5
- Mentioned allele calling in the tool descriptions
v1.0.1
- Samplestats will now round to 4 or 5 significant digits if a value is
above 1000 or 10000, respectively. Previously, this was only done for the
......@@ -262,6 +308,12 @@ v1.0.0
TSSV
~~~~
v1.0.2
- Added new option -n/--indel-score which can be used to increase the
penalty given to insertions and deletions in the flanking sequences
w.r.t. the penalty given to mismatches.
- NOTE: Requires TSSV v0.4.0 or newer to be installed.
v1.0.1
- Renamed the '--is_fastq' option to '--is-fastq', which was the only
option with an underscore instead of a hyphen in FDSTools
......@@ -274,6 +326,12 @@ v1.0.0
Vis
~~~
v1.0.2
- Changed default value of -n/--min-abs from 15 to 5
- Added -I/--input2 option, which allows for specifying a file with raw
data points for Stuttermodelvis and Profilevis
- Added support for creating BGAnalysevis visualisations
v1.0.1
- Added -j/--jitter option for Stuttermodelvis (default: 0.25)
- Fixed bug where Vis would not allow the -n/--min-abs and the
......@@ -285,6 +343,13 @@ v1.0.0
Allelevis
~~~~~~~~~
v2.0.0
- Replaced the simple Options overlay with responsive design options panels
in HTML visualisations
- Reduced Vega graph spec complexity by using the new Rank transform to
position the subgraphs
- Fixed glitch that caused unnecessary padding around the graph
v1.0.0beta2
- Fixed potential crash/corruption that could occur with very unfortunate
combinations of sample names and marker names
......@@ -297,8 +362,21 @@ v1.0.0beta1
- Initial version
BGAnalysevis
~~~~~~~~~~~~
v1.0.0
- Initial version
BGRawvis
~~~~~~~~
v2.0.0
- Replaced the simple Options overlay with responsive design options panels
in HTML visualisations
- Sequences are now sorted by CE allele length when applicable
- Changed default minimum number of reads from 15 to 5
- Added marker selection menu for easier filtering
v1.0.1
- Fixed a JavaScript crash that would occur in HTML visualisations if the
Marker name filter resulted in an invalid regular expression (e.g., when
......@@ -314,6 +392,13 @@ v1.0.0
Profilevis
~~~~~~~~~~
v2.0.0
- Replaced the simple Options overlay with responsive design options panels
in HTML visualisations
- Alleles and sequences are now sorted by CE allele length when applicable
- Added option to plot BGHomRaw data on top of the profiles
- Added marker selection menu for easier filtering
v1.0.1
- Fixed a JavaScript crash that would occur in HTML visualisations if the
Marker name filter resulted in an invalid regular expression (e.g., when
......@@ -329,6 +414,13 @@ v1.0.0
Samplevis
~~~~~~~~~
v2.1.0
- Changed default minimum number of reads for graph filtering from 15 to 5
- Changed default table filtering options:
- Percentage of highest allele per marker changed from 5% to 2%
- Percentage of the marker's total reads changed from 3% to 1.5%
- Minimum number of reads in both orientations changed from 0 to 1
v2.0.1
- Fixed a JavaScript crash that would occur in HTML visualisations if the
Marker name filter resulted in an invalid regular expression (e.g., when
......@@ -352,6 +444,15 @@ v2.0.0
Stuttermodelvis
~~~~~~~~~~~~~~~
v2.0.0
- Replaced the simple Options overlay with responsive design options panels
in HTML visualisations
- Fixed glitch that caused the graphs to be re-rendered twice when loading
a file by drag-and-drop in HTML visualisations
- Fixed glitch that made it possible to replace the data that was embedded
in an HTML visualisation through drag-and-drop
- Added repeat unit selection menu for easier filtering
v1.0.0beta2
- HTML visualisations now support drawing raw data points on top of the fit
functions. The points can be drawn with an adjustable jitter to reduce
......
......@@ -24,7 +24,7 @@ including tools for characterisation and filtering of PCR stutter artefacts and
other systemic noise, and for automatic detection of the alleles in a sample.
"""
__version_info__ = ('0', '0', '4')
__version_info__ = ('0', '0', '5')
__version__ = '.'.join(__version_info__)
usage = __doc__.split("\n\n\n")
......
......@@ -20,7 +20,7 @@
# along with FDSTools. If not, see <http://www.gnu.org/licenses/>.
#
import re, sys, argparse, random, itertools
import re, sys, argparse, random, itertools, textwrap
#import numpy as np # Imported only when calling nnls()
from ConfigParser import RawConfigParser, MissingSectionHeaderError
......@@ -95,6 +95,10 @@ COMPL = {"A": "T", "T": "A", "U": "A", "G": "C", "C": "G", "R": "Y", "Y": "R",
# Special values that may appear in the place of a sequence.
SEQ_SPECIAL_VALUES = ("No data", "Other sequences")
# TextWrapper object for formatting help texts in generated INI files.
INI_COMMENT = textwrap.TextWrapper(width=79, initial_indent="; ",
subsequent_indent="; ", break_on_hyphens=False)
def get_genome_pos(location, x, invert=False):
"""Get the genome position of the x-th base in a sequence."""
......
#!/usr/bin/env python
#
# Copyright (C) 2016 Jerry Hoogenboom
#
# This file is part of FDSTools, data analysis tools for Next
# Generation Sequencing of forensic DNA markers.
#
# FDSTools is free software: you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the
# Free Software Foundation, either version 3 of the License, or (at
# your option) any later version.
#
# FDSTools is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with FDSTools. If not, see <http://www.gnu.org/licenses/>.
#
"""
Analyse the amount of noise in reference samples.
Use this tool after correcting the reference samples with BGCorrect to
analyse the amount of remaining noise after correction. This way,
potentially contaminated or otherwise 'dirty' reference samples can be
detected. The highest amount of remaining noise can be interpreted as a
lower bound to the reliable detection of a minor contributor's alleles
in mixed DNA samples.
In the default mode ('full'), the lowest, highest, and total number of
backgroud/noise reads as well as the respective percentages w.r.t. the
number of allelic reads of each marker in each sample is printed. This
data can be visualised using 'fdstools vis bganalyse'.
In the alternative 'percentiles' mode, the highest and the total number
of background reads as a percentage of the number of allelic reads for
each marker is given at selected percentiles of the samples. I.e., it
gives the highest and total remaining noise considering only the
cleanest x% of samples, for different values of x.
"""
import math
from ..lib import pos_int_arg, add_input_output_args, get_input_output_files,\
add_allele_detection_args, parse_allelelist,\
get_sample_data, add_sequence_format_args
__version__ = "1.0.0"
# Default values for parameters are specified below.
# Default percentiles.
# This value can be overridden by the -p command line option.
_DEF_PERCENTILES = "100,99,95,90"
def get_total_recovery(values):
if "total_recovery" in values:
return float(values["total_recovery"])
elif ("total_add" in values and "total_corrected" in values and
float(values["total_corrected"])):
return 100.*float(values["total_add"])/float(values["total_corrected"])
return 0
#get_total_recovery
def process_sample(sample_data, sample_alleles, tag, library):
# Get read counts of the true alleles.
allele_reads = {}
for marker in sample_alleles:
allele_reads[marker] = 0
for allele in sample_alleles[marker]:
if (marker, allele) not in sample_data:
raise ValueError(
"Missing allele %s of marker %s in sample %s!" %
(allele, marker, tag))
elif 0 in sample_data[marker, allele]:
raise ValueError(
"Allele %s of marker %s has 0 reads!" % (allele, marker))
allele_reads[marker] += sum(sample_data[marker, allele][0:2])
# Find the highest, the lowest, and the total noise for each marker,
# and the highest total recovery.
noise = {}
marker_reads = {}
total_reads = 0
for marker, sequence in sample_data:
reads = sum(sample_data[marker, sequence][0:2])
total_reads += reads
if marker not in sample_alleles or not sample_alleles[marker]:
# Sample does not participate in this marker.
continue
if marker not in noise:
noise[marker] = [0, 0, 0, 0] # Max, min, sum, max(recovery)
marker_reads[marker] = reads
else:
marker_reads[marker] += reads
if sequence in sample_alleles[marker]:
# Note: Some noise has recovery in the order of 1E17.
recovery = get_total_recovery(sample_data[marker, sequence][2])
if recovery > noise[marker][3]:
noise[marker][3] = recovery
else:
# Update lowest/highest noise as needed.
if reads > noise[marker][0]:
noise[marker][0] = reads
if reads < noise[marker][1]:
noise[marker][1] = reads
noise[marker][2] += reads
return {
marker: {
"genotype": sorted(sample_alleles[marker]),
"allele_reads": allele_reads[marker],
"noise": noise[marker],
"marker_reads": marker_reads[marker],
"total_reads": total_reads
}
for marker in noise}
#process_sample
def write_full_table(outfile, data):
outfile.write("\t".join([
"sample", "marker", "genotype", "allelic_reads",
"highest_remaining_bg_reads", "lowest_remaining_bg_reads",
"total_remaining_bg_reads", "highest_as_pct_of_allelic",
"lowest_as_pct_of_allelic", "total_as_pct_of_allelic",
"highest_recovery", "total_reads_marker", "total_reads_sample"]) + "\n")
for tag in data:
for marker in data[tag]:
d = data[tag][marker]
outfile.write("\t".join([
tag, marker, ",".join(d["genotype"])] + [
"%.5g" % x if abs(x) >= 10000 else
"%.4g" % x if abs(x) >= 1000 else
"%.3g" % x if abs(x) > 0.0000000001 else "0" for x in (
d["allele_reads"],
d["noise"][0],
d["noise"][1],
d["noise"][2],
100.*d["noise"][0]/d["allele_reads"],
100.*d["noise"][1]/d["allele_reads"],
100.*d["noise"][2]/d["allele_reads"],
d["noise"][3],
d["marker_reads"],
d["total_reads"])]) + "\n")
#write_full_table
def write_percentiles_table(outfile, data, percentiles):
per_marker = {}
for tag in data:
for marker in data[tag]:
d = data[tag][marker]
if marker not in per_marker:
per_marker[marker] = ([], [], [], [])
per_marker[marker][0].append(100.*d["noise"][0]/d["allele_reads"])
per_marker[marker][1].append(100.*d["noise"][1]/d["allele_reads"])
per_marker[marker][2].append(100.*d["noise"][2]/d["allele_reads"])
per_marker[marker][3].append(d["noise"][3])
outfile.write("\t".join([
"marker", "percentile", "highest_as_pct_of_allelic",
"lowest_as_pct_of_allelic", "total_as_pct_of_allelic",
"highest_recovery"]) + "\n")
for marker in per_marker:
per_marker[marker] = map(sorted, per_marker[marker])
n = len(per_marker[marker][0])
for percentile in percentiles:
i = int(math.ceil(percentile * 0.01 * n)) - 1
outfile.write("\t".join([
marker, "%i" % percentile] + [
"%.5g" % x if abs(x) >= 10000 else
"%.4g" % x if abs(x) >= 1000 else
"%.3g" % x if abs(x) > 0.0000000001 else "0" for x in (
per_marker[marker][0][i],
per_marker[marker][1][n-i-1],
per_marker[marker][2][i],
per_marker[marker][3][i])]) + "\n")
#write_percentiles_table
def analyse_background(samples_in, outfile, allelefile, annotation_column,
seqformat, library, mode, percentiles):
# Parse allele list.
allelelist = {} if allelefile is None \
else parse_allelelist(allelefile, seqformat, library)
# TODO: Should we consider forward and reverse reads separately?
data = {}
get_sample_data(
samples_in,
lambda tag, sample_data: data.__setitem__(tag, process_sample(
sample_data, allelelist[tag], tag, library)),
allelelist, annotation_column, seqformat, library,
after_correction=True, drop_special_seq=True,
extra_columns={
"total_recovery": True,
"total_add": True,
"total_corrected": True
})
if mode == "full":
write_full_table(outfile, data)
else:
write_percentiles_table(outfile, data, percentiles)
#analyse_background
def add_arguments(parser):
parser.add_argument('-m', '--mode', metavar="MODE", default="full",
choices=("full", "percentiles"),
help="controls what kind of information is printed; 'full' (the "
"default) prints the lowest, highest, and total number of "
"backgroud reads as well as the respective percentages w.r.t. "
"the number of allelic reads of each marker in each "
"sample; 'percentiles' prints the highest and the total number "
"of background reads as a percentage of the number of allelic "
"reads for each marker at given percentiles")
parser.add_argument('-p', '--percentiles', metavar="PCT",
default=_DEF_PERCENTILES,
help="comma-separated list of percentiles to report when -m/--mode is "
"set to 'percentiles' (default: %(default)s)")
add_input_output_args(parser)
add_allele_detection_args(parser)
add_sequence_format_args(parser, "raw")
#add_arguments
def run(args):
files = get_input_output_files(args)
if not files:
raise ValueError("please specify an input file, or pipe in the output "
"of another program")
analyse_background(files[0], files[1], args.allelelist,
args.annotation_column, args.sequence_format,
args.library, args.mode,
map(float, args.percentiles.split(",")))
#run
#!/usr/bin/env python
#
# Copyright (C) 2016 Jerry Hoogenboom
#
# This file is part of FDSTools, data analysis tools for Next
# Generation Sequencing of forensic DNA markers.
#
# FDSTools is free software: you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the
# Free Software Foundation, either version 3 of the License, or (at
# your option) any later version.
#
# FDSTools is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with FDSTools. If not, see <http://www.gnu.org/licenses/>.
#
"""
Find dirty reference samples or recurring contaminating alleles.
Match background noise profiles (as obtained from bgestimate) to the
database of reference samples from which they were obtained to detect
samples with unexpectedly high additional alleles. The 'amount' column
in the output lists the number of noise/contaminant reads per true
allelic read.
(The tool is called 'blame' because with default settings it might
produce a DNA profile of an untidy laboratory worker.)
"""
import argparse
#import numpy as np # Only imported when actually running this tool.
from ..lib import pos_int_arg, add_input_output_args, get_input_output_files,\
add_allele_detection_args, nnls, ensure_sequence_format,\
parse_allelelist, load_profiles, add_sequence_format_args,\
get_sample_data
__version__ = "1.0.0"
# Default values for parameters are specified below.
# Default number of results per marker.
# This value can be overridden by the -n command line option.
_DEF_NUM = 5
def add_sample_data(data, sample_data, sample_tag, alleles):
# Add this sample to the data.
use_markers = set()
for marker in alleles:
genotype = [data[marker]["seqs"].index(x)
if x in data[marker]["seqs"] and
data[marker]["seqs"].index(x) < data[marker]["n"]
else -1
for x in alleles[marker]]
if -1 in genotype:
# Don't have a profile for all of this sample's alleles.
continue
data[marker]["samples_forward"].append([0] * data[marker]["m"])
data[marker]["samples_reverse"].append([0] * data[marker]["m"])
data[marker]["genotypes"].append(genotype)
data[marker]["sample_tags"].append(sample_tag)
use_markers.add(marker)
for marker, allele in sample_data:
if marker not in use_markers:
continue
try:
i = data[marker]["seqs"].index(allele)
except ValueError:
continue
data[marker]["samples_forward"][-1][i] = sample_data[marker, allele][0]
data[marker]["samples_reverse"][-1][i] = sample_data[marker, allele][1]
#add_sample_data
def blame(samples_in, outfile, allelefile, annotation_column, mode,
profilefile, num, seqformat, library, marker):
allelelist = {} if allelefile is None \
else parse_allelelist(allelefile, "raw", library)
data = load_profiles(profilefile, library)
if marker:
data = {marker: data[marker]} if marker in data else {}
for marker in data:
data[marker]["samples_forward"] = []
data[marker]["samples_reverse"] = []
data[marker]["genotypes"] = []
data[marker]["sample_tags"] = []
# Read sample data.
get_sample_data(
samples_in,
lambda tag, sample_data: add_sample_data(
data, sample_data, tag,
{m: allelelist[tag][m] for m in data if m in allelelist[tag]}),
allelelist, annotation_column, "raw", library, drop_special_seq=True)
outfile.write("\t".join(["marker",
"allele" if mode == "common" else "sample",
"amount"]) + "\n")
for marker in data:
if not data[marker]["sample_tags"]:
continue
# Estimate which alleles are present in the samples.
P1 = np.matrix(data[marker]["forward"])
P2 = np.matrix(data[marker]["reverse"])
C1 = np.matrix(data[marker]["samples_forward"])
C2 = np.matrix(data[marker]["samples_reverse"])
A = nnls(np.hstack([P1, P2]).T, np.hstack([C1, C2]).T).T
# Zero out the true alleles in each sample and scale the others.
for i in range(len(data[marker]["genotypes"])):
indices = data[marker]["genotypes"][i]
scale = A[i, indices].sum()
A[i, indices] = 0
A[i, :] /= scale
if mode == "common":
# The columns with the highest means correspond to the most
# common contaminants.
A = A.mean(0).flat
for i in np.argsort(A)[:-num-1:-1]: # Print the top N.
if A[i] == 0:
break
outfile.write("\t".join([marker, ensure_sequence_format(
data[marker]["seqs"][i], seqformat, library=library,
marker=marker), str(A[i])]) + "\n")
else:
# The rows with the highest maxima/sums correspond to the
# samples with the highest amounts of contaminant/s.
A = A.max(1).flat if mode == "highest" else A.sum(1).flat
for i in np.argsort(A)[:-num-1:-1]: # Print the top N.
if A[i] == 0:
break
outfile.write("\t".join(
[marker, data[marker]["sample_tags"][i], str(A[i])])+"\n")
#blame
def add_arguments(parser):
parser.add_argument('profiles', metavar="PROFILES",
type=argparse.FileType('r'),
help="file containing background noise profiles to match")
parser.add_argument("-m", "--mode", metavar="MODE", default="common",
choices=("common", "highest", "dirtiest"),
help="controls what kind of information is printed; 'common' (the "
"default) prints the top N contaminant alleles per marker, "
"'highest' prints the top N samples with the highest single "
"contaminant per marker, and 'dirtiest' prints the top N samples "
"with the highest total amount of contaminants per marker")
add_input_output_args(parser)
add_allele_detection_args(parser)
filtergroup = parser.add_argument_group("filtering options")
filtergroup.add_argument('-n', '--num', metavar="N", type=pos_int_arg,
default=_DEF_NUM,
help="print the top N results per marker (default: %(default)s)")
filtergroup.add_argument('-M', '--marker', metavar="MARKER",
help="work only on MARKER")
add_sequence_format_args(parser, "raw")
#add_arguments
def run(args):
# Import numpy now.
global np
import numpy as np
files = get_input_output_files(args)
if not files:
raise ValueError("please specify an input file, or pipe in the output "
"of another program")
blame(files[0], files[1], args.allelelist, args.annotation_column,
args.mode, args.profiles, args.num, args.sequence_format,
args.library, args.marker)
#run
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
......@@ -22,7 +22,11 @@
"""
Compute various statistics for each sequence in the given sample data
file.
file and perform threshold-based allele calling.
Updates the 'flags' column (or adds it, if it was not present in the
input data) to include 'allele' for all sequences that meet various
allele calling thresholds.
Adds the following columns to the input data. Some columns may be
omitted from the output if the input does not contain the required
......@@ -56,7 +60,7 @@ import sys
from ..lib import add_sequence_format_args, add_input_output_args, \
get_input_output_files, get_column_ids
__version__ = "1.0.1"
__version__ = "1.1.0"
# Default values for parameters are specified below.
......@@ -72,12 +76,12 @@ _DEF_MIN_PER_STRAND = 1
# Default minimum percentage of reads w.r.t. the highest allele of the
# marker to mark as allele.
# This value can be overridden by the -m command line option.
_DEF_MIN_PCT_OF_MAX = 5.
_DEF_MIN_PCT_OF_MAX = 2.
# Default minimum percentage of reads w.r.t. the marker's total number
# of reads to mark as allele.
# This value can be overridden by the -p command line option.
_DEF_MIN_PCT_OF_SUM = 3.
_DEF_MIN_PCT_OF_SUM = 1.5
# Default minimum percentage of correction to mark as allele.
# This value can be overridden by the -c command line option.
......
......@@ -40,7 +40,7 @@ from ..lib import pos_int_arg, add_input_output_args, get_input_output_files,\
add_sequence_format_args, reverse_complement, PAT_SEQ_RAW,\
get_column_ids, ensure_sequence_format
__version__ = "1.0.1"
__version__ = "1.0.2"
# Default values for parameters are specified below.
......@@ -50,6 +50,11 @@ __version__ = "1.0.1"
# This value can be overridden by the -m command line option.
_DEF_MISMATCHES = 0.08
# Default penalty multiplier for insertions and deletions in the
# flanking sequences.
# This value can be overridden by the -n command line option.
_DEF_INDEL_SCORE = 1
# Default minimum number of reads to consider.
# This value can be overridden by the -a command line option.
_DEF_MINIMUM = 1
......@@ -75,7 +80,7 @@ def seq_pass_filt(sequence, reads, threshold, explen=None):
def run_tssv_lite(infile, outfile, reportfile, is_fastq, library, seqformat,
threshold, minimum, aggregate_filtered,
missing_marker_action, dirname):
missing_marker_action, dirname, indel_score):
file_format = "fastq" if is_fastq else "fasta"
tssv_library = convert_library(library, threshold)
......@@ -86,7 +91,7 @@ def run_tssv_lite(infile, outfile, reportfile, is_fastq, library, seqformat,
outfiles = None
total_reads, unrecognised, counters, sequences = process_file(