Commit 08cf6ddd authored by Hoogenboom, Jerry's avatar Hoogenboom, Jerry

Various bug fixes and refinements throughout FDSTools

* Global changes in v0.0.4:
  * FDSTools will now print profiling information to stdout when the -d/--debug
    option was specified.
  * Fixed bug where specifying '-' as the output filename would be taken
    literally, while it should have been interpreted as 'write to standard out'
    (Affected tools: BGCorrect, Samplestats, Seqconvert, Stuttermark).
  * Added more detailed license information to FDSTools.
* BGEstimate v1.1.0:
  * Added a new option -g/--min-genotypes (default: 3). Only alleles that occur
    in at least this number of unique heterozygous genotypes will be
    considered. This is to avoid 'contamination' of the noise profile of one
    allele with the noise of another. If homozygous samples are available for
    an allele, this filter is not applied to that allele. Setting this option
    to 1 effectively disables it. This option has the same cascading effect as
    the -s/--min-samples option, that is, if one allele does not meet the
    threshold, the samples with this allele are excluded which may cause some
    of the other alleles of these samples to fall below the threshold as well.
* Stuttermodel v1.1.0:
  * Stuttermodel will now only output a fit for one strand if it could also
    obtain a fit for the other strand (for the same marker, unit, and stutter
    depth). This new behaviour can be disabled with a new -O/--orphans option.
  * Fixed bug that caused Stuttermodel to output only the raw data points for
    -1 and +1 stutter when normal output was supressed.
* BGCorrect v1.0.1:
  * Added new column 'weight' to the output. The value in this column expresses
    the number of times that the noise profile of that allele fitted in the
    sample.
* Samplestats v1.0.1:
  * Samplestats will now round to 4 or 5 significant digits if a value is
    above 1000 or 10000, respectively. Previously, this was only done for the
    combined 'Other sequences' values.
  * The 'Other sequences' lines will now also include values for
    total_recovery, forward_recovery, and reverse_recovery.
  * The total_recovery, forward_recovery, and reverse_recovery columns are no
    longer placed to the left of all the other columns generated by
    Samplestats.
  * The help text for Samplestats erroneously listed the X_recovery_pct instead
    of X_recovery.
  * Added support for the new 'weight' column produced by BGCorrect when the
    -a/--filter-action option is set to 'combine'.
* BGPredict v1.0.1:
  * Greatly reduced memory usage.
  * BGPredict will now output nonzero values below the threshold set by
    -n/--min-pct if the predicted noise ratio of the same stutter on the other
    strand is above the threshold. Previously, values below the threshold were
    clipped to zero, which may cause unnecessarily high strand bias in the
    predicted profile.
* BGMerge v1.0.1:
  * Reduced memory usage.
* TSSV v1.0.1:
  * Renamed the '--is_fastq' option to '--is-fastq'. It was the only option
    with an underscore instead of a hyphen in FDSTools.
  * Fixed crash that would occur if -F/--sequence-format was set to anything
    other than 'raw'.
* Libconvert v1.0.1:
  * Specifying '-' as the first positional argument to libconvert will now
    correctly interpret this as "read from stdin" instead of throwing a "file
    not found" error (or reading from a file named "-" if it exists).
* Seqconvert v1.0.1:
  * Internal naming of the first positional argument was changed from 'format'
    to 'sequence-format'. This was done for consistency with the
    -F/--sequence-format option in other tools, giving it the same name in
    Pipeline configuration files.
* Vis v1.0.1:
  * Added -j/--jitter option for Stuttermodelvis (default: 0.25).
  * Vis would not allow the -n/--min-abs and the -s/--min-per-strand options to
    be set to 0.
* Stuttermodelvis v1.0.0beta2:
  * HTML visualisations now support drawing raw data points on top of the fit
    functions. The points can be drawn with an adjustable jitter to reduce
    overlap.
  * Fixed a JavaScript crash that would occur in HTML visualisations if the
    Repeat unit or Marker name filter resulted in an invalid regular expression
    (e.g., when the entered value ends with a backslash).
  * Reduced Vega graph spec complexity by using the new Rank transform to
    position the subgraphs.
  * HTML visualisations made with the -O/--online option of the Vis tool will
    now contain https URLs instead of http.
* Samplevis v1.0.1:
  * Fixed a JavaScript crash that would occur in HTML visualisations if the
    Marker name filter resulted in an invalid regular expression (e.g., when
    the entered value ends with a backslash).
  * Reduced Vega graph spec complexity by using the new Rank transform to
    position the subgraphs.
  * Fixed a glitch where clicking the 'Truncate sequences to' label would
    select the marker spacing input.
  * The 'Notes' table cells with 'BGPredict' in them now get a light orange
    background to warn the user that their background profile was computed.
    If a sequence was explicitly 'not corrected', 'not in ref db', or
    'corrected as background only', the same colour is used.
  * The message bar at the bottom of Samplevis HTML visualisations will now
    grow no larger than 3 lines. A scroll bar will appear as needed.
  * HTML visualisations made with the -O/--online option of the Vis tool will
    now contain https URLs instead of http.
* BGRawVis v1.0.1:
  * Fixed a JavaScript crash that would occur in HTML visualisations if the
    Marker name filter resulted in an invalid regular expression (e.g., when
    the entered value ends with a backslash).
  * Reduced Vega graph spec complexity by using the new Rank transform to
    position the subgraphs.
  * HTML visualisations made with the -O/--online option of the Vis tool will
    now contain https URLs instead of http.
* Profilevis v1.0.1:
  * Fixed a JavaScript crash that would occur in HTML visualisations if the
    Marker name filter resulted in an invalid regular expression (e.g., when
    the entered value ends with a backslash).
  * Reduced Vega graph spec complexity by using the new Rank transform to
    position the subgraphs.
  * HTML visualisations made with the -O/--online option of the Vis tool will
    now contain https URLs instead of http.
* Allelevis v1.0.0beta2:
  * Fixed potential crash/corruption that could occur with very unfortunate
    combinations of sample names and marker names.
  * HTML visualisations made with the -O/--online option of the Vis tool will
    now contain https URLs instead of http.
  * Added two more colours to the legend, such that a maximum of 22 markers is
    now supported without re-using colours.
* Updated bundled D3 to v3.5.17.
* Updated bundled Vega to v2.6.0.
parent 3a495653
This diff is collapsed.
This file contains the licenses of third-party projects included with FDSTools.
*******************************************************************************
Vega: A Visualization Grammar
Copyright (c) 2013, Trifacta Inc.
Copyright (c) 2015, University of Washington Interactive Data Lab
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its contributors
may be used to endorse or promote products derived from this software
without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*******************************************************************************
D3: Data-Driven Documents
Copyright 2010-2016 Mike Bostock
All rights reserved.
Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
* Neither the name of the author nor the names of contributors may be used to
endorse or promote products derived from this software without specific prior
written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
\ No newline at end of file
#
# Copyright (C) 2016 Jerry Hoogenboom
#
# This file is part of FDSTools, data analysis tools for Next
# Generation Sequencing of forensic DNA markers.
#
# FDSTools is free software: you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the
# Free Software Foundation, either version 3 of the License, or (at
# your option) any later version.
#
# FDSTools is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with FDSTools. If not, see <http://www.gnu.org/licenses/>.
#
"""
Data analysis tools for Next Generation Sequencing of forensic DNA markers,
including tools for characterisation and filtering of PCR stutter artefacts and
other systemic noise, and for automatic detection of the alleles in a sample.
"""
__version_info__ = ('0', '0', '3')
__version_info__ = ('0', '0', '4')
__version__ = '.'.join(__version_info__)
usage = __doc__.split("\n\n\n")
......
#!/usr/bin/env python
#
# Copyright (C) 2016 Jerry Hoogenboom
#
# This file is part of FDSTools, data analysis tools for Next
# Generation Sequencing of forensic DNA markers.
#
# FDSTools is free software: you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the
# Free Software Foundation, either version 3 of the License, or (at
# your option) any later version.
#
# FDSTools is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with FDSTools. If not, see <http://www.gnu.org/licenses/>.
#
import argparse, pkgutil, os, re, textwrap
#import cProfile # Imported only if the -d/--debug option is specified
import tools
from . import usage, version
......@@ -95,7 +116,12 @@ def main():
__tools__[args.tool].error(
"The following arguments are not known. Please check spelling "
"and argument order: '%s'." % "', '".join(unknowns))
args.func(args)
if args.debug:
import cProfile
cProfile.runctx(
"args.func(args)", globals(), locals(), sort="tottime")
else:
args.func(args)
except Exception as error:
if args.debug:
raise
......
#!/usr/bin/env python
#
# Copyright (C) 2016 Jerry Hoogenboom
#
# This file is part of FDSTools, data analysis tools for Next
# Generation Sequencing of forensic DNA markers.
#
# FDSTools is free software: you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the
# Free Software Foundation, either version 3 of the License, or (at
# your option) any later version.
#
# FDSTools is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with FDSTools. If not, see <http://www.gnu.org/licenses/>.
#
import re, sys, argparse, random, itertools
#import numpy as np # Imported only when calling nnls()
......@@ -46,6 +66,8 @@ PAT_STR_DEF_BLOCK = re.compile("([ACGT]+)\s+(\d+)\s+(\d+)")
# Pattern to split a comma-, semicolon-, or space-separated list.
PAT_SPLIT = re.compile("\s*[,; \t]\s*")
PAT_SPLIT_QUOTED = re.compile(
r""""((?:\\"|[^"])*)"|'((?:\\'|[^'])*)'|(\S+)""")
# Pattern that matches a chromosome name/number.
PAT_CHROMOSOME = re.compile(
......@@ -689,6 +711,7 @@ def parse_library_ini(handle):
def load_profiles(profilefile, library=None):
# TODO: To be replaced with load_profiles_new (less memory).
column_names = profilefile.readline().rstrip("\r\n").split("\t")
(colid_marker, colid_allele, colid_sequence, colid_fmean, colid_rmean,
colid_tool) = get_column_ids(column_names, "marker", "allele", "sequence",
......@@ -752,6 +775,41 @@ def load_profiles(profilefile, library=None):
#load_profiles
def load_profiles_new(profilefile, library=None):
# TODO, rename this to load_profiles to complete transition.
column_names = profilefile.readline().rstrip("\r\n").split("\t")
(colid_marker, colid_allele, colid_sequence, colid_fmean, colid_rmean,
colid_tool) = get_column_ids(column_names, "marker", "allele", "sequence",
"fmean", "rmean", "tool")
profiles = {}
for line in profilefile:
line = line.rstrip("\r\n").split("\t")
if line == [""]:
continue
marker = line[colid_marker]
if marker not in profiles:
profiles[marker] = {}
allele = ensure_sequence_format(line[colid_allele], "raw",
library=library, marker=marker)
sequence = ensure_sequence_format(line[colid_sequence], "raw",
library=library, marker=marker)
if allele not in profiles[marker]:
profiles[marker][allele] = {}
elif sequence in profiles[marker][allele]:
raise ValueError(
"Invalid background noise profiles file: encountered "
"multiple values for marker '%s' allele '%s' sequence '%s'" %
(marker, allele, sequence))
profiles[marker][allele][sequence] = {
"forward": float(line[colid_fmean]),
"reverse": float(line[colid_rmean]),
"tool": line[colid_tool]}
return profiles
#load_profiles_new
def pattern_longest_match(pattern, subject):
"""Return the longest match of the pattern in the subject string."""
# FIXME, this function tries only one match at each position in the
......@@ -1261,21 +1319,32 @@ def get_repeat_pattern(seq):
def read_sample_data_file(infile, data, annotation_column=None, seqformat=None,
library=None, default_marker=None,
drop_special_seq=False, after_correction=False):
drop_special_seq=False, after_correction=False,
extra_columns=None):
"""Add data from infile to data dict as [marker, sequence]=reads."""
# Get column numbers.
column_names = infile.readline().rstrip("\r\n").split("\t")
colid_sequence = get_column_ids(column_names, "sequence")
colid_forward = None
colid_reverse = None
numtype = int
if after_correction:
colid_forward, colid_reverse = get_column_ids(column_names,
"forward_corrected", "reverse_corrected",
optional=(after_correction != "require"))
if colid_forward is None:
colid_forward = get_column_ids(column_names, "forward")
else:
numtype = float
if colid_reverse is None:
colid_reverse = get_column_ids(column_names, "reverse")
else:
numtype = float
if extra_columns is not None:
extra_colids = {c: i for c, i in
((c, get_column_ids(column_names, c, optional=extra_columns[c]))
for c in extra_columns)
if i is not None}
# Get marker name column if it exists.
colid_marker = get_column_ids(column_names, "marker", optional=True)
......@@ -1300,8 +1369,11 @@ def read_sample_data_file(infile, data, annotation_column=None, seqformat=None,
if (annotation_column is not None and
line[colid_annotation].startswith("ALLELE")):
found_alleles.append((marker, sequence))
data[marker, sequence] = map(int,
data[marker, sequence] = map(numtype,
(line[colid_forward], line[colid_reverse]))
if extra_columns is not None:
data[marker, sequence].append(
{c: line[extra_colids[c]] for c in extra_colids})
return found_alleles
#read_sample_data_file
......@@ -1330,7 +1402,7 @@ def get_sample_data(tags_to_files, callback, allelelist=None,
annotation_column=None, seqformat=None, library=None,
marker=None, homozygotes=False, limit_reads=sys.maxint,
drop_samples=0, drop_special_seq=False,
after_correction=False):
after_correction=False, extra_columns=None):
if drop_samples:
sample_tags = tags_to_files.keys()
for tag in random.sample(xrange(len(sample_tags)),
......@@ -1344,7 +1416,7 @@ def get_sample_data(tags_to_files, callback, allelelist=None,
infile = sys.stdin if infile == "-" else open(infile, "r")
alleles.update(read_sample_data_file(
infile, data, annotation_column, seqformat, library, marker,
drop_special_seq, after_correction))
drop_special_seq, after_correction, extra_columns))
if infile != sys.stdin:
infile.close()
if limit_reads < sys.maxint:
......@@ -1620,7 +1692,8 @@ def get_input_output_files(args, single=False, batch_support=False):
for infile in infiles]
if len(outfiles) == 1:
outfile = outfiles[0]
outfile = sys.stdout if outfiles[0] == "-" else outfiles[0]
if outfile == sys.stdout and len(set(tags)) == 1:
# Write output of single sample to stdout.
return ((tag, infiles, outfile) for tag in set(tags))
......@@ -1647,6 +1720,16 @@ def get_input_output_files(args, single=False, batch_support=False):
#get_input_output_files
def split_quoted_string(text):
return reduce(
lambda x, y: x + ["".join([
y[0].replace("\\\"", "\""),
y[1].replace("\\'", "'"),
y[2]])],
PAT_SPLIT_QUOTED.findall(text), [])
#split_quoted_string
def print_db(text, debug):
"""Print text if debug is True."""
if debug:
......
#!/usr/bin/env python
#
# Copyright (C) 2016 Jerry Hoogenboom
#
# This file is part of FDSTools, data analysis tools for Next
# Generation Sequencing of forensic DNA markers.
#
# FDSTools is free software: you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the
# Free Software Foundation, either version 3 of the License, or (at
# your option) any later version.
#
# FDSTools is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with FDSTools. If not, see <http://www.gnu.org/licenses/>.
#
"""
Find true alleles in reference samples and detect possible
contaminations.
......
#!/usr/bin/env python
#
# Copyright (C) 2016 Jerry Hoogenboom
#
# This file is part of FDSTools, data analysis tools for Next
# Generation Sequencing of forensic DNA markers.
#
# FDSTools is free software: you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the
# Free Software Foundation, either version 3 of the License, or (at
# your option) any later version.
#
# FDSTools is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with FDSTools. If not, see <http://www.gnu.org/licenses/>.
#
"""
Match background noise profiles (obtained from e.g., bgestimate) to
samples.
......@@ -28,7 +49,7 @@ from ..lib import load_profiles, ensure_sequence_format, get_column_ids, \
nnls, add_sequence_format_args, SEQ_SPECIAL_VALUES, \
add_input_output_args, get_input_output_files
__version__ = "1.0.0"
__version__ = "1.0.1"
def get_sample_data(infile, convert_to_raw=False, library=None):
......@@ -47,6 +68,7 @@ def get_sample_data(infile, convert_to_raw=False, library=None):
column_names.append("reverse_corrected")
column_names.append("total_corrected")
column_names.append("correction_flags")
column_names.append("weight")
colid_marker, colid_sequence, colid_forward, colid_reverse =get_column_ids(
column_names, "marker", "sequence", "forward", "reverse")
data = {}
......@@ -64,10 +86,11 @@ def get_sample_data(infile, convert_to_raw=False, library=None):
cols.append(0)
cols.append(0)
cols.append(0)
cols.append(int(cols[colid_forward]))
cols.append(int(cols[colid_reverse]))
cols.append(int(cols[colid_forward]) + int(cols[colid_reverse]))
cols.append(cols[colid_forward])
cols.append(cols[colid_reverse])
cols.append(cols[colid_forward] + cols[colid_reverse])
cols.append("not_corrected")
cols.append((cols[colid_forward] + cols[colid_reverse]) / 100.)
if marker not in data:
data[marker] = []
data[marker].append(cols)
......@@ -81,11 +104,11 @@ def match_profile(column_names, data, profile, convert_to_raw, library,
colid_forward_noise, colid_reverse_noise, colid_total_noise,
colid_forward_add, colid_reverse_add, colid_total_add,
colid_forward_corrected, colid_reverse_corrected,
colid_total_corrected, colid_correction_flags) = get_column_ids(
column_names, "marker", "sequence", "forward", "reverse", "total",
"forward_noise", "reverse_noise", "total_noise", "forward_add",
"reverse_add", "total_add", "forward_corrected", "reverse_corrected",
"total_corrected", "correction_flags")
colid_total_corrected, colid_correction_flags, colid_weight) = \
get_column_ids(column_names, "marker", "sequence", "forward",
"reverse", "total", "forward_noise", "reverse_noise", "total_noise",
"forward_add", "reverse_add", "total_add", "forward_corrected",
"reverse_corrected", "total_corrected", "correction_flags", "weight")
# Enter profiles into P.
P1 = np.matrix(profile["forward"])
......@@ -127,10 +150,11 @@ def match_profile(column_names, data, profile, convert_to_raw, library,
reverse_add = np.multiply(A, P2.sum(1).T)
# Round values to 3 decimal positions.
forward_noise.round(3, forward_noise);
reverse_noise.round(3, reverse_noise);
forward_add.round(3, forward_add);
reverse_add.round(3, reverse_add);
A.round(3, A)
forward_noise.round(3, forward_noise)
reverse_noise.round(3, reverse_noise)
forward_add.round(3, forward_add)
reverse_add.round(3, reverse_add)
j = 0
for line in data:
......@@ -163,8 +187,10 @@ def match_profile(column_names, data, profile, convert_to_raw, library,
line[colid_correction_flags] = "corrected_bgpredict"
else:
line[colid_correction_flags] = "corrected"
line[colid_weight] = A[0, i]
else:
line[colid_correction_flags] = "corrected_as_background_only"
line[colid_weight] = line[colid_total_corrected] / 100.
# Add sequences that are in the profile but not in the sample.
for i in range(profile["m"]):
......@@ -201,11 +227,13 @@ def match_profile(column_names, data, profile, convert_to_raw, library,
line[colid_correction_flags] = "corrected_bgpredict"
else:
line[colid_correction_flags] = "corrected"
line[colid_weight] = A[0, i]
else:
line[colid_forward_add] = 0
line[colid_reverse_add] = 0
line[colid_total_add] = 0
line[colid_correction_flags] = "corrected_as_background_only"
line[colid_weight] = line[colid_total_corrected] / 100.
data.append(line)
#match_profile
......
#!/usr/bin/env python
#
# Copyright (C) 2016 Jerry Hoogenboom
#
# This file is part of FDSTools, data analysis tools for Next
# Generation Sequencing of forensic DNA markers.
#
# FDSTools is free software: you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the
# Free Software Foundation, either version 3 of the License, or (at
# your option) any later version.
#
# FDSTools is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with FDSTools. If not, see <http://www.gnu.org/licenses/>.
#
"""
Estimate allele-centric background noise profiles (means) from reference
samples.
......@@ -17,7 +38,7 @@ from ..lib import pos_int_arg, add_input_output_args, get_input_output_files,\
parse_allelelist, get_sample_data, \
add_random_subsampling_args
__version__ = "1.0.0"
__version__ = "1.1.0"
# Default values for parameters are specified below.
......@@ -41,6 +62,10 @@ _DEF_MIN_SAMPLES = 2
# This value can be overridden by the -S command line option.
_DEF_MIN_SAMPLE_PCT = 80.
# Default minimum number of heterozygous genotypes for each true allele.
# This value can be overridden by the -g command line option.
_DEF_MIN_GENOTYPES = 3
def solve_profile_mixture(forward, reverse, genotypes, n, variance=False,
reportfile=None):
......@@ -246,11 +271,11 @@ def solve_profile_mixture_single(samples, genotypes, n, variance=False,
return P.tolist(), V.tolist()
else:
return P.tolist()
#solve_profile_mixture
#solve_profile_mixture_single
def ensure_min_samples(allelelist, min_samples):
if min_samples <= 1:
def filter_data(allelelist, min_samples, min_genotypes):
if min_samples <= 1 and min_genotypes <= 1:
return
marker_names = set()
......@@ -258,33 +283,51 @@ def ensure_min_samples(allelelist, min_samples):
marker_names.update(allelelist[tag].keys())
for marker in marker_names:
# Get a sample count of each true allele of this marker.
# Get a sample count of each true allele of this marker
# and a sample count of each unique genotype of each allele.
true_alleles = {}
for tag in allelelist:
if marker not in allelelist[tag]:
continue
for true_allele in allelelist[tag][marker]:
try:
true_alleles[true_allele] += 1
true_alleles[true_allele][0] += 1
except KeyError:
true_alleles[true_allele] = 1
true_alleles[true_allele] = [1, {}]
if len(allelelist[tag][marker]) == 1: # Got homozygote!
true_alleles[true_allele][1] = None
elif true_alleles[true_allele][1] is not None:
genotype = "\t".join(sorted(allelelist[tag][marker]))
try:
true_alleles[true_allele][1][genotype] += 1
except KeyError:
true_alleles[true_allele][1][genotype] = 1
# Drop any alleles that occur in less than min_samples samples
# or in less than min_genotypes unique heterozygous genotypes
# (by dropping the sample for this marker completely).
repeat = True
while repeat:
repeat = False
for true_allele in true_alleles:
if 0 < true_alleles[true_allele] < min_samples:
if 0 < true_alleles[true_allele][0] < min_samples or (
true_alleles[true_allele][1] is not None and
0 < len(true_alleles[true_allele][1]) < min_genotypes):
repeat = True
for tag in allelelist:
if marker not in allelelist[tag]:
continue
if true_allele in allelelist[tag][marker]:
genotype = "\t".join(
sorted(allelelist[tag][marker]))
for allele in allelelist[tag][marker]:
true_alleles[allele] -= 1
true_alleles[allele][0] -= 1
if true_alleles[allele][1] is not None:
true_alleles[allele][1][genotype] -= 1
if not true_alleles[allele][1][genotype]:
del true_alleles[allele][1][genotype]
del allelelist[tag][marker]
#ensure_min_samples
#filter_data
def add_sample_data(data, sample_data, sample_alleles, min_pct, min_abs, tag):
......@@ -398,8 +441,8 @@ def preprocess_data(data, min_sample_pct):
def generate_profiles(samples_in, outfile, reportfile, allelefile,
annotation_column, min_pct, min_abs, min_samples,
min_sample_pct, seqformat, library, marker, homozygotes,
limit_reads, drop_samples):
min_sample_pct, min_genotypes, seqformat, library,
marker, homozygotes, limit_reads, drop_samples):
if reportfile:
t0 = time.time()
......@@ -414,9 +457,9 @@ def generate_profiles(samples_in, outfile, reportfile, allelefile,
allelelist, annotation_column, seqformat, library, marker,
homozygotes, limit_reads, drop_samples, True)
# Ensure minimum number of samples per allele.
# Ensure minimum number of samples and genotypes per allele.
allelelist = {tag: allelelist[tag] for tag in sample_data}
ensure_min_samples(allelelist, min_samples)
filter_data(allelelist, min_samples, min_genotypes)
# Combine data from all samples. This takes most time.
data = {}
......@@ -501,6 +544,11 @@ def add_arguments(parser):
help="require this minimum number of samples for each background "
"product, as a percentage of the number of samples with a "
"particular true allele (default: %(default)s)")
filtergroup.add_argument('-g', '--min-genotypes', metavar="N",
type=pos_int_arg, default=_DEF_MIN_GENOTYPES,
help="require this minimum number of unique heterozygous genotypes "
"for each allele for which no homozygous samples are available "
"(default: %(default)s)")
filtergroup.add_argument('-M', '--marker', metavar="MARKER",
help="work only on MARKER")
filtergroup.add_argument('-H', '--homozygotes', action="store_true",
......@@ -522,6 +570,7 @@ def run(args):
generate_profiles(files[0], files[1], args.report, args.allelelist,
args.annotation_column, args.min_pct, args.min_abs,
args.min_samples, args.min_sample_pct,
args.sequence_format, args.library, args.marker,
args.homozygotes, args.limit_reads, args.drop_samples)
args.min_genotypes, args.sequence_format, args.library,
args.marker, args.homozygotes, args.limit_reads,
args.drop_samples)
#run
#!/usr/bin/env python
#
# Copyright (C) 2016 Jerry Hoogenboom
#
# This file is part of FDSTools, data analysis tools for Next
# Generation Sequencing of forensic DNA markers.
#
# FDSTools is free software: you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the
# Free Software Foundation, either version 3 of the License, or (at
# your option) any later version.
#
# FDSTools is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with FDSTools. If not, see <http://www.gnu.org/licenses/>.
#
"""
Compute noise ratios for all noise detected in homozygous reference
samples.
......
#!/usr/bin/env python
#
# Copyright (C) 2016 Jerry Hoogenboom
#
# This file is part of FDSTools, data analysis tools for Next
# Generation Sequencing of forensic DNA markers.
#
# FDSTools is free software: you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the
# Free Software Foundation, either version 3 of the License, or (at
# your option) any later version.
#
# FDSTools is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with FDSTools. If not, see <http://www.gnu.org/licenses/>.
#
"""
Compute allele-centric statistics for background noise in homozygous
reference samples (min, max, mean, sample variance).
......
#!/usr/bin/env python
#
# Copyright (C) 2016 Jerry Hoogenboom
#
# This file is part of FDSTools, data analysis tools for Next
# Generation Sequencing of forensic DNA markers.
#
# FDSTools is free software: you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the
# Free Software Foundation, either version 3 of the License, or (at
# your option) any later version.
#
# FDSTools is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with FDSTools. If not, see <http://www.gnu.org/licenses/>.
#
"""
Merge multiple files containing background noise profiles.
......@@ -19,42 +40,39 @@ Example: fdstools bgpredict ... | fdstools bgmerge old.txt > out.txt
import argparse
import sys
from ..lib import load_profiles, ensure_sequence_format,\
from ..lib import load_profiles_new, ensure_sequence_format,\
add_sequence_format_args
__version__ = "1.0.0"
__version__ = "1.0.1"
def merge_profiles(infiles, outfile, seqformat, library):
amounts = {}
outfile.write("\t".join(
["marker", "allele", "sequence", "fmean", "rmean", "tool"]) + "\n")
seen = {}
for infile in infiles:
if infile == "-":
profiles = load_profiles(sys.stdin, library)
profiles = load_profiles_new(sys.stdin, library)
else:
with open(infile, "r") as handle:
profiles = load_profiles(handle, library)
profiles = load_profiles_new(handle, library)
for marker in profiles:
if marker not in amounts:
amounts[marker] = {}