Commit ebf700a7 authored by Hoogenboom, Jerry's avatar Hoogenboom, Jerry
Browse files

Big update: Bumped version to v0.0.3

Updated Stuttermark to v1.5. WARNING: This version of Stuttermark is
INCOMPATIBLE with output from previous versions of FDSTools and TSSV.

Introducing TSSV-Lite
* New tool tssv acts as a wrapper around TSSV-Lite (tssvl). Its primary
  purpose is to allow running TSSV-Lite without having to convert the
  FDSTools library to TSSV format, and to offer allelename output. Like
  all other tools in FDSTools, it also works with TSSV library files but
  its allele name generation capabilities are limited in that case.

Changed:
* TSSV-Lite and the new TSSV tool in FDSTools have two columns renamed
  w.r.t. the original TSSV program: 'name' has been changed to 'marker',
  and 'allele' has been changed to 'sequence'. All tools in FDSTools
  have been updated to use the new column names. This change affects
  Allelefinder, BGCorrect, BGEstimate, BGHomRaw, BGHomStats, BGPredict,
  Blame, Samplestats, Samplevis, Stuttermark, Stuttermodel, and
  Seqconvert. Note that this change will BREAK COMPATIBILITY of these
  tools with old data files.

Fixed:
* In Samplevis HTML visualisations, the "percentage recovery" table
  filtering option used the absolute number of recovered reads instead.
* Added PctRecovery to the tables in Samplevis HTML visualisations.
* BGPredict will now print a nice error message if the -n/--min-pct
  option is set to zero or a negative number, to avoid division by zero.
* Samplestats would crash if the input file contained the flags column.
* FDSTools would crash when trying to convert sequences to allele names
  using a TSSV library.

Improved:
* Libconvert will no longer include duplicate sequences in the STR
  defenition when converting to TSSV format and the reference sequence
  of one of the markers is the same as one of its aliases, or when
  aliases of one marker share one or more prefix or suffix sequences.
* Updated add_input_output_args() such that the output file is a
  positional argument (instead of -o) for tools that have a single input
  file and no support for batches.
* Updated add_sequence_format_args() such that the library file can be
  made a required argument.
* Refined the FDSTools package description, since FDSTools does more
  than just noise filteirng.
* FDSTools will now do a marginally better job at producing allele names
  for sequences that do not exactly match the provided STR pattern. When
  seeking the longest matching portion of the sequence, it will now also
  test the reversed sequence with a reversed pattern, which sometimes
  yields a longer match. It is still not optimal, though, but some
  refactoring has been done to move away from regular expressions.
* BGCorrect will now also fill in correction_flags for newly added
  sequences.
* Adjusted the help text of Samplestats to include the fact that the -c
  and -y options have an OR relation instead of an AND relation.
* BGCorrect, BGEstimate, BGHomRaw, BGHomStats, BGPredict, and
  Stuttermodel will now ignore special values that may appear in the
  place of a sequence (currently: 'Other sequences' and 'No data').

Removed:
* The -m/--marker-column and -a/--allele-column arguments of BGPredict
  had no effect and have been removed.

Visualisations:
* Updated bundled D3 to v3.5.12.
* In HTML visualisations, if the page is scrolled to the right edge when
  an option is changed that causes the graphs to become wider, the page
  now remains scrolled to the right.
* Samplevis HTML visualisations:
  * Added 'Clear manually added/removed' link to the table filtering.
  * Reduced flicker of the mouse cursor in Internet Explorer.
  * Added 'Common axis range' checkbox (only available when 'Split
    markers' is off).
  * Added 'Save table' link to save the table of selected alleles to a
    tab-separated file.
  * Added 'PctRecovery' column to the tables of selected alleles.
  * An alert box is now shown when a data file is loaded that contains
    markers that have 'No data'.
  * Added 'Percentage of total reads' to the graph filtering options.
  * Added a note to the table filtering options to explain that the
    minimum percentage correction and recovery have an OR relation.
parent 4f9286e4
......@@ -29,6 +29,9 @@ Alternatively, FDSTools can be installed by running:
FDSTools Changelog
------------------
v0.0.3
- Includes Stuttermark v1.5
v0.0.2
- Added global -d/--debug switch
- Includes Stuttermark v1.4
......@@ -77,6 +80,11 @@ Output
Changelog
~~~~~~~~~
v1.5
- Changed column names 'name' and 'allele' to 'marker' and 'sequence',
respectively. WARNING: Stuttermark is now INCOMPATIBLE with output
from TSSV but made compatible with TSSV-Lite instead.
v1.4
- Stuttermark now accepts raw sequences and allele names as input, which
are automatically rewritten as TSSV-style sequences using a specified
......
"""
Tools for characterisation and filtering of PCR stutter artefacts and other
systemic noise in Next Generation Sequencing data of forensic DNA markers.
Data analysis tools for Next Generation Sequencing of forensic DNA markers,
including tools for characterisation and filtering of PCR stutter artefacts and
other systemic noise, and for automatic detection of the alleles in a sample.
"""
__version_info__ = ('0', '0', '2')
__version_info__ = ('0', '0', '3')
__version__ = '.'.join(__version_info__)
usage = __doc__.split("\n\n\n")
......
This diff is collapsed.
......@@ -24,8 +24,8 @@ profile was based on direct observations.
import argparse, sys
#import numpy as np # Only imported when actually running this tool.
from ..lib import load_profiles, ensure_sequence_format, nnls, \
get_column_ids, add_sequence_format_args, \
from ..lib import load_profiles, ensure_sequence_format, get_column_ids, \
nnls, add_sequence_format_args, SEQ_SPECIAL_VALUES, \
add_input_output_args, get_input_output_files
__version__ = "0.1dev"
......@@ -47,15 +47,15 @@ def get_sample_data(infile, convert_to_raw=False, library=None):
column_names.append("reverse_corrected")
column_names.append("total_corrected")
column_names.append("correction_flags")
colid_name, colid_allele, colid_forward, colid_reverse = get_column_ids(
column_names, "name", "allele", "forward", "reverse")
colid_marker, colid_sequence, colid_forward, colid_reverse = get_column_ids(
column_names, "marker", "sequence", "forward", "reverse")
data = {}
for line in infile:
cols = line.rstrip("\r\n").split("\t")
marker = cols[colid_name]
if convert_to_raw:
cols[colid_allele] = ensure_sequence_format(
cols[colid_allele], "raw", library=library, marker=marker)
marker = cols[colid_marker]
if convert_to_raw and cols[colid_sequence] not in SEQ_SPECIAL_VALUES:
cols[colid_sequence] = ensure_sequence_format(
cols[colid_sequence], "raw", library=library, marker=marker)
cols[colid_forward] = int(cols[colid_forward])
cols[colid_reverse] = int(cols[colid_reverse])
cols.append(0)
......@@ -77,12 +77,12 @@ def get_sample_data(infile, convert_to_raw=False, library=None):
def match_profile(column_names, data, profile, convert_to_raw, library,
marker):
(colid_name, colid_allele, colid_forward, colid_reverse, colid_total,
(colid_marker, colid_sequence, colid_forward, colid_reverse, colid_total,
colid_forward_noise, colid_reverse_noise, colid_total_noise,
colid_forward_add, colid_reverse_add, colid_total_add,
colid_forward_corrected, colid_reverse_corrected,
colid_total_corrected, colid_correction_flags) = get_column_ids(
column_names, "name", "allele", "forward", "reverse", "total",
column_names, "marker", "sequence", "forward", "reverse", "total",
"forward_noise", "reverse_noise", "total_noise", "forward_add",
"reverse_add", "total_add", "forward_corrected", "reverse_corrected",
"total_corrected", "correction_flags")
......@@ -96,14 +96,16 @@ def match_profile(column_names, data, profile, convert_to_raw, library,
C1 = np.matrix(np.zeros([1, profile["m"]]))
C2 = np.matrix(np.zeros([1, profile["m"]]))
for line in data:
if line[colid_sequence] in SEQ_SPECIAL_VALUES:
continue
if convert_to_raw:
allele = ensure_sequence_format(line[colid_allele], "raw",
sequence = ensure_sequence_format(line[colid_sequence], "raw",
library=library, marker=marker)
else:
allele = line[colid_allele]
seqs.append(allele)
sequence = line[colid_sequence]
seqs.append(sequence)
try:
i = profile["seqs"].index(allele)
i = profile["seqs"].index(sequence)
except ValueError:
# Note: Not adding any new sequences to the profile, since
# they will just be zeroes and have no effect on the result.
......@@ -111,6 +113,10 @@ def match_profile(column_names, data, profile, convert_to_raw, library,
C1[0, i] = line[colid_forward]
C2[0, i] = line[colid_reverse]
# Stop if this sample has no explicit data for this marker.
if not seqs:
return
# Compute corrected read counts.
A = nnls(np.hstack([P1, P2]).T, np.hstack([C1, C2]).T).T
np.fill_diagonal(P1, 0)
......@@ -128,6 +134,8 @@ def match_profile(column_names, data, profile, convert_to_raw, library,
j = 0
for line in data:
if line[colid_sequence] in SEQ_SPECIAL_VALUES:
continue
j += 1
try:
i = profile["seqs"].index(seqs[j-1])
......@@ -167,8 +175,8 @@ def match_profile(column_names, data, profile, convert_to_raw, library,
amount += forward_add[0, i] + reverse_add[0, i]
if amount > 0:
line = [""] * len(column_names)
line[colid_name] = marker
line[colid_allele] = profile["seqs"][i]
line[colid_marker] = marker
line[colid_sequence] = profile["seqs"][i]
line[colid_forward] = 0
line[colid_reverse] = 0
line[colid_total] = 0
......@@ -185,10 +193,19 @@ def match_profile(column_names, data, profile, convert_to_raw, library,
line[colid_forward_corrected] += line[colid_forward_add]
line[colid_reverse_corrected] += line[colid_reverse_add]
line[colid_total_corrected] += line[colid_total_add]
if "bgestimate" in profile["tool"][i]:
line[colid_correction_flags] = "corrected_bgestimate"
elif "bghomstats" in profile["tool"][i]:
line[colid_correction_flags] = "corrected_bghomstats"
elif "bgpredict" in profile["tool"][i]:
line[colid_correction_flags] = "corrected_bgpredict"
else:
line[colid_correction_flags] = "corrected"
else:
line[colid_forward_add] = 0
line[colid_reverse_add] = 0
line[colid_total_add] = 0
line[colid_correction_flags] = "corrected_as_background_only"
data.append(line)
#match_profile
......@@ -196,7 +213,7 @@ def match_profile(column_names, data, profile, convert_to_raw, library,
def match_profiles(infile, outfile, profiles, library, seqformat):
column_names, data = get_sample_data(
infile, convert_to_raw=seqformat=="raw", library=library)
colid_allele = get_column_ids(column_names, "allele")
colid_sequence = get_column_ids(column_names, "sequence")
outfile.write("\t".join(column_names) + "\n")
for marker in data:
......@@ -204,9 +221,11 @@ def match_profiles(infile, outfile, profiles, library, seqformat):
match_profile(column_names, data[marker], profiles[marker],
seqformat!="raw", library, marker)
for line in data[marker]:
if seqformat is not None and seqformat != "raw":
line[colid_allele] = ensure_sequence_format(line[colid_allele],
seqformat, library=library, marker=marker)
if (seqformat is not None and seqformat != "raw" and
line[colid_sequence] not in SEQ_SPECIAL_VALUES):
line[colid_sequence] = ensure_sequence_format(
line[colid_sequence], seqformat, library=library,
marker=marker)
outfile.write("\t".join(map(str, line)) + "\n")
#match_profiles
......@@ -225,8 +244,8 @@ def add_arguments(parser):
def run(args):
# Import numpy now.
import numpy as np
global np
import numpy as np
gen = get_input_output_files(args, True, True)
if not gen:
......
......@@ -340,6 +340,9 @@ def add_sample_data(data, sample_data, sample_alleles, min_pct, min_abs, tag):
if marker not in sample_alleles or not sample_alleles[marker]:
# Sample does not participate in this marker (no alleles).
continue
if allele is False:
# This was a special sequence value, skip it.
continue
p = data[marker]["profiles"]
try:
......@@ -412,7 +415,7 @@ def generate_profiles(samples_in, outfile, reportfile, allelefile,
get_sample_data(samples_in,
lambda tag, data: sample_data.update({tag: data}),
allelelist, annotation_column, seqformat, library, marker,
homozygotes, limit_reads, drop_samples)
homozygotes, limit_reads, drop_samples, True)
# Ensure minimum number of samples per allele.
allelelist = {tag: allelelist[tag] for tag in sample_data}
......@@ -512,8 +515,8 @@ def add_arguments(parser):
def run(args):
# Import numpy now.
import numpy as np
global np
import numpy as np
files = get_input_output_files(args)
if not files:
......
......@@ -49,7 +49,7 @@ def add_sample_data(data, sample_data, sample_alleles, min_pct, min_abs, tag):
# Enter the read counts into data and check the thresholds.
for marker, sequence in sample_data:
if marker not in sample_alleles:
if marker not in sample_alleles or sequence is False:
# Sample does not participate in this marker.
continue
allele = sample_alleles[marker]
......@@ -120,7 +120,8 @@ def compute_ratios(samples_in, outfile, allelefile, annotation_column, min_pct,
data, sample_data,
{m: allelelist[tag][m].pop() for m in allelelist[tag]},
min_pct, min_abs, tag),
allelelist, annotation_column, seqformat, library, marker, True)
allelelist, annotation_column, seqformat, library, marker, True,
allow_special=True)
# Ensure minimum number of samples per allele and filter
# insignificant background products.
......
......@@ -53,7 +53,7 @@ def add_sample_data(data, sample_data, sample_alleles, min_pct, min_abs):
# Enter the read counts into data and check the thresholds.
for marker, sequence in sample_data:
if marker not in sample_alleles:
if marker not in sample_alleles or sequence is False:
# Sample does not participate in this marker.
continue
allele = sample_alleles[marker]
......@@ -114,7 +114,7 @@ def compute_stats(samples_in, outfile, allelefile, annotation_column, min_pct,
{m: allelelist[tag][m].pop() for m in allelelist[tag]},
min_pct, min_abs),
allelelist, annotation_column, seqformat, library, marker, True,
limit_reads, drop_samples)
limit_reads, drop_samples, True)
# Ensure minimum number of samples per allele and filter
# insignificant background products.
......
......@@ -36,21 +36,9 @@ __version__ = "0.1dev"
# Default values for parameters are specified below.
# Default name of the column that contains the marker name.
# This value can be overridden by the -m command line option.
_DEF_COLNAME_MARKER = "name"
# Default name of the column that contains the allele.
# This value can be overridden by the -a command line option.
_DEF_COLNAME_ALLELE = "allele"
# Default name of the column to write the output to.
# This value can be overridden by the -o command line option.
_DEF_COLNAME_ALLELE_OUT = "allele"
# Default minimum amount of background to consider, as a percentage of
# the highest allele.
# This value can be overridden by the -m command line option.
# This value can be overridden by the -n command line option.
_DEF_THRESHOLD_PCT = 0.5
# Default minimum R2 score.
......@@ -206,28 +194,34 @@ def get_relative_frequencies(stutters, combinations):
#get_relative_frequencies
def predict_profiles(stuttermodel, seqsfile, outfile, marker_column,
allele_column, default_marker, use_all_data, min_pct,
min_r2, seqformat, library):
def predict_profiles(stuttermodel, seqsfile, outfile, default_marker,
use_all_data, min_pct, min_r2, seqformat, library):
if min_pct <= 0:
raise ValueError("The -n/--min-pct option cannot be negative or zero!")
# Parse stutter model file.
model = parse_stuttermodel(stuttermodel, min_r2, use_all_data)
# Read list of sequences.
seqlist = {}
column_names = seqsfile.readline().rstrip("\r\n").split("\t")
colid_allele = get_column_ids(column_names, "allele")
colid_name = get_column_ids(column_names, "name", optional=True)
colid_sequence = get_column_ids(column_names, "sequence")
colid_marker = get_column_ids(column_names, "marker", optional=True)
for line in seqsfile:
line = line.rstrip("\r\n").split("\t")
marker = line[colid_name] if colid_name is not None else default_marker
marker = line[colid_marker] if colid_marker is not None \
else default_marker
if marker not in model:
if use_all_data and "All data" in model:
# No marker-specific model available, use "All data".
model[marker] = model["All data"]
else:
continue
sequence = ensure_sequence_format(line[colid_allele], "raw",
library=library, marker=marker)
sequence = ensure_sequence_format(line[colid_sequence], "raw",
library=library, marker=marker,
allow_special=True)
if sequence is False:
continue
try:
seqlist[marker].append(sequence)
except KeyError:
......@@ -324,14 +318,6 @@ def add_arguments(parser):
parser.add_argument('outfile', metavar="OUT", nargs="?", default=sys.stdout,
type=argparse.FileType("w"),
help="the file to write the output to (default: write to stdout)")
parser.add_argument('-m', '--marker-column', metavar="COLNAME",
default=_DEF_COLNAME_MARKER,
help="name of the column that contains the marker name "
"(default: '%(default)s')")
parser.add_argument('-a', '--allele-column', metavar="COLNAME",
default=_DEF_COLNAME_ALLELE,
help="name of the column that contains the allele "
"(default: '%(default)s')")
parser.add_argument('-M', '--marker', metavar="MARKER",
help="assume the specified marker for all sequences")
parser.add_argument('-A', '--use-all-data', action="store_true",
......@@ -352,11 +338,10 @@ def add_arguments(parser):
def run(args):
# Import numpy now.
import numpy as np
global np
import numpy as np
predict_profiles(args.stuttermodel, args.seqs, args.outfile,
args.marker_column, args.allele_column, args.marker,
predict_profiles(args.stuttermodel, args.seqs, args.outfile, args.marker,
args.use_all_data, args.min_pct, args.min_r2,
args.sequence_format, args.library)
#run
......@@ -148,8 +148,8 @@ def add_arguments(parser):
def run(args):
# Import numpy now.
import numpy as np
global np
import numpy as np
files = get_input_output_files(args)
if not files:
......
......@@ -149,8 +149,9 @@ def convert_library(infile, outfile, aliases=False):
if len(suffixes):
middle = middle[:-len(suffixes)]
if unmatched:
middle = [(x[0], "0", x[2]) for x in middle] + \
[(x, "0", "1") for x in unmatched]
middle = [(x, "0", "1") for x in unmatched] + \
[(x[0], "0", x[2]) for x in middle]
elif marker in library["nostr_reference"]:
middle = [(library["nostr_reference"][marker],
"0" if marker in marker_aliases else "1", "1")]
......@@ -162,18 +163,23 @@ def convert_library(infile, outfile, aliases=False):
prefixes += library["prefix"][alias]
if alias in library["suffix"]:
suffixes += library["suffix"][alias]
if marker not in library["regex"]:
if marker not in library["regex"] and (
marker not in library["nostr_reference"] or
library["nostr_reference"][marker] !=
library["aliases"][alias]["sequence"]):
middle.append((
library["aliases"][alias]["sequence"],
"0", "1"))
# Final regex is prefixes + middle + suffixes.
pattern = []
for prefix in prefixes:
pattern.append((prefix, "0", "1"))
for i in range(len(prefixes)):
if i == prefixes.index(prefixes[i]):
pattern.append((prefixes[i], "0", "1"))
pattern += middle
for suffix in suffixes:
pattern.append((suffix, "0", "1"))
for i in range(len(suffixes)):
if i == suffixes.index(suffixes[i]):
pattern.append((suffixes[i], "0", "1"))
outfile.write("%s\t%s\t%s\t%s\n" % (
marker, flanks[0], flanks[1],
......@@ -181,6 +187,8 @@ def convert_library(infile, outfile, aliases=False):
else:
# TSSV -> FDSTools
outfile.write("; Lines beginning with a semicolon (;) are ignored by "
"FDSTools.\n\n")
ini = RawConfigParser(allow_no_value=True)
ini.optionxform = str
......@@ -286,7 +294,7 @@ def convert_library(infile, outfile, aliases=False):
"; specified here.")
ini.add_section("block_length")
ini.set("block_length",
"; Specify the core repeat unit length of each marker. The "
"; Specify the repeat unit length of each STR marker. The "
"default length is 4.")
ini.add_section("max_expected_copies")
ini.set("max_expected_copies",
......
......@@ -155,7 +155,7 @@ def compute_stats(infile, outfile, min_reads,
min_pct_of_sum_filt, min_correction_filt, min_recovery_filt):
# Check presence of required columns.
column_names = infile.readline().rstrip("\r\n").split("\t")
get_column_ids(column_names, "name", "allele", "forward", "reverse",
get_column_ids(column_names, "marker", "sequence", "forward", "reverse",
"total")
if "flags" not in column_names:
column_names.append("flags")
......@@ -218,7 +218,7 @@ def compute_stats(infile, outfile, min_reads,
data = {}
for line in infile:
row = line.rstrip("\r\n").split("\t")
marker = row[ci["name"]]
marker = row[ci["marker"]]
row[ci["forward"]] = int(row[ci["forward"]])
row[ci["reverse"]] = int(row[ci["reverse"]])
row[ci["total"]] = int(row[ci["total"]])
......@@ -230,7 +230,7 @@ def compute_stats(infile, outfile, min_reads,
if len(row) == ci["flags"]:
row.append([])
else:
row[ci["flags"]] = map(str.strip, ci["flags"].split(","))
row[ci["flags"]] = map(str.strip, row[ci["flags"]].split(","))
if marker not in data:
data[marker] = []
data[marker].append(row)
......@@ -420,22 +420,22 @@ def compute_stats(infile, outfile, min_reads,
row[ci["reverse"]] if "reverse_corrected" not in ci
else row[ci["reverse_corrected"]])
# Check if this allele should be filtered out.
# Check if this sequence should be filtered out.
if filter_action != "off" and (
total_added < min_reads_filt or
pct_of_max < min_pct_of_max_filt or
pct_of_sum < min_pct_of_sum_filt or
(correction < min_correction_filt and # TODO: docs!
(correction < min_correction_filt and
recovery < min_recovery_filt) or
min_strand < min_per_strand_filt):
filtered[marker].append(row)
# Check if this sequence is an allele.
elif (row[ci["allele"]] != "Other sequences" and
elif (row[ci["sequence"]] != "Other sequences" and
total_added >= min_reads and
pct_of_max >= min_pct_of_max and
pct_of_sum >= min_pct_of_sum and
(correction >= min_correction or # TODO: docs!
(correction >= min_correction or
recovery >= min_recovery) and
min_strand >= min_per_strand):
row[ci["flags"]].append("allele")
......@@ -457,8 +457,8 @@ def compute_stats(infile, outfile, min_reads,
if filter_action == "combine":
have_combined = False
combined = [""] * len(column_names)
combined[ci["name"]] = marker
combined[ci["allele"]] = "Other sequences"
combined[ci["marker"]] = marker
combined[ci["sequence"]] = "Other sequences"
for i in (ci[column] for column in COLUMN_ORDER if column in ci):
# Set known numeric columns to 0.
combined[i] = 0
......@@ -618,7 +618,8 @@ def compute_stats(infile, outfile, min_reads,
def add_arguments(parser):
add_input_output_args(parser, True, True, False)
intergroup = parser.add_argument_group("interpretation options",
"sequences that match all of these settings are marked as 'allele'")
"sequences that match the -c or -y option (or both) and all of the "
"other settings are marked as 'allele'")
intergroup.add_argument('-n', '--min-reads', metavar="N", type=float,
default=_DEF_MIN_READS,
help="the minimum number of reads (default: %(default)s)")
......@@ -643,7 +644,9 @@ def add_arguments(parser):
help="the minimum number of reads that was recovered thanks to "
"noise correction (by e.g., bgcorrect), as a percentage of the "
"original number of reads (default: %(default)s)")
filtergroup = parser.add_argument_group("filtering options")
filtergroup = parser.add_argument_group("filtering options",
"sequences that match the -C or -Y option (or both) and all of the "
"other settings are retained, all others are filtered")
filtergroup.add_argument('-a', '--filter-action', metavar="ACTION",
choices=("off", "combine", "delete"), default="off",
help="filtering mode: 'off', disable filtering; 'combine', replace "
......
......@@ -48,47 +48,47 @@ __version__ = "0.1dev"
# Default name of the column that contains the marker name.
# This value can be overridden by the -m command line option.
_DEF_COLNAME_MARKER = "name"
_DEF_COLNAME_MARKER = "marker"
# Default name of the column that contains the allele.
# Default name of the column that contains the sequence.
# This value can be overridden by the -a command line option.
_DEF_COLNAME_ALLELE = "allele"
_DEF_COLNAME_SEQUENCE = "sequence"
def convert_sequences(infile, outfile, to_format, library=None,
fixed_marker=None, colname_marker=_DEF_COLNAME_MARKER,
colname_allele=_DEF_COLNAME_ALLELE,
colname_allele_out=None,
colname_sequence=_DEF_COLNAME_SEQUENCE,
colname_sequence_out=None,
library2=None, revcomp_markers=[]):
if colname_allele_out is None:
colname_allele_out = colname_allele
if colname_sequence_out is None:
colname_sequence_out = colname_sequence
column_names = infile.readline().rstrip("\r\n").split("\t")
colid_allele = get_column_ids(column_names, colname_allele)
colid_sequence = get_column_ids(column_names, colname_sequence)
if library is None:
fixed_marker = "" # Don't need marker names without library.
if fixed_marker is None:
colid_marker = get_column_ids(column_names, colname_marker)
try:
colid_allele_out = get_column_ids(column_names, colname_allele_out)
colid_sequence_out = get_column_ids(column_names, colname_sequence_out)
except:
column_names.append(colname_allele_out)
colid_allele_out = -1
column_names.append(colname_sequence_out)
colid_sequence_out = -1
outfile.write("\t".join(column_names) + "\n")
for line in infile:
line = line.rstrip("\r\n").split("\t")
if line == [""]:
continue
if colid_allele_out == -1:
if colid_sequence_out == -1:
line.append("")
marker = line[colid_marker] if fixed_marker is None else fixed_marker
seq = line[colid_allele]
seq = line[colid_sequence]
if library2 != library:
seq = ensure_sequence_format(
seq, "raw", marker=marker, library=library, allow_special=True)
if seq == False:
seq = line[colid_allele]
seq = line[colid_sequence]
elif marker in revcomp_markers:
seq = reverse_complement(seq)
# TODO: The current implementation assumes identical
......@@ -98,8 +98,8 @@ def convert_sequences(infile, outfile, to_format, library=None,
seq = ensure_sequence_format(seq, to_format, marker=marker,
library=library2, allow_special=True)
if seq == False:
seq = line[colid_allele]
line[colid_allele_out] = seq
seq = line[colid_sequence]
line[colid_sequence_out] = seq
outfile.write("\t".join(line) + "\n")
#convert_sequences
......@@ -114,8 +114,8 @@ def add_arguments(parser):
help="name of the column that contains the marker name "
"(default: '%(default)s')")
parser.add_argument('-a', '--allele-column', metavar="COLNAME",
default=_DEF_COLNAME_ALLELE,
help="name of the column that contains the allele "
default=_DEF_COLNAME_SEQUENCE,
help="name of the column that contains the sequence "
"(default: '%(default)s')")
parser.add_argument('-c', '--output-column', metavar="COLNAME",
help="name of the column to write the output to "
......
#!/usr/bin/env python
"""
Mark potential stutter products by assuming a fixed maximum percentage
of stutter product vs the parent allele.
of stutter product vs the parent sequence.
Stuttermark adds a new column (named 'annotation' by default) to the
output. The new column contains 'STUTTER' for possible stutter
......@@ -10,7 +10,7 @@ annotated as 'UNKNOWN'. A sequence is considered a possible stutter
product if its total read count is less than or equal to the maximum
number of expected stutter reads. The maximum number of stutter reads
is computed by assuming a fixed percentage of stutter product compared