Commit a3e610e8 authored by Hoogenboom, Jerry's avatar Hoogenboom, Jerry
Browse files

Filtering and aggregation in Samplestats

Fixed:
* When converting STR allele names to sequences, FDSTools would reject
  any prefix variants with a false message stating that the variant does
  not match the reference sequence.
* The Samplestats tool would not allow the -b/--min-per-strand option to
  be set to zero.

Improved:
* Moved the flags generated by BGCorrect to a new column named
  correction_flags. Some of the values have been renamed for clarity,
  and this column now always contains a value.
  * The Samplestats tool will no longer add the not_corrected flag to
    each sequence, as it does not add the correction_flags column.
* The Samplestats tool now supports filtering sequences. For filtering,
  the same set of options is available as those used for marking
  alleles. The filtering options use upper case letters and have '-filt'
  appended to their long name. The new -a/--filter-action option defines
  what should be done with filtered sequencies. 'off', the default,
  disables filtering; 'combine' replaces filtered sequences with a new
  line containing aggregated data; 'delete' removes filtered sequences
  without leaving a trace.
  * The seqconvert tool is aware of the special 'Other sequences' value
    produced by Samplestats with -a/--filter-action set to 'combine'.
	Other tools will give an informative error message when the input
	contains this special value.
* The Samplestats tool now accepts non-integer and negative numbers for
  -n/--min-reads and -b/--min-per-strand because after correction read
  counts are not necessarily nonnegative integers anymore.
* The forward_correction and reverse_correction columns of Samplestats
  will now contain 0 if the sequence had exactly 0 reads both before and
  after correction (previously, this was -100).
* Renamed the _mp columns of Samplestats to _mp_sum ("per-marker
  percentage of the sum") and introduced _mp_max columns ("per-marker
  percentage of the maximum").
* Samplestats and Samplevis HTML visualisations will now mark a sequence
  as 'allele' if the minimum amount of correction OR the minimum number
  of recovered reads is reached (as opposed to AND). This allows alleles
  on stutter positions to be detected.

Changed:
* The -r/--min-recovery option of Samplestats has been renamed to
  -y/--min-recovery, analogous to the new -Y/--min-recovery-filt.

Visualisations:
* Updated Vega to version 2.4.1.
* Replaced the regular expression-based filters in all visualisations
  with a much simpler syntax. The new syntax uses space-separated search
  terms, defaulting to a 'contains'-type search method. If any search
  term is preceded by an equals sign, that term must be matched exactly.
  (The search terms themselves are actually still matched as regexes!)
* Added 'show negative alleles' option (default on) to Samplevis. When
  enabled, the graph filtering options work on abs(value) instead of the
  value itself.
* When sorting alleles in Samplevis, the allele name is now used as the
  final tiebreaker instead of the primary sorting column.
* HTML visualisations no longer re-render the entire graph when changing
  the width. The same holds true for the height setting of Allelevis.
* The tables in Samplevis HTML visualisations will now contain the
  information from BGCorrect's correction_flags column in the Notes
  column.
parent 7820cad0
......@@ -300,7 +300,7 @@ def mutate_sequence(seq, variants, location=None):
if new == "-" or new == "del":
new = ""
if pos < 0:
pos += len(seq) + 1
pos += len(seq)
pos = get_genome_pos(location, pos, True)
if pos < 0 or (pos == 0 and not ins) or pos >= len(seq):
raise ValueError(
......@@ -686,6 +686,9 @@ def detect_sequence_format(seq):
"""Return format of seq. One of 'raw', 'tssv', or 'allelename'."""
if not seq:
raise ValueError("Empty sequence")
if seq == "Other sequences":
# Special case.
return False
if PAT_SEQ_RAW.match(seq):
return 'raw'
if PAT_SEQ_TSSV.match(seq):
......@@ -994,8 +997,13 @@ def convert_sequence_raw_allelename(seq, library, marker):
def ensure_sequence_format(seq, to_format, from_format=None, library=None,
marker=None):
marker=None, allow_special=False):
"""Convert seq to 'raw', 'tssv', or 'allelename' format."""
if seq == "Other sequences":
# Special case.
if allow_special:
return False
raise ValueError("Unable to handle special value '%s'" % seq)
known_formats = ("raw", "tssv", "allelename")
if to_format not in known_formats:
raise ValueError("Unknown format '%s', choose from %s" %
......
......@@ -9,14 +9,17 @@ columns) and the number of noise reads caused by the prescense of this
sequence (_add columns), as well as the resulting number of reads after
correction (_corrected columns: original minus _noise plus _add).
The flags column contains a comma-separated list of flags with the
following meanings: 'not_corrected', no background noise profile was
available for this marker; 'not_in_ref_db', the sequence was not present
in the noise profiles given; 'not_profiled', the sequence was present in
the noise profiles given, but only as noise and not as genuine allele;
'profile_predicted', the sequence was present in the noise profiles as a
genuine allele, but its noise profile consists entirely of predictions
as opposed to direct observations.
The correction_flags column contains one of the following values:
'not_corrected', no background noise profile was available for this
marker; 'not_in_ref_db', the sequence was not present in the noise
profiles given; 'corrected_as_background_only', the sequence was present
in the noise profiles given, but only as noise and not as genuine
allele; 'corrected_bgpredict', the sequence was present in the noise
profiles as a genuine allele, but its noise profile consists entirely of
predictions as opposed to direct observations;
'corrected_bgestimate'/'corrected_bghomstats', the sequence was present
in the noise profiles as a genuine allele and at least part of its noise
profile was based on direct observations.
"""
import argparse, sys
#import numpy as np # Only imported when actually running this tool.
......@@ -43,7 +46,7 @@ def get_sample_data(infile, convert_to_raw=False, library=None):
column_names.append("forward_corrected")
column_names.append("reverse_corrected")
column_names.append("total_corrected")
column_names.append("flags")
column_names.append("correction_flags")
colid_name, colid_allele, colid_forward, colid_reverse = get_column_ids(
column_names, "name", "allele", "forward", "reverse")
data = {}
......@@ -78,11 +81,11 @@ def match_profile(column_names, data, profile, convert_to_raw, library,
colid_forward_noise, colid_reverse_noise, colid_total_noise,
colid_forward_add, colid_reverse_add, colid_total_add,
colid_forward_corrected, colid_reverse_corrected,
colid_total_corrected, colid_flags) = get_column_ids(
colid_total_corrected, colid_correction_flags) = get_column_ids(
column_names, "name", "allele", "forward", "reverse", "total",
"forward_noise", "reverse_noise", "total_noise", "forward_add",
"reverse_add", "total_add", "forward_corrected", "reverse_corrected",
"total_corrected", "flags")
"total_corrected", "correction_flags")
# Enter profiles into P.
P1 = np.matrix(profile["forward"])
......@@ -129,7 +132,7 @@ def match_profile(column_names, data, profile, convert_to_raw, library,
try:
i = profile["seqs"].index(seqs[j-1])
except ValueError:
line[colid_flags] = "not_in_ref_db"
line[colid_correction_flags] = "not_in_ref_db"
continue
line[colid_forward_noise] = forward_noise[0, i]
line[colid_reverse_noise] = reverse_noise[0, i]
......@@ -144,13 +147,16 @@ def match_profile(column_names, data, profile, convert_to_raw, library,
line[colid_forward_corrected] += line[colid_forward_add]
line[colid_reverse_corrected] += line[colid_reverse_add]
line[colid_total_corrected] += line[colid_total_add]
if ("bgestimate" not in profile["tool"][i] and
"bgcorrect" not in profile["tool"][i]):
line[colid_flags] = "profile_predicted"
if "bgestimate" in profile["tool"][i]:
line[colid_correction_flags] = "corrected_bgestimate"
if "bghomstats" in profile["tool"][i]:
line[colid_correction_flags] = "corrected_bghomstats"
elif "bgpredict" in profile["tool"][i]:
line[colid_correction_flags] = "corrected_bgpredict"
else:
line[colid_flags] = ""
line[colid_correction_flags] = "corrected"
else:
line[colid_flags] = "not_profiled"
line[colid_correction_flags] = "corrected_as_background_only"
# Add sequences that are in the profile but not in the sample.
for i in range(profile["m"]):
......
This diff is collapsed.
......@@ -86,15 +86,20 @@ def convert_sequences(infile, outfile, to_format, library=None,
seq = line[colid_allele]
if library2 != library:
seq = ensure_sequence_format(
seq, "raw", marker=marker, library=library)
if marker in revcomp_markers:
seq, "raw", marker=marker, library=library, allow_special=True)
if seq == False:
seq = line[colid_allele]
elif marker in revcomp_markers:
seq = reverse_complement(seq)
# TODO: The current implementation assumes identical
# flanking sequences. Introduce means to shift flanking
# sequence in/out of prefix and/or suffix.
line[colid_allele_out] = ensure_sequence_format(
seq, to_format, marker=marker, library=library2)
seq = ensure_sequence_format(seq, to_format, marker=marker,
library=library2, allow_special=True)
if seq == False:
seq = line[colid_allele]
line[colid_allele_out] = seq
outfile.write("\t".join(line) + "\n")
#convert_sequences
......
......@@ -74,11 +74,11 @@ _DEF_HEIGHT = 400
# Default marker name matching regular expression.
# This value can be overridden by the -M command line option.
_DEF_MARKER_REGEX = ".*"
_DEF_MARKER_NAME = ""
# Default repeat unit matching regular expression.
# This value can be overridden by the -U command line option.
_DEF_UNIT_REGEX = ".*"
_DEF_UNIT = ""
# Default data file that Vega will read when -V/--vega is specified
# without providing data to embed in the file.
......@@ -285,16 +285,18 @@ def add_arguments(parser):
default=_DEF_THRESHOLD_BIAS,
help="[sample] mark sequences that have less than this percentage of "
"reads on one strand (default: %(default)s)")
visgroup.add_argument('-M', '--marker', metavar="REGEX",
default=_DEF_MARKER_REGEX,
visgroup.add_argument('-M', '--marker', metavar="MARKER",
default=_DEF_MARKER_NAME,
help="[sample, profile, bgraw, stuttermodel] only show graphs for the "
"markers that match the given regular expression; the default "
"value '%(default)s' matches any marker name")
visgroup.add_argument('-U', '--repeat-unit', metavar="REGEX",
default=_DEF_UNIT_REGEX,
help="[stuttermodel] only show graphs for the repeat units that match "
"the given regular expression; the default value '%(default)s' "
"matches any repeat unit sequence")
"markers that contain the given value in their name; separate "
"multiple values with spaces; prepend any value with '=' for an "
"exact match (default: show all markers)")
visgroup.add_argument('-U', '--repeat-unit', metavar="UNIT",
default=_DEF_UNIT,
help="[stuttermodel] only show graphs for the repeat units that "
"contain the given value; separate multiple values with spaces; "
"prepend any value with '=' for an exact match (default: show "
"all repeat units)")
visgroup.add_argument('-L', '--log-scale', action="store_true",
help="[sample, profile, bgraw] use logarithmic scale (for sample: "
"square root scale) instead of linear scale")
......
......@@ -144,6 +144,37 @@
});
}
function setSignalValue(signalname, value){
if(!graph_spec)
return;
if(signalname == "width")
graph_spec.width = value;
if(signalname == "height")
graph_spec.height = value;
for(i in graph_spec.signals){
if(graph_spec.signals[i].name == signalname){
graph_spec.signals[i].init = value;
break;
}
}
if(graph)
graph.signal(signalname, value).padding("auto").update();
return;
}
function getSignalValue(signalname){
if(!graph_spec)
return false;
if(signalname == "width")
return graph_spec.width;
if(signalname == "height")
return graph_spec.height;
for(i in graph_spec.signals)
if(graph_spec.signals[i].name == signalname)
return graph_spec.signals[i].init;
return false;
}
function setRenderer(value){
if(graph)
graph.renderer(value).update();
......@@ -160,7 +191,7 @@
else
fileName = "graph";
document.title = fileName + " - Allele Visualisation - FDSTools";
graph_spec["data"][0]["values"] = reader.result;
graph_spec.data[0].values = reader.result;
parse();
};
reader.readAsText(fileList[0]);
......@@ -209,17 +240,13 @@
var value = parseFloat(this.value);
if(isNaN(value))
return;
graph_spec["width"] = value;
if(graph)
graph.width(value).padding("auto").update();
setSignalValue("width", value);
}, false);
document.getElementById("graphheight").addEventListener('change', function(){
var value = parseFloat(this.value);
if(isNaN(value))
return;
graph_spec["height"] = value;
if(graph)
graph.height(value).padding("auto").update();
setSignalValue("height", value);
}, false);
//Toggle options visibility.
......@@ -232,8 +259,8 @@
}, false);
//Sync graph_spec and display.
document.getElementById("graphwidth").value = graph_spec["width"];
document.getElementById("graphheight").value = graph_spec["height"];
document.getElementById("graphwidth").value = getSignalValue("width");
document.getElementById("graphheight").value = getSignalValue("height");
if(has_data){
document.getElementById("options").style.display = "none";
parse();
......
......@@ -12,7 +12,12 @@
},
{
"name": "filter_marker",
"init": ".*"
"init": ""
},
{
"name": "marker_regex",
"init": {"expr": "regexp('(?:' + replace(replace(replace(filter_marker, /^ *(.*?) *$/, '$1'), /(^| )=(.*?)(?= |$)/g, '$1^$2$$'), / +/g, ')|(?:') + ')')"},
"expr": "regexp('(?:' + replace(replace(replace(filter_marker, /^ *(.*?) *$/, '$1'), /(^| )=(.*?)(?= |$)/g, '$1^$2$$'), / +/g, ')|(?:') + ')')"
},
{
"name": "barwidth",
......@@ -50,7 +55,7 @@
},
{
"type": "filter",
"test": "datum.total >= 1 && datum.total >= amplitude_threshold && datum.maxnoise >= amplitude_pct_threshold && test('^' + filter_marker + '$', datum.marker)"
"test": "datum.total >= 1 && datum.total >= amplitude_threshold && datum.maxnoise >= amplitude_pct_threshold && test(marker_regex, datum.marker)"
},
{
"type": "formula",
......@@ -236,12 +241,7 @@
"scale": "x",
"grid": true,
"layer": "back",
"title": "Noise ratio (%)",
"properties": {
"title": {
"dy": {"value": -5}
}
}
"title": "Noise ratio (%)"
},
{
"type": "y",
......
......@@ -21,6 +21,11 @@
background-color: rgba(255, 255, 255, 0.8);
z-index: 10;
}
div.options span.help {
cursor: help;
font-weight: bold;
border-bottom: 1px dashed black;
}
table.optiongroup {
padding: 10px 0px 0px 0px;
margin: 0px;
......@@ -83,7 +88,10 @@
</tr>
<tr>
<td>Marker name</td>
<td><input type="text" id="markerFilter" size="20" title="Supports regular expression syntax: e.g., use '.*' to match anything."></td>
<td>
<input type="text" id="markerFilter" size="20">
<span class="help" title="Type xyz to show all markers with 'xyz' in their names. Type =xyz to show only the marker 'xyz'. Separate multiple values with spaces.">?</span>
</td>
</tr>
</table>
<table class="optiongroup">
......@@ -142,11 +150,11 @@
function setScale(value){
if(!graph_spec)
return;
for(i in graph_spec["marks"])
if(graph_spec["marks"][i]["scales"])
for(j in graph_spec["marks"][i]["scales"])
if(graph_spec["marks"][i]["scales"][j]["name"] == "x")
graph_spec["marks"][i]["scales"][j]["type"] = value;
for(i in graph_spec.marks)
if(graph_spec.marks[i].scales)
for(j in graph_spec.marks[i].scales)
if(graph_spec.marks[i].scales[j].name == "x")
graph_spec.marks[i].scales[j].type = value;
if(graph)
parse();
......@@ -155,20 +163,22 @@
function getScale(){
if(!graph_spec)
return "linear";
for(i in graph_spec["marks"])
if(graph_spec["marks"][i]["scales"])
for(j in graph_spec["marks"][i]["scales"])
if(graph_spec["marks"][i]["scales"][j]["name"] == "x")
return graph_spec["marks"][i]["scales"][j]["type"];
for(i in graph_spec.marks)
if(graph_spec.marks[i].scales)
for(j in graph_spec.marks[i].scales)
if(graph_spec.marks[i].scales[j].name == "x")
return graph_spec.marks[i].scales[j].type;
return "linear";
}
function setSignalValue(signalname, value){
if(!graph_spec)
return;
for(i in graph_spec["signals"]){
if(graph_spec["signals"][i]["name"] == signalname){
graph_spec["signals"][i]["init"] = value;
if(signalname == "width")
graph_spec.width = value;
for(i in graph_spec.signals){
if(graph_spec.signals[i].name == signalname){
graph_spec.signals[i].init = value;
break;
}
}
......@@ -180,9 +190,11 @@
function getSignalValue(signalname){
if(!graph_spec)
return false;
for(i in graph_spec["signals"])
if(graph_spec["signals"][i]["name"] == signalname)
return graph_spec["signals"][i]["init"];
if(signalname == "width")
return graph_spec.width;
for(i in graph_spec.signals)
if(graph_spec.signals[i].name == signalname)
return graph_spec.signals[i].init;
return false;
}
......@@ -202,7 +214,7 @@
else
fileName = "graph";
document.title = fileName + " - Background Noise Visualisation - FDSTools";
graph_spec["data"][0]["values"] = reader.result;
graph_spec.data[0].values = reader.result;
parse();
};
reader.readAsText(fileList[0]);
......@@ -257,9 +269,7 @@
var value = parseFloat(this.value);
if(isNaN(value))
return;
graph_spec["width"] = value;
if(graph)
graph.width(value).padding("auto").update();
setSignalValue("width", value);
}, false);
document.getElementById("barwidth").addEventListener('change', function(){
var value = parseFloat(this.value);
......@@ -308,7 +318,7 @@
document.getElementById("scaleLog").checked = true;
}
document.getElementById("markerFilter").value = getSignalValue("filter_marker");
document.getElementById("graphwidth").value = graph_spec["width"];
document.getElementById("graphwidth").value = getSignalValue("width");
document.getElementById("barwidth").value = getSignalValue("barwidth");
document.getElementById("subgraphoffset").value = getSignalValue("subgraphoffset");
document.getElementById("minN").value = getSignalValue("amplitude_threshold");
......
......@@ -21,6 +21,11 @@
background-color: rgba(255, 255, 255, 0.8);
z-index: 10;
}
div.options span.help {
cursor: help;
font-weight: bold;
border-bottom: 1px dashed black;
}
table.optiongroup {
padding: 10px 0px 0px 0px;
margin: 0px;
......@@ -79,7 +84,10 @@
</tr>
<tr>
<td>Marker name</td>
<td><input type="text" id="markerFilter" size="20" title="Supports regular expression syntax: e.g., use '.*' to match anything."></td>
<td>
<input type="text" id="markerFilter" size="20">
<span class="help" title="Type xyz to show all markers with 'xyz' in their names. Type =xyz to show only the marker 'xyz'. Separate multiple values with spaces.">?</span>
</td>
</tr>
</table>
<table class="optiongroup">
......@@ -138,11 +146,11 @@
function setScale(value){
if(!graph_spec)
return;
for(i in graph_spec["marks"])
if(graph_spec["marks"][i]["scales"])
for(j in graph_spec["marks"][i]["scales"])
if(graph_spec["marks"][i]["scales"][j]["name"] == "x")
graph_spec["marks"][i]["scales"][j]["type"] = value;
for(i in graph_spec.marks)
if(graph_spec.marks[i].scales)
for(j in graph_spec.marks[i].scales)
if(graph_spec.marks[i].scales[j].name == "x")
graph_spec.marks[i].scales[j].type = value;
setSignalValue("low", value == "log"? 0.001 : 0);
if(graph)
......@@ -152,20 +160,22 @@
function getScale(){
if(!graph_spec)
return "linear";
for(i in graph_spec["marks"])
if(graph_spec["marks"][i]["scales"])
for(j in graph_spec["marks"][i]["scales"])
if(graph_spec["marks"][i]["scales"][j]["name"] == "x")
return graph_spec["marks"][i]["scales"][j]["type"];
for(i in graph_spec.marks)
if(graph_spec.marks[i].scales)
for(j in graph_spec.marks[i].scales)
if(graph_spec.marks[i].scales[j].name == "x")
return graph_spec.marks[i].scales[j].type;
return "linear";
}
function setSignalValue(signalname, value){
if(!graph_spec)
return;
for(i in graph_spec["signals"]){
if(graph_spec["signals"][i]["name"] == signalname){
graph_spec["signals"][i]["init"] = value;
if(signalname == "width")
graph_spec.width = value;
for(i in graph_spec.signals){
if(graph_spec.signals[i].name == signalname){
graph_spec.signals[i].init = value;
break;
}
}
......@@ -177,9 +187,11 @@
function getSignalValue(signalname){
if(!graph_spec)
return false;
for(i in graph_spec["signals"])
if(graph_spec["signals"][i]["name"] == signalname)
return graph_spec["signals"][i]["init"];
if(signalname == "width")
return graph_spec.width;
for(i in graph_spec.signals)
if(graph_spec.signals[i].name == signalname)
return graph_spec.signals[i].init;
return false;
}
......@@ -199,7 +211,7 @@
else
fileName = "graph";
document.title = fileName + " - Background Noise Profile Visualisation - FDSTools";
graph_spec["data"][0]["values"] = reader.result;
graph_spec.data[0].values = reader.result;
parse();
};
reader.readAsText(fileList[0]);
......@@ -254,9 +266,7 @@
var value = parseFloat(this.value);
if(isNaN(value))
return;
graph_spec["width"] = value;
if(graph)
graph.width(value).padding("auto").update();
setSignalValue("width", value);
}, false);
document.getElementById("barwidth").addEventListener('change', function(){
var value = parseFloat(this.value);
......@@ -299,7 +309,7 @@
document.getElementById("scaleLog").checked = true;
}
document.getElementById("markerFilter").value = getSignalValue("filter_marker");
document.getElementById("graphwidth").value = graph_spec["width"];
document.getElementById("graphwidth").value = getSignalValue("width");
document.getElementById("barwidth").value = getSignalValue("barwidth");
document.getElementById("subgraphoffset").value = getSignalValue("subgraphoffset");
document.getElementById("minP").value = getSignalValue("filter_threshold");
......
......@@ -8,7 +8,12 @@
},
{
"name": "filter_marker",
"init": ".*"
"init": ""
},
{
"name": "marker_regex",
"init": {"expr": "regexp('(?:' + replace(replace(replace(filter_marker, /^ *(.*?) *$/, '$1'), /(^| )=(.*?)(?= |$)/g, '$1^$2$$'), / +/g, ')|(?:') + ')')"},
"expr": "regexp('(?:' + replace(replace(replace(filter_marker, /^ *(.*?) *$/, '$1'), /(^| )=(.*?)(?= |$)/g, '$1^$2$$'), / +/g, ')|(?:') + ')')"
},
{
"name": "barwidth",
......@@ -37,7 +42,7 @@
"transform": [
{
"type": "filter",
"test": "datum.allele != datum.sequence && max(datum.fmean, datum.rmean) >= filter_threshold && test('^' + filter_marker + '$', datum.marker)"
"test": "datum.allele != datum.sequence && max(datum.fmean, datum.rmean) >= filter_threshold && test(marker_regex, datum.marker)"
},
{
"type": "formula",
......@@ -231,12 +236,7 @@
"scale": "x",
"grid": true,
"layer": "back",
"title": "Noise ratio (%)",
"properties": {
"title": {
"dy": {"value": -5}
}
}
"title": "Noise ratio (%)"
},
{
"type": "y",
......
......@@ -22,6 +22,11 @@
background-color: rgba(255, 255, 255, 0.8);
z-index: 10;
}
div.options span.help {
cursor: help;
font-weight: bold;
border-bottom: 1px dashed black;
}
table.optiongroup {
padding: 10px 0px 0px 0px;
margin: 0px;
......@@ -115,6 +120,9 @@
flex-grow: 1;
padding: 5pt;
}
div.options span.help {
display: none;
}
table.optiongroup {
padding: 0px;
margin: 0px;
......@@ -237,12 +245,19 @@
</tr>
<tr>
<td>Marker name</td>
<td><input type="text" id="markerFilter" size="20" title="Supports regular expression syntax: e.g., use '.*' to match anything."></td>
<td>
<input type="text" id="markerFilter" size="20">
<span class="help" title="Type xyz to show all markers with 'xyz' in their names. Type =xyz to show only the marker 'xyz'. Separate multiple values with spaces.">?</span>
</td>