notes.txt · a3e610e85dacc82889cd1657e8d7cb2ac853a104 · Hoogenboom / fdstools

Filtering and aggregation in Samplestats · a3e610e8
Hoogenboom, Jerry authored Dec 09, 2015
Fixed:
* When converting STR allele names to sequences, FDSTools would reject
  any prefix variants with a false message stating that the variant does
  not match the reference sequence.
* The Samplestats tool would not allow the -b/--min-per-strand option to
  be set to zero.

Improved:
* Moved the flags generated by BGCorrect to a new column named
  correction_flags. Some of the values have been renamed for clarity,
  and this column now always contains a value.
  * The Samplestats tool will no longer add the not_corrected flag to
    each sequence, as it does not add the correction_flags column.
* The Samplestats tool now supports filtering sequences. For filtering,
  the same set of options is available as those used for marking
  alleles. The filtering options use upper case letters and have '-filt'
  appended to their long name. The new -a/--filter-action option defines
  what should be done with filtered sequencies. 'off', the default,
  disables filtering; 'combine' replaces filtered sequences with a new
  line containing aggregated data; 'delete' removes filtered sequences
  without leaving a trace.
  * The seqconvert tool is aware of the special 'Other sequences' value
    produced by Samplestats with -a/--filter-action set to 'combine'.
	Other tools will give an informative error message when the input
	contains this special value.
* The Samplestats tool now accepts non-integer and negative numbers for
  -n/--min-reads and -b/--min-per-strand because after correction read
  counts are not necessarily nonnegative integers anymore.
* The forward_correction and reverse_correction columns of Samplestats
  will now contain 0 if the sequence had exactly 0 reads both before and
  after correction (previously, this was -100).
* Renamed the _mp columns of Samplestats to _mp_sum ("per-marker
  percentage of the sum") and introduced _mp_max columns ("per-marker
  percentage of the maximum").
* Samplestats and Samplevis HTML visualisations will now mark a sequence
  as 'allele' if the minimum amount of correction OR the minimum number
  of recovered reads is reached (as opposed to AND). This allows alleles
  on stutter positions to be detected.

Changed:
* The -r/--min-recovery option of Samplestats has been renamed to
  -y/--min-recovery, analogous to the new -Y/--min-recovery-filt.

Visualisations:
* Updated Vega to version 2.4.1.
* Replaced the regular expression-based filters in all visualisations
  with a much simpler syntax. The new syntax uses space-separated search
  terms, defaulting to a 'contains'-type search method. If any search
  term is preceded by an equals sign, that term must be matched exactly.
  (The search terms themselves are actually still matched as regexes!)
* Added 'show negative alleles' option (default on) to Samplevis. When
  enabled, the graph filtering options work on abs(value) instead of the
  value itself.
* When sorting alleles in Samplevis, the allele name is now used as the
  final tiebreaker instead of the primary sorting column.
* HTML visualisations no longer re-render the entire graph when changing
  the width. The same holds true for the height setting of Allelevis.
* The tables in Samplevis HTML visualisations will now contain the
  information from BGCorrect's correction_flags column in the Notes
  column.
a3e610e8