To-do: * TRY: BGCorrect edits to get rid of overcorrection of singletons: * Round final output to nearest integer; halves away from zero. * Clip deviations of less than 1 read to zero while iterating. * Both. * Samplevis: * Add option to truncate long allele names. * Sort STR alleles by length by default. * Option to adjust the sorting. * Option to choose complete table download. * When we have them, add default values to table filtering (for reference). * Some of the media query breakpoints overlap, fix this. * Perhaps it is desirable to be able to request a list of 'Other sequences'. * Additions needed for publication: * [If not too difficult to implement] BGEstimate should start with homozygous samples and add heterozygous samples later to optimise correction. * Summary statistics for BGEstimate based on top N genotypes per marker, which is the highest percentage remaining background in reference database (maybe also additional value for confidence interval). * Visualisation to display highest remaining background (positive and negative) in known samples after BGCorrect analysis. * Add options to Libconvert to generate a template for STR or non-STR markers. * Add plotting of raw data points to StuttermodelVis. * Add a print stylesheet for the other visualisations (only Samplevis has one). * Add visualisation with all markers in one graph ("samplesummaryvis"?). * Allow loading multiple files into HTML visualisations and provide prev/next buttons to browse them. * Samplevis HTML visualisations in IE11: * Printing striped table rows does not seem to work, though this might be an NFI-specific issue. * Tables are not perfectly aligned with the graphs (graphs render slightly differently). Firefox is just 1px off (using em units for positioning now). * When printing, IE11 respects the pagebreak hints. Chrome and FF are bugged! * [Known bug]: pattern_longest_match does not give the longest match if a shorter match is possible and found earlier at the same position. * Add tool that takes a configuration file and runs a pipeline of other tools. * Add "allow_N" flag to [no_repeat] markers. If the flag is specified, the reference sequence may contain Ns. People might need this for the rCRS mtDNA reference sequence. * Adjust BGEstimate so that it computes forward and reverse in one go. To do this, double the number of columns in P and C and put the forward profile in the left half and the reverse profile in the right half. The benefit of this is that this ensures the same A is used for both strands, that is, the estimated allele balance is the same. * Adjust BGEstimate so that it takes strand bias in the allele itself into account as well. * Perhaps there should be a version of BGEstimate that makes a profile for each genotype instead of each allele. This allows for the detection of hybrids. * Add tool to summarise various statistics about the entire analysis pipeline: (TODO: Write this list) Open Vega issues: * Lookup transform only takes simple field names for the onKey parameter. https://github.com/vega/vega/issues/526 * Sorting is broken. https://github.com/vega/vega/issues/509 * Feature request: Id-based refs for Force transform's source and target. https://github.com/vega/vega/issues/471 * Tick labels of log scale axes don't respect number formatting. https://github.com/vega/vega/issues/470 * Legend and axis corruption on signal changes. https://github.com/vega/vega/issues/446 * Feature request for lines (or arbitrary shapes) for legend items. https://github.com/vega/vega/issues/408 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Inclusion of tools developed during the project into FDSTools: PRIO OLD TOOL NAME DESCRIPTION ------- ------------------- --------------------------------------------------- LOW distance-plot Check stability of profiles by subsampling LOW qq-plot Draw Q-Q plot vs normal/lognormal distributions LOW stuttercheck Sort-of stuttermodel using profiles as input LOW substitutioncheck Sort-of substitutionmodel using profiles as input DONE gen-allele-names Convert TSSV-style sequences to allelenames DONE stuttermark Mark stutter products in sample DONE analyze-background BGEstimate: generate background profiles DONE profilemark BGCorrect: find and correct for noise in samples DONE gen-bg-profiles Compute statistics on noise in homozygous samples DONE blame Find dirty samples in the reference database DONE allelenames-update Convert allele names from one library to another DONE polyfit-repeat-len Stuttermodel: predict stutter from sequence DONE common-background Compute noise ratios in homozygous samples DROP alleles-convert Convert allele names using lookup table DROP block-dedup Remove duplicate blocks in TSSV-style sequences DROP graphgen Create bar graph from a sample's data DROP find-true-alleles Check whether the true alleles are detected DROP allele-graph Create a graph of allele co-occurrence DROP ambiguity Find potentially ambiguous allele combinations DROP strandbias Find strand bias in the data DROP annotate-alleles Annotate true alleles in sample based on allelelist Visualisations: LOW blame Common alleles LOW qqplot Q-Q plot of normal/lognormal distribution LOW stability Profile distance vs amount of subsampling DONE samplevis Sample data DONE profiles Background profiles DONE bg Dotplots of noise ratios in homozygous samples DONE trends Fit repeat length vs stutter amount DONE allelegraph Homozygosity/heterozygosity ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Argument group order*: input file options bgcorrect,samplestats,seqconvert,stuttermark output file options allelefinder,bgcorrect,bgestimate,bghomraw,bghomstats,bgmerge,blame,samplestats,seqconvert,stuttermark,stuttermodel,tssv sample tag parsing options allelefinder,bgcorrect,bgestimate,bghomraw,bghomstats,blame,samplestats,seqconvert,stuttermark,stuttermodel allele detection options bgestimate,bghomraw,bghomstats,blame,stuttermodel interpretation options samplestats filtering options allelefinder,bgcorrect,bgestimate,bghomraw,bghomstats,bgpredict,blame,samplestats,stuttermark,stuttermodel sequence format options allelefinder,bgcorrect,bgestimate,bghomraw,bghomstats,bgmerge,bgpredict,blame,stuttermark,stuttermodel,tssv random subsampling options bgestimate,bghomstats,stuttermodel visualisation options vis *tssv has sequence format options before output file options Input/output of tools: INPUT OUTPUT allelefinder list of sample files *single output + report bgcorrect bg file + single sample (b) single sample (batches supported) bgestimate list of sample files single output + report bghomraw list of sample files single output bghomstats list of sample files single output bgmerge^ list of bg files single output bgpredict^ model + seqfile single output blame bg file + list of samples single output libconvert^ single input library single output samplestats single sample file (batch) single sample (batches supported) seqconvert single sample file (batch) single sample (batches supported) stuttermark single sample file (batch) single sample (batches supported) stuttermodel list of sample files single output tssv single sample FQ/FA single output + report vis^ single sample (optional) single output *TODO: add option to change single output to multi-out (batch_process=True) ^does not use add_args functions (bgmodel/predict use sequence_format_args) Input/output conventions: * Write single output to sys.stdout by default, allow changing by -o * Write report (if applicable) to sys.stderr by default, allow changing by -R * For multi-in to single-out: list input files as positionals * For single-in: provide [IN], [OUT] positionals (defaulting IN to sys.stdin) * Single-in batch support via -i/-o; mutex with positionals! (if len(-o) == len(-i), map infile->outfile; if len(-o) == 1, map sample->outfile by rewriting tag) * Multi-in to multi-out: allow multiple values in -o option (currently unused) Reserved option letters: -h Help (used globally) -v Version (used globally) -d Debug (used globally) -i Batch input files -o Output files -R Report file -F Target sequence format -l Library -e Sample tag extraction pattern -f Sample tag format ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Action for special values in sequence columns: TOOL 'No data' 'Other sequences' allelefinder Not suitable for marker Treated as single noise sequence bgcorrect Transparent Transparent bgestimate Ignored Ignored bghomraw Ignored Ignored bghomstats Ignored Ignored bgmerge Transparent Transparent bgpredict Ignored Ignored blame Ignored Ignored samplestats Transparent Transparent; may remove or add more seqconvert Transparent Transparent stuttermark Marked as 'UNKNOWN' Marked as 'UNKNOWN' stuttermodel Ignored Ignored tssv May create them May create them ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sequence format conversions: From To raw tssv OK - Transparent to non-STR markers (have no regex_middle). raw name OK - Uses raw->tssv as a first step. tssv raw OK - Trivial case. tssv name (Implemented as tssv->raw->name) name raw (Implemented as name->tssv->raw) name tssv OK ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Find TODOs and FIXMEs in the code: grep "TODO\|FIXME" *.py */*.py */*/*.py Number of lines, excluding empty lines and comments: grep -v "^\s*\(#.*\)\?$" *.py */*.py */*/*.py | wc -l