notes.txt 12 KB
Newer Older
1
To-do:
Hoogenboom, Jerry's avatar
Hoogenboom, Jerry committed
2
* Group tools by function in the command line help and put Pipeline on top.
3
* Samplevis:
4
  * Detect whether correction was performed; hide related columns if not.
Hoogenboom, Jerry's avatar
Hoogenboom, Jerry committed
5
  * A PctOfAlleles column in the tables would be useful for mixtures.
6
  * Option to choose complete table download (all columns, not all rows).
Hoogenboom, Jerry's avatar
Hoogenboom, Jerry committed
7
  * Option to freely adjust the sorting (currently CE length toggle only).
8
  * Some of the media query breakpoints overlap, fix this.
Hoogenboom, Jerry's avatar
Hoogenboom, Jerry committed
9 10 11 12 13
  * Make Samplevis more responsive when rendering/updating graphs. To do this,
    the work should be broken up into chunks and each chunk should set off the
    next chunk through "window.setTimeout(nextChunkFunction);". The page will
    be repainted between each chunk. One major issue with this is that user
    input events may get scheduled between the chunks.
14
  * Allow table filtering options to be specified for each marker separately.
15 16 17 18 19 20 21 22 23 24
* Pipeline:
  * Add raw sequence output to ref-sample and case-sample analyses.
* Samplestats:
  * Verify that Samplestats never treats "Other sequences" as the highest.
  * Add capability to run Samplestats again on its own output.
  * Add percentage-of-called-alleles columns.
* BGAnalyse:
  * Add columns containing the sequence of the highest/lowest noise and the
    sequence with the highest percentage recovery in every sample and marker.
* Add option for ignoring strands, operating on the total read counts instead.
25
* Add options for exporting data in CODIS format (and possibly others?).
26 27
* Add grouping, show/hide options, and target coverage for BGAnalyseVis to the
  Vis tool.
Hoogenboom, Jerry's avatar
Hoogenboom, Jerry committed
28
* Add r2 filter and stutter amount for Stuttermodelvis to the Vis tool.
29 30 31
* Add per-marker allele calling settings to Samplestats.
* Exceptions to general mtDNA nomenclature: http://empop.online/methods
* Reduce noise profile memory usage:
32
  * Use sparse matrices in BGEstimate and BGCorrect.  May save over 90% of
33
    memory for the profile matrix after BGMerge of BGEstimate and BGPredict.
34
* Add 'BGDiff' tool to compare noise profiles.
35 36 37
* Add section to the library file where genomic positions of known pathogenic
  variants are specified. The TSSV tool should always output the reference base
  at these positions to comply with ethical regulations.
38
* Add options to Samplevis, Samplestats (and possibly other relevant tools) to
39
  filter alleles by sequence length.  The TSSV tool already supports this.
40
* Add visualisation with all markers in one graph ("samplesummaryvis"?).
41
* Add tool to analyse within-marker and between-marker coverage variation.
42 43
* Allow loading multiple files into HTML visualisations and provide prev/next
  buttons to browse them.
44 45 46
* Samplevis HTML visualisations in IE11:
  * Printing striped table rows does not seem to work, though this might be an
    NFI-specific issue.
47
  * Tables are not perfectly aligned with the graphs (graphs render slightly
48
    differently). Firefox is just 1px off (using em units for positioning now).
49 50
* [Known bug]: pattern_longest_match does not give the longest match if a
  shorter match is possible and found earlier at the same position.
51 52 53 54 55
* Adjust BGEstimate so that it computes forward and reverse in one go.  To do
  this, double the number of columns in P and C and put the forward profile in
  the left half and the reverse profile in the right half.  The benefit of this
  is that this ensures the same A is used for both strands, that is, the
  estimated allele balance is the same.
56 57
* Idea to make Stuttermodel for heterozygotes: compute a fit to the (weighted)
  profiles from BGEstimate.
58 59 60 61 62 63 64
* Adjust BGEstimate so that it takes strand bias in the allele itself into
  account as well.
* Perhaps there should be a version of BGEstimate that makes a profile for each
  genotype instead of each allele.  This allows for the detection of hybrids.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
Inclusion of tools developed during the project into FDSTools:
PRIO    OLD TOOL NAME       DESCRIPTION
------- ------------------- ---------------------------------------------------
LOW     distance-plot       Check stability of profiles by subsampling
LOW     qq-plot             Draw Q-Q plot vs normal/lognormal distributions
LOW     stuttercheck        Sort-of stuttermodel using profiles as input
LOW     substitutioncheck   Sort-of substitutionmodel using profiles as input
DONE    gen-allele-names    Convert TSSV-style sequences to allelenames
DONE    stuttermark         Mark stutter products in sample
DONE    analyze-background  BGEstimate: generate background profiles
DONE    profilemark         BGCorrect: find and correct for noise in samples
DONE    gen-bg-profiles     Compute statistics on noise in homozygous samples
DONE    allelenames-update  Convert allele names from one library to another
DONE    polyfit-repeat-len  Stuttermodel: predict stutter from sequence
DONE    common-background   Compute noise ratios in homozygous samples
80
DROP    blame               Find dirty samples in the reference database
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
DROP    alleles-convert     Convert allele names using lookup table
DROP    block-dedup         Remove duplicate blocks in TSSV-style sequences
DROP    graphgen            Create bar graph from a sample's data
DROP    find-true-alleles   Check whether the true alleles are detected
DROP    allele-graph        Create a graph of allele co-occurrence
DROP    ambiguity           Find potentially ambiguous allele combinations
DROP    strandbias          Find strand bias in the data
DROP    annotate-alleles    Annotate true alleles in sample based on allelelist

Visualisations:
LOW     qqplot              Q-Q plot of normal/lognormal distribution
LOW     stability           Profile distance vs amount of subsampling
DONE    samplevis           Sample data
DONE    profiles            Background profiles
DONE    bg                  Dotplots of noise ratios in homozygous samples
jhoogenboom's avatar
jhoogenboom committed
96
DONE    trends              Fit repeat length vs stutter amount
97
DONE    allelegraph         Homozygosity/heterozygosity
98
DROP    blame               Common alleles
99 100 101

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

102
Argument group order*:
103 104 105
input file options          bgcorrect,findnewalleles,samplestats,seqconvert,stuttermark
output file options         allelefinder,bganalyse,bgcorrect,bgestimate,bghomraw,bghomstats,bgmerge,findnewalleles,samplestats,seqconvert,stuttermark,stuttermodel,tssv
sample tag parsing options  allelefinder,bganalyse,bgcorrect,bgestimate,bghomraw,bghomstats,findnewalleles,pipeline,samplestats,seqconvert,stuttermark,stuttermodel
106
allele detection options    bganalyse,bgestimate,bghomraw,bghomstats,stuttermodel
107
interpretation options      samplestats
108 109
filtering options           allelefinder,bgcorrect,bgestimate,bghomraw,bghomstats,bgpredict,findnewalleles,samplestats,stuttermark,stuttermodel,tssv
sequence format options     allelefinder,bganalyse,bgcorrect,bgestimate,bghomraw,bghomstats,bgmerge,bgpredict,findnewalleles,stuttermark,stuttermodel,tssv
110 111
random subsampling options  bgestimate,bghomstats,stuttermodel
visualisation options       vis
112
*tssv has sequence format options before output file options
113 114 115

Input/output of tools:      INPUT           OUTPUT
allelefinder    list of sample files        *single output + report
116
bganalyse       list of sample files        single output
117 118 119 120 121 122
bgcorrect       bg file + single sample (b) single sample (batches supported)
bgestimate      list of sample files        single output + report
bghomraw        list of sample files        single output
bghomstats      list of sample files        single output
bgmerge^        list of bg files            single output
bgpredict^      model + seqfile             single output
123
findnewalleles  seqfile + single sample (b) single sample (batches supported)
124
libconvert^     single input library        single output
125 126
library^        (none)                      single output
pipeline^       single ini file             single ini file
127
samplestats     single sample file (batch)  single sample (batches supported)
128 129 130
seqconvert      single sample file (batch)  single sample (batches supported)
stuttermark     single sample file (batch)  single sample (batches supported)
stuttermodel    list of sample files        single output
131
tssv            single sample FQ/FA         single output + report
132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
vis^            single sample (optional)    single output
*TODO: add option to change single output to multi-out (batch_process=True)
^does not use add_args functions (bgmodel/predict use sequence_format_args)

Input/output conventions:
* Write single output to sys.stdout by default, allow changing by -o
* Write report (if applicable) to sys.stderr by default, allow changing by -R
* For multi-in to single-out: list input files as positionals
* For single-in: provide [IN], [OUT] positionals (defaulting IN to sys.stdin)
* Single-in batch support via -i/-o; mutex with positionals!
  (if len(-o) == len(-i), map infile->outfile; if len(-o) == 1,
   map sample->outfile by rewriting tag)
* Multi-in to multi-out: allow multiple values in -o option (currently unused)

Reserved option letters:
-h  Help (used globally)
-v  Version (used globally)
-d  Debug (used globally)
-i  Batch input files
-o  Output files
-R  Report file
-F  Target sequence format
-l  Library
-e  Sample tag extraction pattern
-f  Sample tag format

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

160 161
Action for special values in sequence columns:
TOOL            'No data'                   'Other sequences'
162
allelefinder    Not suitable for marker     Ignored (Not suitable for marker)
163
bganalyse       Ignored                     Ignored
164 165 166 167 168 169
bgcorrect       Transparent                 Transparent
bgestimate      Ignored                     Ignored
bghomraw        Ignored                     Ignored
bghomstats      Ignored                     Ignored
bgmerge         Transparent                 Transparent
bgpredict       Ignored                     Ignored
170
findnewalleles  Marked as 'new'             Marked as 'new'
171 172 173 174 175 176 177 178
samplestats     Transparent                 Transparent; may remove or add more
seqconvert      Transparent                 Transparent
stuttermark     Marked as 'UNKNOWN'         Marked as 'UNKNOWN'
stuttermodel    Ignored                     Ignored
tssv            May create them             May create them

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

179 180 181 182 183 184 185 186 187
Sequence format conversions:
From    To
raw     tssv    OK - Transparent to non-STR markers (have no regex_middle).
raw     name    OK - Uses raw->tssv as a first step.
tssv    raw     OK - Trivial case.
tssv    name    (Implemented as tssv->raw->name)
name    raw     (Implemented as name->tssv->raw)
name    tssv    OK

188 189 190 191 192 193 194 195 196 197 198 199 200 201 202
allelefinder    analysis    default output  output option
bganalyse       (as output) raw             yes
bgcorrect       raw         (as input)      yes
bgestimate      (as output) raw             no
bghomraw        (as output) raw             yes
bghomstats      (as output) (as input)      yes
bgmerge         raw         raw             no*
bgpredict       raw         raw             no*
findnewalleles  raw         (as input)      no
samplestats     (as input)  (as input)      no
stuttermark     tssv        tssv            no
stuttermodel    raw         not applicable  not applicable
tssv            raw         raw             yes
*easily changed, but is it appropriate to do so?

203 204
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

205 206 207 208 209 210 211 212 213
Tool and visualisation version numbering, e.g., v1.2.3:
1   major version, changes only with large, disruptive, fundamental changes
2   minor version, changes when the default output is altered, or when it is
    otherwise likely that user pipelines will break when updating
3   patch version, changes with any other changes (bug fixes, new optional
    features, etc.)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

214 215 216
Find TODOs and FIXMEs in the code:
grep "TODO\|FIXME" *.py */*.py */*/*.py

217 218
Number of lines of Python, excluding empty lines and comments:
grep -v "^\s*\(#.*\)\?$" *.py */*.py */*/*.py | wc -l
219

220 221
Number of lines of JSON, excluding empty lines and those with only a brace:
grep -v "^\s*[{}]\?\[\?\]\?,\?$"  */*/*/*.json | wc -l