notes.txt 11.7 KB
Newer Older
1
To-do:
2
* Samplevis:
3
  * Option to choose complete table download (all columns, not all rows).
Hoogenboom, Jerry's avatar
Hoogenboom, Jerry committed
4
  * Option to freely adjust the sorting (currently CE length toggle only).
5
  * When we have them, add default values to table filtering (for reference).
6
7
  * Some of the media query breakpoints overlap, fix this.
  * Perhaps it is desirable to be able to request a list of 'Other sequences'.
Hoogenboom, Jerry's avatar
Hoogenboom, Jerry committed
8
9
10
11
12
  * Make Samplevis more responsive when rendering/updating graphs. To do this,
    the work should be broken up into chunks and each chunk should set off the
    next chunk through "window.setTimeout(nextChunkFunction);". The page will
    be repainted between each chunk. One major issue with this is that user
    input events may get scheduled between the chunks.
13
14
  * Add 'Save page' button that also saves the alleles clicked by the user.
  * Add options to set the Table filtering options in the Vis tool.
15
* Additions needed for publication:
16
17
18
19
20
  * Check whether there is a difference between filtering short artefacts in
    TSSV vs having BGEstimate/BGCorrect filter them.
    * vWA GATGGAT
    * D2S441 GGCGCCCCAGCATTCTAACAAGGAATGTGGGTGCTGGGAGCCAGGAACCTGGACAAAAACCAAAACGCATATCC
    * D2S441 AGCACCCCAGCATTCTAACAAGCAATGTAGGCATTGGGAGCCAGGAGCCTGGACAAAATCCAATACGCATATCC
21
22
23
24
25
26
27
28
  * [If not too difficult to implement] BGEstimate should start with homozygous
    samples and add heterozygous samples later to optimise correction.
  * Summary statistics for BGEstimate based on top N genotypes per marker,
    which is the highest percentage remaining background in reference database
    (maybe also additional value for confidence interval).
  * Visualisation to display highest remaining background (positive and
    negative) in known samples after BGCorrect analysis.
* Add options to Libconvert to generate a template for STR or non-STR markers.
29
* Add options to Samplevis, Samplestats (and possibly other relevant tools) to
30
  filter alleles by sequence length. The TSSV tool already supports this.
31
32
33
* Add plotting of raw data points to StuttermodelVis.
* Add a print stylesheet for the other visualisations (only Samplevis has one).
* Add visualisation with all markers in one graph ("samplesummaryvis"?).
34
35
* Allow loading multiple files into HTML visualisations and provide prev/next
  buttons to browse them.
36
37
38
* Samplevis HTML visualisations in IE11:
  * Printing striped table rows does not seem to work, though this might be an
    NFI-specific issue.
39
  * Tables are not perfectly aligned with the graphs (graphs render slightly
40
41
    differently). Firefox is just 1px off (using em units for positioning now).
  * When printing, IE11 respects the pagebreak hints. Chrome and FF are bugged!
42
43
* [Known bug]: pattern_longest_match does not give the longest match if a
  shorter match is possible and found earlier at the same position.
44
* Add tool that takes a configuration file and runs a pipeline of other tools.
45
46
47
48
49
50
51
52
53
54
55
56
* Add "allow_N" flag to [no_repeat] markers.  If the flag is specified, the
  reference sequence may contain Ns.  People might need this for the rCRS mtDNA
  reference sequence.
* Adjust BGEstimate so that it computes forward and reverse in one go.  To do
  this, double the number of columns in P and C and put the forward profile in
  the left half and the reverse profile in the right half.  The benefit of this
  is that this ensures the same A is used for both strands, that is, the
  estimated allele balance is the same.
* Adjust BGEstimate so that it takes strand bias in the allele itself into
  account as well.
* Perhaps there should be a version of BGEstimate that makes a profile for each
  genotype instead of each allele.  This allows for the detection of hybrids.
57
58
* Add tool to summarise various statistics about the entire analysis pipeline:
  (TODO: Write this list)
59

Hoogenboom, Jerry's avatar
Hoogenboom, Jerry committed
60
Open Vega issues:
Hoogenboom, Jerry's avatar
Hoogenboom, Jerry committed
61
* Bug in aggregate transform w.r.t. signals.
62
  https://github.com/vega/vega/issues/530
63
64
* Lookup transform only takes simple field names for the onKey parameter.
  https://github.com/vega/vega/issues/526
65
* Sorting needs the Rank transform, but that is not released yet.
66
  https://github.com/vega/vega/issues/509
67
68
69
70
* Feature request: Id-based refs for Force transform's source and target.
  https://github.com/vega/vega/issues/471
* Tick labels of log scale axes don't respect number formatting.
  https://github.com/vega/vega/issues/470
Hoogenboom, Jerry's avatar
Hoogenboom, Jerry committed
71
72
73
74
* Legend and axis corruption on signal changes.
  https://github.com/vega/vega/issues/446
* Feature request for lines (or arbitrary shapes) for legend items.
  https://github.com/vega/vega/issues/408
75
76
77
78


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
Inclusion of tools developed during the project into FDSTools:
PRIO    OLD TOOL NAME       DESCRIPTION
------- ------------------- ---------------------------------------------------
LOW     distance-plot       Check stability of profiles by subsampling
LOW     qq-plot             Draw Q-Q plot vs normal/lognormal distributions
LOW     stuttercheck        Sort-of stuttermodel using profiles as input
LOW     substitutioncheck   Sort-of substitutionmodel using profiles as input
DONE    gen-allele-names    Convert TSSV-style sequences to allelenames
DONE    stuttermark         Mark stutter products in sample
DONE    analyze-background  BGEstimate: generate background profiles
DONE    profilemark         BGCorrect: find and correct for noise in samples
DONE    gen-bg-profiles     Compute statistics on noise in homozygous samples
DONE    blame               Find dirty samples in the reference database
DONE    allelenames-update  Convert allele names from one library to another
DONE    polyfit-repeat-len  Stuttermodel: predict stutter from sequence
DONE    common-background   Compute noise ratios in homozygous samples
DROP    alleles-convert     Convert allele names using lookup table
DROP    block-dedup         Remove duplicate blocks in TSSV-style sequences
DROP    graphgen            Create bar graph from a sample's data
DROP    find-true-alleles   Check whether the true alleles are detected
DROP    allele-graph        Create a graph of allele co-occurrence
DROP    ambiguity           Find potentially ambiguous allele combinations
DROP    strandbias          Find strand bias in the data
DROP    annotate-alleles    Annotate true alleles in sample based on allelelist

Visualisations:
LOW     blame               Common alleles
LOW     qqplot              Q-Q plot of normal/lognormal distribution
LOW     stability           Profile distance vs amount of subsampling
DONE    samplevis           Sample data
DONE    profiles            Background profiles
DONE    bg                  Dotplots of noise ratios in homozygous samples
jhoogenboom's avatar
jhoogenboom committed
111
DONE    trends              Fit repeat length vs stutter amount
112
DONE    allelegraph         Homozygosity/heterozygosity
113
114
115

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

116
Argument group order*:
117
input file options          bgcorrect,samplestats,seqconvert,stuttermark
118
output file options         allelefinder,bgcorrect,bgestimate,bghomraw,bghomstats,bgmerge,blame,samplestats,seqconvert,stuttermark,stuttermodel,tssv
119
120
121
sample tag parsing options  allelefinder,bgcorrect,bgestimate,bghomraw,bghomstats,blame,samplestats,seqconvert,stuttermark,stuttermodel
allele detection options    bgestimate,bghomraw,bghomstats,blame,stuttermodel
interpretation options      samplestats
122
filtering options           allelefinder,bgcorrect,bgestimate,bghomraw,bghomstats,bgpredict,blame,samplestats,stuttermark,stuttermodel,tssv
123
sequence format options     allelefinder,bgcorrect,bgestimate,bghomraw,bghomstats,bgmerge,bgpredict,blame,stuttermark,stuttermodel,tssv
124
125
random subsampling options  bgestimate,bghomstats,stuttermodel
visualisation options       vis
126
*tssv has sequence format options before output file options
127
128
129
130
131
132
133
134
135
136
137

Input/output of tools:      INPUT           OUTPUT
allelefinder    list of sample files        *single output + report
bgcorrect       bg file + single sample (b) single sample (batches supported)
bgestimate      list of sample files        single output + report
bghomraw        list of sample files        single output
bghomstats      list of sample files        single output
bgmerge^        list of bg files            single output
bgpredict^      model + seqfile             single output
blame           bg file + list of samples   single output
libconvert^     single input library        single output
138
samplestats     single sample file (batch)  single sample (batches supported)
139
140
141
seqconvert      single sample file (batch)  single sample (batches supported)
stuttermark     single sample file (batch)  single sample (batches supported)
stuttermodel    list of sample files        single output
142
tssv            single sample FQ/FA         single output + report
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
vis^            single sample (optional)    single output
*TODO: add option to change single output to multi-out (batch_process=True)
^does not use add_args functions (bgmodel/predict use sequence_format_args)

Input/output conventions:
* Write single output to sys.stdout by default, allow changing by -o
* Write report (if applicable) to sys.stderr by default, allow changing by -R
* For multi-in to single-out: list input files as positionals
* For single-in: provide [IN], [OUT] positionals (defaulting IN to sys.stdin)
* Single-in batch support via -i/-o; mutex with positionals!
  (if len(-o) == len(-i), map infile->outfile; if len(-o) == 1,
   map sample->outfile by rewriting tag)
* Multi-in to multi-out: allow multiple values in -o option (currently unused)

Reserved option letters:
-h  Help (used globally)
-v  Version (used globally)
-d  Debug (used globally)
-i  Batch input files
-o  Output files
-R  Report file
-F  Target sequence format
-l  Library
-e  Sample tag extraction pattern
-f  Sample tag format

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
Action for special values in sequence columns:
TOOL            'No data'                   'Other sequences'
allelefinder    Not suitable for marker     Treated as single noise sequence
bgcorrect       Transparent                 Transparent
bgestimate      Ignored                     Ignored
bghomraw        Ignored                     Ignored
bghomstats      Ignored                     Ignored
bgmerge         Transparent                 Transparent
bgpredict       Ignored                     Ignored
blame           Ignored                     Ignored
samplestats     Transparent                 Transparent; may remove or add more
seqconvert      Transparent                 Transparent
stuttermark     Marked as 'UNKNOWN'         Marked as 'UNKNOWN'
stuttermodel    Ignored                     Ignored
tssv            May create them             May create them

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

189
190
191
192
193
194
195
196
197
198
199
Sequence format conversions:
From    To
raw     tssv    OK - Transparent to non-STR markers (have no regex_middle).
raw     name    OK - Uses raw->tssv as a first step.
tssv    raw     OK - Trivial case.
tssv    name    (Implemented as tssv->raw->name)
name    raw     (Implemented as name->tssv->raw)
name    tssv    OK

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

200
201
202
203
204
205
206
207
208
Tool and visualisation version numbering, e.g., v1.2.3:
1   major version, changes only with large, disruptive, fundamental changes
2   minor version, changes when the default output is altered, or when it is
    otherwise likely that user pipelines will break when updating
3   patch version, changes with any other changes (bug fixes, new optional
    features, etc.)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

209
210
211
Find TODOs and FIXMEs in the code:
grep "TODO\|FIXME" *.py */*.py */*/*.py

212
213
Number of lines of Python, excluding empty lines and comments:
grep -v "^\s*\(#.*\)\?$" *.py */*.py */*/*.py | wc -l
214

215
216
Number of lines of JSON, excluding empty lines and those with only a brace:
grep -v "^\s*[{}]\?\[\?\]\?,\?$"  */*/*/*.json | wc -l