notes.txt 7.79 KB
Newer Older
1
To-do:
2
3
* Known bug: regex_longest_match does not give the longest match if a shorter
  match is possible and found earlier at the same position.
4
5
6
7
8
9
10
11
12
13
14
* Allow loading multiple files into HTML visualisations and provide prev/next
  buttons to browse them.
* Add tool to summarise various statistics about the entire analysis pipeline:
  (TODO: Write this list)
* Add "allow_N" flag to [no_repeat] markers.  If the flag is specified, the
  reference sequence may contain Ns.  People might need this for the rCRS mtDNA
  reference sequence.
* Allow simple marker name filtering in visualisations.
* Samplevis:
  * Add columns to the table of selected alleles that shows which tool did the
    filtering (if any) and what data was used to filter with.
15
16
17
18
19
20
21
  * Add 'percentage of marker reads' column in the table of selected alleles.
  * Add 'show negative alleles' checkbox in Filtering options (for diagnostic
    purposes - but default to TRUE).
  * Respect sorting rules when adding sequences due to lowered filters.
  * Make sure marker and allele names sort the same in Vega and in the tables.
  * Option to save the marker tables to a TSV file.
  * When we have them, add default values to table filtering (for reference).
22
23
24
25
26
27
28
29
30
31
32
33
* Adjust BGEstimate so that it computes forward and reverse in one go.  To do
  this, double the number of columns in P and C and put the forward profile in
  the left half and the reverse profile in the right half.  The benefit of this
  is that this ensures the same A is used for both strands, that is, the
  estimated allele balance is the same.
* Adjust BGEstimate so that it takes strand bias in the allele itself into
  account as well.
* Perhaps there should be a version of BGEstimate that makes a profile for each
  genotype instead of each allele.  This allows for the detection of hybrids.
* Add plotting of raw data points to StuttermodelVis.
* Add visualisation with all markers in one graph ("samplesummaryvis"?).

Hoogenboom, Jerry's avatar
Hoogenboom, Jerry committed
34
Open Vega issues:
35
36
* Force layout parameter name: 'drag' or 'active'?
  https://github.com/vega/vega/issues/460
Hoogenboom, Jerry's avatar
Hoogenboom, Jerry committed
37
38
39
40
* Legend and axis corruption on signal changes.
  https://github.com/vega/vega/issues/446
* Feature request for lines (or arbitrary shapes) for legend items.
  https://github.com/vega/vega/issues/408
41
42
43
44


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
Inclusion of tools developed during the project into FDSTools:
PRIO    OLD TOOL NAME       DESCRIPTION
------- ------------------- ---------------------------------------------------
LOW     distance-plot       Check stability of profiles by subsampling
LOW     qq-plot             Draw Q-Q plot vs normal/lognormal distributions
LOW     stuttercheck        Sort-of stuttermodel using profiles as input
LOW     substitutioncheck   Sort-of substitutionmodel using profiles as input
DONE    gen-allele-names    Convert TSSV-style sequences to allelenames
DONE    stuttermark         Mark stutter products in sample
DONE    analyze-background  BGEstimate: generate background profiles
DONE    profilemark         BGCorrect: find and correct for noise in samples
DONE    gen-bg-profiles     Compute statistics on noise in homozygous samples
DONE    blame               Find dirty samples in the reference database
DONE    allelenames-update  Convert allele names from one library to another
DONE    polyfit-repeat-len  Stuttermodel: predict stutter from sequence
DONE    common-background   Compute noise ratios in homozygous samples
DROP    alleles-convert     Convert allele names using lookup table
DROP    block-dedup         Remove duplicate blocks in TSSV-style sequences
DROP    graphgen            Create bar graph from a sample's data
DROP    find-true-alleles   Check whether the true alleles are detected
DROP    allele-graph        Create a graph of allele co-occurrence
DROP    ambiguity           Find potentially ambiguous allele combinations
DROP    strandbias          Find strand bias in the data
DROP    annotate-alleles    Annotate true alleles in sample based on allelelist

Visualisations:
LOW     blame               Common alleles
LOW     qqplot              Q-Q plot of normal/lognormal distribution
LOW     stability           Profile distance vs amount of subsampling
DONE    samplevis           Sample data
DONE    profiles            Background profiles
DONE    bg                  Dotplots of noise ratios in homozygous samples
jhoogenboom's avatar
jhoogenboom committed
77
DONE    trends              Fit repeat length vs stutter amount
78
DONE    allelegraph         Homozygosity/heterozygosity
79
80
81
82

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Argument group order:
83
84
85
input file options          bgcorrect,samplestats,seqconvert,stuttermark
output file options         allelefinder,bgcorrect,bgestimate,bghomstats,blame,samplestats,seqconvert,stuttermark,stuttermodel
sample tag parsing options  allelefinder,bgcorrect,bgestimate,bghomstats,blame,samplestats,seqconvert,stuttermark,stuttermodel
86
allele detection options    bgestimate,bghomstats,blame,stuttermodel
87
interpretation options		samplestats
88
filtering options           allelefinder,bgcorrect,bgestimate,bghomstats,bgpredict,blame,stuttermark,stuttermodel
89
sequence format options     allelefinder,bgcorrect,bgestimate,bghomstats,bgpredict,blame,samplestats,stuttermark,stuttermodel
90
91
92
93
94
95
96
97
98
99
100
101
102
random subsampling options  bgestimate,bghomstats,stuttermodel
visualisation options       vis

Input/output of tools:      INPUT           OUTPUT
allelefinder    list of sample files        *single output + report
bgcorrect       bg file + single sample (b) single sample (batches supported)
bgestimate      list of sample files        single output + report
bghomraw        list of sample files        single output
bghomstats      list of sample files        single output
bgmerge^        list of bg files            single output
bgpredict^      model + seqfile             single output
blame           bg file + list of samples   single output
libconvert^     single input library        single output
103
samplestats		single sample file (batch)	single sample (batches supported)
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
seqconvert      single sample file (batch)  single sample (batches supported)
stuttermark     single sample file (batch)  single sample (batches supported)
stuttermodel    list of sample files        single output
vis^            single sample (optional)    single output
*TODO: add option to change single output to multi-out (batch_process=True)
^does not use add_args functions (bgmodel/predict use sequence_format_args)

Input/output conventions:
* Write single output to sys.stdout by default, allow changing by -o
* Write report (if applicable) to sys.stderr by default, allow changing by -R
* For multi-in to single-out: list input files as positionals
* For single-in: provide [IN], [OUT] positionals (defaulting IN to sys.stdin)
* Single-in batch support via -i/-o; mutex with positionals!
  (if len(-o) == len(-i), map infile->outfile; if len(-o) == 1,
   map sample->outfile by rewriting tag)
* Multi-in to multi-out: allow multiple values in -o option (currently unused)

Reserved option letters:
-h  Help (used globally)
-v  Version (used globally)
-d  Debug (used globally)
-i  Batch input files
-o  Output files
-R  Report file
-F  Target sequence format
-l  Library
-e  Sample tag extraction pattern
-f  Sample tag format

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
Sequence format conversions:
From    To
raw     tssv    OK - Transparent to non-STR markers (have no regex_middle).
raw     name    OK - Uses raw->tssv as a first step.
tssv    raw     OK - Trivial case.
tssv    name    (Implemented as tssv->raw->name)
name    raw     (Implemented as name->tssv->raw)
name    tssv    OK

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Find TODOs and FIXMEs in the code:
grep "TODO\|FIXME" *.py */*.py */*/*.py


Number of lines, excluding empty lines and comments:
grep -v "^\s*\(#.*\)\?$" *.py */*.py */*/*.py | wc -l