Commit be745e64 authored by jhoogenboom's avatar jhoogenboom
Browse files

Introducing bgestimate

I could write about all its features here, but instead I will point
out some future plans to highlight the things that are possibly not
optimal in their current implementation.

There are a number of things I plan to change in the future:
* The output format is currently JSON, perhaps a carefully designed
  tabular format is a better choice. The benefit of switching to a
  tabluar format is that the data can be loaded into e.g. Excel as
  well.
* The profiles are currently produced separately for forward and
  reverse reads. I would prefer to integrate these into a single
  computation that estimates allele balance in the heterozygotes
  using both strands as well.
* I would like to add information about strand bias of the alleles
  as well. The most straightforward way to do this is to set only
  the forward reads of the true allele to 100 and treat the reverse
  reads the same as all background products. You will then obtain a
  number of reverse reads observed for ever 100 forward reads of
  the true allele.
* I think it would be appropriate to make sure the values in the
  allele balance matrices of each sample ('Ax' in the source code)
  should add up to 1. For homozygotes, it is currently a scalar 1,
  the sum of the elements tend to be more than 1. This means that a
  heterozygous sample has a stronger influence on the profiles than
  a homozygous sample.
parent f9543ed9
......@@ -15,5 +15,5 @@ def version(name, toolname=None, toolversion=None):
return verformat % (name, __version__)
if toolversion is None:
return toolverformat % (toolname, verformat % (name, __version__))
return toolverformat % (verformat % (toolname,toolversion),
return toolverformat % (verformat % (toolname, toolversion),
verformat % (name, __version__))
#!/usr/bin/env python
import re, sys, argparse
#import numpy as np # Imported only when calling nnls()
from ConfigParser import RawConfigParser, MissingSectionHeaderError
from StringIO import StringIO
......@@ -582,6 +583,51 @@ def ensure_sequence_format(seq, to_format, from_format=None, library=None,
#ensure_sequence_format
def nnls(A, C, B=None, max_iter=200, min_change=0.0001, debug=False):
"""
Solve for B in A * B = C in the least squares sense, s.t. B >= 0.
Hint: call nnls(B.T, C.T).T to solve for A.
Algorithm has converged if the sum of squared error has decreased
by less than a factor of min_change in one iteration. If debug is
True, print the sum of squared error to sys.stdout after each
iteration.
This code was partially adopted from nimfa.methods.factorization.bd.
"""
import numpy as np
if B is None:
B = np.matrix(np.zeros([A.shape[1], C.shape[1]]))
E = A.T * A
F = A.T * C
prev_score = cur_score = sys.float_info.max
for i in range(max_iter):
for n in range(B.shape[0]):
nn = list(range(n)) + list(range(n + 1, B.shape[0]))
tmp = (F[n, :] - E[n, nn] * B[nn, :]) / E[n, n]
tmp[np.isnan(tmp)] = 0
tmp[tmp < 0] = 0
B[n, :] = tmp
prev_score = cur_score
cur_score = np.square(C - A * B).sum()
score_change = (prev_score-cur_score)/prev_score
if debug:
if i:
print("%4i %15.6f %15.6f %6.2f" % (i, cur_score,
prev_score-cur_score, 100*score_change))
else:
print("%4i %15.6f" % (i, cur_score))
if not cur_score or score_change < min_change:
# We have converged.
break
return B
#nnls
def get_column_ids(column_names, *names):
"""Find all names in column_names and return their indices."""
result = []
......
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment