README.md 5.39 KB
Newer Older
Sander Bollen's avatar
Sander Bollen committed
1
2
[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/vtools/README.html)

Sander Bollen's avatar
Sander Bollen committed
3
4
5
6
7
8
vtools
======

Little toolset operating over VCF files. Uses cyvcf2 and cython under
the hood for speed.

Sander Bollen's avatar
Sander Bollen committed
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
## Installation

### PyPI
vtools is now on pip! Since the 'vtools' name is already taken by another
package, installing _this_ vtools requires installing the following:

```bash
pip install v-tools
```

After installation, tools will still be called `vtools-<tool>`. Programmatic
access also simply works with

```python
import vtools
```

### Conda

```bash
conda install -c bioconda vtools
```

Sander Bollen's avatar
Sander Bollen committed
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

Tools
-----

### vtools-filter

Filter VCF files based on a few criteria. Will output both a filtered VCF
file, and a VCF file containing all the filtered-out variants.

####  Filter criteria

| name | meaning | optional |
| ---- | ------- | -------- |
| NON_CANONICAL | Non-canonical chromosome | Yes |
| INDEX_UNCALLED | Index uncalled or homozygous reference | Yes |
| TOO_HIGH_GONL_AF | Too high GonL allele frequency | Yes |
| TOO_HIGH_GNOMAD_AF | Too high GnomAD allele frequency | Yes |
| LOW_GQ | Too low GQ on index sample | Yes |
| DELETED_ALLELE | The only ALT allele is a deleted allele | No |

#### Configuration 

Sander Bollen's avatar
Sander Bollen committed
54
55
Configuration of filters goes by a little JSON file. 
See [here](cfg/example-filter.json) for an example.
Sander Bollen's avatar
Sander Bollen committed
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75


#### Usage

```bash
Usage: vtools-filter [OPTIONS]

Options:
  -i, --input PATH                Path to input VCF file  [required]
  -o, --output PATH               Path to output (filtered) VCF file
                                  [required]
  -t, --trash PATH                Path to trash VCF file  [required]
  -p, --params-file PATH          Path to filter params json  [required]
  --index-sample TEXT             Name of index sample  [required]
  --immediate-return / --no-immediate-return
                                  Immediately write filters to file upon
                                  hitting one filter criterium. Default = True
  --help                          Show this message and exit.

```
Sander Bollen's avatar
Sander Bollen committed
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90

### vtools-stats

Collects some general statistics about a VCF file, and writes a json to
stdout.

#### Usage

```bash
Usage: vtools-stats [OPTIONS]

Options:
  -i, --input FILE  Input VCF file  [required]
  --help            Show this message and exit.
```
Sander Bollen's avatar
Sander Bollen committed
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128

### vtools-gcoverage

Collect coverage metrics over a gVCF file for every exon or every transcript
in a refFlat file. This assumes the input VCF file is at least similar to
GATK's gVCF files. gVCF files are only expected to have one sample; if
your input file contains multiple samples, we simply take the first only.

Output is a simple TSV file with the following columns

| column | meaning |
| ------ | ------- |
| exon | exon number |
| gene | gene name / symbol / id |
| mean_dp | mean DP value over the exon |
| mean_gq | mean GQ value over the exon* |
| median_dp | median DP value over the exon |
| median_gq | median GQ value over the exon |
| perc_at_least_{10, 20, 30, 50, 100}_dp | Percentage of exon with DP value over value |
| perc_at_least_{10, 29, 30, 50, 90}_gq | Percentage of exon with GQ value over exon | 
| transcript | transcript name / symbol / id |

*: mean GQ value is computed by first calculating the P-value of all GQ 
values, then calculating the mean over these P-values, and lastly 
converting this number back to a phred score.

#### Usage

```bash
Usage: vtools-gcoverage [OPTIONS]

Options:
  -I, --input-gvcf PATH          Path to input VCF file  [required]
  -R, --refflat-file PATH        Path to refFlat file  [required]
  --per-exon / --per-transcript  Collect metrics per exon or per transcript
  --help                         Show this message and exit.
```

Sander Bollen's avatar
Sander Bollen committed
129
### vtools-evaluate
Sander Bollen's avatar
Sander Bollen committed
130

Sander Bollen's avatar
Sander Bollen committed
131
132
133
134
135
136
137
Evaluate a VCF file to a baseline VCF file containing true positives. 
We only consider variants that are present in both VCF files. This makes
it useful when the two VCF files have been produced by wildly different
technologies. E.g, when comparing a WES VCF file vs a SNP array, this
tool can be quite useful.

Output is a simple JSON file listing counts of concordant and discordant
138
139
alleles and some other metrics. It is also possible to output the discordant
VCF records.
Sander Bollen's avatar
Sander Bollen committed
140

Sander Bollen's avatar
fix    
Sander Bollen committed
141
142
Multisample VCF files are allowed; the samples to be evaluated have to be set 
through a CLI argument.
Sander Bollen's avatar
Sander Bollen committed
143

144
145
146
Variants from the `--call-vcf` are filtered to have a Genotype Quality (GQ) of
at least 30 by default. This can be overruled by specifying `--min-qual 0`.
The optional flag `--min-depth` can be used to set the minimum read coverage.
Sander Bollen's avatar
Sander Bollen committed
147
148
149
150
151
152
153
154
155
156
157
158
159
160

#### Usage

```bash
Usage: vtools-evaluate [OPTIONS]

Options:
  -c, --call-vcf PATH           Path to VCF with calls to be evaluated
                                [required]
  -p, --positive-vcf PATH       Path to VCF with known calls  [required]
  -cs, --call-samples TEXT      Sample(s) in call-vcf to consider. May be
                                called multiple times  [required]
  -ps, --positive-samples TEXT  Sample(s) in positive-vcf to consider. May be
                                called multiple times  [required]
161
  -s, --stats PATH              Path to output stats json file
162
  -dc, --discordant PATH        Path to output gzipped discordant vcf file
163
  -mq, --min-qual FLOAT         Minimum quality of variants to consider
van den Berg's avatar
van den Berg committed
164
  -md, --min-depth INTEGER      Minimum depth of variants to consider
165
166
  --help                        Show this message and exit.
```
Sander Bollen's avatar
Sander Bollen committed
167
168
169
170

## License

MIT