README.md 4.51 KB
Newer Older
Sander Bollen's avatar
Sander Bollen committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
vtools
======

Little toolset operating over VCF files. Uses cyvcf2 and cython under
the hood for speed.


Tools
-----

### vtools-filter

Filter VCF files based on a few criteria. Will output both a filtered VCF
file, and a VCF file containing all the filtered-out variants.

####  Filter criteria

| name | meaning | optional |
| ---- | ------- | -------- |
| NON_CANONICAL | Non-canonical chromosome | Yes |
| INDEX_UNCALLED | Index uncalled or homozygous reference | Yes |
| TOO_HIGH_GONL_AF | Too high GonL allele frequency | Yes |
| TOO_HIGH_GNOMAD_AF | Too high GnomAD allele frequency | Yes |
| LOW_GQ | Too low GQ on index sample | Yes |
| DELETED_ALLELE | The only ALT allele is a deleted allele | No |

#### Configuration 

Sander Bollen's avatar
Sander Bollen committed
29 30
Configuration of filters goes by a little JSON file. 
See [here](cfg/example-filter.json) for an example.
Sander Bollen's avatar
Sander Bollen committed
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50


#### Usage

```bash
Usage: vtools-filter [OPTIONS]

Options:
  -i, --input PATH                Path to input VCF file  [required]
  -o, --output PATH               Path to output (filtered) VCF file
                                  [required]
  -t, --trash PATH                Path to trash VCF file  [required]
  -p, --params-file PATH          Path to filter params json  [required]
  --index-sample TEXT             Name of index sample  [required]
  --immediate-return / --no-immediate-return
                                  Immediately write filters to file upon
                                  hitting one filter criterium. Default = True
  --help                          Show this message and exit.

```
Sander Bollen's avatar
Sander Bollen committed
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

### vtools-stats

Collects some general statistics about a VCF file, and writes a json to
stdout.

#### Usage

```bash
Usage: vtools-stats [OPTIONS]

Options:
  -i, --input FILE  Input VCF file  [required]
  --help            Show this message and exit.
```
Sander Bollen's avatar
Sander Bollen committed
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103

### vtools-gcoverage

Collect coverage metrics over a gVCF file for every exon or every transcript
in a refFlat file. This assumes the input VCF file is at least similar to
GATK's gVCF files. gVCF files are only expected to have one sample; if
your input file contains multiple samples, we simply take the first only.

Output is a simple TSV file with the following columns

| column | meaning |
| ------ | ------- |
| exon | exon number |
| gene | gene name / symbol / id |
| mean_dp | mean DP value over the exon |
| mean_gq | mean GQ value over the exon* |
| median_dp | median DP value over the exon |
| median_gq | median GQ value over the exon |
| perc_at_least_{10, 20, 30, 50, 100}_dp | Percentage of exon with DP value over value |
| perc_at_least_{10, 29, 30, 50, 90}_gq | Percentage of exon with GQ value over exon | 
| transcript | transcript name / symbol / id |

*: mean GQ value is computed by first calculating the P-value of all GQ 
values, then calculating the mean over these P-values, and lastly 
converting this number back to a phred score.

#### Usage

```bash
Usage: vtools-gcoverage [OPTIONS]

Options:
  -I, --input-gvcf PATH          Path to input VCF file  [required]
  -R, --refflat-file PATH        Path to refFlat file  [required]
  --per-exon / --per-transcript  Collect metrics per exon or per transcript
  --help                         Show this message and exit.
```

Sander Bollen's avatar
Sander Bollen committed
104
### vtools-evaluate
Sander Bollen's avatar
Sander Bollen committed
105

Sander Bollen's avatar
Sander Bollen committed
106 107 108 109 110 111 112 113 114
Evaluate a VCF file to a baseline VCF file containing true positives. 
We only consider variants that are present in both VCF files. This makes
it useful when the two VCF files have been produced by wildly different
technologies. E.g, when comparing a WES VCF file vs a SNP array, this
tool can be quite useful.

Output is a simple JSON file listing counts of concordant and discordant
alleles. 

Sander Bollen's avatar
Sander Bollen committed
115 116
Multisample VCF files are allowed; the samples to be evaluated have to be set 
through a CLI argument.
Sander Bollen's avatar
Sander Bollen committed
117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133


#### Usage

```bash
Usage: vtools-evaluate [OPTIONS]

Options:
  -c, --call-vcf PATH           Path to VCF with calls to be evaluated
                                [required]
  -p, --positive-vcf PATH       Path to VCF with known calls  [required]
  -cs, --call-samples TEXT      Sample(s) in call-vcf to consider. May be
                                called multiple times  [required]
  -ps, --positive-samples TEXT  Sample(s) in positive-vcf to consider. May be
                                called multiple times  [required]
  --help                        Show this message and exit.
```
134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149

## Installation

* Python 3.6 at minimum
* numpy and cython must be installed prior to installing vtools
    * this will get fixed in the very near future

After both requirements have been met, simply install vtools with

```bash
python setup.py install
```

## License

MIT