config.md 7.32 KB
Newer Older
1
2
3
4
# How to create configs

### The sample config

Peter van 't Hof's avatar
Peter van 't Hof committed
5
The sample config should be in [__JSON__](http://www.json.org/) or [__YAML__](http://yaml.org/) format. For yaml the file should be named *.yml or *.yaml.
6
7
8
9

- First field should have the key __"samples"__
- Second field should contain the __"libraries"__
- Third field contains __"R1" or "R2"__ or __"bam"__
Sander van der Zeeuw's avatar
Sander van der Zeeuw committed
10
- The fastq input files can be provided zipped and unzipped
Peter van 't Hof's avatar
Peter van 't Hof committed
11
- `output_dir` is a required setting that should be set either in a `config.json` or specified on the invocation command via -cv output_dir=<path/to/outputdir\>.
12
13

#### Example sample config
Peter van 't Hof's avatar
Peter van 't Hof committed
14

Wai Yi Leung's avatar
Wai Yi Leung committed
15
###### YAML:
Peter van 't Hof's avatar
Peter van 't Hof committed
16
17

``` yaml
18
output_dir: /home/user/myoutputdir
Peter van 't Hof's avatar
Peter van 't Hof committed
19
20
21
22
samples:
  Sample_ID1:
    libraries:
      MySeries_1:
23
24
        R1: /path/to/R1.fastq.gz
        R2: /path/to/R2.fastq.gz
Peter van 't Hof's avatar
Peter van 't Hof committed
25
26
```

Wai Yi Leung's avatar
Wai Yi Leung committed
27
###### JSON:
Peter van 't Hof's avatar
Peter van 't Hof committed
28
29

``` json
30
    {  
31
       "output_dir": "/home/user/myoutputdir",
32
33
34
35
36
37
38
39
40
41
42
       "samples":{  
          "Sample_ID1":{  
             "libraries":{  
                "MySeries_1":{  
                   "R1":"Your_R1.fastq.gz",
                   "R2":"Your_R2.fastq.gz"
                }
             }
          }
       }
    }
Peter van 't Hof's avatar
Peter van 't Hof committed
43
```
44

Peter van 't Hof's avatar
Peter van 't Hof committed
45
For BAM files as input one should use a config like this:
46
  
Peter van 't Hof's avatar
Peter van 't Hof committed
47
48
49
``` yaml
samples:
  Sample_ID_1:
Peter van 't Hof's avatar
Peter van 't Hof committed
50
51
52
53
    tags:
      gender: male
      father: sampleNameFather
      mother: sampleNameMother
Peter van 't Hof's avatar
Peter van 't Hof committed
54
55
    libraries:  
      Lib_ID_1:
Peter van 't Hof's avatar
Peter van 't Hof committed
56
57
        tags:
          key: value
Peter van 't Hof's avatar
Peter van 't Hof committed
58
59
60
61
        bam: MyFirst.bam
      Lib_ID_2:
        bam: MySecond.bam
```
62

63
Note that there is a tool called [SamplesTsvToConfig](../tools/SamplesTsvToConfig.md) that enables the user to get the sample config without any chance of creating a wrongly formatted file.
64

Peter van 't Hof's avatar
Peter van 't Hof committed
65
66
#### Tags

Peter van 't Hof's avatar
Typo    
Peter van 't Hof committed
67
In the `tags` key inside a sample or library users can supply tags that belong to samples/libraries. These tags will we automatically parsed inside the summary of a pipeline.
68
69
70

### The settings config
The settings config enables a user to alter the settings for almost all settings available in the tools used for a given pipeline.
71
This config file should be written in either JSON or YAML format. It can contain setup settings like:
Sander van der Zeeuw's avatar
Sander van der Zeeuw committed
72

73
74
75
76
77
78
79
80
 * references,
 * cut offs,
 * program modes and memory limits (program specific),
 * Whether chunking should be used
 * set program executables (if for some reason the user does not want to use the systems default tools)
 * One could set global variables containing settings for all tools used in the pipeline or set tool specific options one layer 
 deeper into the JSON file. E.g. in the example below the settings for Picard tools are altered only for Picard and not global. 

81

Sander van der Zeeuw's avatar
Sander van der Zeeuw committed
82
``` json
83
"picard": { "validationstringency": "LENIENT" } 
Sander van der Zeeuw's avatar
Sander van der Zeeuw committed
84
```
85
86
87
88
89
90
91
92
93
94
95

Global setting examples are:
~~~
"java_gc_timelimit": 98,
"numberchunks": 25,
"chunking": true
~~~


----

Sander Bollen's avatar
Sander Bollen committed
96
#### References
Sander van der Zeeuw's avatar
Sander van der Zeeuw committed
97
98
Pipelines and tools that use references should now use the reference module.
This gives a more fine-grained control over references and enables a user to curate the references in a structural way.
Peter van 't Hof's avatar
Typo    
Peter van 't Hof committed
99
E.g. pipelines and tools which use a FASTA references should now set value `"reference_fasta"`.
Sander van der Zeeuw's avatar
Sander van der Zeeuw committed
100
101
Additionally, we can set `"reference_name"` for the name to be used (e.g. `"hg19"`). If unset, Biopet will default to `unknown`.
It is also possible to set the `"species"` flag. Again, we will default to `unknown` if unset.
Peter van 't Hof's avatar
Typo    
Peter van 't Hof committed
102

103
#### Example settings config
Sander van der Zeeuw's avatar
Sander van der Zeeuw committed
104
``` json
105
{
Sander Bollen's avatar
Sander Bollen committed
106
107
108
        "reference_fasta": "/references/hg19_nohap/ucsc.hg19_nohap.fasta",
        "reference_name": "hg19_nohap",
        "species": "homo_sapiens",
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
        "dbsnp": "/data/LGTC/projects/vandoorn-melanoma/data/references/hg19_nohap/dbsnp_137.hg19_nohap.vcf",
        "joint_variantcalling": false,
        "haplotypecaller": { "scattercount": 100 },
        "multisample": { "haplotypecaller": { "scattercount": 1000 } },
        "picard": { "validationstringency": "LENIENT" },
        "library_variantcalling_temp": true,
        "target_bed_temp": "/data/LGTC/projects/vandoorn-melanoma/analysis/target.bed",
        "min_dp": 5,
        "bedtools": {"exe":"/share/isilon/system/local/BEDtools/bedtools-2.17.0/bin/bedtools"},
        "bam_to_fastq": true,
        "baserecalibrator": { "memory_limit": 8, "vmem":"16G" },
        "samtofastq": {"memory_limit": 8, "vmem": "16G"},
        "java_gc_timelimit": 98,
        "numberchunks": 25,
        "chunking": true,
        "haplotypecaller": { "scattercount": 1000 }
}
Sander van der Zeeuw's avatar
Sander van der Zeeuw committed
126
```
127

Mei's avatar
Mei committed
128
129
# More advanced use of config files.
### 4 levels of configuring settings
Mei's avatar
Mei committed
130
131
132
133
134
In biopet, a value of a ConfigNamespace (e.g., "reference_fasta") for a tool or a pipeline can be defined in 4 different levels.
 * Level-4: As a fixed value hardcoded in biopet source code
 * Level-3: As a user specified value in the user config file
 * Level-2: As a system specified value in the global config files. On the LUMC's SHARK cluster, these global config files are located at /usr/local/sasc/config.
 * Level-1: As a default value provided in biopet source code.
Mei's avatar
Mei committed
135

Mei's avatar
Mei committed
136
During execution, biopet framework will resolve the value for each ConfigNamespace following the order from level-4 to level-1. Hence, a value defined in the a higher level will overwrite a value define in a lower level for the same ConfigNamespace.
Mei's avatar
Mei committed
137

138
139
### JSON validation

Sander van der Zeeuw's avatar
Sander van der Zeeuw committed
140
To check if the created JSON file is correct their are several possibilities: the simplest way is using [this](http://jsonformatter.curiousconcept.com/)
Moustakas's avatar
Moustakas committed
141
website. It is also possible to use Python, Scala or any other programming languages for validating JSON files but this requires some more knowledge.
akaljuvee's avatar
akaljuvee committed
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175

#Creating config files with Biopet

With the pipelines Gentrap, MultiSampleMapping and Shiva it is possible to use Biopet itself for creating the config files. Biopet should be called with the keyword *template* and the user will be then prompted to enter the values for the parameters needed by the pipelines. Biopet will generate a config file that can be used as input when running the pipelines. The purpose is to ease the step of creating the config files. It is useful especially when Biopet has been pre-configured to use a list of reference genomes. Then the user needs only to specify which refence genome he/she wants to use and the location of the reference genome files can be derived from Biopet's global configuration.

<br/>
<b> Example </b>

For viewing the pipelines for which this functionality is supported:

``` bash
biopet template
```

For getting help about using it for a specific pipeline:

``` bash
biopet template Gentrap -h
```

For running the tool:

``` bash
biopet template Gentrap -o gentrap_config.yml -s gentrap_run.sh
```
<br/>
<b> Description of the parameters </b>

| Flag  (short)| Flag (long) | Type | Function |
| ------------ | ----------- | ---- | -------- |
| -o | --outputConfig | Path (**required**) | Name of the config file that gets generated.|
| -s | --outputScript | Path (optional) | Biopet can also output a script that can be directly used for running the pipeline, the call of the pipeline is generated with the config file as input. This parameter sets the name for the script file.|
| -t | --template | Path (optional) | A template file with 2 placeholders *%s* is required for generating the script. The first placeholder will be replaced with the name of the pipeline, the second with the paths to the sample and settings config files. When Biopet has been pre-configured to use the default template file, then setting this parameter is optional. |
|    | --expert |  | This flag enables the user to configure a more extensive list of parameters for the pipeline. |