...
 
Commits (2)
......@@ -30,7 +30,7 @@ cd CNAprioritization
- Set date_data to the date of interest to specify which version of the dataset to use, e.g. "2016_07_15"
- See http://gdac.broadinstitute.org/runs/info/analyses__runs_list.html for older versions.
6. In config.yaml, the settings for running GISTIC2.0 and for benchmarking can be modified if necessary (see Settings for more information).
6. In config.yaml, the settings for running GISTIC2.0 and for benchmarking can be modified if necessary (see Settings for more information).
7. Run the pipeline shell script, which will create the required conda environment and run snakemake (recommended when snakemake is not installed yet):
```
......@@ -81,28 +81,58 @@ This pipeline repository contains the following scripts and files:
<b>config.yaml</b>: all customizable settings for running the pipeline
<b>rules/</b>
- PreprocessInput.smk: Preprocess the input files.
- RecurrentRegions.smk: Detect recurrently amplified regions using GISTIC2 and RUBIC.
- GenePrioritization.smk: Prioritize the list of genes harbored by the recurrent regions and perform GO enrichment analysis using topGO.
- SampleSizes.smk: Examine the differences in power when using smaller sample sizes for GISTIC2 and RUBIC.
- UseControl.smk: Examine the differences between using tumor samples only and using tumor samples and matched control samples.
- PreprocessInput.smk: preprocess the input files.
- GISTIC2.smk: detect recurrently amplified regions using GISTIC2.
- RUBIC.smk: detect recurrently amplified regions using RUBIC.
- Circos.smk: make Circos visualizations on the raw segmentation file and the results from RUBIC and GISTIC2.0.
- ComparisonRegions.smk: compare the regions detected by RUBIC and GISTIC2.0
- GenePrioritization.smk: prioritize the list of genes harbored by the recurrent regions based on a GO enrichment analysis using topGO.
- SampleSizes.smk: examine the differences when using smaller sample sizes for GISTIC2 and RUBIC.
<b>scripts/</b>
- Reports.py: functions for making reports on the results.
- Rubic.py: functions for preparing files for running RUBIC.
- SampleSizes.py: functions for making segmentation files as input for SampleSizes.smk.
- <i>Rubic.py:</i> functions for preparing files for running RUBIC.
- install_gistic2.sh: script for installing GISTIC2 to a directory of choice.
- ParseResults.py: functions for parsing the result files from GISTIC2.0 and RUBIC and files with a list of genes.
- ensemblQueries.py: functions for retrieving ensembl information using pyensembl.
- ReportTools.py: functions for making a report and several tables and plots on the results from GISTIC2.0 and RUBIC.
- ReportSegments: functions for making a report on the raw segmentation file.
- Circos.py: functions for preparing input for Circos.
- circos/: configuration files for the Circos plots.
- SampleSizes.py: functions for making segmentation files as input for SampleSizes.smk.
- ReportSizes.py: functions for making a report on the results from the different sample sizes.
- PrecisionRecall.py: make a plot with the precision and recall based on different sample sizes.
<b>wrappers/</b>
<b>wrappers/</b>: wrappers for reusing tools.
- GISTIC2/: wrapper directory for running GISTIC2 for prioritization of copy number regions.
- RUBIC/: wrapper directory for running RUBIC for prioritization of copy number regions.
- topGO/: wrapper directory for performing a GO enrichment analysis using topGO.
<b>test/<b>
- Test files for testing pipeline.
<b>envs/</b>: environments with required packages.
- bedtools.yaml: environment for running bedtools intersect.
- circos.yaml: environment for making circos plot.
- firehose.yaml: environment for retrieving data using get_firehose
- pipeline.yaml: python environment for running the pipeline.
<b>input_files/<b>: files with information needed for running pipeline.
- biomart_human_genes_38.tsv: gene information for GRCh38 retrieved from Biomart.
- biomart_human_genes_hg19.tsv: gene information for hg19 retrieved from Biomart.
- Census_genes.txt: List of census genes for validation
- ID_to_GO.txt: list of all ensembl_IDs and corresponding GO terms, needed for running topGO.
- intogen-CM-drivers-data.tsv: Information on candidate driver genes from Intogen.
<b>run_pipeline_cluster.sh</b>: Script to submit to Shark cluster for running pipeline.
## Output structure
Overview of files created by the pipeline.
Overview of folders and content created by the pipeline:
- Benchmarks: benchmarks reports
- Circos: Circos plots
- GO: GO enrichment analysis results from topGO
- Input: retrieved input files
- Reports: reports on results
- RUBIC: results from RUBIC
- GISTIC: results from GISTIC
- Samplesize: results from the sample size comparison
- stddata__2016_07_15: files retrieved from get_firehose.
## Author
Beatrice F. Tan (beatrice.ftan@gmail.com)
#Directories to be specified
#workdir: /home/bftan/CNA_results #directory to write output
#gisticdir: /home/bftan/Tools/GISTIC2_test #directory to install GISTIC2
#workdir: /home/bftan/CNA_results #directory to write output
#gisticdir: /home/bftan/Tools/GISTIC2 #directory to install GISTIC2
workdir: /home/beatrice/CNA_analysis
gisticdir: /home/beatrice/CNA_analysis/run_gistic2
#Input details to download from firehose
cancer_type: SKCM
date_data: "2016_07_15"
#Or provide input file
inputfile: "" #tumor segmentation data
normal: "" #normal segmentation data
#Data for running and benchmarking tools.
reference: hg19
prev_found_genes: input_files/intogen-CM-drivers-data.tsv
census_genes: input_files/Census_genes.txt
biomart_genes: input_files/biomart_human_genes.tsv
#Settings GISTIC2.0
gistic_precision: "99"
settings_gistic: "-ta 0.1
-td 0.1
-qvt 0.25
-brlen 0.7
-cap 1.5
-rx 1
-genegistic 1
-conf 0.99"
#Settings for sample size differences
sizes: [20, 30, 40, 50, 60, 70, 80, 90]
#sizes: [30, 40]
repeats: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
#!/usr/bin/env Rscript
args = commandArgs(trailingOnly=TRUE)
source("https://bioconductor.org/biocLite.R")
biocLite("karyoploteR")
bed = read.table(args[1], sep="\t", header=TRUE)
#Read output plot files?
#chrom as "chr1"
kp <- plotKaryotype(chromosomes=c(chrom))
#kpDataBackground(kp)
kpAxis(kp)
kpRect(kp, chr=chrom, x0=start, x1=end, y0=0.2, y1=0.4)