Commit 9d2b7e12 authored by Sam Nooij's avatar Sam Nooij
Browse files

One more data cleanup

parents
# Analysis contents
Summary of the analyses included in the manuscript 'Effect of faecal microbiota transplantation on _pks_+ _E. coli_ in metagenomes of patients with multiple recurrent _Clostridioides difficile_ infections' (Nooij _et al._, _in preparation_).
The analysis was conducted as a multistep process, each of which is briefly highlighted here. Links to the corresponding scripts (Rmarkdown) and report (html) files are provided for more thorough descriptions of each particular analysis step and its results.
## Brief summary
We retrospectively screened faecal metagenomes of a cohort of 49 multiple recurrent _Clostridioides difficile_ infected (rCDI) patients and their respective healthy faecal donors of the Netherlands Donor Faeces bank for the procarcinogenic _pks_+ _E. coli_ variant.
First, we downsampled raw reads to 10M pairs and analysed them with [Jovian](https://www.github.com/DennisSchmitz/Jovian). Then, to detect and quantify the _pks_ island, we mapped trimmed and human-filtered reads to a reference sequence (GenBank accession ID [AM229678](https://www.ncbi.nlm.nih.gov/nuccore/am229678)) using [Jovian screener](https://git.lumc.nl/snooij/jovian-screener).
Additionally, we used taxonomic output and scaffolds generated by Jovian to estimate _E. coli_ abundance and determine the species of origin of scaffolds containing the _pks_ island.
Besides, trimmed and filtered reads were also mapped to putative _E. coli_ single-copy marker genes to infer the number of genome copies per metagenome.
The mapped read counts were used in the further analyses described below.
_Note: the report files linked below work best when downloaded first._
## Downstream analyses contents
1. The first step in the analysis is to import read counts and normalise the numbers to reads per kilobase per million to facilitate comparison.
(This normalisation takes into account target gene length as well as the total number of reads in the metagenome.)
[Import and RPKM calculation script](1_Import_and_RPKM_calculation.Rmd)
[Import and RPKM calculation report](1_Import_and_RPKM_calculation.html)
This includes:
- Import data from different results and metadata files
- Calculate RPKM values of mapped reads per gene
- Export dataframe for consecutive analyses
2. The second step is to assess whether the _pks_ island occurs as a single genomic unit, or if the genes are likely to be transferred separately.
By ensuring that the _pks_ island is one unit, we can assume each hit to one of the 19 genes indicates presence of the island.
[Assess _pks_ script](2_Assess_pks_unity_by_clb_correlations.Rmd)
[Assess _pks_ report](2_Assess_pks_unity_by_clb_correlations.html)
This includes:
- Correlate levels of RPKM genes to establish if _pks_ occurs as one genomic unit
- Create figures (S3-S5)
3. The third step is then to quantify the level of _pks_ in each metagenome and visualise these values per subject and per group (donors, pre-FMT patients and post-FMT patients).
[Quantify _pks_ script](3_Quantify_pks.Rmd)
[Quantify _pks_ report](3_Quantify_pks.html)
This includes:
- Quantify _pks_ levels in donors and patients, pre- and post-FMT
- Create figures of _pks_ levels (= figures 1, 3, S1)
- Statistical tests on _pks_ levels
- Export table with matching donor and patients _pks_ levels
4. The fourth step is to evaluate temporal dynamics of _pks_ in a healthy donor by visualising the levels of _pks_ in samples taken over a period of about six months. The relative abundances of _E. coli_ and Enterobacteriaceae are also visualised to indicate the putative species of origin.
[Evaluate temporal _pks_ script](4_Evaluate_temporal_pks_in_D5.Rmd)
[Evaluate temporal _pks_ report](4_Evaluate_temporal_pks_in_D5.html)
This includes:
- Figures for persistent colonisation by _pks_+ bacterium (_E. coli_?) (= figure 2)
5. The fifth step is performs several statistical tests to compare _pks_ carriership between groups and determine whether the donor and pre-FMT _pks_-status influence _pks_ post-FMT.
[Group statistics script](5_Calculate_group_statistics.Rmd)
[Group statistics report](5_Calculate_group_statistics.html)
This includes:
- Statistical tests (Chi-squared and Fisher's exact test) to assess influence of FMT on _pks_-carriership
6. The sixth step evaluates three putative _E. coli_ single-copy marker genes (_dxs_, [AF035440](https://www.ncbi.nlm.nih.gov/nuccore/AF035440); _rodA_, [M22857](https://www.ncbi.nlm.nih.gov/nuccore/M22857); _uidA_, [S69414](https://www.ncbi.nlm.nih.gov/nuccore/S69414)) by correlating their abundance to the relative abundances of _E. coli_ and Enterobacteriaceae.
Single-copy marker genes may be used to infer the number of genome equivalents present in the metagenome and thereby calculate the number of _pks_ island copies per genome (see next step).
[Evaluate marker genes script](6_Evaluate_marker_genes.Rmd)
[Evaluate marker genes report](6_Evaluate_marker_genes.html)
This includes:
- Correlate marker gene abundances to _E. coli_ and Enterobacteriaceae
- Create figures (S6)
7. The seventh step calculates the number of _pks_ island copies per _E. coli_ genome to estimate the prevalence among _E. coli_, test if a genome might hold several copies and/or provide a clue as to whether another species may carry the _pks_ island.
[Calculate _pks_ per _E. coli_ ratio script](7_Calculate_pks_per_Ecoli_ratio.Rmd)
[Calculate _pks_ per _E. coli_ ratio report](7_Calculate_pks_per_Ecoli_ratio.html)
This includes:
- Calculate the number of _pks_ island copies per _E. coli_ genome using single-copy marker genes as equivalent of the number of genomes
- Create figure (S2)
8. The eighth step performs a number of additional checks on the results to help report all the right numbers and assess reliability.
[Additional data checks script](8_Additional_data_checks.Rmd)
[Additional data checks report](8_Additional_data_checks.html)
This includes:
- Number of samples in which _pks_ is detected, compared to number with _clbB_
- How many patients keep/lose/acquire _pks_, including summary statistics of these subsets (mean and median _pks_)
- Total number of reads analysed per sample/metagenome: anything out of the ordinary?
- Evaluate _clb_ gene and _uidA_ mapped read counts for metagenomes with high (~1) _pks_/_E. coli_ ratios
---
title: "Import mapped read data and calculate RPKM"
author: "Sam Nooij"
date: "7 January 2021"
output:
html_document:
code_folding: show
---
```{r setup, include=FALSE}
library(knitr)
library(rmarkdown)
library(here)
library(codetools)
library(tidyverse)
library(DT)
knitr::opts_chunk$set(echo = TRUE,
warning = FALSE,
message = FALSE,
cache = TRUE)
create_dt <- function(x){
DT::datatable(x,
extensions = 'Buttons',
options = list(dom = 'Blfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
lengthMenu = list(c(10,25,50,-1),
c(10,25,50,"All"))),
filter = 'top',
rownames = FALSE)
}
#Create nice looking downloadable tables,
# function adapted from Martin Chan, 2019:
# https://martinctc.github.io/blog/vignette-downloadable-tables-in-rmarkdown-with-the-dt-package/
# with column filter thanks to Yihui Xie, 2015:
# https://blog.rstudio.com/2015/06/24/dt-an-r-interface-to-the-datatables-library/
```
# Import mapped read data and calculate normalised (RPKM) values
The first step in the analysis of _pks_ island sequences from metagenomes is to import mapped read counts and combine the numbers with relevant metadata such as alias IDs, which donor matches which patient, etc. When the data have been properly imported and combined, the raw read counts are normalised using both target gene lengths and total read numbers per sample to calculate the reads per kilobase per million (RPKM) values. Then, the resulting dataframe is exported to be used in the different analysis steps.
```{r import_data}
#Start by loading the counts of mapped reads.
# These have been processed in batches, so one file is imported
# for each batch.
donor_read_counts <- read.delim(here("..", "data", "processed", "donor", "Deduplicated_read_counts.tsv"))
extra_donor_read_counts <- read.delim(here("..", "data", "processed", "extra_donors", "Deduplicated_read_counts.tsv"))
pre_read_counts <- read.delim(here("..", "data", "processed", "pre-FMT", "Deduplicated_read_counts.tsv"))
post_read_counts <- read.delim(here("..", "data", "processed", "post-FMT", "Deduplicated_read_counts.tsv"))
## Note: `here` points to the directory above the current,
## not the directory in which this script is saved! (Which is `doc`.)
#Combine the batches into one dataframe for easier processing.
all_read_counts <- rbind(donor_read_counts,
extra_donor_read_counts,
pre_read_counts,
post_read_counts)
#Load metadata for all subjects (donors and patients).
subject_metadata <- read.csv(here("..", "..", "..", "data", "FMT_metadata.csv"))
subject_metadata$Sample_long <- gsub(pattern = "\\.", replacement = "-", x = subject_metadata$Sample_long)
#Filter out the example entries:
subject_metadata <- subject_metadata %>%
filter(!str_detect(Comments, "^EXAMPLE"))
#Keep a list of ordered IDs for to sort the dataframe for more logical visualisation
ordered_alias_IDs <- unique(subject_metadata$alias_ID)
#Also load the conversion table for long, old IDs into new sample IDs
alias_sample_IDs <- read.delim(here("..", "data", "Sample_triplets-new_IDs.tsv"))
#Load the total number of reads used for mapping.
# (I use not the raw library sizes, but the read counts after trimmed and filtering for normalisation).
donor_total_reads <- read.delim(here("..", "data", "processed", "donor", "Filtered_read_counts.csv"))
extra_donors_total_reads <- read.delim(here("..", "data", "processed", "extra_donors", "Filtered_read_counts-summed_up.csv"))
pre_total_reads <- read.delim(here("..", "data", "processed", "pre-FMT", "Filtered_read_counts.csv"))
post_total_reads <- read.delim(here("..", "data", "processed", "post-FMT", "Filtered_read_counts.csv"))
#Filter out only the rows of interest (there are also counts for paired and unpaired reads separately).
donor_total_reads <- donor_total_reads %>% filter(set == "total")
extra_donors_total_reads <- extra_donors_total_reads %>% filter(set == "total")
pre_total_reads <- pre_total_reads %>% filter(set == "total")
post_total_reads <- post_total_reads %>% filter(set == "total")
#Combine into one dataframe
unfiltered_total_reads <- rbind(donor_total_reads, extra_donors_total_reads, pre_total_reads, post_total_reads)
#Remove the "_pR1" suffix that is attached to each sample name:
unfiltered_total_reads$Sample <- unfiltered_total_reads$Sample %>% gsub(pattern = "_pR1", replacement = "")
#And match the total reads to the alias sample IDs to facilitate matching sample and read data
total_reads <- merge(x = unfiltered_total_reads,
y = alias_sample_IDs,
by.x = "Sample",
by.y = "Original_ID")
#Now to add the metadata to the read count data
all_read_counts <- merge(x = all_read_counts,
y = subject_metadata,
by.x = "Sample_name",
by.y = "Sample_long",
all = TRUE)
all_read_counts <- merge(x = all_read_counts,
y = alias_sample_IDs,
by.x = "Sample_name",
by.y = "Original_ID",
all = TRUE)
#Remove '.1' suffix from accession IDs to facilitate merging with gene metadata
all_read_counts$Reference <- gsub(pattern = "\\.1$", replacement = "", x = all_read_counts$Reference)
clb_metadata <- read.delim(here("..", "data", "references", "Colibactin-AM229678_info.tsv"))
colnames(clb_metadata)[colnames(clb_metadata) == "Accession"] <- "Accession_ID"
#The gene names in clb_metadata have spaces in the end: remove those before further processing:
clb_metadata$Gene <- trimws(clb_metadata$Gene, which = "right")
markers_metadata <- read.delim(here("..", "data", "references", "Adiri-markers-info.tsv"))
#Add an 'ID_long' column to match with the read data
markers_metadata$ID_long <- markers_metadata$Accession_ID
#Add these two together to make 'gene metadata'
gene_metadata <- merge(x = clb_metadata,
y = markers_metadata,
all = TRUE)
#Also add the markers metadata to the read count data
# (note that only selected markers are in this set, so rows with different genes will be excluded)
annotated_read_counts <- merge(x = all_read_counts,
y = gene_metadata,
by.x = "Reference",
by.y = "ID_long")
#And finally merge the total number of reads per metagenome into the dataframe
annotated_read_counts <- merge(x = annotated_read_counts,
y = total_reads,
by = "New_ID")
#Now start selecting data of interest: first filter out genes
genes_to_filter_out <- c("mdh", "ppk", "gcl", "metA", "zwf", "adk")
# (Based on depth of coverage figures, these MLST genes show
# uneven coverage and are therefore less suitable as specific
# marker genes.)
selected_read_data <- annotated_read_counts[! annotated_read_counts$Gene %in% genes_to_filter_out, ]
#Remove samples that are not from the triplets and have 'New_ID' == 'NA'
selected_read_data <- selected_read_data %>% filter(!is.na(New_ID))
#Calculate the reads per kilobase per million
selected_read_data$RPKM <- with(selected_read_data,
mapped_reads / Length * 1000 / reads * 1e6)
#Now select the columns of interest for further analyses:
selected_read_data <- selected_read_data %>%
select(alias_ID, New_ID, Type, Status, Gene, Length, mapped_reads, RPKM, reads)
#Add back metadata for samples with no mapped reads in 'selected_read_data':
additional_data <- all_read_counts %>%
filter(!New_ID %in% selected_read_data$Sample_ID & New_ID != 'NA')
additional_data <- merge(x = additional_data,
y = total_reads,
by = "New_ID") %>%
select(alias_ID, New_ID, Type, Status, reads) %>%
unique()
selected_read_data <- right_join(x = selected_read_data,
y = additional_data)
#And rename a few columns for easier interpretation
colnames(selected_read_data)[colnames(selected_read_data) %in%
c("alias_ID", "New_ID", "mapped_reads", "reads")] <-
c("Subject_ID", "Sample_ID", "Mapped_reads", "Total_reads")
#Make 'Subject_ID' into a factor to facilitate sorting...
selected_read_data$Subject_ID <- factor(selected_read_data$Subject_ID %>%
str_extract("^[DP][0-9]+"),
levels = ordered_alias_IDs)
#... and sort the dataframe
selected_read_data <- selected_read_data[with(selected_read_data,
order(Subject_ID, Sample_ID)),]
#Save the dataframe as RDS file to easily open it in other R scripts
saveRDS(selected_read_data, file = here("data", "RPKM_per_gene_per_sample.rds"))
#And finally show the table
create_dt(selected_read_data)
```
The table above shows the RPKM values for each gene for each sample (metagenome) and are used in subsequent analysis steps.
_Note: in some metagenomes, none of the genes of interest were detected. These metagenomes are listed in the table with empty (NA) for 'Gene', 'Length', 'Mapped_reads' and 'RPKM'._
---
# Software used
```{r session_info}
sessionInfo()
```
\ No newline at end of file
This diff is collapsed.
---
title: "Assess the _pks_ island as genomic unit"
author: "Sam Nooij"
date: "8 January 2021"
output:
html_document:
toc: yes
toc_depth: 3
code_folding: show
---
```{r setup, include=FALSE}
library(knitr)
library(rmarkdown)
library(here)
library(codetools)
library(tidyverse)
library(DT)
library(Hmisc)
library(corrplot)
library(ggpubr)
library(cowplot)
library(viridisLite)
knitr::opts_chunk$set(fig.height = 12, fig.width = 13, fig.path = here("figures", ""),
fig.ext = c("png", "pdf"), dev = c("png", "pdf"), dpi = c(72, 300),
echo = TRUE,
warning = FALSE,
message = FALSE,
cache = TRUE)
create_dt <- function(x){
DT::datatable(x,
extensions = 'Buttons',
options = list(dom = 'Blfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
lengthMenu = list(c(10,25,50,-1),
c(10,25,50,"All"))),
filter = 'top',
rownames = FALSE)
}
#Create nice looking downloadable tables,
# function adapted from Martin Chan, 2019:
# https://martinctc.github.io/blog/vignette-downloadable-tables-in-rmarkdown-with-the-dt-package/
# with column filter thanks to Yihui Xie, 2015:
# https://blog.rstudio.com/2015/06/24/dt-an-r-interface-to-the-datatables-library/
```
# Assess the _pks_ island as genomic unit by correlating sequencing depths of genes
In this second step, the _pks_ island is assessed by correlating the numbers of reads mapped to each of its coding genes (_clb_ genes). Using the normalised (RPKM) values for each gene, high correlations would indicate co-occurrence in a sample.
Low correlations would be an indication of genes existing separately within the metagenome, hinting at separate transfer of genes.
Note that next to the 19 coding (_clb_) genes, the flanking regions (intP4 = 5' flanking region; IS1400_transposase_A, IS1400_transposase_B and putative_transposase_(clb) = 3' flanking region) are also included, together with putative _E. coli_ single-copy marker genes _dxs_, _rodA_, and _uidA_.
It is expected that the correlations between _clb_ genes are very high (always occur together), while correlations between _clb_ genes and single-copy marker genes are lower (not all _E. coli_ genomes have the _pks_ island).
```{r correlate_rpkm_values_of_mapped_genes}
rpkm_per_gene_per_sample <- readRDS(here("data", "RPKM_per_gene_per_sample.rds"))
rpkm_per_gene_df <- pivot_wider(data = rpkm_per_gene_per_sample,
id_cols = "Sample_ID",
names_from = "Gene",
values_from = "RPKM")
rpkm_per_gene_matrix <- as.matrix(rpkm_per_gene_df %>% select(-Sample_ID))
rownames(rpkm_per_gene_matrix) <- rpkm_per_gene_df$Sample_ID
gene_rpkm_correlation_matrix <- rcorr(rpkm_per_gene_matrix, type = "spearman")
corrplot(
corr = gene_rpkm_correlation_matrix$r,
p.mat = gene_rpkm_correlation_matrix$P,
sig.level = 0.05,
insig = "blank",
method = "color",
type = "lower",
tl.col = "black",
order = "hclust",
col = viridis(20),
addCoef.col = "black",
number.cex = .9,
tl.cex = 1.2,
diag = FALSE
)
```
**Figure 1.** Spearman correlations between _pks_ island and putative _E. coli_ single-copy marker genes. Background colour corresponds to Spearman correlation (R-value; see legend in bottom); cells with no background colour indicate non-significant Spearman correlations (P>=.05). Spearman correlations are significant and often high between _clb_ genes (R between .54 and .99) indicating strong correlations between most _clb_ genes. Only _clbS_ often has lower correlations. _Clb_ genes are less or not correlated with marker genes and flanking regions, indicating that the one often occurs without the other.
```{r correlate_rpkm_values_of_mapped_genes-pvalues}
corrplot(corr = gene_rpkm_correlation_matrix$r,
p.mat = gene_rpkm_correlation_matrix$P,
sig.level = -1,
insig = "p-value",
pch = 6,
pch.cex = .9,
pch.col = "white",
method = "color",
type = "lower",
cl.align.text = "c",
tl.col = "black",
order = "hclust",
col = viridis(20),
tl.cex = 1.2,
diag = FALSE
)
```
**Figure 2.** P-values of Spearman correlations between _pks_ island genes and putative _E. coli_ marker genes. Background colours correspond to Spearman correlation values (R-value). Numbers in cells indicate P-values, rounded to two decimals. Spearman correlations between _clb_ genes are highly significant (P<.01), while correlations between _clb_ genes and other genes are mostly less strong and less significant (P<=.99).
## Alternative visualisation, focusing on _clb_ genes only
```{r plot_clbB_to_clb_genes-normalised, fig.height=12, fig.width=16}
#Extract the names of all clb genes
clb_genes <- sort(unique(
rpkm_per_gene_per_sample$Gene[rpkm_per_gene_per_sample$Gene %>% grepl(pattern = "^clb")]
))
clb_plot_list <- list()
plot_index <- 1
for (clb_gene in clb_genes) {
clb_plot <- ggscatter(rpkm_per_gene_df,
x = "clbB", y = clb_gene, color = "Sample_ID",
palette = viridis(nrow(rpkm_per_gene_df))) +
stat_cor(method = "spearman",
mapping = aes(size = 16)) +
theme(legend.position = "none",
axis.text = element_text(size = 16),
text = element_text(size = 16),
axis.title = element_text(face = "italic"))
clb_plot_list[[plot_index]] <- clb_plot
plot_index <- plot_index + 1
}
plot_grid(plotlist = clb_plot_list,
labels = "AUTO")
```
**Figure 3.** RPKM values of each _clb_ gene plotted against _clbB_, with Spearman correlations. Correlations between most _clb_ genes and _clbB_ are very high (R>.9) and highly significant (P<.0001). Only _clbS_ has a markedly lower correlation (R=.57, P=.00055).
---
# Conclusion
From both the visualisations it is clear that the correlations between most _clb_ genes are very strong, indicating that they always co-occur, which in turn is a sign of the _pks_ island always being a complete genomic element with all 19 coding genes.
This result is also what was observed before in the assembly results: _clb_ genes in patients with high pks loads assemble together in scaffolds - in four patients we even assembled the complete island, including flanking regions, into a single scaffold. (This seemed mostly depth dependent: the more reads were generated for the pks island, the better the island could be assembled.)
Therefore, we consider the _pks_ island as one genomic unit and base detection and quantification on each and every _clb_ gene. That is, we count the _pks_ island as detected if any of the _clb_ genes has one or more reads mapped to it, and we quantify the _pks_ island by taking the average RPKM value among all 19 _clb_ genes.
---
The dataframe (table) used to generate the figures above is provided below for your convenience.
(Note that 'NA' values are shown as empty cells.)
```{r share_dataframe}
create_dt(rpkm_per_gene_df)
```
---
# Software used
```{r session_info}
sessionInfo()
```
\ No newline at end of file
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
---
title: "Evaluate temporal dynamics of _pks_ in a healthy faeces donor"
author: "Sam Nooij"
date: "14 January 2021"
output:
html_document:
toc: yes
toc_depth: 3
code_folding: hide
---
```{r setup, include=FALSE}
library(knitr)
library(rmarkdown)
library(here)
library(codetools)
library(tidyverse)
library(DT)
library(ggpubr)
library(cowplot)
library(viridisLite)
knitr::opts_chunk$set(fig.height = 8, fig.width = 12, fig.path = here("figures", ""),
fig.ext = c("png", "pdf"), dev = c("png", "pdf"), dpi = c(72, 300),
echo = TRUE,
warning = FALSE,
message = FALSE,
cache = TRUE)
create_dt <- function(x){
DT::datatable(x,
extensions = 'Buttons',
options = list(dom = 'Blfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
lengthMenu = list(c(10,25,50,-1),
c(10,25,50,"All"))),
filter = 'top',
rownames = FALSE)
}
#Create nice looking downloadable tables,
# function adapted from Martin Chan, 2019:
# https://martinctc.github.io/blog/vignette-downloadable-tables-in-rmarkdown-with-the-dt-package/
# with column filter thanks to Yihui Xie, 2015:
# https://blog.rstudio.com/2015/06/24/dt-an-r-interface-to-the-datatables-library/
```
# Evaluate changes in _pks_ over time in a healthy donor that donated multiple _pks_+ samples
In [step three](3_Quantify_pks.html) of the analysis it is shown that donor D5 provided multiple samples with _pks_.
Besides the prevalence of _pks_, it is interesting to look into longitudinal data to investigate how long healthy people may be colonised with _pks_+ _E. coli_.
From this information it may be possible to estimate the colonisation duration, which in turn may be used to inform screening protocols.
And the colonisation duration may also be linked to (future) colorectal carcinogenesis studies to try and assess whether healthy people are exposed to _pks_ long enough to be at risk of developing colorectal cancer.
In this fourth analysis step, all _pks_ results from all available metagenomes for donor D5 are imported, including from samples not used for FMT - so data that is not included in any other step.
These data are then compared to the abundance of _E. coli_ in each metagenome.
As _E. coli_ abundances may be extremely low, abundances of Enterobacteriaceae are also included.
The rationale being twofold: 1) in metagenomes with very low abundances of _E. coli_, only genomic regions may be retrieved that are not specific to _E. coli_ and are therefore classified to Enterobacteriaceae as Lowest Common Ancestor. And 2) besides _E. coli_, it has been shown that other species of the Enterobacteriaceae family may also carry the _pks_ island ([Putze _et al_., 2009](https://doi.org/10.1128/IAI.00522-09). The latter option is less likely, because the non-_E. coli_ species reported with the _pks_ island are all pathogenic species, for which the donors faeces have been screened.
The abundances of _E. coli_ and Enterobacteriaceae shown here are based on results from [Jovian](https://www.github.com/DennisSchmitz/Jovian).
## Import _pks_ data for all D5 metagenomes
```{r import_all_D5_pks_data}
## I need a number of variables for this analysis:
# 1. the mapped read counts
donor_read_counts <- read.delim(here("..", "data", "processed", "donor", "Deduplicated_read_counts.tsv"))
extra_donor_read_counts <- read.delim(here("..", "data", "processed", "extra_donors", "Deduplicated_read_counts.tsv"))
donor_mapped_reads <- rbind(donor_read_counts,
extra_donor_read_counts)
# 2. the total number of mapped reads per metagenome
donor_total_reads <- read.delim(here("..", "data", "processed", "donor", "Filtered_read_counts.csv"))
extra_donor_total_reads <- read.delim(here("..", "data", "processed", "extra_donors", "Filtered_read_counts-summed_up.csv"))
donor_total_reads <- donor_total_reads %>% filter(set == "total")
extra_donor_total_reads <- extra_donor_total_reads %>% filter(set == "total")
donor_total <- rbind(donor_total_reads,
extra_donor_total_reads)
#Remove the '_pR1' suffix from sample names to facilitate merging
donor_total$Sample <- donor_total$Sample %>% gsub(pattern = "_pR1", replacement = "")
# 3. the sampling date of each D5 metagenome
subject_metadata <- read.csv(here("..", "..", "..", "data", "FMT_metadata.csv"))
subject_metadata$Sample_long <- gsub(pattern = "\\.", replacement = "-", x = subject_metadata$Sample_long)
d5_metadata <- subject_metadata %>%
filter(alias_ID == "D5")
# (D5 has 15 metagenomes available, as the dataframe 'd5_metadata' has 15 rows)
# 4. the gene names to distinguish colibactin-coding (clb) genes
clb_metadata <- read.delim(here("..", "data", "references", "Colibactin-AM229678_info.tsv"))
colnames(clb_metadata)[colnames(clb_metadata) == "Accession"] <- "Accession_ID"
#The gene names in clb_metadata have spaces in the end: remove those before further processing:
clb_metadata$Gene <- trimws(clb_metadata$Gene, which = "right")
## Now merge these all together and extract the rows and columns of interest
mapped_reads_and_total_reads <- merge(x = donor_total,
y = donor_mapped_reads,
by.x = "Sample",
by.y = "Sample_name")
reads_of_d5 <- merge(x = mapped_reads_and_total_reads,
y = d5_metadata,
by.x = "Sample",
by.y = "Sample_long")
#length(unique(reads_of_d5$reads))
#There are 15 metagenomes of D5
#length(unique(reads_of_d5$Sampling_date))
#from 13 different dates
d5_clb_reads <- merge(x = reads_of_d5,
y = clb_metadata,
by.x = "Reference",
by.y = "ID_long") %>%
filter(str_detect(Gene, "^clb"))
#length(unique(d5_clb_reads$reads))
#13/15 metagenomes have clb genes
#length(unique(d5_clb_reads$Sampling_date))
#from 12 different dates (so there is one sampling date with multiple clb-containing metagenomes)
#Now to calculate the average pks, we need RPKM values per gene and zeroes for genes
# that were not detected.
d5_clb_reads$RPKM <- with(d5_clb_reads,
mapped_reads / Length * 1000 / reads * 1e6)
clb_genes <- sort(unique(clb_metadata$Gene))[1:19]
sample_IDs <- sort(unique(d5_clb_reads$Sample))
all_clb_genes_per_sample <- merge(x = as.data.frame(sample_IDs),
y = as.data.frame(clb_genes))
all_clb_in_d5_clb_reads <- merge(x = d5_clb_reads,
y = all_clb_genes_per_sample,
by.x = c("Sample", "Gene"),
by.y = c("sample_IDs", "clb_genes"),
all = TRUE) %>%
mutate_at(23, ~replace(., is.na(.), 0))
average_pks_per_sample <- aggregate(
all_clb_in_d5_clb_reads$RPKM,
by=list(Sample=all_clb_in_d5_clb_reads$Sample),
FUN=mean)
colnames(average_pks_per_sample) <- c("Sample_ID", "average_pks_rpkm")
#Add back the relevant metadata
pks_per_sample <- right_join(x = average_pks_per_sample,
y = d5_metadata,
by = c("Sample_ID" = "Sample_long")) %>%
mutate_at(c(2), ~replace(., is.na(.), 0)) #add zero values for pks-negative samples
#Remove unnecessary data from the dataframe
pks_per_sample <- pks_per_sample[with(pks_per_sample,
order(Sampling_date, Sample_ID)),]
pks_per_sample$Sampling_date <- as.Date(pks_per_sample$Sampling_date)
# Convert dates to time as days since first sampling
pks_per_sample$Time <- with(pks_per_sample,
Sampling_date - min(Sampling_date))
pks_per_sample <- pks_per_sample %>%