Estimating the predictive accuracy using cross-cohort train/test splits is probably much more trustworthy. Using OOB or LOOCV on the joint dataset probably overestimates the generalizability of the models for new cohorts (i.e. for the 2023 cohort).
As shown in the previous challenges, using baseline values for the prediction of absolute numbers (tasks 1.1, 2.1, 3.1) works already very good. It will not be easy to beat this baseline.
The metadata model for task 1.2 works quite well (when looking at the spearman correlation), whereas task 2.2 and 3.2 seem much more difficult.
4 Questions
In the cross-cohort setting, the predictions from 2022 -> 2021 are sometimes much better than 2021 -> 2022 (or the other way around). How can this be?
5 Results
Code
task_meta <-list(task_11 =list(name ="task_11",header ="## Task 1.1",description ="Rank the individuals by IgG antibody levels against pertussis toxin (PT) that we detect in plasma 14 days post booster vaccinations." ),task_12 =list(name ="task_12",header ="## Task 1.2",description ="Rank the individuals by fold change of IgG antibody levels against pertussis toxin (PT) that we detect in plasma 14 days post booster vaccinations compared to titer values at day 0." ),task_21 =list(name ="task_21",header ="## Task 2.1",description ="Rank the individuals by predicted frequency of Monocytes on day 1 post boost after vaccination." ),task_22 =list(name ="task_22",header ="## Task 2.2",description ="Rank the individuals by fold change of predicted frequency of Monocytes on day 1 post booster vaccination compared to cell frequency values at day 0." ),task_31 =list(name ="task_31",header ="## Task 3.1",description ="Rank the individuals by predicted gene expression of CCL3 on day 3 post-booster vaccination." ),task_32 =list(name ="task_32",header ="## Task 3.2",description ="Rank the individuals by fold change of predicted gene expression of CCL3 on day 3 post booster vaccination compared to gene expression values at day 0." ),task_41 =list(name ="task_41",header ="## Task 4.1",description ="Rank the individuals based on their Th1/Th2 (IFN-γ/IL-5) polarization ratio on Day 30 post-booster vaccination." ))
Rank the individuals by IgG antibody levels against pertussis toxin (PT) that we detect in plasma 14 days post booster vaccinations.
mode
mse
r2
srho
oob
21,930,271
0.81
0.68
mode
mse
r2
srho
loocv
20,912,097
0.82
0.74
trainset
testset
srho_mean
srho_sd
srho_baseline
train_n
test_n
2021
2022
-0.31
0.01
0.44
32
20
2022
2021
-0.34
0.04
0.60
20
32
5.2 Task 1.2
Rank the individuals by fold change of IgG antibody levels against pertussis toxin (PT) that we detect in plasma 14 days post booster vaccinations compared to titer values at day 0.
mode
mse
r2
srho
oob
0.38
0.61
0.81
mode
mse
r2
srho
loocv
0.39
0.6
0.81
trainset
testset
srho_mean
srho_sd
srho_baseline
train_n
test_n
2021
2022
0.36
0.03
-0.89
32
20
2022
2021
0.04
0.02
-0.71
20
32
5.3 Task 2.1
Rank the individuals by predicted frequency of Monocytes on day 1 post boost after vaccination.
mode
mse
r2
srho
oob
64.45
0.34
0.59
mode
mse
r2
srho
loocv
63.73
0.34
0.59
trainset
testset
srho_mean
srho_sd
srho_baseline
train_n
test_n
2020
2021
0.53
0.04
0.88
12
33
2020
2022
0.13
0.06
0.55
12
21
2021
2020
0.72
0.04
0.81
33
12
2021
2022
0.49
0.02
0.55
33
21
2022
2020
0.20
0.04
0.81
21
12
2022
2021
0.60
0.01
0.88
21
33
5.4 Task 2.2
Rank the individuals by fold change of predicted frequency of Monocytes on day 1 post booster vaccination compared to cell frequency values at day 0.
mode
mse
r2
srho
oob
0.11
-0.04
-0.04
mode
mse
r2
srho
loocv
0.11
-0.03
0
trainset
testset
srho_mean
srho_sd
srho_baseline
train_n
test_n
2020
2021
-0.11
0.03
-0.22
12
33
2020
2022
-0.31
0.02
-0.21
12
21
2021
2020
0.52
0.03
-0.43
33
12
2021
2022
-0.31
0.04
-0.21
33
21
2022
2020
-0.44
0.04
-0.43
21
12
2022
2021
0.00
0.03
-0.22
21
33
5.5 Task 3.1
Rank the individuals by predicted gene expression of CCL3 on day 3 post-booster vaccination.
mode
mse
r2
srho
oob
1
0.18
0.46
mode
mse
r2
srho
loocv
1
0.18
0.46
trainset
testset
srho_mean
srho_sd
srho_baseline
train_n
test_n
2020
2021
0.35
0.02
0.57
26
36
2020
2022
-0.06
0.04
0.53
26
21
2021
2020
0.01
0.03
0.27
36
26
2021
2022
0.38
0.05
0.53
36
21
2022
2020
-0.01
0.02
0.27
21
26
2022
2021
0.40
0.02
0.57
21
36
5.6 Task 3.2
Rank the individuals by fold change of predicted gene expression of CCL3 on day 3 post booster vaccination compared to gene expression values at day 0.
mode
mse
r2
srho
oob
1.14
0.01
0.1
mode
mse
r2
srho
loocv
1.15
0
0.08
trainset
testset
srho_mean
srho_sd
srho_baseline
train_n
test_n
2020
2021
0.21
0.01
-0.43
26
36
2020
2022
-0.26
0.03
-0.19
26
21
2021
2020
-0.16
0.02
-0.33
36
26
2021
2022
-0.08
0.05
-0.19
36
21
2022
2020
-0.35
0.02
-0.33
21
26
2022
2021
0.16
0.05
-0.43
21
36
5.7 Task 4.1
Rank the individuals based on their Th1/Th2 (IFN-<U+03B3>/IL-5) polarization ratio on Day 30 post-booster vaccination.
mode
mse
r2
srho
oob
4.52
0.03
0.21
mode
mse
r2
srho
loocv
4.53
0.03
0.18
trainset
testset
srho_mean
srho_sd
srho_baseline
train_n
test_n
2021
2022
0.47
0.05
0.74
27
16
2022
2021
0.53
0.02
0.55
16
27
Source Code
---title: "Metadata Models"author: "Philipp Sven Lars Schäfer"date: "`r format(Sys.time(), '%d %B, %Y')`"editor: sourceengine: knitr---# Packages```{r}suppressPackageStartupMessages({library(tidyverse)library(flextable)library(ggdark)library(magick)source(file.path("..", "src", "read_data.R"))source(file.path("..", "src", "generate_targets.R"))source(file.path("..", "src", "model.R"))})knitr::opts_knit$set(output.dir ="./")```# Data```{r}input_dir =file.path("..", "data")``````{r}meta_data <-read_harmonized_meta_data(input_dir)gene_meta <-read_gene_meta(input_dir)experimental_data <-read_raw_experimental_data(input_dir)experimental_data <-filter_experimental_data(meta_data, experimental_data, gene_meta)meta_data <-filter_meta_data(meta_data, experimental_data)celltype_meta <-read_celltype_meta(input_dir)gene_meta <-read_gene_meta(input_dir)protein_meta <-read_protein_meta(input_dir)``````{r}target_list <-generate_all_targets(meta_data=meta_data,experimental_data=experimental_data,experimental_data_settings=experimental_data_settings,gene_meta=gene_meta,protein_meta=protein_meta)str(target_list)```# Conclusions- Estimating the predictive accuracy using cross-cohort train/test splits is probably much more trustworthy. Using OOB or LOOCV on the joint dataset probably overestimates the generalizability of the models for new cohorts (i.e. for the 2023 cohort).- As shown in the previous challenges, using baseline values for the prediction of absolute numbers (tasks 1.1, 2.1, 3.1) works already very good. It will not be easy to beat this baseline.- The metadata model for task 1.2 works quite well (when looking at the spearman correlation), whereas task 2.2 and 3.2 seem much more difficult.# Questions- In the cross-cohort setting, the predictions from 2022 -> 2021 are sometimes much better than 2021 -> 2022 (or the other way around). How can this be?# Results```{r}task_meta <-list(task_11 =list(name ="task_11",header ="## Task 1.1",description ="Rank the individuals by IgG antibody levels against pertussis toxin (PT) that we detect in plasma 14 days post booster vaccinations." ),task_12 =list(name ="task_12",header ="## Task 1.2",description ="Rank the individuals by fold change of IgG antibody levels against pertussis toxin (PT) that we detect in plasma 14 days post booster vaccinations compared to titer values at day 0." ),task_21 =list(name ="task_21",header ="## Task 2.1",description ="Rank the individuals by predicted frequency of Monocytes on day 1 post boost after vaccination." ),task_22 =list(name ="task_22",header ="## Task 2.2",description ="Rank the individuals by fold change of predicted frequency of Monocytes on day 1 post booster vaccination compared to cell frequency values at day 0." ),task_31 =list(name ="task_31",header ="## Task 3.1",description ="Rank the individuals by predicted gene expression of CCL3 on day 3 post-booster vaccination." ),task_32 =list(name ="task_32",header ="## Task 3.2",description ="Rank the individuals by fold change of predicted gene expression of CCL3 on day 3 post booster vaccination compared to gene expression values at day 0." ),task_41 =list(name ="task_41",header ="## Task 4.1",description ="Rank the individuals based on their Th1/Th2 (IFN-γ/IL-5) polarization ratio on Day 30 post-booster vaccination." ))``````{r results="asis"}RENDER <- TRUEmake_flextable <- function(x) { if (RENDER) { x %>% flextable() %>% bg(., bg = "#333333", part = "all") %>% color(., color = "white", part = "all") %>% set_table_properties(., align = "left") %>% flextable_to_rmd(ft) %>% return() } else { return(x) }}for (task in task_meta) { #task <- task_meta[[1]] cat(task$header) cat("\n\n") cat(task$description) cat("\n\n") meta_data_covariates <- get_metadata_covariates(meta_data) model_df <- target_list[[task$name]] %>% dplyr::left_join(meta_data_covariates, by="subject_id") set.seed(42) get_oob_perf(model_df=model_df) %>% dplyr::mutate(mse = round(mse, 2), r2 = round(r2, 2), srho = round(srho, 2)) %>% make_flextable(.) get_loocv_perf(model_df=model_df) %>% dplyr::mutate(mse = round(mse, 2), r2 = round(r2, 2), srho = round(srho, 2)) %>% make_flextable(.) # get_cross_cohort_perf_combinations(model_df=model_df, meta_data=meta_data) %>% # dplyr::mutate(mse = round(mse, 2), r2 = round(r2, 2)) %>% # make_flextable(.) # get_cross_cohort_perf_single(model_df=model_df, meta_data=meta_data) %>% # dplyr::mutate(mse = round(mse, 2), r2 = round(r2, 2), # srho = round(srho, 2), srho_baseline = round(srho_baseline, 2), # mse_tmean = round(mse_tmean, 2)) %>% # make_flextable(.) get_cross_cohort_perf_single_repeated(model_df=model_df, meta_data=meta_data) %>% dplyr::mutate(srho_mean = round(srho_mean, 2), srho_sd = round(srho_sd, 2), srho_baseline = round(srho_baseline, 2)) %>% make_flextable(.) cat("\n\n")}```