Metadata Models

Author

Philipp Sven Lars Schäfer

Published

November 28, 2024

1 Packages

Code

suppressPackageStartupMessages({
  library(tidyverse)
  library(flextable)
  library(ggdark)
  library(magick)
  source(file.path("..", "src", "read_data.R"))
  source(file.path("..", "src", "generate_targets.R"))
  source(file.path("..", "src", "model.R"))
})
knitr::opts_knit$set(output.dir = "./")

2 Data

Code

input_dir = file.path("..", "data")

Code

meta_data <- read_harmonized_meta_data(input_dir)
gene_meta <- read_gene_meta(input_dir)

experimental_data <- read_raw_experimental_data(input_dir)
experimental_data <- filter_experimental_data(meta_data, experimental_data, gene_meta)

pbmc_cell_frequency | Removed 550 specimens because missing in meta data

plasma_ab_titer | Removed 6931 specimens because missing in meta data

plasma_cytokine_concentration_by_olink | Removed 495 specimens because missing in meta data

pbmc_cell_frequency | Removed 56 features because not in feature subset

plasma_ab_titer | Removed 48 features because not in feature subset

plasma_cytokine_concentration_by_olink | Removed 234 features because not in feature subset

t_cell_polarization | Removed 3 features because not in feature subset

plasma_cytokine_concentration_by_olink | Removed 300 features because qc warning

plasma_ab_titer | Removed 10540 measurements because wrong unit used

plasma_cytokine_concentration_by_olink | Removed 2400 measurements because wrong unit used

plasma_cytokine_concentration_by_legendplex | Removed 8 because specimen is outlier

plasma_cytokine_concentration_by_olink | Removed specimen 750, 760, 824, 833, 894, 903 because fraction of measurements below LOQ > 50%

plasma_ab_titer | Removed specimen 674, 675, 676 because fraction of measurements below LOD > 50%

Code

meta_data <- filter_meta_data(meta_data, experimental_data)

celltype_meta <- read_celltype_meta(input_dir)
gene_meta <- read_gene_meta(input_dir)
protein_meta <- read_protein_meta(input_dir)

Code

target_list <- generate_all_targets(
  meta_data=meta_data,
  experimental_data=experimental_data,
  experimental_data_settings=experimental_data_settings,
  gene_meta=gene_meta,
  protein_meta=protein_meta)
str(target_list)

List of 7
 $ task_11: tibble [52 x 3] (S3: tbl_df/tbl/data.frame)
  ..$ subject_id: num [1:52] 61 62 63 64 65 66 67 68 69 70 ...
  ..$ baseline  : num [1:52] 112.8 125 278.7 353.2 83.8 ...
  ..$ target    : num [1:52] 304 1548 1917 558 1838 ...
 $ task_12: tibble [52 x 3] (S3: tbl_df/tbl/data.frame)
  ..$ subject_id: num [1:52] 61 62 63 64 65 66 67 68 69 70 ...
  ..$ baseline  : num [1:52] 112.8 125 278.7 353.2 83.8 ...
  ..$ target    : num [1:52] 0.992 2.517 1.928 0.458 3.088 ...
 $ task_21: tibble [66 x 3] (S3: tbl_df/tbl/data.frame)
  ..$ subject_id: num [1:66] 4 6 15 20 21 26 29 31 33 36 ...
  ..$ baseline  : num [1:66] 5.49 22.79 15.64 21.4 15.23 ...
  ..$ target    : num [1:66] 7.21 41.38 10.59 26.61 34.81 ...
 $ task_22: tibble [66 x 3] (S3: tbl_df/tbl/data.frame)
  ..$ subject_id: num [1:66] 4 6 15 20 21 26 29 31 33 36 ...
  ..$ baseline  : num [1:66] 5.49 22.79 15.64 21.4 15.23 ...
  ..$ target    : num [1:66] 0.273 0.596 -0.39 0.218 0.827 ...
 $ task_31: tibble [83 x 3] (S3: tbl_df/tbl/data.frame)
  ..$ subject_id: num [1:83] 1 3 4 5 6 9 10 13 15 20 ...
  ..$ baseline  : num [1:83] 5.68 5.4 5.04 5.66 5.44 ...
  ..$ target    : num [1:83] 5.91 5.46 4.89 5.23 5.38 ...
 $ task_32: tibble [83 x 3] (S3: tbl_df/tbl/data.frame)
  ..$ subject_id: num [1:83] 1 3 4 5 6 9 10 13 15 20 ...
  ..$ baseline  : num [1:83] 5.68 5.4 5.04 5.66 5.44 ...
  ..$ target    : num [1:83] 0.234 0.0612 -0.1466 -0.4284 -0.0628 ...
 $ task_41: tibble [43 x 3] (S3: tbl_df/tbl/data.frame)
  ..$ subject_id: num [1:43] 61 62 66 67 68 69 70 71 72 73 ...
  ..$ baseline  : num [1:43] -2.933 -1.985 2.286 -0.555 -0.144 ...
  ..$ target    : num [1:43] 2.625 -1.114 2.817 -2.363 -0.466 ...

3 Conclusions

Estimating the predictive accuracy using cross-cohort train/test splits is probably much more trustworthy. Using OOB or LOOCV on the joint dataset probably overestimates the generalizability of the models for new cohorts (i.e. for the 2023 cohort).
As shown in the previous challenges, using baseline values for the prediction of absolute numbers (tasks 1.1, 2.1, 3.1) works already very good. It will not be easy to beat this baseline.
The metadata model for task 1.2 works quite well (when looking at the spearman correlation), whereas task 2.2 and 3.2 seem much more difficult.

4 Questions

In the cross-cohort setting, the predictions from 2022 -> 2021 are sometimes much better than 2021 -> 2022 (or the other way around). How can this be?

5 Results

Code

task_meta <- list(
  task_11 = list(
    name = "task_11",
    header = "## Task 1.1",
    description = "Rank the individuals by IgG antibody levels against pertussis toxin (PT) that we detect in plasma 14 days post booster vaccinations."
  ),
  task_12 = list(
    name = "task_12",
    header = "## Task 1.2",
    description = "Rank the individuals by fold change of IgG antibody levels against pertussis toxin (PT) that we detect in plasma 14 days post booster vaccinations compared to titer values at day 0."
  ),
  task_21 = list(
    name = "task_21",
    header = "## Task 2.1",
    description = "Rank the individuals by predicted frequency of Monocytes on day 1 post boost after vaccination."
  ),
  task_22 = list(
    name = "task_22",
    header = "## Task 2.2",
    description = "Rank the individuals by fold change of predicted frequency of Monocytes on day 1 post booster vaccination compared to cell frequency values at day 0."
  ),
  task_31 = list(
    name = "task_31",
    header = "## Task 3.1",
    description = "Rank the individuals by predicted gene expression of CCL3 on day 3 post-booster vaccination."
  ),
  task_32 = list(
    name = "task_32",
    header = "## Task 3.2",
    description = "Rank the individuals by fold change of predicted gene expression of CCL3 on day 3 post booster vaccination compared to gene expression values at day 0."
  ),
  task_41 = list(
    name = "task_41",
    header = "## Task 4.1",
    description = "Rank the individuals based on their Th1/Th2 (IFN-γ/IL-5) polarization ratio on Day 30 post-booster vaccination."
  )
)

Code

RENDER <- TRUE

make_flextable <- function(x) {
  if (RENDER) {
    x %>%
      flextable() %>% 
      bg(., bg = "#333333", part = "all") %>%
      color(., color = "white", part = "all") %>%
      set_table_properties(., align = "left") %>%
      flextable_to_rmd(ft) %>%
      return()
  } else {
    return(x)
  }
}

for (task in task_meta) {
  #task <- task_meta[[1]]
  
  cat(task$header)
  cat("\n\n")
  cat(task$description)
  cat("\n\n")
  
  meta_data_covariates <- get_metadata_covariates(meta_data)

  model_df <- target_list[[task$name]] %>%
    dplyr::left_join(meta_data_covariates, by="subject_id")

  set.seed(42)
  
  get_oob_perf(model_df=model_df) %>% 
    dplyr::mutate(mse = round(mse, 2), r2 = round(r2, 2), srho = round(srho, 2)) %>%
    make_flextable(.)
  
  get_loocv_perf(model_df=model_df) %>% 
    dplyr::mutate(mse = round(mse, 2), r2 = round(r2, 2), srho = round(srho, 2)) %>%
    make_flextable(.)
    
  # get_cross_cohort_perf_combinations(model_df=model_df, meta_data=meta_data) %>%
  #   dplyr::mutate(mse = round(mse, 2), r2 = round(r2, 2)) %>%
  # make_flextable(.)
    
  # get_cross_cohort_perf_single(model_df=model_df, meta_data=meta_data) %>%
  #   dplyr::mutate(mse = round(mse, 2), r2 = round(r2, 2),
  #                 srho = round(srho, 2), srho_baseline = round(srho_baseline, 2),
  #                 mse_tmean = round(mse_tmean, 2)) %>%
  #   make_flextable(.)
  
  get_cross_cohort_perf_single_repeated(model_df=model_df, meta_data=meta_data) %>%
    dplyr::mutate(srho_mean = round(srho_mean, 2), srho_sd = round(srho_sd, 2),
                  srho_baseline = round(srho_baseline, 2)) %>%
    make_flextable(.)
  
  cat("\n\n")
}

5.1 Task 1.1

Rank the individuals by IgG antibody levels against pertussis toxin (PT) that we detect in plasma 14 days post booster vaccinations.

mode	mse	r2	srho
oob	21,930,271	0.81	0.68

mode	mse	r2	srho
loocv	20,912,097	0.82	0.74

trainset	testset	srho_mean	srho_sd	srho_baseline	train_n	test_n
2021	2022	-0.31	0.01	0.44	32	20
2022	2021	-0.34	0.04	0.60	20	32

5.2 Task 1.2

Rank the individuals by fold change of IgG antibody levels against pertussis toxin (PT) that we detect in plasma 14 days post booster vaccinations compared to titer values at day 0.

mode	mse	r2	srho
oob	0.38	0.61	0.81

mode	mse	r2	srho
loocv	0.39	0.6	0.81

trainset	testset	srho_mean	srho_sd	srho_baseline	train_n	test_n
2021	2022	0.36	0.03	-0.89	32	20
2022	2021	0.04	0.02	-0.71	20	32

5.3 Task 2.1

Rank the individuals by predicted frequency of Monocytes on day 1 post boost after vaccination.

mode	mse	r2	srho
oob	64.45	0.34	0.59

mode	mse	r2	srho
loocv	63.73	0.34	0.59

trainset	testset	srho_mean	srho_sd	srho_baseline	train_n	test_n
2020	2021	0.53	0.04	0.88	12	33
2020	2022	0.13	0.06	0.55	12	21
2021	2020	0.72	0.04	0.81	33	12
2021	2022	0.49	0.02	0.55	33	21
2022	2020	0.20	0.04	0.81	21	12
2022	2021	0.60	0.01	0.88	21	33

5.4 Task 2.2

Rank the individuals by fold change of predicted frequency of Monocytes on day 1 post booster vaccination compared to cell frequency values at day 0.

mode	mse	r2	srho
oob	0.11	-0.04	-0.04

mode	mse	r2	srho
loocv	0.11	-0.03	0

trainset	testset	srho_mean	srho_sd	srho_baseline	train_n	test_n
2020	2021	-0.11	0.03	-0.22	12	33
2020	2022	-0.31	0.02	-0.21	12	21
2021	2020	0.52	0.03	-0.43	33	12
2021	2022	-0.31	0.04	-0.21	33	21
2022	2020	-0.44	0.04	-0.43	21	12
2022	2021	0.00	0.03	-0.22	21	33

5.5 Task 3.1

Rank the individuals by predicted gene expression of CCL3 on day 3 post-booster vaccination.

mode	mse	r2	srho
oob	1	0.18	0.46

mode	mse	r2	srho
loocv	1	0.18	0.46

trainset	testset	srho_mean	srho_sd	srho_baseline	train_n	test_n
2020	2021	0.35	0.02	0.57	26	36
2020	2022	-0.06	0.04	0.53	26	21
2021	2020	0.01	0.03	0.27	36	26
2021	2022	0.38	0.05	0.53	36	21
2022	2020	-0.01	0.02	0.27	21	26
2022	2021	0.40	0.02	0.57	21	36

5.6 Task 3.2

Rank the individuals by fold change of predicted gene expression of CCL3 on day 3 post booster vaccination compared to gene expression values at day 0.

mode	mse	r2	srho
oob	1.14	0.01	0.1

mode	mse	r2	srho
loocv	1.15	0	0.08

trainset	testset	srho_mean	srho_sd	srho_baseline	train_n	test_n
2020	2021	0.21	0.01	-0.43	26	36
2020	2022	-0.26	0.03	-0.19	26	21
2021	2020	-0.16	0.02	-0.33	36	26
2021	2022	-0.08	0.05	-0.19	36	21
2022	2020	-0.35	0.02	-0.33	21	26
2022	2021	0.16	0.05	-0.43	21	36

5.7 Task 4.1

Rank the individuals based on their Th1/Th2 (IFN-<U+03B3>/IL-5) polarization ratio on Day 30 post-booster vaccination.

mode	mse	r2	srho
oob	4.52	0.03	0.21

mode	mse	r2	srho
loocv	4.53	0.03	0.18

trainset	testset	srho_mean	srho_sd	srho_baseline	train_n	test_n
2021	2022	0.47	0.05	0.74	27	16
2022	2021	0.53	0.02	0.55	16	27