Author

Philipp Sven Lars Schäfer

Published

November 28, 2024

1 Packages

Code
suppressPackageStartupMessages({
  library(tidyverse)
  library(flextable)
  library(ggdark)
  library(magick)
  source(file.path("..", "src", "read_data.R"))
  source(file.path("..", "src", "generate_targets.R"))
  source(file.path("..", "src", "model.R"))
})
knitr::opts_knit$set(output.dir = "./")

2 Data

Code
input_dir = file.path("..", "data")
Code
meta_data <- read_harmonized_meta_data(input_dir)
gene_meta <- read_gene_meta(input_dir)

experimental_data <- read_raw_experimental_data(input_dir)
experimental_data <- filter_experimental_data(meta_data, experimental_data, gene_meta)
pbmc_cell_frequency | Removed 550 specimens because missing in meta data
plasma_ab_titer | Removed 6931 specimens because missing in meta data
plasma_cytokine_concentration_by_olink | Removed 495 specimens because missing in meta data
pbmc_cell_frequency | Removed 56 features because not in feature subset
plasma_ab_titer | Removed 48 features because not in feature subset
plasma_cytokine_concentration_by_olink | Removed 234 features because not in feature subset
t_cell_polarization | Removed 3 features because not in feature subset
plasma_cytokine_concentration_by_olink | Removed 300 features because qc warning
plasma_ab_titer | Removed 10540 measurements because wrong unit used
plasma_cytokine_concentration_by_olink | Removed 2400 measurements because wrong unit used
plasma_cytokine_concentration_by_legendplex | Removed 8 because specimen is outlier
plasma_cytokine_concentration_by_olink | Removed specimen 750, 760, 824, 833, 894, 903 because fraction of measurements below LOQ > 50%
plasma_ab_titer | Removed specimen 674, 675, 676 because fraction of measurements below LOD > 50%
Code
meta_data <- filter_meta_data(meta_data, experimental_data)

celltype_meta <- read_celltype_meta(input_dir)
gene_meta <- read_gene_meta(input_dir)
protein_meta <- read_protein_meta(input_dir)
Code
target_list <- generate_all_targets(
  meta_data=meta_data,
  experimental_data=experimental_data,
  experimental_data_settings=experimental_data_settings,
  gene_meta=gene_meta,
  protein_meta=protein_meta)
str(target_list)
List of 7
 $ task_11: tibble [52 x 3] (S3: tbl_df/tbl/data.frame)
  ..$ subject_id: num [1:52] 61 62 63 64 65 66 67 68 69 70 ...
  ..$ baseline  : num [1:52] 112.8 125 278.7 353.2 83.8 ...
  ..$ target    : num [1:52] 304 1548 1917 558 1838 ...
 $ task_12: tibble [52 x 3] (S3: tbl_df/tbl/data.frame)
  ..$ subject_id: num [1:52] 61 62 63 64 65 66 67 68 69 70 ...
  ..$ baseline  : num [1:52] 112.8 125 278.7 353.2 83.8 ...
  ..$ target    : num [1:52] 0.992 2.517 1.928 0.458 3.088 ...
 $ task_21: tibble [66 x 3] (S3: tbl_df/tbl/data.frame)
  ..$ subject_id: num [1:66] 4 6 15 20 21 26 29 31 33 36 ...
  ..$ baseline  : num [1:66] 5.49 22.79 15.64 21.4 15.23 ...
  ..$ target    : num [1:66] 7.21 41.38 10.59 26.61 34.81 ...
 $ task_22: tibble [66 x 3] (S3: tbl_df/tbl/data.frame)
  ..$ subject_id: num [1:66] 4 6 15 20 21 26 29 31 33 36 ...
  ..$ baseline  : num [1:66] 5.49 22.79 15.64 21.4 15.23 ...
  ..$ target    : num [1:66] 0.273 0.596 -0.39 0.218 0.827 ...
 $ task_31: tibble [83 x 3] (S3: tbl_df/tbl/data.frame)
  ..$ subject_id: num [1:83] 1 3 4 5 6 9 10 13 15 20 ...
  ..$ baseline  : num [1:83] 5.68 5.4 5.04 5.66 5.44 ...
  ..$ target    : num [1:83] 5.91 5.46 4.89 5.23 5.38 ...
 $ task_32: tibble [83 x 3] (S3: tbl_df/tbl/data.frame)
  ..$ subject_id: num [1:83] 1 3 4 5 6 9 10 13 15 20 ...
  ..$ baseline  : num [1:83] 5.68 5.4 5.04 5.66 5.44 ...
  ..$ target    : num [1:83] 0.234 0.0612 -0.1466 -0.4284 -0.0628 ...
 $ task_41: tibble [43 x 3] (S3: tbl_df/tbl/data.frame)
  ..$ subject_id: num [1:43] 61 62 66 67 68 69 70 71 72 73 ...
  ..$ baseline  : num [1:43] -2.933 -1.985 2.286 -0.555 -0.144 ...
  ..$ target    : num [1:43] 2.625 -1.114 2.817 -2.363 -0.466 ...

3 Conclusions

  • Estimating the predictive accuracy using cross-cohort train/test splits is probably much more trustworthy. Using OOB or LOOCV on the joint dataset probably overestimates the generalizability of the models for new cohorts (i.e. for the 2023 cohort).

  • As shown in the previous challenges, using baseline values for the prediction of absolute numbers (tasks 1.1, 2.1, 3.1) works already very good. It will not be easy to beat this baseline.

  • The metadata model for task 1.2 works quite well (when looking at the spearman correlation), whereas task 2.2 and 3.2 seem much more difficult.

4 Questions

  • In the cross-cohort setting, the predictions from 2022 -> 2021 are sometimes much better than 2021 -> 2022 (or the other way around). How can this be?

5 Results

Code
task_meta <- list(
  task_11 = list(
    name = "task_11",
    header = "## Task 1.1",
    description = "Rank the individuals by IgG antibody levels against pertussis toxin (PT) that we detect in plasma 14 days post booster vaccinations."
  ),
  task_12 = list(
    name = "task_12",
    header = "## Task 1.2",
    description = "Rank the individuals by fold change of IgG antibody levels against pertussis toxin (PT) that we detect in plasma 14 days post booster vaccinations compared to titer values at day 0."
  ),
  task_21 = list(
    name = "task_21",
    header = "## Task 2.1",
    description = "Rank the individuals by predicted frequency of Monocytes on day 1 post boost after vaccination."
  ),
  task_22 = list(
    name = "task_22",
    header = "## Task 2.2",
    description = "Rank the individuals by fold change of predicted frequency of Monocytes on day 1 post booster vaccination compared to cell frequency values at day 0."
  ),
  task_31 = list(
    name = "task_31",
    header = "## Task 3.1",
    description = "Rank the individuals by predicted gene expression of CCL3 on day 3 post-booster vaccination."
  ),
  task_32 = list(
    name = "task_32",
    header = "## Task 3.2",
    description = "Rank the individuals by fold change of predicted gene expression of CCL3 on day 3 post booster vaccination compared to gene expression values at day 0."
  ),
  task_41 = list(
    name = "task_41",
    header = "## Task 4.1",
    description = "Rank the individuals based on their Th1/Th2 (IFN-γ/IL-5) polarization ratio on Day 30 post-booster vaccination."
  )
)
Code
RENDER <- TRUE

make_flextable <- function(x) {
  if (RENDER) {
    x %>%
      flextable() %>% 
      bg(., bg = "#333333", part = "all") %>%
      color(., color = "white", part = "all") %>%
      set_table_properties(., align = "left") %>%
      flextable_to_rmd(ft) %>%
      return()
  } else {
    return(x)
  }
}

for (task in task_meta) {
  #task <- task_meta[[1]]
  
  cat(task$header)
  cat("\n\n")
  cat(task$description)
  cat("\n\n")
  
  meta_data_covariates <- get_metadata_covariates(meta_data)

  model_df <- target_list[[task$name]] %>%
    dplyr::left_join(meta_data_covariates, by="subject_id")

  set.seed(42)
  
  get_oob_perf(model_df=model_df) %>% 
    dplyr::mutate(mse = round(mse, 2), r2 = round(r2, 2), srho = round(srho, 2)) %>%
    make_flextable(.)
  
  get_loocv_perf(model_df=model_df) %>% 
    dplyr::mutate(mse = round(mse, 2), r2 = round(r2, 2), srho = round(srho, 2)) %>%
    make_flextable(.)
    
  # get_cross_cohort_perf_combinations(model_df=model_df, meta_data=meta_data) %>%
  #   dplyr::mutate(mse = round(mse, 2), r2 = round(r2, 2)) %>%
  # make_flextable(.)
    
  # get_cross_cohort_perf_single(model_df=model_df, meta_data=meta_data) %>%
  #   dplyr::mutate(mse = round(mse, 2), r2 = round(r2, 2),
  #                 srho = round(srho, 2), srho_baseline = round(srho_baseline, 2),
  #                 mse_tmean = round(mse_tmean, 2)) %>%
  #   make_flextable(.)
  
  get_cross_cohort_perf_single_repeated(model_df=model_df, meta_data=meta_data) %>%
    dplyr::mutate(srho_mean = round(srho_mean, 2), srho_sd = round(srho_sd, 2),
                  srho_baseline = round(srho_baseline, 2)) %>%
    make_flextable(.)
  
  cat("\n\n")
}

5.1 Task 1.1

Rank the individuals by IgG antibody levels against pertussis toxin (PT) that we detect in plasma 14 days post booster vaccinations.

mode

mse

r2

srho

oob

21,930,271

0.81

0.68

mode

mse

r2

srho

loocv

20,912,097

0.82

0.74

trainset

testset

srho_mean

srho_sd

srho_baseline

train_n

test_n

2021

2022

-0.31

0.01

0.44

32

20

2022

2021

-0.34

0.04

0.60

20

32

5.2 Task 1.2

Rank the individuals by fold change of IgG antibody levels against pertussis toxin (PT) that we detect in plasma 14 days post booster vaccinations compared to titer values at day 0.

mode

mse

r2

srho

oob

0.38

0.61

0.81

mode

mse

r2

srho

loocv

0.39

0.6

0.81

trainset

testset

srho_mean

srho_sd

srho_baseline

train_n

test_n

2021

2022

0.36

0.03

-0.89

32

20

2022

2021

0.04

0.02

-0.71

20

32

5.3 Task 2.1

Rank the individuals by predicted frequency of Monocytes on day 1 post boost after vaccination.

mode

mse

r2

srho

oob

64.45

0.34

0.59

mode

mse

r2

srho

loocv

63.73

0.34

0.59

trainset

testset

srho_mean

srho_sd

srho_baseline

train_n

test_n

2020

2021

0.53

0.04

0.88

12

33

2020

2022

0.13

0.06

0.55

12

21

2021

2020

0.72

0.04

0.81

33

12

2021

2022

0.49

0.02

0.55

33

21

2022

2020

0.20

0.04

0.81

21

12

2022

2021

0.60

0.01

0.88

21

33

5.4 Task 2.2

Rank the individuals by fold change of predicted frequency of Monocytes on day 1 post booster vaccination compared to cell frequency values at day 0.

mode

mse

r2

srho

oob

0.11

-0.04

-0.04

mode

mse

r2

srho

loocv

0.11

-0.03

0

trainset

testset

srho_mean

srho_sd

srho_baseline

train_n

test_n

2020

2021

-0.11

0.03

-0.22

12

33

2020

2022

-0.31

0.02

-0.21

12

21

2021

2020

0.52

0.03

-0.43

33

12

2021

2022

-0.31

0.04

-0.21

33

21

2022

2020

-0.44

0.04

-0.43

21

12

2022

2021

0.00

0.03

-0.22

21

33

5.5 Task 3.1

Rank the individuals by predicted gene expression of CCL3 on day 3 post-booster vaccination.

mode

mse

r2

srho

oob

1

0.18

0.46

mode

mse

r2

srho

loocv

1

0.18

0.46

trainset

testset

srho_mean

srho_sd

srho_baseline

train_n

test_n

2020

2021

0.35

0.02

0.57

26

36

2020

2022

-0.06

0.04

0.53

26

21

2021

2020

0.01

0.03

0.27

36

26

2021

2022

0.38

0.05

0.53

36

21

2022

2020

-0.01

0.02

0.27

21

26

2022

2021

0.40

0.02

0.57

21

36

5.6 Task 3.2

Rank the individuals by fold change of predicted gene expression of CCL3 on day 3 post booster vaccination compared to gene expression values at day 0.

mode

mse

r2

srho

oob

1.14

0.01

0.1

mode

mse

r2

srho

loocv

1.15

0

0.08

trainset

testset

srho_mean

srho_sd

srho_baseline

train_n

test_n

2020

2021

0.21

0.01

-0.43

26

36

2020

2022

-0.26

0.03

-0.19

26

21

2021

2020

-0.16

0.02

-0.33

36

26

2021

2022

-0.08

0.05

-0.19

36

21

2022

2020

-0.35

0.02

-0.33

21

26

2022

2021

0.16

0.05

-0.43

21

36

5.7 Task 4.1

Rank the individuals based on their Th1/Th2 (IFN-<U+03B3>/IL-5) polarization ratio on Day 30 post-booster vaccination.

mode

mse

r2

srho

oob

4.52

0.03

0.21

mode

mse

r2

srho

loocv

4.53

0.03

0.18

trainset

testset

srho_mean

srho_sd

srho_baseline

train_n

test_n

2021

2022

0.47

0.05

0.74

27

16

2022

2021

0.53

0.02

0.55

16

27