pbmc_cell_frequency | Removed 550 specimens because missing in meta data
plasma_ab_titer | Removed 6931 specimens because missing in meta data
plasma_cytokine_concentration_by_olink | Removed 495 specimens because missing in meta data
pbmc_cell_frequency | Removed 56 features because not in feature subset
plasma_ab_titer | Removed 48 features because not in feature subset
plasma_cytokine_concentration_by_olink | Removed 234 features because not in feature subset
t_cell_polarization | Removed 3 features because not in feature subset
plasma_cytokine_concentration_by_olink | Removed 300 features because qc warning
plasma_ab_titer | Removed 10540 measurements because wrong unit used
plasma_cytokine_concentration_by_olink | Removed 2400 measurements because wrong unit used
plasma_cytokine_concentration_by_legendplex | Removed 8 because specimen is outlier
plasma_cytokine_concentration_by_olink | Removed specimen 750, 760, 824, 833, 894, 903 because fraction of measurements below LOQ > 50%
plasma_ab_titer | Removed specimen 674, 675, 676 because fraction of measurements below LOD > 50%
converting counts to integer mode
Found 4 batches
Using null model in ComBat-seq.
Adjusting for 0 covariate(s) or covariate level(s)
Estimating dispersions
Fitting the GLM model
Shrinkage off - using GLM estimates for parameters
Adjusting the data
Found3batches
Adjusting for0covariate(s) or covariate level(s)
Standardizing Data across genes
Fitting L/S model and finding priors
Finding parametric adjustments
Adjusting the Data
3 Tasks
Code
task_meta <-list(task_11 =list(name ="task_11",header ="## Task 1.1",description ="Rank the individuals by IgG antibody levels against pertussis toxin (PT) that we detect in plasma 14 days post booster vaccinations." ),task_12 =list(name ="task_12",header ="## Task 1.2",description ="Rank the individuals by fold change of IgG antibody levels against pertussis toxin (PT) that we detect in plasma 14 days post booster vaccinations compared to titer values at day 0." ),task_21 =list(name ="task_21",header ="## Task 2.1",description ="Rank the individuals by predicted frequency of Monocytes on day 1 post boost after vaccination." ),task_22 =list(name ="task_22",header ="## Task 2.2",description ="Rank the individuals by fold change of predicted frequency of Monocytes on day 1 post booster vaccination compared to cell frequency values at day 0." ),task_31 =list(name ="task_31",header ="## Task 3.1",description ="Rank the individuals by predicted gene expression of CCL3 on day 3 post-booster vaccination." ),task_32 =list(name ="task_32",header ="## Task 3.2",description ="Rank the individuals by fold change of predicted gene expression of CCL3 on day 3 post booster vaccination compared to gene expression values at day 0." ),task_41 =list(name ="task_41",header ="## Task 4.1",description ="Rank the individuals based on their Th1/Th2 (IFN-g/IL-5) polarization ratio on Day 30 post-booster vaccination." ))
4 Rationale
How to choose the right model?
Top ranking performance
Low variance between the test sets
If several models with top performance and low variance, choose the more regularized model
If several models with top performance and low variance, choose model that is not using assay that is missing often in test data (-> Olink!)
5 Conclusions
TODO
6 Prepare Features
Code
# get the 2000 most variable genes based on the mean dispersion (variance/mean) across cohortsN_HVG <-2000hvg <- gene_meta %>% dplyr::mutate("prefix_versioned_ensembl_gene_id_clean"=paste0("pbmc_gene_expression_", versioned_ensembl_gene_id_clean)) %>% dplyr::slice_max(mean_rank, n=N_HVG, with_ties=TRUE)meta_data_cov <-get_metadata_covariates(meta_data) %>% dplyr::select(-c(ethnicity_Not_Hispanic_or_Latino, ethnicity_Hispanic_or_Latino))experimental_predictors_day_0 <-generate_wide_experimental_data(experimental_data=experimental_data,impute="knn", max_NA_frac_sample =0.5,max_NA_frac_feature =0.25,verbose=TRUE) %>% purrr::imap(function(df, modality) {colnames(df) <-paste0(modality, "_", colnames(df)) # since feature names can be repeated across assays df <-as.data.frame(df) %>% tibble::rownames_to_column("specimen_id") %>% dplyr::mutate(specimen_id =as.numeric(specimen_id)) %>% dplyr::inner_join((specimen_per_day$day_0 %>% dplyr::select(subject_id, specimen_id, dataset)),by="specimen_id") %>% dplyr::select(-c(specimen_id, dataset)) %>% dplyr::select(subject_id, everything())return(df) })
plasma_cytokine_concentration_by_olink | NA Fraction: 0.00288417166589755 | Imputed with knn imputation
t_cell_activation | NA Fraction: 0.0576923076923077 | Removed samples: 681
t_cell_activation | NA Fraction: 0.0553691275167785 | Imputed with knn imputation
t_cell_polarization | NA Fraction: 0.0134099616858238 | Imputed with knn imputation
Olink is missing quite often, because I removed all measurements for the 2020 cohort
Code
# which assays are most/least often missing?meta_data_plus_assays_day_0 %>% dplyr::select(dplyr::all_of(names(experimental_predictors_day_0))) %>% dplyr::mutate(row_sum =apply(select(., matches(".+")), 1, sum)) %>% dplyr::filter(row_sum >0) %>% tidyr::pivot_longer(cols=!row_sum) %>% dplyr::group_by(name) %>% dplyr::summarise(mean =mean(value)) %>% dplyr::arrange(desc(mean))
Code
MAX_N_ASSAYS_MISSING <-2K_FOR_KNN_IMPUTE <-7feature_list <-list("metadata"= meta_data_cov,"metadata+baseline"= meta_data_cov # will take care of that logic in loop)# check for each feature setting (aka modality setting)# whether to many assays are missing for any subject # (by default we remove subject if more than 2 of the required feature sets are missing)# usually we exclude here subject from the 2020 datasetfor (modality_set in modality_powerset) {# modality_set <- modality_powerset[[100]] modality_set_name <-paste0(paste0(modality_set, collapse ="_pca+"), "_pca+metadata+baseline")# inner join here, other subject have no assays for day 0... modality_set_check_assays <- meta_data_cov %>% dplyr::inner_join(dplyr::select(meta_data_plus_assays_day_0, dplyr::all_of(c("subject_id", modality_set))), by="subject_id") modality_set_check_assays[, modality_set][is.na(modality_set_check_assays[, modality_set])] <-FALSE modality_set_check_assays[, "n_assays_present"] <-rowSums(modality_set_check_assays[, modality_set]) modality_set_check_assays[, "n_assays_missing"] <-length(modality_set) - modality_set_check_assays[, "n_assays_present"]#modality_set_check_assays[, "n_assays_max"] <- length(modality_set)#modality_set_check_assays <- modality_set_check_assays %>% dplyr::mutate(n_assays_frac = n_assays_present / n_assays_max) subjects_to_exclude <- modality_set_check_assays %>% dplyr::filter(n_assays_missing >=!!MAX_N_ASSAYS_MISSING) %>% dplyr::pull(subject_id) modality_set_meta_df <- meta_data_cov %>% dplyr::filter(!(subject_id %in% subjects_to_exclude)) modality_set_df <- experimental_predictors_day_0_pca[modality_set] %>% purrr::reduce(full_join, by="subject_id") %>% dplyr::inner_join(modality_set_meta_df, by="subject_id") modality_set_mtx <- modality_set_df %>% tibble::column_to_rownames("subject_id") %>%as.matrix() %>%t() %>% impute::impute.knn(data=., k=K_FOR_KNN_IMPUTE, maxp=10000, colmax=1) %>% .$data %>%t() feature_list[[modality_set_name]] <-as.data.frame(modality_set_mtx) %>% tibble::rownames_to_column("subject_id") %>% dplyr::mutate(subject_id =as.numeric(subject_id))}# adding power set of single-omic predictorsfor (modality_set in modality_powerset) {# modality_set <- modality_powerset[[100]] modality_set_name <-paste0(paste0(modality_set, collapse ="+"), "+metadata+baseline")# inner join here, other subject have no assays for day 0... modality_set_check_assays <- meta_data_cov %>% dplyr::inner_join(dplyr::select(meta_data_plus_assays_day_0, dplyr::all_of(c("subject_id", modality_set))), by="subject_id") modality_set_check_assays[, modality_set][is.na(modality_set_check_assays[, modality_set])] <-FALSE modality_set_check_assays[, "n_assays_present"] <-rowSums(modality_set_check_assays[, modality_set]) modality_set_check_assays[, "n_assays_missing"] <-length(modality_set) - modality_set_check_assays[, "n_assays_present"]#modality_set_check_assays[, "n_assays_max"] <- length(modality_set)#modality_set_check_assays <- modality_set_check_assays %>% dplyr::mutate(n_assays_frac = n_assays_present / n_assays_max) subjects_to_exclude <- modality_set_check_assays %>% dplyr::filter(n_assays_missing >=!!MAX_N_ASSAYS_MISSING) %>% dplyr::pull(subject_id) modality_set_meta_df <- meta_data_cov %>% dplyr::filter(!(subject_id %in% subjects_to_exclude)) modality_set_df <- experimental_predictors_day_0[modality_set] %>% purrr::reduce(full_join, by="subject_id") %>% dplyr::inner_join(modality_set_meta_df, by="subject_id") modality_set_mtx <- modality_set_df %>% tibble::column_to_rownames("subject_id") %>%as.matrix() %>%t() %>% impute::impute.knn(data=., k=K_FOR_KNN_IMPUTE, maxp=10000, colmax=1) %>% .$data %>%t() feature_list[[modality_set_name]] <-as.data.frame(modality_set_mtx) %>% tibble::rownames_to_column("subject_id") %>% dplyr::mutate(subject_id =as.numeric(subject_id))}stopifnot(all(purrr::map_dbl(feature_list, ~mean(is.na(.x))) ==0))#purrr::map(feature_list, ~ dim(.x))
7 Model Evaluation / Assessment
Code
set.seed(42)tictoc::tic()result_list <-list()for (task in task_meta) {# task <- task_meta[[1]] task_df <- target_list[[task$name]] target_df <- task_df %>% dplyr::select(subject_id, target) baseline_df <- task_df %>% dplyr::select(subject_id, baseline) baseline_model_df <- task_df %>% dplyr::left_join(dplyr::distinct(dplyr::select(meta_data, c(subject_id, dataset))),by="subject_id") all_datasets <-unique(baseline_model_df$dataset)for (test_dataset in all_datasets) { result_list <- rlist::list.append(result_list, tibble::tibble(task = task$name,trainset =NA,testset = test_dataset,test_subset = task_df$subject_id[baseline_model_df$dataset==test_dataset],target = task_df$target[baseline_model_df$dataset==test_dataset], prediction = task_df$baseline[baseline_model_df$dataset==test_dataset],testset_mean =NA,model ="baseline",feature_set ="baseline" )) }# for (model_name in c("lasso", "elnet", "rf", "rf+boruta")) {for (model_name inc("lasso", "elnet", "rf")) {# model_name <- "lasso"for (feature_name innames(feature_list)) {# feature_name <- names(feature_list)[2] feature_df <- feature_list[[feature_name]]if (str_detect(feature_name, "baseline")) { feature_df <- feature_df %>% dplyr::left_join(baseline_df, by="subject_id") } model_df <- target_df %>% dplyr::left_join(feature_df, by="subject_id") %>% dplyr::left_join(dplyr::distinct(dplyr::select(meta_data, c(subject_id, dataset))),by="subject_id")#stopifnot(!any(is.na(model_df))) # guard rail model_df <- model_df %>% tidyr::drop_na()stopifnot(nrow(model_df)==nrow(dplyr::distinct(model_df))) all_datasets <-unique(model_df$dataset) cross_dataset_out <- purrr::map(all_datasets, function(test_dataset) {# test_dataset <- all_datasets[2] train_dataset <- all_datasets[!all_datasets %in% test_dataset] train_df <- model_df %>% dplyr::filter(dataset %in% train_dataset) %>% dplyr::select(-c(dataset, subject_id)) train_mtx <-as.matrix(dplyr::select(train_df, -target)) train_target <- dplyr::pull(train_df, target) test_df_with_subject <- model_df %>% dplyr::filter(dataset %in% test_dataset) test_df <- test_df_with_subject %>% dplyr::select(-c(dataset, subject_id)) test_mtx <-as.matrix(dplyr::select(test_df, -target)) test_target <- dplyr::pull(test_df, target)if (model_name =="rf") { model <- ranger::ranger(formula = target ~ ., data = train_df, num.trees =500) predictions <-predict(model, test_df)$predictions } elseif (model_name =="rf+boruta") {if (ncol(train_df) >2) { boruta_out <- Boruta::Boruta(formula = target ~ .,data = train_df,maxRuns=1000,num.trees=500 ) boruta_feats <-names(boruta_out$finalDecision)[boruta_out$finalDecision=="Confirmed"] } else { boruta_feats <-colnames(train_df)[colnames(train_df)!="target"] }if (length(boruta_feats) ==0) { predictions <-rep(mean(train_df$target), nrow(test_df)) } else { train_df_subset <- train_df %>% dplyr::select(dplyr::all_of(c(boruta_feats, "target"))) test_df_subset <- test_df %>% dplyr::select(dplyr::all_of(c(boruta_feats, "target"))) model <- ranger::ranger(formula = target ~ ., data = train_df_subset, num.trees =500) predictions <-predict(model, test_df_subset)$predictions } } elseif (model_name =="lasso") {# standardize the features, even though glmnet is doing this automatically!# feature_means <- matrixStats::colMeans2(train_mtx)# feature_sds <- matrixStats::colSds(train_mtx)# train_mtx <- t((t(train_mtx) - feature_means) / feature_sds)# train_target <- (train_target - mean(train_target)) / sd(train_target)#test_mtx <- t((t(test_mtx) - feature_means) / feature_sds)# inner hyperparameter optim loop using LOOCV to find best lambda model <- glmnet::cv.glmnet(x=train_mtx,y=train_target,nfolds=nrow(train_mtx),grouped=FALSE,alpha=1, # LASSOtype.measure="mae", # not sure whether this is betterstandardize=TRUE, # by default each feature is standardized internallyfamily="gaussian" )# investigate#beta_mtx <- as.data.frame(as.matrix(model$glmnet.fit$beta))#colnames(beta_mtx) <- paste0("lambda_", round(model$lambda, 3))#plot(model)#lambda_value <- model$lambda.1se, # model$lambda.min# never use zero features (with lamba=model$lambda[1]), since constant predictions# will have NA spearman... lambda_value <-ifelse(model$lambda.min==model$lambda[1], model$lambda[2], model$lambda.min) predictions <-predict(model,newx=test_mtx, s=lambda_value)[,1] }elseif (model_name =="elnet") {# standardize the features, even though glmnet is doing this automatically!# feature_means <- matrixStats::colMeans2(train_mtx)# feature_sds <- matrixStats::colSds(train_mtx)# train_mtx <- t((t(train_mtx) - feature_means) / feature_sds)# train_target <- (train_target - mean(train_target)) / sd(train_target)#test_mtx <- t((t(test_mtx) - feature_means) / feature_sds)# inner hyperparameter optim loop using LOOCV to find best lambda model <- glmnet::cv.glmnet(x=train_mtx,y=train_target,nfolds=nrow(train_mtx),grouped=FALSE,alpha=0.5, # elastic nettype.measure="mae", # not sure whether this is betterstandardize=TRUE, # by default each feature is standardized internallyfamily="gaussian" ) lambda_value <-ifelse(model$lambda.min==model$lambda[1], model$lambda[2], model$lambda.min) predictions <-predict(model,newx=test_mtx, s=lambda_value)[,1] }else {stop(paste0(model_name), " not implemented") } tibble::tibble(task=task$name,trainset =paste0(train_dataset, collapse="__"),testset = test_dataset,test_subset = test_df_with_subject$subject_id,target = test_df$target, prediction = predictions,testset_mean =mean(test_df$target),model = model_name,feature_set = feature_name) }) %>% dplyr::bind_rows() result_list <- rlist::list.append(result_list, cross_dataset_out) } }}
Warning: from glmnet C++ code (error code -99); Convergence for 99th lambda
value not reached after maxit=100000 iterations; solutions for larger lambdas
returned
Warning: from glmnet C++ code (error code -93); Convergence for 93th lambda
value not reached after maxit=100000 iterations; solutions for larger lambdas
returned
Warning: from glmnet C++ code (error code -98); Convergence for 98th lambda
value not reached after maxit=100000 iterations; solutions for larger lambdas
returned
Warning: from glmnet C++ code (error code -94); Convergence for 94th lambda
value not reached after maxit=100000 iterations; solutions for larger lambdas
returned
Warning: from glmnet C++ code (error code -91); Convergence for 91th lambda
value not reached after maxit=100000 iterations; solutions for larger lambdas
returned
Warning: from glmnet C++ code (error code -97); Convergence for 97th lambda
value not reached after maxit=100000 iterations; solutions for larger lambdas
returned
Warning: from glmnet C++ code (error code -91); Convergence for 91th lambda
value not reached after maxit=100000 iterations; solutions for larger lambdas
returned
Warning: from glmnet C++ code (error code -95); Convergence for 95th lambda
value not reached after maxit=100000 iterations; solutions for larger lambdas
returned
Why is that? This happens when the boruta algorithm returns 0 important features, which means in the test set the predictions are set to some constant. And since a constant vector has 0 standard deviation, we cannot compute the correlation (where we divide by the standard deviation).
prediction_specs <-list("task_11"=list("model"="baseline","feature_set"="baseline","table_col"="1.1) IgG-PT-D14-titer-Rank" ),"task_12"=list("model"="lasso","feature_set"="pbmc_cell_frequency+t_cell_polarization+metadata+baseline","table_col"="1.2) IgG-PT-D14-FC-Rank" ),"task_21"=list("model"="rf","feature_set"="pbmc_cell_frequency_pca+plasma_cytokine_concentration_by_legendplex_pca+t_cell_activation_pca+metadata+baseline","table_col"="2.1) Monocytes-D1-Rank" ),"task_22"=list("model"="elnet","feature_set"="plasma_cytokine_concentration_by_legendplex+t_cell_activation+t_cell_polarization+metadata+baseline","table_col"="2.2) Monocytes-D1-FC-Rank" ),"task_31"=list("model"="lasso","feature_set"="plasma_cytokine_concentration_by_legendplex+t_cell_polarization+metadata+baseline","table_col"="3.1) CCL3-D3-Rank" ),"task_32"=list("model"="rf","feature_set"="plasma_ab_titer_pca+metadata+baseline","table_col"="3.2) CCL3-D3-FC-Rank" ),"task_41"=list("model"="lasso","feature_set"="pbmc_cell_frequency+plasma_cytokine_concentration_by_legendplex+t_cell_activation+metadata+baseline","table_col"="4.1) IFNG/IL5-Polarization-D30-Rank" ))# checks## 0) all task column names are in submission tablepurrr::map_lgl(prediction_specs, ~ .x[["table_col"]] %in%colnames(submission_df)) %>%all() %>%stopifnot()## a) all tasks are specifiedstopifnot(dplyr::setequal(names(task_meta), names(prediction_specs)))## b) all features are presentpurrr::map_lgl(prediction_specs, ~length(str_split(.x[["feature_set"]], "\\+")[[1]]) >=1) %>%all() %>%stopifnot()purrr::map_lgl(prediction_specs, ~ .x[["feature_set"]] %in%c(names(feature_list), "baseline")) %>%all() %>%stopifnot()## c) which features are never selected?selected_features <- purrr::map(prediction_specs, ~str_split(.x[["feature_set"]], "\\+")[[1]]) %>%unlist() %>%unique()all_features <- purrr::map(names(feature_list), ~str_split(.x, "\\+")[[1]]) %>%unlist() %>%unique()all_features[!all_features %in% selected_features]
rm(selected_features, all_features)## d) which features are most commonly usedpurrr::map(prediction_specs, ~str_split(.x[["feature_set"]], "\\+")[[1]]) %>%unlist() %>% stringr::str_remove_all(., "_pca$") %>%table() %>%sort(decreasing=TRUE)
inner_join: added 20 columns (PC1.x, PC2.x, PC3.x, PC4.x, PC5.x, ...)
> rows only in x (15)
> rows only in feature_df (57)
> matched rows 51
> ====
> rows total 51
Baseline values are missing for 6 samples! Imputing with Median
inner_join: added 28 columns (plasma_cytokine_concentration_by_legendplex_P01375, plasma_cytokine_concentration_by_legendplex_P01563, plasma_cytokine_concentration_by_legendplex_P01579, plasma_cytokine_concentration_by_legendplex_P02778, plasma_cytokine_concentration_by_legendplex_P05231, ...)
> rows only in x (16)
> rows only in feature_df (57)
> matched rows 50
> ====
> rows total 50
Baseline values are missing for 6 samples! Imputing with Median
inner_join: added 24 columns (plasma_cytokine_concentration_by_legendplex_P01375, plasma_cytokine_concentration_by_legendplex_P01563, plasma_cytokine_concentration_by_legendplex_P01579, plasma_cytokine_concentration_by_legendplex_P02778, plasma_cytokine_concentration_by_legendplex_P05231, ...)
> rows only in x (29)
> rows only in feature_df (54)
> matched rows 54
> ====
> rows total 54
Baseline values are missing for 1 samples! Imputing with Median
inner_join: added 20 columns (PC1, PC2, PC3, PC4, PC5, ...)
> rows only in x (30)
> rows only in feature_df (54)
> matched rows 53
> ====
> rows total 53
Baseline values are missing for 1 samples! Imputing with Median
inner_join: added 32 columns (pbmc_cell_frequency_Basophils, pbmc_cell_frequency_Bcells, pbmc_cell_frequency_CD4Tcells, pbmc_cell_frequency_CD8Tcells, pbmc_cell_frequency_Classical_Monocytes, ...)
> rows only in x ( 1)
> rows only in feature_df (66)
> matched rows 42
> ====
> rows total 42
Baseline values are missing for 3 samples! Imputing with Median
left_join: added one column (predictions)
> rows only in x 0
> rows only in task_preds_df ( 0)
> matched rows 54
> ====
> rows total 54
left_join: added one column (predictions)
> rows only in x 0
> rows only in task_preds_df ( 0)
> matched rows 54
> ====
> rows total 54
left_join: added one column (predictions)
> rows only in x 0
> rows only in task_preds_df ( 0)
> matched rows 54
> ====
> rows total 54
left_join: added one column (predictions)
> rows only in x 0
> rows only in task_preds_df ( 0)
> matched rows 54
> ====
> rows total 54
left_join: added one column (predictions)
> rows only in x 0
> rows only in task_preds_df ( 0)
> matched rows 54
> ====
> rows total 54
left_join: added one column (predictions)
> rows only in x 0
> rows only in task_preds_df ( 0)
> matched rows 54
> ====
> rows total 54
left_join: added one column (predictions)
> rows only in x 0
> rows only in task_preds_df ( 0)
> matched rows 54
> ====
> rows total 54
Code
submission_df_with_predictions
Lastly we need to create ranks from those predictions, see:
We expect contestants to generate computational models, and upon making predictions, these values are ranked from highest to lowest (i.e. highest = 1, lowest = N) before making a final submission.