Notes on the Modeling Strategy

Author

Philipp Sven Lars Schäfer

Published

November 28, 2024

1 Thoughts on the Challenge

  • There are several aspects that make this challenge challenging:
    • Large-P-Small-N Problem
    • Seven different prediction tasks
    • Six different data modalities
    • Strong cohort (group) effects
  • Given the complexity of the challenge, there are many degrees of freedom of how to approach. To not get lost I tried to focus on the following aspects:
      1. Solid QC and data preprocessing of the experimental data.
      1. Solic Model evaluation framework, because it is extremly easy to overfit on the training data in the Large-P-Small-N setting.

2 Data Processing

2.1 Clean Feature Names

  • I replaced all “[+-.: /]” with other strings such that I could use the formula syntax in R

2.2 Data Filtering

  • Code: filter_experimental_data in read_data.R

  • Only keep Specimen for which metadata are available

  • Manually subset features for certain assays:

    • PBMC frequencies: Selected a subset of cell types based on the gating info (but did not benchmark this selection): Monocytes, Classical_Monocytes, Intermediate_Monocytes, Non-Classical_Monocytes, CD8Tcells, CD4Tcells, Bcells, NK, pDC, Basophils
    • Olink: Only use proteins/cytokines that have an NA fraction below 50%.
  • Olink QC Qarning:

    • Removed all measurements with QC warning in the Olink assay (which was about 300 measurements)
  • Different Units in Assay

    • Plasma Antibody Titers:
      • See forum post here
      • In the 2020 dataset we have IU/ML whereas for 2021-2023 we have MFI (fluorescence intensity). IU/ML is obtained using a serum standard, thus there is not trivial conversion from MFI to IU/ML.
      • I decided not to consider the 2020 Ab titer data
    • Olink:
      • In the 2020 dataset we have Normalized Protein eXpression, whereas for 2021-2023 we have PG/ML
      • I removed all measurements with different units.
      • Here, I remove by far the most measurements.
  • Lower Limit of Detection in Plasma Antibody Titers:

    • See forum post here
    • Removed specimen with more than 50% of measurements below LOD
  • Lower Limit of Quantification in Olink assay

    • See forum post here
    • Removed specimen with more than 50% of measurements below LOD
  • Outlier Removal

    • For the legendplex assay I removed these specimen: 661, 657, 658, 689, 903, 638, 659, 661 (based on PCA plots)
  • PBMC gene expression

    • For simplicity, I decided to only keep genes that have a unique mapping from ensemble id to gene symbol

2.3 Data Normalization

  • Code: normalize_experimental_data in normalize_integrate.R

  • pbmc_cell_frequency: Median Baseline Normalization (i.e. divide each feature by the median of that feature in the measurments from specimen from day 0)

  • pbmc_gene_expression: VST

  • plasma_ab_titer: Median Baseline Normalization

  • plasma_cytokine_concentration_by_legendplex: Median Baseline Normalization

  • plasma_cytokine_concentration_by_olink: No Normalization

  • t_cell_activation: No Normalization

  • t_cell_polarization: No Normalization

2.4 Data Integration

  • Code: integrate_experimental_data in normalize_integrate.R

  • pbmc_cell_frequency: No Integration

  • pbmc_gene_expression: ComBat-seq

  • plasma_ab_titer: No Integration

  • plasma_cytokine_concentration_by_legendplex: ComBat

  • plasma_cytokine_concentration_by_olink: No Integration

  • t_cell_activation: No Integration

  • t_cell_polarization: No Integration

2.5 Construction of Target Tracks

  • Code: generate_all_targets in generate_targets.R

  • Which subjects / specimens did I consider:

  • I defined the following acceptable differences between the actual and planned day of visit wrt booster administration:

    • day_0 = list(“min_diff” = -7, “max_diff” = 0)
    • day_1 = list(“min_diff” = 1, “max_diff” = 2)
    • day_3 = list(“min_diff” = 3, “max_diff” = 6)
    • day_14 = list(“min_diff” = 12, “max_diff” = 16)
    • day_30 = list(“min_diff” = 28, “max_diff” = 40)
  • In tasks 1.2, 2.2, 3.2, instead of predicting fold changes, I predicted log fold changes, as the distributions of the logFC looked less skewed.

  • In task 3.1, instead of predicting the raw expression of CCL3, I predicted the log expression

3 Model Selection

  • Code: model_selection.qmd

  • Given that we deal here with a Large-P-Small-N problem, having a good model selection framework is probably even more important than the actual model.

  • To evaluate and select models for each task, I used nested cross validation (CV):

    • Folds in the outer loop: Each cohort is a fold (i.e. Group k-fold) -> Outer loop is used to estimate cross-cohort model performance
    • Fold in the inner loop: Each subject is a fold (i.e. LOOCV) -> Inner loop is used to select the best set of hyperparameters for any model. Using this set on hyperparameters I then estimate model performance on the hold-out fold from the outer loop
  • I used this model evaluation framework to test many combinations of models and features:

    • Models: LASSO, Elastic Net, Random Forest
    • Features: Power set of all assays
  • To choose the final model per task, I not only considered the perfomrance, but all of the following points:

    • Top ranking performance
    • Low variance between the test sets
    • If several models with top performance and low variance, choose the more regularized model
    • If several models with top performance and low variance, choose model that is not using any assay that is missing often in test data (specifically Olink!)

4 Other Thoughts

  • In pretty much none of the models I consider the 2020 cohort, because the performance of models that use all the data is quite underwhelming…