Notes on the Modeling Strategy
1 Thoughts on the Challenge
- There are several aspects that make this challenge challenging:
- Large-P-Small-N Problem
- Seven different prediction tasks
- Six different data modalities
- Strong cohort (group) effects
- Given the complexity of the challenge, there are many degrees of freedom of how to approach. To not get lost I tried to focus on the following aspects:
- Solid QC and data preprocessing of the experimental data.
- Solic Model evaluation framework, because it is extremly easy to overfit on the training data in the Large-P-Small-N setting.
2 Data Processing
2.1 Clean Feature Names
- I replaced all “[+-.: /]” with other strings such that I could use the formula syntax in R
2.2 Data Filtering
Code:
filter_experimental_data
inread_data.R
Only keep Specimen for which metadata are available
Manually subset features for certain assays:
- PBMC frequencies: Selected a subset of cell types based on the gating info (but did not benchmark this selection): Monocytes, Classical_Monocytes, Intermediate_Monocytes, Non-Classical_Monocytes, CD8Tcells, CD4Tcells, Bcells, NK, pDC, Basophils
- Olink: Only use proteins/cytokines that have an NA fraction below 50%.
Olink QC Qarning:
- Removed all measurements with QC warning in the Olink assay (which was about 300 measurements)
Different Units in Assay
- Plasma Antibody Titers:
- See forum post here
- In the 2020 dataset we have
IU/ML
whereas for 2021-2023 we have MFI (fluorescence intensity).IU/ML
is obtained using a serum standard, thus there is not trivial conversion from MFI to IU/ML. - I decided not to consider the 2020 Ab titer data
- Olink:
- In the 2020 dataset we have
Normalized Protein eXpression
, whereas for 2021-2023 we havePG/ML
- I removed all measurements with different units.
- Here, I remove by far the most measurements.
- In the 2020 dataset we have
- Plasma Antibody Titers:
Lower Limit of Detection in Plasma Antibody Titers:
- See forum post here
- Removed specimen with more than 50% of measurements below LOD
Lower Limit of Quantification in Olink assay
- See forum post here
- Removed specimen with more than 50% of measurements below LOD
Outlier Removal
- For the legendplex assay I removed these specimen: 661, 657, 658, 689, 903, 638, 659, 661 (based on PCA plots)
PBMC gene expression
- For simplicity, I decided to only keep genes that have a unique mapping from ensemble id to gene symbol
2.3 Data Normalization
Code:
normalize_experimental_data
innormalize_integrate.R
pbmc_cell_frequency: Median Baseline Normalization (i.e. divide each feature by the median of that feature in the measurments from specimen from day 0)
pbmc_gene_expression: VST
plasma_ab_titer: Median Baseline Normalization
plasma_cytokine_concentration_by_legendplex: Median Baseline Normalization
plasma_cytokine_concentration_by_olink: No Normalization
t_cell_activation: No Normalization
t_cell_polarization: No Normalization
2.4 Data Integration
Code:
integrate_experimental_data
innormalize_integrate.R
pbmc_cell_frequency: No Integration
pbmc_gene_expression: ComBat-seq
plasma_ab_titer: No Integration
plasma_cytokine_concentration_by_legendplex: ComBat
plasma_cytokine_concentration_by_olink: No Integration
t_cell_activation: No Integration
t_cell_polarization: No Integration
2.5 Construction of Target Tracks
Code:
generate_all_targets
ingenerate_targets.R
Which subjects / specimens did I consider:
I defined the following acceptable differences between the actual and planned day of visit wrt booster administration:
- day_0 = list(“min_diff” = -7, “max_diff” = 0)
- day_1 = list(“min_diff” = 1, “max_diff” = 2)
- day_3 = list(“min_diff” = 3, “max_diff” = 6)
- day_14 = list(“min_diff” = 12, “max_diff” = 16)
- day_30 = list(“min_diff” = 28, “max_diff” = 40)
In tasks 1.2, 2.2, 3.2, instead of predicting fold changes, I predicted log fold changes, as the distributions of the logFC looked less skewed.
In task 3.1, instead of predicting the raw expression of CCL3, I predicted the log expression
3 Model Selection
Code:
model_selection.qmd
Given that we deal here with a Large-P-Small-N problem, having a good model selection framework is probably even more important than the actual model.
To evaluate and select models for each task, I used nested cross validation (CV):
- Folds in the outer loop: Each cohort is a fold (i.e. Group k-fold) -> Outer loop is used to estimate cross-cohort model performance
- Fold in the inner loop: Each subject is a fold (i.e. LOOCV) -> Inner loop is used to select the best set of hyperparameters for any model. Using this set on hyperparameters I then estimate model performance on the hold-out fold from the outer loop
I used this model evaluation framework to test many combinations of models and features:
- Models: LASSO, Elastic Net, Random Forest
- Features: Power set of all assays
To choose the final model per task, I not only considered the perfomrance, but all of the following points:
- Top ranking performance
- Low variance between the test sets
- If several models with top performance and low variance, choose the more regularized model
- If several models with top performance and low variance, choose model that is not using any assay that is missing often in test data (specifically Olink!)
4 Other Thoughts
- In pretty much none of the models I consider the 2020 cohort, because the performance of models that use all the data is quite underwhelming…