Several taxonomy and phylotype levels were used as input in the model. We conducted two approaches to select the optimal features to train the model. On one hand, we identified those features with significant correlation with the collect_week variable. We use the two subpopulations (term and preterm patients) independently. Thus, we identified subgroups of features term- and preterm-associated. Spearman correlation test was performed for each phylotype and taxonomy variable (see 01_longitudinal_feature_selection.r). In both cases, we used relative abundance variables. We based on two criteria to select the features:
- the features must be present with a correlation higher than 0.5 (in absolute values) with the collect_week variable, and
- this correlation must be present in at least 20 percent of patients inside each subpopulation.
On the other hand, a Differential Expression Analysis (DEA) was carried out, in order to select those features differentially expressed between term and preterm patients. We used the DESeq2 package in each cohort (see 01.2_DESeq2_analysis.R). Thus, we performed a meta-analysis, using DExMA package, with the p-values extracted by each cohort (see 01.1_Meta_Analysis.R). The most significant features through all cohorts were selected for deeper study. This approach also was executed for each taxonomy and phylotype level.
Both approaches were carried out for Task 1 and Task 2. The unique difference was the relabeled in the datasets. Therefore, in Task 1 (preterm birth prediction), we restricted the delivery_week variable to 32, and in Task 2 (early preterm birth prediction), the delivery_week variable was restricted to 28. Finally, we included 210 unique variables as input in our prediction models for Task 1, and 218 unique variables in Task 2. In addition, to include covariates in our models, we selected the following ones: valencia score, phylo entropy, collect week and NIH Racial Category. The latter was converted into a dummy variable.
Before training our models, and in order not to include noise in our models, we removed the B, I and J cohorts for training. All of them had a large number of zero values.
Three Machine Learning-based models were trained: random forest, elastic net and XGBoost. The training was performed in the mlr3 package 3. To evaluate the performances in our models we carried a nested cross validation, with an internal Holdout to select the best hyperparameters, and an external 10 fold CV with 5 repetitions to test our model in general terms. In all cases, Random Forest was the model that achieved the higher performance in all folds. After the training phase, the final hyperparameters of RF were: mtry equal to 7, nodesize equal to 2 and ntree equal to 400.
When we investigated our model further, we looked at the importance the model gave to each variable. This importance was calculated across all folds. In order to further reduce dimensionality, we selected those variables that exceeded a threshold we predefined. Specifically, those variables with a cumulative importance greater than 250 value of the mean decrease impurity score (gini score). The final variables used to create our last model were: For Phylotypes 1 (pt__00042, pt__00002, pt__00019, pt__00021, pt__00001), for Phylotypes 2 (pt__00009, pt__00006, pt__00005, pt__00002, pt__00001), for Phylotypes 3 (pt__00005, pt__00003, pt__00001, pt__00007, pt__00032), for Family levels (Prevotellaceae, Lactobacillaceae, Bifidobacteriaceae, Ruminococcaceae, Veillonellaceae, Lachnospiraceae, Bacteroidaceae), for Genus levels (Prevotella, Lactobacillus, Bacteroides and Porphyromonas) and for Species leves (Prevotella bivia, Fenollaria Massiliensis Timonensis, Lactobacillus Iners). In addition, were included the Valencia score, the phylo entropy score and the collect week variable.
Due to unbalanced train data, we examined an alternative threshold in order to reduce false negative and positive ratio. We defined a cost matrix, penalizing with 3 true negatives and with 2 false negative predictions. The theoretical threshold calculated from this cost matrix was 0.33.
All the models were biospecimen-specific, thus, to obtain a prediction by participants we selected only one prediction. In order to select the most reliable probability for each participant, we are guided by two criteria. We reduced each probability by 0.5 and obtained the absolute value of the result. In this way, we could have a score of how far the model was from 0.5, i.e. how sure the model was of each prediction. On the other hand, we were also guided by the time at which the sample was taken. In this way, we believe that samples close to delivery will have more information about whether the patient is going to have a preterm delivery. For this we calculated a score according to the variable collect_week. This variable was stratified into deciles, and then scaled between 0.5 and 1. The two scores were multiplied and the biospecimens were sorted in descending order for each participant. The prediction with the highest score was finally selected. The aim of this approach was to weight more strongly those predictions in which the model was very sure and were close to the delivery week (see main_v2_1.r) .
We have developed a two steps model to predict preterm events. We used biological information to reduce non relevant features according to the correlation with the collection week and the differential expression between conditions. We observe a lot of difference in some of the data of the challenge due to the great heterogeneity of the data available to us. Our strategy to calculate a score according to the collect week allows us to weight the robustness of the prediction in order to avoid false positives.
Golob, J. L., Oskotsky, T. T., Tang, A. S., Roldan, A., Chung, V., Ha, C. W., ... & Sirota, M. (2023). Microbiome preterm birth DREAM challenge: Crowdsourcing machine learning approaches to advance preterm birth research. medRxiv. 10.1101/2023.03.07.23286920