Predicting Risk of Stillbirth and Preterm Pregnancies With Machine Learning
Predicting Risk of Stillbirth and Preterm Pregnancies With Machine Learning
RESEARCH
  Abstract
  Modelling the risk of abnormal pregnancy-related outcomes such as stillbirth and preterm birth have been proposed
  in the past. Commonly they utilize maternal demographic and medical history information as predictors, and they
  are based on conventional statistical modelling techniques. In this study, we utilize state-of-the-art machine learn-
  ing methods in the task of predicting early stillbirth, late stillbirth and preterm birth pregnancies. The aim of this
  experimentation is to discover novel risk models that could be utilized in a clinical setting. A CDC data set of almost
  sixteen million observations was used conduct feature selection, parameter optimization and verification of proposed
  models. An additional NYC data set was used for external validation. Algorithms such as logistic regression, artificial
  neural network and gradient boosting decision tree were used to construct individual classifiers. Ensemble learning
  strategies of these classifiers were also experimented with. The best performing machine learning models achieved
  0.76 AUC for early stillbirth, 0.63 for late stillbirth and 0.64 for preterm birth while using a external NYC test data. The
  repeatable performance of our models demonstrates robustness that is required in this context. Our proposed novel
  models provide a solid foundation for risk prediction and could be further improved with the addition of biochemical
  and/or biophysical markers.
  Keywords: Risk prediction, Stillbirth, Preterm, Machine learning
                                         © The Author(s) 2020. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use,
                                         sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the origi-
                                         nal author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images
                                         or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in
                                         a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not
                                         permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright
                                         holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Koivu and Sairanen Health Inf Sci Syst (2020) 8:14                                                              Page 2 of 12
by the Centers for Disease Control and Prevention (CDC)         clinically significant prediction algorithms for stillbirth
[19]. A similar rate (7.2/1000) was recently reported in a      and other pregnancy complications with complex etiol-
smaller state level cohort study [29]. The rate of stillbirth   ogy. In addition to this, feasible risk prediction with only
is still slowly decreasing as the level of education and        demographic variables has value when methods that
access to pregnancy care have increased and maternal            require biomarker results cannot be utilized due to the
smoking has decreased [34]. On the other hand, average          availability of testing.
age and BMI of pregnant mothers are increasing which
partially counter the positive trends [34].                     Methods and materials
  Various risk models to assess risk for stillbirth have        CDC data set
been proposed; prediction performances range from               Infant birth and death data sets containing pregnancies
0.64 to 0.67 area under the curve (AUC) from receiver           during the years 2013 to 2016 in the United States were
operating characteristic curve (ROC) with maternal              provided by CDC, National Center of Health Statistics
demographics [29, 33], and 0.82 AUC when biophysical            via their National Vital Statistics System [28]. The data
variables were added [11]. While these models are based         sets for different years are de-identified and are publicly
on highly similar maternal demographic and medical              available for statistical analysis, and this research com-
history, they have also relied on classical linear statisti-    plies with the data user agreement provided by the CDC.
cal models, i.e. where the decision boundary is linear. In      State participation and available variables vary from year
this study, we show how machine learning (ML) meth-             to year due to the data formatting and national reporting
ods could further improve the individual risk predic-           policy changes. The basis of these data sets are the yearly
tion beyond the traditional models. Typical statistical         reported birth and death certificates. These data sets
analysis of clinical risk prediction is compared with a         were combined to form the data set for our experimenta-
ML analysis pipeline, where statistically weak variables        tion. From a total of 15,976,537 pregnancies, 15,883,784
are not excluded, and more complex modelling tech-              were live births and 92,753 were infant deaths. This
niques are used. These techniques can learn structure           amounts to an overall prevalence of 0.58% for the preg-
from data without being explicitly programmed to do so          nancies ending in infant death. Whereas 1,532,538 of live
[16]. This also means that a significant amount of data         births were PTB deliveries, a prevalence of 9.6%.
is a crucial factor in creating robust ML models. A data          The data set contained variables that were either not
set with almost sixteen million observations provided           feasible for prediction modelling, for example time of
by the CDC was used in our study, containing pregnan-           birth, or functional variables such as flags for identifying
cies concluded in the United States. This data was used         incomplete national reporting. Initial variable selection
for predictive modelling, analyzing variables for feature       based on both literature [11, 25, 29, 33] and pragmatic
inclusion and experimenting with a set of machine learn-        reasoning was done to reduce the number of meaning-
ing algorithms to produce viable risk models. Also, the         ful demographic, risk factor and infection predictor vari-
models were evaluated with a separate data set consisting       ables to 26. The complete variable listing is documented
of pregnancy data from three years from the New York            in Table 1.
City (NYC) Department of Health and Mental Hygiene.
  In addition to stillbirth outcome, we also included           NYC data set
preterm birth (PTB, pregnancies with delivery before            Limited use birth data sets containing pregnancies during
37 weeks of gestation), another major pregnancy com-            the years 2014 to 2016 in the City of New York were pro-
plication, as an outcome to be evaluated with the inves-        vided by the New York City Department of Health and
tigated methods. The incidence of PTB in the United             Mental Hygiene. The data sets for different years are de-
States is estimated to be 1 in 10 births, making it consid-     identified and IRB approval for the data was acquired for
erably more prevalent when compared to stillbirth [25].         research purposes. Like the CDC data sets, the basis for
These infants have an elevated risk for multiple neona-         this data was collected birth and death certificates. From
tal complications [25]. While risk factors for PTB have         a total of 364,124 pregnancies, 363,560 were live births
been identified, causality is hard to prove because PTB         and 564 were reported as not living when the record was
can occur to women without elevated risk results [25].          created. This amounts to an overall prevalence of 0.15%
Because of this, proposed risk models for PTB have been         for the pregnancies ending in infant death. As for the live
modest to poor, resulting in AUC values of 0.51 to 0.67         births, 31,600 were PTB deliveries, a prevalence of 8.7%.
with evidence of overfitting [20].                                The purpose of the NYC data is to further evaluate the
  Maximizing the prediction power derived from exist-           predictive models created using the CDC data. External
ing maternal characteristics data creates a solid base          validation is crucial for evaluating clinical potential of
for future implementation of biomarkers and closer to           the model, and usually results into reduced performance
Koivu and Sairanen Health Inf Sci Syst (2020) 8:14                                                                     Page 3 of 12
[20]. By applying the same preprocessing and variable              Variable encodings were altered to accommodate
selection steps to both data sets, we were able to achieve       modelling with machine learning algorithms. New vari-
comparable data variables for both CDC and NYC.                  ables were also created. Parity status of the mother was
                                                                 deducted from the number of prior births variable. A
Preprocessing                                                    new class variable was annotated to specify the outcome
The first step of preprocessing was applying inclusion           of the pregnancies more accurately. Fetal death cases
criteria to the observations. The first one was complete-        were divided into late and early stillbirth based on their
ness, i.e. no missing variable values. This was a feasible       gestational age, using the WHO’s definition of 28 weeks
approach due to the substantial size of the data set, so         as the cutoff point. In addition to this, early stillbirth
that value imputation was not needed. Included moth-             cases of less than than 21 weeks were excluded because
ers were either 18 or older and cases of maternal mor-           they are clinically defined as miscarriage cases. The 21
bidity were excluded. Cases with reported gestational age        weeks cutoff was averaged from the WHO’s definition
in alive babies less than 21 weeks were also excluded. It        and ACOG’s definition. Live births were divided into
is safe to assume that these cases are errors in the data        uncomplicated term pregnancies, referred to as normal,
records, because the earliest alive PTB baby in the world        and PTB birth pregnancies. This was done using the
at the time of writing this publication is 21 weeks and 6        gestational week cutoff point of 37. The final number of
days [3]. Multiple birth pregnancies were also excluded.         CDC data observations for the study was 11,901,611 nor-
In addition to this, pregnancies that ended in fetal death       mal pregnancies, 946,301 PTB cases, 7924 early stillbirth
due to external causes were excluded. This was deter-            cases and 8310 late stillbirth cases. The final number of
mined by the U, V, W, X and Y identifier values listed in        NYC data observations was 266,419 normal pregnancies,
the data according to the ICD-10 [31]. Postnatal death           19,203 PTB cases, 139 early stillbirth cases and 110 late
cases were also excluded.                                        stillbirth cases.
Koivu and Sairanen Health Inf Sci Syst (2020) 8:14                                                                           Page 4 of 12
  Continuous predictor variables were standardized                           frequently only includes variables in the modelling phase
by removing the mean (zero-mean normalization) and                           that have statistically significant odds ratios in the uni-
scaling to unit variance (unit-variance normalization).                      variate analysis [29, 33]. While this is a proper way of
Nominal predictor variables were one-hot encoded in                          performing analysis with logistic regression, ML models
the modelling phase to accommodate artificial neu-                           could find beneficial feature dependencies from data to
ral network models. For conducting the whole analysis,                       support decision making that are not detected in the uni-
CDC data was partitioned into four sets; feature selection                   variate analysis [26]. Therefore, only correlation analysis
data, training data, validation data and test data. Feature                  was used to determine the final predictor set for machine
selection data was used exclusively for feature variable                     learning models.
analysis, training data for model training, validation data
for regularization and early stopping while model train-                     Risk prediction modelling
ing, and test data for final model evaluation along with                     Logistic regression (LR), gradient boosting decision tree
the NYC data set. To sustain the class distribution of the                   (GBDT) and two artificial neural network (ANN) mod-
outcome variable, class-stratified random splits of 10%,                     els were used in this study. LR will serve as a baseline for
70%, 10% and 10% were used, respectively. This ensured                       the more complex algorithms due to its simplicity and
insulated data sets for feature selection, model training                    robustness. L2-regularizied logistic regression with lim-
and evaluation phases. The data partitions are described                     ited-memory Broyden–Fletcher–Goldfarb–Shanno
in Table 2, while descriptive statistics of the whole CDC                       (BFGS) parameter optimization was used [22]. Toler-
data set are listed in Table 3.                                              ance for stopping criteria was set to 1.0e−4. Regulari-
                                                                             zation strength C was set to 1.0. The optimal maximum
Feature variable analysis
                                                                             number of iterations was found to be 100.
For the task of predictor variable selection, correlation                       GBDT is a widely implemented machine learning algo-
analysis and univariate analysis were used to determine                      rithm and has demonstrated exceptional performance
the final set of variables. In correlation analysis, all pos-                with bioinformatics tasks [23]. The lightgbm (LGBM) ver-
sible predictor variable pairs were examined for linear                      sion of GBDT algorithm was chosen for our study [12].
dependency to each other with Pearson correlation coef-                      It contains features which provide a substantial increase
ficient. Because highly correlated predictor variables                       in execution speed without losing significant amount
have the same effect on the dependent variable [8], one                      of accuracy. This was appropriate for our research with
of the variables with correlation less than − 0.5 or more                    training data that contained over nine million observa-
than 0.5 was excluded. This is based on the definition                       tions. For modelling, after iterative experimentation the
of moderate correlation [21]. This reduces redundancy                        number of leaves was set to 48, minimal number of data
of the data and produces more robust models. For the                         observations in one leaf to 500, maximum depth of the
task of univariate analysis, logistic regression was used                    tree model was not restricted, shrinkage rate was set to
to assess the impact of individual predictor variables to                    0.001, feature and bagging fractions were set to 1 and
the classification outcome. For all binary classification                    boosting algorithm was chosen to be Gradient Boosting
tasks of case classes, odds ratios, their 2.5% and 97.5%                     Decision Tree. Maximum iterations was set to 2000, and
confidence intervals and p-values were calculated. The                       early stopping after 500 iterations was used and the used
method for calculating p-values was two-tailed Z-score.                      metric for performance was AUC. Different outcomes
In clinical risk model development, the analysis pipeline                    have clinically significant false positive rates based on
                                                                             incidence. True positive rates in those false positive rates
                                                                             could also be used as a metric for performance, however
Table 2 Data observations                                                    initial experimentation showed that there were no signifi-
Data              Normal       Early stillbirth Late stillbirth PTB          cant changes in using them over AUC.
                                                                                For ANN, the first model was a Leaky ReLU-based
Feature selec-     1,178,146 782                  809             93,813     deep two-layer feed-forward neural network that we have
  tion                 (92.5%)   (0.06%)                (0.06%)   (7.36%)
                                                                             previously shown to perform well in the risk prediction
Training           8,331,492 5578                5806             662,026
                       (92.5%)   (0.06%)             (0.06%)       (7.35%)   task of Down’s syndrome [15]. The second was a deep
Validation         1,196,173 796                  845             95,033     feed-forward self-normalizing neural network based on
                       (92.5%)   (0.06%)                (0.07%)   (7.35%)    the scaled exponential linear units (SELU) activation
Test               1,195,800 768                  850             95,429     function, which has been demonstrated to achieve supe-
                       (92.5%)   (0.06%)                (0.07%)   (7.38%)    rior performance to other feed-forward neural networks
NYC                  266,419 139                  110             19,203     [14]. The original publication suggests that deeper archi-
                       (93.2%)   (0.05%)              (0.04%)     (6.72%)
                                                                             tectures yield better performance, so four hidden layers
Koivu and Sairanen Health Inf Sci Syst (2020) 8:14                                                                   Page 5 of 12
Age
 Mean (SD)                                            28.5 (5.65)          28.0 (5.99)        28.3 (6.01)             28.7 (6.02)
 Range                                                18.0–50.0            18.0–50.0          18.0–48.0         18.0–50.0
Race
 White                                                9,170,035 (77.0%)    4701 (59.3%)       5620 (67.6%)      669,375 (70.7%)
 Black                                                1,771,832 (14.9%)    2770 (35.0%)       2231 (26.8%)      205,091 (21.7%)
 American Indian or Alaskan Native                    129,546 (1.1%)       116 (1.5%)         102 (1.2%)          11,962 (1.3%)
 Asian or Pacific Islander                            830,198 (7.0%)       337 (4.3%)         357 (4.3%)          59,873 (6.3%)
Marital status
 Married                                              4,592,768 (38.6%)    4359 (55.0%)       3932 (47.3%)      439,202 (46.4%)
 Not married                                          7,308,843 (61.4%)    3565 (45.0%)       4378 (52.7%)      507,099 (53.6%)
Education
 8th grade or less                                    412,938 (3.5%)       281 (3.5%)         347 (4.2%)          33,244 (3.5%)
 9th through 12th grade with no diploma               1169,038 (9.8%)      1027 (13.0%)       1031 (12.4%)      118,041 (12.5%)
 High school graduate or GED completed                2,985,280 (25.1%)    2667 (33.7%)       2665 (32.1%)      267,006 (28.2%)
 Some college credit, but not a degree                2,574,183 (21.6%)    1865 (23.5%)       1828 (22.0%)      218,090 (23.0%)
 Associate degree                                     995,733 (8.4%)       607 (7.7%)         650 (7.8%)          78,195 (8.3%)
 Bachelor’s degree                                    2,393,391 (20.1%)    984 (12.4%)        1277 (15.4%)      148,341 (15.7%)
 Master’s degree                                      1,065,979 (9.0%)     392 (4.9%)         420 (5.1%)          64,909 (6.9%)
 Doctorate or Professional Degree                     305,069 (2.6%)       101 (1.3%)           92 (1.1%)         18,475 (2.0%)
Number of previous terminations
 Mean (SD)                                            0.40 (0.85)          0.69 (1.20)        0.65 (1.17)             0.52 (1.03)
 Range                                                0.00–30.0            0.00–13.0          0.00–17.0         0.00–27.0
WIC
 No                                                   6,950,753 (58.4%)    5108 (64.5%)       5410 (65.1%)      518,777 (54.8%)
 Yes                                                  4,950,858 (41.6%)    2816 (35.5%)       2900 (34.9%)      427,524 (45.2%)
Smoking before pregnancy
 Nonsmoker                                            10,685,669 (89.8%)   6707 (84.6%)       7179 (86.4%)      817,116 (86.3%)
 1–5                                                  309,680 (2.6%)       329 (4.2%)         300 (3.6%)          31,856 (3.4%)
 6–10                                                 415,142 (3.5%)       438 (5.5%)         402 (4.8%)          43,269 (4.6%)
 11–20                                                419,861 (3.5%)       397 (5.0%)         356 (4.3%)          45,524 (4.8%)
 21–40                                                61,875 (0.5%)        45 (0.6%)            63 (0.8%)          7383 (0.8%)
 41 or more                                           9384 (0.1%)          8 (0.1%)             10 (0.1%)          1153 (0.1%)
BMI
 Mean (SD)                                            26.6 (6.52)          28.6 (7.70)        28.3 (7.51)             27.3 (7.16)
 Range                                                10.5–168             13.7–68.7          10.0–67.4         10.0–125
Height (in.)
 Mean (SD)                                            64.2 (2.84)          64.0 (2.83)        64.0 (2.84)             63.9 (2.87)
 Range                                                30.0–78.0            48.0–78.0          46.0–78.0         34.0–78.0
Weight (pounds)
 Mean (SD)                                            156 (40.5)           167 (47.8)         165 (46.4)              159 (44.3)
 Range                                                75.0–375             75.0–375           75.0–375          75.0–375
Parity
 Nulliparous                                          6,786,170 (57.0%)    5649 (71.3%)       5373 (64.7%)      576,415 (60.9%)
 Parous                                               5,115,441 (43.0%)    2275 (28.7%)       2937 (35.3%)      369,886 (39.1%)
Pre-pregnancy diabetes
 No                                                   11,823,600 (99.3%)   7787 (98.3%)       8135 (97.9%)      922,519 (97.5%)
 Yes                                                  78,011 (0.7%)        137 (1.7%)         175 (2.1%)          23,782 (2.5%)
Koivu and Sairanen Health Inf Sci Syst (2020) 8:14                                                                          Page 6 of 12
Table 3 (continued)
                                                      Normal                  Early stillbirth     Late stillbirth     PTB
Gestational diabetes
 No                                                   11,255,544 (94.6%)      7752 (97.8%)         8019 (96.5%)         868,686 (91.8%)
 Yes                                                  646,067 (5.4%)          172 (2.2%)            291 (3.5%)           77,615 (8.2%)
Pre-pregnancy hypertension
 No                                                   11,737,430 (98.6%)      7662 (96.7%)         8084 (97.3%)         906,296 (95.8%)
 Yes                                                  164,181 (1.4%)          262 (3.3%)            226 (2.7%)           40,005 (4.2%)
Gestational hypertension
 No                                                   11,362 046 (95.5%)      7632 (96.3%)         8010 (96.4%)         817,889 (86.4%)
 Yes                                                  539,565 (4.5%)          292 (3.7%)            300 (3.6%)          128,412 (13.6%)
Hypertension eclampsia
 No                                                   11,883,299 (99.8%)      7895 (99.6%)         8289 (99.7%)         936,176 (98.9%)
 Yes                                                  18,312 (0.2%)           29 (0.4%)              21 (0.3%)           10,125 (1.1%)
Previous preterm birth
 No                                                   11,629,604 (97.7%)      7163 (90.4%)         7733 (93.1%)         854,379 (90.3%)
 Yes                                                  272,007 (2.3%)          761 (9.6%)            577 (6.9%)           91,922 (9.7%)
Infertility treatment
 No                                                   11,784 432 (99.0%)      7774 (98.1%)         8178 (98.4%)         932,874 (98.6%)
 Yes                                                  117,179 (1.0%)          151 (1.9%)            133 (1.6%)           13,427 (1.4%)
Infertility drugs
 No                                                   11,848,427 (99.6%)      7867 (99.3%)         8249 (99.3%)         939,687 (99.3%)
 Yes                                                  53,184 (0.4%)           57 (0.7%)              61 (0.7%)            6614 (0.7%)
ART
 No                                                   11,847,091 (99.5%)      7844 (99.0%)         8250 (99.3%)         940,647 (99.4%)
 Yes                                                  54,520 (0.5%)           80 (1.0%)              60 (0.7%)            5654 (0.6%)
Previous cesarean sections
 No                                                   10,107,199 (84.9%)      6972 (88.0%)         7204 (86.7%)         777,914 (82.2%)
 Yes                                                  1,794,412 (15.1%)       952 (12.0%)          1106 (13.3%)         168 387 (17.8%)
Gonorrhea
 No                                                   11,873,368 (99.8%)      7893 (99.6%)         8271 (99.5%)         942,694 (99.6%)
 Yes                                                  28,243 (0.2%)           31 (0.4%)              39 (0.5%)            3607 (0.4%)
Syphilis
 No                                                   11,892,843 (99.9%)      7915 (99.9%)         8300 (99.9%)         945,206 (99.9%)
 Yes                                                  8768 (0.1%)             9 (0.1%)               10 (0.1%)            1095 (0.1%)
Chlamydia
 No                                                   11,695,524 (98.3%)      7724 (97.5%)         8142 (98.0%)         925,808 (97.8%)
 Yes                                                  206,087 (1.7%)          200 (2.5%)            168 (2.0%)           20 493 (2.2%)
Hepatitis B
 No                                                   11,874,921 (99.8%)      7907 (99.8%)         8298 (99.9%)         944,139 (99.8%)
 Yes                                                  26,690 (0.2%)           17 (0.2%)              12 (0.1%)            2162 (0.2%)
Hepatitis C
 No                                                   11,863,301 (99.7%)      7875 (99.4%)         8273 (99.6%)         939,814 (99.3%)
 Yes                                                  38,310 (0.3%)           49 (0.6%)              37 (0.4%)            6487 (0.7%)
were selected for the SELU network instead of two that                     was used [14]. Adam gradient descent optimization with
was used in our previously published ANN. The number                       0.001 learning rate was used for updating weights [13].
of hidden nodes per layer was set to the number of input                   Sigmoid activation function was utilized as the final node
variables; all of them contained the SELU activation                       for binary classification. 10 epochs with a batch size of
function. Alpha node dropout amount in these nodes                         256 was tested to be optimal.
was set to 15% [14]. LeCun normal weight initialization
Koivu and Sairanen Health Inf Sci Syst (2020) 8:14                                                                             Page 7 of 12
                                                                           f10
                                                                           f11
                                                                           f12
                                                                           f13
                                                                           f14
                                                                           f15
                                                                           f16
                                                                           f17
                                                                           f18
                                                                           f19
                                                                           f20
                                                                           f21
                                                                           f22
                                                                           f23
                                                                           f24
                                                                           f25
                                                                           f26
and PTB, binary classifiers of normal pregnancy vs. case
                                                                           f1
                                                                           f2
                                                                           f3
                                                                           f4
                                                                           f5
                                                                           f6
                                                                           f7
                                                                           f8
                                                                           f9
                                                                      f1                                                               1
were constructed. Because of the class unbalance, class               f2
                                                                      f3
weights w were calculated from the training data set with             f4
                                                                                                                                      0.8
                                                                      f5
                                                                      f6                                                              0.6
      w = s/(c ∗ f (y))                                    (1)        f7
                                                                      f8                                                              0.4
where s is number of samples, c is the number of different            f9
                                                                    f10
classes and f(y) is the frequency of classes in data labels y.      f11                                                               0.2
                                                                    f12
   The performance of classifiers was tested with the par-          f13
                                                                                                                                       0
                                                                    f14
titioned test set and the NYC data set. Folded cross vali-          f15
dation was determined to not be necessary with the data             f16                                                               −0.2
                                                                    f17
of this size. The metric for performance was AUC. In                f18
                                                                                                                                      −0.4
                                                                    f19
addition to this, the true positive rate (TPR) at clinically        f20
                                                                    f21
significant false positive rates (FPR) were estimated from          f22
                                                                                                                                      −0.6
were experimented with, to see if models with different            Fig. 1 Pearson correlation matrix of feature variables
priors and assumptions would complement each other to
form better predictions. In average ensemble, prediction
probabilities were averaged over several models, creat-
ing a new ensembled prediction. In WA, if y is a set of
probabilities created by different models and α is a set of
weights, the weighted sum y is calculated with
                 p
      ỹ(x; a) =     j=1 αj yj (x)                         (2)
reproductive technology (ART) use were correlated to                    notable odds ratios were risk factors of pre-pregnancy
infertility treatment (0.68 and 0.67). This is also to be               diabetes, gestational diabetes, pre-pregnancy hyperten-
expected, because they are alternative forms of infertil-               sion, hypertension eclampsia, previous preterm birth,
ity treatment. Figure 2 shows that observations marked                  infertility treatment and assisted reproductive technol-
for infertility drugs and ART are always a member of                    ogy and marital status.
the set of infertility treatment. The presence of 10,660                  The results for predicting late stillbirth showed that 14
observations (0.08% of all study data) where treatment                  variables had a statistically significant odds ratio. Notable
is marked but drugs or ART use are not suggests either                  ones were risk factors of pre-pregnancy diabetes, gesta-
the use of other undocumented infertility treatment pro-                tional diabetes, pre-pregnancy hypertension, gestational
cedures such as myomectomy surgeries [18], incomplete                   hypertension, previous preterm birth, infertility treat-
documentation in some data collection areas, poor data                  ment and assisted reproductive technology and marital
quality or a combination of the three. Because the data is              status. For PTB prediction, only infection of Hepatitis B
de-identified, we can only speculate the underlying effect,             was found not statistically significant. The same variables
so the three infertility-related variables were included in             have notable odds ratio when compared to predicting
the set of features. No other significant correlations were             late stillbirth and early stillbirth, but in addition variables
found, i.e. less than − 0.5 or more than 0.5.                           such as infections seem to have a bigger impact in PTB.
                                                                        The final variable sets for logistic multivariate modelling
Univariate analysis                                                     are highlighted in Table 4.
For predicting early stillbirth, logistic univariate analysis
revealed that from the 26 selected predictor variables,
18 had a statistically significant odds ratio, i.e. p < 0.05.
Table 4 shows that from the 17 variables, the ones with
Age                                      0.98 (0.97, 1.00)       0.01    1.00 (0.98, 1.01)             0.43   1.01 (1.00, 1.01)        < 0.01
Race                                     1.18 (1.09, 1.26)   < 0.01      1.00 (0.91, 1.08)             0.95   1.08 (1.07, 1.09)        < 0.01
Marital status                           0.50 (0.44, 0.58)   < 0.01      0.71 (0.61, 0.81)         < 0.01     0.72 (0.71, 0.73)        < 0.01
Education                                0.84 (0.80, 0.87)   < 0.01      0.88 (0.84, 0.92)         < 0.01     0.91 (0.91, 0.92)        < 0.01
Number of previous terminations          1.27 (1.21, 1.32)   < 0.01      1.24 (1.19, 1.30)         < 0.01     1.15 (1.14, 1.15)        < 0.01
WIC                                      0.80 (0.69, 0.93)   < 0.01      0.81 (0.70, 0.93)         < 0.01     1.17 (1.15, 1.19)        < 0.01
Smoking before pregnancy                 1.23 (1.14, 1.32)   < 0.01      1.06 (0.96, 1.15)             0.22   1.14 (1.13, 1.15)        < 0.01
BMI                                      1.04 (1.04, 1.05)   < 0.01      1.04 (1.03, 1.05)         < 0.01     1.02 (1.01, 1.02)        < 0.01
Height                                   0.98 (0.95, 1.00)       0.07    0.95 (0.93, 0.97)         < 0.01     0.96 (0.96, 0.97)        < 0.01
Parity                                   0.83 (0.76, 0.90)   < 0.01      0.91 (0.84, 0.98)             0.02   0.96 (0.95, 0.97)        < 0.01
Pre-pregnancy diabetes                   2.95 (1.69, 4.73)   < 0.01      3.43 (2.07, 5.31)         < 0.01     3.87 (3.69, 4.05)        < 0.01
Gestational diabetes                     0.34 (0.20, 0.55)   < 0.01      0.51 (0.33, 0.75)         < 0.01     1.56 (1.52, 1.60)        < 0.01
Pre-pregnancy hypertension               2.66 (1.78, 3.80)   < 0.01      1.91 (1.20, 2.86)         < 0.01     3.15 (3.04, 3.26)        < 0.01
Gestational hypertension                 0.87 (0.60, 1.23)       0.46    0.68 (0.44, 0.98)             0.05   3.37 (3.30, 3.44)        < 0.01
Hypertension eclampsia                   4.16 (1.49, 8.99)   < 0.01      2.41 (0.60, 6.26)             0.13   7.14 (6.61, 7.71)        < 0.01
Previous preterm births                  5.13 (4.06, 6.39)   < 0.01      3.35 (2.54, 4.33)         < 0.01     4.56 (4.44, 4.67)        < 0.01
Infertility treatment                    2.39 (1.44, 3.69)   < 0.01      2.18 (1.29, 3.40)         < 0.01     1.47 (1.38, 1.55)        < 0.01
Infertility drugs                        1.73 (0.68, 3.52)       0.18    1.95 (0.84, 3.79)             0.08   1.65 (1.53, 1.79)        < 0.01
ART                                     2.56 (1.23, 4.64)       0.01    2.48 (1.19, 4.49)             0.01   1.23 (1.13, 1.35)        < 0.01
Previous cesarean sections               0.73 (0.59, 0.91)       0.01    0.87 (0.71, 1.06)             0.18   1.21 (1.19, 1.23)        < 0.01
Gonorrhea                                1.09 (0.18, 3.37)       0.90    2.11 (0.65, 4.92)             0.14   1.64 (1.46, 1.82)        < 0.01
Syphilis                                 1.70 (0.10, 7.49)       0.60    < 0.01 (< 0.01, < 0.01)       0.94   1.67 (1.37, 2.01)        < 0.01
Chlamydia                                1.09 (0.63, 1.75)       0.73    1.42 (0.88, 2.14)             0.12   1.25 (1.19, 1.31)        < 0.01
Hepatitis B                              1.71 (0.43, 4.46)       0.35    < 0.01 (< 0.01, < 0.01)       0.93   1.04 (0.91, 1.19)           0.56
Hepatitis C                              1.99 (0.71, 4.29)       0.13    1.15 (0.29, 2.99)             0.81   2.05 (1.88, 2.23)        < 0.01
Koivu and Sairanen Health Inf Sci Syst (2020) 8:14                                                                          Page 9 of 12
Logistic regression         0.73 (0.71, 0.74)         38           0.58 (0.55, 0.60)     15           0.64 (0.64, 0.64)        27
Deep NN                     0.73 (0.72, 0.75)         37           0.57 (0.54, 0.60)     16           0.66 (0.66, 0.66)        30
SELU network                0.75 (0.73, 0.76)         40           0.59 (0.56, 0.62)     17           0.67 (0.66, 0.67)        31
LGBM                        0.75 (0.74, 0.77)         39           0.60 (0.58, 0.63)     17           0.67 (0.67, 0.67)        31
Averaged ensemble           0.75 (0.74, 0.77)         39           0.60 (0.57, 0.62)     18           0.67 (0.66, 0.67)        31
WA ensemble                 0.75 (0.74, 0.77)         40           0.60 (0.58, 0.63)     19           0.67 (0.67, 0.67)        31
Logistic regression         0.74 (0.69, 0.78)         37           0.61 (0.56, 0.66)     18           0.62 (0.61, 0.62)        22
Deep NN                     0.74 (0.70, 0.77)         37           0.54 (0.49, 0.59)     15           0.63 (0.63, 0.64)        24
SELU network                0.76 (0.73, 0.79)         38           0.59 (0.54, 0.65)     15           0.64 (0.63, 0.64)        24
LGBM                        0.76 (0.70, 0.79)         37           0.61 (0.55, 0.67)     22           0.64 (0.63, 0.64)        24
Averaged ensemble           0.75 (0.72, 0.79)         38           0.63 (0.57, 0.68)     21           0.63 (0.63, 0.64)        22
WA ensemble                 0.76 (0.71, 0.79)         38           0.62 (0.56, 0.67)     26           0.63 (0.63, 0.64)        23
Koivu and Sairanen Health Inf Sci Syst (2020) 8:14                                                                              Page 10 of 12
                                                                                 Conclusions
                                                                                 Beside various earlier studies [7, 15, 27, 30], this inves-
                                                                                 tigation further establishes the role of machine learning
                                                                                 models as tools that can generate risk prediction models
                                                                                 that show improved clinical prediction power over a mul-
                                                                                 tivariate logistic regression model, which was used as a
                                                                                 control and represents the current standard method. Fur-
                                                                                 thermore, the ensemble models across three algorithms
                                                                                 further improved the performance with late stillbirth.
                                                                                   We were able to improve performance, TPR at 10% FPR
                                                                                 and AUC on average by 3% and 0.02, respectively, with
                                                                                 the CDC test set over all case conditions. The improve-
                                                                                 ment was repeatable with the external NYC data set,
                                                                                 where TPR was improved by 4% on average and AUC
                                                                                 by 0.02 with ML methods. For early stillbirth, the best
                                                                                 performing model was SELU network, while averaged
                                                                                 ensemble was best for late stillbirth, and for PTB LGBM
                                                                                 and SELU network performed the best. Compared to
                                                                                 other published models, for late stillbirth our model dis-
                                                                                 played similar performance. For PTB, we were able to
                                                                                 create similar performance but with no empirical evi-
                                                                                 dence of overfitting, due to the external validation. There
                                                                                 are currently no prior models published for early still-
                                                                                 birth prediction, making our SELU network novel.
                                                                                   Multiclass implementations of the algorithms used in
                                                                                 this study, i.e. one model predicting all four classes, were
                                                                                 experimented upon but were excluded from further anal-
                                                                                 ysis due to inferior performance with late stillbirth and
                                                                                 early stillbirth classification. It was suspected that the
                                                                                 unbalanced class distribution in the study data caused
                                                                                 the multiclass models to favor the two proportionally
                                                                                 biggest classes, normal and PTB pregnancies, even when
                                                                                 class weights were used. In practice, the model’s binary
                                                                                 results would be interpreted in a hierarchical manner;
                                                                                 from most hazardous to life, early stillbirth, to the least,
                                                                                 PTB.
                                                                                   Variables increasing the risk for late stillbirth included
                                                                                 increased age and BMI, previous pregnancies with
                                                                                 adverse effect, various comorbidities and having an ART
                                                                                 pregnancy. Compared to stillbirth the biggest difference
                                                                                 to PTB birth was seen on infectious diseases that are
                                                                                 known to be involved in about 25–30% of the PTB preg-
                                                                                 nancy cases [6]. On the other hand, education level had
                                                                                 great positive effect on lowering the risk for these adverse
                                                                                 events of pregnancy.
                                                                                   Our SELU network experiments showed no con-
                                                                                 crete improvement in adding hidden layers beyond four,
                                                                                 despite what was stated in the original SELU publication
                                                                                 [14]. Iterating more or less epochs with smaller batch
  Fig. 3 Weight grid searches of early stillbirth (a), late stillbirth (b) and
  preterm (c) for WA ensemble. Color is determined by the calculated             sizes also did not significantly improve prediction per-
  AUC of the ensemble                                                            formance. Substantial class imbalance in our data sets is
                                                                                 suspected to cause this, more experimentation is needed
                                                                                 in the future to fully understand the effect. As for LGBM,
Koivu and Sairanen Health Inf Sci Syst (2020) 8:14                                                                                                  Page 11 of 12
training the model was frequently stopped early after 500                          References
                                                                                   1. Acog practice bulletin no. 102: Management of stillbirth. Obstet Gynecol.
iterations, indicating that going beyond this would start                              2009;113(3):748–61 .
to decrease validation data performance. Using AUC as                              2. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS,
the stopping criteria for early stopping could also poten-                             Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G,
                                                                                       Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mane D,
tially inhibit us from generating a model that produces                                Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sut-
the most optimal TPR at 10% FPR, because AUC as a                                      skever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viegas F, Vinyals O,
metric takes into consideration the whole FPR range.                                   Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X. Tensorflow: large-scale
                                                                                       machine learning on heterogeneous distributed systems. 2016. arXiv
   We want to highlight that due to the large amount of                                :1603.04467.
data there were enough samples for independent selec-                              3. BBC News. Most-premature baby allowed home. 2007. http://news.bbc.
tion, training, validation and test sets. The original data,                           co.uk/2/hi/americas/6384621.stm
                                                                                   4. Blencowe H, Cousens S, Jassir FB, Say L, Chou D, Mathers C, Hogan D,
i.e. medical records, are not available to us, so it is not                            Shiekh S, Qureshi ZU, You D, Lawn JE. National, regional, and worldwide
feasible to estimate the quality and integrity of the data                             estimates of stillbirth rates in 2015, with trends from 2000: a systematic
used in this study. However, because the data contains                                 analysis. Lancet Glob Health. 2016;4(2):e98–108.
                                                                                   5. Chollet FK. 2015. https://github.com/fchollet/keras.
observations from multiple years, regions and hospitals,                           6. Goldenberg RL, Culhane JF, Iams JD, Romero R. Epidemiology and causes
the number of random artifacts such as incorrect data                                  of preterm birth. Lancet. 2008;371(9606):75–84.
entry should be reduced to insignificant levels.                                   7. Gürgen F, Zengin Z, Varol F. Intrauterine growth restriction (IUGR)
                                                                                       risk decision based on support vector machines. Expert Syst Appl.
   The machine learning models used in this study pro-                                 2012;39(3):2872–6.
vide a solid basis for adding biochemical and/or biophysi-                         8. Hall MA. Correlation-based feature selection for discrete and numeric
cal markers to further improve sensitivity and specificity                             class machine learning. In: Proceedings of the seventeenth international
                                                                                       conference on machine learning, ICML ’00. San Francisco: Morgan Kauf-
of these risk models. Further research would also include                              mann Publishers Inc.; 2000. p. 359–366.
inspecting the reproducibility of the results beyond the                           9. Hui-Lung H. Fetal death rate in the united states, 1979–1990: trend and
population of the United States with different ML mod-                                 racial disparity. Obstet Gynecol. 1997;89(1):33–9.
                                                                                   10. Jones E, Oliphant T, Peterson P. Scipy: open source scientific tools for
els. The best machine learning models (SELU network,                                   python; 2001.
LGBM and averaged ensemble) were able to produce                                   11. Kayode GA, Grobbee DE, Amoakoh-Coleman M, Adeleke IT, Ansah E, de
repeatable performance over two data sets. Using these                                 Groot JAH, Klipstein-Grobusch K. Predicting stillbirth in a low resource
                                                                                       setting. BMC Pregnancy Childbirth. 2016;16(1):1–8.
machine learning models, especially for early stillbirth,                          12. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu TY. Lightgbm:
could provide earlier identification of at-risk pregnancies                            a highly efficient gradient boosting decision tree. In: Guyon I, Luxburg
with high accuracy and provide tools for better utiliza-                               UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, editors.
                                                                                       Advances in neural information processing systems, vol. 30. Red Hook:
tion of healthcare resources targeted to those needing it                              Curran Associates, Inc.; 2017. p. 3146–54.
most.                                                                              13. Kingma DP, Ba J. Adam: a method for stochastic optimization. 2014. arXiv
                                                                                       :1412.6980.
Acknowledgements                                                                   14. Klambauer G, Unterthiner T, Mayr A, Hochreiter S. Self-normalizing neural
Open access funding provided by University of Turku (UTU) including Turku              networks. In: Advances in neural information processing systems; 2017. p.
University Central Hospital.                                                           971–980.
                                                                                   15. Koivu A, Korpimäki T, Kivelä P, Pahikkala T, Sairanen M. Evaluation of
Funding                                                                                machine learning algorithms for improved risk assessment for down’s
There was no funding for this research.                                                syndrome. Comput Biol Med. 2018;98:1–7.
                                                                                   16. Koza JR, Bennett FH, Andre D, Keane MA. Automated design of both the
Compliance with ethical standards                                                      topology and sizing of analog electrical circuits using genetic program-
                                                                                       ming. In: Artificial intelligence in design ’96. Dordrecht: Springer; 1996. p.
Conflict of interest                                                                   151–70.
The authors declare that they have no conflict of interest.                        17. Kuhn M. Building predictive models inRUsing thecaretPackage. J Stat
                                                                                       Softw. 2008;28(5):1–26.
Ethical approval                                                                   18. Li T, Mortimer R, Cooke I. Myomectomy: a retrospective study to examine
All procedures performed in studies involving human participants were in               reproductive performance before and after surgery. Hum Reprod.
accordance with the ethical standards of the institutional and/or national             1999;14(7):1735–40.
research committee (include name of committee + reference number) and              19. MacDorman MF, Gregory EC. Fetal and perinatal mortality: United States,
with the 1964 Helsinki declaration and its later amendments or comparable              2013. Hyattsville: National Center for Health Statistics; 2015.
ethical standards. The data provided by the National Center of Health Statistics   20. Meertens LJ, van Montfort P, Scheepers HC, van Kuijk SM, Aardenburg
is publicly available for research. For the analysis of data provided by the           R, Langenveld J, van Dooren IM, Zwaan IM, Spaanderman ME, Smits LJ.
New York City Department of Health and Mental Hygiene, IRB approval was                Prediction models for the risk of spontaneous preterm birth based on
granted.                                                                               maternal characteristics: a systematic review and independent external
                                                                                       validation. Acta Obstet Gynecol Scand. 2018;97(8):907–20.
Author details                                                                     21. Mukaka MM. Statistics corner: a guide to appropriate use of correlation
1
  Department of Future Technologies, University of Turku, 20500 Turku, Finland.        coefficient in medical research. Malawi Med J. 2012;24(3):69–71.
2
  PerkinElmer Wallac Oy, 20750 Turku, Finland.                                     22. Nocedal J. Updating quasi-Newton matrices with limited storage. Math.
                                                                                       Comput. 1980;35(151):773.
Received: 20 December 2019 Accepted: 11 March 2020                                 23. Olson RS, Cava WL, Mustahsan Z, Varik A, Moore JH. Data-driven advice
Published online: 25 March 2020                                                        for applying machine learning to bioinformatics problems. In: Biocom-
                                                                                       puting 2018. Singapore: World Scientific; 2017.
Koivu and Sairanen Health Inf Sci Syst (2020) 8:14                                                                                                    Page 12 of 12
24. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O,            30. Vovsha I, Rajan A, Salleb A, Raja A, Radeva A, Diab H, Tomar A, Wapner R.
    Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A,              Predicting preterm birth is not elusive: Machine learning paves the way
    Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: Machine               to individual wellness. In: AAAI spring symposium—technical report;
    learning in python. J Mach Learn Res. 2011;12:2825–30.                              2014. p. 82–89.
25. Purisch SE, Gyamfi-Bannerman C. Epidemiology of preterm birth. Semin            31. World Health Organization (WHO). ICD-10 revision. Geneva: World Health
    Perinatol. 2017;41(7):387–91.                                                       Organization; 2014.
26. Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in       32. Wickham H, Hester J, Francois R. readr: read rectangular text data. 2017.
    bioinformatics. Bioinformatics. 2007;23(19):2507–17.                                https://CRAN.R-project.org/package=readr, r package version 1.1.1.
27. Tran T, Luo W, Phung D, Morris J, Rickard K, Venkatesh S. Preterm birth         33. Yerlikaya G, Akolekar R, McPherson K, Syngelaki A, Nicolaides KH. Predic-
    prediction: stable selection of interpretable rules from high dimensional           tion of stillbirth from maternal demographic and pregnancy characteris-
    data. In: Doshi-Velez F, Fackler J, Kale D, Wallace B, Wiens J, editors. Pro-       tics. Ultrasound Obstet Gynecol. 2016;48(5):607–12.
    ceedings of the 1st machine learning for healthcare conference, PMLR,           34. Zhang X, Kramer M. Temporal trends in stillbirth in the united states,
    proceedings of machine learning research, vol 56. Los Angeles: Children’s           1992-2004: a population-based cohort study. BJOG Int J Obstet Gynaecol.
    Hospital LA; 2016. p. 164–77.                                                       2014;121(10):1229–36.
28. The Centers for Disease Control and Prevention. National vital statistics
    system. 2014. https://www.cdc.gov/nchs/nvss/.                                  Publisher’s Note Springer Nature remains neutral with regard to
29. Trudell AS, Tuuli MG, Colditz GA, Macones GA, Odibo AO. A stillbirth            jurisdictional claims in published maps and institutional affiliations.
    calculator: development and internal validation of a clinical prediction
    model to quantify stillbirth risk. PLoS ONE. 2017;12(3):e0173461.