Paper 5
Paper 5
Article
1 Department of Statistics, Faculty of Physical Sciences, University of Ilorin, llorin 1515, Nigeria;
aliu.tayo30@gmail.com
2 Department of Mathematics, Faculty of Science, Taibah University, Al-Madinah Al-Munawara 42353,
Saudi Arabia; jlohibi@taibahu.edu.sa (J.A.); aahharbi@taibahu.edu.sa (A.A.A.);
nmharbi@taibahu.edu.sa (N.M.A.)
* Correspondence: olaniran.or@unilorin.edu.ng
† These authors contributed equally to this work.
Abstract: This paper proposes a novel two-stage ensemble framework combining Long
Short-Term Memory (LSTM) and Bidirectional LSTM (BiLSTM) with randomized feature
selection to enhance diabetes prediction accuracy and calibration. The method first trains
multiple LSTM/BiLSTM base models on dynamically sampled feature subsets to promote
diversity, followed by a meta-learner that integrates predictions into a final robust output.
A systematic simulation study conducted reveals that feature selection proportion critically
impacts generalization: mid-range values (0.5–0.8 for LSTM; 0.6–0.8 for BiLSTM) optimize
performance, while values close to 1 induce overfitting. Furthermore, real-life data evalua-
tion on three benchmark datasets—Pima Indian Diabetes, Diabetic Retinopathy Debrecen,
and Early Stage Diabetes Risk Prediction—revealed that the framework achieves state-
of-the-art results, surpassing conventional (random forest, support vector machine) and
recent hybrid frameworks with an accuracy of up to 100%, AUC of 99.1–100%, and superior
calibration (Brier score: 0.006–0.023). Notably, the BiLSTM variant consistently outperforms
unidirectional LSTM in the proposed framework, particularly in sensitivity (98.4% vs. 97.0%
on retinopathy data), highlighting its strength in capturing temporal dependencies.
tasks. Unlike traditional recurrent neural networks (RNNs), which suffer from vanishing
gradients during backpropagation, LSTM architectures mitigate this issue through gated
memory cells [1,4–8]. Bi-LSTM extends this capability by processing data in both forward
and backward directions, enhancing sensitivity to temporal patterns [4]. Despite these
advantages, diabetes research has predominantly relied on traditional ML models, which
focus on static datasets and yield limited generalizability [9–13].
A paradigm shift is emerging in medical AI, with hybrid DL approaches demonstrating
superior performance. For example, Sun et al. [2] employed Bi-LSTM to predict blood
glucose levels, outperforming autoregressive (ARIMA) and support vector regression
(SVR) models in reducing root mean square error (RMSE). Similarly, Kusuma et al. [14]
combined convolutional and LSTM networks (CNN-LSTM) to achieve 99.5% accuracy in
heart failure prediction using ECG signals, highlighting the potential of hybrid architectures
for clinical applications. Cheng et al. [4,15] further advanced this trend with a knowledge-
extended CNN (KE-CNN) for diabetes prediction, achieving 95.8% accuracy through entity
recognition and feature selection.
Extreme Learning Machines (ELMs) have also gained traction, offering rapid training
speeds and low mean squared error (MSE). Pangaribuan et al. [16] demonstrated ELM’s
efficacy in diabetes diagnosis (MSE: 0.4036), while Elsayed et al. [17,18] achieved 98.1%
accuracy in early-stage risk prediction. However, such studies often rely on small, non-
representative datasets, limiting generalizability. Hybrid models like CNN-LSTM-SVM [19,20]
and weather-predictive LSTM [21] further illustrate the versatility of sequential learning but
face challenges in scalability and gradient management [3,22].
Apart from single-stage methods that solely utilize classification procedures, several
hybrid methods have emerged that combine feature selection with classification techniques.
For instance, ref. [23] integrated particle swarm optimization (PSO) for feature optimization
with the Fuzzy Clustering Model (FCM). Similarly, ref. [24] employed principal component
analysis (PCA) in conjunction with K-means clustering [25] for feature selection, subse-
quently using these selected features to inform logistic regression predictions. In a related
approach, ref. [26] harnessed the strengths of variational autoencoders (VAEs) for sample
data augmentation and sparse autoencoders (SAEs) for feature augmentation, feeding the
results into a convolutional neural network (CNN) for prediction. Additionally, ref. [27]
selected key features (KFs) before integrating them into an ensemble framework for predic-
tions. ref. [7] adopted the Boruta feature selection algorithm, combining it with ensemble
learning to predict diabetes.
While effective, these methods, with the exception of the Boruta approach by [7],
primarily belong to the broad category of filter methods, which aim to identify key features
prior to classification. These filter methods often rely on a greedy learning strategy that
focuses on the most relevant features. While this approach can be effective in smooth feature
spaces devoid of interactions between relevant and non-relevant features, it may falter
in more complex scenarios, leading to difficulties in accurately identifying key features
and, consequently, increasing false positives and diminishing prediction accuracy. In
contrast, the Boruta feature selection algorithm, overlaid on a random forest procedure,
exemplifies a wrapper technique. This method considers a more comprehensive feature
space by randomly sampling features, thereby creating a sampling distribution that fosters
diversity among base learners. This framework inspired the random feature technique
utilized in the first stage of our proposed hybrid LSTM and BiLSTM models. To address
the limitations inherent in existing feature selection methods, we combined the predictions
from base models trained on random features using a stacking approach, enhancing overall
prediction accuracy.
Mathematics 2025, 13, 628 3 of 25
Despite these advancements, critical gaps remain. Many existing approaches, includ-
ing Random Weighted LSTM (RWL) [28], have been validated on limited datasets such as
the Pima Indian cohort, which constrains their clinical applicability. Additionally, issues
such as vanishing gradients and computational inefficiencies impede real-time deployment
in clinical settings. To overcome these challenges, we propose the Random Feature LSTM
and BiLSTM (RFLSTM and RFBiLSTM) frameworks. These frameworks integrate dynamic
feature selection and model stacking, optimizing feature diversity while leveraging tem-
poral processing to enhance computational efficiency and generalizability. As a result,
RFLSTM and RFBiLSTM present a robust solution for diabetes prediction and broader
healthcare applications.
(i ) (i )
Xtrain ⊆ Xtrain where dim( Xtrain ) = nfeatures , (1)
ht = LSTM( Xt , ht−1 ; θ ), (2)
1
ŷt = σ (W · ht + b), σ( x) = . (3)
1 + e− x
(i ) (i ) (i ) (i )
ŷtrain = Mi ( Xtrain ), ŷtest = Mi ( Xtest ).
Mathematics 2025, 13, 628 4 of 25
where Ŷtrain contains the predictions from the Stage 1 models. The final prediction is
obtained by training another LSTM on Xmeta :
Theorem 1 (lower misclassification error of Random Feature LSTM). Let D be a data dis-
tribution over feature vectors X ∈ R p and labels y ∈ {0, 1}. Let f LSTM denote a standard LSTM
classifier trained on all p features and f RF-LSTM denote the Random Feature LSTM ensemble with
nmodels base LSTMs trained on subsets of nfeatures = ⌊η p⌋ features (η ≥ 0.5), followed by a stacking
LSTM. Use the following assumptions:
1. Diversity: The base LSTMs’ prediction errors are not perfectly correlated due to random
feature selection.
2. Optimality: The stacking LSTM can approximate the optimal combination of base predictions
and original features.
Then, the misclassification error rate of f RF-LSTM is bounded above by that of f LSTM :
ϵRF-LSTM ≤ ϵLSTM
and then the bias–variance decomposition under the squared loss can be obtained using
h i
ϵ( f ) = EX (E[ f ( X )] − E[y| X ])2 + EX [Var( f ( X ))] + EX [Var(y| X )] . (5)
| {z } | {z } | {z }
Var( f ) σ2
Bias2 ( f )
We now proceed with the main results, where we assume that the Random Feature
ensemble will reduce the variance of prediction. That is, for base models M1 , . . . , Mnmodels
with predictions Ŷ (i) ,
! !
nmodels nmodels
1 1
Var
nmodels ∑ Ŷ (i )
=
n2models
∑ Var(Ŷ ) + 2 ∑ Cov(Ŷ , Ŷ
(i ) (i ) ( j)
)
i =1 i =1 i< j
Mathematics 2025, 13, 628 5 of 25
1
Varbase ≈ Var( Mi ).
nmodels
Now we proceed with the second stage of the algorithm, where we propose a stacked
prediction. The stacking model receives enhanced input:
where δstack ≤ 0 due to the stacking LSTM’s capacity to reduce residual bias through
meta-features.
Finally, combining the results, we have
VarLSTM
ϵRF-LSTM = Bias2LSTM + δstack + + σ2 .
| {z } nmodels
2 BiasRF-LSTM
| {z }
VarRF-LSTM
Remark 1. yokyok
1. Strict inequality ϵRF-LSTM < ϵLSTM holds with non-zero diversity: RF-LSTM outperforms a
single LSTM as long as the individual LSTM models exhibit sufficient diversity, which enables
the ensemble to combine complementary information and reduce variance.
2. This requires regularization for Mstack to prevent overfitting on Xmeta : The meta-classifier in
the RF-LSTM framework must be regularized to avoid overfitting to the outputs of the LSTM
models, ensuring that the ensemble generalizes well to unseen data.
Corollary 1 (higher accuracy of Random Feature LSTM). Let the accuracy A( f ) of a classifier
f be defined as
A ( f ) = 1 − ϵ ( f ),
where ϵ( f ) is the misclassification error. Under the same conditions as Theorem 1, the accuracy of
the Random Feature LSTM (ARF-LSTM ) satisfies the following inequality:
ARF-LSTM ≥ ALSTM .
Proof. From Theorem 1, the misclassification error rate of f RF-LSTM is bounded above by
that of f LSTM , i.e.,
ϵRF-LSTM ≤ ϵLSTM .
1 − ϵRF-LSTM ≥ 1 − ϵLSTM .
Simplifying,
ARF-LSTM ≥ ALSTM .
This establishes that the Random Feature LSTM achieves accuracy at least as high as
the standard LSTM, with the equality holding when the diversity or stacking optimization
conditions are not met.
Remark 2. yokyok
1. The strict inequality ARF-LSTM > ALSTM holds when the individual LSTM models exhibit
sufficient diversity and the stacking LSTM effectively combines their predictions to reduce
variance and bias.
2. Regularization of the stacking model (Mstack ) ensures that overfitting to Xmeta does not degrade
the ensemble’s generalization, preserving the accuracy advantage of RF-LSTM.
(i ) (i ) (i ) (i )
ŷtrain = MiBi ( Xtrain ), ŷtest = MiBi ( Xtest ).
ŷBi Bi Bi
final = Mstack ( Xmeta ).
Mathematics 2025, 13, 628 7 of 25
The results for the Random Feature LSTM ensemble (RF-LSTM) extend naturally to
the Random Feature BiLSTM ensemble (RF-BiLSTM). We formally present the theorem,
corollary, and their proofs below.
Theorem 2 (Misclassification error bound for RF-BiLSTM). Let FBiLSTM denote the hypothesis
class of a single BiLSTM model trained on a dataset D , and let FRF-BiLSTM denote the hypothesis
class of an RF-BiLSTM ensemble constructed by averaging predictions over M independently
trained BiLSTM models, each using a random subset of input features. Under the assumption of
independent errors across individual (diversity) BiLSTM models, the expected classification error
ϵRF-BiLSTM of the ensemble satisfies
ϵRF-BiLSTM ≤ ϵBiLSTM ,
When random feature ensembles are applied, the variance reduction achieved by
averaging predictions across M models further decreases the classification error, resulting in
Corollary 2 (classification accuracy for RF-BiLSTM). Let ALSTM , ABiLSTM , ARF-LSTM , and
ARF-BiLSTM represent the classification accuracies of their respective models. Then, the following
inequalities hold:
ARF-BiLSTM ≥ ABiLSTM ≥ ARF-LSTM ≥ ALSTM .
A = 1 − ϵ.
Remark 3. The results in Theorem 2 and Corollary 2 build upon Theorem 1 and Corollary 1 by
highlighting the hierarchical relationship between LSTM and BiLSTM architectures in the context of
random feature ensembles. Specifically, the bidirectional nature of BiLSTM amplifies the advantages
of ensemble modelling, including reduced variance and increased robustness to overfitting. The
additional backward pass in BiLSTM not only enhances temporal context representation but also
enables a tighter upper bound on classification error, as shown in Theorem 2. This progression
emphasizes the generalizability of the random feature ensemble framework and its applicability to
both unidirectional and bidirectional recurrent neural network architectures.
The algorithms presented here are similar to the random forest (RF) procedure regard-
ing model and variable uncertainty. In the first stage, a similitude of RF bootstrapping of
several samples is implemented by creating several nmodels from random feature combina-
tions. This simultaneously incorporates variable uncertainty. In the second stage, instead of
aggregating the predicted outcomes as in RF, we implemented a stacking approach similar
to boosting the base algorithm LSTM/BiLSTM using the predictions in the training stage.
It is important to note that the number of models nmodels and number of features n f eatures
must be less than n and p, respectively, to avoid singularity issues during the training in
stage 1 and prediction in stage 2.
The proposed framework presented in Figure 1 follows a two-stage architecture for
diabetes prediction, incorporating both Random Feature LSTM (RFLSTM) and Random Fea-
ture BiLSTM (RFBiLSTM). In Stage 1, the input dataset with p features undergoes random
feature selection, where each base model is trained on a subset of features, nfeatures = η × p
(e.g., η = 0.6). The base models include RFLSTM, which employs a unidirectional LSTM,
and RFBiLSTM, which utilizes a bidirectional LSTM (BiLSTM). The predictions from all
base models, Ŷbase , are stored. In Stage 2, the stacking model integrates the base model
predictions with the original feature set, forming the meta-feature matrix Xmeta . A final
LSTM (for RFLSTM) or BiLSTM (for RFBiLSTM) is then trained on Xmeta to generate the
final aggregated prediction. Key parameters include η, the proportion of selected features,
and nmodels , the number of base models. The flowchart visually distinguishes between
RFLSTM and RFBiLSTM while highlighting their shared two-stage learning approach.
where
• Bias2 ( f ) is the squared bias, representing the error due to the model’s inability to
capture the true underlying relationship.
• Var( f ) is the variance, representing the model’s sensitivity to the training data.
• σ2 is the irreducible error due to noise in the data.
The bias of the model is primarily affected by the number of features used for training.
When η is small, the model is trained on a limited subset of features, which may not capture
the full complexity of the data. This leads to underfitting and high bias. Mathematically,
Bias2 (η ) ∝ (1 − η )α ,
Mathematics 2025, 13, 628 9 of 25
where α > 0 is a dataset-dependent constant. For η ≥ 0.5, the bias decreases as η increases
because more features are available to capture the underlying data distribution.
The variance of the model is influenced by the diversity of the base models in the
ensemble. When η is small, the feature subsets for each base model are highly diverse,
leading to low correlation between the models’ errors and reduced ensemble variance.
Mathematically,
1
Var(η ) ∝ ,
nmodels · η β
Mathematics 2025, 13, 628 10 of 25
where C1 , C2 > 0 are constants that depend on the dataset and model architecture.
The optimal value of η minimizes ϵ(η ). Taking the derivative of ϵ(η ) with respect to η
and setting it to zero,
dϵ βC2
= −αC1 (1 − η )α−1 − = 0.
dη nmodels η β+1
Solving this equation yields the optimal η ∗ , which balances bias and variance. Empiri-
cally, η ∗ is often found to be in the range 0.5 ≤ η ≤ 0.8.
For η < 0.5, the feature subsets are too small to capture the full complexity of the data,
leading to high bias. While the ensemble variance is low due to high diversity, the overall
error is dominated by bias, resulting in poor accuracy. This occurs specifically as follows:
• Bias increases sharply as η → 0.
• Variance decreases but is insufficient to compensate for the high bias.
• Accuracy degrades significantly due to underfitting.
Similarly, the cross-entropy loss L(η ) for the ensemble can be decomposed into two
components:
nmodels
L(η ) = E[− log p(y| Xmeta )] +λ · ∑ L( Mi ) ,
i =1
| {z }
Stacking Loss | {z }
Base Model Losses
where λ is a regularization parameter. The base model loss decreases as η increases because
more features improve the individual models’ ability to fit the training data. For η < 0.5, the
base model loss is high due to underfitting. The stacking loss is minimized at intermediate
values of η where the meta-features Xmeta = [ X, Ŷ (1) , . . . , Ŷ (nmodels ) ] provide the most useful
information. For η < 0.5, the stacking loss increases because the base models’ predictions
are less reliable. The behavior of various η values are summarized in Table 1.
– This ensures that the feature overlap between base models is bounded:
E[| X (i) ∩ X ( j) |]
FeatureOverlap = ≤ η2.
p
3. Simulation Study
The synthetic dataset was designed to replicate the Pima Indian Diabetes Dataset
(PIDD) for binary classification tasks, incorporating a simulation model grounded in clinical
plausibility and prior studies [29,30]. Specifically, the synthetic dataset contains n = 500
observations with eight predictors and one binary outcome variable. Let X = ( X1 , . . . , X8 )
denote the predictor matrix and Y ∈ {0, 1} the diabetes diagnosis outcome.
8
logit( P(Y = 1|X)) = β 0 + ∑ β j Xj + ϵ (10)
j =1
Mathematics 2025, 13, 628 12 of 25
where
β 0 = −8
β = [0.04, 0.06, 0.02, 0, 0, 0, 0.03, 0]
ϵ ∼ N (0, 1)
The intercept β 0 was calibrated to achieve 25% prevalence: P(Y = 1) = 0.25. The focus
of the simulation study is to investigate the diversity and generalizability of the proposed
RFLSTM and RFBiLSTM models for 0.5 ≤ η ≤ 1. The base LSTM and BiLSTM models
were implemented in R version 4.3.3 using the Keras package. The architecture consists of a
sequential model where the primary layer is a Long Short-Term Memory (LSTM) network
with 50 units. The LSTM layer processes input sequences of predefined shape and captures
temporal dependencies. A fully connected dense layer with a sigmoid activation function
follows, outputting a probability score for binary classification. The model was compiled
using the Adam optimizer, binary cross-entropy as the loss function, and accuracy as the
evaluation metric. Training was conducted over 100 epochs to ensure sufficient learning
while preventing overfitting.
Figure 2. Training and validation loss trajectories of RFLSTM across varying η (proportion of predictors).
Mathematics 2025, 13, 628 13 of 25
η Train Accuracy Val Accuracy Diff (%) Train Loss Val Loss Diff Remark
0.5 97.1% 96.7% 0.5 0.0923 0.1495 −0.0571 Balanced
0.6 97.1% 96.7% 0.5 0.0665 0.1199 −0.0535 Balanced
0.7 97.1% 96.7% 0.5 0.0554 0.1184 −0.0630 Balanced
0.8 97.1% 96.7% 0.5 0.0664 0.1216 −0.0552 Balanced
0.9 95.7% 76.7% 19.0 0.0660 0.2535 −0.1874 Overfit
1.0 98.6% 73.3% 25.2 0.0665 0.2972 −0.2307 Overfit
The results in Table 3 illustrate the impact of the feature selection proportion η on the
training and validation performance of the Random Feature BiLSTM (RFBiLSTM). For η = 0.6
and η = 0.8, the model achieves a balanced performance, with high training accuracy (97.1%
and 97.1%, respectively) and relatively small differences between training and validation
metrics (e.g., 3.8% accuracy difference and −0.0875 loss difference for η = 0.6). This indicates
effective generalization and a well-calibrated bias–variance trade-off. However, for η = 0.5,
η = 0.7, η = 0.9, and η = 1.0, the model exhibits signs of overfitting, as evidenced by the
large discrepancies between training and validation metrics (e.g., 100.0% vs. 90.0% accuracy
and a loss difference of −0.1841 for η = 0.7). These findings are corroborated by the training
and validation loss trajectories in Figure 3, which show a divergence in losses for η = 0.5, 0.7,
0.9, and 1.0, indicating that the model is memorizing the training data rather than learning
generalizable patterns. Thus, η values of 0.6 and 0.8 are optimal for RFBiLSTM, while other
values lead to overfitting and degraded validation performance.
η Train Accuracy Val Accuracy Diff (%) Train Loss Val Loss Diff Remark
0.5 97.1% 86.7% 10.5 0.0719 0.3357 −0.2638 Overfit
0.6 97.1% 93.3% 3.8 0.0508 0.1383 −0.0875 Balance
0.7 100.0% 90.0% 10.0 0.0196 0.2037 −0.1841 Overfit
0.8 97.1% 96.7% 0.5 0.0597 0.0996 −0.0398 Balance
0.9 98.6% 86.7% 11.9 0.0458 0.2905 −0.2447 Overfit
1.0 98.6% 83.3% 15.2 0.0485 0.3641 −0.3156 Overfit
Figure 3. Training and validation loss trajectories of RFBiLSTM across varying η (proportion of predictors).
Mathematics 2025, 13, 628 14 of 25
method was particularly effective in ensuring that imputations maintained realistic clinical
patterns rather than introducing artificial biases [8].
4.3.1. Accuracy
Accuracy measures the overall correctness of predictions and is defined as the ratio of
correctly predicted instances to the total number of instances:
TP + TN
Accuracy = , (11)
TP + TN + FP + FN
where TP is the number of true positives, TN is the number of true negatives, FP is the
number of false positives, and FN is the number of false negatives [34].
N
1
Brier Score =
N ∑ ( p i − y i )2 , (14)
i =1
where pi is the predicted probability, yi is the actual outcome (0 or 1), and N is the total
number of predictions. Lower Brier scores indicate better-calibrated predictions [1].
Mathematics 2025, 13, 628 17 of 25
5. Results
The results in Tables 4–6 demonstrate that the proposed Random Feature LSTM and
BiLSTM methods significantly outperform traditional state-of-the-art methods across three
diabetes-related datasets: Pima Indian Diabetes, Diabetic Retinopathy Debrecen, and Early
Stage Diabetes Risk. For the Pima Indian dataset, Random Feature BiLSTM achieved
the highest accuracy (99.3%) with perfect AUC (100%) and the lowest Brier score (0.006),
closely followed by Random Feature LSTM, which also excelled, with an accuracy of
97.6% and AUC of 100%. In the Diabetic Retinopathy dataset, both proposed methods
delivered accuracy above 97%, far exceeding that of traditional LSTM and BiLSTM models,
whose accuracy was only 62.3% and 64.0%, respectively. Similarly, in the Early Stage
Diabetes Risk dataset, both Random Feature LSTM and BiLSTM attained perfect scores
across all performance metrics (100% accuracy, sensitivity, specificity, and AUC, with a
Brier score of 0.000), outperforming other models, including random forest and SVM. These
findings underscore the enhanced predictive capabilities and robustness of the Random
Feature LSTM and BiLSTM approaches, particularly in achieving high accuracy, sensitivity,
specificity, and low Brier scores, making them highly effective for diabetes prediction tasks
across diverse datasets.
Table 6. Average of 10-fold cross-validation performance comparison of the proposed methods (Ran-
dom Feature LSTM and BiLSTM) with state-of-the-art methods for Early Stage Diabetes Risk Dataset.
Figures 4–6 illustrate the Receiver Operating Characteristic (ROC) curves for various
models across the Pima Indian Diabetes, Diabetic Retinopathy Debrecen, and Early Stage
Diabetes Risk datasets. These plots, generated from one of the ten-fold cross-validation
runs, highlight the superior performance of the proposed Random Feature LSTM and
BiLSTM methods. For the Pima Indian Diabetes Dataset, the ROC curves for these methods
reached the top-left corner, reflecting perfect AUC values (99.3%) and corroborating their
near-perfect classification capabilities, as detailed in Table 4. Similarly, in the Diabetic
Retinopathy Debrecen Dataset, the proposed methods maintained high AUC values (above
99%), showcasing their reliability in distinguishing between diabetic and non-diabetic cases
compared to the traditional LSTM and BiLSTM models, which exhibited significantly lower
AUC values. The Early Stage Diabetes Risk dataset, further emphasized the superiority of
the proposed ensemble methods, with both achieving perfect ROC curves that align with
their flawless performance metrics across all evaluation categories in Table 6. These ROC
curves visually reinforce the findings, demonstrating the proposed methods’ robustness,
precision, and clinical applicability.
The computational time results in Table 7 show significant differences in computa-
tional efficiency across methods. The proposed Random Feature BiLSTM (2.94 s) is much
faster than the LSTM (5.68 s), proposed Random Feature LSTM (12.87 s), and BiLSTM
(11.25 s), demonstrating the benefits of random feature augmentation for BiLSTM in terms
of computational speed. However, ELM (0.02 s) and Naive Bayes (0.01 s) are the quickest,
suitable for time-sensitive tasks. Traditional ML models like random forest (0.12 s), SVM
(0.27 s), and logistic regression (0.58 s) offer a balance between speed and complexity,
outperforming deep learning models in efficiency. Neural networks (0.26 s) are faster
than LSTM-based models but slower than simpler algorithms. Overall, in terms of high
Mathematics 2025, 13, 628 19 of 25
predictive ability captured by accuracy, sensitivity, specificity, AUC, and Brier score as well
as moderate computational speed, the proposed Random Feature BiLSTM is the best.
0.8
0.6
0.4
0.2
Sensitivity
0.0
Neural Network Random Forest SVM Logistic Regression Naive Bayes
AUC = 0.799 AUC = 0.885 AUC = 0.883 AUC = 0.875 AUC = 0.848
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
1 − Specificity
Figure 4. Receiver Operating Characteristic (ROC) curves of the various methods for the Pima
Indian dataset. (Note the AUC value shown on the plot corresponds to the AUC from one of the
ten iterations).
0.8
0.6
0.4
0.2
Sensitivity
0.0
Neural Network Random Forest SVM Logistic Regression Naive Bayes
AUC = 0.827 AUC = 0.942 AUC = 0.842 AUC = 0.882 AUC = 0.745
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
1 − Specificity
Figure 5. Receiver Operating Characteristic (ROC) curves of the various methods for the Diabetic
Retinopathy Debrecen Dataset. (Note the AUC value shown on the plot corresponds to the AUC
from one of the ten iterations).
Mathematics 2025, 13, 628 20 of 25
0.8
0.6
0.4
0.2
Sensitivity
0.0
Neural Network Random Forest SVM Logistic Regression Naive Bayes
AUC = 0.943 AUC = 0.994 AUC = 0.997 AUC = 0.997 AUC = 0.964
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
1 − Specificity
Figure 6. Receiver Operating Characteristic (ROC) curves of the various methods for the Early Stage
Diabetes Risk dataset. (Note the AUC value shown on the plot corresponds to the AUC from one of
the ten iterations).
6. Discussion of Results
The proposed Random Feature LSTM (RFLSTM) and Random Feature BiLSTM (RF-
BiLSTM) frameworks demonstrate significant advancements in diabetes prediction accu-
racy and robustness compared to conventional machine learning models and standard
LSTM/BiLSTM architectures. Across three benchmark datasets, Pima Indian Diabetes
(PIDD), Diabetic Retinopathy Debrecen (DRDD), and Early Stage Diabetes Risk Prediction
(ESDRPD), the models achieve state-of-the-art performance, with RFBiLSTM consistently
outperforming RFLSTM and existing methods. On the PIDD, RFBiLSTM attains 99.3%
accuracy and 99.0% sensitivity, surpassing advanced ensemble methods like Boruta + EL
(98.1% accuracy) and hybrid architectures such as Conv-LSTM (97.2% accuracy). Similarly,
on the ESDRPD, both RFLSTM and RFBiLSTM achieve flawless accuracy and sensitivity
(100%), outperforming gradient-boosted models like LGBM (96.2% accuracy) and ensemble
techniques such as Boruta + EL (98.6% accuracy). For the DRDD, RFBiLSTM achieves
97.5% accuracy and 98.4% sensitivity, exceeding deep learning hybrids like DNN + PCA +
GWO (97.3% accuracy) and traditional models such as SVM (79.0% accuracy). These results
underscore the efficacy of combining random feature selection with bidirectional temporal
processing, particularly in clinical contexts where sensitivity and specificity are critical for
early intervention.
The performance superiority stems from two synergistic innovations. First, the ran-
dom feature selection mechanism mitigates overfitting by promoting model diversity, as
evidenced by the η-dependency analysis (Tables 2 and 3). For RFLSTM, mid-range η values
(0.5–0.8) yield balanced generalization, with training and validation accuracies stabilizing
at 97.1% and 96.7%, respectively, and minimal loss discrepancies (e.g., −0.0571 for η = 0.5).
Conversely, higher η values (≥0.9) induce overfitting, as seen in the sharp decline in vali-
dation accuracy (76.7%) and widening loss gaps (−0.1874). Similarly, RFBiLSTM achieves
optimal performance at η = 0.6 and η = 0.8 (validation accuracy: 93.3–96.7%) but falters
at η = 0.9 (validation accuracy: 86.7%), emphasizing the necessity of retaining sufficient
feature diversity to balance bias and variance. Second, the bidirectional architecture in
RFBiLSTM enhances sensitivity by capturing temporal dependencies in both forward and
Mathematics 2025, 13, 628 22 of 25
backward directions, as demonstrated by its 5.1% sensitivity gain over RFLSTM on the
DRDD (98.4% vs. 97.0%). This aligns with findings from [2], where bidirectional archi-
tectures improved glucose trend prediction, and [14], where hybrid models excelled in
capturing sequential patterns in medical data.
The models’ clinical reliability is further validated by their exceptional calibration
metrics, including near-perfect AUC scores (100% for ESDRPD) and low Brier scores
(0.006–0.023), which indicate precise probabilistic predictions. These metrics suggest
that the models are not only accurate but also trustworthy in real-world settings where
predictive confidence impacts clinical decisions. However, the perfect scores on ESDRPD
warrant careful scrutiny, as they may reflect dataset-specific biases, such as homogeneous
patient demographics or limited variability in symptom presentation. Despite this caveat,
the results align with broader trends in medical AI research, such as [19], where hybrid
CNNs outperformed traditional models in diabetes prediction, and [17], which highlighted
the importance of sequential learning for early risk detection.
The findings hold critical implications for both clinical practice and machine learning
research. Clinically, the high sensitivity (98.4–100%) and specificity (97.0–100%) of RFBiL-
STM position it as a valuable tool for early diabetes screening, where false negatives can
delay critical interventions. The models’ ability to maintain robust performance across di-
verse datasets ranging from physiological measurements (PIDD) to retinal imaging (DRDD)
and symptom-based assessments (ESDRPD) suggests broad applicability in multi-modal
healthcare environments. From a technical perspective, the η dependency analysis un-
derscores the importance of optimizing the proportions of feature selection, with 50–80%
feature retention emerging as a “sweet spot” to balance information richness and model
generalizability. This insight challenges the conventional preference for high η values
in feature selection, instead advocating a middle ground that prioritizes diversity over
completeness. Architecturally, the bidirectional design of RFBiLSTM proves indispensable
for sensitivity-driven tasks, as it captures temporal dependencies more comprehensively
than unidirectional models, a finding consistent with recent advances in medical time series
analysis. Future work should focus on validating these models on larger, multi-institutional
datasets to ensure generalizability across diverse populations and addressing potential
biases in datasets with limited heterogeneity. Furthermore, exploring the integration of
attention mechanisms or explainability frameworks could further enhance clinical adoption
by providing interpretable insights into model decisions.
In addition, the proposed hybrid framework enhances accuracy, efficiency, and in-
terpretability in several key ways. Accuracy is significantly improved through a dual
mechanism: (1) randomized feature selection reduces overfitting by training diverse base
models on unique feature subsets, fostering robustness through ensemble aggregation, and
(2) bidirectional processing in BiLSTM captures temporal dependencies in both forward
and backward directions, enabling enhanced pattern recognition (e.g., 98.4% vs. 97.0%
sensitivity for retinopathy data). Systematic simulations revealed that the proportions
of midrange feature selection (η = 0.5–0.8) optimize generalization, outperforming con-
ventional models (e.g., 100% precision in early-stage risk data) and recent hybrids such
as Boruta + EL. Efficiency is maintained despite the ensemble design: parallelizable base
model training and meta-learner integration minimize computational overhead, while dy-
namic feature sparsity reduces per-model complexity. Interpretability is advanced through
two pathways: (1) the meta-learner’s transparent weighting mechanism clarifies how base
predictions contribute to final outputs. (2) The superior calibration (Brier score: 0.006–0.023)
ensures probabilistic reliability, critical for clinical trust. Together, these innovations balance
performance, scalability, and actionable insights, addressing limitations of monolithic deep
learning architectures while advancing practical utility in healthcare analytics.
Mathematics 2025, 13, 628 23 of 25
7. Conclusions
The integration of hybrid deep learning architectures, such as those combining ran-
dom feature selection with unidirectional/bidirectional temporal modelling, represents a
paradigm shift in medical predictive analytics. By harmonizing the strengths of ensemble
learning and sequential data processing, these frameworks address critical limitations of
conventional models, which often struggle with unstructured medical data. The success of
such architectures in the prediction of diabetes underscores their potential to advance preci-
sion medicine, offering tools that are not only accurate but also reliable in their probabilistic
calibration, a prerequisite for clinical trust.
This study reinforces the importance of balancing feature diversity and information
retention in medical AI design. Although traditional methods prioritize either interpretabil-
ity or complexity, hybrid architectures like those proposed here demonstrate that these
goals need not be mutually exclusive. Instead, they can co-exist to enhance model gener-
alizability and robustness, particularly in early-stage disease detection, where nuanced
patterns demand sophisticated analytical frameworks.
The broader implications extend beyond diabetes. The principles underlying these
models’ adaptive feature selection, temporal dependency capture, and probabilistic cali-
bration are transferable to other chronic diseases, from cardiovascular disorders to neu-
rodegenerative conditions, where early diagnosis and risk stratification are equally vital.
However, the path to clinical adoption requires addressing challenges such as dataset
heterogeneity, algorithmic transparency, and computational scalability. Future efforts must
prioritize collaborative frameworks that bridge machine learning innovation with clinical
expertise, ensuring that these tools evolve in tandem with real-world healthcare needs.
Ultimately, this work contributes to a growing movement in medical AI, one that seeks
not only to predict but to empower, transforming raw data into actionable insights that
improve patient outcomes and redefine preventive care.
Author Contributions: Conceptualization, O.R.O., A.O.S., J.A., A.A.A. and N.M.A.; methodology,
O.R.O. and A.O.S.; software, O.R.O.; validation, O.R.O., A.O.S., J.A., A.A.A. and N.M.A.; formal
analysis, O.R.O.; investigation, O.R.O., A.O.S., J.A., A.A.A. and N.M.A.; resources, J.A. and A.A.A.;
data curation, O.R.O.; writing—original draft preparation, O.R.O. and A.O.S.; writing—review and
editing, O.R.O., A.O.S., J.A., A.A.A. and N.M.A.; visualization, O.R.O.; supervision, O.R.O.; project
administration, O.R.O. All authors have read and agreed to the published version of the manuscript.
Data Availability Statement: The authors confirm that the data supporting the findings of this study
are available within the article.
References
1. Rahman, M.; Islam, D.; Mukti, R.J.; Saha, I. A deep learning approach based on convolutional LSTM for detecting diabetes.
Comput. Biol. Chem. 2020, 88, 107329. [CrossRef] [PubMed]
2. Sun, Q.; Jankovic, M.V.; Bally, L.; Mougiakakou, S.G. Predicting blood glucose with an lstm and bi-lstm based deep neural
network. In Proceedings of the IEEE 2018 14th Symposium on Neural Networks and Applications (NEUREL), Belgrade, Serbia,
20–21 November 2018; pp. 1–5.
3. Ishida, K.; Ercan, A.; Nagasato, T.; Kiyama, M.; Amagasaki, M. Use of one-dimensional CNN for input data size reduction
in LSTM for improved computational efficiency and accuracy in hourly rainfall-runoff modeling. J. Environ. Manag. 2024,
359, 120931. [CrossRef] [PubMed]
4. Cheng, H.; Zhu, J.; Li, P.; Xu, H. Combining knowledge extension with convolution neural network for diabetes prediction. Eng.
Appl. Artif. Intell. 2023, 125, 106658. [CrossRef]
Mathematics 2025, 13, 628 24 of 25
5. Madan, P.; Singh, V.; Chaudhari, V.; Albagory, Y.; Dumka, A.; Singh, R.; Gehlot, A.; Rashid, M.; Alshamrani, S.S.; AlGhamdi, A.S.
An optimization-based diabetes prediction model using CNN and Bi-directional LSTM in real-time environment. Appl. Sci. 2022,
12, 3989. [CrossRef]
6. Araveeporn, A. Comparing the linear and quadratic discriminant analysis of diabetes disease classification based on data
multicollinearity. Int. J. Math. Math. Sci. 2022, 2022, 7829795. [CrossRef]
7. Zhou, H.; Xin, Y.; Li, S. A diabetes prediction model based on Boruta feature selection and ensemble learning. BMC Bioinform.
2023, 24, 224. [CrossRef]
8. Jaiswal, S.; Gupta, P. Diabetes Prediction Using Bi-directional Long Short-Term Memory. SN Comput. Sci. 2023, 4, 373. [CrossRef]
9. Maniruzzaman, M.; Kumar, N.; Abedin, M.M.; Islam, M.S.; Suri, H.S.; El-Baz, A.S.; Suri, J.S. Comparative approaches for
classification of diabetes mellitus data: Machine learning paradigm. Comput. Methods Programs Biomed. 2017, 152, 23–34.
[CrossRef]
10. Zou, Q.; Qu, K.; Luo, Y.; Yin, D.; Ju, Y.; Tang, H. Predicting diabetes mellitus with machine learning techniques. Front. Genet.
2018, 9, 515. [CrossRef]
11. Yahyaoui, A.; Jamil, A.; Rasheed, J.; Yesiltepe, M. A decision support system for diabetes prediction using machine learning and
deep learning techniques. In Proceedings of the IEEE 2019 1st International Informatics and Software Engineering Conference
(UBMYK), Ankara, Turkey, 6–7 November 2019; pp. 1–4.
12. Yuvaraj, N.; SriPreethaa, K. Diabetes prediction in healthcare systems using machine learning algorithms on Hadoop cluster.
Clust. Comput. 2019, 22, 1–9. [CrossRef]
13. Khanam, J.J.; Foo, S.Y. A comparison of machine learning algorithms for diabetes prediction. Ict Express 2021, 7, 432–439.
[CrossRef]
14. Kusuma, S.; Jothi, K. ECG signals-based automated diagnosis of congestive heart failure using Deep CNN and LSTM architecture.
Biocybern. Biomed. Eng. 2022, 42, 247–257. [CrossRef]
15. Reddy, S.N.B.; Reddy, K.N.; Rao, S.T.; Kumar, K. Diabetes Prediction using Extreme Learning Machine: Application of Health
Systems. In Proceedings of the IEEE 2023 5th International Conference on Smart Systems and Inventive Technology (ICSSIT),
Tirunelveli, India, 23–25 January 2023; pp. 993–998.
16. Pangaribuan, J.J.; Suharjito. Diagnosis of diabetes mellitus using extreme learning machine. In Proceedings of the IEEE 2014
International Conference on Information Technology Systems and Innovation (ICITSI), Bandung, Indonesia, 24–27 November
2014; pp. 33–38.
17. Elsayed, N.; ElSayed, Z.; Ozer, M. Early stage diabetes prediction via extreme learning machine. In Proceedings of the IEEE
SoutheastCon 2022, Mobile, AL, USA, 26 March–3 April 2022; pp. 374–379.
18. Georga, E.I.; Protopappas, V.C.; Polyzos, D.; Fotiadis, D.I. Online prediction of glucose concentration in type 1 diabetes using
extreme learning machines. In Proceedings of the IEEE 2015 37th Annual International Conference of the IEEE Engineering in
Medicine and Biology Society (EMBC), Milan, Italy, 25–29 August 2015; pp. 3262–3265.
19. Swapna, G.; Vinayakumar, R.; Soman, K. Diabetes detection using deep learning algorithms. ICT Express 2018, 4, 243–246.
20. Hossain, M.M.; Ali, M.S.; Ahmed, M.M.; Rakib, M.R.H.; Kona, M.A.; Afrin, S.; Islam, M.K.; Ahsan, M.M.; Raj, S.M.R.H.; Rahman,
M.H. Cardiovascular disease identification using a hybrid CNN-LSTM model with explainable AI. Inform. Med. Unlocked 2023,
42, 101370. [CrossRef]
21. Karthika, S.; Priyanka, T.; Indirapriyadharshini, J.; Sadesh, S.; Rajeshkumar, G.; Rajesh Kanna, P. Prediction of Weather Forecasting
with Long Short-Term Memory using Deep Learning. In Proceedings of the IEEE 2023 4th International Conference on Smart
Electronics and Communication (ICOSEC), Trichy, India, 20–22 September 2023; pp. 1161–1168.
22. Khan, A.; Fouda, M.M.; Do, D.T.; Almaleh, A.; Rahman, A.U. Short-term traffic prediction using deep learning long short-term
memory: Taxonomy, applications, challenges, and future trends. IEEE Access 2023, 11, 94371–94391. [CrossRef]
23. Raja, J.B.; Pandian, S.C. PSO-FCM based data mining model to predict diabetic disease. Comput. Methods Programs Biomed. 2020,
196, 105659. [CrossRef]
24. Zhu, C.; Idemudia, C.U.; Feng, W. Improved logistic regression model for diabetes prediction by integrating PCA and K-means
techniques. Inform. Med. Unlocked 2019, 17, 100179. [CrossRef]
25. Wu, H.; Yang, S.; Huang, Z.; He, J.; Wang, X. Type 2 diabetes mellitus prediction model based on data mining. Inform. Med.
Unlocked 2018, 10, 100–107. [CrossRef]
26. García-Ordás, M.T.; Benavides, C.; Benítez-Andrades, J.A.; Alaiz-Moretón, H.; García-Rodríguez, I. Diabetes detection using
deep learning techniques with oversampling and feature augmentation. Comput. Methods Programs Biomed. 2021, 202, 105968.
[CrossRef]
27. Qi, H.; Song, X.; Liu, S.; Zhang, Y.; Wong, K.K. KFPredict: An ensemble learning prediction framework for diabetes based on
fusion of key features. Comput. Methods Programs Biomed. 2023, 231, 107378. [CrossRef]
28. Al Rafi, A.S.; Rahman, T.; Al Abir, A.R.; Rajib, T.A.; Islam, M.; Mukta, M.S.H. A new classification technique: Random weighted
lstm (rwl). In Proceedings of the 2020 IEEE Region 10 Symposium (TENSYMP), Dhaka, Bangladesh, 5–7 June 2020; pp. 262–265.
Mathematics 2025, 13, 628 25 of 25
29. Tagmatova, Z.; Abdusalomov, A.; Nasimov, R.; Nasimova, N.; Dogru, A.H.; Cho, Y.I. New approach for generating synthetic
medical data to predict type 2 diabetes. Bioengineering 2023, 10, 1031. [CrossRef] [PubMed]
30. Noguer, J.; Contreras, I.; Mujahid, O.; Beneyto, A.; Vehi, J. Generation of individualized synthetic data for augmentation of the
type 1 diabetes data sets using deep learning models. Sensors 2022, 22, 4944. [CrossRef] [PubMed]
31. Butt, U.M.; Letchmunan, S.; Ali, M.; Hassan, F.H.; Baqir, A.; Sherazi, H.H.R. Machine learning based diabetes classification and
prediction for healthcare applications. J. Healthc. Eng. 2021, 2021, 9930985. [CrossRef]
32. Antal, B.; Hajdu, A. An ensemble-based system for automatic screening of diabetic retinopathy. Knowl.-Based Syst. 2014, 60, 20–27.
[CrossRef]
33. Nguyen, Q.H.; Muthuraman, R.; Singh, L.; Sen, G.; Tran, A.C.; Nguyen, B.P.; Chua, M. Diabetic retinopathy detection using deep
learning. In Proceedings of the 4th International Conference on Machine Learning and Soft Computing, Haiphong City, Vietnam,
17–19 January 2020; pp. 103–107.
34. Olaniran, O.R.; Abdullah, M.A.A. Bayesian weighted random forest for classification of high-dimensional genomics data. Kuwait
J. Sci. 2023, 50, 477–484. [CrossRef]
35. Olaniran, O.R.; Alzahrani, A.R.R. On the Oracle Properties of Bayesian Random Forest for Sparse High-Dimensional Gaussian
Regression. Mathematics 2023, 11, 4957. [CrossRef]
36. Olaniran, O.R.; Alzahrani, A.R.R.; Alzahrani, M.R. Eigenvalue Distributions in Random Confusion Matrices: Applications to
Machine Learning Evaluation. Mathematics 2024, 12, 1425. [CrossRef]
37. Kumari, S.; Kumar, D.; Mittal, M. An ensemble approach for classification and prediction of diabetes mellitus using soft voting
classifier. Int. J. Cogn. Comput. Eng. 2021, 2, 40–46. [CrossRef]
38. Rajendra, P.; Latifi, S. Prediction of diabetes using logistic regression and ensemble techniques. Comput. Methods Programs Biomed.
Update 2021, 1, 100032. [CrossRef]
39. Wu, Y.; Zhang, Q.; Hu, Y.; Sun-Woo, K.; Zhang, X.; Zhu, H.; Jie, L.; Li, S. Novel binary logistic regression model based on feature
transformation of XGBoost for type 2 Diabetes Mellitus prediction in healthcare systems. Future Gener. Comput. Syst. 2022,
129, 1–12. [CrossRef]
40. Roobini, M.; Lakshmi, M. Autonomous prediction of Type 2 Diabetes with high impact of glucose level. Comput. Electr. Eng.
2022, 101, 108082. [CrossRef]
41. Nipa, N.; Riyad, M.H.; Satu, S.; Walliullah; Howlader, K.C.; Moni, M.A. Clinically adaptable machine learning model to identify
early appreciable features of diabetes. Intell. Med. 2024, 4, 22–32. [CrossRef]
42. Devi, R.M.; Keerthika, P.; Devi, K.; Suresh, P.; Sangeetha, M.; Sagana, C.; Devendran, K. Detection of diabetic retinopathy using
optimized back-propagation neural network (Op-BPN) algorithm. In Proceedings of the IEEE 2021 5th International Conference
on Computing Methodologies and Communication (ICCMC), Erode, India, 8–10 April 2021; pp. 1695–1699.
43. Gadekallu, T.R.; Khare, N.; Bhattacharya, S.; Singh, S.; Maddikunta, P.K.R.; Srivastava, G. Deep neural networks to predict
diabetic retinopathy. J. Ambient. Intell. Humaniz. Comput. 2023, 14, 5407–5420. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.