05 MEDIN52024744 Online
05 MEDIN52024744 Online
Medinformatics
RESEARCH ARTICLE 2025, Vol. 2(2) 107–119
DOI: 10.47852/bonviewMEDIN52024744
Abstract: Liver disease is any condition that negatively affects the liver’s function or structure, resulting in impaired liver function and various
health complications. Abnormal conditions are rapidly increasing day by day. In this study, we used a dataset of key liver disease-related blood
sample biomarkers to utilize various machine learning (ML) techniques to enhance the accuracy of liver disease prediction. Specifically, we
integrated the artificial neural network (ANN) model with five ML models: Stacked Generalization (Stacking), Bootstrap Aggregating
(Bagging), Adaptive Boosting (AdaBoost), Gradient-Boosted Decision Tree (GBDT), and Support Vector Machine (SVM)—resulting in
five distinct hybrid models: Stacking with ANN (SANN), Bagging with ANN, AdaBoost with ANN (ABANN), GBDT with ANN
(GANN), and SVM with ANN (SVMANN). We tested all these hybrid models with feature selection techniques, including linear
discriminant analysis (LDA), principal component analysis (PCA), recursive feature elimination (RFE), and also without feature
selection. Through extensive testing, we found that these five hybrid models performed best when combined with LDA rather than PCA,
RFE, or no feature selection. This discovery led us to create a max voting ensemble (MVE) of these LDA-optimized hybrid models.
Remarkably, our prediction accuracy increased from 79.15% to 98.38% using the MVE. Furthermore, we employ explainable artificial
intelligence techniques such as Local Interpretable Model-agnostic Explanations, Shapley Additive Explanations, and Individual
Conditional Expectations to analyze and enhance trust in the predictions. We also implemented 10-fold cross-validation to ensure the
robustness and reliability of our results. This research underscores the significance of advancements in neural network systems and
highlights the potential for hybrid models to improve predictive accuracy in liver disease diagnosis. Our findings pave the way for a new
generation of computational technologies endowed with intelligence, ultimately contributing to better health outcomes and a deeper
understanding of liver disease dynamics.
Keywords: liver disease, machine learning, artificial neural network, explainable artificial intelligence
© The Author(s) 2025. Published by BON VIEW PUBLISHING PTE. LTD. This is an open access article under the CC BY License (https://creativecommons.org/
licenses/by/4.0/).
107
Medinformatics Vol. 2 Iss. 2 2025
for liver disease prediction. For instance, Choubey et al. [3] adopted 2.1. Dataset
Decision Tree (DT) algorithms and achieved an accuracy of 75.10%,
while Shetty and Satyanarayana [4] enhanced Support Vector We obtained a dataset from the UCI ML Repository [14]
Machine (SVM) with Random Sampling for a 71% accuracy rate. containing 583 samples of individuals, both affected and unaffected
Alyabis et al. [5] turned to Neural Network Analysis and obtained a by liver disease. The dataset comprises 10 features, excluding the
79.6% success rate, and Singh and Agarwal [6] experimented with target variable indicating the presence or absence of liver disease. Of
an Extreme Learning Machine (ELM), resulting in 77.77% accuracy. the 583 instances in the dataset, 416 samples are affected by the
Further contributions include Azam et al. [7], who integrated disease, while the remaining 167 are free. These 10 features contain
K-Nearest Neighbor (KNN) with Feature Selection Techniques vital information related to various blood parameters and liver
(KNNWFST) for a 74% accuracy, and Choudhary et al. [8], who conditions, including Age, Gender, Total Bilirubin (TB), Direct
applied Logistic Regression (LR) with a 70.54% accuracy rate. Bilirubin (DB), Alkaline Phosphatase (ALPH), Alanine
Additional studies by Khan et al. [9] and Thirunavukkarasu et al. Aminotransferase (ALAT), Aspartate Aminotransferase (ASAT),
[10] utilized Random Forest (RF) and LR to achieve accuracies of Total Proteins (TP), Albumin (AL), and Albumin and Globulin
72.17% and 73.97%, respectively. Muthuselvan et al. [11] used Ratio (AGR). The dataset comprehensively represents individuals’
Random Tree, and Yasmin et al. [12] studied KNN, yielding 74.2% liver health features, incorporating key biochemical markers and
and 76.03% accuracy, demonstrating the diverse range of ML demographic information. This diverse set of features will serve as
methodologies being explored for liver disease prediction. the foundation for constructing and evaluating predictive models for
In our study, we critically analyzed the limitations and scopes of liver disease diagnosis. Additionally, Table 1 provides a detailed
these previous studies, seeking to bring novelty to our research description of all 10 features and their corresponding value types,
methodology. In the discussion section, we provide a facilitating a better understanding of the dataset’s composition and
comprehensive comparison of these studies with our findings to characteristics.
highlight the advancements and contributions of our approach.
In recent years, the emergence of advanced computational 2.2. Analysis and visualization
techniques, such as artificial neural network (ANNs) and
explainable artificial intelligence (XAI), has provided promising Data analysis and visualization are crucial in understanding
avenues for enhancing the predictive capabilities of liver disease datasets, especially when applying different ML models [15].
diagnosis models. This research investigates the potential of ANN- These techniques provide valuable insights into the distribution,
based models integrated with XAI techniques for predicting liver patterns, outliers, and relationships within the data, essential for
disease from optimal features extracted from patient data. Unlike making informed decisions during model development, feature
traditional statistical methods, ANNs offer the advantage of learning selection, and evaluation. In our liver dataset analysis, we utilize
complex patterns and relationships from large datasets, enabling various visualization techniques, including histograms [16], violin
more accurate and robust predictions. Moreover, incorporating XAI plots [17], and correlation heatmaps [18].
methods allows for interpreting and understanding the ANN
model’s decision-making process, addressing the critical need for 2.3. Preprocessing
transparency and explainability in medical AI systems [13]. The
We employed various preprocessing techniques to address
primary objective of this study is to develop ANN-based models
missing values and transform textual values into numerical
trained on a comprehensive dataset of clinical variables associated
representations [19]. In our dataset, we encountered missing
with liver disease, utilizing feature selection techniques to identify
values in the “AGR” feature, totaling four instances. Additionally,
the most informative features for prediction. By leveraging XAI
we converted gender values, where females were represented as
methods, such as Local Interpretable Model-agnostic Explanation
one and males as 0. Missing values in the dataset were addressed
(LIME) and Shapley Additive Explanation (SHAP), we aim to
using data imputation techniques. Specifically, we utilized the
elucidate the underlying factors driving the model’s predictions,
mean imputation method to fill in the missing values of the AGR
enhancing its interpretability and trustworthiness.
feature. The mean imputation formula is given as follows:
This research is motivated by the potential of advanced
computational techniques to revolutionize medical diagnostics and P
decision-making processes. By harnessing the power of ANNs x
Meanð X Þ ¼ (1)
and XAI, we aim to develop more accurate, transparent, and n
clinically relevant predictive models for liver disease. Specifically,
this study integrates ANNs with robust ML models to enhance Here, Meanð X P Þ is the mean value that is used to fill in missing values
predictive accuracy, employs advanced XAI tools to ensure in the dataset, x denotes the sum of all non-missing values in the
transparency in decision-making, and optimizes feature selection feature, and n indicates the total number of non-missing values in the
to target the most informative clinical variables for liver disease feature.
prediction. These advancements can potentially inform clinical We employed one-hot encoding to convert textual values into
practice and improve patient outcomes through early detection and numerical representations [20]. This technique transforms
personalized treatment strategies. categorical variables into binary vectors, effectively representing
each category as a separate feature. In our case, we encoded gender
2. Materials and Methods information, where females were mapped to 1 and males to 0.
The main goal of this study is to accurately predict liver disease 2.4. Ideal feature finding
by employing various ANN-based hybrid models and subsequently
assembling them for improved performance. The research workflow The process of selecting the most relevant and informative
is outlined in Figure 1. Sections 2.1 to 2.9 provide a brief working features from a dataset to improve the performance of ANN-based
structure of the study. ML models. This step is essential in building efficient and
108
Medinformatics Vol. 2 Iss. 2 2025
accurate predictive models as it helps reduce dimensionality, (PCA), and recursive feature elimination (RFE) [21] to identify
minimize overfitting, and enhance model interpretability. optimal features for ANN-based ML models, aiming to enhance
In this study, we applied feature selection techniques such as predictions of liver disease outcomes. We identified the most
linear discriminant analysis (LDA), principal component analysis compelling feature selection approach among the tested methods
109
Medinformatics Vol. 2 Iss. 2 2025
and integrated it into our models to improve predictive performance. several base models and uses their predictions as input features for a
Furthermore, we employed a max voting ensemble (MVE) technique meta-model, which learns to refine and integrate these predictions.
to combine multiple ANN models utilizing the best feature subset, The meta-model addresses errors and biases of individual models,
significantly boosting accuracy and robustness. yielding a more robust and accurate prediction [26]. Mathematically,
LDA is a dimensionality reduction technique that finds linear the stacking process with an ANN is as follows:
combinations of features to best separate different classes or
categories in the data. It is commonly used for classification tasks Xmeta ¼ ½ANN1 ðXÞ ANN2 ðXÞ . . . ANNn ðXÞ; y (2)
to maximize the separation between classes while minimizing the
variance within each class [22]. Here, X represents the input features, y represents the target variable,
PCA is another dimensionality reduction technique that transforms ANNi(X) represents the prediction made by ANN model i,
the original features into a lower-dimensional space while preserving as and RF(Xmeta) represents the prediction made by the RF as a
much variance as possible. PCA identifies the principal components that meta-model. Then, the RF meta-model is trained on Xmeta:
capture the most significant variation in the data, allowing for
dimensionality reduction and simplification of the dataset [23]. RFðXmeta Þ ¼ ð½ANN1 ðXÞ; ANN2 ðXÞ; . . . ; ANNn ðXÞÞ (3)
RFE is a feature selection method that recursively removes
features based on their importance from the dataset. It trains the The performance metrics are then calculated based on the predictions
model on the remaining features and evaluates their performance, of the RF meta-model.
continuing this process until the optimal subset of features is
identified. RFE helps select the most informative features while 2.5.2. Bagging with ANN
discarding redundant or irrelevant ones, thereby improving model Bagging is an ensemble method that enhances ML stability and
efficiency and interpretability [24]. accuracy by bootstrap sampling to create multiple training subsets.
Each subset trains a base model, and their predictions are aggregated
2.5. ML model construction for the final output [27]. By introducing model diversity, bagging
reduces overfitting and improves generalization. This study employs
Our study thoroughly examined the preprocessed dataset by Bagging with ANNs as base models to mitigate prediction variance.
integrating various ML models with ANN to enhance prediction While ANNs excel at capturing complex data patterns, they are
accuracy. Our approach involved leveraging ANN as the base sensitive to training subsets. Bagging reduces this sensitivity,
model and implementing five distinct algorithms. These enhancing prediction stability.
algorithms are designed to improve predictive performance by The Bagging Classifier aggregates predictions from multiple
incorporating unique methodologies and characteristics. This ANN base models through averaging. For N base model, predictions
comprehensive analysis aims to identify the most effective model are the average of all base model predictions. Mathematically, this is
for accurately predicting liver disease outcomes. Our methodology represented as:
underscores the importance of neural network systems in
maximizing prediction accuracy across different ML models [25].
1X N
BCðXÞ ¼ ANNi ðXÞ (4)
N i¼1
2.5.1. Stacking with ANN (SANN)
Model stacking, or stacked generalization, is an ML technique that Here, ANNi ðXÞ represent the prediction made by the i-th ANN model
combines multiple models to enhance predictive performance. It trains on the input features X, BCðXÞ represent the prediction made by the
110
Medinformatics Vol. 2 Iss. 2 2025
2.5.3. ABANN This equation encapsulates the idea of integrating the predictions
Adaptive Boosting (AdaBoost) enhances classification from both the SVM and ANN models to form the output of the
performance by combining weak learners, typically shallow DTs, SVMANN hybrid model.
through iterative training that assigns higher weights to misclassified
samples [28]. In AdaBoost with ANNs (ABANN), ANNs replace 2.6. ML model evaluation
traditional weak learners. Multiple ANNs are trained sequentially,
with each focusing more on previous misclassifications. The We evaluate our hybrid ANN-based ML models for liver disease
final ABANN prediction is a weighted sum of individual ANN prediction using an 80:20 train-test split, ensuring robust training and a
predictions, with weights based on their accuracy during training. realistic performance assessment. Predictions are analyzed through a
Mathematically, this process is expressed as: confusion matrix (CM), which categorizes predictions into True Positives
(TP), True Negatives (TN), False Positives (FP), and False Negatives
! (FN), forming the basis for calculating key performance metrics [31].
X
N
ABANNðXÞ ¼ sign αi ANNi ðXÞ (5) Accuracy is the proportion of correctly classified instances and
i¼1 is calculated as:
Here, ANNi ðXÞ represent the prediction made by the i-th ANN model TP þ TN
on the input feature. X, ABANNðXÞ represent the prediction made by 100 (8)
TP þ TN þ FP þ FN
the AdaBoost model on the input features. X, AdaBoost model com-
bines predictions from multiple base models ANN through a Precision measures the accuracy of optimistic predictions, which is
weighted sum. Considering N base models and αi Represents the crucial for minimizing FPs in medical contexts. It is calculated as:
weight assigned to the i-th base model. The sign function ensures
the final prediction is binary, typically {−1, 1} in classification tasks. TP
The weights αi are determined during the training process, favoring 100 (9)
TP þ FP
models with better performance. This iterative approach of combin-
ing multiple ANNs with AdaBoost enhances the model’s overall pre- Recall reflects the model’s ability to identify actual positive cases,
dictive accuracy and robustness. ensuring minimal missed diagnoses. It is computed as:
111
Medinformatics Vol. 2 Iss. 2 2025
MVE [35] to combine predictions from five hybrid ANN models, the most effective. We then implemented a MVE model with LDA,
leveraging their diverse strengths to improve accuracy and addressed outliers through scalarization, and used 10-fold cross-
robustness. This approach enhances prediction using ANN’s validation for results. Finally, XAI techniques were applied to
learning capabilities and LDA’s discriminative power. enhance the interpretability and trustworthiness of the predictions.
We gain valuable insights into the dataset’s structure and feature
b
y ¼ MajorityVote relationships through comprehensive data analysis using visualizations
such as histograms, violin plots, and correlation heatmaps. Histograms
ðySANN þ LDA ; yBANN þ LDA ; yABANN þ LDA ; yGANN þ LDA ; ySVMANN þ LDA Þ
reveal that Age, TPs, and AL follow near-normal distributions, while
(12) features like TB, ALPH, and ASAT exhibit right-skewed distributions
with notable outliers, indicating the presence of extreme values that
could impact model performance. Violin plots further confirm that
2.8. Performance analysis with XAI bilirubin and enzyme levels are highly skewed. In contrast, protein
levels and Age maintain more symmetric distributions, providing a
We analyze our best model’s predictions using XAI techniques, clearer view of data spread and potential anomalies. Additionally, the
such as SHAP, LIME, and Individual Conditional Expectation (ICE)
correlation heatmap highlights strong positive relationships, such as
plots, to gain transparency into its decision-making process. SHAP
between TB and DB and Alamine Aminotransferase and ASAT,
attributes prediction contributions to individual features, while suggesting collinearity among liver function markers. Moderate
LIME provides local explanations for specific predictions. ICE negative correlations, like the inverse relationship between AL and
plots reveal feature effects across instances. These techniques Age, also emerge, offering insights into potential dependencies. These
enhance the interpretability of our ANN-based Max Voting model,
analyses are crucial in understanding data characteristics, guiding
improving its transparency for clinical applications [36–38].
feature selection, and optimizing model performance.
Table 2 consolidates the performance metrics for six models
2.9. External validation across different feature optimization scenarios—no optimization,
To further evaluate the performance of our MVE model, we test it LDA, PCA, and RFE—providing a comprehensive comparison of
in various ways. We gather real-time patient information from multiple accuracy, precision, recall, and F1 score. Without feature reduction,
internet sources [39, 40], collecting three patient data sets representing SVMANN leads with an accuracy of 78.03%, while SANN trails at
diverse demographic and health conditions. These datasets are then 76.32%, setting the baseline for model effectiveness. With LDA
tested against the pre-trained model, developed using a well- applied, overall performance improves, with SANN achieving the
established dataset, allowing us to assess how well the model highest accuracy of 79.15% and SVMANN recording the lowest at
generalizes to unseen real-time data. Additionally, we apply the 75.72%, underscoring the nuanced impact of LDA on these models.
model to a multiclass classification dataset instead of the original When PCA is used, SVMANN emerges as the top performer with a
binary classification task to examine its performance with more 77.44% accuracy, contrasting with ABANN’s lower accuracy of
complex classification problems. This approach helps us evaluate 74.70%, while corresponding precision, recall, and F1 scores
the model’s adaptability and scalability across a broader range of further delineate these differences. Finally, under RFE, SVMANN
potential outcomes. The results from both the real-time patient data again attains the highest accuracy at 78.03%, whereas GANN
and the multiclass dataset provide valuable insights into the model’s shows the lowest at 75.81%. This table highlights how various
capabilities and highlight areas for future improvement. feature optimization techniques distinctly influence model
performance, offering detailed insights into their relative strengths
3. Results and Discussion and weaknesses across multiple evaluation metrics.
Table 3 summarizes the LDA model’s feature importance
After preprocessing the liver dataset, we evaluated ANN-based rankings and coefficients within the Max Voting framework
models and applied feature reduction techniques, finding LDA to be across ten cross-validation folds and the Final Optimal Feature
112
Medinformatics Vol. 2 Iss. 2 2025
TB (0.0328)
Gender (0.0033)
TB (0.0251)
TB (0.0305)
TB (0.0257)
AGR (0.0141)
Gender (0.1017)
AGR (0.0053)
TB (0.0281)
Gender (0.1746)
Table 4. Metrics across 10 folds for Max Voting model
Rank 10 Fold Accuracy Precision Recall F1 score
AGR
1 98.40 98.28 98.24 98.25
2 98.35 98.25 98.30 98.28
3 98.42 98.35 98.27 98.38
4 98.36 98.27 98.28 98.30
AGR (0.0525)
TB (0.0233)
AGR (0.1255)
AGR (0.0447)
AGR (0.0348)
TB (0.0263)
AGR (0.1386)
TB (0.0118)
AGR (0.0801)
ASAT (0.1835)
5 98.39 98.29 98.31 98.30
Rank 9
TB
7 98.38 98.24 98.27 98.28
8 98.36 98.31 98.28 98.30
9 98.40 98.25 98.28 98.32
10 98.37 98.30 98.26 98.36
Gender (0.0952)
AGR (0.0369)
ASAT (0.1548)
Gender (0.1533)
ASAT (0.1423)
ASAT (0.1475)
ALAT (0.1607)
ASAT (0.1741)
ASAT (0.1313)
AGR (0.2219)
Mean 98.38 98.28 98.28 98.31
Standard Deviation 0.0221 0.0329 0.0199 0.0382
Rank 8
Age
Set (FOFS), derived as the union of top features across all folds. AL
Rank 7
DB
TP
AL
ALPH
ASAT
Figure 3(A) presents the CM for the Max Voting model. It shows
only one misclassification between disease and non-disease cases,
indicating strong predictive performance. Meanwhile, Figure 3(B)
displays the ROC curve, where the model achieves an AUC of 1.00,
AL (0.6101)
AL (0.6748)
AL (0.6895)
AL (0.6904)
AL (0.6444)
AL (0.6692)
DB (1.0238)
AL (0.5628)
AL (0.6611)
DB (0.9085)
113
Medinformatics Vol. 2 Iss. 2 2025
Max Voting
SVMANN + RFE
SVMANN + PCA
SANN + LDA
SVMANN
0 20 40 60 80 100 120
Figure 2. Comparison of the best-performed models with and without feature optimization and Max Voting
Figure 3. Performance evaluation of the MVE model: (A) CM for classification accuracy and (B) ROC curve for model
discrimination
TPs, along with Age and AL, increase it. Figure 4(C) presents the underscores the model’s interpretability and potential for real-
LIME explanation for class 0 (no liver disease), where Alamine world clinical applications.
Aminotransferase and DB contribute negatively, while TPs have a The performance of Deep Learning models, including Long-Short-
small positive impact. Figure 4(D) shows the LIME explanation Term Memory (LSTM), Gated Recurrent Unit (GRU), and Convolutional
for class 1 (liver disease), where Alamine Aminotransferase and Neural Network-Long-Short-Term Memory (CNN-LSTM) Ensemble
ALPH contribute negatively. Finally, Figure 4(E) illustrates the models, was evaluated on a 583-instance dataset. Among these models,
SHAP FORCE plot, offering a more granular view of the force LSTM achieved the highest accuracy at 68.38%, closely followed by
and direction of each feature’s influence on the final prediction, GRU at 68.12%. However, neither LSTM nor GRU outperformed the
emphasizing their relative contribution in a visual format. These Max Voting model. Regarding precision, recall, and F1 score, LSTM
visualizations comprehensively understand the feature impacts achieved 52.47%, 52.45%, and 47.56%, respectively, while GRU
driving the Max Voting model’s decisions. showed better precision and recall at 59.01% and 55.47% but slightly
Figure 5(A) presents the ICE plots for each feature, showing how lower F1 at 53.88%. The CNN-LSTM ensemble model, on the other
prediction values change with varying feature values. Features like hand, had the lowest performance across all metrics, with an accuracy
Age and TB exhibit a more substantial influence on predictions, of 50.1%, precision of 51.15%, recall of 50.78%, and F1 score of
while Gender and TPs have minimal impact. Figure 5(B) shows the 45.62%. Despite the strengths of these deep learning models, they do
SHAP dependence analysis, revealing that Age and TB contribute not surpass the MVE in predictive accuracy.
positively to predictions. At the same time, ALPH and AL have We further evaluated the performance of our Max Voting model
varying impacts, suggesting their effects are more context- using real-time patient data and a different dataset to assess its
dependent. These analyses provide a deeper understanding of how accuracy across various contexts. The validation with real-time
individual features drive the model’s decision-making process. sample data involved testing the model on new patient samples,
Table 5 compares feature prioritization between various XAI including data from Mr. Akash (23, male) from Dr. Lal’s
methods, such as SHAP, LIME, FORCE, ICE, and clinical Pathology Lab, Mrs. Sushila (53, female) from House of
experts. DB and TB are consistently high-priority features. This Diagnostics, and Mr. Wasif (30, male) from Chughtai Lab. Key
alignment between the model’s decisions and expert judgment health indicators such as TB, DB, ALPH, ALAT, ASAT, TP, AL,
114
Medinformatics Vol. 2 Iss. 2 2025
Figure 4. XAI feature impact analysis in the Max Voting model: (A) SHAP summary plot, (B) SHAP waterfall plot, (C) LIME
explanation for no liver disease prediction, (D) LIME explanation for liver disease prediction, and (E) FORCE plot
Figure 5. Feature impact analysis for liver health assessment in the Max Voting model: (A) ICE plots for each feature and (B) SHAP
dependence analysis of predictive features
115
Medinformatics Vol. 2 Iss. 2 2025
Table 5. Comparison of feature importance rankings in liver disease prediction between XAI methods and clinical expert judgment
XAI decision Experts decision
LIME SHAP
Priority SHAP Disease No Disease WATERFALL FORCE ICE Expert 1 Expert 2
First DB ALAT ALAT ALPH TB TB Both TB and DB are TB/DB
Second ALPH ALPH DB Age ALPH ALPH ASAT, ALPH
the most important
Third Age Age AL TP DB ASAT AL, AGR
Fourth ALAT TP ALPH ASAT ASAT DB Age, Gender
Fifth TP DB TP ALAT TP TP –
Sixth ASAT TB AGR DB ALAT AGR –
and AGR were used to evaluate the model’s accuracy. The results of the model effectively distinguishes between different risk levels,
this validation were compared with the 583-sample dataset, reflecting its robust predictive capabilities on the MHR dataset.
showcasing the model’s ability to accurately assess and predict Finally, Table 6 compares with existing literature and reveals
patient health metrics. the superior accuracy of 98.38% achieved by the proposed
The MVE model was tested on an external liver disease dataset of approach, significantly outperforming the 70.54% to 79.6% range
30,691 patients [41] for validation. The model demonstrates strong reported in previous studies. Unlike earlier research, this study
performance with an average accuracy of 88.35%, highlighting its incorporates feature optimization, cross-validation, and XAI
ability to generalize effectively to external samples and confirming its techniques, addressing existing gaps. The ensemble model, built
robustness in predicting liver disease outcomes. The model’s on ANN-based hybrid approaches, enhances predictive accuracy
precision, recall, and F1 score also reflect solid performance, with and interpretability, distinguishing it from prior work.
mean values of 92.99%, 79.48%, and 83.26%, respectively. The The lower accuracies and other performance metrics presented
standard deviation for accuracy, precision, recall, and F1 score is in Table 4 can be attributed mainly to the limited size of the
1.94, 1.30, 2.77, and 1.43, respectively, indicating relatively stable dataset, which consists of only 583 samples and 11 features. This
performance across the folds. The 95% confidence intervals for these small dataset restricts the ability of individual models to generalize
metrics are ±1.20 for accuracy, ±1.30 for precision, ±2.77 for recall, effectively, especially when it comes to capturing complex patterns.
and ±1.43 for F1 score, further validating the model’s effectiveness in As a result, the models exhibit lower precision, recall, and F1 scores.
liver disease prediction. When trained on such limited datasets with few features, models are
We collect a Maternal Health Risk (MHR) dataset from Kaggle more prone to overfitting or underfitting, as they lack sufficient
[42], which contains 1,014 samples and seven features divided into information to identify intricate relationships. This ultimately leads to
three classes: low, mid, and high risk. The results of the Max Voting reduced accuracy and other performance metrics [43].
model across 10 folds are evaluated using performance metrics such However, the MVE method with LDA achieves higher
as accuracy, precision, recall, and F1 score. The model demonstrates accuracy despite the limitations of individual models. By
strong performance, achieving an average accuracy of 93.07%, combining predictions from multiple models through Max Voting,
precision of 92.71%, recall at 93.06%, and an F1 score of 92.85%. this approach mitigates the weaknesses of each model, enhancing
For each fold, the standard deviation and 95% confidence intervals overall performance. LDA’s role in reducing dimensionality
are calculated, showing minor variability, with confidence intervals allows each model to focus on the most relevant features,
of ±0.74 for accuracy, ±0.80 for precision, ±1.72 for recall, and improving their performance within the ensemble. The ensemble
±0.89 for F1 score. The CM reveals that the model accurately capitalizes on the strengths of each model. At the same time,
classifies most samples across the three classes, correctly predicting LDA’s feature optimization provides a more transparent, more
60 out of 67 Low-risk cases, 67 out of 71 Mid-risk cases, and 61 out robust representation of the data, leading to improved accuracy
of 64 High-risk cases. Some misclassifications occur, such as 5 Low- and other metrics in the combined outcome.
risk cases misclassified as Mid and 2 as High, along with a few Even though our dataset contains only 11 features, we used
misclassifications in the Mid and High-risk categories, but overall, feature optimization techniques because they enhance model
116
Medinformatics Vol. 2 Iss. 2 2025
Figure 6. Real-time framework for liver disease prediction using semi-auto biochemistry analysis, ANN-based MVE models, and
XAI-driven interpretation
accuracy by refining the dataset to its most informative aspects. against an existing, optimized dataset of 583 samples using LDA,
These techniques, like LDA, PCA, and RFE, help the model focus chosen for its effective feature selection in ANN-based models.
on the features that most significantly contribute to identifying Afterward, predictions are generated using the ANN-based MVE
patterns and improving predictive reliability. By reducing noise for improved accuracy. Finally, XAI enables users, including
and minimizing irrelevant data, feature optimization allows for non-experts, to understand the projections and confidently take
more effective learning and generalization, increasing stability and further medical actions in consultation with experts.
reducing computational complexity. This approach is also
valuable in small datasets, where maximizing the signal-to-noise 4. Conclusion
ratio is critical for robust performance [44].
LDA proved the most compelling feature selection method In conclusion, this study demonstrates a robust approach to
because it maximizes class separation, making it ideal for enhancing liver disease prediction by integrating ANN with five
classification tasks where distinguishing between classes is distinct ML models—Stacking, Bagging, AdaBoost, Gradient-
crucial. Unlike PCA, which reduces dimensionality based on Boosted Decision Tree, and SVM—to create five hybrid models
variance without considering class labels, or RFE, which does not optimized through LDA. Combined into a MVE, these
directly optimize for class discrimination, LDA enhances class LDA-optimized hybrids achieve a significant accuracy increase
separability. Additionally, LDA handles class imbalances better from 79.15% to 98.38%. XAI techniques, such as LIME, SHAP,
by considering the ratio of between-class to within-class variance, and ICE, further support the transparency of the model’s decision-
ensuring that selected features are most relevant for distinguishing making process. We validate the ensemble model’s effectiveness
between classes, even in imbalanced datasets [45]. by comparing its predictions with doctors’ decisions and testing it
Figure 6 depicts our liver disease prediction framework in the on samples from external sources and a multiclass MHR dataset,
real-time scenario. Patient data is collected via questionnaires and confirming its adaptability beyond the initial dataset. A real-time
blood samples, which are then analyzed in a semi-auto biochemistry demonstration of our model underscores its practical utility,
analyzer to measure liver function indicators. This data is tested though the study notes limitations, particularly in applying the
117
Medinformatics Vol. 2 Iss. 2 2025
model to clinical settings due to data constraints. Future work will & S. Uhlig (Eds.), 6G enabled fog computing in IoT:
address these limitations by implementing Differential Privacy and Applications and opportunities (pp. 183–213). Springer.
Clinical Servers to protect patient data, with plans to extend the https://doi.org/10.1007/978-3-031-30101-8_8
model to support multi-disease prediction. Additionally, we aim to [4] Shetty, P. J., & Satyanarayana. (2023). Prediction performance
construct a web server that would enhance the accessibility and of classification models for imbalanced liver disease data.
value of this tool for the broader community and end users. International Journal of Statistics and Applied Mathematics,
8(5), 58–62.
Acknowledgment [5] Alyabis, M. A. S., Howaimil, B. M. I., Alyabes, A. M. S.,
Alrabiah, A. A. H., Alrabiah, A. S. H., Aljumayi, I. M., : : : ,
Our heartfelt gratitude goes to two distinguished physicians who & Binshaheen, H. S. (2022). Prediction of liver diseases
generously shared their expertise, greatly enriching this study. Special using neural network analysis. International Journal of
thanks are extended to Dr. Shahriar Shafiq, a Higher Specialty Pharmaceutical and Bio Medical Science, 2(8), 314–320.
Registrar in Diabetes and Endocrinology at the Royal College of https://doi.org/10.47191/ijpbms/v2-i8-08
Physicians of Edinburgh, England, referenced as Expert 1 in [6] Singh, G., & Agarwal, C. (2023). Prediction and analysis of
Table 5. Dr. Shafiq’s insightful guidance was instrumental in liver disease using extreme learning machine. In Sentiment
navigating the complexities of liver disease research. Appreciation Analysis and Deep Learning: Proceedings of ICSADL 2022,
is also due to Dr. Talha Sami Anik, Assistant Surgeon with the 679–690. https://doi.org/10.1007/978-981-19-5443-6_52
Government of the People’s Republic of Bangladesh, listed as [7] Azam, S., Rahman, A., Iqbal, S. M. H. S., & Ahmed, T. (2020).
Expert 2 in Table 5. Dr. Anik’s extensive experience, including his Prediction of liver diseases by using few machine learning
work at Birdem General Hospital and Dhaka Medical College & based approaches. Australian Journal of Engineering and
Hospital, brought valuable perspectives to this research. Despite Innovative Technology, 2(5), 85–90. https://doi.org/10.
their demanding schedules, both doctors demonstrated exceptional 34104/ajeit.020.085090
commitment and professionalism, providing critical support that [8] Choudhary, R., Gopalakrishnan, T., Ruby, D., Gayathri, A.,
significantly contributed to the advancement of this study. Murthy, V. S., & Shekhar, R. (2021). An efficient model for
predicting liver disease using machine learning. In R.
Ethical Statement Satpathy, T. Choudhury, S. Satpathy, S. N. Mohanty, & X.
Zhang (Eds.), Data analytics in bioinformatics: A machine
This study does not contain any studies with human or animal learning perspective (pp. 443–457). Wiley. https://doi.org/10.
subjects performed by any of the authors.
1002/9781119785620.ch18
[9] Khan, B., Naseem, R., Ali, M., Arshad, M., & Jan, N. (2019).
Conflicts of Interest Machine learning approaches for liver disease diagnosing.
The authors declare that they have no conflicts of interest to this International Journal of Data Science and Advanced
work. Analytics, 1(1), 27–31. https://doi.org/10.69511/ijdsaa.v1i1.71
[10] Thirunavukkarasu, K., Singh, A. S., Irfan, & Chowdhury,
Data Availability Statement A. (2018). Prediction of liver disease using classification
algorithms. In 4th International Conference on Computing
The data that support this work are available upon reasonable Communication and Automation, 1–3. https://doi.org/10.1109/
request to the corresponding author. CCAA.2018.8777655
[11] Muthuselvan, S., Rajapraksh, S., Somasundaram, K., &
Author Contribution Statement Karthik, K. (2018). Classification of liver patient dataset
using machine learning algorithms. International Journal of
Safiul Haque Chowdhury: Conceptualization, Methodology, Engineering & Technology, 7(3.34), 323–326. https://doi.
Software, Validation, Formal analysis, Investigation, Resources, org/10.14419/ijet.v7i3.34.19217
Data curation, Writing – original draft, Writing – review & editing, [12] Yasmin, R., Amin, R., & Reza, S. (2023). Design of novel
Visualization, Project administration. Mohammad Mamun: Formal feature union for prediction of liver disease patients: A
analysis, Investigation, Resources, Data curation, Writing – original machine learning approach. In The Fourth Industrial
draft, Writing – review & editing, Visualization, Project Revolution and Beyond: Select Proceedings of IC4IR+,
administration. Tanvir Ahmed Shaikat: Visualization, Project 515–526. https://doi.org/10.1007/978-981-19-8032-9_36
administration. Mohammed Ibrahim Hussain: Writing – review & [13] Kufel, J., Bargieł-Łączek, K., Kocot, S., Koźlik, M.,
editing, Supervision. Sadiq Iqbal: Writing – review & editing, Bartnikowska, W., Janik, M., : : : , & Gruszczyńska, K.
Visualization, Supervision. Muhammad Minoar Hossain: Writing (2023). What is machine learning, artificial neural networks
– review & editing, Supervision. and deep learning?—Examples of practical applications in
medicine. Diagnostics, 13(15), 2582. https://doi.org/10.
References 3390/diagnostics13152582
[14] Ramana, B. V., Babu, M. S. P., & Venkateswarlu, N. B. (2012).
[1] Williams, R. (2006). Global challenges in liver disease. A critical comparative study of liver patients from USA and
Hepatology, 44(3), 521–526. https://doi.org/10.1002/hep.21347 INDIA: An exploratory analysis. International Journal of
[2] American Liver Foundation. (2023). How many people have liver Computer Science Issues, 9(3), 506–516.
disease? Retrieved from: https://liverfoundation.org/about-your-li [15] Shinde, B. G., & Shivthare, S. (2024). Impact of data visualization
ver/facts-about-liver-disease/how-many-people-have-liver-disease/ in data analysis to improve the efficiency of machine learning
[3] Choubey, D. K., Dubey, P., Tewari, B. P., Ojha, M., & Kumar, models. Journal of Advanced Zoology, 45, 107–112.
J. (2023). Prediction of liver disease using soft computing and [16] Roy, S., Bhalla, K., & Patel, R. (2024). Mathematical analysis
data science approaches. In M. Kumar, S. S. Gill, J. K. Samriya, of histogram equalization techniques for medical image
118
Medinformatics Vol. 2 Iss. 2 2025
enhancement: A tutorial from the perspective of data loss. [32] Lee, D. K., In, J., & Lee, S. (2015). Standard deviation and
Multimedia Tools and Applications, 83(5), 14363–14392. standard error of the mean. Korean Journal of Anesthesiology,
https://doi.org/10.1007/s11042-023-15799-8 68(3), 220–223. https://doi.org/10.4097/kjae.2015.68.3.220
[17] Hu, K. (2020). Become competent within one day in generating [33] Hazra, A. (2017). Using the confidence interval confidently.
boxplots and violin plots for a novice without prior R experience. Journal of Thoracic Disease, 9(10), 4125–4130. https://doi.
Methods and Protocols, 3(4), 64. https://doi.org/10.3390/ org/10.21037/jtd.2017.09.14
mps3040064 [34] Hoo, Z. H., Candlish, J., & Teare, D. (2017). What is an ROC
[18] Gu, Z. (2022). Complex heatmap visualization. iMeta, 1(3), curve? Emergency Medicine Journal, 34(6), 357–359. https://
e43. https://doi.org/10.1002/imt2.43 doi.org/10.1136/emermed-2017-206735
[19] Ismail, A. R., Abidin, N. Z., & Maen, M. K. (2022). Systematic [35] Tian, T., & Zhu, J. (2015). Max-margin majority voting for
review on missing data imputation techniques with machine learning from crowds. In Advances in Neural Information
learning algorithms for healthcare. Journal of Robotics and Processing Systems 28: 29th Annual Conference on Neural
Control, 3(2), 143–152. https://doi.org/10.18196/jrc.v3i2.13133 Information Processing Systems, 1–9.
[20] Yu, L., Zhou, R., Chen, R., & Lai, K. K. (2022). Missing data [36] Barredo Arrieta, A., Díaz-Rodríguez, N., del Ser, J.,
preprocessing in credit classification: One-hot encoding or Bennetot, A., Tabik, S., Barbado, A., : : : , & Herrera, F.
imputation? Emerging Markets Finance and Trade, 58(2), (2020). Explainable Artificial Intelligence (XAI):
472–482. https://doi.org/10.1080/1540496X.2020.1825935 Concepts, taxonomies, opportunities and challenges toward
[21] Pisner, D. A., & Schnyer, D. M. (2020). Support vector responsible AI. Information Fusion, 58, 82–115. https://
machine. In A. Mechelli, & S. Vieira (Eds.), Machine doi.org/10.1016/j.inffus.2019.12.012
learning: Methods and applications to brain disorders (pp. [37] van den Broeck, G., Lykov, A., Schleich, M., & Suciu, D.
101–121). Academic Press. https://doi.org/10.1016/B978-0- (2022). On the tractability of SHAP explanations. Journal of
12-815739-8.00006-7 Artificial Intelligence Research, 74, 851–886. https://doi.org/
[22] Zhao, S., Zhang, B., Yang, J., Zhou, J., & Xu, Y. (2024). Linear 10.1613/jair.1.13283
discriminant analysis. Nature Reviews Methods Primers, 4(1), [38] Kawakura, S., Hirafuji, M., Ninomiya, S., & Shibasaki, R. (2022).
70. https://doi.org/10.1038/s43586-024-00346-y Analyses of diverse agricultural worker data with explainable
[23] Greenacre, M., Groenen, P. J. F., Hastie, T., D’Enza, A. I., artificial intelligence: XAI based on SHAP, LIME, and
Markos, A., & Tuzhilina, E. (2023). Publisher correction: LightGBM. European Journal of Agriculture and Food Sciences,
Principal component analysis. Nature Reviews Methods 4(6), 11–19. https://doi.org/10.24018/ejfood.2022.4.6.348
Primers, 3(1), 22. https://doi.org/10.1038/s43586-023-00209-y [39] Vikas1055. (2019). Lab serial No. patient name age/sex
[24] Chen, X. W., & Jeong, J. C. (2007). Enhanced recursive feature referred by testname. Retrieved from: https://www.coursehe
elimination. In Sixth International Conference on Machine ro.com/file/42005265/labreportnewpdf/
Learning and Applications, 429–435. https://doi.org/10.1109/ [40] Asking for Self. (n.d.). Talk to liver on liver function test.
ICMLA.2007.35 Retrieved from: https://www.marham.pk/forum/liver-speciali
[25] Abiodun, O. I., Jantan, A., Omolara, A. E., Dada, K. V., st/liver-function-test
Mohamed, N. A., & Arshad, H. (2018). State-of-the-art in [41] Velu, S. R., Ravi, V., & Tabianan, K. (2022). Data mining in
artificial neural network applications: A survey. Heliyon, predicting liver patients using classification model. Health and
4(11), e00938. https://doi.org/10.1016/j.heliyon.2018.e00938 Technology, 12(6), 1211–1235. https://doi.org/10.1007/s12553-
[26] Pavlyshenko, B. (2018). Using stacking approaches for machine 022-00713-3
learning models. In IEEE Second International Conference on [42] Ahmed, M., Kashem, M. A., Rahman, M., & Khatun, S. (2020).
Data Stream Mining & Processing, 255–258. https://doi.org/10. Review and analysis of risk factor of maternal health in remote
1109/DSMP.2018.8478522 area using the Internet of Things (IoT). In InECCE2019:
[27] González, S., García, S., del Ser, J., Rokach, L., & Herrera, F. Proceedings of the 5th International Conference on
(2020). A practical tutorial on bagging and boosting based Electrical, Control & Computer Engineering, 357–365.
ensembles for machine learning: Algorithms, software tools, https://doi.org/10.1007/978-981-15-2317-5_30
performance study, practical perspectives and opportunities. [43] Li, Z., Liu, F., Yang, W., Peng, S., & Zhou, J. (2022). A survey
Information Fusion, 64, 205–237. https://doi.org/10.1016/ of convolutional neural networks: Analysis, applications, and
j.inffus.2020.07.007 prospects. IEEE Transactions on Neural Networks and
[28] Cao, Y., Miao, Q. G., Liu, J. C., & Gao, L. (2013). Advance Learning Systems, 33(12), 6999–7019. https://doi.org/10.
and prospects of AdaBoost algorithm. Acta Automatica 1109/TNNLS.2021.3084827
Sinica, 39(6), 745–758. https://doi.org/10.1016/S1874- [44] Ali, M. Z., Abdullah, A., Zaki, A. M., Rizk, F. H., Eid, M. M., &
1029(13)60052-X El-Kenway, E. M. (2024). Advances and challenges in feature
[29] Bentéjac, C., Csörgő, A., & Martínez-Muñoz, G. (2021). A selection methods: A comprehensive review. Journal of
comparative analysis of gradient boosting algorithms. Artificial Intelligence and Metaheuristics, 7(1), 67–77.
Artificial Intelligence Review, 54(3), 1937–1967. https://doi. https://doi.org/10.54216/JAIM.070105
org/10.1007/s10462-020-09896-5 [45] Kim, A. K. H., & Chung, H. (2021). The effect of rebalancing
[30] Vishwanathan, S. V. M., & Murty, M. N. (2002). SSVM: A simple on LDA in imbalanced classification. Stat, 10(1), e384. https://
SVM algorithm. In Proceedings of the 2002 International Joint doi.org/10.1002/sta4.384
Conference on Neural Networks, 3, 2393–2398. https://doi.org/
10.1109/IJCNN.2002.1007516 How to Cite: Chowdhury, S. H., Mamun, M., Shaikat, M. T. A., Hussain, M. I., Iqbal,
M. S., & Hossain, M. M. (2025). An Ensemble Approach for Artificial Neural
[31] Fahmy, M. M. (2022). Confusion matrix in binary classification Network-Based Liver Disease Identification from Optimal Features Through
problems: A step-by-step tutorial. Journal of Engineering Hybrid Modeling Integrated with Advanced Explainable AI. Medinformatics, 2(2),
Research, 6(5), T1–T12. 107–119. https://doi.org/10.47852/bonviewMEDIN52024744
119