Final Report
Final Report
Among the patient demographics parameters, there were patients’ age, race, region, etc. Attributes
of the physician who prepared the prescription or performed the task of observing the patient might be
an important predictor, so it was included. The primary disease for which patients were treated in this
case is Nontuberculous Mycobacterial (NTM). Various tests such as DEXA Scans are performed for NTM
which produces metrics like T-Score. Clinical Factors like the outcomes of these tests during the Rx and
the performance shift during the last 1 to 2 years were also accounted for, along with the Risk Segment
of the patient and the possibility of prevalence of multi-risk among the patients. Other treatment factors
such as comorbidity of patients for other diseases alongside NTM, Injectable Experience, and
concomitancy of various drugs applied on the patient for NTM were also accounted for. All these
parameters will be used to produce models using Machine Learning to correctly classify patients based
on their “Persistency Flag”. Efforts will also need to be given to determine the most influential
parameters (or class of parameter) for determining peoples’ choice on continuing the medicine.
More than the healthcare perspective of this problem, it has an even more important business
perspective. As discussed earlier, one of the challenges for all pharmaceutical companies is to
understand the persistency of drugs as per the physician's prescription. The general trend of persistency
of pharmaceutical products among a group of patients is downward, as depicted by the “Persistency
Curve” in Figure 1. The pharmaceutical companies aim to determine the factors which affect the decline
most so that they can address these issues properly and slow down the process (i.e., the smaller slope of
the model). Pharmaceutical companies, healthcare organizations, and hospitals in the USA lose billions
of dollars per year due to patients not being persistent in their prescribed medicines and/or treatments
[1]. Based on the collected data by any pharmaceutical company or an Integrated Delivery Network
(IDN), the most important factors behind the lower level of persistence among patients can be
determined. The company or organization can address those issues properly to mitigate the process and
over time they can resurvey to check for improvements.
From the Persistency Curve in Figure 1, it can be understood that the general trend of the Persistency
Curve is downward i.e., patients generally do not adhere to the medicine or the set of medicines
prescribed by their doctors. They either stop taking the drug for many reasons or swap to another. Since
the overall trend is always downward, the important thing to focus on is how to slow down the rate. The
main key behind slowing down this rate is to determine the most influential factor(s) behind so that they
can be assessed in time and important business decisions can be taken which could save millions,
sometimes even millions for the company or the government. To make people adhere to a certain
prescription or guidelines for a long time is not a trivial task and IDNs across the USA were formed
primarily due to this.
(a) (b)
(c) (d)
(e) (f)
Figure 2: Visualization of the Descriptive Analysis on Demographics of the Subjects
On the other hand, the dataset was dominated by non-Hispanic, Caucasian patients in large proportions.
Even though the Region and Age classes were more proportionate, it is clear that most of the patients
belonged to higher age classes (only a few lower than 55) and the greatest number of patients came
from the Midwest region. One of the most clinically important factors in this analysis was the high
proportion of the subjects belonging to the IDN. Around 75% of the subjects belonged to a certain IDN
implying the success of forming clusters of health service providers for better patient experience.
(a) (b)
(c) (d)
(e) (f)
Figure 3: Visualization of Persistency Flags Related with Respect to Other Parameters
In Figure 3, the target variable “Persistency Flag” with respect to some independent variables or
predictors have been plotted. It can be seen than most people on the dataset belonged to the aged
class, but the non-persistency level (or ratio) is more among the older patients. As discussed earlier, this
study has been imbalanced towards female subjects but among females, the non-persistency level is
higher than the males. But no concrete conclusion can be drawn due to the data imbalance. Also, low-
risk patients were found to be less persistent than the high-risk ones, as shown in Figure 3(d) and 3(f).
Physician Specialty Type and Specialist Flag for the Observing Physician
In this section, the focus was given to the practitioners involved in the cases based on their specialty
type and specialist flag (expert on their respective departments).
Figure 4: Specialty and Specialist-Flag of the Observing Medical Staff or Physician for the Patient
It can be observed that a large number of physicians who handled the NTM cases were general
practitioners, around 45% of the total. But among the other groups, all were not specialists. As the
specialist flags indicate, only the physicians belonging to “Endocrinology”, “Obstetrics and Gynecology”,
“Rheumatology” and “Urology” were specialists in those respective fields. It might indicate an important
assumption that NTM is more critical for these categories of patients, so specialists had to be involved.
Clinical Factors
The clinical factor can be divided into few major classes such as Comorbidity of Diseases, Concomitancy
of Drugs, and Risk Factors among the subjects. Percentage-based column charts are plotted for each
sub-category of them and shown in Figures 5-7 respectively. The concomitancy of some drugs is around
35% while other drugs can be as low as 10% of subjects. In this case, the Cholesterol mitigating drug was
most commonly used by the subjects alongside the main drug. On the other hand, the comorbidity
parameter varies a lot as well-meaning that some diseases are comorbid with NTM than other ones. The
comorbidity of Lipoprotein Disorder is around 51%, which is the highest, meaning that these diseases
occur commonly together with the main disease.
Figure 5: Concomitancy of Various Drugs among Subjects
Various risk factors were observed for the subjects under the study. Among the risk factors, the risk of
Vitamin D insufficient was the most acute while the risk of smoking tobacco and the risk of chronic
malnutrition were also other influential factors. Nevertheless, the strength of the relations between
various clinical factors can be understood from the Machine Learning and Dominance Analysis discussed
in the next sections.
Figure 7: Risk Factors among Subjects
Recommendations
From the Exploratory Data Analysis (EDA) done on the dataset, following recommendations are given to
the ABC company’s technical team:
Demographic Factors provided in the dataset is not strongly related to the “Persistency
Level” of the patients.
NTM Specialist type or Specialist Flag did not show any correlation to the target
variable.
Clinical Factors such as “Concomitancy of Drugs”, “Comorbidity of Various Diseases” and
“Risk Factors” do show some correlations with the target variable “Persistency Level” of
the patients which needs to be investigated further through a Quantitative Analysis such
as Machine Learning.
Figure 15. Data Pre-processing and Machine Learning (ML) Pipeline (in Python)
A Data pre-processing and ML pipeline was developed using Python Jupyter Notebooks in the Google
COLAB platform. The overall structure of the pipeline is shown in Figure 15. The CSV file containing the
dataset was uploaded into Google Drive which is connected to COLAB. Using Python’s PANDAS Library, a
DataFrame was created from the CSV dataset and used for further analysis. At first, unnecessary data
such as the “Patient ID” column was removed from the dataset and NULL values were filled up (not
dropped) by ‘N/A’ or Not Answered. For descriptive analysis, categorical data could directly be used
using PANDAS DataFrame while regression techniques can only work on numerical data. So, dummy
variables were created from categorical data for regression analysis. All versions of the processed
datasets were saved in Google Drive. The dummy DataFrame was converted into a NumPy array (which
can only take numerical values) to perform ML on it. The dataset of 3424 subjects was split into train
and test set following a ratio of 80:20 using Python’s Scikit-learn library. So, 2739 subjects were
randomly taken into the train set and 685 for the test. While normally the test set is smaller than the
train set, this standard ratio can be altered to check the performance of the ML model (e.g., 70:30 or
85:15). Since Scikit-learn creates the train-test split randomly, the test data is not completely
independent and during each run, the samples inside train and test datasets will be different. It might
affect the performance since the randomly chosen test data might represent a certain type of people
more, or it can be of completely different characteristics than the training dataset. In both cases, the
outcome will be biased and might change during each run. To get rid of this bias, 5-Fold Cross-Validation
(CV) was performed while training and testing the data i.e., the entire process of splitting dataset and
performing ML was run 5 times and the final result is the average of the results from these 5 runs.
Eleven Regression techniques were used as Classified with Supervised Machine Learning. Among
them, there are commonly used techniques like Logistic Regression, K-Nearest Neighbor (KNN), Support
Vector Machines (SVM), Stochastic Gradient Descent (SGD), Decision Trees, Ensemble techniques such
as Gradient “Tree” Boosting, Random Forest, Extra Trees, AdaBoost, XgBoost, and Artificial Neural
Network (ANN) technique like Multi-layer Perceptron (MLP). Moreover, a Deep Neural Network
sequential model was developed using KERAS. The Regression Algorithms are below discussed in brief:
Logistic Regression: Logistic Regression is another regression method like Linear Regression but can
accept categorical data in the dependent variable which linear regression is inefficient at. Logistic
Regression can be used in classification tasks due to its logarithmic, non-linear kernel (Equation 1).
𝑃
𝑌 = ln ( ) = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑛 𝑋𝑛 − − − 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 1
1−𝑃
K-Nearest Neighbor (KNN): KNN classifies new data points or cases based on the classes of its
neighbors. The number of neighboring points to be tested can be changed for optimum performance on
a dataset. Condensed Nearest Neighbor (CNN, the Hart algorithm) can be used to find the border ratio
(Equation 2) to efficiently perform KNN on multiple neighbor.
‖𝑥 ′ − 𝑦‖
𝑎(𝑥) = − − − 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 2
‖𝑥 − 𝑦 ‖
Here, ‖𝑥 − 𝑦‖ is the distance to the closest example y having a different color than x, and ‖𝑥 ′ − 𝑦‖ is
the distance from y to its closest example x' with the same label as x.
Support Vector Machines (SVM): The primary aim of SVM is to create hyperplanes that can act as
decision boundaries during the classification of multidimensional data so that any new data can fall into
the correct category. SVM chooses the extreme points/vectors, known as the Support Vectors, which
help in creating the hyperplane.
Stochastic Gradient Descent (SGD): SGD is an iterative method for optimizing an objective function by
Gradient Descent on a randomly selected portion of the dataset. The cost function for SGD during
training of Machine Learning can be performed as shown in Equation 3.
1
𝑐𝑜𝑠𝑡 (𝜃, (𝑥 (𝑖) , 𝑦 (𝑖) )) = (ℎ𝜃 (𝑥 (𝑖) ) − 𝑦 (𝑖) )2 − − − 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 3
2
Decision Trees: Decision Tree Learning in ML uses a Decision Tree for classification or regression. A
Decision Tree is a flowchart-like model based on events and outcomes to reach a certain goal. A decision
tree comprises nodes, branches, and leaves where each node is known as a test, each branch represents
an outcome of a test, and each leaf represents a class label.
Ensemble Techniques: In Machine Learning, Ensemble techniques such as Gradient “Tree” Boosting,
Random Forest, Extra Trees, AdaBoost, and XgBoost use a weaker base model, typically Decision Trees,
to produce a more efficient predictive model by variously changing the base model through their
algorithms.
Multi-Layer Perceptron (MLP): MLP is a class of (feedforward) Artificial Neural Network or ANN which
uses Artificial Neuron as nodes in the network to perform Machine Learning tasks. MLP comprises at
least three layers viz. Input, Hidden, and Output layer. MLP uses non-linear functions such as ReLU,
sigmoid, etc. as activation functions for each node, which makes them different from a Perceptron,
which uses a linear activation function instead. MLP uses backpropagation, a supervised technique
during training. The learning process of MLP can be represented by Equation 4.
𝛿𝜀 (𝑛) 𝛿𝜀 (𝑛)
− = 𝜙 ′ (𝑣𝑗 (𝑛)) ∑ − 𝑤 (𝑛) − − − 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 4
𝛿𝑣𝑗 (𝑛) 𝛿𝑣𝑘 (𝑛) 𝑘𝑗
𝑘
Evaluation of the Machine Learning (ML) Techniques: The evaluation of the ML models was performed
by measuring the errors, creating the Confusion Matrix, and extracting other parameters such as
Accuracy, Precision, Recall (or Sensitivity), and f1-Score from the Confusion Matrix. Their formulae are
shown in Equations 5 to 8. Here, TP = True Positive (equivalent with Hit), TN = True Negative (equivalent
with Correct Rejection), FP = False Positive (false alarm, type I error, or underestimation), and FN = False
Negative (equivalent with miss, type II error, or overestimation) are the four main parameters of the
Confusion Matrix.
𝑇𝑃 + 𝑇𝑁
𝑨𝒄𝒄𝒖𝒓𝒂𝒄𝒚 = − − − 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 5
𝑇𝑃 + 𝐹𝑁 + 𝑇𝑁 + 𝐹𝑃
𝑇𝑃
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 = − − − 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 6
𝑇𝑃 + 𝐹𝑃
𝑇𝑃
𝑹𝒆𝒄𝒂𝒍𝒍 = − − − 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 7
𝑇𝑃 + 𝐹𝑁
𝑇𝑃 + 𝑇𝑁 Figure 16. Confusion Matrix in General
𝒇𝟏 − 𝑺𝒄𝒐𝒓𝒆 = − − − 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 8
𝑇𝑃 + 𝐹𝑁 + 𝑇𝑁 + 𝐹𝑃 Format
The Receiver Operating Characteristic or ROC Curves were plotted on the same window for all classes of
the dependent variable for the 4-Class problem. The ROC curve is created by plotting the True Positive
Rate (TPR, also known as Sensitivity or Recall) against the False Positive Rate (FPR, also known as the
Probability of False Alarm) at various thresholds settings to discover the optimal model. The formulae
for TPR and FPR are shown in Equations 9 and 10, respectively. Area Under the ROC Curve, commonly
known as AUC, was calculated for each ROC curve. AUC can be found by integrating the ROC curve over
the whole interval. AUC as a scale-invariant measure can measure how well the predictions are ranked
among classes, regardless of their respective magnitudes. It can also measure the quality of the model’s
prediction irrespective of the classification threshold.
𝑇𝑃
𝑻𝒓𝒖𝒆 𝑷𝒐𝒔𝒊𝒕𝒊𝒗𝒆 𝑹𝒂𝒕𝒆 (𝑻𝑷𝑹) = ≡ 1 − 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑅𝑎𝑡𝑒(𝐹𝑁𝑅) − − − 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 9
𝑇𝑃 + 𝐹𝑁
𝐹𝑃
𝑭𝒂𝒍𝒔𝒆 𝑷𝒐𝒔𝒊𝒕𝒊𝒗𝒆 𝑹𝒂𝒕𝒆 (𝑭𝑷𝑹) = ≡ 1 − 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑅𝑎𝑡𝑒(𝑇𝑁𝑅) − − − 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 10
𝐹𝑃 + 𝑇𝑁
There are various types of error calculating metrics such as Root Mean Square Error (RMSE), Mean
Absolute Error (MAE), Median Absolute Deviation (MAD), MAE was chosen since it is one of the most
common ways errors measurement among the ML researcher community. The formula for MAE is
shown in Equation 11 where the predicted values are being compared to the true/actual labels from the
test data.
∑𝑛𝑖=1|𝑦𝑖 − 𝜆(𝑥𝑖 )|
𝑴𝑨𝑬 = − − − 𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 11
𝑛
Table 2: Results
Original Autoencoder Dominance Analysis
ML Algorithms MAE Accuracy Precision Recall f1-Score AUC MAE Accuracy Precision Recall f1-Score AUC MAE Accuracy Precision Recall f1-Score AUC
Logistic Regression 0.19 0.81 0.81 0.79 0.81 0.88 0.20 0.80 0.79 0.78 0.79 0.87 0.27 0.73 0.73 0.68 0.72 0.76
K-Nearest Neighbour (KNN) 0.22 0.78 0.78 0.73 0.76 0.84 0.21 0.79 0.80 0.78 0.79 0.83 0.31 0.69 0.68 0.63 0.67 0.70
Support Vector Machine (SVM) 0.21 0.79 0.78 0.76 0.78 0.86 0.20 0.80 0.80 0.79 0.80 0.82 0.28 0.72 0.72 0.69 0.72 0.71
Stochastic Gradient Descent (SGD) 0.24 0.76 0.76 0.75 0.76 0.81 0.21 0.79 0.80 0.79 0.79 0.86 0.28 0.72 0.71 0.67 0.71 0.74
Decision Tree 0.27 0.73 0.73 0.71 0.73 0.71 0.25 0.75 0.75 0.73 0.75 0.73 0.35 0.65 0.63 0.59 0.63 0.62
Gradient Boosting 0.19 0.81 0.81 0.78 0.81 0.88 0.20 0.80 0.80 0.79 0.80 0.86 0.28 0.72 0.71 0.68 0.71 0.76
Random Forest 0.19 0.81 0.80 0.78 0.80 0.88 0.22 0.78 0.78 0.76 0.78 0.84 0.33 0.67 0.66 0.63 0.66 0.69
Extra Trees 0.21 0.79 0.79 0.77 0.79 0.87 0.23 0.77 0.77 0.75 0.77 0.84 0.34 0.66 0.64 0.60 0.64 0.67
AdaBoost 0.19 0.81 0.81 0.79 0.81 0.87 0.21 0.79 0.79 0.78 0.79 0.86 0.28 0.72 0.72 0.67 0.71 0.75
XgBoost 0.21 0.79 0.80 0.75 0.78 0.87 0.20 0.80 0.80 0.79 0.80 0.86 0.28 0.72 0.71 0.66 0.70 0.76
Multiple Layer Perceptron (MLP) 0.25 0.75 0.75 0.74 0.75 0.82 0.21 0.79 0.79 0.77 0.79 0.86 0.32 0.68 0.68 0.65 0.68 0.70
ANN Developed with KERAS 0.80 0.78 0.79 0.79
It can be noticed that AdaBoost performed best on the whole dataset while GradBoost performed best
for the Autoencoder feature-map. The subset dataset created out of Dominance Analysis has performed
much lower for all networks so this can be discarded. The incredible thing about the autoencoder is that
only 2 best features (after trial and error for feature number 2, 4, 8, 16, 32, 64 and 128) were selected
for creating the feature map of dimension (3424*2) = 6848 which is 59 times smaller than the whole
dataset which has about 404032 parameters! And it performed almost the same as the whole dataset.
So, autoencoders in some cases can represent a large dataset in a very compact form while maintain the
performance. It allows both producibility, deployment easiness in mobile devices with less
computational ability and storages, and portability of models. So, we select the Autoencoder based
approach for this project as the final propose model.
The ROC curves and the Confusion Matrix for the best performing model for autoencoder approach i.e.,
“GradBoost” is shown in Figures 17 and 18, respectively.
Figure 17. ROC Curves for all ML Models used to Evaluate AutoEncoder Extracted Features
Conclusion
In conclusion, the autoencoder based approach seemed to be the best pipeline for this project which
can be further upgraded by tuning the model by changing the U-Net model parameters. The best model
was able to detect Patient Persistency with an Accuracy of 81%. More data can be collected especially
focusing into removing data imbalance discussed beforehand to improve the performance.
GitHub Repo Link
https://github.com/m-odeh/Persistency-of-a-drug
References
[1] https://decisionresourcesgroup.com/solutions/integrated-delivery-networks-and-their-growing-influence-on-regional-healthcare-in-the-us/
[2] https://www.researchgate.net/publication/5055576_How_to_Project_Patient_Persistency
[3] "pandas.get_dummies — pandas 1.2.4 documentation", Pandas.pydata.org, 2021. [Online]. Available: https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.get_dummies.html.
[4] "Deep inside: Autoencoders", Medium, 2021. [Online]. Available: https://towardsdatascience.com/deep-inside-autoencoders-
7e41f319999f.
[5] [3]"Understanding Semantic Segmentation with UNET", Medium, 2021. [Online]. Available: https://towardsdatascience.com/understanding-
semantic-segmentation-with-unet-6be4f42d4b47.
[6] "nibtehaz/PPG2ABP", GitHub, 2021. [Online]. Available: https://github.com/nibtehaz/PPG2ABP. [Accessed: 15- May- 2021].
[7] Azen R, Budescu D V. The Dominance Analysis Approach for Comparing Predictors in Multiple Regression. Psychol Methods 2003;8: 129–48.
https://doi.org/10.1037/1082-989X.8.2.129.
[8] "dominance-analysis/dominance-analysis", GitHub, 2021. [Online]. Available: https://github.com/dominance-analysis/dominance-analysis.
[Accessed: 15- May- 2021].