Mini Project Documentation
Mini Project Documentation
By
            K. ADITHI                           -      21RA1A6603
            S. VARSHITHA KUSUMA PRIYA           -      21RA1A6606
            A. MAHESH                           -      21RA1A6627
CERTIFICATE
We hereby declare that this project work entitled “Machine Learning - Differential
Diagnosis Of Erythemato – Squamous Diseases from Clinical and Microscopica
features” in partial fulfillment of requirements for the award of degree of Computer
Science and Engineering(Artificial Intelligence & Machine Learning) is a bonafide work
carried out by us during the academic year 2024- 25.
We further declare that this project is a result of our effort and has not been submitted for the
award of any degree by us to any institution .
                                                                     By
                                              K Adithi Shree                   (21RA1A6603)
                                              S Varshitha Kusuma Priya         (21RA1A6606)
                                              A Mahesh                         (21RA1A6627)
                             ACKNOWLEDGEMENT
It gives us immense pleasure to acknowledge with gratitude, the help and support extended
throughout the project report from the following:
We will be very much grateful to almighty our Parents who have made us capable of
carrying out our job.
We are grateful to Mrs.G.Sujatha who is our Head of the Department, AI&ML for her
amiable ingenious and adept suggestions and pioneering guidance during the project report.
We express our gratitude and thanks to the coordinator Dr. S. Sankar Ganesh of our
department for his contribution for making it success within the given time duration.
We express our deep sense of gratitude and thanks to Internal Guide, Dr. S. Sankar
Ganesh Associate Professor for his guidance during the project report.
We are also very thankful to our Management, Staff Members and all Our Friends for
their valuable suggestions and timely guidance without which we would not have been
completed it.
By
Mission Statement
Mission Statement
      PEO’s                               Statement
      PEO1                The graduates of Computer Science and
                          Engineering will have successful career in
                          technology.
      PEO2                The graduates of the program will have solid
                          technical and professional foundation to
                          continue higher studies.
      PEO3                The graduate of the program will have skills
                          to develop products, offer services and
                          innovation.
      PEO4                The graduates of the program will have
                          fundamental awareness of industry process,
                          tools and technologies.
      Kommuri Pratap Reddy Institute of Technology
       Department Of Computer Science And Engineering
         (Artificial Intelligence And Machine Learning)
Program Outcomes
TABLE OF CONTENTS
ABSTRACT………………………………………………………………
CHAPTER-1 INTRODUCTION………………………………………....
      1.1 Overview
      1.2 Motivation
      1.3 Problem Statement
      1.4 Applications
CHAPTER-8 RESULTS…………………………………………………...
      8.1 Implementation Description:
      8.2 Dataset Description
      8.3 Results Description
REFERENCES…………………………………………………………….
                                 LIST OF FIGURES
Figure 8.3.2 Count plot of the various types of the class column of the dataset
before applying smote.
Figure 8.3.3 Count plot of the various types of classes of the class column of the
dataset after applying remote.
Figure 8.3.6 Performance Metrics of SVM and Decision Tree Classifier Models.
                                          TABLE OF SCREENSHOTS
ABSTRACT
INTRODUCTION
1.1 Overview
1.2 Motivation
    Automation in the form of advanced diagnostic tools and systems can address these
    issues by providing more accurate, consistent, and timely information. Automated
    systems can integrate data from various sources, such as electronic health records,
    imaging, and patient inputs, to provide a comprehensive view of a patient’s condition.
    This approach not only enhances diagnostic accuracy but also streamlines treatment
    planning and monitoring, ultimately leading to improved patient outcomes and more
    efficient use of healthcare resources.
1.4 Applications
LITERATURE SURVEY
Xie et al. [2] proposed a CNN-based approach for the classification of skin disease
images, focusing on enhancing the performance of automated diagnostic systems.
They demonstrated the effectiveness of deep learning models in distinguishing
between different skin conditions by utilizing a large dataset of clinical images. The
study emphasized the importance of feature extraction and model training on diverse
data to achieve high classification accuracy and address the variability in skin disease
presentations.
Wong et al. [3] developed an automated skin cancer detection system using deep
CNNs. The paper highlighted the system’s ability to classify skin lesions into benign
or malignant categories with high accuracy. The authors addressed the integration of
ML models into clinical workflows, emphasizing the benefits of reduced diagnostic
time and improved consistency. The study also identified limitations, including the
need for extensive and diverse training datasets to enhance model generalization.
Ghosh et al. [4] introduced an intelligent system for automated skin disease diagnosis
utilizing deep learning techniques. The study focused on integrating image analysis
with histopathological data to improve diagnostic precision. The authors highlighted
the system’s capability to handle complex disease features and reduce the reliance on
manual interpretation. The research underscored the system’s potential to assist
dermatologists in making informed decisions and improving patient outcomes.
Kumar et al. [5] proposed a multi-modal deep learning network for dermatological
disease diagnosis, combining clinical images and histopathological data. The paper
demonstrated how the integration of multiple data sources enhances diagnostic
performance by capturing a comprehensive view of the disease. The authors discussed
the advantages of their approach in providing accurate and consistent diagnoses and
the challenges related to data fusion and model training.
Sharma et al. [6] reviewed various machine learning approaches for skin lesion
classification, focusing on the effectiveness of different algorithms in detecting skin
diseases. The study highlighted advancements in ML techniques, such as support
vector machines and neural networks, and their application in dermatology. The
authors addressed the issues of dataset quality and the need for robust evaluation
metrics to ensure the reliability of automated diagnostic systems.
Lee et al. [7] developed hybrid deep learning models for accurate skin disease
classification, incorporating CNNs with other ML techniques. The study
demonstrated the benefits of combining different models to improve diagnostic
accuracy and address limitations of single-model approaches. The authors emphasized
the importance of model integration and the use of extensive datasets to enhance the
system’s ability to handle diverse skin conditions.
Zhao et al. [8] investigated the use of Generative Adversarial Networks (GANs)
alongside CNNs for skin disease diagnosis. The paper focused on how GANs can
generate synthetic images to augment training datasets and improve model
performance. The study highlighted the potential of combining GANs with CNNs to
address challenges such as data scarcity and model overfitting, enhancing the overall
diagnostic accuracy.
Das et al. [9] explored advanced techniques for skin disease detection using machine
learning, including feature selection and model optimization strategies. The study
examined the effectiveness of different ML algorithms in improving diagnostic
accuracy and handling complex skin conditions. The authors discussed the impact of
algorithmic improvements on the reliability of automated systems and the importance
of continuous model refinement.
Collins et al. [13] investigated predictive modeling techniques for skin disease
classification using machine learning algorithms. The study explored various ML
approaches and their application in diagnosing dermatological conditions,
highlighting the advantages of predictive modeling in improving diagnostic outcomes.
The authors discussed the impact of algorithmic advancements on the accuracy and
efficiency of skin disease diagnosis.
Williams et al. [14] focused on the application of convolutional neural networks for
automated dermatological diagnosis. The paper demonstrated the effectiveness of
CNNs in analyzing skin images and identifying various skin conditions. The study
highlighted the benefits of using deep learning models to enhance diagnostic accuracy
and reduce the time required for manual analysis.
Patel et al. [15] examined the integration of machine learning into dermatological
diagnostic processes, addressing the challenges and prospects of automated systems.
The study explored the potential of ML models to improve diagnostic precision and
efficiency while identifying key issues such as data quality and model interpretability.
The authors emphasized the transformative potential of ML in dermatology and the
need for further research to optimize these systems.
                                     CHAPTER 3
EXISTING SYSTEM
Process:
1. Clinical Examination:
2. Histopathological Analysis:
1. Subjectivity:
2. Time-Consuming:
3. Resource Intensive:
      o   Laboratory Resources: Histopathological analysis requires specialized
          equipment and skilled personnel. This can be resource-intensive, especially
          in settings with limited access to advanced diagnostic tools.
      o   Cost: The cost associated with biopsies and subsequent analyses can be high,
          making it less accessible to patients in low-resource settings.
4. Diagnostic Accuracy:
      o   Atypical Cases: Some cases may present atypically, making them harder to
          diagnose with traditional methods. Misdiagnosis or delayed diagnosis can
          lead to inappropriate treatment.
5. Patient Involvement:
6. Advancements in Technology:
PROPOSED SYSTEM
4.1 Overview
 Step 1 Dataset: The research utilizes a dataset (data.csv) containing clinical and
  microscopic features for diagnosing erythemato-squamous diseases. The dataset
  includes various attributes relevant to the diagnosis and a target class representing the
  disease type.
 Step 2 Data Preprocessing: Initial preprocessing involves loading the dataset and
  inspecting it for unique class values and missing values. Missing values are handled
  by replacing placeholders (e.g., '?') with NaN and then dropping rows with missing
  data. The dataset is subsequently described to understand its statistical properties.
 Step 3 Handling Missing Values: Missing values are managed by replacing '?' with
  NaN and removing records with any missing entries. This approach ensures that the
  dataset is clean and ready for analysis, reducing potential biases and inaccuracies in
  the model training process.
 Step 4 Data Balancing: To address class imbalance, the Synthetic Minority Over-
  sampling Technique (SMOTE) is applied. SMOTE generates synthetic samples for
  underrepresented classes to balance the dataset and improve the model's ability to
  generalize across all classes.
 Step 5 Splitting Data: The dataset is divided into training and testing subsets using
  an 80-20 split ratio. This separation allows for model training on a portion of the data
  while evaluating its performance on unseen test data.
 Step 6 Model Training: Two machine learning models such as Decision Tree
  Classifier and Support Vector Machine (SVM) are trained.
 Decision Tree Classifier: The Decision Tree Classifier, with a maximum depth of 3,
   is used to classify the data. It is trained on the training set and evaluated using
   various metrics including precision, recall, F1 score, and accuracy.
 SVM Classifier: The Support Vector Machine (SVM) model, which is an advanced
   algorithm compared to decision trees, is trained using a linear kernel. The SVM
 model is intended to provide better performance by effectively handling the
 complexities and non-linearities in the data.
The preprocessing stage is meticulously designed to prepare the data for machine
learning models. Key actions include handling missing data, encoding categorical
variables, visualizing class distribution, addressing class imbalance with SMOTE, and
splitting the data into training and testing sets. These steps are crucial in ensuring that
the data is clean, balanced, and well-structured, enabling the model to learn
effectively and make accurate predictions. This careful preprocessing lays a robust
foundation for applying machine learning algorithms, ultimately leading to better
diagnostic tools for erythemato-squamous diseases.
Loading the Data: The dataset is loaded from a CSV file into a Pandas DataFrame,
which is a common structure for handling tabular data in Python. This step is
fundamental, as it provides the raw data needed for analysis.
Exploration: The dataset is then explored to understand its structure. The unique
values in the target column (class) are identified to check the different categories of
erythemato-squamous diseases. The head() function displays the first few rows of the
dataset, giving an initial look at the data. The info() function provides a summary of
the dataset, including the data types of each column, the number of non-null entries,
and memory usage.
Missing Values Check: Missing values in the dataset are identified using
isnull().sum(), which counts the number of null entries in each column. Missing data
can pose a significant challenge, leading to biases or errors if not handled properly.
Replacing Missing Values: Any placeholders for missing values, such as '?', are
replaced with NaN (Not a Number) to standardize the dataset and make it easier to
handle missing data.
Dropping Missing Data: Rows containing missing values are removed using
dropna(). This step is crucial because most machine learning algorithms require a
complete dataset without any missing values. Dropping rows with missing values
ensures that the remaining dataset is clean and consistent.
Label Encoding: The target variable (class), which is categorical, is converted into
numerical values using LabelEncoder. This encoding is necessary because machine
learning models typically require numerical input. Each class is assigned a unique
integer value, making it easier for the model to process and differentiate between the
categories.
Defining Independent and Dependent Variables: The dataset is split into features
(independent variables) and the target (dependent variable). The features (X) include
    all columns except the last one, which is the target variable (y). This separation is
    essential for supervised learning tasks, where the model learns to map inputs (X) to
    outputs (y).
    Train-Test Split: The dataset is split into training and testing sets using
    train_test_split(). Typically, 80% of the data is used for training the model, while the
    remaining 20% is reserved for testing. This split is essential for evaluating the
    model’s performance on unseen data, providing a realistic assessment of how well the
    model is likely to perform in real-world scenarios.
    A Decision Tree Classifier is a popular machine learning algorithm used for both
    classification and regression tasks. It works by splitting the dataset into subsets based
    on the value of the input features. The model creates a tree-like structure where each
    internal node represents a decision on a feature, each branch represents the outcome
    of the decision, and each leaf node represents a class label.
How It Works:
    Splitting Criteria: The tree is built by selecting the best features to split the data at
     each node. The goal is to make each subset as pure as possible, meaning that the
     data in each subset should ideally belong to the same class. Common metrics for
     selecting the best split include Gini Impurity and Information Gain.
    Tree Construction: The process continues recursively, splitting the data further
     until the tree reaches a predefined depth or until the data within a node is pure
     enough (i.e., all samples belong to the same class).
    Prediction: For a new input, the model traverses the tree by following the decisions
     made at each node, eventually arriving at a leaf node that gives the predicted class.
Advantages:
    Interpretability: Decision trees are easy to understand and interpret, making them
     useful for explaining predictions.
Limitations:
    Overfitting: Decision trees can easily overfit, especially if they are allowed to grow
     too deep, capturing noise rather than the underlying pattern in the data.
    Instability: Small changes in the data can lead to a completely different tree
     structure, making them less robust.
How It Works:
    Maximizing Margin: SVM selects the hyperplane that maximizes the margin
     between the two classes. This margin maximization helps in improving the model's
     ability to generalize to unseen data.
    Kernel Trick: SVM can be extended to work in non-linear spaces using a technique
     called the "kernel trick." The kernel function implicitly maps the input data into a
     higher-dimensional space where it becomes easier to separate the classes with a
     linear hyperplane.
Advantages:
 Flexibility: The kernel trick allows SVM to model complex non-linear relationships.
    In this research, the Support Vector Machine (SVM) outperformed the Decision Tree
    Classifier in terms of key performance metrics such as accuracy, precision, recall, and
    F1-score.
    Kernel Trick: The potential use of a linear kernel in SVM helps in finding the
     optimal hyperplane even in cases where the relationship between features is non-
     linear, which may not be as effectively captured by a Decision Tree.
1. Improved Accuracy: SVMs are known for their ability to achieve higher accuracy,
   especially in complex datasets where linear decision boundaries may not suffice.
   The linear kernel used in this research helps in effective classification of linearly
   separable data.
5. Support Vector Importance: The model focuses on the support vectors, which are
   the most informative data points for classification, leading to a more efficient and
   precise model.
6. Scalability: SVMs are scalable with respect to the number of features and can be
   adapted for large datasets using techniques like kernel approximation.
7. Versatility: With different kernel functions, SVMs can be tailored to fit various data
   distributions and relationships, providing flexibility in modeling complex patterns.
  Overall, the proposed SVM model offers significant improvements over traditional
  methods, such as decision trees, by enhancing classification accuracy and handling
  complex datasets more effectively.
                                  CHAPTER 5
MACHINE LEARNING
Before we take a look at the details of various machine learning methods, let's start by
looking at what machine learning is, and what it isn't. Machine learning is often
categorized as a subfield of artificial intelligence, but I find that categorization can
often be misleading at first brush. The study of machine learning certainly arose from
research in this context, but in the data science application of machine learning
methods, it's more helpful to think of machine learning as a means of building models
of data.
At the most fundamental level, machine learning can be categorized into two main
types: supervised learning and unsupervised learning.
  Human beings, at this moment, are the most intelligent and advanced species on earth
  because they can think, evaluate, and solve complex problems. On the other side, AI
  is still in its initial stage and have not surpassed human intelligence in many aspects.
  Then the question is that what is the need to make machine learn? The most suitable
  reason for doing this is, “to make decisions, based on data, with efficiency and scale”.
1. Quality of data − Having good-quality data for ML algorithms is one of the biggest
   challenges. Use of low-quality data leads to the problems related to data
   preprocessing and feature extraction.
2. Time-Consuming task − Another challenge faced by ML models is the consumption
     of time especially for data acquisition, feature extraction and retrieval.
 Emotion analysis
 Sentiment analysis
 Speech synthesis
 Speech recognition
 Customer segmentation
    Object recognition
    Fraud detection
 Fraud prevention
    Arthur Samuel coined the term “Machine Learning” in 1959 and defined it as a “Field
    of study that gives computers the capability to learn without being explicitly
    programmed”.
    And that was the beginning of Machine Learning! In modern times, Machine
    Learning is one of the most popular (if not the most!) career choices. According
    to Indeed, Machine Learning Engineer Is the Best Job of 2019 with a 344% growth
    and an average base salary of $146,085 per year.
    But there is still a lot of doubt about what exactly is Machine Learning and how to
    start learning it? So, this article deals with the Basics of Machine Learning and also
    the path you can follow to eventually become a full-fledged Machine Learning
    Engineer. Now let’s get started!!!
    This is a rough roadmap you can follow on your way to becoming an insanely talented
    Machine Learning Engineer. Of course, you can always modify the steps according to
    your needs to reach your desired end-goal!
    In case you are a genius, you could start ML directly but normally, there are some
    prerequisites that you need to know which include Linear Algebra, Multivariate
    Calculus, Statistics, and Python. And if you don’t know these, never fear! You don’t
    need a Ph.D. degree in these topics to get started but you do need a basic
    understanding.
    Both Linear Algebra and Multivariate Calculus are important in Machine Learning.
    However, the extent to which you need them depends on your role as a data scientist.
    If you are more focused on application heavy machine learning, then you will not be
    that heavily focused on maths as there are many common libraries available. But if
    you want to focus on R&D in Machine Learning, then mastery of Linear Algebra and
    Multivariate Calculus is very important as you will have to implement many ML
    algorithms from scratch.
    Data plays a huge role in Machine Learning. In fact, around 80% of your time as an
    ML expert will be spent collecting and cleaning data. And statistics is a field that
    handles the collection, analysis, and presentation of data. So it is no surprise that you
    need                         to                        learn                         it!!!
    Some of the key concepts in statistics that are important are Statistical Significance,
    Probability Distributions, Hypothesis Testing, Regression, etc. Also, Bayesian
    Thinking is also a very important part of ML which deals with various concepts like
    Conditional Probability, Priors, and Posteriors, Maximum Likelihood, etc.
    Some people prefer to skip Linear Algebra, Multivariate Calculus and Statistics and
    learn them as they go along with trial and error. But the one thing that you absolutely
    cannot skip is Python! While there are other languages you can use for Machine
    Learning like R, Scala, etc. Python is currently the most popular language for ML. In
    fact, there are many Python libraries that are specifically useful for Artificial
    Intelligence and Machine Learning such as Keras, TensorFlow, Scikit-learn, etc.
    So, if you want to learn ML, it’s best if you learn Python! You can do that using
    various online resources and courses such as Fork Python available Free on
    GeeksforGeeks.
    Now that you are done with the prerequisites, you can move on to actually learning
    ML (Which is the fun part!!!) It’s best to start with the basics and then move on to the
    more complicated stuff. Some of the basic concepts in ML are:
    Target (Label) – A target variable or label is the value to be predicted by our model.
     For the fruit example discussed in the feature section, the label with each set of input
     would be the name of the fruit like apple, orange, banana, etc.
    Prediction – Once our model is ready, it can be fed a set of inputs to which it will
     provide a predicted output(label).
    Supervised Learning – This involves learning from a training dataset with labeled
     data using classification and regression models. This learning process continues
     until the required level of performance is achieved.
    Unsupervised Learning – This involves using unlabelled data and then finding the
     underlying structure in the data in order to learn more and more about the data itself
     using factor and cluster analysis models.
    Reinforcement Learning – This involves learning optimal actions through trial and
     error. So, the next action is decided by learning behaviors that are based on the
     current state and that will maximize the reward in the future.
    1. Easily identifies trends and patterns: Machine Learning can review large volumes
    of data and discover specific trends and patterns that would not be apparent to humans.
    For instance, for an e-commerce website like Amazon, it serves to understand the
browsing behaviors and purchase histories of its users to help cater to the right
products, deals, and reminders relevant to them. It uses the results to reveal relevant
advertisements to them.
2. No human intervention needed (automation): With ML, you don’t need to babysit
your research every step of the way. Since it means giving machines the ability to
learn, it lets them make predictions and also improve the algorithms on their own. A
common example of this is anti-virus softwares; they learn to filter new threats as
they are recognized. ML is also good at recognizing spam.
1. Data Acquisition: Machine Learning requires massive data sets to train on, and
these should be inclusive/unbiased, and of good quality. There can also be times
where they must wait for new data to be generated.
2. Time and Resources: ML needs enough time to let the algorithms learn and develop
enough to fulfill their purpose with a considerable amount of accuracy and relevancy.
It also needs massive resources to function. This can mean additional requirements of
computer power for you.
SOFTWARE ENVIRONMENT
What is Python?
    Python language is being used by almost all tech-giant companies like – Google,
     Amazon, Facebook, Instagram, Dropbox, Uber… etc.
    The biggest strength of Python is huge collection of standard libraries which can be
    used for the following –
 Machine Learning
 Test frameworks
 Multimedia
Advantages of Python
1. Extensive Libraries
 Python downloads with an extensive library and it contain code for various purposes
  like regular expressions, documentation-generation, unit-testing, web browsers,
  threading, databases, CGI, email, image manipulation, and more. So, we don’t have to
  write the complete code for that manually.
2. Extensible
 As we have seen earlier, Python can be extended to other languages. You can write
  some of your code in languages like C++ or C. This comes in handy, especially in
  researchs.
3. Embeddable
4. Improved Productivity
5. IOT Opportunities
 Since Python forms the basis of new platforms like Raspberry Pi, it finds the future
  bright for the Internet of Things. This is a way to connect the language with the real
  world.
 When working with Java, you may have to create a class to print ‘Hello World’. But
  in Python, just a print statement will do. It is also quite easy to learn, understand,
  and code. This is why when people pick up Python, they have a hard time adjusting to
  other more verbose languages like Java.
7. Readable
 Because it is not such a verbose language, reading Python is much like reading
  English. This is the reason why it is so easy to learn, understand, and code. It also
  does not need curly braces to define blocks, and indentation is mandatory. These
  further aids the readability of the code.
8. Object-Oriented
  Like we said earlier, Python is freely available. But not only can you download
  Python for free, but you can also download its source code, make changes to it, and
  even distribute it. It downloads with an extensive collection of libraries to help you
  with your tasks.
10. Portable
  When you code your research in a language like C++, you may need to make some
  changes to it if you want to run it on another platform. But it isn’t the same with
  Python. Here, you need to code only once, and you can run it anywhere. This is
  called Write Once Run Anywhere (WORA). However, you need to be careful enough
  not to include any system-dependent features.
11. Interpreted
  Lastly, we will say that it is an interpreted language. Since statements are executed
  one by one, debugging is easier than in compiled languages.
Any doubts till now in the advantages of Python? Mention in the comment section.
1. Less Coding
  Almost all of the tasks done in Python requires less coding when the same task is
  done in other languages. Python also has an awesome standard library support, so you
  don’t have to search for any third-party libraries to get your job done. This is the
  reason that many people suggest learning Python to beginners.
2. Affordable
Python is free therefore individuals, small companies or big organizations can leverage
the free available resources to build applications. Python is popular and widely used so
it gives you better community support.
  The 2019 Github annual survey showed us that Python has overtaken Java in the most
  popular programming language category.
  Python code can run on any machine whether it is Linux, Mac or Windows.
  Programmers need to learn different languages for different jobs but with Python, you
  can professionally build web apps, perform data analysis and machine learning,
  automate things, do web scraping and also build games and powerful visualizations. It
  is an all-rounder programming language.
Disadvantages of Python
  So far, we’ve seen why Python is a great choice for your research. But if you choose it,
  you should be aware of its consequences as well. Let’s now see the downsides of
  choosing Python over another language.
1. Speed Limitations
  We have seen that Python code is executed line by line. But since Python is
  interpreted, it often results in slow execution. This, however, isn’t a problem unless
  speed is a focal point for the research. In other words, unless high speed is a
  requirement, the benefits offered by Python are enough to distract us from its speed
  limitations.
3. Design Restrictions
As you know, Python is dynamically-typed. This means that you don’t need to declare
the type of variable while writing the code. It uses duck-typing. But wait, what’s that?
Well, it just means that if it looks like a duck, it must be a duck. While this is easy on
the programmers during coding, it can raise run-time errors.
5. Simple
No, we’re not kidding. Python’s simplicity can indeed be a problem. Take my
example. I don’t do Java, I’m more of a Python person. To me, its syntax is so simple
that the verbosity of Java code seems unnecessary.
This was all about the Advantages and Disadvantages of Python Programming
Language.
History of Python
What do the alphabet and the programming language Python have in common? Right,
both start with ABC. If we are talking about ABC in the Python context, it's clear that
the programming language ABC is meant. ABC is a general-purpose programming
language and programming environment, which had been developed in the
Netherlands, Amsterdam, at the CWI (Centrum Wiskunde &Informatica). The
greatest achievement of ABC was to influence the design of Python. Python was
conceptualized in the late 1980s. Guido van Rossum worked that time in a research at
the CWI, called Amoeba, a distributed operating system. In an interview with Bill
Venners1, Guido van Rossum said: "In the early 1980s, I worked as an implementer
on a team building a language called ABC at Centrum voor Wiskunde en Informatica
(CWI). I don't know how well people know ABC's influence on Python. I try to
    mention ABC's influence because I'm indebted to everything I learned during that
    research and to the people who worked on it. "Later on in the same Interview, Guido
    van Rossum continued: "I remembered all my experience and some of my frustration
    with ABC. I decided to try to design a simple scripting language that possessed some
    of ABC's better properties, but without its problems. So, I started typing. I created a
    simple virtual machine, a simple parser, and a simple runtime. I made my own version
    of the various ABC parts that I liked. I created a basic syntax, used indentation for
    statement grouping instead of curly braces or begin-end blocks, and developed a small
    number of powerful data types: a hash table (or dictionary, as we call it), a list, strings,
    and numbers."
    Guido Van Rossum published the first version of Python code (version 0.9.0) at
    alt.sources in February 1991. This release included already exception handling,
    functions, and the core data types of lists, dict, str and others. It was also object
    oriented            and            had             a           module              system.
    Python version 1.0 was released in January 1994. The major new features included in
    this release were the functional programming tools lambda, map, filter and reduce,
    which Guido Van Rossum never liked. Six and a half years later in October 2000,
    Python 2.0 was introduced. This release included list comprehensions, a full garbage
    collector and it was supporting unicode. Python flourished for another 8 years in the
    versions 2.x before the next major release as Python 3.0 (also known as "Python
    3000" and "Py3K") was released. Python 3 is not backwards compatible with Python
    2.x. The emphasis in Python 3 had been on the removal of duplicate programming
    constructs and modules, thus fulfilling or coming close to fulfilling the 13th law of the
    Zen of Python: "There should be one -- and preferably only one -- obvious way to do
    it."Some changes in Python 7.3:
 There is only one integer type left, i.e., int. long is int as well.
    The division of two integers returns a float instead of an integer. "//" can be used to
     have the "old" behaviour.
Purpose
Python
    Python is Interactive − you can actually sit at a Python prompt and interact with the
     interpreter directly to write your programs.
    Python also acknowledges that speed of development is important. Readable and terse
    code is part of this, and so is access to powerful constructs that avoid tedious
    repetition of code. Maintainability also ties into this may be an all but useless metric,
    but it does say something about how much code you have to scan, read and/or
    understand to troubleshoot problems or tweak behaviors. This speed of development,
    the ease with which a programmer of other languages can pick up basic Python skills
    and the huge standard library is key to another area where Python excels. All its tools
    have been quick to implement, saved a lot of time, and several of them have later
    been patched and updated by people with no Python background - without breaking.
TensorFlow
    TensorFlow is a free and open-source software library for dataflow and differentiable
    programming across a range of tasks. It is a symbolic math library and is also used
    for machine learning applications such as neural networks. It is used for both research
    and production at Google.
    TensorFlow was developed by the Google Brain team for internal Google use. It was
    released under the Apache 2.0 open-source license on November 9, 2015.
NumPy
    It is the fundamental package for scientific computing with Python. It contains various
    features including these important ones:
    Besides its obvious scientific uses, NumPy can also be used as an efficient multi-
    dimensional container of generic data. Arbitrary datatypes can be defined using
NumPy which allows NumPy to seamlessly and speedily integrate with a wide variety
of databases.
Pandas
Matplotlib
Scikit – learn
    Python is Interactive − you can actually sit at a Python prompt and interact with the
     interpreter directly to write your programs.
    Python also acknowledges that speed of development is important. Readable and terse
    code is part of this, and so is access to powerful constructs that avoid tedious
    repetition of code. Maintainability also ties into this may be an all but useless metric,
    but it does say something about how much code you have to scan, read and/or
    understand to troubleshoot problems or tweak behaviors. This speed of development,
    the ease with which a programmer of other languages can pick up basic Python skills
    and the huge standard library is key to another area where Python excels. All its tools
    have been quick to implement, saved a lot of time, and several of them have later
    been patched and updated by people with no Python background - without breaking.
There have been several updates in the Python version over the years. The question is
how to install Python? It might be confusing for the beginner who is willing to start
learning Python but this tutorial will solve your query. The latest or the newest
version of Python is version 3.7.4 or in other words, it is Python 3.
Note: The python version 3.7.4 cannot be used on Windows XP or earlier devices.
Before you start with the installation process of Python. First, you need to know about
your System Requirements. Based on your system type i.e., operating system and
based processor, you must download the python version. My system type is
a Windows 64-bit operating system. So, the steps below are to install python version
3.7.4 on Windows 7 device or to install Python 3. Download the Python Cheatsheet
here. The steps on how to install Python on Windows 10, 8 and 7 are divided into 4
parts to help understand better.
Step 1: Go to the official site to download and install python using Google Chrome or
any other web browser. OR Click on the following link: https://www.python.org
Now, check for the latest and the correct version for your operating system.
Step 4: Scroll down the page until you find the Files option.
Step 5: Here you see a different version of python along with the operating system.
    To download Windows 32-bit python, you can select any one from the three options:
     Windows x86 embeddable zip file, Windows x86 executable installer or Windows
     x86 web-based installer.
    To download Windows 64-bit python, you can select any one from the three options:
     Windows x86-64 embeddable zip file, Windows x86-64 executable installer or
     Windows x86-64 web-based installer.
    Here we will install Windows x86-64 web-based installer. Here your first part
    regarding which version of python is to be downloaded is completed. Now we move
    ahead with the second part in installing python i.e., Installation
    Note: To know the changes or updates that are made in the version you can click on
    the Release Note Option.
Installation of Python
    Step 1: Go to Download and Open the downloaded python version to carry out the
    installation process.
Step 2: Before you click on Install Now, make sure to put a tick on Add Python 3.7 to
PATH.
Step 3: Click on Install NOW After the installation is successful. Click on Close.
With these above three steps on python installation, you have successfully and
correctly installed Python. Now is the time to verify the installation.
Step 4: Let us test whether the python is correctly installed. Type python –V and press
Enter.
Note: If you have any of the earlier versions of Python already installed. You must
first uninstall the earlier version and then install the new one.
Step 3: Click on IDLE (Python 3.7 64-bit) and launch the program
Step 4: To go ahead with working in IDLE you must first save the file. Click on File >
Click on Save
Step 5: Name the file and save as type should be Python files. Click on SAVE. Here I
have named the files as Hey World.
Step 6: Now for e.g., enter print (“Hey World”) and Press Enter.
You will see that the command given is launched. With this, we end our tutorial on
how to install Python. You have learned how to download python for windows into
your respective operating system.
Note: Unlike Java, Python does not need semicolons at the end of the statements
otherwise it won’t work.
                                    CHAPTER 7
SOURCE CODE
# Importing Libraries
import pandas as pd
import numpy as np
import joblib
import os
import warnings
warnings.filterwarnings('ignore')
# Importing Dataset
dataset=pd.read_csv("data.csv")
dataset['class'].unique()
dataset.head()
dataset.info()
dataset.isnull().sum()
dataset.dropna(inplace =True)
dataset.describe()
# Replace 'dataset' with your actual DataFrame and 'Drug' with the column name
for p in ax.patches:
textcoords='offset points')
plt.xticks(rotation=90)
le= LabelEncoder()
dataset['class']=le.fit_transform(dataset['class'])
dataset
X=dataset.iloc[:,0:34]
y=dataset.iloc[:,-1]
#Applying SMOTE
X,y= smote.fit_resample(X, y)
# Replace 'dataset' with your actual DataFrame and 'Drug' with the column name
for p in ax.patches:
textcoords='offset points')
#Datasplitting
X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.20)
X_train.shape
X_test.shape
X_train.shape
X_test.shape
#Building a ML Model
precision = []
recall = []
fscore = []
accuracy = []
testY = testY.astype('int')
predict = predict.astype('int')
a = accuracy_score(testY,predict)*100
accuracy.append(a)
precision.append(p)
recall.append(r)
fscore.append(f)
report=classification_report(predict, testY,target_names=labels)
ax.set_ylim([0,len(labels)])
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()
# DecisionTreesClassifier
if os.path.exists('Decision_trees_model.pkl'):
  # Load the trained model from the file
clf = joblib.load('Decision_trees_model.pkl')
predict = clf.predict(X_test)
else:
clf = DecisionTreeClassifier(max_depth=3)
clf.fit(X_train, y_train)
joblib.dump(clf, 'Decision_trees_model.pkl')
predict = clf.predict(X_test)
if os.path.exists('SVM_model.pkl'):
clf = joblib.load('SVM_model.pkl')
predict = clf.predict(X_test)
else:
joblib.dump(clf, 'SVM_model.pkl')
predict = clf.predict(X_test)
values = []
for i in range(len(algorithm_names)):
values.append([algorithm_names[i],precision[i],recall[i],fscore[i],accuracy[i]])
temp = pd.DataFrame(values,columns=columns)
temp
# prediction
test=pd.read_csv("test.csv")
test
predict = clf.predict(test)
for i, p in enumerate(predict):
  print(test.iloc[i]) # Print the row
  print(f"Row         {i}:**************************************************
{labels[p]}")
test['predict']=predict
test
test
                                   CHAPTER 8
RESULTS
      o   The dataset (data.csv) is loaded into a DataFrame using pandas. This dataset
          contains various clinical and microscopic features associated with different
          erythemato-squamous diseases.
      o   Initial exploration includes checking the unique values in the class column,
          examining the dataset's structure, and identifying missing values.
2. Data Preprocessing:
3. Data Balancing:
      o   The dataset is split into training and testing subsets using an 80-20 ratio. This
          separation allows the model to be trained on one part of the data and
          evaluated on another to assess its performance.
5. Model Training:
      o   Decision Tree Classifier: The model is initially trained using the Decision
          Tree Classifier with a maximum depth of 3. The trained model is saved to a
          file using joblib and reloaded if available. Performance metrics such as
          accuracy, precision, recall, and F1 score are computed.
6. Performance Evaluation:
      o   A separate test dataset (test.csv) is used to make predictions using the trained
          SVM model. The predictions are appended to the test dataset, and the results
          are printed and displayed for review.
       o   Versatility: The SVM's flexibility with different kernels allows for effective
           modeling of various data distributions.
  The dataset provided contains clinical and microscopic features used for the
  differential diagnosis of erythemato-squamous diseases. Each row represents a sample
  with various attributes, including both clinical and histopathological characteristics.
  The dataset is structured as follows:
Columns:
2. scaling: Presence and extent of scaling on the lesion (e.g., 0: absent, 1: mild, 2:
   moderate, 3: severe).
4. itching: Degree of itching associated with the lesion (e.g., 0: absent, 1: mild, 2:
   moderate, 3: severe).
35. class: Disease classification (e.g., psoriasis, lichen planus, seborrheic dermatitis,
   chronic dermatitis, pityriasis rosea, pityriasis rubra pilaris).
Figure 1 displays the initial five rows of the dataset, showcasing a snapshot of the raw
data used for the study. Each row represents a patient case with clinical and
microscopic features, and the columns include attributes such as age, family history,
and various microscopic findings. The 'class' column indicates the disease type, with
values like psoriasis and seboreic dermatitis. This figure provides an overview of the
structure and format of the data before any preprocessing steps, allowing us to
understand the types of features involved in diagnosing erythemato-squamous
diseases.
  Figure 2: Count plot of the various types of the class column of the dataset before
                                    applying smote.
Figure 2 illustrates the distribution of different classes within the dataset before
applying Synthetic Minority Over-sampling Technique (SMOTE). This count plot
highlights the class imbalance, with certain diseases like psoriasis having significantly
more samples compared to others like pityriasis rubra pilaris. The imbalance can
negatively impact model performance by biasing towards the majority class. This
visualization underscores the need for techniques like SMOTE to balance the classes
and ensure that the model learns effectively from all categories.
Figure 3: Count plot of the various types of classes of the class column of the dataset
                                after applying smote.
Figure 3 depicts the class distribution after applying SMOTE, a technique used to
generate synthetic samples for minority classes to achieve a balanced dataset. Post-
SMOTE, each class has an equal number of samples, as reflected in the uniform
heights of the bars in the count plot. This balance is crucial for training the machine
learning models as it prevents bias towards any particular class, ensuring that the
model has sufficient examples to learn from each disease category. This figure
confirms the successful application of SMOTE to address class imbalance issues.
       Figure 4: Confusion matrix of Decision Tree Algorithm to the dataset.
Figure 4 presents the confusion matrix of the Decision Tree algorithm's performance
on the test dataset. The matrix provides a detailed breakdown of the model's
predictions against the actual labels, showing true positives, true negatives, false
positives, and false negatives for each class. Each cell in the matrix represents the
count of predictions made by the model for a given class. High values along the
diagonal indicate good predictive accuracy for those classes. This figure helps in
understanding the Decision Tree model's strengths and weaknesses in classifying the
various types of erythemato-squamous diseases.
Figure 6 presents the results of the model predictions on the test dataset. This figure
includes a table showing each test sample with its corresponding predicted disease
class. The predictions are compared against the actual class labels to evaluate the
model's performance. This comprehensive view of model output allows for detailed
examination of prediction accuracy, highlighting instances of correct classifications
and misclassifications. It provides a practical perspective on how the trained model
would perform in real-world diagnostic scenarios, validating its applicability and
reliability in clinical settings.
    Algorithm Name: This column indicates the name of the machine learning
     algorithm evaluated.
    Precision: Precision measures the accuracy of the positive predictions made by the
     model. The Support Vector Machine Classifier achieved a precision of 57.31%,
     while the Decision Tree Classifier achieved a significantly higher precision of
     99.07%.
    Recall: Recall indicates the ability of the model to correctly identify all relevant
     instances. The SVM Classifier had a recall of 66.67%, whereas the Decision Tree
     Classifier demonstrated superior recall with a score of 99.33%.
    F-Score: The F-Score is the harmonic mean of precision and recall, providing a
     single metric that balances both. The SVM Classifier's F-Score was 60.16%, while
     the Decision Tree Classifier had an outstanding F-Score of 99.18%.
    The table highlights the stark contrast between the two models, with the Decision
    Tree Classifier significantly outperforming the SVM Classifier across all metrics.
    This suggests that the Decision Tree model is highly effective for this particular
    diagnostic task, providing accurate and reliable classifications for erythemato-
    squamous diseases. The superior performance of the Decision Tree Classifier could
    be attributed to its ability to handle the complex interactions among the clinical and
    microscopic features of the dataset, making it a preferred choice for this application.
                                   CHAPTER 9
9.1 Conclusion
The comparative analysis revealed that the Decision Tree Classifier significantly
outperformed the SVM Classifier in terms of precision, recall, F-score, and accuracy.
Specifically, the Decision Tree model achieved a near-perfect accuracy of 99.25%,
highlighting its capability to handle the intricate patterns within the dataset effectively.
This superior performance underscores the potential of decision tree algorithms in
medical diagnostics, where precision and recall are critical for patient outcomes.
Moreover, the study underscores the importance of data preprocessing steps, such as
handling missing values and applying SMOTE for class balancing, which are crucial
in enhancing the performance of machine learning models. The results demonstrated
that a well-preprocessed dataset, coupled with a robust algorithm, can provide reliable
and accurate diagnostic predictions.
The promising results of this study open several avenues for future research and
development in the field of machine learning-based medical diagnostics. Firstly, there
is potential for extending the dataset to include more diverse and comprehensive
clinical and microscopic features, which could further enhance the model's accuracy
and generalizability across different populations and subtypes of erythemato-
squamous diseases.
In summary, the future scope of this research is vast and multifaceted, with numerous
opportunities to advance the field of dermatological diagnostics through innovative
machine learning applications. By continuing to refine these models and expanding
their capabilities, we can move closer to achieving more accurate, efficient, and
personalized healthcare solutions.
                                 REFERENCES
[1] M. A. Alshamrani et al. "Deep Learning for Skin Disease Diagnosis: A Survey,"
IEEE Access, vol. 9, pp. 40771-40784, 2021.
[2] H. Xie et al. "A Convolutional Neural Network for the Classification of Skin
Disease Images," IEEE Transactions on Biomedical Engineering, vol. 67, no. 10, pp.
2920-2929, 2020.
[4] S. T. Ghosh et al. "An Intelligent System for Automated Diagnosis of Skin
Diseases Using Deep Learning," IEEE Transactions on Instrumentation and
Measurement, vol. 70, pp. 1-10, 2021.
[6] R. K. Sharma et al. "Machine Learning Approaches for the Classification of Skin
Lesions," IEEE Reviews in Biomedical Engineering, vol. 14, pp. 24-34, 2021.
[7] C. L. Lee et al. "Hybrid Deep Learning Models for Accurate Skin Disease
Classification," IEEE Transactions on Biomedical Circuits and Systems, vol. 14, no. 2,
pp. 274-283, 2020.
[8] L. Zhao et al. "Skin Disease Diagnosis with Generative Adversarial Networks and
Convolutional Neural Networks," IEEE Transactions on Computational Biology and
Bioinformatics, vol. 17, no. 4, pp. 1134-1143, 2020.
[9] A. R. Das et al. "Advanced Techniques for Skin Disease Detection Using Machine
Learning," IEEE Access, vol. 8, pp. 106469-106480, 2020.
[13] A. M. Collins et al. "Predictive Modeling for Skin Disease Classification Using
Machine Learning Algorithms," Computers in Biology and Medicine, vol. 129,
Article 104110, 2020.