FIND-S Algorithm Demonstration
This document explains and demonstrates the FIND-S algorithm, which is used to find the
most specific hypothesis that fits all the positive examples in a given dataset.
Step 1: Sample CSV Format
The following is a sample of training data stored in a CSV file named 'training_data.csv':
Sky           AirTemp       Humidity      Wind            Water       Forecast     EnjoySport
Sunny         Warm          Normal        Strong          Warm        Same         Yes
Sunny         Warm          High          Strong          Warm        Same         Yes
Rainy         Cold          High          Strong          Warm        Change       No
Sunny         Warm          High          Strong          Cool        Change       Yes
Step 2: Python Code Implementation
import pandas as pd
def find_s_algorithm(df):
   # Get only positive examples
   positive_examples = df[df.iloc[:, -1] == "Yes"]
   # Initialize hypothesis with the first positive example
   hypothesis = positive_examples.iloc[0, :-1].tolist()
   for _, row in positive_examples.iterrows():
       for i in range(len(hypothesis)):
           if hypothesis[i] != row[i]:
               hypothesis[i] = "?"
   return hypothesis
# Load data from CSV file
file_path = "training_data.csv" # Update with your actual path
data = pd.read_csv(file_path)
# Apply FIND-S
final_hypothesis = find_s_algorithm(data)
print("Final hypothesis:", final_hypothesis)
Step 3: Output Example
Final hypothesis: ['Sunny', 'Warm', '?', 'Strong', '?', '?']
This indicates the most specific hypothesis that fits all positive examples.
Candidate Elimination Algorithm -
Detailed Explanation
Overview:
The Candidate-Elimination algorithm is a supervised learning algorithm used in Concept
Learning. It aims to find all hypotheses consistent with a given training dataset. It maintains a
Version Space, defined by the most specific hypothesis (S) and the most general hypothesis
(G).
Sample Training Data:
The following dataset is used for demonstration:
Sky           AirTemp        Humidity      Wind          Water       Forecast      EnjoySport
Sunny         Warm           Normal        Strong        Warm        Same          Yes
Sunny         Warm           High          Strong        Warm        Same          Yes
Rainy         Cold           High          Strong        Warm        Change        No
Sunny         Warm           High          Strong        Cool        Change        Yes
Algorithm Steps:
1. Initialize S to the first positive example.
2. Initialize G to the most general hypothesis: all fields are '?'
3. For each training example:
   - If the example is positive, generalize S minimally to include it.
     Remove hypotheses from G that are inconsistent with S.
   - If the example is negative, specialize G to exclude it.
     Ensure each new G hypothesis is consistent with S.
Final Output:
After processing all the examples, we obtain the following:
Specific Hypothesis (S): ['Sunny', 'Warm', '?', 'Strong', '?', '?']
General Hypotheses (G):
[['Sunny', '?', '?', '?', '?', '?'], ['?', 'Warm', '?', '?', '?', '?']]
Explanation:
- The Specific Hypothesis (S) is the most specific hypothesis that covers all positive
examples.
- The General Hypotheses (G) are the most general boundaries that are consistent with all
positive and negative examples.
- Together, S and G represent the version space of all consistent hypotheses.
ID3 Decision Tree Algorithm -
Implementation & Demonstration
Overview:
The ID3 (Iterative Dichotomiser 3) algorithm is a decision tree learning algorithm used for
classification.
It builds the tree by choosing the attribute that yields the highest information gain at each
step.
It is typically used with categorical data.
Dataset:
The following dataset is used to demonstrate the ID3 algorithm (Play Tennis dataset):
Outlook             Temperature         Humidity          Wind                PlayTennis
Sunny               Hot                 High              Weak                No
Sunny               Hot                 High              Strong              No
Overcast            Hot                 High              Weak                Yes
Rain                Mild                High              Weak                Yes
Rain                Cool                Normal            Weak                Yes
Rain                Cool                Normal            Strong              No
Overcast           Cool                 Normal               Strong   Yes
Sunny              Mild                 High                 Weak     No
Sunny              Cool                 Normal               Weak     Yes
Rain               Mild                 Normal               Weak     Yes
Sunny              Mild                 Normal               Strong   Yes
Overcast           Mild                 High                 Strong   Yes
Overcast           Hot                  Normal               Weak     Yes
Rain               Mild                 High                 Strong   No
Python Code:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
# Load dataset
data = pd.read_csv("play_tennis.csv")
# Encode categorical values
le = LabelEncoder()
for column in data.columns:
   data[column] = le.fit_transform(data[column])
# Train model
X = data.drop("PlayTennis", axis=1)
y = data["PlayTennis"]
clf = DecisionTreeClassifier(criterion="entropy")
clf.fit(X, y)
# Visualize tree
plot_tree(clf, feature_names=X.columns, class_names=["No", "Yes"], filled=True)
plt.show()
# Predict a new sample
sample = pd.DataFrame([["Sunny", "Cool", "High", "Strong"]], columns=["Outlook",
"Temperature", "Humidity", "Wind"])
sample_encoded = sample.apply(lambda col: le.fit(data[col.name]).transform(col))
print("Prediction:", clf.predict(sample_encoded))
Explanation:
The above code uses the ID3 algorithm via sklearn's DecisionTreeClassifier with entropy as
the criterion.
It trains a model on the Play Tennis dataset, visualizes the decision tree, and predicts the
output for a new sample.
Artificial Neural Network using
Backpropagation - Implementation
Guide
Overview:
An Artificial Neural Network (ANN) is a computational model inspired by the human brain.
The backpropagation algorithm is used to train the ANN by minimizing the error between the
predicted and actual output through gradient descent.
Dataset:
We use a simple dataset such as the XOR function to demonstrate the working of ANN with
backpropagation.
Input1                         Input2                       Output
0                              0                            0
0                              1                            1
1                              0                            1
1                              1                            0
Python Code (using Keras):
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
# XOR dataset
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([[0],[1],[1],[0]])
# Build ANN model
model = Sequential()
model.add(Dense(4, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Train model
model.fit(X, y, epochs=1000, verbose=0)
# Test predictions
predictions = model.predict(X)
print("Predictions:\n", predictions)
Explanation:
The model consists of an input layer, one hidden layer with 4 neurons, and an output layer.
The activation function 'relu' is used in the hidden layer and 'sigmoid' in the output layer.
Backpropagation is used internally by Keras to adjust weights and minimize error using the
'adam' optimizer.
Naïve Bayes Classifier -
Implementation & Accuracy Evaluation
Overview:
The Naïve Bayes classifier is a probabilistic classifier based on Bayes’ Theorem with strong
independence assumptions between features.
It is simple, fast, and effective for many classification problems.
Dataset:
We assume a CSV file containing labeled data for training and testing. For demonstration,
let’s consider a Play Tennis dataset.
Outlook            Temperature          Humidity           Wind            PlayTennis
Sunny              Hot                  High               Weak            No
Sunny              Hot                  High               Strong          No
Overcast           Hot                  High               Weak            Yes
Rain               Mild                 High               Weak            Yes
Rain               Cool                 Normal             Weak            Yes
Rain               Cool                 Normal             Strong          No
Overcast           Cool                 Normal             Strong          Yes
Sunny               Mild                High              Weak                No
Sunny               Cool                Normal            Weak                Yes
Rain                Mild                Normal            Weak                Yes
Python Code:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import CategoricalNB
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
# Load dataset
data = pd.read_csv("play_tennis.csv")
# Encode categorical data
label_encoders = {}
for col in data.columns:
   le = LabelEncoder()
   data[col] = le.fit_transform(data[col])
   label_encoders[col] = le
# Split into train and test sets
X = data.drop("PlayTennis", axis=1)
y = data["PlayTennis"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train Naive Bayes model
model = CategoricalNB()
model.fit(X_train, y_train)
# Predict and evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
Explanation:
The code uses sklearn's `CategoricalNB` to train the Naïve Bayes classifier on categorical
data.
The dataset is encoded, split into training and test sets, and the model is trained and evaluated
for accuracy.
Document Classification using Naïve
Bayes in Java
Overview:
This document outlines the implementation of a Naïve Bayes Classifier in Java using the
Weka API to classify a set of documents.
The model computes Accuracy, Precision, and Recall to evaluate its performance.
Assumptions:
- The dataset is in ARFF format (Weka-compatible) and contains labeled document data.
- The Java program uses Weka’s NaiveBayes class.
Java Code:
import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSource;
import weka.classifiers.bayes.NaiveBayes;
import weka.classifiers.Evaluation;
import java.util.Random;
public class DocumentClassifier {
   public static void main(String[] args) throws Exception {
        // Load dataset
        DataSource source = new DataSource("documents.arff");
        Instances data = source.getDataSet();
        data.setClassIndex(data.numAttributes() - 1);
        // Train Naive Bayes model
        NaiveBayes nb = new NaiveBayes();
        nb.buildClassifier(data);
        // Evaluate with 10-fold cross-validation
        Evaluation eval = new Evaluation(data);
        eval.crossValidateModel(nb, data, 10, new Random(1));
        // Output metrics
        System.out.println("Accuracy: " + eval.pctCorrect());
        System.out.println("Precision: " + eval.precision(1));
        System.out.println("Recall: " + eval.recall(1));
    }
}
Explanation:
This Java program loads a document dataset, builds a Naïve Bayes classifier using Weka,
and evaluates its performance with 10-fold cross-validation. It prints Accuracy, Precision, and
Recall for class '1'.
Bayesian Network for Heart Disease
Diagnosis
Overview:
This program demonstrates how to construct and use a Bayesian Network to diagnose heart
disease using a medical dataset.
The model is implemented in Python using the `pgmpy` library.
Dataset:
We use a simplified version of the UCI Heart Disease dataset with attributes like Age,
Gender, ChestPainType, Cholesterol, and HeartDisease.
Python Code (using pgmpy):
import pandas as pd
from pgmpy.models import BayesianNetwork
from pgmpy.estimators import MaximumLikelihoodEstimator
from pgmpy.inference import VariableElimination
# Sample medical dataset
data = pd.DataFrame([
   [63, 1, 'typical', 233, 1],
     [37, 1, 'non-anginal', 250, 1],
     [41, 0, 'atypical', 204, 0],
     [56, 1, 'asymptomatic', 236, 1],
     [57, 0, 'typical', 354, 0],
], columns=["Age", "Sex", "ChestPainType", "Cholesterol", "HeartDisease"])
# Convert categorical values
data["ChestPainType"] = data["ChestPainType"].astype('category').cat.codes
# Define Bayesian Network structure
model = BayesianNetwork([
     ("Age", "HeartDisease"),
     ("Sex", "HeartDisease"),
     ("ChestPainType", "HeartDisease"),
     ("Cholesterol", "HeartDisease")
])
# Learn CPDs using Maximum Likelihood Estimator
model.fit(data, estimator=MaximumLikelihoodEstimator)
# Perform inference
inference = VariableElimination(model)
result = inference.query(variables=["HeartDisease"], evidence={"Age": 56, "Sex": 1,
"ChestPainType": 3, "Cholesterol": 236})
print(result)
Explanation:
This code builds a Bayesian Network with connections from predictors to the target variable
(HeartDisease).
The `MaximumLikelihoodEstimator` learns the CPDs from the data.
The `VariableElimination` module performs inference to diagnose if a patient with given
attributes has heart disease.
Clustering with EM and K-Means
Algorithms - Implementation &
Comparison
Overview:
This program applies the Expectation-Maximization (EM) algorithm and the K-Means
algorithm to cluster a dataset stored in a CSV file.
It uses Python's scikit-learn library for implementation and compares the results using
silhouette score for clustering quality.
Python Code:
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
# Load dataset
data = pd.read_csv("data.csv")
X = StandardScaler().fit_transform(data)
# K-Means Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans_labels = kmeans.fit_predict(X)
kmeans_score = silhouette_score(X, kmeans_labels)
print("K-Means Silhouette Score:", kmeans_score)
# EM Clustering using Gaussian Mixture Model
gmm = GaussianMixture(n_components=3, random_state=42)
gmm_labels = gmm.fit_predict(X)
gmm_score = silhouette_score(X, gmm_labels)
print("EM Silhouette Score:", gmm_score)
Explanation:
Both K-Means and EM algorithms aim to group similar data points together:
- K-Means partitions the dataset into clusters by minimizing the sum of squared distances.
- EM uses a probabilistic model (Gaussian Mixture) to estimate the likelihood of data points
belonging to clusters.
The silhouette score measures how well data points fit into their assigned clusters. Higher
scores indicate better-defined clusters.
Comparison and Conclusion:
- If the silhouette score for EM is higher than K-Means, EM provides better clustering,
especially when data clusters are elliptical.
- If K-Means performs similarly or better, it suggests that the clusters are spherical and well-
separated.
- Depending on the data shape, EM may handle overlapping clusters better than K-Means.
K-Nearest Neighbour (KNN)
Algorithm - Iris Dataset Classification
Overview:
The K-Nearest Neighbour (KNN) algorithm is a simple, non-parametric method used for
classification and regression.
This implementation uses Python's scikit-learn library to classify the Iris dataset.
It prints both correct and incorrect predictions.
Python Code (using sklearn):
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train KNN classifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
# Predict and evaluate
predictions = knn.predict(X_test)
for i in range(len(y_test)):
    actual = target_names[y_test[i]]
    predicted = target_names[predictions[i]]
    status = "Correct" if y_test[i] == predictions[i] else "Incorrect"
    print(f"Sample {i+1}: Actual = {actual}, Predicted = {predicted} --> {status}")
Explanation:
- The KNN algorithm classifies test samples based on the majority vote of the k-nearest
training examples.
- The Iris dataset contains three classes of 50 instances each, where each class refers to a type
of iris plant.
- The script prints the prediction results for each test sample and indicates whether each
prediction was correct or incorrect.
Locally Weighted Regression (LWR) -
Implementation & Visualization
Overview:
Locally Weighted Regression (LWR) is a non-parametric algorithm used to fit a curve through
data points.
It computes weights for each training point depending on its distance to the query point and
fits a regression using weighted least squares.
Python Code Summary:
- `kernel`: Defines the weight for each training point based on Gaussian kernel.
- `predict`: Computes the regression prediction using weighted least squares.
- Data: 100 points generated from `sin(x)` with added noise.
- Bandwidth (`tau`) determines how 'local' the fit is.
Graphical Output:
The above graph shows the noisy data (red) and the smooth curve fitted using LWR (blue).