K-Nearest Neighbors (K-NN) is a supervised machine learning algorithm used for classification and regression, characterized by its non-parametric, lazy learning, and instance-based nature. The algorithm involves selecting a value for K, computing distances to training data, and making predictions based on the nearest neighbors, with various distance metrics like Euclidean and Manhattan. While K-NN is simple and effective for small datasets, it can be computationally expensive and sensitive to noise, requiring optimizations like feature scaling and dimensionality reduction for better performance.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0 ratings0% found this document useful (0 votes)
33 views6 pages
K - Nearest Neighbours
K-Nearest Neighbors (K-NN) is a supervised machine learning algorithm used for classification and regression, characterized by its non-parametric, lazy learning, and instance-based nature. The algorithm involves selecting a value for K, computing distances to training data, and making predictions based on the nearest neighbors, with various distance metrics like Euclidean and Manhattan. While K-NN is simple and effective for small datasets, it can be computationally expensive and sensitive to noise, requiring optimizations like feature scaling and dimensionality reduction for better performance.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 6
4/26/2025
K-Nearest Neighbours
Dr. Eva Patel
Phecoctate Professor
Advanced Computer Science aud Engineering Department
Introduction
* K-Nearest Neighbors (K-NN) is a supervised machine learning algorithm that can
be used for both classification and regression tasks.
+ Key Characteristics:
+ Non-parametric: It does not make assumptions about the data distribution.
* Lazy Learning: No explicit training phase, that is, there is no model learning:
all computations take place is deferred until prediction.
* Instance-Based: Stores all training examples and makes decisions based on
similarity.
* Distance-Based: Classifies or predicts based on the distance to the nearest
neighbors.4/26/2025
The KNN Algorithm
+ The K-NN algorithm follows these steps:
1. Choose the value of K (number of nearest neighbors).
2. Compute the distance between the new data point and all training data
points.
3. Sort the distances and select the K nearest neighbors.
4. For classification: Assign the most common class among the K neighbors.
5. For regression: Compute the average (or weighted average) of the values of
the K neighbors.
Distance Metrics in K-NN
+ The choice of distance metric affects the performance of K-NN.
A. Euclidean Distance
+ Most commonly used
d(A, B) =
+ Suitable for continuous variables.
* Sensitive to different scales of data.
B. Manhattan Distance
da Bil
t
Works well when data varies along axes (e.g., city block distance).
d(A,B)4/26/2025
Distance Metrics in K-NN
C. Minkowski Distance (Generalized Form)
(A,B) = (> 14i- ar)
i=1
* When p = 2, it becomes Euclidean distance.
* When p = 1, it becomes Manhattan distance.
D. Cosine Similarity
Se
AB
cos = Tara
* Used for text or high-dimensional data.
Choosing the Right K Value
+ Small K (e.g., K = 1, K = 3): High variance, more sensitive to noise.
* Large K (e.g., K = 10, K = 20): High bias, smooth decision boundary.
* Common practice: Use odd values of K to avoid ties in classification.
+ Rule of thumb: K = VN (where N is the number of training examples).
* Applications
* Handwritten digit recognition (e.g., MNIST dataset)
* Recommendation systems (e.g., movie or product recommendations)
+ Anomaly detection in cybersecurity
* Medical diagnosis (e.g., cancer detection)
* Credit scoring and fraud detection4/26/2025
K-Nearest Neighbours
Advantages Disadvantages of K-NN
* Simple and intuitive + Computationally expensive at
+ No training phase (fast to set up) peas time (slow for large
atasets)
* Works well with small datasets a
+ Sensitive to irrelevant features and
* Can handle multi-class classification noise
* Can be used for both classification and |- performance depends on the choice
regression of distance metric
* Struggles with high-dimensional data
(curse of dimensionality)
Optimizing K-NN
+ Feature Scaling: Normalize data using Min-Max Scaling or Standardization (Z-
score normalization).
+ Dimensionality Reduction: Use PCA (Principal Component Analysis) to reduce
features.
* Weighted K-NN: Give closer neighbors more weight in decision-making.
+ Efficient Search Algorithms:
+ KD-Tree (for low-dimensional data)
* Ball Tree (for higher dimensions)
+ Approximate Nearest Neighbors (ANN) methods4/26/2025
K-NN in Python (using Scikit-Learn)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
# Load dataset
X, y = iris.data, iris.target
4# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
K-NN in Python (using Scikit-Learn)
# Feature scaling (important for distance-based algorithms)
scaler = StandardScaler()
X_train = scalerfit_transform(X_train)
X_test = scaler:transform(X_test)
# Train K-NN model
knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn.fit(X_train, y_train)
# Make predictions
y_pred = kn, predict(X_test)4/26/2025
# Evaluate accuracy
i: K-NN in Python (using Scikit-Learn)
from sklearn.metrics import accuracy_score
print("Accuracy:", accuracy _score(y test, y_pred))