1.
The Pinhole Perspective Imaging Model
Imagine a completely dark box with a tiny hole in one of its sides. If you place an object in front
of this pinhole, an inverted image of the object will form on the opposite side of the box. This
simple setup is the essence of the pinhole perspective imaging model.
Here's a breakdown of its key aspects:
   ● Pinhole as the Center of Projection: The tiny hole acts as the center from which all light
       rays originating from the object pass through. This single point of projection is a
       fundamental characteristic of the model.
   ● Straight Line Projection: Light rays are assumed to travel in straight lines from the
       object, through the pinhole, and onto the image plane. This linear projection simplifies the
       geometry significantly.
   ● Inverted Image Formation: As the light rays cross at the pinhole, the image formed on
       the image plane is inverted both horizontally and vertically with respect to the object.
   ● Perspective Effect: Objects farther away from the pinhole appear smaller in the image,
       while closer objects appear larger. This is the perspective effect that our eyes and most
       cameras naturally exhibit.
   ● No Lenses: The ideal pinhole model doesn't involve any lenses. The pinhole itself
       restricts the light rays, creating a focused image (in theory, with an infinitely small
       pinhole).
   ● Image Plane: The surface where the image is formed is called the image plane. In the
       simplest model, this plane is assumed to be flat and perpendicular to the optical axis (the
       line passing through the pinhole and the center of the image plane).
In essence, the pinhole camera model provides a simplified yet powerful geometric
framework for understanding how a 3D scene is projected onto a 2D image plane. It forms
the basis for many concepts in computer vision and graphics, even though real cameras use
lenses to gather more light and focus the image more effectively.
2. Transforming Between RGB and CIE XYZ
The transformation between RGB color spaces and the CIE XYZ color space is indeed a linear
transformation. This means that each component of one color space is a linear combination of
the components of the other color space.
Let's represent the RGB color vector as \begin{bmatrix} R \\ G \\ B \end{bmatrix} and the CIE
XYZ color vector as \begin{bmatrix} X \\ Y \\ Z \end{bmatrix}.
RGB to CIE XYZ:
The transformation from RGB to CIE XYZ can be expressed as:
\qquad \begin{bmatrix} X \\ Y \\ Z \end{bmatrix} = \mathbf{M}_{RGB \to XYZ} \begin{bmatrix} R \\
G \\ B \end{bmatrix} = \begin{bmatrix} M_{11} & M_{12} & M_{13} \\ M_{21} & M_{22} & M_{23}
\\ M_{31} & M_{32} & M_{33} \end{bmatrix} \begin{bmatrix} R \\ G \\ B \end{bmatrix}
Where M_{ij} are the elements of the 3 \times 3 transformation matrix \mathbf{M}_{RGB \to
XYZ}. These elements are derived from the color matching functions of the specific RGB color
space (e.g., sRGB, Adobe RGB) and the CIE color matching functions. Each row of the matrix
essentially represents how the R, G, and B primaries contribute to the X, Y, and Z tristimulus
values, respectively. For instance:
\qquad X = M_{11}R + M_{12}G + M_{13}B \qquad Y = M_{21}R + M_{22}G + M_{23}B \qquad
Z = M_{31}R + M_{32}G + M_{33}B
CIE XYZ to RGB:
Similarly, the transformation from CIE XYZ back to RGB is also a linear transformation,
represented by the inverse of the \mathbf{M}_{RGB \to XYZ} matrix:
\qquad \begin{bmatrix} R \\ G \\ B \end{bmatrix} = \mathbf{M}_{XYZ \to RGB} \begin{bmatrix} X \\
Y \\ Z \end{bmatrix} = \mathbf{M}_{RGB \to XYZ}^{-1} \begin{bmatrix} X \\ Y \\ Z \end{bmatrix} =
\begin{bmatrix} M'_{11} & M'_{12} & M'_{13} \\ M'_{21} & M'_{22} & M'_{23} \\ M'_{31} & M'_{32}
& M'_{33} \end{bmatrix} \begin{bmatrix} X \\ Y \\ Z \end{bmatrix}
Where M'_{ij} are the elements of the inverse transformation matrix \mathbf{M}_{XYZ \to RGB}.
These elements determine how the X, Y, and Z tristimulus values combine to produce the R, G,
and B color components. For example:
\qquad R = M'_{11}X + M'_{12}Y + M'_{13}Z \qquad G = M'_{21}X + M'_{22}Y + M'_{23}Z
\qquad B = M'_{31}X + M'_{32}Y + M'_{33}Z
The specific numerical values of the matrix elements depend on the chosen RGB color space's
primaries and white point.
3. Separable Convolution
Let's consider an image I(x, y) of size N \times N and a discrete, separable 2D filter kernel K(x,
y) of size (2k + 1) \times (2k + 1). Separability means that the 2D kernel can be expressed as
the outer product of two 1D kernels: a horizontal kernel K_h(x) of size (2k + 1) \times 1 and a
vertical kernel K_v(y) of size 1 \times (2k + 1).
\qquad K(x, y) = K_v(y) K_h(x)
Direct 2D Convolution:
To compute the convolution of the image with the 2D kernel at each pixel (i, j), we perform the
following summation:
\qquad (I * K)(i, j) = \sum_{x=-k}^{k} \sum_{y=-k}^{k} I(i - x, j - y) K(x, y)
For each output pixel, this involves (2k + 1) \times (2k + 1) multiplications and approximately the
same number of additions. Since there are N \times N output pixels, the total number of
operations is roughly N^2 (2k + 1)^2 multiplications and N^2 (2k + 1)^2 additions.
Convolution with Two 1D Kernels:
Using the separability property, we can perform the convolution in two steps:
   1. Convolve the image with the 1D horizontal kernel K_h(x) along each row: \qquad I'(i,
       j) = \sum_{x=-k}^{k} I(i - x, j) K_h(x) For each of the N rows and each of the N columns,
       this requires (2k + 1) multiplications and (2k) additions. The total number of operations for
       this step is approximately N^2 (2k + 1) multiplications and N^2 (2k) additions.
   2. Convolve the intermediate result I'(i, j) with the 1D vertical kernel K_v(y) along each
       column: \qquad (I * K)(i, j) = \sum_{y=-k}^{k} I'(i, j - y) K_v(y) Similarly, for each of the N
       columns and each of the N rows, this requires (2k + 1) multiplications and (2k) additions.
       The total number of operations for this step is approximately N^2 (2k + 1) multiplications
       and N^2 (2k) additions.
Total Operations for Separable Convolution:
The total number of operations for the separable convolution is approximately 2 \times N^2 (2k +
1) multiplications and 2 \times N^2 (2k) additions.
Estimate of Operations Saved:
The number of multiplications saved is approximately:
\qquad N^2 (2k + 1)^2 - 2 N^2 (2k + 1) = N^2 (2k + 1) [(2k + 1) - 2] = N^2 (2k + 1) (2k - 1) = N^2
(4k^2 - 1)
The number of additions saved is approximately:
\qquad N^2 (2k + 1)^2 - 2 N^2 (2k) = N^2 [(4k^2 + 4k + 1) - 4k] = N^2 (4k^2 + 1)
For larger kernel sizes (k > 1), the savings in the number of operations by using separable
convolution can be significant, as the complexity reduces from O(N^2 k^2) to O(N^2 k).
4. Convolution with Delta Functions
Let's consider a continuous function f(t) (the results extend analogously to 2D functions). The
Dirac delta function \delta(t) is defined by the following properties:
    1. \delta(t) = 0 for t \neq 0
    2. \int_{-\infty}^{\infty} \delta(t) dt = 1
    3. \int_{-\infty}^{\infty} f(\tau) \delta(t - \tau) d\tau = f(t) (sifting property)
Convolution with a Delta Function:
The convolution of f(t) with \delta(t) is given by:
\qquad (f * \delta)(t) = \int_{-\infty}^{\infty} f(\tau) \delta(t - \tau) d\tau
Using the sifting property of the delta function, where the delta function "sifts out" the value of
f(\tau) at \tau = t, we get:
\qquad (f * \delta)(t) = f(t)
Thus, convolving a function with a delta function reproduces the original function.
Convolution with a Shifted Delta Function:
Now, let's consider a shifted delta function \delta(t - a), where a is a constant shift. The
convolution of f(t) with \delta(t - a) is:
\qquad (f * \delta(t - a))(t) = \int_{-\infty}^{\infty} f(\tau) \delta((t - a) - \tau) d\tau =
\int_{-\infty}^{\infty} f(\tau) \delta(\tau - (t - a)) d\tau
Again, using the sifting property, but this time the delta function is centered at \tau = t - a, we
get:
\qquad (f * \delta(t - a))(t) = f(t - a)
This shows that convolving a function f(t) with a shifted delta function \delta(t - a) results in a
shifted version of the original function, f(t) shifted by a. If a > 0, the function is shifted to the right
(delayed), and if a < 0, the function is shifted to the left (advanced).
These properties of convolution with the delta function are fundamental in signal processing and
image processing, particularly when dealing with impulse responses and spatial
transformations.
Implementation Algorithm of Background Subtraction
Background subtraction is a technique used to identify moving objects in a video stream by
differentiating them from a static background. Here's a common implementation algorithm:
   1. Initialization (Background Model Creation):
          ○ Collect Initial Frames: Acquire a sequence of initial video frames that ideally
              contain only the static background without any foreground objects.
          ○ Build the Background Model: There are several ways to build the background
              model:
                 ■ Simple Averaging: Calculate the pixel-wise average of the initial frames.
                     This creates a single background image.
                 ■ Median Filtering: Compute the pixel-wise median of the initial frames. This is
                     more robust to transient noise or small moving objects.
                 ■ Gaussian Mixture Model (GMM): Model each pixel's history as a mixture of
                     Gaussian distributions. Background pixels will typically form one or more
                     dominant Gaussians. This is more adaptive to gradual changes in the
                     background (e.g., lighting).
   2. Foreground Detection (Frame Processing):
          ○ Acquire a New Frame: Read the current frame from the video stream.
          ○ Compare with Background Model: For each pixel in the current frame, compare
              its color or intensity value with the corresponding pixel in the background model.
          ○ Determine Foreground Pixels: A pixel is classified as foreground if the difference
              between its current value and the background model exceeds a predefined
              threshold. The threshold needs to be chosen carefully to balance sensitivity to
              motion and robustness to noise or minor background variations.
                 ■ For simple averaging/median: Calculate the absolute difference
                     |I_{current}(x, y) - B(x, y)| > T, where I_{current} is the current frame, B is the
                     background model, and T is the threshold.
                 ■ For GMM: A pixel is considered foreground if its current value does not fit
                     any of the background Gaussian distributions with a certain confidence level.
          ○ Create a Binary Mask: Generate a binary image (the foreground mask) where
              foreground pixels are white (or 1) and background pixels are black (or 0).
   3. Post-processing (Refinement):
          ○ Noise Reduction: The initial foreground mask often contains noise (isolated white
              pixels or small groups). Apply morphological operations like erosion (to remove
              small white regions) followed by dilation (to restore the shape of the remaining
              foreground objects) to clean the mask.
          ○ Blob Analysis: Group connected foreground pixels into distinct regions or "blobs."
              This helps in identifying individual moving objects.
          ○ Object Tracking (Optional): If the goal is to track the detected objects over time,
              assign unique IDs to the blobs and follow their movement across frames.
   4. Update Background Model (Optional):
          ○ Adaptive Background: To handle gradual changes in the background (e.g.,
              shadows, lighting changes, waving trees), the background model can be updated
              over time. This is typically done by slowly incorporating the current frame's
              information into the background model, but only for pixels classified as background.
            The learning rate for this update needs to be carefully chosen.
What is Computer Vision?
Computer Vision is an interdisciplinary field of artificial intelligence (AI) that enables computers
to "see" and interpret the visual world. It involves developing algorithms and techniques that
allow computers to acquire, process, analyze, and understand images and videos, much like
human vision does. The goal is to extract meaningful information from visual data and use it for
various tasks.
Research and Application Areas of Computer Vision
Computer vision is a rapidly evolving field with numerous research and application areas,
including:
Research Areas:
   ● Image Classification: Categorizing images into predefined classes (e.g., cat vs. dog, car
       vs. pedestrian).
   ● Object Detection: Identifying and localizing specific objects within an image or video
       (e.g., detecting faces, cars, and traffic signs in a scene).
   ● Image Segmentation: Dividing an image into meaningful regions or segments (e.g.,
       separating different objects or parts of an object).
   ● Object Tracking: Following the movement of objects over time in a video sequence.
   ● Pose Estimation: Determining the 3D pose (position and orientation) of objects or
       humans from images or videos.
   ● Scene Understanding: Developing a comprehensive understanding of the content,
       context, and relationships between objects in a visual scene.
   ● Image Generation: Creating new images from textual descriptions or other input.
   ● Video Analysis: Understanding and interpreting events, activities, and behaviors in video
       data.
   ● 3D Reconstruction: Creating 3D models of objects or scenes from multiple images or
       videos.
   ● Visual Recognition: A broad area encompassing various tasks like image classification,
       object detection, and instance segmentation.
   ● Explainable AI (XAI) for Vision: Understanding why a computer vision model makes a
       particular decision.
   ● Adversarial Attacks and Robustness: Studying vulnerabilities of vision models and
       developing robust models.
   ● Self-Supervised Learning for Vision: Training vision models without explicit human
       annotations.
Application Areas:
   ● Autonomous Vehicles: Object detection, lane keeping, traffic sign recognition,
       pedestrian detection.
   ● Robotics: Navigation, object manipulation, inspection, human-robot interaction.
   ● Surveillance and Security: Intruder detection, anomaly detection, crowd analysis, facial
       recognition.
   ● Medical Imaging: Disease diagnosis, image-guided surgery, medical image analysis.
   ● Manufacturing and Quality Control: Defect detection, part inspection, assembly
       verification.
  ● Retail: Inventory management, customer behavior analysis, product recommendation.
  ● Agriculture: Crop monitoring, disease detection, yield prediction.
  ● Augmented and Virtual Reality: Scene understanding, object tracking, virtual object
     placement.
  ● Human-Computer Interaction: Gesture recognition, eye tracking, emotion recognition.
  ● Entertainment: Special effects in movies, gaming, content creation.
  ● Search and Retrieval: Image and video search engines, content-based image retrieval.
  ● Accessibility: Assisting visually impaired individuals with scene description and object
     recognition.
Define Image Processing, Pattern Recognition, and Photogrammetry
  ● Image Processing: Focuses on manipulating and transforming digital images to enhance
     their quality, extract specific features, or prepare them for further analysis. The input and
     output of image processing are typically images. Common tasks include noise reduction,
     image enhancement (contrast adjustment, sharpening), geometric transformations
     (scaling, rotation), and image restoration.
  ● Pattern Recognition: A broader field that aims to classify or categorize data (including
     images, audio, text, etc.) into predefined classes or patterns. It involves developing
     algorithms that can learn from data and make predictions or decisions based on the
     identified patterns. Computer vision tasks like object detection and image classification
     are considered subfields of pattern recognition. The process typically involves feature
     extraction, followed by a classification or clustering algorithm.
  ● Photogrammetry: The science and technology of obtaining reliable information about
     physical objects and the environment through the process of recording, measuring, and
     interpreting photographic images and patterns of electromagnetic radiant energy and
     other phenomena. It is primarily concerned with creating accurate 2D and 3D
     measurements from images. Applications include surveying, mapping, 3D modeling of
     terrain and buildings, and industrial metrology. While it uses images as input, its primary
     goal is geometric reconstruction and measurement, rather than high-level understanding
     of the image content like in computer vision.
Explain Snell’s Law for Refraction
Snell's Law describes the relationship between the angles of incidence and refraction when light
(or other waves) passes through the interface between two different homogeneous media with
different refractive indices.
Let:
   ● n_1 be the refractive index of the first medium.
   ● n_2 be the refractive index of the second medium.
   ● \theta_1 be the angle of incidence (the angle between the incident ray and the normal to
       the interface).
   ● \theta_2 be the angle of refraction (the angle between the refracted ray and the normal to
       the interface).
Snell's Law states:
\qquad n_1 \sin(\theta_1) = n_2 \sin(\theta_2)
In simpler terms:
The ratio of the sine of the angle of incidence to the sine of the angle of refraction is equal to the
inverse ratio of the refractive indices of the two media.
Explanation:
   ● When light travels from a medium with a lower refractive index (n_1) to a medium with a
       higher refractive index (n_2), it bends towards the normal (\theta_2 < \theta_1). This is
       because light travels slower in a denser medium (higher refractive index).
   ● Conversely, when light travels from a medium with a higher refractive index (n_1) to a
       medium with a lower refractive index (n_2), it bends away from the normal (\theta_2 >
       \theta_1).
   ● If the refractive indices of the two media are the same (n_1 = n_2), then \sin(\theta_1) =
       \sin(\theta_2), which implies \theta_1 = \theta_2. In this case, there is no bending of light
       at the interface.
Snell's Law is a fundamental principle in optics and explains phenomena like the bending of light
as it passes from air to water (or glass) and the functioning of lenses and prisms.
What is a Negative Image? How Can We Generate a Negative Image?
A negative image is an image in which the tonal values are inverted. Bright areas in the original
image appear dark in the negative image, and dark areas appear bright. Essentially, it's like
looking at the photographic negative of a print.
Generating a Negative Image:
For a digital grayscale image with pixel intensity values ranging from 0 (black) to L-1 (white),
where L is the number of possible intensity levels (e.g., for an 8-bit image, L = 2^8 = 256), the
negative image can be generated by subtracting each pixel's intensity value from the maximum
intensity value:
\qquad I_{negative}(x, y) = (L - 1) - I_{original}(x, y)
Where:
   ● I_{negative}(x, y) is the intensity of the pixel at coordinates (x, y) in the negative image.
   ● I_{original}(x, y) is the intensity of the pixel at coordinates (x, y) in the original image.
   ● L is the total number of intensity levels.
For a color image with RGB components (each ranging from 0 to 255 for an 8-bit image),
the negative image is generated by inverting each color channel independently:
\qquad R_{negative}(x, y) = 255 - R_{original}(x, y) \qquad G_{negative}(x, y) = 255 -
G_{original}(x, y) \qquad B_{negative}(x, y) = 255 - B_{original}(x, y)
The result is an image where the colors are also inverted (e.g., red becomes cyan, green
becomes magenta, and blue becomes yellow).
Implementation Algorithm of Mean Shift Segmentation
Mean shift segmentation is a non-parametric clustering algorithm used to segment an image
into regions of similar color or intensity. It works by iteratively shifting each data point (pixel)
towards the average of the data points within its neighborhood until convergence.
Here's a common implementation algorithm:
   1. Initialization:
          ○ Define Search Window: Choose a kernel (e.g., Gaussian, flat) and a bandwidth
              (radius) for the search window. This window determines the neighborhood of pixels
              considered for each mean shift iteration. The bandwidth is a crucial parameter that
              affects the size and granularity of the resulting segments.
          ○ Initialize Cluster Centers: Each pixel in the image is initially considered as a
             potential cluster center.
  2. Iteration (Mean Shift Process):
         ○ For each pixel p_i in the image:
                ■ Define the Search Window: Center the search window around the current
                     pixel p_i.
                ■ Calculate the Mean Shift Vector: Find all pixels p_j within the search
                     window of p_i. Calculate the weighted average (mean) of these neighboring
                     pixels, where the weights are determined by the kernel function (e.g.,
                     Gaussian weight decreases with distance). The mean shift vector
                     \mathbf{m}(p_i) is the difference between this weighted mean and the current
                     pixel p_i: \qquad \mathbf{m}(p_i) = \frac{\sum_{p_j \in W(p_i)} w(p_j - p_i)
                     p_j}{\sum_{p_j \in W(p_i)} w(p_j - p_i)} - p_i where W(p_i) is the set of pixels
                     within the search window of p_i, and w(\cdot) is the kernel function.
                ■ Update Pixel Position: Shift the current pixel p_i by the mean shift vector:
                     \qquad p_i^{new} = p_i + \mathbf{m}(p_i)
                ■ Repeat: Continue this iterative shifting process until the mean shift vector
                     becomes smaller than a predefined threshold, indicating convergence. The
                     final converged position is considered the mode (peak) of the local density of
                     pixels.
  3. Clustering (Mode Assignment):
         ○ Assign Pixels to Modes: After the mean shift process converges for all pixels,
             group the pixels based on the modes they converged to. Pixels that converge to the
             same mode are considered to belong to the same segment.
         ○ Merge Close Modes (Optional): If two modes are very close to each other in the
             feature space (e.g., color space and spatial space), they can be merged into a
             single segment to reduce over-segmentation. A distance threshold is used for this
             merging.
  4. Output:
         ○ Segmentation Map: Create a segmented image where each segment is
             represented by a unique color or label. This is done by assigning the same color
             (e.g., the color of the mode) to all pixels that belong to the same segment.
Key Considerations:
  ● Bandwidth Selection: The bandwidth of the kernel is the most critical parameter. A small
      bandwidth leads to fine-grained segmentation (many small segments), while a large
      bandwidth results in coarser segmentation (fewer large segments).
  ● Kernel Choice: Common kernels include the flat (uniform) kernel and the Gaussian
      kernel. The Gaussian kernel gives more weight to closer pixels.
  ● Feature Space: Mean shift can be applied in different feature spaces. For color
      segmentation, the feature space is typically the RGB or Lab color space. Spatial
      information (pixel coordinates) can also be included in the feature vector to encourage
      spatially connected segments.
  ● Computational Cost: Mean shift can be computationally intensive, especially for large
      images, as each pixel undergoes an iterative process.
What is a Histogram? Explain Histogram Equalization.
A histogram of a digital image is a graphical representation of the distribution of pixel intensity
values. For a grayscale image with intensity levels ranging from 0 to L-1, the histogram plots the
frequency (number of pixels) of each intensity level. The horizontal axis represents the intensity
values, and the vertical axis represents the number of pixels at that intensity.
Histogram Equalization is a technique used to enhance the contrast of an image by
redistributing the pixel intensity values to approximate a uniform distribution. The goal is to
stretch out the intensity range, making better use of all possible intensity levels and thereby
increasing the overall contrast of the image.
Algorithm for Histogram Equalization:
   1. Calculate the Histogram: Compute the histogram h(r_k) of the input image, where r_k
       represents the k-th intensity level (from 0 to L-1) and h(r_k) is the number of pixels with
       that intensity.
   2. Calculate the Normalized Histogram (Probability Density Function): Normalize the
       histogram by dividing each frequency by the total number of pixels N in the image: \qquad
       p(r_k) = \frac{h(r_k)}{N}, for k = 0, 1, ..., L-1 Here, p(r_k) represents the probability of
       occurrence of the intensity level r_k.
   3. Calculate the Cumulative Distribution Function (CDF): Compute the cumulative sum
       of the normalized histogram: \qquad cdf(r_k) = \sum_{i=0}^{k} p(r_i) = \sum_{i=0}^{k}
       \frac{h(r_i)}{N} The CDF represents the probability that a pixel's intensity level is less than
       or equal to r_k.
   4. Map the Intensity Values: Use the CDF to create a transformation function that maps the
       original intensity levels to new intensity levels s_k. For an output image with L intensity
       levels, the mapping function is: \qquad s_k = \text{round}((L - 1) \times cdf(r_k)) Here, (L -
       1) scales the CDF to the full range of output intensity levels, and the rounding operation
       ensures that the output intensities are integers.
   5. Create the Equalized Image: Apply the mapping function to each pixel in the original
       image. If a pixel in the original image has intensity r_k, its corresponding pixel in the
       equalized image will have intensity s_k.
The resulting image will have a histogram that is approximately uniform, leading to increased
contrast and better visibility of details.
Sketching the Histogram and Equalized Histogram
Let's analyze the given 3-bit image (8 intensity levels, 0-7) of size 64 \times 64 = 4096 pixels
with the intensity distribution:
r_k         0           1        2          3           4           5          6         7
n_k         790         1023     850        656         329         245        122       81
**1. Sketch the
1. Sketch the Histogram:
The histogram will have 8 bins, corresponding to the 8 intensity levels (0 to 7). The height of
each bar will represent the number of pixels (n_k) at that intensity level.
   ● Intensity 0: Height = 790
   ● Intensity 1: Height = 1023 (highest peak)
   ● Intensity 2: Height = 850
   ● Intensity 3: Height = 656
   ● Intensity 4: Height = 329
   ● Intensity 5: Height = 245
   ● Intensity 6: Height = 122
   ● Intensity 7: Height = 81 (lowest peak)
The histogram will show a distribution where most pixels have intensities around 1 and 2, with
fewer pixels at the extreme ends (0 and 7).
2. Calculate the Normalized Histogram (Probability Density Function):
Total number of pixels N = 64 \times 64 = 4096.
r_k                               n_k                              p(r_k) = n_k / N
0                                 790                              790 / 4096 ≈ 0.193
1                                 1023                             1023 / 4096 ≈ 0.250
2                                 850                              850 / 4096 ≈ 0.208
3                                 656                              656 / 4096 ≈ 0.160
4                                 329                              329 / 4096 ≈ 0.080
5                                 245                              245 / 4096 ≈ 0.060
6                                 122                              122 / 4096 ≈ 0.030
7                                 81                               81 / 4096 ≈ 0.020
3. Calculate the Cumulative Distribution Function (CDF):
r_k                               p(r_k)                           cdf(r_k)
0                                 0.193                            0.193
1                                 0.250                            0.193 + 0.250 = 0.443
2                                 0.208                            0.443 + 0.208 = 0.651
3                                 0.160                            0.651 + 0.160 = 0.811
4                                 0.080                            0.811 + 0.080 = 0.891
5                                 0.060                            0.891 + 0.060 = 0.951
6                                 0.030                            0.951 + 0.030 = 0.981
7                                 0.020                            0.981 + 0.020 = 1.001 ≈ 1.00
4. Map the Intensity Values:
Using the formula s_k = \text{round}((L - 1) \times cdf(r_k)), where L = 8 and L - 1 = 7:
r_k                               cdf(r_k)                         s_k = \text{round}(7 \times
                                                                   cdf(r_k))
0                                 0.193                            round(7 * 0.193) = round(1.351)
                                                                   =1
1                                 0.443                            round(7 * 0.443) = round(3.101)
                                                                   =3
2                                 0.651                            round(7 * 0.651) = round(4.557)
                                                                   =5
3                                 0.811                            round(7 * 0.811) = round(5.677)
r_k                              cdf(r_k)                         s_k = \text{round}(7 \times
                                                                  cdf(r_k))
                                                                  =6
4                                0.891                            round(7 * 0.891) = round(6.237)
                                                                  =6
5                                0.951                            round(7 * 0.951) = round(6.657)
                                                                  =7
6                                0.981                            round(7 * 0.981) = round(6.867)
                                                                  =7
7                                1.000                            round(7 * 1.000) = round(7.000)
                                                                  =7
5. Sketch the Equalized Histogram:
Now, we need to find the number of pixels at each new intensity level s_k. This is done by
summing the number of pixels from the original histogram that map to the same new intensity
level.
   ● s_k = 1: Corresponds to r_k = 0, so n_{s=1} = n_{r=0} = 790
   ● s_k = 3: Corresponds to r_k = 1, so n_{s=3} = n_{r=1} = 1023
   ● s_k = 5: Corresponds to r_k = 2, so n_{s=5} = n_{r=2} = 850
   ● s_k = 6: Corresponds to r_k = 3 and r_k = 4, so n_{s=6} = n_{r=3} + n_{r=4} = 656 + 329
       = 985
   ● s_k = 7: Corresponds to r_k = 5, r_k = 6, and r_k = 7, so n_{s=7} = n_{r=5} + n_{r=6} +
       n_{r=7} = 245 + 122 + 81 = 448
   ● s_k = 0 and s_k = 2 and s_k = 4 have 0 pixels.
The equalized histogram will have the following approximate distribution:
   ● Intensity 0: Height = 0
   ● Intensity 1: Height = 790
   ● Intensity 2: Height = 0
   ● Intensity 3: Height = 1023
   ● Intensity 4: Height = 0
   ● Intensity 5: Height = 850
   ● Intensity 6: Height = 985
   ● Intensity 7: Height = 448
The equalized histogram shows a more spread-out distribution of pixel intensities compared to
the original histogram, indicating increased contrast in the equalized image.
Explain Lowpass Gaussian Filter Kernels
A lowpass Gaussian filter kernel is a type of linear filter used in image processing to blur an
image and reduce noise. It works by convolving the image with a Gaussian function.
Gaussian Function in 1D:
The 1D Gaussian function is defined as:
\qquad G(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{x^2}{2\sigma^2}}
where:
   ● x is the distance from the center of the kernel.
   ● \sigma (sigma) is the standard deviation of the Gaussian distribution. It controls the extent
      of the blurring. A larger \sigma results in more blurring.
Gaussian Function in 2D:
For a 2D image, the Gaussian function is often defined as a separable function of x and y:
\qquad G(x, y) = \frac{1}{2\pi\sigma^2} e^{-\frac{x^2 + y^2}{2\sigma^2}} =
\frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{x^2}{2\sigma^2}} \times \frac{1}{\sqrt{2\pi\sigma^2}}
e^{-\frac{y^2}{2\sigma^2}}
Gaussian Filter Kernel:
A discrete Gaussian filter kernel is a finite-sized matrix derived from sampling the 2D Gaussian
function. The size of the kernel is typically (2k + 1) \times (2k + 1), where k is an integer that
determines the radius of the kernel. The values in the kernel represent the weights applied to
neighboring pixels during convolution.
Properties of Gaussian Filters:
   ● Lowpass: They attenuate high-frequency components in the image, which correspond to
       sharp edges and noise, while preserving low-frequency components (smooth regions).
       This results in a blurring effect.
   ● Spatially Local: The weights in the kernel decrease with distance from the center,
       meaning that closer pixels have a greater influence on the filtered pixel value than farther
       pixels.
   ● Smoothness: The Gaussian function is smooth and continuous, which helps to produce
       smooth blurring without sharp transitions or ringing artifacts that can occur with other
       types of lowpass filters (e.g., box filter).
   ● Separability: The 2D Gaussian function is separable into the product of two 1D Gaussian
       functions. This property allows for efficient implementation of the 2D convolution by
       performing two 1D convolutions (one horizontal and one vertical), as discussed earlier.
Which filtering techniques do we have to use to remove salt and pepper noise?
Salt and pepper noise is characterized by random occurrences of black (pepper) and white
(salt) pixels in an image. Since these are sharp, isolated noise points, linear filters like the
Gaussian filter are not very effective at removing them without also blurring the image
significantly.
The most effective filtering techniques for removing salt and pepper noise are non-linear filters,
specifically order-statistic filters. The most common and effective one is the median filter.
Median Filter:
The median filter works by replacing the value of each pixel with the median value of its
neighboring pixels within a defined window (kernel). The median is the middle value in the
sorted set of neighboring pixel values.
   ● How it works for salt and pepper noise: If the central pixel is a noisy black or white
       pixel, the neighboring pixels are more likely to have the original, uncorrupted intensity
       values. The median operation effectively replaces the extreme noisy value with a more
       representative value from its neighborhood.
Other less common but potentially useful non-linear filters for salt and pepper noise
include:
   ● Min Filter: Replaces the central pixel with the minimum value in its neighborhood.
       Effective for removing salt noise (white pixels).
   ● Max Filter: Replaces the central pixel with the maximum value in its neighborhood.
       Effective for removing pepper noise (black pixels).
   ● Midpoint Filter: Replaces the central pixel with the average of the minimum and
       maximum values in its neighborhood.
The choice of filter and the size of the filter kernel depend on the density of the salt and pepper
noise. Higher noise densities might require larger kernel sizes.
What should an object recognition system do? Explain the current
strategies for object recognition.
An object recognition system should take an image or video as input and perform the
following key tasks:
   1. Identify the presence of specific objects: Determine if any of the pre-defined object
       categories are present in the input.
   2. Localize the objects: If an object is detected, determine its spatial location within the
       image, typically by drawing a bounding box around it.
   3. Classify the detected objects: Assign a specific category label to each detected object
       (e.g., "car," "person," "cat").
   4. (Optional) Provide additional information: Depending on the application, the system
       might also need to provide more detailed information, such as:
          ○ Instance segmentation: Identifying the precise pixel boundaries of each object
              instance.
          ○ Pose estimation: Determining the 3D orientation and pose of the object.
          ○ Attributes: Recognizing specific properties of the object (e.g., color, size, make of
              a car).
Current Strategies for Object Recognition:
Modern object recognition systems heavily rely on deep learning, particularly Convolutional
Neural Networks (CNNs). Here are the main current strategies:
1. End-to-End Deep Learning Approaches:
   ● Region-Based CNNs (R-CNN family):
          ○ R-CNN (Region-based Convolutional Neural Network): Proposes a set of
              candidate object regions using a selective search algorithm, extracts features from
              each region using a CNN, and then classifies these regions using Support Vector
              Machines (SVMs) and refines the bounding boxes using linear regressors.
          ○ Fast R-CNN: Improves upon R-CNN by extracting features from the entire image
              once and then using a Region of Interest (RoI) pooling layer to extract fixed-size
              feature vectors for each proposed region, making it significantly faster.
          ○ Faster R-CNN: Further enhances speed by replacing the selective search with a
              Region Proposal Network (RPN) that is also a CNN, allowing the network to learn to
              propose regions directly.
          ○ Mask R-CNN: Extends Faster R-CNN by adding a branch for predicting
              segmentation masks for each detected object, enabling instance segmentation.
   ● Single-Shot Detectors (SSDs):
          ○ SSD (Single Shot MultiBox Detector): Predicts bounding boxes and class
              probabilities directly from feature maps at multiple scales in a single forward pass of
              the network, making it very fast.
          ○ YOLO (You Only Look Once): Divides the image into a grid and predicts bounding
              boxes and class probabilities for each grid cell. It also operates in a single forward
              pass and is known for its speed. Various versions (YOLOv2, YOLOv3, YOLOv4,
              YOLOv5, YOLOR, YOLOv7, YOLOv8) have been developed with significant
              improvements in accuracy and efficiency.
          ○ RetinaNet: Addresses the class imbalance problem in single-shot detectors using a
              focal loss function, achieving state-of-the-art accuracy while maintaining reasonable
              speed.
   ● Transformers for Object Detection:
          ○ More recently, transformer-based architectures, like DETR (DEtection
              TRansformer), have shown promising results. DETR uses a transformer
              encoder-decoder architecture along with a set of learnable object queries to directly
              predict a fixed number of object bounding boxes and their classes in parallel,
              eliminating the need for explicit region proposal or anchor box generation.
2. Key Components and Techniques Used in These Strategies:
   ● Convolutional Neural Networks (CNNs): Serve as the backbone for feature extraction,
      learning hierarchical representations of visual data. Architectures like VGG, ResNet,
      Inception, EfficientNet, and others are commonly used.
   ● Feature Pyramids: Processing features at multiple scales to handle objects of different
      sizes. Techniques like Feature Pyramid Networks (FPN) are widely used.
   ● Anchor Boxes (or Prior Boxes): A set of pre-defined bounding boxes with different sizes
      and aspect ratios used in some detectors (like Faster R-CNN and SSD) to facilitate the
      prediction of object locations.
   ● Non-Maximum Suppression (NMS): A post-processing step used to eliminate redundant
      overlapping bounding box predictions for the same object, keeping only the most
      confident one.
   ● Data Augmentation: Techniques like random cropping, flipping, scaling, and color
      jittering are used to increase the diversity of the training data and improve the robustness
      of the models.
   ● Transfer Learning: Pre-training CNNs on large-scale image datasets (like ImageNet) and
      then fine-tuning them on the specific object recognition task with a smaller dataset.
   ● Loss Functions: Carefully designed loss functions that penalize incorrect classifications
      and inaccurate bounding box predictions are crucial for training effective models.
The field of object recognition is continuously advancing, with ongoing research focusing on
improving accuracy, speed, robustness to variations (e.g., lighting, viewpoint, occlusion), and
reducing the need for large amounts of labeled data.