Docs 1
Docs 1
On
any given day it is observed that only 15% of the account holders utilize the bank services. On an
average each branch has 50,000 accounts. The bank is currently holding data for 1200 days (monthly
aggregates are required). (a) Estimate the size of the warehouse in terms of records. (b) Assuming
sparsity for 1-way, 2-way, and 3-way aggregates as 50%, 75%, and 100% respectively, estimate the
size of the warehouse when you create all possible aggregates. (c) Did the phenomenon of sparsity
failure occur? Explain.
Let's break down each part of the problem and estimate the warehouse size based on the provided
information:
• 15% of the account holders utilize the bank's services each day.
• Total number of accounts = 2500 branches * 50,000 accounts per branch = 125,000,000
accounts.
Since only 15% of the account holders use the bank's services on any given day, the number of
accounts that utilize services daily is:
Now, for each active account, there will be 20 records for the different services offered:
• Records per day = 20 services * 18,750,000 active accounts = 375,000,000 records per day.
Since the bank holds data for 1200 days, the total records over the 1200 days are:
So, the size of the warehouse in terms of records is 450 billion records.
Part (b): Estimate the size of the warehouse when you create all possible aggregates
When creating aggregates, we are considering combinations of dimensions like state, branch, time,
service type, and account usage. Given the sparsity levels for 1-way, 2-way, and 3-way aggregates, we
can calculate the size of the warehouse when including these aggregates.
Number of dimensions:
• There are 4 main dimensions: service type (20 types), state (50 states), branch (2500
branches), and day (1200 days).
o Service type: 20
o State: 50
o Branch: 2500
o Day: 1200
So, there are 4 dimensions, and the total number of 1-way aggregates is the sum of these values:
Next, we apply the sparsity for 1-way, 2-way, and 3-way aggregates:
• Sparsity for 1-way aggregates = 50% of the 1-way aggregates are expected to have records:
3770×0.50=1885 1-way aggregates with data.3770 \times 0.50 = 1885 \text{ 1-way aggregates with
data}.3770×0.50=1885 1-way aggregates with data.
• Sparsity for 3-way aggregates = 100% of the 3-way aggregates are expected to have data:
In this case:
• The majority of the aggregates (especially the 3-way aggregates) are fully populated due to
the 100% sparsity assumption for 3-way aggregates.
• The 1-way and 2-way aggregates are sparsely populated, but they still have significant
records, especially the 2-way aggregates.
Thus, the phenomenon of sparsity failure does not occur in this scenario because the high sparsity
rates are accounted for, and even for sparsely populated aggregates, the number of records remains
manageable. However, we still have a lot of empty space in the aggregate records, particularly in the
1-way and 2-way aggregates.
Therefore, sparsity failure did not occur, but there is still inefficiency due to sparse records for 1-way
and 2-way aggregates.
A university wants to analyze the attendance of students in various courses. The attendance fact
table has the schema: Attendance Fact Table (Student key. Faculty key. Course key, Campus Key.
Room key, Day hour_key) (a)What is the granularity of the above fact table? (b) Assuming that all the
dimension keys are surrogate keys, estimate the size of the above fact table (in GB), given that the
fact table contains data for 10 semesters. In each semester, 200 courses are offered at each campus
(assume there are 4 campuses) and each course has 40 lecture hours. Each course has an average
strength of 100 students. It is also given that the average attendance in courses is 60%. (c)Each
faculty belongs to a department and each student has a major discipline (like CS). Create appropriate
aggregates so that they facilitate attendance analysis with departments and disciplines. Suggest a
suitable values for sparsity as we aggregate attendance over departments and disciplines (d) When
you create aggregates in part c, would sparsity failure happen? Justify your answer. (e) Design the
Time dimension for the Attendance fact table given and give the aggregated star schema for
semester-wise attendance of courses.
Granularity refers to the level of detail stored in the fact table. For the given fact table schema:
• Schema: (Student Key, Faculty Key, Course Key, Campus Key, Room Key, Day Hour Key)
• Each record corresponds to the attendance of a specific student for a specific course taught
by a specific faculty in a specific room at a specific hour of the day on a specific date.
Granularity: The fact table stores attendance at the hourly level for individual students per course.
Key information:
1. Number of semesters = 10.
3. Number of campuses = 4.
Total courses per semester=200×4=800 courses.\text{Total courses per semester} = 200 \times 4 =
800 \text{ courses.}Total courses per semester=200×4=800 courses.
Lecture hours per semester=800×40=32,000 hours.\text{Lecture hours per semester} = 800 \times 40
= 32,000 \text{ hours.}Lecture hours per semester=800×40=32,000 hours.
Assume each dimension key is a 4-byte surrogate key (integer). The record schema has 6 keys:
Convert to gigabytes:
Aggregates:
To analyze attendance by departments (for faculty) and disciplines (for students), we can create the
following aggregates:
1. Department-level attendance:
2. Discipline-level attendance:
• Department-level aggregates: Sparsity = 50%. Not all departments will have faculty teaching
every course.
• Discipline-level aggregates: Sparsity = 70%. Not all disciplines will have students attending
every course.
Sparsity failure occurs when most of the records in an aggregate table are empty or null, leading to
wasted storage.
In this case:
2. Discipline-level aggregates (70% sparsity) are moderately sparse but still unlikely to cause
failure.
3. Combined Department-Discipline aggregates (90% sparsity) are very sparse, which could
lead to sparsity failure.
Justification:
• Sparsity failure is most likely for the combined Department-Discipline aggregates due to
high sparsity. This aggregate would result in a large number of null records and inefficient
storage. To mitigate sparsity failure, consider materializing only necessary subsets of these
aggregates (e.g., for commonly occurring department-discipline combinations).
1. Time Key (Surrogate key, unique identifier for each time unit).
4. Month.
5. Quarter.
6. Year.
1. Fact Table:
2. Dimensions:
This design facilitates semester-wise analysis of attendance while retaining the ability to drill down
by various attributes such as course, campus, faculty, and department.
which class of ODS (Operational data stores) can be used in real time data warehouse
Operational Data Stores (ODS) serve as an intermediate layer for integrating and transforming data
from multiple operational systems before loading it into a data warehouse. Depending on the
requirements, different classes of ODS can be utilized for a real-time data warehouse:
The class of ODS that can be used in a real-time data warehouse is the "Active ODS" or "Real-Time
ODS".
o Processes and integrates data from operational systems in near real-time or real-
time, enabling continuous updates to the ODS.
2. Low-Latency:
o Designed to minimize the delay between data creation in operational systems and its
availability in the ODS.
o Can be queried directly for up-to-date data, making it suitable for operational
decision-making and feeding real-time data warehouses.
4. Dynamic Updates:
o Acts as a staging area for real-time ETL (Extract, Transform, Load) processes to
transform and prepare data for analytical use in a real-time data warehouse.
• Timely Insights: Provides up-to-date data for analytical processing in real-time or near real-
time.
• Reduced Load on Source Systems: Aggregates data from multiple operational systems,
reducing the direct impact on them.
o Real-time monitoring of trades and stock prices for analytics and decision-making.
2. E-Commerce:
3. Healthcare:
o Monitoring patient data and clinical metrics for immediate analysis and action.
o Integration of real-time data from IoT devices into a data warehouse for analytics.
**
4o
You said:
ChatGPT said:
ChatGPT
Using an Active ODS (Operational Data Store) in a real-time data warehouse involves a well-defined
process for integrating, transforming, and managing operational data in real time. Below is an
explanation of how it is used:
o For less time-sensitive data, batch jobs might feed the ODS periodically.
o Corrects errors in raw data (e.g., fixing null values or inconsistent formats).
• Data Enrichment:
o Adds additional context to raw data by combining it with other datasets (e.g.,
appending demographic details to a transaction record).
• Standardization:
o Ensures that data from multiple sources is in a consistent format for downstream
use.
• Schema Mapping:
o Maps data to a unified schema in the ODS that aligns with the data warehouse
schema.
• The ODS acts as an intermediate layer for real-time ETL/ELT processes, preparing data for
the data warehouse:
o Load: Feeds the transformed data into the data warehouse in near real-time.
• For operational reporting and monitoring, analysts and applications can query the ODS
directly:
o Dashboards: Real-time dashboards (e.g., sales performance, system health) can pull
data directly from the ODS.
o Operational Alerts: Applications use the ODS to trigger alerts based on real-time
conditions (e.g., fraud detection, inventory depletion).
o Updates are incremental, only including changes since the last update.
6. Feedback Loops
• Insights from the real-time data warehouse can be sent back to the operational systems to
improve decision-making in real time (e.g., dynamic pricing updates, fraud prevention
mechanisms).
Scenario:
An e-commerce platform wants to monitor website traffic, sales, and inventory levels in real-time.
1. Data Sources:
2. Active ODS:
o Enriches data by combining sales with inventory levels to flag low-stock items.
4. Actionable Insights:
4. Visualization Tools:
Summary
An Active ODS bridges the gap between operational systems and the real-time data warehouse by
ingesting, transforming, and preparing real-time data. It supports operational reporting and serves as
a staging area for analytical queries, enabling businesses to derive actionable insights in real time
Suppose you have implemented a DW on top of an RDBMS that does not support partitioning. Write
proper SQL statements to create 4 yearly partitions from 2010 to 2013 for a sales fact table. Create
all database objects that you would need to support partitioning. Also, create a metadata table in
which you store all relevant information about the partitions such that queries can be redirected to
appropriate partitions.
To implement partitioning in an RDBMS that does not natively support it, we can manually simulate
partitioning by creating separate tables for each partition (e.g., one table per year) and redirecting
queries to the appropriate table using metadata. Below are the steps and SQL statements to achieve
this:
We create separate tables for each year (2010 to 2013) to simulate partitions. These tables will have
the same schema as the original sales table.
sql
Copy code
product_id INT,
customer_id INT,
sale_date DATE,
amount DECIMAL(10, 2)
);
product_id INT,
customer_id INT,
sale_date DATE,
amount DECIMAL(10, 2)
);
product_id INT,
customer_id INT,
sale_date DATE,
amount DECIMAL(10, 2)
);
product_id INT,
customer_id INT,
sale_date DATE,
amount DECIMAL(10, 2)
);
2. Create a Metadata Table for Partition Management
This table will store information about the partitions, such as the table name and the range of dates
it covers. Queries will refer to this table to determine which partition to query.
sql
Copy code
);
VALUES
sql
Copy code
You can write a stored procedure or query logic to determine the correct partition dynamically by
referring to the metadata table.
Example: Query a Sale Based on sale_date
sql
Copy code
SELECT partition_table_name
FROM sales_partitions
-- Result: 'sales_2012'
SELECT *
FROM sales_2012
Alternatively, you can use a view to abstract the partitioning from end users.
A view can union all the partition tables, making it appear as a single table.
sql
Copy code
UNION ALL
UNION ALL
UNION ALL
Users can query the view as if querying the entire sales table:
sql
Copy code
SELECT *
FROM sales
Summary
• Redirect Logic: Queries are redirected based on sale_date using the metadata table.
This approach manually simulates partitioning and enables efficient querying while maintaining
logical separation of data for performance.
Explain in detail, the kind of features and functionalities RDBMS vendors are introducing to make
their product more suitable for data warehousing & OLAP. List the features and briefly describe them
and their usefulness.
RDBMS vendors continuously introduce features and functionalities to make their products more
suitable for data warehousing and Online Analytical Processing (OLAP). These features are designed
to enhance performance, scalability, and usability for analytical workloads, which differ significantly
from traditional transactional systems (OLTP). Below is a detailed explanation of these features:
1. Partitioning
Description:
Partitioning involves dividing large tables into smaller, more manageable pieces called partitions
based on a key, such as date or region. Types include range partitioning, hash partitioning, list
partitioning, or composite partitioning.
Usefulness:
• Improves query performance by enabling the database to scan only relevant partitions.
Description:
Materialized views are precomputed, stored query results that can be refreshed periodically or
incrementally based on the underlying data.
Usefulness:
• Boosts query performance by allowing the system to use precomputed results instead of
recalculating them.
Description:
RDBMS vendors include advanced query optimizers that analyze queries and determine the most
efficient execution plan. Features include:
• Cost-based optimization: Evaluates multiple query plans to select the one with the lowest
cost.
• Join optimization: Implements algorithms like hash joins, nested loops, or merge joins.
Usefulness:
• Reduces query response times for complex analytical queries involving large datasets.
4. Indexing Enhancements
Description:
Specialized indexing techniques for analytical queries, such as:
• Bitmap Indexes: Efficient for columns with low cardinality (e.g., gender, region).
• Columnar Indexes: Optimized for reading and aggregating data across columns.
Usefulness:
5. Columnar Storage
Description:
Data is stored column-wise instead of row-wise, enabling efficient access for analytical queries that
typically scan specific columns (e.g., SUM, AVG).
Usefulness:
6. Parallel Processing
Description:
Enables multiple processors or nodes to execute queries concurrently. Techniques include:
Usefulness:
7. In-Memory Processing
Description:
RDBMS vendors provide in-memory capabilities to store and process frequently accessed data
directly in RAM.
Usefulness:
8. Data Compression
Description:
Compresses data on disk and in memory using techniques like dictionary encoding or delta encoding.
Usefulness:
• Improves query performance by minimizing the amount of data read from storage.
Description:
Modern RDBMS systems provide integration with big data platforms like Hadoop and Spark for
hybrid processing.
Usefulness:
• Enables data warehouses to handle unstructured and semi-structured data.
Description:
Many RDBMS systems natively support multidimensional data models, allowing users to define and
query OLAP cubes directly.
Usefulness:
Description:
RDBMS vendors integrate statistical and machine learning libraries for advanced analytics within the
database.
Usefulness:
• Enables predictive analytics and advanced data modeling without exporting data to external
tools.
Description:
Supports querying across multiple data sources (e.g., databases, flat files, cloud storage) as if they
were a single system.
Usefulness:
Description:
Features like role-based access control (RBAC), encryption, and auditing ensure data is managed
securely.
Usefulness:
Description:
Automation tools for managing indexes, rebuilding partitions, and refreshing materialized views.
Usefulness:
• Keeps the data warehouse optimized for performance with minimal manual intervention.
Description:
RDBMS systems integrate streaming capabilities to process real-time data for operational analytics.
Usefulness:
Summary Table
Feature Usefulness
Columnar Storage Improves I/O and aggregation efficiency for analytical queries.
Data Compression Saves storage space and reduces data retrieval time.
OLAP Cube Support Simplifies multidimensional analysis for slicing, dicing, and aggregation.
Advanced Analytics Facilitates predictive and statistical analytics within the database.
Feature Usefulness
Security and
Protects sensitive data and ensures compliance.
Governance
Automated
Reduces manual workload for optimizing data warehouses.
Maintenance
These features collectively make RDBMS products more efficient, scalable, and user-friendly for data
warehousing and OLAP, ensuring they meet the growing demands of modern analytics environments.
Why it is recommended that the users should not be aware of the partitions and aggregates that are
present in the DW? In the absence of information about and aggregates, how the user queries
leverage them?
It is recommended that users should not be directly aware of the partitions and aggregates in a data
warehouse (DW) because of the following reasons:
1. Complexity Reduction
• Reason: Exposing partitions and aggregates adds complexity to querying, as users would
need to explicitly account for these structures in their SQL queries.
• Advantage: By abstracting these details, users can focus on querying the data they need
without worrying about the underlying storage or performance optimizations.
• Reason: Partitions and aggregates are physical implementations aimed at optimizing storage
and query performance, while the user's view of the data should remain logical.
• Advantage: Ensures that any changes to the physical structure (e.g., re-partitioning or
creating new aggregates) do not require query modifications, maintaining query portability.
3. Query Optimization
• Reason: If users are aware of partitions and aggregates, they might inadvertently reference
or misuse them, leading to incorrect results or reduced performance.
• Advantage: Keeping these details hidden allows administrators to manage partitions and
aggregates without user intervention.
Modern data warehouses and RDBMSs provide tools and mechanisms that enable users to benefit
from partitions and aggregates transparently:
1. Query Optimizer
• Example:
o A query like SELECT SUM(sales) FROM sales WHERE year = 2021 can directly read
from a partitioned table (e.g., sales_2021) or a materialized view with precomputed
yearly sales.
2. Partition Pruning
• The optimizer identifies which partitions are relevant for a query based on filtering
conditions and restricts access to only those partitions.
• Example:
3. Aggregate Awareness
• Example:
o A query for monthly sales totals (GROUP BY month) might directly use a materialized
view with monthly aggregates.
4. Views
• Logical views abstract partitions and aggregates, presenting a unified schema to users.
• Example:
o A view like CREATE VIEW sales AS SELECT * FROM sales_2010 UNION ALL SELECT *
FROM sales_2011 ... hides partition details.
5. Metadata Tables
• Metadata about partitions and aggregates guides the query optimizer in selecting the
appropriate data sources.
• Example:
o A metadata table might map date ranges to partitions, enabling the system to route
queries to the correct table.
Conclusion
• Simplicity in querying.
This approach ensures users get accurate and fast query results without needing in-depth knowledge
of the underlying implementation.
Supervised learning and unsupervised learning are two primary types of machine learning
techniques, each with distinct characteristics and applications. Here's a detailed comparison:
1. Definition:
• Supervised Learning: In supervised learning, the model is trained on labeled data, where
each input has a corresponding output (label). The model learns the mapping between
inputs and outputs, and this mapping is used to make predictions on new, unseen data.
2. Data:
• Supervised Learning: Requires labeled data, meaning each training example has a known
output or label.
• Unsupervised Learning: Works with unlabeled data, and the model tries to identify
relationships or structures within the data without prior knowledge of the outcomes.
3. Goal:
• Supervised Learning: The goal is to predict an output from input data. It is often used for
classification (predicting a category) and regression (predicting a continuous value).
• Unsupervised Learning: The goal is to find hidden patterns or structures in the data, such as
grouping similar data points or reducing data dimensionality.
4. Output:
• Supervised Learning: The output is known and specified as a label or a value (e.g., spam or
not-spam in email classification).
• Unsupervised Learning: The output is not predefined; the model groups or organizes data
into clusters or structures.
5. Examples:
• Supervised Learning:
o Regression: Predicting the price of a house based on its features like size, location,
etc.
• Unsupervised Learning:
o Dimensionality Reduction: Reducing the number of features in the data (e.g., PCA -
Principal Component Analysis).
6. Algorithms:
• Supervised Learning:
▪ Neural Networks
• Unsupervised Learning:
7. Performance Evaluation:
• Supervised Learning: The model’s performance can be directly evaluated using metrics such
as accuracy, precision, recall, F1 score (for classification), or mean squared error (for
regression).
8. Training Time:
• Unsupervised Learning: The training time may vary but is generally less focused on
prediction, so the process can sometimes be quicker.
9. Application:
• Supervised Learning:
• Unsupervised Learning:
o Anomaly detection
• Supervised Learning:
o Advantages: More direct and interpretable, works well when labeled data is
available.
o Disadvantages: Requires a large amount of labeled data, which can be costly or time-
consuming to obtain.
• Unsupervised Learning:
o Disadvantages: Harder to evaluate and interpret, and might lead to less precise
models.
Summary:
• Supervised Learning: Works with labeled data to predict specific outcomes (classification or
regression). It is data-driven and allows for direct evaluation of performance.
• Unsupervised Learning: Works with unlabeled data to find hidden patterns and structures
(clustering, dimensionality reduction). Evaluation is more complex due to the lack of labeled
outcomes.
Performance measures in machine learning (ML) evaluate how well a model predicts or classifies
data. The choice of performance metric depends on the type of ML problem (classification,
regression, clustering, etc.) and the specific goals of the application. Here are the key performance
measures and their applications:
1. Classification Metrics
Used when the output is categorical (e.g., spam detection, image classification).
a. Accuracy
• Limitations: Misleading for imbalanced datasets (e.g., predicting a rare disease where most
predictions are "healthy").
b. Precision
• Definition: The ratio of correctly predicted positive observations to the total predicted
positives. Precision=TPTP+False Positives (FP)\text{Precision} = \frac{\text{TP}}{\text{TP} +
\text{False Positives (FP)}}Precision=TP+False Positives (FP)TP
• Best for: Applications where minimizing false positives is critical (e.g., spam detection).
• Definition: The ratio of correctly predicted positive observations to all actual positives.
Recall=TPTP+False Negatives (FN)\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{False
Negatives (FN)}}Recall=TP+False Negatives (FN)TP
• Best for: Applications where missing positives is costly (e.g., disease diagnosis).
d. F1 Score
• Definition: The harmonic mean of precision and recall.
F1=2×Precision⋅RecallPrecision+RecallF1 = 2 \times \frac{\text{Precision} \cdot
\text{Recall}}{\text{Precision} + \text{Recall}}F1=2×Precision+RecallPrecision⋅Recall
• Best for: Imbalanced datasets, balancing false positives and false negatives.
• Definition: The ROC curve plots the true positive rate against the false positive rate for
various thresholds. AUC (Area Under Curve) quantifies the area under the ROC curve.
• Definition: Measures the error of predicted probabilities compared to true class labels.
Log Loss=−1N∑i=1N[yilog(pi)+(1−yi)log(1−pi)]\text{Log Loss} = -\frac{1}{N} \sum_{i=1}^N
\left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right]Log Loss=−N1i=1∑N[yilog(pi)+(1−yi)log(1−pi)]
2. Regression Metrics
Used when the output is continuous (e.g., house price prediction, stock price prediction).
• Definition: The average of absolute differences between predicted and actual values.
MAE=1N∑i=1N∣yi−y^i∣\text{MAE} = \frac{1}{N} \sum_{i=1}^N \left| y_i - \hat{y}_i
\right|MAE=N1i=1∑N∣yi−y^i∣
• Definition: The average of squared differences between predicted and actual values.
MSE=1N∑i=1N(yi−y^i)2\text{MSE} = \frac{1}{N} \sum_{i=1}^N \left( y_i - \hat{y}_i
\right)^2MSE=N1i=1∑N(yi−y^i)2
• Definition: The square root of MSE. It gives error in the same unit as the target variable.
RMSE=MSE\text{RMSE} = \sqrt{\text{MSE}}RMSE=MSE
d. R-squared (R²)
• Definition: Indicates the proportion of variance in the target variable explained by the
model. R2=1−SSresidualSStotalR^2 = 1 -
\frac{\text{SS}_{\text{residual}}}{\text{SS}_{\text{total}}}R2=1−SStotalSSresidual
Used when data is grouped into clusters without labels (e.g., customer segmentation).
a. Silhouette Score
• Definition: Measures how similar a point is to its own cluster compared to other clusters.
Silhouette=b−amax(a,b)\text{Silhouette} = \frac{b - a}{\max(a, b)}Silhouette=max(a,b)b−a
where aaa = average intra-cluster distance, bbb = average nearest-cluster distance.
b. Dunn Index
c. Davies-Bouldin Index
4. General Metrics
a. Confusion Matrix
• Definition: Summarizes model performance with counts of true positives, false positives,
true negatives, and false negatives.
c. Cost Function
• Definition: Quantifies the difference between predicted and actual values during training.
d. Model Interpretability
Summary Table
Problem Type Performance Measures
Choosing the right performance metric is crucial for aligning the model's goals with real-world
application needs.
Let's illustrate how an Artificial Neural Network (ANN) works by predicting house prices based on
features such as the size of the house, number of bedrooms, location, etc.
1. Problem Setup
Goal:
• Input Layer: Accepts the four features (X1,X2,X3,X4X_1, X_2, X_3, X_4X1,X2,X3,X4).
• Hidden Layer: Contains neurons that process inputs to learn intermediate representations.
Example:
This is the process of passing input data through the network to compute the output.
1. Input Layer:
2. Hidden Layer 1:
o Each neuron computes a weighted sum of inputs, adds a bias, and applies an
activation function.
3. Hidden Layer 2:
o Takes outputs from Layer 1 as inputs, repeats the weighted sum and activation steps.
4. Output Layer:
o Combines the outputs from the last hidden layer, computes the final weighted sum,
and applies a linear activation function to predict y^\hat{y}y^ (house price).
The model calculates the error (difference) between the predicted output y^\hat{y}y^ and the actual
output yyy using a loss function like Mean Squared Error (MSE):
Step 3: Backpropagation
1. Compute the gradient of the loss with respect to weights and biases using the chain rule.
2. Update weights and biases using a learning rate (η\etaη) in the opposite direction of the
gradient: wnew=wold−η∂Loss∂ww_{new} = w_{old} - \eta \frac{\partial \text{Loss}}{\partial
w}wnew=wold−η∂w∂Loss
Step 4: Iterate
Repeat steps 1–3 for multiple iterations (epochs) until the loss is minimized, ensuring the ANN learns
to predict house prices accurately.
4. Example Walkthrough
Input:
Bedrooms (X2X_2X2) 3
Forward Propagation:
Output:
Loss:
Compare y^\hat{y}y^ with the actual house price yyy (e.g., $340,000) to compute the error.
Backpropagation:
5. Application
Over multiple iterations, the ANN refines its weights and biases to minimize the error, ultimately
learning how to predict house prices effectively based on the input features.
K-Means Clustering is an unsupervised machine learning algorithm used for partitioning data into K
clusters based on their similarities. It minimizes the variance within each cluster while maximizing
the variance between clusters.
o Decide how many clusters the data should be divided into, based on domain
knowledge or by using methods like the Elbow Method.
o Randomly select KKK data points from the dataset as the initial cluster centroids.
▪ Assign the data point to the cluster with the nearest centroid.
4. Update Centroids:
o Recalculate the centroids of each cluster as the mean of all the data points assigned
to that cluster: Centroidk=1Nk∑i=1NkXi\text{Centroid}_k = \frac{1}{N_k}
\sum_{i=1}^{N_k} X_iCentroidk=Nk1i=1∑NkXi where NkN_kNk is the number of data
points in cluster kkk, and XiX_iXi are the data points in that cluster.
5. Repeat:
6. Output:
Input Data:
1 20,000 30
2 35,000 50
3 70,000 80
Customer ID Income ($) Spending Score
Step-by-Step Application:
1. Choose K: Suppose K=3K = 3K=3, indicating we want to group customers into 3 clusters.
3. Assign Points to Clusters: For each customer, calculate the distance to each centroid and
assign the customer to the nearest cluster. For instance:
4. Update Centroids: Compute the mean of the points in each cluster to find new centroids:
Ck=Mean of all points in Cluster kC_k = \text{Mean of all points in Cluster } kCk
=Mean of all points in Cluster k
5. Iterate: Reassign points and update centroids until the clusters stabilize.
Result:
• Customers are grouped into 3 clusters based on their income and spending patterns.
Visual Representation:
• Gradually adjusts the centroids and reassigns points until distinct clusters are formed.
Advantages of K-Means:
Limitations:
1. Choosing KKK:
o The number of clusters KKK must be specified beforehand, which may not always be
obvious.
o The Elbow Method is often used to find an optimal KKK.
2. Sensitive to Initialization:
4. Sensitive to Outliers:
Applications of K-Means:
By minimizing intra-cluster variance, K-Means provides a simple yet effective way to group similar
data points.
Discuss the frame work to enquire fairness for an Machine Learning model.
Ensuring fairness in machine learning (ML) models involves a systematic framework to identify,
assess, and mitigate bias. The steps below outline a structured approach to evaluate and promote
fairness in ML systems:
• Stakeholder Input: Collaborate with stakeholders to define what fairness means for the
application (e.g., equal treatment, equal outcomes, or avoiding disparate impact).
• Context-Specific Fairness: Consider the societal, legal, and ethical implications of fairness for
the model's domain (e.g., healthcare, hiring, lending).
• Data Collection: Ensure diverse and representative data that captures all relevant groups
fairly.
• Data Analysis: Identify potential biases or imbalances in the dataset (e.g.,
underrepresentation or overrepresentation of certain groups).
• Sensitive Attributes: Recognize attributes like race, gender, or age that might lead to unfair
outcomes.
o Demographic Parity: Equal outcomes across groups (e.g., the same acceptance rate
for different genders).
• Trade-offs: Recognize trade-offs between metrics (e.g., optimizing for one fairness metric
may reduce another).
• In-Processing Techniques:
• Post-Processing Techniques:
o Adjust predictions to meet fairness criteria (e.g., equalizing outcomes across groups).
5. Validate Fairness
• Model Evaluation:
• Robustness Checks:
• Feedback Loop: Incorporate feedback from affected users or groups to improve fairness
iteratively.
7. Communicate Findings
• Ethical Oversight: Engage with ethics committees or independent auditors when necessary.
Challenges to Address
This framework provides a comprehensive roadmap to address fairness concerns in ML, emphasizing
iterative improvement, stakeholder involvement, and transparency.
Counterfactual explanations (CFEs) are increasingly popular in explaining machine learning (ML)
models because they focus on "what-if" scenarios to show how changes in inputs could alter model
outcomes. Here's a discussion of their advantages and disadvantages:
1. Human Interpretability
• CFEs offer intuitive insights by showing how specific input changes lead to desired outcomes.
• Example: For a loan application rejection, a CFE might suggest, "If your income were $5,000
higher, the loan would have been approved."
2. Actionability
• CFEs provide users with actionable feedback on what they could change to achieve a
different outcome, making them especially useful in domains like credit scoring or hiring.
3. Model-Agnostic
• CFEs do not depend on the internal structure of the model. They work for any ML model,
whether a simple linear regression or a complex neural network.
4. Fairness Assessment
• CFEs can help identify biases by examining how sensitive a model’s decisions are to specific
attributes (e.g., gender or race).
• Constraints can be added to ensure explanations remain realistic (e.g., ensuring "years of
education" doesn’t decrease in the counterfactual for a working adult).
1. Feasibility Issues
2. Lack of Causality
• CFEs operate on correlations and do not guarantee that changing the suggested features will
actually result in the predicted outcome due to causal complexity.
• If not carefully designed, CFEs might inadvertently suggest changes to sensitive attributes
(e.g., "change your gender"), leading to ethical concerns and legal violations.
4. Computational Complexity
• Generating CFEs can be computationally expensive, particularly for complex models with
large feature spaces, as they require optimization or iterative sampling.
5. Ambiguity
• Multiple counterfactuals can exist for the same instance. Determining which one to present
might introduce biases or confusion.
6. Model Vulnerability
• CFEs might focus on features the individual cannot change (e.g., age or past job history),
making explanations less useful.
Use Cases and Balancing the Trade-offs
• Best for: Domains where users benefit from actionable feedback, such as financial services,
education, or healthcare.
• Challenging in: High-stakes or regulated environments, where the explanation must be both
realistic and legally compliant.
CFEs are a powerful tool when carefully implemented, but they require balancing interpretability,
realism, fairness, and practicality to be effective and ethical.
4o