Introduction to Data mining
What is data mining? Examples.
Data mining is the process of searching and analyzing a large batch of raw data in order to
identify patterns and extract useful information.
Data mining is used to explore large data volumes to find patterns and insights that can be used
for specific purposes. These purposes might include improving sales and marketing, optimizing
manufacturing, detecting fraud, and enhancing security.
Why data mining?
Data mining is important because it helps organizations and individuals extract useful insights
from large amounts of data.
Enhances Business Strategies
Companies use it for customer segmentation and personalized marketing.
Helps in inventory management, pricing strategies, and customer retention.
Improves Decision-Making
Businesses can make data-driven decisions instead of relying on intuition.
Governments and healthcare industries use it to improve public services.
Medical and Scientific Discoveries
Used in disease prediction, drug discovery, and patient care improvements.
Helps in genome research and bioinformatics.
Fraud Detection and Risk Management
Banks and financial institutions use data mining to detect unusual transactions.
Helps in identifying cybersecurity threats.
Discover Patterns and Trends
Helps identify hidden patterns and correlations in data.
Predicts customer behavior, fraud detection, and market trends.
Automation and AI Development
Supports machine learning models by providing meaningful data patterns.
Used in recommendation systems like Netflix and Amazon.
Competitive Advantage
Helps businesses stay ahead of competitors by understanding market trends.
Assists in predicting future sales and customer needs.
What is (not) Data Mining?
What is not Data Mining?
-Look up phone number in phone directory
-Query a Web search engine for information about “Amazon”
What is Data Mining?
-Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in
Boston area)
-Group together similar documents returned by search engine according to their context (e.g.
Amazon rainforest, Amazon.com,)
-SPL ->DSA-I
Origins of Data Mining:
-Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems
-Traditional Techniques may be unsuitable due to
-Enormity | (extreme scale) of data
-High dimensionality (large number of features or attributes) of data
-Heterogeneous, distributed nature of data
Related technologies:
Machine learning: Machine learning (ML) is a type of artificial intelligence (AI) that
allows computers to learn and improve from data without being explicitly programmed. It
uses algorithms to analyze data, identify patterns, and make decisions.
OLAP : Online analytical processing (OLAP) is a technology that analyzes large amounts
of data quickly. It's used in business intelligence, decision support, and reporting.
Statistics: Statistics is a branch of applied mathematics that involves the collection,
description, analysis, and inference of conclusions from quantitative data.
DBMS: A database management system (DBMS) is a software system for creating
and managing databases. A DBMS enables end users to create, protect, read, update,
and delete data in a database. It also manages security, data integrity, and concurrency for
databases.
Data Mining Tasks
Prediction Methods- Use some variables to predict unknown or future values of other variables
Description Methods- Find human-interpretable patterns that describe the data
Classification [Predictive]
-Classification involves assigning data into predefined categories based on specific attributes.
For example, using algorithms trained on labeled data, emails can be classified as 'spam' or
'not spam'.
-Clustering groups data into clusters based on similarities without predefined labels.
Regression [Predictive]- Predict a value of a given continuous valued variable based on the
values of other variables
Examples:
– Predicting sales amounts of new product based on advertising expenditure.
– Predicting wind velocities as a function of temperature, humidity, air pressure,
etc.
– Time series prediction of stock market indices
Linear-> age and height
Nonlinear -> population growth with time
Logistics -> diabetes based on some issues
Deviation Detection [Predictive]
-Abnormality instead of normal occurrence [credit card fraud, Network Intrusion
Detection]
Clustering [Descriptive]
Given a set of data points, each having a set of attributes, and a similarity measure among
them, find clusters such that
– Data points in one cluster are more similar to one another.
– Data points in separate clusters are less similar to one another.
Association Rule Discovery [Descriptive]
Rules which will predict occurrence of an item based on occurrences of other items.
- Customers who buy bread are likely to also buy milk.
Sequential Pattern Discovery [Descriptive]
Find rules that predict strong sequential dependencies among different events. Event
occurrences in the patterns are governed by timing constraints.
– People who buy DVD players tend to buy DVD in the period immediately following the
purchase
– Candy sales peak before Halloween (timing constraints)
– Athletic Apparel Store:
(Shoes) (Racket, Racketball) --> (Sports_Jacket)
What is Machine Learning?
Machine learning (ML) is a type of artificial intelligence (AI) that allows machines to learn and improve
from experience.
Types of relationship
-Nonlinear relationship
-Linear relationship
Math for coefficients and least square equation for age salary
Straight-line regression using the method of least squares. Table 1 shows a set of paired data where x is
the number of works experience of a college graduate and y is the corresponding salary of the graduate.
Estimate the equation of least squares line. Also, calculate the salary y for 10 years experiences.
Solution:
Mean value of x, =9.1 and Mean value of y, =55.4
We know that,
Also we know,
w1 x w0
Now, y=w0+w1x=> 23.6+3.5x=> 23.6+3.5*10=58.6*1000=$58600
ML vs Data mining
Data mining is used on an existing dataset (like a data warehouse) to find patterns. Machine learning, on
the other hand, is trained on a 'training' data set, which teaches the computer how to make sense of
data, and then to make predictions about new data sets.
Data Mining Machine Learning
Extracting useful information from large amount Introduce algorithm from data as well as from
of data past experience
Teaches the computer to learn and understand
Used to understand the data flow
from the data flow
Huge databases with unstructured data Existing data as well as algorithms
Data mining abstract from the data warehouse Machine learning reads machine
Clustering, association rule mining, outlier Regression, classification, clustering, deep
detection learning
Domain knowledge is helpful, but not always
Strong domain knowledge is often required
necessary
Stages of the Data mining Process:
Four main stages:
1. Data Pre-processing: Data preprocessing is the process of preparing data for machine learning and other data
analysis. It involves:
Data cleaning: Fixing errors, removing duplicates, and handling missing values
Data transformation: Scaling, normalizing, and encoding categorical variables
Data reduction: Selecting relevant features and reducing dimensionality
Data integration: Combining data from multiple sources
Data formatting: Ensuring consistent data types and structures
Data validation: Checking for errors, ensuring data consistency, and verifying that transformations were
applied correctly
2. Exploratory Data Analysis: Exploratory Data Analysis (EDA) is an analysis approach that identifies
general patterns in the data. These patterns include outliers and features of the data that might be unexpected. EDA
is an important first step in any data analysis.
3. Data Selection: Data selection is the process of choosing the right data type, source, and collection
instruments for a project. It's a crucial step before data collection.
Purpose
To ensure that the data is relevant, accurate, and aligned with the project's goals
To answer research questions
To train or evaluate a machine learning model
4. Knowledge Discovery: Knowledge discovery is the process of extracting useful knowledge from data
Data Mining Techniques:
Classification
Clustering Regression
Associative rules
Sequential Pattern
Artificial Neural Network
Outlier detection
Prediction
Genetic Algorithm
Prediction is a data mining technique that involves using models to predict future outcomes. This
is called predictive data mining.
Genetic algorithms (GAs) are a data mining technique that can be used to classify data and solve
optimization problems. GAs are based on natural evolution and can be used to find optimal
solutions.