School of Computing Science and Engineering
Course Code : Course Name: Data mining and web Algo
Unit – 1
Data Mining
Faculty Name: Mr. Soumalya Ghosh Program Name: B.Tech CSE
What is Data Mining?
• Data mining is the process of
– extracting knowledge or insights from large amounts of data
• using various statistical and computational techniques.
• The primary goal of data mining
– is to discover hidden patterns and relationships in the data that can be used
to make informed decisions or predictions.
What is Data Mining?
• This involves exploring the data using various techniques such as
– Clustering
– Classification
– regression analysis,
– association rule mining
– anomaly detection.
Data Mining: Applications
• Data mining has a wide range of applications across various industries,
including marketing, finance, healthcare, and telecommunications.
• For example,
– in marketing,
• data mining can be used to identify customer segments and target marketing
campaigns
– in healthcare
• it can be used to identify risk factors for diseases and develop personalized
treatment plans.
Evolution of Database Technology
• The Explosive Growth of Data: from terabytes to petabytes
– Data collection and data availability
• Automated data collection tools, database systems, Web, computerized society
– Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—Automated analysis of massive data
sets
Why it is called Data Mining?
• Simply stated, data mining refers to extracting or “mining” knowledge from
large amounts of data.
• The term is actually a misnomer.
– Remember that the mining of gold from rocks or sand is referred to as gold
mining rather than rock or sand mining.
– Thus, data mining should have been more appropriately named “knowledge
mining from data,” which is unfortunately somewhat long.
– “Knowledge mining,” a shorter term, may not reflect the emphasis on mining
from large amounts of data.
• Thus, such a misnomer that carries both “data” and “mining” became a
popular choice.
Why it is called Data Mining?
• Many other terms carry a similar or slightly different meaning to data
mining, such as
– knowledge mining from data,
– knowledge extraction,
– data/pattern analysis,
– data archaeology
– data dredging
• Many people treat data mining as a synonym for another popularly used
term, Knowledge Discovery from Data, or KDD
• Alternatively, others view data mining as simply an essential step in the
process of knowledge discovery
Data mining as a step in the process of knowledge discovery
• 1. Data cleaning (to remove noise and inconsistent data)
• 2. Data integration (where multiple data sources may be combined)
• 3. Data selection (where data relevant to the analysis task are retrieved from the database)
• 4. Data transformation (where data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations, for instance)
• 5. Data mining (an essential process where intelligent methods are applied in order to
• extract data patterns)
• 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
• based on some interestingness measures)
• 7. Knowledge presentation (where visualization and knowledge representation techniques
are used to present the mined knowledge to the user)
Knowledge Discovery (KDD) Process
– Data mining—core of
Pattern Evaluation
knowledge discovery
process
Data Mining
Task-relevant Data
Data Warehouse Selection
Data Cleaning
Data Integration
Databases
Difference between KDD and Data Mining
• Although the two terms KDD and Data Mining are heavily used interchangeably,
they refer to two related yet slightly different concepts.
• KDD is the overall process of extracting knowledge from data, while Data Mining
is a step inside the KDD process, which deals with identifying patterns in data.
• And Data Mining is only the application of a specific algorithm based on the
overall goal of the KDD process.
• KDD is an iterative process where evaluation measures can be enhanced, mining
can be refined, and new data can be integrated and transformed to get different
and more appropriate results.
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Architecture: Typical Data Mining System
Graphical User Interface
Pattern Evaluation
Knowl
Data Mining Engine edge-
Base
Database or Data Warehouse
Server
data cleaning, integration, and selection
Data World-Wide Other Info
Database Repositories
Warehouse Web
Data Mining: Confluence of Multiple Disciplines
Database
Technology Statistics
Machine Visualization
Data Mining
Learning
Pattern
Recognition Other
Algorithm Disciplines
Data Mining: On What Kinds of Data?
• Database-oriented data sets and applications
– Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
– Data streams and sensor data
– Time-series data, temporal data, sequence data (incl. bio-sequences)
– Structure data, graphs, social networks and multi-linked data
– Object-relational databases
– Heterogeneous databases and legacy databases
– Spatial data and spatiotemporal data
– Multimedia database
– Text databases
– The World-Wide Web
Data Mining Functionalities
• Multidimensional concept description: Characterization and discrimination
– Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions
• Frequent patterns, association, correlation vs. causality
– Diaper Beer [0.5%, 75%] (Correlation or causality?)
• Classification and prediction
– Construct models (functions) that describe and distinguish classes or concepts for
future prediction
• E.g., classify countries based on (climate), or classify cars based on (gas mileage)
– Predict some unknown or missing numerical values
Data Mining Functionalities
• Cluster analysis
– Class label is unknown: Group data to form new classes, e.g., cluster houses to find
distribution patterns
– Maximizing intra-class similarity & minimizing interclass similarity
• Outlier analysis
– Outlier: Data object that does not comply with the general behavior of the data
– Noise or exception? Useful in fraud detection, rare events analysis
• Trend and evolution analysis
– Trend and deviation: e.g., regression analysis
– Sequential pattern mining: e.g., digital camera large SD memory
– Periodicity analysis
– Similarity-based analysis
• Other pattern-directed or statistical analyses
Data Mining - Issues
Data Mining - Issues
• Mining methodology
– Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
– Performance: efficiency, effectiveness, and scalability
– Pattern evaluation: the interestingness problem
– Incorporation of background knowledge
– Handling noise and incomplete data
– Parallel, distributed and incremental mining methods
– Integration of the discovered knowledge with existing one: knowledge fusion
• User interaction
– Data mining query languages and ad-hoc mining
– Expression and visualization of data mining results
– Interactive mining of knowledge at multiple levels of abstraction
• Applications and social impacts
– Domain-specific data mining & invisible data mining
– Protection of data security, integrity, and privacy
Data Mining Applications