Data Warehousing and Data
Mining
                 Unit 3: Introduction to Data Mining
                Introduction and Need for Data Mining
                Introduction to Data Mining
                Data mining is the process of discovering patterns, correlations, and useful
                information from large datasets using techniques such as machine learning,
                statistical analysis, and database systems. It is an essential component of
                knowledge discovery in databases (KDD), where raw data is transformed
                into meaningful insights that help in decision-making.
                With the rapid growth of data in various domains such as healthcare,
                finance, business, and social media, traditional data analysis techniques are
                no longer sufficient to extract valuable information. Data mining automates
                the process of analyzing large datasets, identifying hidden patterns, and
                predicting future trends, enabling organizations to make data-driven
                decisions.
                Definition of Data Mining
                Data mining can be defined as:
                "The process of extracting useful, valid, and previously unknown patterns
                or knowledge from large amounts of data stored in databases, data
                warehouses, or other information repositories."
                Characteristics of Data Mining:
                 1. Automatic Processing: Data mining tools use machine learning
                    algorithms to process data automatically.
Data Warehousing and Data Mining                                                                1
                 2. Pattern Discovery: It finds hidden patterns and relationships in large
                    datasets.
                 3. Prediction and Classification: It helps in predicting future outcomes
                     based on historical data.
                 4. Large-Scale Data Handling: It efficiently processes massive amounts of
                    structured and unstructured data.
                 5. Decision Support: The insights gained from data mining help in making
                     strategic business decisions.
                Need for Data Mining
                The increasing volume and complexity of data generated in various fields
                have made data mining an essential tool for extracting useful knowledge.
                Below are the key reasons why data mining is needed:
                1. Handling Large Volumes of Data
                     With the rise of big data, IoT, and digital transformation, organizations
                     generate vast amounts of data every second.
                     Traditional data analysis methods cannot efficiently process such large
                     datasets.
                     Data mining provides automated techniques to extract useful patterns
                     from massive data repositories.
                Example:
                E-commerce companies like Amazon analyze millions of customer
                transactions daily to identify purchasing patterns and recommend products.
                2. Extracting Hidden Patterns and Relationships
                     Raw data often contains hidden correlations that are not immediately
                     visible through traditional analysis.
                     Data mining helps discover non-obvious relationships between
                     different attributes.
                Example:
                In the healthcare industry, data mining is used to identify risk factors for
                diseases by analyzing patient history, genetic data, and environmental
Data Warehousing and Data Mining                                                                 2
                conditions.
                3. Enhancing Decision-Making and Business Intelligence
                     Organizations use data mining to make data-driven decisions rather
                     than relying on intuition.
                     It provides valuable insights that help businesses improve strategies,
                     optimize operations, and enhance customer experience.
                Example:
                Banks use data mining to analyze customer transactions and detect
                fraudulent activities in real time.
                4. Improving Customer Relationship Management (CRM)
                     Companies use data mining to understand customer preferences,
                     buying behavior, and feedback.
                     It enables businesses to personalize services, improve customer
                     satisfaction, and enhance loyalty programs.
                Example:
                Netflix uses data mining to analyze user viewing patterns and recommend
                personalized content to its subscribers.
                5. Fraud Detection and Security
                     Financial institutions use data mining techniques to detect credit card
                     fraud, money laundering, and cyber threats.
                     Anomalies in transaction data can indicate suspicious activities,
                     allowing quick intervention.
                Example:
                Banks use machine learning models to flag unusual transactions,
                preventing unauthorized access to accounts.
                6. Predictive Analysis for Future Trends
                     Data mining helps organizations forecast trends and market behavior
                     based on historical data.
Data Warehousing and Data Mining                                                               3
                     It enables companies to anticipate customer needs, manage inventory,
                     and plan for future demand.
                Example:
                Retailers use data mining to predict seasonal sales trends and adjust stock
                levels accordingly.
                7. Cost Reduction and Efficiency Optimization
                     Businesses use data mining to identify inefficiencies, optimize supply
                     chains, and reduce operational costs.
                     It helps in resource allocation, workforce management, and minimizing
                     wastage.
                Example:
                Manufacturing companies use data mining to analyze machine
                performance and predict equipment failures, reducing maintenance costs.
                8. Competitive Advantage in the Market
                     Companies that effectively utilize data mining gain a competitive edge
                     over others.
                     It enables organizations to stay ahead of market trends and make better
                     strategic decisions.
                Example:
                Social media platforms like Facebook and Instagram use data mining to
                analyze user behavior and optimize ad placements, generating more
                revenue.
                9. Personalized Marketing and Recommendation Systems
                     Data mining helps in targeted advertising by analyzing customer
                     preferences and online behavior.
                     Businesses can segment customers based on demographics, interests,
                     and purchase history.
                Example:
Data Warehousing and Data Mining                                                               4
                Google Ads uses data mining to display personalized advertisements
                based on users' search history and browsing behavior.
                10. Enhancing Scientific Research and Healthcare
                     Scientists use data mining to analyze large datasets in genetics, climate
                     change, and medical research.
                     Healthcare providers use it for disease diagnosis, treatment
                     recommendations, and drug discovery.
                Example:
                Pharmaceutical companies use data mining to analyze clinical trial results,
                speeding up drug development.
                Knowledge Discovery in Databases
                (KDD) Process
                Introduction to KDD
                Knowledge Discovery in Databases (KDD) is a systematic process of
                extracting useful, valid, and previously unknown patterns or knowledge from
                large datasets. It involves multiple stages, starting from raw data collection
                to meaningful insight generation, enabling informed decision-making in
                various fields such as business, healthcare, finance, and scientific research.
                KDD is not just about data mining; it is a broader process that includes data
                preprocessing, transformation, and interpretation of results.
                Steps in the KDD Process
                The KDD process consists of the following major steps:
                 1. Data Selection
                 2. Data Preprocessing (Cleaning and Integration)
                 3. Data Transformation
                 4. Data Mining
                 5. Pattern Evaluation and Knowledge Representation
Data Warehousing and Data Mining                                                                 5
                Each of these steps plays a crucial role in ensuring that the final extracted
                knowledge is accurate, relevant, and useful.
                1. Data Selection
                This is the first step, where relevant data is chosen from various sources
                such as databases, data warehouses, web data, or sensor logs.
                Objectives:
                     Identify the most relevant attributes (features) required for analysis.
                     Remove unnecessary or redundant data.
                     Extract data from different sources such as transactional databases,
                     logs, spreadsheets, or cloud storage.
                Example:
                In a retail business, sales records from the last five years might be selected
                from a database for customer purchasing behavior analysis.
                2. Data Preprocessing (Cleaning and Integration)
                Raw data is often incomplete, noisy, or inconsistent. This step ensures data
                quality by handling missing values, removing errors, and integrating multiple
                datasets.
                Key Tasks:
                     Data Cleaning: Handling missing values, removing duplicate records,
                     and correcting errors.
                     Data Integration: Combining data from multiple sources to form a single,
                     consistent dataset.
                Example:
                     Filling missing age values in a customer database using the average of
                     available values.
                     Merging customer transaction data from multiple branches into a central
                     database.
Data Warehousing and Data Mining                                                                 6
                3. Data Transformation
                Once the data is cleaned and integrated, it is transformed into a suitable
                format for analysis. This step involves normalization, aggregation, and
                feature selection.
                Techniques Used:
                     Normalization: Scaling numerical values to a common range (e.g.,
                     between 0 and 1).
                     Aggregation: Summarizing data at different levels (e.g., daily sales →
                     monthly sales).
                     Feature Selection: Choosing only the most relevant attributes for
                     analysis.
                Example:
                     Converting salary figures into standardized values (e.g., converting
                     rupees into dollars).
                     Aggregating daily product sales data into monthly sales reports.
                4. Data Mining
                This is the core step of the KDD process, where data mining algorithms are
                applied to extract patterns, trends, and insights.
                Common Data Mining Techniques:
                     Classification: Assigning labels to data (e.g., spam vs. non-spam
                     emails).
                     Clustering: Grouping similar data points (e.g., customer segmentation).
                     Association Rule Mining: Finding relationships between variables (e.g.,
                     "Customers who buy bread often buy butter").
                     Anomaly Detection: Identifying unusual patterns (e.g., fraud detection
                     in credit card transactions).
                Example:
                     A bank uses classification to predict whether a loan applicant is likely to
                     default.
Data Warehousing and Data Mining                                                                   7
                     A supermarket uses association rules to discover that customers
                     buying milk also tend to buy cereal.
                5. Pattern Evaluation and Knowledge Representation
                In this final step, the discovered patterns are evaluated for usefulness and
                interpreted into meaningful knowledge. Only significant and valid patterns
                are retained for decision-making.
                Key Aspects:
                     Filtering out patterns that are statistically insignificant or irrelevant.
                     Visualizing results using graphs, charts, or dashboards.
                     Converting patterns into business strategies or actionable insights.
                Example:
                     A healthcare provider analyzes mined data to identify key factors
                     leading to heart disease and takes preventive actions.
                     An e-commerce website personalizes product recommendations based
                     on customer behavior analysis.
                Data Mining Architecture
                Basic Working:
                 1. It all starts when the user puts up certain data mining requests, these
                    requests are then sent to data mining engines for pattern evaluation.
                 2. These applications try to find the solution to the query using the already
                    present database.
                 3. The metadata then extracted is sent for proper analysis to the data
                    mining engine which sometimes interacts with pattern evaluation
                    modules to determine the result.
                 4. This result is then sent to the front end in an easily understandable
                     manner using a suitable interface.
                A detailed description of parts of data mining architecture is shown:
                 1. Data Sources: Database, World Wide Web(WWW), and data
                    warehouse are parts of data sources. The data in these sources may be
Data Warehousing and Data Mining                                                                  8
                     in the form of plain text, spreadsheets, or other forms of media like
                     photos or videos. WWW is one of the biggest sources of data.
                 2. Database Server: The database server contains the actual data ready to
                    be processed. It performs the task of handling data retrieval as per the
                     request of the user.
                 3. Data Mining Engine: It is one of the core components of the data mining
                     architecture that performs all kinds of data mining techniques like
                     association, classification, characterization, clustering, prediction, etc.
                 4. Pattern Evaluation Modules: They are responsible for finding interesting
                    patterns in the data and sometimes they also interact with the database
                    servers for producing the result of the user requests.
                 5. Graphic User Interface: Since the user cannot fully understand the
                    complexity of the data mining process so graphical user interface helps
                     the user to communicate effectively with the data mining system.
                 6. Knowledge Base: Knowledge Base is an important part of the data
                     mining engine that is quite beneficial in guiding the search for the result
                     patterns. Data mining engines may also sometimes get inputs from the
                     knowledge base. This knowledge base may contain data from user
                     experiences. The objective of the knowledge base is to make the result
                     more accurate and reliable.
                Types of Data Mining architecture:
                 1. No Coupling: The no coupling data mining architecture retrieves data
                     from particular data sources. It does not use the database for retrieving
                     the data which is otherwise quite an efficient and accurate way to do the
                     same. The no coupling architecture for data mining is poor and only
                     used for performing very simple data mining processes.
                 2. Loose Coupling: In loose coupling architecture data mining system
                     retrieves data from the database and stores the data in those systems.
                     This mining is for memory-based data mining architecture.
                 3. Semi-Tight Coupling: It tends to use various advantageous features of
                    the data warehouse systems. It includes sorting, indexing, and
                    aggregation. In this architecture, an intermediate result can be stored in
                     the database for better performance.
Data Warehousing and Data Mining                                                                   9
                 4. Tight coupling: In this architecture, a data warehouse is considered one
                    of its most important components whose features are employed for
                    performing data mining tasks. This architecture provides scalability,
                     performance, and integrated information
                Advantages of Data Mining:
                     Assists in preventing future adversaries by accurately predicting future
                     trends.
                     Contributes to the making of important decisions.
                     Compresses data into valuable information.
                     Provides new trends and unexpected patterns.
                     Helps to analyze huge data sets.
                     Aids companies to find, attract and retain customers.
                     Helps the company to improve its relationship with the customers.
                     Assists Companies to optimize their production according to the likability
                     of a certain product thus saving costs to the company.
                Disadvantages of Data Mining:
                     Excessive work intensity requires high-performance teams and staff
                     training.
                     The requirement of large investments can also be considered a problem
                     as sometimes data collection consumes many resources that suppose a
                     high cost.
                     Lack of security could also put the data at huge risk, as the data may
                     contain private customer details.
                     Inaccurate data may lead to the wrong output.
                     Huge databases are quite difficult to manage.
                Data Mining Functionalities
                Data mining functionalities are used to represent the type of patterns that
                have to be discovered in data mining tasks. In general, data mining tasks
                can be classified into two types including descriptive and predictive.
                Descriptive mining tasks define the common features of the data in the
Data Warehousing and Data Mining                                                                  10
                database and the predictive mining tasks act inference on the current
                information to develop predictions.
                There are various data mining functionalities which are as follows −
                     Data characterization − It is a summarization of the general
                     characteristics of an object class of data. The data corresponding to the
                     user-specified class is generally collected by a database query. The
                     output of data characterization can be presented in multiple forms.
                     Data discrimination − It is a comparison of the general characteristics
                     of target class data objects with the general characteristics of objects
                     from one or a set of contrasting classes. The target and contrasting
                     classes can be represented by the user, and the equivalent data objects
                     fetched through database queries.
                     Association Analysis − It analyses the set of items that generally occur
                     together in a transactional dataset. There are two parameters that are
                     used for determining the association rules −
                           It provides which identifies the common item set in the database.
                           Confidence is the conditional probability that an item occurs in a
                           transaction when another item occurs.
                     Classification − Classification is the procedure of discovering a model
                     that represents and distinguishes data classes or concepts, for the
                     objective of being able to use the model to predict the class of objects
                     whose class label is anonymous. The derived model is established on
                     the analysis of a set of training data (i.e., data objects whose class label
                     is common).
                     Prediction − It defines predict some unavailable data values or pending
                     trends. An object can be anticipated based on the attribute values of the
                     object and attribute values of the classes. It can be a prediction of
                     missing numerical values or increase/decrease trends in time-related
                     information.
                     Clustering − It is similar to classification but the classes are not
                     predefined. The classes are represented by data attributes. It is
                     unsupervised learning. The objects are clustered or grouped, depends
                     on the principle of maximizing the intraclass similarity and minimizing
                     the intraclass similarity.
Data Warehousing and Data Mining                                                                    11
                     Outlier analysis − Outliers are data elements that cannot be grouped in
                     a given class or cluster. These are the data objects which have multiple
                     behaviour from the general behaviour of other data objects. The analysis
                     of this type of data can be essential to mine the knowledge.
                     Evolution analysis − It defines the trends for objects whose behaviour
                     changes over some time.
                Data Mining Task Primitives
                A data mining task can be specified in the form of a data mining query,
                which is input to the data mining system. A data mining query is defined in
                terms of data mining task primitives. These primitives allow the user to
                interactively communicate with the data mining system during discovery to
                direct the mining process or examine the findings from different angles or
                depths. The data mining primitives specify the following,
                 1. Set of task-relevant data to be mined.
                 2. Kind of knowledge to be mined.
                 3. Background knowledge to be used in the discovery process.
                 4. Interestingness measures and thresholds for pattern evaluation.
                 5. Representation for visualizing the discovered patterns.
                A data mining query language can be designed to incorporate these
                primitives, allowing users to interact with data mining systems flexibly.
                Having a data mining query language provides a foundation on which user-
                friendly graphical interfaces can be built.
                Designing a comprehensive data mining language is challenging because
                data mining covers a wide spectrum of tasks, from data characterization to
                evolution analysis. Each task has different requirements. The design of an
                effective data mining query language requires a deep understanding of the
                power, limitation, and underlying mechanisms of the various kinds of data
                mining tasks. This facilitates a data mining system's communication with
                other information systems and integrates with the overall information
                processing environment.
Data Warehousing and Data Mining                                                                12
                List of Data Mining Task Primitives
                A data mining query is defined in terms of the following primitives, such as:
                1. The set of task-relevant data to be mined
                This specifies the portions of the database or the set of data in which the
                user is interested. This includes the database attributes or data warehouse
                dimensions of interest (the relevant attributes or dimensions).
                In a relational database, the set of task-relevant data can be collected via a
                relational query involving operations like selection, projection, join, and
                aggregation.
                The data collection process results in a new data relational called the initial
                data relation. The initial data relation can be ordered or grouped according
                to the conditions specified in the query. This data retrieval can be thought of
                as a subtask of the data mining task.
                This initial relation may or may not correspond to physical relation in the
                database. Since virtual relations are called Views in the field of databases,
                the set of task-relevant data for data mining is called a minable view.
                2. The kind of knowledge to be mined
                This specifies the data mining functions to be performed, such as
                characterization, discrimination, association or correlation analysis,
                classification, prediction, clustering, outlier analysis, or evolution analysis.
                3. The background knowledge to be used in the discovery process
                This knowledge about the domain to be mined is useful for guiding the
                knowledge discovery process and evaluating the patterns found. Concept
                hierarchies are a popular form of background knowledge, which allows data
                to be mined at multiple levels of abstraction.
                Concept hierarchy defines a sequence of mappings from low-level concepts
                to higher-level, more general concepts.
                     Rolling Up - Generalization of data: Allow to view data at more
                     meaningful and explicit abstractions and makes it easier to understand.
                     It compresses the data, and it would require fewer input/output
                     operations.
                     Drilling Down - Specialization of data: Concept values replaced by
                     lower-level concepts. Based on different user viewpoints, there may be
                     more than one concept hierarchy for a given attribute or dimension.
Data Warehousing and Data Mining                                                                   13
                An example of a concept hierarchy for the attribute (or dimension) age is
                shown below. User beliefs regarding relationships in the data are another
                form of background knowledge.
                4. The interestingness measures and thresholds for pattern evaluation
                Different kinds of knowledge may have different interesting measures. They
                may be used to guide the mining process or, after discovery, to evaluate the
                discovered patterns. For example, interesting measures for association rules
                include support and confidence. Rules whose support and confidence
                values are below user-specified thresholds are considered uninteresting.
                     Simplicity: A factor contributing to the interestingness of a pattern is the
                     pattern's overall simplicity for human comprehension. For example, the
                     more complex the structure of a rule is, the more difficult it is to
                     interpret, and hence, the less interesting it is likely to be. Objective
                     measures of pattern simplicity can be viewed as functions of the pattern
                     structure, defined in terms of the pattern size in bits or the number of
                     attributes or operators appearing in the pattern.
                     Certainty (Confidence): Each discovered pattern should have a
                     measure of certainty associated with it that assesses the validity or
                     "trustworthiness" of the pattern. A certainty measure for association
                     rules of the form "A =>B" where A and B are sets of items is confidence.
                     Confidence is a certainty measure. Given a set of task-relevant data
                     tuples, the confidence of "A => B" is defined asConfidence (A=>B) = #
                     tuples containing both A and B /# tuples containing A
                     Utility (Support): The potential usefulness of a pattern is a factor
                     defining its interestingness. It can be estimated by a utility function,
                     such as support. The support of an association pattern refers to the
                     percentage of task-relevant data tuples (or transactions) for which the
                     pattern is true.Utility (support): usefulness of a patternSupport (A=>B) =
                     # tuples containing both A and B / total #of tuples
                     Novelty: Novel patterns are those that contribute new information or
                     increased performance to the given pattern set. For example -> A data
                     exception. Another strategy for detecting novelty is to remove redundant
                     patterns.
                5. The expected representation for visualizing the discovered patterns
Data Warehousing and Data Mining                                                                    14
                This refers to the form in which discovered patterns are to be displayed,
                which may include rules, tables, cross tabs, charts, graphs, decision trees,
                cubes, or other visual representations.
                Users must be able to specify the forms of presentation to be used for
                displaying the discovered patterns. Some representation forms may be
                better suited than others for particular kinds of knowledge.
                For example, generalized relations and their corresponding cross tabs or
                pie/bar charts are good for presenting characteristic descriptions, whereas
                decision trees are common for classification.
                Example of Data Mining Task Primitives
                Suppose, as a marketing manager of AllElectronics, you would like to
                classify customers based on their buying patterns. You are especially
                interested in those customers whose salary is no less than $40,000 and
                who have bought more than $1,000 worth of items, each of which is priced
                at no less than $100.
                In particular, you are interested in the customer's age, income, the types of
                items purchased, the purchase location, and where the items were made.
                You would like to view the resulting classification in the form of rules. This
                data mining query is expressed in DMQL3 as follows, where each line of the
                query has been enumerated to aid in our discussion.
                 1. use database AllElectronics_db
                 2. use hierarchy location_hierarchy for T.branch, age_hierarchy for C.age
                 3. mine classification as promising_customers
                 4. in relevance to C.age, C.income, I.type, I.place_made, T.branch
                 5. from customer C, an item I, transaction T
                 6. where I.item_ID = T.item_ID and C.cust_ID = T.cust_ID and C.income ≥
                    40,000 and I.price ≥ 100
                 7. group by T.cust_ID
                What is the integration of a data mining
                system with a database system?
Data Warehousing and Data Mining                                                                 15
                The data mining system is integrated with a database or data warehouse
                system so that it can do its tasks in an effective presence. A data mining
                system operates in an environment that needed it to communicate with
                other data systems like a database system. There are the possible
                integration schemes that can integrate these systems which are as follows −
                No coupling − No coupling defines that a data mining system will not use
                any function of a database or data warehouse system. It can retrieve data
                from a specific source (including a file system), process data using some
                data mining algorithms, and therefore save the mining results in a different
                file.
                Such a system, though simple, deteriorates from various limitations. First, a
                Database system offers a big deal of flexibility and adaptability at storing,
                organizing, accessing, and processing data. Without using a Database/Data
                warehouse system, a Data mining system can allocate a large amount of
                time finding, collecting, cleaning, and changing data.
                Loose Coupling − In this data mining system uses some services of a
                database or data warehouse system. The data is fetched from a data
                repository handled by these systems. Data mining approaches are used to
                process the data and then the processed data is saved either in a file or in a
                designated area in a database or data warehouse. Loose coupling is better
                than no coupling as it can fetch some area of data stored in databases by
                using query processing or various system facilities.
                Semitight Coupling − In this adequate execution of a few essential data
                mining primitives can be supported in the database/datawarehouse system.
                These primitives can contain sorting, indexing, aggregation, histogram
                analysis, multi-way join, and pre-computation of some important statistical
                measures, including sum, count, max, min, standard deviation, etc.
                Tight coupling − Tight coupling defines that a data mining system is
                smoothly integrated into the database/data warehouse system. The data
                mining subsystem is considered as one functional element of an information
                system.
                Data mining queries and functions are developed and established on mining
                query analysis, data structures, indexing schemes, and query processing
                methods of database/data warehouse systems. It is hugely desirable
                because it supports the effective implementation of data mining functions,
                high system performance, and an integrated data processing environment.
Data Warehousing and Data Mining                                                                 16