GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
BA5021 DATA MINING FOR BUSINESS INTELLIGENCE
                                  UNIT I
                             INTRODUCTION
Syllabus:
Data mining, Text mining, Web mining, Spatial mining, Process mining, BI
process- Private and Public intelligence, Strategic assessment of
implementing BI
 1.1. DATA MINING
 Why Data Mining?
    The Explosive Growth of Data: from terabytes to petabytes
           Data collection and data availability
                Automated data collection tools, database systems,
                 Web, computerized society
           Major sources of abundant data
                Business: Web, e-commerce, transactions, stocks, …
                Science: Remote sensing, bioinformatics, scientific
                 simulation, …
                Society and everyone: news, digital cameras, YouTube
    We are drowning in data, but starving for knowledge!
    ―Necessity is the mother of invention‖—Data mining—Automated
     analysis of massive data sets
 What is Data Mining?
    Data mining (knowledge discovery from data)
           Extraction of interesting (non-trivial, implicit, previously
            unknown and potentially useful) patterns or knowledge from
            huge amount of data
     1                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                              GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
        Data mining: a misnomer?
   Alternative names
        Knowledge discovery (mining) in databases (KDD),
         knowledge       extraction, data/pattern   analysis,   data
         archeology, data dredging, information harvesting, business
         intelligence, etc.
   Watch out: Is everything ―data mining‖?
        Simple search and query processing
        (Deductive) expert systems
Knowledge Discovery in DB (KDD) Process
   This is a view from typical database systems and data
    warehousing communities
   Data mining plays an essential role in the knowledge discovery
    process
   2                          BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                               GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Knowledge discovery as a process involves in the following steps:
     1. Data cleaning
           To remove noise and inconsistent data
     2. Data integration
           where multiple data sources may be combined
     3. Data selection
           where data relevant to the analysis task are retrieved from
           the database
     4. Data transformation
           where data are transformed or consolidated into forms
           appropriate for mining by performing summary or
           aggregation operations, for instance
     5. Data mining
           an essential process where intelligent methods are applied in
           order to extract data patterns
     6. Pattern evaluation
           To identify the truly interesting patterns representing
           knowledge based on some interestingness measures
     7. Knowledge presentation
           where    visualization  and    knowledge    representation
           techniques are used to present the mined knowledge to the
           user
    3                          BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                             GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
KDD Process: A Typical View from ML and Statistics
Data Mining in Business Intelligence
    4                        BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                              GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Architecture of typical DM Systems
Based on KDD’s view, the architecture of a typical data mining system
may have the following major components
   Database, data warehouse or other information repository:
         This is one or a set of databases, data warehouses,
          spreadsheets, or other kinds of information repositories.
         Data cleaning and data integration techniques may be
          performed on the data.
   Database or data warehouse server:
         The database or data warehouse server is responsible for
          fetching the relevant data, based on the user’s data mining
          request.
   Knowledge base:
         This is the domain knowledge that is used to guide the
          search or evaluate the interestingness of resulting patterns.
    5                         BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                               GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
         Such knowledge can include concept hierarchies, used to
          organize attributes or attribute values into different levels of
          abstraction.
         Other examples of domain knowledge are additional
          interestingness constraints or thresholds, and metadata
   Data mining engine:
         This is essential to the data mining system and ideally
          consists of a set of functional modules for tasks such as
          characterization, association and correlation analysis,
          classification, prediction, cluster analysis, outlier analysis,
          and evolution analysis.
   Pattern evaluation module :
         This component typically employs interestingness measures
          and interacts with the data mining modules so as to focus the
          search toward interesting patterns.
         It may use interestingness thresholds to filter out discovered
          patterns.
   User interface:
         This module communicates between users and the data
          mining system, allowing the user to interact with the system
          by specifying a data mining query or task, providing
          information to help focus the search, and performing
          exploratory data mining based on the intermediate data
          mining results
Data Mining: On What Kinds of Data?
There are no. of data stores on which data mining can be performed:
   Relational database
   Data warehouse
   Transactional database
    6                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                             GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
 Advanced database and information repository
      Spatial and temporal data
      Time-series data
      Stream data
      Multimedia database
      Text databases & WWW
 Relational database
     A relational database is a collection of tables, each of which
       is assigned a unique name.
     Each table consists of a set of attributes (columns or fields)
       and usually stores a large set of tuples (records or rows).
     Each tuple in a relational table represents an object identified
       by a unique key and described by a set of attribute values
 Data warehouse
     A data warehouse is a repository of information collected
       from multiple sources, stored under a unified schema, and
       that usually resides at a single site.
     Data warehouses are constructed via a process of data
       cleaning, data integration, data transformation, data loading,
       and periodic data refreshing.
                        Figure: Data Warehouse
 7                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                         GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
 To facilitate decision making, the data in a data warehouse are
  organized around major subjects, such as customer, item,
  supplier, and activity.
 A data warehouse is usually modeled by a multidimensional
  database structure, where each dimension corresponds to an
  attribute or a set of attributes in the schema.
 Each cell stores the value of some aggregate measure, such as
  count or sales amount.
 The actual physical structure of a data warehouse may be a
  relational data store or a multidimensional data cube.
 A data cube provides a multidimensional view of data and
  allows the pre computation and fast accessing of summarized
  data.
 A data cube for summarized sales data of All Electronics is
  presented in Figure.
 The cube has three dimensions:
      address (with city values Chicago, New York, Toronto,
       Vancouver),
      time (with quarter values Q1, Q2, Q3, Q4), and
      item(with item type values           home     entertainment,
       computer, phone, security).
 The aggregate value stored in each cell of the cube is sales
  amount (in thousands).
 A data warehouse collects information about subjects that
  span an entire organization, and thus its scope is enterprise-
  wide.
 A data mart, on the other hand, is a department subset of a
  data warehouse. It focuses on selected subjects, and thus its
  scope is department-wide
8                        BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                       GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
 By providing multidimensional data views and the pre
  computation of summarized data, data warehouse systems are
  well suited for on-line analytical processing, or OLAP.
9                      BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                               GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
   Transactional database
   In general, a transactional database consists of a file where each
    record represents a transaction.
   A transaction typically includes a unique transaction identity
    number (trans ID) and a list of the items making up the transaction
    (such as items purchased in a store).
Advanced database and information repository:
   Object-Relational Databases
      Constructed based on an object-relational data model.
      This model extends the relational model by providing a rich data
       type for handling complex objects and object orientation.
      Object-relational data model inherits the essential concepts of
       object-oriented databases, where, in general terms, each entity
       is considered as an object.
   Temporal Databases, Sequence Databases, and Time-Series
    Databases
      A temporal database typically stores relational data that include
       time-related attributes.
      These attributes may involve several timestamps, each having
       different semantics
      Temporal Databases
         A temporal database typically stores relational data that
          include time-related attributes.
   10                          BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                            GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
      These attributes may involve several timestamps, each
       having different semantics
   Sequence Databases
      A sequence database stores sequences of ordered events,
       with or without a concrete notion of time.
      Examples include customer shopping sequences, Web click
       streams, and biological sequences
   Time-Series Databases
     A time-series database stores sequences of values or events
      obtained over repeated measurements of time (e.g., hourly,
      daily, weekly).
     Examples include data collected from the stock exchange,
      inventory control, and the observation of natural phenomena
      (like temperature and wind).
 Spatial Databases
   Spatial databases contain spatial-related information.
   Examples include geographic (map) databases, very large-
    scale integration (VLSI) or computed-aided design databases,
    and medical and satellite image databases.
 Text Databases and Multimedia Databases
   Text databases are databases that contain word descriptions
    for objects.
   These word descriptions are usually not simple keywords but
    rather long sentences or paragraphs, such as product
    specifications, error or bug reports, warning messages,
    summary reports, notes, or other documents.
   Multimedia databases store image, audio, and video data. They
    are used in applications such as picture content-based retrieval,
    voice-mail systems, video-on-demand systems, the World Wide
    Web, and speech-based user interfaces that recognize spoken
    commands
11                          BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                              GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
   Data Streams
         Data flow in and out of an observation platform (or window)
          dynamically.
         Such data streams have the following unique features: huge
          or possibly infinite volume, dynamically changing, flowing in
          and out in a fixed order, allowing only one or a small number
          of scans, and demanding fast (often real-time) response
          time.
         Examples : scientific and engineering data, time-series data,
          and data produced in other dynamic environments, etc.
DATA MINING CONCEPTS AND APPLICATIONS- Data Mining
Definitions, Characteristics, and Benefits:
   Data mining is a term used to describe discovering or "mining"
    knowledge from large amounts of data.
   Technically speaking, data mining is a process that uses
    statistical, mathematical, and artificial intelligence techniques
    to extract and identify useful information and subsequent
    knowledge (or patterns) from large sets of data.
   These patterns can be in the form of business rules, affinities,
    correlations, trends, or prediction models
   Most literature defines data mining as "the nontrivial process of
    identifying valid, novel, potentially useful, and ultimately
    understandable patterns in data stored in structured databases,"
    where the data are organized in records structured by
    categorical, ordinal, and continuous variables.
The meanings of the key terms are as follows:
   Nontrivial means that some experimentation-type search or
    inference is involved ;
   Valid means that the discovered patterns should hold true on new
    data with sufficient degree of certainty.
   12                         BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                             GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
 Novel means that the patterns are not previously known
 Potentially useful means that the discovered patterns should lead
  to some benefit to the user or task.
 Ultimately understandable means that the pattern should make
  business sense that leads to the user saying "mmm!
 Data mining is not a new discipline, but rather a new definition for
  the use of many disciplines.
 Data mining is tightly positioned at the intersection of many
  disciplines, including statistics, artificial intelligence, machine
  learning, management science, information systems, and
  databases.
13                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                               GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Major characteristics and objectives of data mining:
   Data are often buried deep within very large databases
   The data are cleansed and consolidated into a data warehouse.
   Data may be presented in a variety of formats.
   The data mining environment is usually a client/ server architecture
    / Web-based information systems architecture.
   Sophisticated new tools, including advanced visualization tools,
    help to remove the information buried in corporate files or archival
    public records.
   data miners are exploring the usefulness of soft data
   The miner is often an end user, empowered by data drills and
    other power query tools to ask ad hoc questions and obtain
    answers quickly
   Data mining tools are readily combined with sp read sheets and
    other software development tools.
   It is sometimes necessary to use parallel processing for DM
A Simple Taxonomy of Data
   Data refers to a collection of facts usually obtained as the result
    of experiences, observations, or experiments.
   Data may consist of numbers, letters, words, images, voice
    recordings, and so on as measurements of a set of variables.
   Data are often viewed as the lowest level of abstraction from
    which information and then knowledge is derived.
   At the highest level of abstraction, one can classify data as
    structured and unstructured
   Structured data is what data mining algorithms use, and can be
    classified as categorical or numeric.
   The categorical data can be subdivided into nominal or ordinal
    data,
   14                          BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                            GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
 whereas numeric data can be subdivided into interval or ratio
 Categorical data
   represent the labels of multiple classes used to divide a variable
    into specific groups.
   Examples of categorical variables include sex, age group, and
    educational level.
 Nominal data
   contain measurements of simple codes assigned to objects as
    labels, which are not measurements.
   For example, the variable marital status can be generally
    categorized as (1) single, (2) married, and (3) divorced.
   Nominal data can be represented with binomial values having
    two possible values (e.g., yes/ no, true/ false, good/ bad), or
    multinomial values having three or more possible values.
 Ordinal data
   contain codes assigned to objects or events as labels that also
    represent the rank order among them.
   For example, the variable credit score can be generally
    categorized as (1) low, (2) medium, or (3) high.
15                          BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                              GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
      Similar ordered relationships can be seen in variables such as
       age group (i.e., child, young, middle-aged, elderly)
   Numeric data
      represent the numeric values of specific variables.
      Examples of numerically valued variables include age, number
       of children, total household income
   Interval data
      are variables that can be measured on interval scales.
      A common example of interval scale measurement is
       temperature on the Celsius scale.
   Ratio data
      include measurement variables commonly found in the physical
       sciences and engineering. Mass, length, time, plane angle,
       energy, and electric charge.
How Data Mining Works?
   Using existing and relevant data, data mining builds models to
    identify patterns among the attributes presented in the data set.
   Models are the mathematical representations that identify the
    patterns among the attributes of the objects described in the data
    set.
   Some of these patterns are explanatory (explaining the
    interrelationships and affinities among the attributes), whereas
    others are predictive (foretelling future values of certain
    attributes).
In general, data mining seeks to identify four major types of
patterns:
1. Associations
           find the commonly co-occurring groupings of things, such as
beer and diapers going together in market-basket analysis.
   16                         BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                               GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
2. Predictions
          tell the nature of future occurrences of certain events based
on what has happened in the past, such as predicting the winner of the
Game or forecasting the absolute temperature of a particular day.
3. Clusters
            identify natural groupings of things based on their known
characteristics, such as assigning customers in different segments
based on their demographics and past purchase behaviors.
4. Sequential relationships
            discover time-ordered events, such as predicting that an
existing banking customer who already has a checking account will open
a savings account followed by an investment account within a year.
   Data mining tasks can be classified into three main categories:
         prediction,
         association, and
         clustering.
   Based on the way in which the patterns are extracted from the
    historical data, the learning algorithms of data mining methods
    can be classified as either
         supervised      or
         unsupervised.
               Supervised learning algorithms - the training data
                includes both the descriptive attributes (i.e.,
                independent variables or decision variables) as well as
                the class attribute (i.e. , output variable or result
                variable).
               Unsupervised learning algorithm - the training data
                includes only the descriptive attributes.
   17                          BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
PREDICTION
   Prediction is commonly referred to as the act of telling about the
    future.
   It differs from simple guessing by taking into account the
    experiences, opinions, and other relevant information in
    conducting the task of foretelling.
   A term that is commonly associated with prediction is forecasting.
   Prediction is largely experience and opinion based, forecasting is
    data and model based.
   That is, in order of increasing reliability, one might list the relevant
    terms as guessing, predicting, and forecasting, respectively.
  18                            BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                              GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
CLASSIFICATION (supervised induction)
   The objective of classification is to analyze the historical data
    stored in a database and automatically generate a model that can
    predict future behavior.
   This induced model consists of generalizations over the records of
    a training dataset, which help distinguish predefined classes.
   The hope is that the model can then be used to predict the classes
    of other unclassified records and, more important, to accurately
    predict actual future events.
   Common classification tools include neural networks and decision
    trees, logistic regression and discriminate analysis.
     Emerging tools such as rough sets, support vector machines, and
      genetic algorithms
            Neural networks
               Involve the development of mathematical structures
                (somewhat resembling the biological neural networks
                in human brain) that have the capability to learn from
                past experiences presented in the form of well-
                structured datasets
            Decision trees
               Classify data into a finite number of classes based on
                the values of the input variables.
               Decision trees are essentially a hierarchy of if-then
                statements
               Faster than neural networks.
               They are most appropriate for categorical and interval
                data.
               Therefore, incorporating continuous variables into a
                decision tree framework requires discretization -
                converting continuous valued numerical variables to
                ranges and categories
  19                          BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
CLUSTERING
   Clustering partitions a collection of things (e.g., objects and
    events presented in a structured dataset) into segments (or
    natural groupings) whose members share similar characteristics.
   Unlike in classification, in clustering, the class labels are unknown.
   As the selected algorithms go through the dataset, identifying the
    commonalities of things based on their characteristics, the clusters
    are established.
   Because the clusters are determined using a heuristic-type
    algorithm, and because different algorithms may end up with
    different sets of clusters for the same dataset.
   It may be necessary for an expert to interpret, and potentially
    modify, the suggested clusters before the results of clustering
    techniques are put to actual use.
   After reasonable clusters have been identified, they can be used to
    classify and interpret new data.
   The goal of clustering is to create groups so that the members
    within each group have maximum similarity and the members
    across groups have minimum similarity.
   The most commonly used clustering techniques include k-means
    (from statistics) and self-organizing maps (from machine
    learning), which is a unique neural network architecture developed
    by Kohonen.
ASSOCIATIONS
   Associations, or association rule learning in data mining, is a
    popular and well-researched technique for discovering interesting
    relationships among variables in large databases.
   In the context of the retail industry, association rule mining is often
    called market-basket analysis.
  20                            BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                 GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
   Two commonly used derivatives of association rule mining are link
    analysis and sequence mining.
        Link analysis - the linkage among many objects of interest
         is discovered automatically, such as the link between Web
         pages and referential relationships among groups of
         academic publication authors.
        Sequence mining - relationships are examined in terms of
         their order of occurrence to identify associations over time
HYPOTHESIS- OR DISCOVERY-DRIVEN DATA MINING
   Data mining can be hypothesis driven or discovery driven.
   Hypothesis-driven data mining begins with a proposition by the
    user, who then seeks to validate the truthfulness of the proposition.
   For example, a marketing manager may begin with the following
    proposition: ―Are DVD player sales related to sales of television
    sets?‖
   Discovery-driven data mining finds patterns, associations, and
    other relationships hidden within datasets. It can uncover facts that
    an organization had not previously known or even contemplated
DATA MINING APPLICATIONS
    • Customer relationship management.
    • Banking.
    • Retailing and logistics.
    • Manufacturing and production.
    • Brokerage and securities trading.
    • Insurance.
    • Computer hardware and software.
    • Government and defense.
    • Travel industry (airlines, hotels/resorts, rental car companies).
  21                             BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                 GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
     • Health care.
     • Medicine.
     • Entertainment industry.
     • Homeland security and law enforcement.
     • Sports.
Data mining has become a popular tool in addressing many complex
businesses issues.
1. Customer relationship management.
   Customer relationship management (CRM) is the new and
    emerging extension of traditional marketing.
   The goal of CRM is to create one-on-one relationships with
    customers by developing an intimate understanding of their needs
    and wants.
   As businesses build relationships with their customers over time
    through a variety of transactions (e.g., product inquiries, sales,
    service requests, warranty calls)
   When combined with demographic and socioeconomic attributes,
    this information-rich data can be used to
           (1) identify most likely responders / buyers of new
           products/services (i.e., customer profiling);
           (2) understand the root causes of customer attrition in order
           to improve customer retention (i.e., churn analysis);
           (3) discover time-variant associations between products and
           services to maximize sales and customer value;
           (4) identify the most profitable customers and their
           preferential needs to strengthen relationships and to
           maximize sales.
  22                             BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                               GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
2. Banking.
   Data mining can help banks with the following:
     (1) automating the loan application process by accurately
     predicting the most probable defaulters;
     (2) detecting     fraudulent   credit   card   and     online-banking
     transactions;
     (3) identifying ways to maximize customer value by selling them
     products and services that they are most likely to buy;
     (4) optimizing the cash return by accurately forecasting the cash
     flow on banking entities (e.g., ATM machines, banking branches).
3. Retailing and logistics.
   In the retailing industry, data mining can be used to
     (1) predict accurate sales volumes at specific retail locations in
     order to determine correct inventory levels;
     (2) identify sales relationships between different products (with
     market-basket analysis) to improve the store layout and optimize
     sales promotions;
     (3) forecast consumption levels of different product types (based
     on seasonal and environmental conditions) to optimize logistics
     and hence maximize sales;
     (4) discover interesting patterns in the movement of products
     (especially for the products that have a limited shelf life because
     they are prone to expiration, perishability, and contamination) in a
     supply chain by analyzing sensory and RFID data.
4. Manufacturing and production.
   Manufacturers can use data mining to
     (1) predict machinery failures before they occur through the use of
     sensory data (enabling what is called condition-based
     maintenance);
   23                          BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                 GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
     (2) identify anomalies and commonalities in production systems to
     optimize manufacturing capacity; and
     (3) discover novel patterns to identify and improve product quality
5. Brokerage and securities trading.
   Brokers and traders use data mining to
     (1) predict when and how much certain bond prices will change;
     (2) forecast the range and direction of stock fluctuations;
     (3) assess the effect of particular issues and events on overall
     market movements; and
     (4) identify and prevent fraudulent activities in securities trading
6. Insurance.
   The insurance industry uses data mining techniques to
     (1) forecast claim amounts for property and medical coverage
     costs for better business planning;
     (2) determine optimal rate plans based on the analysis of claims
     and customer data;
     (3) predict which customers are more likely to buy new policies
     with special features; and
     (4) identify and prevent incorrect claim payments and fraudulent
     activities.
7. Computer hardware and software.
   Data mining can be used to
     (1) predict disk drive failures well before they actually occur;
     (2) identify and filter unwanted Web content and e-mail messages;
     (3) detect and prevent computer network security bridges; and
     (4) identify potentially unsecure software products.
   24                            BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
8. Government and defense.
   Data mining also has a number of military applications. It can be
    used to
     (1) forecast the cost of moving military personnel and equipment;
     (2) predict an adversary’s moves to develop more successful
     strategies for military engagements;
     (3) predict resource consumption for better planning and
     budgeting; and
     (4) identify classes of unique experiences, strategies, and lessons
     learned from military operations for better knowledge sharing
9. Travel industry (airlines, hotels / resorts, rental car companies).
   Data mining has a variety of uses in the travel industry. It is
    successfully used to
     (1) predict sales of different services (seat types in airplanes, room
     types in hotels/resorts, car types in rental car companies) in order
     to optimally price services to maximize revenues as a function of
     time-varying transactions (commonly referred to as yield
     management);
     (2) forecast demand at different locations to better allocate limited
     organizational resources;
     (3) identify the most profitable customers and provide them with
     personalized services to maintain their repeat business; and
     (4) retain valuable employees by identifying and acting on the root
     causes for attrition
10. Health care.
   Data mining has a number of health care applications. It can be
    used to
     (1) identify people without health insurance and the factors
     underlying this undesired phenomenon;
   25                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
     (2) identify novel cost-benefit relationships between different
     treatments to develop more effective strategies;
     (3) forecast the level and the time of demand at different service
     locations to optimally allocate organizational resources; and
     (4) understand the underlying reasons for customer and employee
     attrition.
11. Medicine.
     (1) identify novel patterns to improve survivability of patients with
     cancer;
     (2) predict success rates of organ transplantation patients to
     develop better donor-organ matching policies;
     (3) identify the functions of different genes in the human
     chromosome (known as genomics);
     (4) discover the relationships between symptoms and illnesses to
     help medical professionals make informed and correct decisions in
     a timely manner.
12. Entertainment industry.
   Data mining is successfully used by the entertainment industry to
     (1) analyze viewer data to decide what programs to show during
     prime time and how to maximize returns by knowing where to
     insert advertisements;
     (2) predict the financial success of movies before they are
     produced to make investment decisions and to optimize the
     returns;
     (3) forecast the demand at different locations and different times to
     better schedule entertainment events and to optimally allocate
     resources; and
     (4) develop optimal pricing policies to maximize revenues.
   26                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                 GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
13. Homeland security and law enforcement.
   Data mining has a number of homeland security and law
    enforcement applications. Data mining is often used to
     (1) identify patterns of terrorist behaviors
     (2) discover crime patterns (e.g., locations, timings, criminal
     behaviors, and other related attributes) to help solve criminal
     cases in a timely manner;
     (3) predict and eliminate potential biological and chemical attacks
     to a nation’s critical infrastructure by analyzing special-purpose
     sensory data; and
     (4) identify and stop malicious attacks on critical information
     infrastructures (often called information warfare).
14.Sports.
   Data mining was used to improve the performance of National
     (1) Basketball Association (NBA) teams in the United States.
     (2) The NBA developed Advanced Scout, a PC-based data mining
     application that coaching staff use to discover interesting patterns
     in basketball game data.
     (3) The pattern interpretation is facilitated by allowing the user to
     relate patterns to videotape.
     (4) See Bhandari et al.
     (5) (1997) for details.
DATA MINING PROCESS
   In order to systematically carry out data mining projects, a general
    process is usually followed.
   Based on best practices, data mining researchers and practitioners
    have proposed several processes (workflows or simple step-by-
    step approaches)
   27                            BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                            GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
 One such standardized process, the most popular one,          Cross-
  Industry Standard Process for Data Mining—CRISP-DM
 CRISP - DM is a sequence of six steps.
 Starts with a good understanding of the business and ends with
  the deployment of the solution that satisfied the specific business
  need.
 Even though these steps are sequential in nature, there is usually
  a great deal of backtracking.
 Because the data mining is driven by experience and
  experimentation, depending on the problem situation and the
  knowledge/experience of the analyst, the whole process can be
  very iterative
28                          BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Step 1: Business Understanding
   A thorough understanding of the managerial need for new
    knowledge and an explicit specification of the business objective
   Specific questions such as
            ―What are the common characteristics of the customers we
have lost to our competitors recently?‖ or
           ―What are typical profiles of our customers, and how much
value does each of them provide to us?‖
     need to be addressed.
   Then a project plan for finding such knowledge is developed that
    specifies the people responsible for collecting the data, analyzing
    the data, and reporting the findings.
   At this early stage, a budget to support the study should also be
    established
Step 2: Data Understanding
   Different business tasks require different sets of data.
   The main activity of the data mining process is to identify the
    relevant data from many available databases.
   Some key points must be considered in the data identification
    and selection phase.
   First and foremost, the analyst should be clear and concise about
    the description of the data mining task so that the most relevant
    data can be identified.
   For example, a retail data mining project may seek to identify
    spending behaviors of female shoppers, who purchase seasonal
    clothes, based on their demographics, credit card transactions,
    and socioeconomic attributes.
   Furthermore, the analyst should build an intimate understanding of
    the data sources
   29                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
   Example:
         where the relevant data are stored and in what form;
         what the process of collecting the data is—automated versus
          manual;
         who the collectors of it are;
         and how often it is updated etc
   In order to better understand the data, the analyst often uses a
    variety of statistical and graphical techniques
   Such as simple statistical summaries of each variable (e.g., for
    numeric variables, the average, minimum/maximum, median,
    standard deviation are among the calculated measures, etc)
   Data sources for data selection can vary.
   Normally, data sources for business applications include
    demographic data (such as income, education, number of
    households, and age),
     sociographic data (such as hobby, club membership, and
     entertainment),
     transactional data (sales record, credit card spending, and issued
     checks), and so on.
   Data can be categorized as quantitative and qualitative.
Step 3: Data Preparation
   The purpose of data preparation (or more commonly called as data
    preprocessing) is to take the data identified in the previous step
    and prepare them for analysis by data mining methods.
   Data are generally
     Incomplete (lacking attribute values, lacking certain attributes of
     interest, or containing only aggregate data),
     Noisy (containing errors or outliers), and
   30                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                           GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
  Inconsistent (containing discrepancies in codes or names).
 The four main steps needed to convert the raw, real-world data
  into minable datasets.
      Data Collection / Selection
      Data Cleaning
      Data Transformation
      Data Reduction
       1. Data Collection / Selection
           The relevant data are collected from the identified
            sources
           The necessary records and variables are selected, and
           The records coming from multiple data sources are
            integrated
31                          BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                        GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
     2. Data Cleaning
           The data are cleaned (this step is also known as
            data scrubbing).
           The values in the dataset are identified and dealt
            with.
           Missing values - need to be imputed (filled with a
            most probable value) or ignored;
           Noisy values - (i.e., the outliers) and smooth them
            out.
           Inconsistencies -     (unusual values within a
            variable) in the data should be handled using
            domain knowledge and/or expert opinion.
     3. Data Transformation
           Data are transformed for better processing.
           For instance, in many cases, the data are
            normalized in order to mitigate the potential bias of
            one variable (having large numeric values, such as
            for household income)
           Another transformation that takes            place    is
            discretization and/or aggregation.
           In some cases, the numeric variables are converted
            to categorical values (e.g., low, medium, and high);
           In other cases, a nominal variable’s unique value
            range is reduced to a smaller set using concept
            hierarchies.
           Even though data miners like to have large
            datasets, too much data is also a problem.
           One can visualize the data commonly used in data
            mining projects as a flat file consisting of two
            dimensions:
32                      BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                               GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
                 In some cases (e.g., image processing and genome
                  projects with complex microarray data), the number
                  of variables can be rather large, and the analyst
                  must reduce the number down to a manageable
                  size.
                 The variables are treated as different dimensions
                  that describe the phenomenon from different
                  perspectives, in data mining, this process is
                  commonly called dimensional reduction.
Step 4: Modeling Building
   In this step, various modeling techniques are selected and applied
    to an already prepared dataset in order to address the specific
    business need.
   Depending on the business need, the data mining task can be of a
    prediction (either classification or regression), an association, or a
    clustering type.
   Each of these data mining tasks can use a variety of data mining
    methods and algorithms.
   Some of these data mining methods and some of the most popular
    algorithms - decision trees for classification, k-means for
    clustering, and the Apriori algorithm for association rule mining.
Step 5: Testing and Evaluation
   The developed models are assessed and evaluated for their
    accuracy and generality.
   This step assesses the degree to which the selected model (or
    models) meets the business objectives and, if so, to what extent
   Another option is to test the developed model(s) in a real-world
    scenario if time and budget constraints permit.
   Even though the outcome of the developed models are expected
    to relate to the original business objectives,
   The testing and evaluation step is a critical and challenging task.
   33                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                               GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
   No value is added by the data mining task until the business value
    obtained from discovered knowledge patterns is identified and
    recognized.
   Determining the business value from discovered knowledge
    patterns is somewhat similar to playing with puzzles.
   The success of this identification operation depends on the
    interaction among data analysts, business analysts, and
    decision makers
Step 6: Deployment
   Development and assessment of the models is not the end of the
    data mining project.
   The knowledge gained from such exploration will need to be
    organized and presented in a way that the end user can
    understand and benefit from it.
   Depending on the requirements, the deployment phase can be as
    simple as generating a report or as complex as implementing a
    repeatable data mining process across the enterprise.
   In many cases, it is the customer, not the data analyst, who carries
    out the deployment steps.
   The deployment step may also include maintenance activities for
    the deployed models
  34                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                           GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
  Other Data Mining Standardized Processes and Methodologies
Ranking of Data Mining Processes and Methodologies
  35                       BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
DATA MINING METHODS
   A variety of methods are available for performing data mining
    studies, including classification, regression, clustering, and
    association.
   Most data mining software tools employ more than one technique
    (or algorithm) for each of these methods.
1. Classification
   Classification is perhaps the most frequently used data mining
    method
   A popular member of the machine-learning family of techniques,
   Classification learns patterns from past data (a set of information—
    traits, variables, features—on characteristics of the previously
    labeled items, objects, or events) in order to place new instances
    (with unknown labels) into their respective groups or classes.
   For example, one could use classification to predict whether the
    weather on a particular day will be ―sunny,‖ ―rainy,‖ or ―cloudy.‖
   Popular classification tasks include credit approval (i.e., good or
    bad credit risk),
   store location (e.g., good, moderate, bad), target marketing (e.g.,
    likely customer, no hope),
   fraud detection (i.e., yes, no), and telecommunication (e.g., likely
    to turn to another phone company, yes/no).
   If what is being predicted is a class label (e.g., ―sunny,‖ ―rainy,‖ or
    ―cloudy‖), the prediction problem is called a classification.
   whereas if it is a numeric value (e.g., temperature such as 68°F),
    the prediction problem is called a regression.
   The most common two-step methodology of classification-type
    prediction involves
   36                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                               GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
         model development/training and
         model testing/deployment.
   In the model development phase, a collection of input data,
    including the actual class labels, is used.
   After a model has been trained, the model is tested against the
    holdout sample for accuracy assessment to predict classes of new
    data instances (where the class label is unknown)
Several factors are considered in assessing the model, including the
following:
  1. Predictive accuracy.
      The model’s ability to correctly predict the class label of new or
       previously unseen data.
      Prediction accuracy is the most commonly used assessment
       factor
      To compute this measure, actual class labels of a test dataset
       are matched against the class labels predicted by the model.
      The accuracy can then be computed as the accuracy rate,
       which is the percentage of test dataset samples correctly
       classified by the model
  2. Speed.
      The computational costs involved in generating and using the
       model, where faster is deemed to be better.
  3. Robustness.
      The model’s ability to make reasonably accurate predictions,
       given noisy data or data with missing and erroneous values.
  4. Scalability.
      The ability to construct a prediction model efficiently given a
       rather large amount of data.
   37                          BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                             GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
  5. Interpretability.
      The level of understanding and insight provided by the model
        (e.g., how and/or what the model concludes on certain
        predictions).
Estimating the True Accuracy of Classification Models
   In classification problems, the primary source for accuracy
    estimation is the confusion matrix (also called a classification
    matrix or a contingency table).
   The numbers along the diagonal (L – R) correct decisions, and the
    numbers outside this diagonal represent the errors.
   38                        BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                               GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
SIMPLE SPLIT
   The simple split partitions the data into two mutually exclusive
    subsets called a training set and a test set (or holdout set).
   Two-thirds of the data - training set ; remaining one-third - test
    set.
   Training set - used by the inducer (model builder), and the built
    classifier is then tested on the test set
K-FOLD CROSS-VALIDATION (rotation estimation)
   The complete dataset is randomly split into k mutually exclusive
    subsets of approximately equal size.
   The classification model is trained and tested k times.
  39                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
   Each time, it is trained on all but one fold and then tested on the
    remaining single fold.
ADDITIONAL CLASSIFICATION ASSESSMENT METHODOLOGIES
  1. Leave-one-out.
      similar to the k-fold cross-validation
      every data point is used for testing once
      This is a time consuming methodology, but for small datasets,
       sometimes it is a viable option.
  2. Bootstrapping.
      With bootstrapping, a fixed number of instances from the
       original data are sampled (with replacement) for training and the
       rest of the dataset is used for testing.
      This process is repeated as many times as desired.
  3. Jackknifing.
      Similar to the leave-one-out methodology;
      The accuracy is calculated by leaving one sample out at each
       iteration of the estimation process.
  4. Area under the ROC curve.
      The area under the ROC curve is a graphical assessment
       technique
      true positive rate is plotted on the Y-axis and false positive rate
       is plotted on the X-axis.
      The area under the ROC curve determines the accuracy
       measure of a classifier: A value of 1 indicates a perfect
       classifier whereas 0.5 indicates no better than random chance;
  40                            BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                  GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
CLASSIFICATION TECHNIQUES
         Decision tree analysis.
         Statistical analysis.
         Neural networks.
         Case-based reasoning
         Bayesian classifiers
         Genetic algorithms
         Rough sets
2. Cluster Analysis
   Data mining method for classifying items, events, or concepts into
    common groupings called clusters.
   The method is commonly used in biology, medicine, genetics,
    social network analysis, anthropology, archaeology, astronomy,
    character recognition, and even in management information
    system development.
   Cluster analysis is an exploratory data analysis tool for solving
    classification problems.
   The objective is to sort cases (e.g., people, things, events) into
    groups, or clusters, so that the degree of association is strong
    among members of the same cluster and weak among members
    of different clusters.
   Each cluster describes the class to which its members belong.
Cluster analysis results may be used to:
   Identify a classification scheme (e.g., types of customers)
   Suggest statistical models to describe populations
   Indicate rules for assigning new cases to classes for identification,
    targeting, and diagnostic purposes
   Provide measures of definition, size, and change in what were
    previously broad concepts
   41                             BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
   Find typical cases to label and represent classes
   Decrease the size and complexity of the problem space for other
    data mining methods
   Identify outliers in a specific domain (e.g., rare-event detection)
DETERMINING THE OPTIMAL NUMBER OF CLUSTERS
   Clustering algorithms usually require one to specify the number of
    clusters to find.
   If this number is not known from prior knowledge, it should be
    chosen in some way
The following are among the most commonly referenced ones:
   Look at the percentage of variance explained as a function of the
    number of clusters; that is, choose a number of clusters so that
    adding another cluster would not give much better modeling of the
    data.
   Set the number of clusters to (n/2)1/2, where n is the number of
    data points.
   Use the Akaike Information Criterion, which is a measure of the
    goodness of fit
   Use Bayesian Information Criterion, which is a model-selection
    criterion to determine the number of clusters
ANALYSIS METHODS
         Statistical methods
         Neural networks
         Fuzzy logic
Each of these methods generally works with one of two general method
classes:
         Divisive.
                All items start in one cluster and are broken apart
   42                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                  GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
         Agglomerative
                   all items start in individual clusters, and the clusters are
joined together.
K-MEANS CLUSTERING ALGORITHM
   The k-means clustering algorithm (where k stands for the
    predetermined number of clusters) is arguably the most referenced
    clustering algorithm.
   It has its roots in traditional statistical analysis. As the name
    implies, the algorithm assigns each data point (customer, event,
    object, etc.) to the cluster whose center (also called centroid) is the
    nearest.
   The center is calculated as the average of all the points in the
    cluster; that is, its coordinates are the arithmetic mean for each
    dimension separately over all the points in the cluster.
   Initialization step: Choose the number of clusters (i.e., the value
    of k).
         Step 1: Randomly generate k random points as initial cluster
          centers.
         Step 2: Assign each point to the nearest cluster center.
         Step 3: Recompute the new cluster centers.
   Repetition step: Repeat steps 2 and 3 until some convergence
    criterion is met (usually that the assignment of points to clusters
    becomes stable).
   43                             BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Association Rule Mining
   Association rule mining is a popular data mining method
   Association rule mining aims to find interesting relationships
    (affinities) between variables (items) in large databases.
   It is commonly called a market-basket analysis.
   The main idea in market basket analysis is to identify strong
    relationships among different products (or services) that are
    usually purchased together
   The outcome of the analysis is invaluable information that can be
    used to better understand customer-purchase behavior in order to
    maximize the profit from business transactions.
   A business can take advantage of such knowledge by
     (1) putting the items next to each other to make it more convenient
     for the customers to pick them
     (2) promoting the items as a package (do not put one on sale if
     the other(s) is on sale); and
     (3) placing them apart from each other so that the customer has to
     walk the aisles to search for it, and by doing so potentially seeing
     and buying other items
   Applications of market-basket analysis include cross- marketing,
    cross-selling, store design, catalog design,e-commerce site
    design, optimization of online advertising, product pricing, and
    sales/promotion configuration.
   ―Are all association rules interesting and useful?‖
   In order to answer such a question, association rule mining uses
    two common metrics: support and confidence
   Several algorithms are available for generating association rules.
    Some well-known algorithms include
  44                            BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                            GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
            Apriori,
            Eclat, and
            FP-Growth.
 These algorithms only do half the job, which is to identify the
  frequent itemsets in the database.
 A frequent itemset is an arbitrary number of items that frequently
  go together in a transaction
 Once the frequent itemsets are identified, they need to be
  converted into rules with antecedent and consequent parts.
 APRIORI ALGORITHM
   The Apriori algorithm is the most commonly used algorithm to
    discover association rules.
   Given a set of itemsets (e.g., sets of retail transactions, each
    listing individual items purchased)
   The algorithm attempts to find subsets that are common to at
    least a minimum number of the itemsets (i.e., complies with a
    minimum support).
   Apriori uses a bottom-up approach, where frequent subsets are
    extended one item at a time (a method known as candidate
    generation, whereby the size of frequent subsets increases
    from one-item subsets to two-item subsets, then three-item
    subsets, etc.),
   Groups of candidates at each level are tested against the data
    for minimum support. The algorithm terminates when no further
    successful extensions are found.
45                          BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                               GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
TEXT MINING
Text mining (text data mining or knowledge discovery in textual
databases)
   It is the semi automated process of extracting patterns from large
    amounts of unstructured data sources.
   Text mining is the same as data mining;
   But with text mining, the input to the process is a collection of
    unstructured (or less structured) data files such as Word
    documents, PDF files, text excerpts, XML files, and so on.
   Text mining has two main steps
          1. Imposing structure to the text-based data sources
          2. Extracting relevant information and knowledge from this
          structured text-based data using data mining techniques and
          tools
TEXT MINING CONCEPTS AND DEFINITIONS
Benefits of text mining
         Law (court orders)
         Academic research (research articles)
         Finance (quarterly reports)
   46                          BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
         Medicine (discharge summaries)
         Biology (molecular interactions)
         Technology (patent files), and
         Marketing (customer comments)
Example
      Free-form text-based interactions with customers in the form of
       complaints
      Electronic communications and e-mail.
      Used to classify and filter junk e-mail
      Used to automatically prioritize e-mail based on importance
       level as well as to generate automatic responses
Application areas of text mining:
   Information extraction -        Identification of key phrases and
    relationships
   Topic tracking - Based on a user profile and documents that a
    user views, text mining can predict other documents
   Summarization - To save time on the part of the reader.
   Categorization - Identifying the main themes of a document and
    then placing them into a predefined set of categories based on
    those themes.
   Clustering - Grouping similar documents without having a
    predefined set of categories.
   Concept linking - Connects related documents by identifying their
    shared concepts
   Question answering - Finding the best answer to a given
    question through knowledge-driven pattern matching.
   47                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                               GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
NATURAL LANGUAGE PROCESSING
   NLP is an important component of text mining
   NLP is a subfield of artificial intelligence and computational
    linguistics.
   NLP studies the problem of ―understanding‖ the natural human
    language
   NLP Converts the human language (such as textual documents)
    into more formal representations (in the form of numeric and
    symbolic data) that are easier for computer programs to
    manipulate.
   The goal of NLP is to move beyond syntax-driven text
    manipulation (which is often called ―word counting‖) to a true
    understanding and processing of natural language
   natural human language is vague and that a true understanding of
    meaning requires extensive knowledge of a topic
Challenges associated with the implementation of NLP
   Part-of-speech tagging - It is difficult to mark up terms in a text as
    corresponding to a particular part of speech (such as nouns, verbs,
    adjectives, and adverbs)
   Text segmentation - Some written languages, such as Chinese,
    Japanese, and Thai, do not have single-word boundaries. In these
    instances, the text-parsing task requires the identification of word
    boundaries, which is often a difficult task.
   Word sense disambiguation - Many words have more than one
    meaning. Selecting the meaning that makes the most sense can
    only be accomplished by taking into account the context within
    which the word is used.
   Syntactic ambiguity - The grammar for natural languages is
    ambiguous; that is, multiple possible sentence structures often
    need to be considered.
  48                            BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                 GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
   Imperfect or irregular input - Foreign or regional accents and
    vocal impediments in speech and typographical or grammatical
    errors in texts make the processing of the language an even more
    difficult task.
   Speech acts - A sentence can often be considered an action by
    the speaker.
                The sentence structure alone may not contain enough
                information to define this action.
                For example, ―Can you pass the class?‖ requests a
                simple yes/no answer, whereas
                ―Can you pass the salt?‖ is a request for a physical
                action to be performed..
   WordNet is a laboriously hand-coded database of English words,
    their definitions, sets of synonyms, and various semantic relations
    between synonym sets.
   It is a major resource for NLP applications, but it has proven to be
    very expensive to build and maintain manually
   An important area of CRM, where NLP is making a significant
    impact, is sentiment analysis.
   Sentiment analysis is a technique used to detect favorable and
    unfavorable opinions toward specific products and services using a
    large numbers of textual data sources (customer feedback in the
    form of Web postings).
   NLP has successfully been applied to a variety of tasks via
    computer programs to automatically process natural human
    language.
Following are among the most popular of these tasks:
     1. Information retrieval.
            The science of searching for relevant documents, finding
specific information within them, and generating metadata as to their
contents.
   49                            BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
     2. Information extraction.
            A type of information retrieval whose goal is to automatically
extract structured information, such as categorized and contextually and
semantically well-defined data from a certain domain, using unstructured
machine readable documents.
     3. Named-entity recognition.
           Also known as entity identification and entity extraction, this
subtask of information extraction seeks to locate and classify atomic
elements in text into predefined categories
     4. Question answering.
            The task of automatically answering a question posed in
natural language;
           To find the answer to a question, the computer program may
use either a prestructured database or a collection of natural language
documents (a text corpus such as the World Wide Web).
     5. Automatic summarization.
            The creation of a shortened version of a textual document by
a computer program that contains the most important points of the
original document.
     6. Natural language generation.
           Systems convert information from computer databases into
readable human language.
     7. Natural language understanding.
           Systems convert samples of human language into more
formal representations that are easier for computer programs to
manipulate.
     8. Machine translation.
           The automatic translation of one human language to another
   50                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
     9. Foreign language reading.
           A computer program that assists a nonnative language
speaker to read a foreign language with correct pronunciation and
accents on different parts of the words.
     10.   Foreign language writing.
             A computer program that assists a nonnative language user
in writing in a foreign language.
     11.   Speech recognition.
           Converts spoken words to machine-readable input.
     12.   Text-to-speech.
           Also called speech synthesis, a computer program
automatically converts normal language text into human speech.
     13.   Text proofing.
           A computer program reads a proof copy of a text in order to
detect and correct any errors
     14.   Optical character recognition.
            The automatic translation of images of handwritten,
typewritten, or printed text (usually captured by a scanner) into machine
editable textual documents
TEXT MINING APPLICATIONS
1. Marketing Applications
   Text mining can be used to increase cross-selling and up-selling
    by analyzing the unstructured data generated by call centers
   blogs, user reviews of products at independent Web sites, and
    discussion board postings are a gold mine of customer sentiments
   Text Mining used to predict customer perceptions and subsequent
    purchasing behavior
   51                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                               GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
2. Security Applications
   ECHELON surveillance system – It is assumed to be capable of
    identifying the content of telephone calls, faxes, e-mails, and other
    types of data, intercepting information sent via satellites, public
    switched telephone networks.
   EUROPOL developed an integrated system capable of accessing,
    storing, and analyzing vast amounts of structured and unstructured
    data sources in order to track transnational organized crime.
   The U.S. (FBI) and the (CIA), are jointly developed a
    supercomputer data and text mining system. The system is
    expected to create a gigantic data warehouse along with a variety
    of data and text mining modules to meet the knowledge-discovery
    needs of federal, state, and local law enforcement agencies.
   Text mining is in the area of deception detection
3. Biomedical Applications
   Experimental techniques such as DNA microarray analysis, serial
    analysis of gene expression (SAGE), and mass spectrometry
    proteomics, among others, are generating large amounts of data
    related to genes and proteins.
   Knowing the location of a protein within a cell can help to
    determine its potential as a drug target
4. Academic Applications
   Text Mining provides semantic cues to machines to answer
    specific queries.
   52                          BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                               GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
TEXT MINING PROCESS
            Context diagram for the text mining process
As the context diagram indicates:
   The input into the text-based knowledge discovery process is the
    unstructured as well as structured data collected, stored, and
    made available to the process.
   The output of the process is the context-specific knowledge that
    can be used for decision making.
   The controls, also called the constraints (inward connection to the
    top edge of the box), of the process include
                 software and hardware limitations,
                 privacy issues, and the
   53                          BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                 GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
                difficulties related to processing of the text
   The mechanisms of the process include proper techniques,
    software tools, and domain expertise.
   The text mining process can be broken down into three
    consecutive tasks,
          Task 1 : Establish the Corpus
          Task 2: Create the Term–Document Matrix
          Task 3: Extract Knowledge
   each of which has specific inputs to generate certain outputs.
              The three steps Text Mining Processes
Task 1: Establish the corpus
         Collect all relevant unstructured data
          (e.g., textual documents, XML files, emails, Web pages,
          short notes, voice recordings…)
         Digitize, standardize the collection
          (e.g., all in ASCII text files)
         Place the collection in a common place
          (e.g., in a flat file, or in a directory as separate files)
   54                            BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                               GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Task 2: Create the Term–by–Document Matrix
   The digitized and organized documents (the corpus) are used to
    create the term–document matrix (TDM).
   In the TDM, rows represent the documents and columns represent
    the terms.
   The relationships between the terms and documents are
    characterized by indices (i.e., a relational measure that can be as
    simple as the number of occurrences of the term in respective
    documents).
                                                                                 nt                    g
                                                                             e                 e   rin
          Terms                                  k                    g   em               ine
                                         t   ris             a   na             e     ng              e    nt
                                      en                  tm                are                    pm
                                 tm                   c                                      elo
                            es                   je                   ftw                v                      P
 Documents            inv                    pro                 so                   de                   SA       ...
   Document 1        1                                                                1
   Document 2                            1
   Document 3                                                3                                             1
   Document 4                            1
   Document 5                                                2                        1
   Document 6        1                                                                1
   ...
          Should all terms be included?
               Stop words, include words
               Synonyms, homonyms
               Stemming
          What is the best representation of the indices (values in
           cells)?
                   Row counts; binary frequencies; log frequencies;
                   Inverse document frequency
  55                                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
         TDM is a sparse matrix. How can we reduce the
          dimensionality of the TDM?
                Manual – a domain expert goes through it
                Eliminate terms with very few occurrences in very
                  few documents (?)
                Transform the matrix using singular value
                  decomposition (SVD)
                SVD is similar to principle component analysis
Task 3: Extract patterns/knowledge
   Classification (text categorization)
   Clustering (natural groupings of text)
        Improve search recall
        Improve search precision
        Scatter/gather
        Query-specific clustering
   Association
   Trend Analysis (…)
TEXT MINING TOOLS
   Following are some of the popular text mining tools, which we
    classify as
         Commercial software tools
         Free software tools
  1. Commercial Software Tools
   The following are some of the most popular software tools used for
    text mining.
   Note that many companies offer demonstration versions of their
    products on their Web sites.
     1. ClearForest offers text analysis and visualization tools
     (clearforest.com).
   56                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
     2. IBM Intelligent Miner Data Mining Suite, now fully integrated
     into IBM’s InfoSphere Warehouse software, includes data and
     text mining tools (ibm.com).
     3. Megaputer Text Analyst offers semantic analysis of free-form
     text, summarization, clustering, navigation, and natural language
     retrieval with search dynamic refocusing (megaputer.com).
     4. SAS Text Miner provides a rich suite of text processing and
     analysis tools (sas.com).
     5. SPSS Text Mining for Clementine extracts key concepts,
     sentiments, and relationships from call-center notes, blogs, e-
     mails, and other unstructured data and converts them to a
     structured format for predictive modeling (spss.com).
     6. The Statistica Text Mining engine provides easy-to-use text
     mining functionally with exceptional visualization capabilities
     (statsoft.com).
     7. VantagePoint provides a variety of interactive graphical views
     and analysis tools with powerful capabilities to discover knowledge
     from text databases(vpvp.com).
     8. The WordStat analysis module from Provalis Research analyzes
     textual information such as responses to open-ended questions
     and interviews (provalisresearch.com).
2. Free Software Tools
Free software tools, some of which are open source, are available from
a number of nonprofit organizations:
     1. GATE is a leading open source toolkit for text mining. It has a
     free open source framework (or SDK) and graphical development
     environment (gate.ac.uk).
     2. RapidMiner has a community edition of its software that includes
     text mining modules (rapid-i.com).
     3. LingPipe is a suite of Java libraries for the linguistic analysis of
     human language (alias-i.com/lingpipe).
   57                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
     4. S-EM (Spy-EM) is a text classification system that learns from
     positive and unlabeled examples (cs.uic.edu/~liub/S-EM/S-EM-
     download.html).
     5. Vivisimo/Clusty is a Web search and text-clustering engine
     (clusty.com).
WEB MINING OVERVIEW
   Web mining (or Web data mining) is the process of discovering
    intrinsic relationships (i.e., interesting and useful information) from
    Web data, which are expressed in the form of textual, linkage, or
    usage information.
   The Web is perhaps the world’s largest data and text repository.
   The amount of information on the Web is growing rapidly every
    day.
   A lot of interesting information can be found online:
         whose homepage is linked to which other pages
         how many people have links to a specific Web page
         how a particular site is organized
Web also poses great challenges for effective and efficient knowledge
discovery:
  1. The Web is too big for effective data mining
  2. The Web is too complex.
            Web pages lack a unified structure They contain far more
authoring style and content variation
  3. The Web is too dynamic.
      Not only does the Web grow rapidly, but its content is constantly
being updated. Blogs, news stories, stock market results, weather
reports
   58                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
  4. The Web is not specific to a domain.
     Web users have very different backgrounds, interests, and usage
purposes.
  5. The Web has everything.
      Only a small portion of the information on the Web is truly relevant
or useful to someone
     Three main areas of Web mining:
         Web content mining
         Web structure mining
         Web usage mining.
1. WEB CONTENT MINING
   Web content mining refers to the extraction of useful information
    from Web pages.
   The documents may be extracted in some machine-readable
    format so that automated techniques can generate some
    information about the Web pages.
   Web crawlers are used to read through the content of a Web site
    automatically.
   59                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                             GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
 The information gathered may include document characteristics
  similar to what is used in text mining, but it may include additional
  concepts such as the document hierarchy
 Web content mining can also be used to enhance the results
  produced by search engines
 In addition to text, Web pages also contain hyperlinks pointing one
  page to another.
 Hyperlinks contain a significant amount of hidden human
  annotation
 When a Web page developer includes a link pointing to another
  Web page, this can be regarded as the developer’s endorsement
  of the other page.
 Therefore, the vast amount of Web linkage information provides a
  rich collection of information about the relevance, quality, and
  structure of the Web’s contents.
 A search on the Web to obtain information on a specific topic
  usually returns a few relevant, high-quality Web pages and a larger
  number of unusable Web pages.
 Use of an index based on authoritative will improve the search
  results and ranking of relevant pages
 The idea of authority stems from earlier information retrieval work
  using citations among journal articles to evaluate the impact of
  research papers
 There are significant differences between the citations in research
  articles and hyperlinks on Web pages:
      not every hyperlink represents an endorsement
      one authority will rarely have its Web page point to rival
       authorities in the same domain
      authoritative pages are seldom particularly descriptive
60                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                               GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
   The structure of Web hyperlinks has led to another important
    category of Web pages called a hub.
   A hub is one or more Web pages that provide a collection of
    links to authoritative pages.
   Hub pages provide link to a collection of prominent sites on a
    specific topic of interest.
   A hub could be a list of recommended links on an individual’s
    homepage, recommended reference sites on a course Web page
Hyperlink-induced topic search (HITS)
   HITS is a link analysis algorithm that rates Web pages using the
    hyperlink information contained within them.
   The HITS algorithm collects a base document set for a specific
    query. It then recursively calculates the hub and authority values
    for each document.
   To gather the base document set, a root set that matches the
    query is fetched from a search engine.
2.WEB STRUCTURE MINING
   It is the process of extracting useful information from the links
    embedded in Web documents.
   It is used to identify authoritative pages and hubs.
   Just as links going to a Web page may indicate a site’s popularity
    (or authority), links within the Web page (or the compete Web site)
    may indicate the depth of coverage of a specific topic.
   Analysis of links is very important in understanding the
    interrelationships among large numbers of Web pages, leading to
    a better understanding of a specific Web community
   61                          BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                              GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
3.WEB USAGE MINING
   Web usage mining is the extraction of useful information from data
    generated through Web page visits and transactions.
   Three types of data are generated through Web page visits:
          1. Automatically generated data stored in server access logs,
          referrer logs, agent logs, and client-side cookies
          2. User profiles
          3. Metadata, such as page attributes, content attributes, and
          usage data
   Analysis of the information collected by Web servers can help us
    better understand user behavior. Analysis of this data is often
    called click stream analysis.
   By using the data and text mining techniques, a company might be
    able to determine interesting patterns.
   Click stream Analysis:
         Useful in determining where to place online advertisements.
         Click stream analysis might also be useful for knowing when
          visitors access a site.
Process of extracting knowledge from clickstream data
   62                         BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Applications of Web mining:
     1. Determine the lifetime value of clients.
     2. Design cross-marketing strategies across products.
     3. Evaluate promotional campaigns.
     4. Target electronic ads and coupons at user groups based on
     user access patterns.
     5. Predict user behavior based on previously learned rules and
     users’ profiles.
     6. Present dynamic information to users based on their interests
     and profiles
Web usage mining software
SPATIAL DATA MINING
Spatial data mining is the process of discovering interesting, useful, non-
trivial patterns from large spatial datasets
   63                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
     GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
64   BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
     GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
65   BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
     GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
66   BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
     GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
67   BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
     GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
68   BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
     GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
69   BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
     GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
70   BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
     GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
71   BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
     GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
72   BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
     GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
73   BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
PROCESS MINING
   ―The idea of process mining is to discover, monitor and improve
    real processes (i.e., not assumed processes) by extracting
    knowledge from event logs readily available in today’s
    (information) systems.
   Process mining includes (automated) process discovery (i.e.,
    extracting process models from an event log), conformance
    checking (i.e., monitoring deviations by comparing model and log),
    social network/organizational mining, automated construction of
    simulation models, model extension, model repair, case prediction,
    and history-based recommendations.‖
Events and event logs:
   It is assumed that an event refers to a process activity or a task,
    which is a well-defined step in the process and is related to a
    particular case, i.e. process instance.
   Another assumption is that these events are ordered.
   The case or process instance is a specific occurrence or execution
    of a business process, while activity is an operation, part of a case,
    that is being executed.
   An event log stores information about cases and activities, but also
    information about event performers, event timestamps (moment
    when the event is triggered) or data elements recorded with the
    event
   Process mining activities such as extracting and filtering data from
    information systems are not trivial.
   Data may be distributed over a variety of sources, event data may
    be incomplete, an event log may contain outliers, logs may contain
    events at different level of granularity, etc.
   Process Mining Manifesto gives following guidelines referring to
    the event data:
           events should be trustworthy,
   74                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
         event logs should be complete,
         any recorded event should have well-defined semantics
the event data should be safe
   Process mining types
           Three process mining types:
                 discovery,
                 conformance and
                 enhancement.
  1. Process discovery:
      A process discovery technique produces a process model from
       an event log, without using any a-priori information about the
       process and it is the most eminent process mining technique.
  2. Conformance
      Conformance compares an existing process model with an
       event log of the same process
      It is used to check if reality, as recorded in the log, conforms to
       the model and vice versa.
      Conformance checking can be used to:
            check the quality of documented processes (asses
             whether they describe reality accurately);
            to identify deviating cases and understand what they have
             in common; for auditing purposes;
            to judge the quality of a discovered process model
  3. Enhancement :
      Enhancement extends or improves an existing process model
       using information about the actual process recorded in event
       log, with the aim of changing or extending the a-priori model.
   75                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
       For instance, by using time stamps in the event log one can
        extend the model to show bottlenecks, service levels,
        throughput times and frequencies
Process mining software tools and techniques:
   Many contemporary process mining software tools were developed
    and are continuously improved, such as: (Celonis Gmbh), Disco
    (Fluxicon), EDS (StereoLOGIC Ltd), Fujitsu (Fujitsu Ltd) Icaro
    (Icaro Tech), Icris (Icris), LANA (Lana Labs), Minit (Gradient ECM),
    myInvenio (Cognitive Technology), ProcessGold (Processgold
    International B.V.), ProM (Open Source, hosted at TU/e), ProM
    Lite (Open Source hosted at TU/e), QPR (QPR), RapidProM
    (Open Source hosted at TU/e), Rialto (Exeura), SNP (SNP
    Schneider-Neureither & Partner AG), ARIS PPM ( Software AG).
   Currently, the most prominent, open-source tool is ProM (Process
    Mining Framework) , as it offers a variety of plug-ins that enable
    application of various algorithms and latest developments in
    process mining research.
   Three main categories of process mining algorithms:
         Deterministic algorithms,
         Heuristic algorithms and
         Genetic algorithms.
   Deterministic algorithms always generate repeatable models, as
    all of the data has to be known and process mining output is
    constant for the given input of variables
  1.2. BUSINESS INTELLIGENCE PROCESS
What is Business Intelligence?
      BI(Business Intelligence) is a set of processes, architectures, and
technologies that convert raw data into meaningful information that
drives profitable business actions.
      It is a suite of software and services to transform data into
      actionable intelligence and knowledge.
   76                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
      BI has a direct impact on organization's strategic, tactical and
      operational business decisions.
      BI supports fact-based decision making using historical data
      rather than assumptions and gut feeling.
      BI tools perform data analysis and create reports, summaries,
      dashboards, maps, graphs, and charts to provide users with
      detailed intelligence about the nature of the business.
Why is BI important?
   Measurement: creating KPI (Key Performance Indicators) based
    on historic data
       Identify and set benchmarks for varied processes.
      With BI systems organizations can identify market trends and
      spot business problems that need to be addressed.
      BI helps on data visualization that enhances the data quality and
      thereby the quality of decision making.
      BI systems can be used not just by enterprises but SME (Small
      and Medium Enterprises)
How Business Intelligence systems are implemented?
Here are the steps:
Step 1:
     Raw Data from corporate databases is extracted. The data could
be spread across multiple systems heterogeneous systems.
Step 2:
      The data is cleaned and transformed into the data warehouse. The
table can be linked, and data cubes are formed.
Step 3:
      Using BI system the user can ask quires, request ad-hoc reports
or conduct any other analysis
   77                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                             GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Examples of Business Intelligence System used in Practice
   In an Online Transaction Processing (OLTP) system information
    that could be fed into product database could be
        add a product line
        change a product price
   Correspondingly, in a Business Intelligence system query that
    would be executed for the product subject area could be did the
    addition of new product line or change in product price increase
    revenues
   In an advertising database of OLTP system query that could be
    executed
        Changed in advertisement options
        Increase radio budget
Four types of BI users
Following given are the four key players who are used Business
Intelligence System:
1. The Professional Data Analyst:
   78                        BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
     The data analyst is a statistician who always needs to drill deep
down into data. BI system helps them to get fresh insights to develop
unique business strategies.
2. The IT users:
      The IT user also plays a dominant role in maintaining the BI
infrastructure.
3. The head of the company:
     CEO or CXO can increase the profit of their business by improving
operational efficiency in their business.
4. The Business Users:
         Business intelligence users can be found from across the
          organization. There are mainly two types of business users
          Casual business intelligence user
         The power user.
Advantages of Business Intelligence
Here are some of the advantages of using Business Intelligence System:
1. Boost productivity
      With a BI program, It is possible for businesses to create reports
with a single click thus saves lots of time and resources. It also allows
employees to be more productive on their tasks.
2. To improve visibility
      BI also helps to improve the visibility of these processes and make
it possible to identify any areas which need attention.
3. Fix Accountability :
     BI system assigns accountability in the organization as there must
be someone who should own accountability and ownership for the
organization's performance against its set goals.
   79                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                   GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
4. It gives a bird's eye view:
     BI system also helps organizations as decision makers get an
overall bird's eye view through typical BI features like dashboards and
scorecards.
5. It streamlines business processes:
     BI takes out all complexity associated with business processes. It
also automates analytics by offering predictive analysis, computer
modeling, benchmarking and other methodologies.
6. It allows for easy analytics:
     BI software has democratized its usage, allowing even
nontechnical or non-analysts users to collect and process data quickly.
This also allows putting the power of analytics from the hand's many
people.
BI System Disadvantages
1. Cost:
Business intelligence can prove costly for small as well as for medium-
sized enterprises. The use of such type of system may be expensive for
routine business transactions.
2. Complexity:
Another drawback of BI is its complexity in implementation of
datawarehouse. It can be so complex that it can make business
techniques rigid to deal with.
3. Limited use
Like all improved technologies, BI was first established keeping in
consideration the buying competence of rich firms. Therefore, BI system
is yet not affordable for many small and medium size companies.
4. Time Consuming Implementation
It takes almost one and half year for data warehousing system to be
completely implemented. Therefore, it is a time-consuming process.
   80                              BA5021 DATAMINING FOR BUSINESS INTELLIGENCE
                                GANADIPATHY TULSI’S JAIN ENGINEERING COLLEGE
Trends in Business Intelligence
Artificial Intelligence:
      Gartner' report indicates that AI and machine learning now take on
complex tasks done by human intelligence. This capability is being
leveraged to come up with real-time data analysis and dashboard
reporting.
Collaborative BI:
      BI software combined with collaboration tools, including social
media, and other latest technologies enhance the working and sharing
by teams for collaborative decision making.
Embedded BI:
       Embedded BI allows the integration of BI software or some of its
features into another business application for enhancing and extending
it's reporting functionality.
Cloud Analytics:
      BI applications will be soon offered in the cloud, and more
businesses will be shifting to this technology. As per their predictions
within a couple of years, the spending on cloud-based analytics will grow
4.5 times faster.
                                                              Prepared by,
                                                       D.DURAI KUMAR,
                                                Head Of the Department,
                                 Department Of Information Technology,
                                                                    GTEC.
   81                           BA5021 DATAMINING FOR BUSINESS INTELLIGENCE