0% found this document useful (0 votes)
27 views58 pages

Unit-2 Bi

Uploaded by

Ankur Ji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views58 pages

Unit-2 Bi

Uploaded by

Ankur Ji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 58

UNIT-2

BI – DATA MINING & WAREHOUSING

Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization
Data Mining in Business Analytics: Definition, Techniques, and
Benefits

The term “data mining” may conjure images of hackers getting access to your
information or spying on you. But the truth is that data mining plays a very
important and positive role for many individuals and organizations.
What is Data Mining?

• Data mining allows businesses to identify key sets of information and


compare them to past data. This process, in turn, can lead to better
decision-making and strategy. It can also help in other objectives like
boosting sales and marketing more effectively.

• Data mining relies on human intervention, whereas machine learning is


meant to be initiated by an individual and then learn on its own. However,
machine learning processes are often utilized in data mining for automation
purposes.

• Data mining, data analysis, artificial intelligence, machine learning, and


similar methods are typically combined in business intelligence processes so
that organizations can gain more insight about their customers and potential
sales outcomes.
CRISP DM PROCESS IN DATA MINING
Overview of the Data Mining Process
(CRISP –DM Phases)
• Business Understanding
The first step to successful data mining is understanding the overall objectives of the business and
how it converts these objectives into a data mining problem and plan. Without an understanding of
a business’s ultimate goal, you may not be able to design a good data mining algorithm. For
example, a supermarket might want to use data mining to learn more about its customers. The
business understanding comes when the supermarket discovers which products customers are
buying the most.
• Data Understanding
After you know what a business is looking for, it’s time to collect data. There are many complex
ways that organizations can obtain, organize, store, and manage data. Data mining involves
becoming familiar with the data, identifying issues, gaining insights, and observing subsets of
information. For example, a supermarket may use a rewards program where customers can input
their phone number at purchase, giving the supermarket access to their shopping data.
• Data Preparation
Data preparation means readying information production, which tends to be the most intensive part
of data mining. It typically includes converting computer-language data into a user-friendly and
quanitifiable format. Transforming and cleaning the data for modeling is key during data
preparation.
• Modeling
In the modeling phase, mathematical models are used to search for patterns in the data.
Businesses may use one of several techniques for the same set of data. Even though
modeling involves a fair amount of trial and error, it’s still a crucial phase in data mining.
• Evaluation
When the model is complete, it needs to be carefully evaluated and reviewed to ensure that
it meets business objectives. At the end of this phase, a final decision about the data mining
results is made. In the supermarket example, the results will provide a list of relevant
customer purchases that the business can then use for its operational planning and goals.
• Deployment
Deployment can be as simple or as complex as a business deems necessary, depending on
the amount and nature of the data. For instance, it could entail generating a single report or
creating a repeatable data mining process to occur regularly.
After the data mining process has been completed, a business can finalize its decisions and
implement changes accordingly.
How Does Data Mining Inform Business Analytics?

• Organizations that engage in data mining can a gain a competitive advantage with
a better understanding of consumer behavior, efficient oversight of business
operations, improved customer acquisition, and new growth opportunities.

• Some businesses may look for better ways to optimize their supply chain while
others may opt to improve marketing outreach techniques. Whatever an
organization’s goals might be, the process of data mining can help them make
more effective decisions through comprehensive analysis.
Benefits of Data Mining

While they can differ across industries, the benefits of data mining commonly include the
following:

• Cost effective. Organizations that invest in efficient methods of data mining can save
money in the long run.

• Reliable. Many—if not all—types of data mining are designed to produce dependable,
actionable results.

• Quantifiable. Information pulled from data mining can be easily measured and compared
against other sets of data.

• Strategy promoting. Data mining is instrumental in fostering new, improved strategies


for businesses to test and prove.
Data Mining Techniques in Business Analytics

Five effective techniques that many businesses practice.


• Classification
This complex data mining technique takes attributes of data and moves them into discernable categories. For example, supermarket data
miners may use classification to group the types of groceries customers buy, including produce, meat, baked goods, and more. These
classifications help the business learn more about what customers prioritize as they shop.
• Clustering
This technique is akin to classification as it involves grouping data sets together based on their similarities. Cluster groups are more
generalized than classification groups, making clustering a more flexible option for data mining. In the supermarket example, simple cluster
groups might be food and nonfood items.
• Association Rules
Association in data mining is about tracking patterns based on linked variables. In the supermarket example, this may mean that many
customers who buy one specific item may also buy a second, related item. The business then learns to group certain products together. You
may encounter association when you see a “people also bought this” section while shopping online.
• Regression Analysis
Regression analysis is used to plan and model to identify the likelihood of a specific variable. For instance, the supermarket may project
price points based on product availability, consumer demand, and industry competition. Regression analysis helps identify the relationship
between variables in a data set.
• Anomaly and Outlier Detection
For many data mining cases, merely seeing an overarching pattern might not be enough. You may need to be able to identify and
understand outliers in your data as well. For example, if most supermarket shoppers are generally female, but one week in February skews
mostly male, the business can use anomaly detection to investigate that outlier to understand what’s behind it.
Data Mining Architecture
The significant components of data mining systems are a data source, data mining engine, data warehouse server, the
pattern evaluation module, graphical user interface, and knowledge base.
• Data Source:
The actual source of data is the Database, data warehouse, World Wide Web (WWW), text files, and other documents. You need a
huge amount of historical data for data mining to be successful. Organizations typically store data in databases or data
warehouses. Data warehouses may comprise one or more databases, text files spreadsheets, or other repositories of data.
Sometimes, even plain text files or spreadsheets may contain information. Another primary source of data is the World Wide Web
or the internet.

• Different processes:
Before passing the data to the database or data warehouse server, the data must be cleaned, integrated, and selected. As the
information comes from various sources and in different formats, it can't be used directly for the data mining procedure because
the data may not be complete and accurate. So, the first data requires to be cleaned and unified. More information than needed
will be collected from various data sources, and only the data of interest will have to be selected and passed to the server. These
procedures are not as easy as we think. Several methods may be performed on the data as part of selection, integration, and
cleaning.

• Database or Data Warehouse Server:


The database or data warehouse server consists of the original data that is ready to be processed. Hence, the server is cause for
retrieving the relevant data that is based on data mining as per user request.

• Data Mining Engine:


The data mining engine is a major component of any data mining system. It contains several modules for operating data mining
tasks, including association, characterization, classification, clustering, prediction, time-series analysis, etc.
In other words, we can say data mining is the root of our data mining architecture. It comprises instruments and software used to
obtain insights and knowledge from data collected from various data sources and stored within the data warehouse.
• Pattern Evaluation Module:
The Pattern evaluation module is primarily responsible for the measure of investigation of the pattern by using a
threshold value. It collaborates with the data mining engine to focus the search on exciting patterns.
This segment commonly employs stake measures that cooperate with the data mining modules to focus the search
towards fascinating patterns. It might utilize a stake threshold to filter out discovered patterns. On the other hand,
the pattern evaluation module might be coordinated with the mining module, depending on the implementation of
the data mining techniques used. For efficient data mining, it is abnormally suggested to push the evaluation of
pattern stake as much as possible into the mining procedure to confine the search to only fascinating patterns.

• Graphical User Interface:


The graphical user interface (GUI) module communicates between the data mining system and the user. This
module helps the user to easily and efficiently use the system without knowing the complexity of the process. This
module cooperates with the data mining system when the user specifies a query or a task and displays the results.

• Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might be helpful to guide the search or
evaluate the stake of the result patterns. The knowledge base may even contain user views and data from user
experiences that might be helpful in the data mining process. The data mining engine may receive inputs from the
knowledge base to make the result more accurate and reliable. The pattern assessment module regularly interacts
with the knowledge base to get inputs, and also update it.
Difference between Data Lake and Data Warehouse
Types of data mining techniques
• Classification
The most common technique is classification. To do this, identify a target variable and then divide
that variable into appropriate level of detail categories. For example, the variable ‘occupation
level’ might be split into ‘entry-level’, ‘associate’, and ‘senior’. With other fields such as age and
education level, you can train your data model to predict what occupation level a person is more
likely to have. You may add an entry for a recent 22-year-old graduate, and the data model could
automatically classify that person in an ‘entry-level’ position. Insurance or financial institutions
such as PEMCO Insurance used classification to train their algorithms to flag fraud and to monitor
claims.

• Clustering
Clustering is another common technique, grouping records, observations, or cases by similarity.
There won’t be a target variable like in classification. Instead, clustering just means separating the
data set into subgroups. This method can include grouping records of users by geographic area or
age group. Typically, clustering the data into subgroups is preparation for analysis. The subgroups
become inputs for a different technique.
Functionalities of Data Mining
There is a lot of confusion between data mining and data analysis. Data mining functions
are used to define the trends or correlations contained in data mining activities. While data
analysis is used to test statistical models that fit the dataset, for example, analysis of a
marketing campaign, data mining uses Machine Learning and mathematical and statistical
models to discover patterns hidden in the data. In comparison, data mining activities can be
divided into two categories:

• Descriptive Data Mining: It includes certain knowledge to understand what is


happening within the data without a previous idea. The common data features are
highlighted in the data set. For example, count, average etc.

• Predictive Data Mining: It helps developers to provide unlabeled definitions of


attributes. With previously available or historical data, data mining can be used to make
predictions about critical business metrics based on data's linearity. For example,
predicting the volume of business next quarter based on performance in the previous
quarters over several years or judging from the findings of a patient's medical
examinations that is he suffering from any particular disease.
1. Class/Concept Descriptions
• A class or concept implies there is a data set or set of features that define the class or a concept. A
class can be a category of items on a shop floor, and a concept could be the abstract idea on which
data may be categorized like products to be put on clearance sale and non-sale products. There are
two concepts here, one that helps with grouping and the other that helps in differentiating.
• Data Characterization: This refers to the summary of general characteristics or features of the
class, resulting in specific rules that define a target class. A data analysis technique called Attribute-
oriented Induction is employed on the data set for achieving characterization.
• Data Discrimination: Discrimination is used to separate distinct data sets based on the disparity in
attribute values. It compares features of a class with features of one or more contrasting classes.g.,
bar charts, curves and pie charts.
2. Mining Frequent Patterns
• One of the functions of data mining is finding data patterns. Frequent patterns are things that are
discovered to be most common in data. Various types of frequency can be found in the dataset.
• Frequent item set:This term refers to a group of items that are commonly found together, such as
milk and sugar.
• Frequent substructure: It refers to the various types of data structures that can be combined with
an item set or subsequences, such as trees and graphs.
• Frequent Subsequence: A regular pattern series, such as buying a phone followed by a cover.
3. Association Analysis
• It analyses the set of items that generally occur together in a transactional dataset. It is also known as Market
Basket Analysis for its wide use in retail sales. Two parameters are used for determining the association rules:
• It provides which identifies the common item set in the database.
• Confidence is the conditional probability that an item occurs when another item occurs in a transaction.

4. Classification
• Classification is a data mining technique that categorizes items in a collection based on some predefined
properties. It uses methods like if-then, decision trees or neural networks to predict a class or essentially classify a
collection of items. A training set containing items whose properties are known is used to train the system to
predict the category of items from an unknown collection of items.

5. Prediction
• It defines predict some unavailable data values or spending trends. An object can be anticipated based on the
attribute values of the object and attribute values of the classes. It can be a prediction of missing numerical values
or increase or decrease trends in time-related information. There are primarily two types of predictions in data
mining: numeric and class predictions.
• Numeric predictions are made by creating a linear regression model that is based on historical data. Prediction of
numeric values helps businesses ramp up for a future event that might impact the business positively or negatively.
• Class predictions are used to fill in missing class information for products using a training data set where the class
for products is known.
6. Cluster Analysis
• In image processing, pattern recognition and bioinformatics, clustering is a popular data mining
functionality. It is similar to classification, but the classes are not predefined. Data attributes represent the
classes. Similar data are grouped together, with the difference being that a class label is not known.
Clustering algorithms group data based on similar features and dissimilarities.
7. Outlier Analysis
• Outlier analysis is important to understand the quality of data. If there are too many outliers, you cannot
trust the data or draw patterns. An outlier analysis determines if there is something out of turn in the data
and whether it indicates a situation that a business needs to consider and take measures to mitigate. An
outlier analysis of the data that cannot be grouped into any classes by the algorithms is pulled up.
8. Evolution and Deviation Analysis
• Evolution Analysis pertains to the study of data sets that change over time. Evolution analysis models are
designed to capture evolutionary trends in data helping to characterize, classify, cluster or discriminate
time-related data.
9. Correlation Analysis
• Correlation is a mathematical technique for determining whether and how strongly two attributes is
related to one another. It refers to the various types of data structures, such as trees and graphs, that can
be combined with an item set or subsequence. It determines how well two numerically measured
continuous variables are linked. Researchers can use this type of analysis to see if there are any possible
correlations between variables in their study.
Classification of Data Mining Systems
Classification Based on the mined Databases
• A data mining system can be classified based on the types of databases that have been mined. A
database system can be further segmented based on distinct principles, such as data models,
types of data, etc., which further assist in classifying a data mining system.
• For example, if we want to classify a database based on the data model, we need to select either
relational, transactional, object-relational or data warehouse mining systems.

Classification Based on the type of Knowledge Mined


• A data mining system categorized based on the kind of knowledge mind may have the following
functionalities:
1.Characterization
2.Discrimination
3.Association and Correlation Analysis
4.Classification
5.Prediction
6.Outlier Analysis
Classification Based on the Techniques Utilized
• A data mining system can also be classified based on the type of techniques that are being incorporated. These
techniques can be assessed based on the involvement of user interaction involved or the methods of analysis
employed.

Classification Based on the Applications Adapted


• Data mining systems classified based on adapted applications adapted are as follows:
1. Finance
2. Telecommunications
3. DNA
4. Stock Markets
5. E-mail

Examples of Classification Task


• Following is some of the main examples of classification tasks:
• Classification helps in determining tumor cells as benign or malignant.
• Classification of credit card transactions as fraudulent or legitimate.
• Classification of secondary structures of protein as alpha-helix, beta-sheet, or random coil.
• Classification of news stories into distinct categories such as finance, weather, entertainment, sports, etc.
Integration schemes of Database and Data warehouse systems

No Coupling
• In no coupling schema, the data mining system does not use any database or data warehouse system functions.
Loose Coupling
• In loose coupling, data mining utilizes some of the database or data warehouse system functionalities. It mainly
fetches the data from the data repository managed by these systems and then performs data mining. The results
are kept either in the file or any designated place in the database or data warehouse.
Semi-Tight Coupling
• In semi-tight coupling, data mining is linked to either the DB or DW system and provides an efficient
implementation of data mining primitives within the database.
Tight Coupling
• A data mining system can be effortlessly combined with a database or data warehouse system in tight coupling.
KDD Process in Data Mining (Representation of Input Data)

KDD Process
• KDD (Knowledge Discovery in Databases) is a process that involves the
extraction of useful, previously unknown, and potentially valuable
information from large datasets. The KDD process is an iterative process and
it requires multiple iterations to extract accurate knowledge from the data.
Steps included in KDD process
Data Cleaning
• Data cleaning is defined as removal of noisy and irrelevant data from collection.
1.Cleaning in case of Missing values.
2.Cleaning noisy data, where noise is a random or variance error.
3.Cleaning with Data discrepancy detection and Data transformation tools.
Data Integration
• Data integration is defined as heterogeneous data from multiple sources combined
in a common source(Data Warehouse). Data integration using Data Migration
tools, Data Synchronization tools and ETL(Extract-Load-Transformation)
process.
Data Selection
• Data selection is defined as the process where data relevant to the analysis is
decided and retrieved from the data collection. For this we can use Neural
network, Decision Trees, Naive bayes, Clustering, and Regression methods.
Data Transformation
• Data Transformation is defined as the process of transforming data into appropriate form
required by mining procedure. Data Transformation is a two step process:

1.Data Mapping: Assigning elements from source base to destination to capture


transformations.
2.Code generation: Creation of the actual transformation program.

Data Mining
• Data mining is defined as techniques that are applied to extract patterns potentially
useful. It transforms task relevant data into patterns, and decides purpose of model using
classification or characterization.
Pattern Evaluation
• Pattern Evaluation is defined as identifying strictly increasing patterns representing
knowledge based on given measures. It find interestingness score of each pattern, and
uses summarization and Visualization to make data understandable by user.
Knowledge Representation
Methodology analysis of data
The 7 Most Useful Data Analysis Methods and Techniques

Regression analysis
• Regression analysis is used to estimate the relationship between a set of variables. The aim of regression analysis is to estimate
how one or more variables might impact the dependent variable, in order to identify trends and patterns. This is especially useful
for making predictions and forecasting future trends.
• Let’s imagine you work for an ecommerce company and you want to examine the relationship between: (a) how much money is
spent on social media marketing, and (b) sales revenue. In this case, sales revenue is your dependent variable—it’s the factor
you’re most interested in predicting and boosting. Social media spend is your independent variable; you want to determine
whether or not it has an impact on sales and, ultimately, whether it’s worth increasing, decreasing, or keeping the same. Using
regression analysis, you’d be able to see if there’s a relationship between the two variables. A positive correlation would imply
that the more you spend on social media marketing, the more sales revenue you make. No correlation at all might suggest that
social media marketing has no bearing on your sales. Understanding the relationship between these two variables would help
you to make informed decisions about the social media budget going forward. However: It’s important to note that, on their own,
regressions can only be used to determine whether or not there is a relationship between a set of variables—they don’t tell you
anything about cause and effect. So, while a positive correlation between social media spend and sales revenue may suggest that
one impacts the other, it’s impossible to draw definitive conclusions based on this analysis alone.
• There are many different types of regression analysis, and the model you use depends on the type of data you have for the
dependent variable. For example, your dependent variable might be continuous (i.e. something that can be measured on a
continuous scale, such as sales revenue in USD), in which case you’d use a different type of regression analysis than if your
dependent variable was categorical in nature (i.e. comprising values that can be categorised into a number of distinct groups
based on a certain characteristic, such as customer location by continent). You can learn more about different types of dependent
variables and how to choose the right regression analysis in this guide.
• Regression analysis in action:
Investigating the relationship between clothing brand Benetton’s advertising expenditure and sales
b. Monte Carlo simulation

• When making decisions or taking certain actions, there are a range of different possible outcomes. If you take the bus, you might get
stuck in traffic. If you walk, you might get caught in the rain or bump into your chatty neighbor, potentially delaying your journey. In
everyday life, we tend to briefly weigh up the pros and cons before deciding which action to take; however, when the stakes are high,
it’s essential to calculate, as thoroughly and accurately as possible, all the potential risks and rewards.

• Monte Carlo simulation, otherwise known as the Monte Carlo method, is a computerized technique used to generate models of
possible outcomes and their probability distributions. It essentially considers a range of possible outcomes and then calculates how
likely it is that each particular outcome will be realized. The Monte Carlo method is used by data analysts to conduct advanced risk
analysis, allowing them to better forecast what might happen in the future and make decisions accordingly.

So how does Monte Carlo simulation work, and what can it tell us?
• To run a Monte Carlo simulation, you’ll start with a mathematical model of your data—such as a spreadsheet. Within your
spreadsheet, you’ll have one or several outputs that you’re interested in; profit, for example, or number of sales. You’ll also have a
number of inputs; these are variables that may impact your output variable. If you’re looking at profit, relevant inputs might include
the number of sales, total marketing spend, and employee salaries. If you knew the exact, definitive values of all your input variables,
you’d quite easily be able to calculate what profit you’d be left with at the end. However, when these values are uncertain, a Monte
Carlo simulation enables you to calculate all the possible options and their probabilities. What will your profit be if you make 100,000
sales and hire five new employees on a salary of $50,000 each? What is the likelihood of this outcome? What will your profit be if you
only make 12,000 sales and hire five new employees? And so on. It does this by replacing all uncertain values with functions which
generate random samples from distributions determined by you, and then running a series of calculations and recalculations to produce
models of all the possible outcomes and their probability distributions. The Monte Carlo method is one of the most popular techniques
for calculating the effect of unpredictable variables on a specific output variable, making it ideal for risk analysis.
• Monte Carlo simulation in action: A case study using Monte Carlo simulation for risk analysis
c. Factor analysis
• Factor analysis is a technique used to reduce a large number of variables to a smaller number of factors. It works on the basis that
multiple separate, observable variables correlate with each other because they are all associated with an underlying construct. This is
useful not only because it condenses large datasets into smaller, more manageable samples, but also because it helps to uncover hidden
patterns. This allows you to explore concepts that cannot be easily measured or observed—such as wealth, happiness, fitness, or, for a
more business-relevant example, customer loyalty and satisfaction.

• Let’s imagine you want to get to know your customers better, so you send out a rather long survey comprising one hundred questions.
Some of the questions relate to how they feel about your company and product; for example, “Would you recommend us to a friend?”
and “How would you rate the overall customer experience?” Other questions ask things like “What is your yearly household income?”
and “How much are you willing to spend on skincare each month?”

• Once your survey has been sent out and completed by lots of customers, you end up with a large dataset that essentially tells you one
hundred different things about each customer (assuming each customer gives one hundred responses). Instead of looking at each of these
responses (or variables) individually, you can use factor analysis to group them into factors that belong together—in other words, to
relate them to a single underlying construct. In this example, factor analysis works by finding survey items that are strongly correlated.
This is known as covariance. So, if there’s a strong positive correlation between household income and how much they’re willing to
spend on skincare each month (i.e. as one increases, so does the other), these items may be grouped together. Together with other
variables (survey responses), you may find that they can be reduced to a single factor such as “consumer purchasing power”. Likewise, if
a customer experience rating of 10/10 correlates strongly with “yes” responses regarding how likely they are to recommend your product
to a friend, these items may be reduced to a single factor such as “customer satisfaction”.

• In the end, you have a smaller number of factors rather than hundreds of individual variables. These factors are then taken forward for
further analysis, allowing you to learn more about your customers (or any other area you’re interested in exploring).

• Factor analysis in action: Using factor analysis to explore customer behavior patterns in Tehran
• d. Cohort analysis

• “Cohort analysis is a subset of behavioral analytics that takes the data from a given dataset and rather than looking at all users as one unit,
it breaks them into related groups for analysis. These related groups, or cohorts, usually share common characteristics or experiences
within a defined time-span.”

• So what does this mean and why is it useful? Let’s break down the above definition further. A cohort is a group of people who share a
common characteristic (or action) during a given time period. Students who enrolled at university in 2020 may be referred to as the 2020
cohort. Customers who purchased something from your online store via the app in the month of December may also be considered a
cohort.

• With cohort analysis, you’re dividing your customers or users into groups and looking at how these groups behave over time. So, rather
than looking at a single, isolated snapshot of all your customers at a given moment in time (with each customer at a different point in
their journey), you’re examining your customers’ behavior in the context of the customer lifecycle. As a result, you can start to identify
patterns of behavior at various points in the customer journey—say, from their first ever visit to your website, through to email newsletter
sign-up, to their first purchase, and so on. As such, cohort analysis is dynamic, allowing you to uncover valuable insights about the
customer lifecycle.

• This is useful because it allows companies to tailor their service to specific customer segments (or cohorts). Let’s imagine you run a 50%
discount campaign in order to attract potential new customers to your website. Once you’ve attracted a group of new customers (a
cohort), you’ll want to track whether they actually buy anything and, if they do, whether or not (and how frequently) they make a repeat
purchase. With these insights, you’ll start to gain a much better understanding of when this particular cohort might benefit from another
discount offer or retargeting ads on social media, for example. Ultimately, cohort analysis allows companies to optimize their service
offerings (and marketing) to provide a more targeted, personalized experience. You can learn more about
how to run cohort analysis using Google Analytics here.
• Cohort analysis in action: How Ticketmaster used cohort analysis to boost revenue
e. Cluster analysis
• Cluster analysis is an exploratory technique that seeks to identify structures within a dataset. The
goal of cluster analysis is to sort different data points into groups (or clusters) that are internally
homogeneous and externally heterogeneous. This means that data points within a cluster are similar
to each other, and dissimilar to data points in another cluster. Clustering is used to gain insight into
how data is distributed in a given dataset, or as a preprocessing step for other algorithms.
• There are many real-world applications of cluster analysis. In marketing, cluster analysis is
commonly used to group a large customer base into distinct segments, allowing for a more targeted
approach to advertising and communication. Insurance firms might use cluster analysis to
investigate why certain locations are associated with a high number of insurance claims. Another
common application is in geology, where experts will use cluster analysis to evaluate which cities
are at greatest risk of earthquakes (and thus try to mitigate the risk with protective measures).
• It’s important to note that, while cluster analysis may reveal structures within your data, it won’t
explain why those structures exist. With that in mind, cluster analysis is a useful starting point for
understanding your data and informing further analysis. Clustering algorithms are also used in
machine learning—you can learn more about clustering in machine learning here.
• Cluster analysis in action:
Using cluster analysis for customer segmentation—a telecoms case study example
f. Time series analysis

• Time series analysis is a statistical technique used to identify trends and cycles over time. Time series data is a
sequence of data points which measure the same variable at different points in time (for example, weekly sales
figures or monthly email sign-ups). By looking at time-related trends, analysts are able to forecast how the
variable of interest may fluctuate in the future.
• When conducting time series analysis, the main patterns you’ll be looking out for in your data are:
• Trends: Stable, linear increases or decreases over an extended time period.
• Seasonality: Predictable fluctuations in the data due to seasonal factors over a short period of time. For
example, you might see a peak in swimwear sales in summer around the same time every year.
• Cyclic patterns: Unpredictable cycles where the data fluctuates. Cyclical trends are not due to seasonality, but
rather, may occur as a result of economic or industry-related conditions.

• As you can imagine, the ability to make informed predictions about the future has immense value for business.
Time series analysis and forecasting is used across a variety of industries, most commonly for stock market
analysis, economic forecasting, and sales forecasting. There are different types of time series models
depending on the data you’re using and the outcomes you want to predict. These models are typically
classified into three broad types: the autoregressive (AR) models, the integrated (I) models, and the moving
average (MA) models. For an in-depth look at time series analysis.
• Time series analysis in action: Developing a time series model to predict jute yarn demand in Bangladesh
g. Sentiment analysis
• With sentiment analysis, the goal is to interpret and classify the emotions conveyed within textual data. From a business
perspective, this allows you to ascertain how your customers feel about various aspects of your brand, product, or service.

There are several different types of sentiment analysis models, each with a slightly different focus. The three main types include:
Fine-grained sentiment analysis
• If you want to focus on opinion polarity (i.e. positive, neutral, or negative) in depth, fine-grained sentiment analysis will allow you
to do so.
• For example, if you wanted to interpret star ratings given by customers, you might use fine-grained sentiment analysis to
categorize the various ratings along a scale ranging from very positive to very negative.
Emotion detection
• This model often uses complex machine learning algorithms to pick out various emotions from your textual data.
• You might use an emotion detection model to identify words associated with happiness, anger, frustration, and excitement, giving
you insight into how your customers feel when writing about you or your product on, say, a product review site.
Aspect-based sentiment analysis
• This type of analysis allows you to identify what specific aspects the emotions or opinions relate to, such as a certain product
feature or a new ad campaign.
• If a customer writes that they “find the new Instagram advert so annoying”, your model should detect not only a negative
sentiment, but also the object towards which it’s directed.
• In a nutshell, sentiment analysis uses various Natural Language Processing (NLP) algorithms and systems which are trained to
associate certain inputs (for example, certain words) with certain outputs.
• For example, the input “annoying” would be recognized and tagged as “negative”. Sentiment analysis is crucial to understanding
how your customers feel about you and your products, for identifying areas for improvement, and even for averting PR disasters
in real-time!
• Sentiment analysis in action: 5 Real-world sentiment analysis case studies
What is a Data Mart?

A data mart is a simple form of data warehouse focused on a single


subject or line of business. With a data mart, teams can access data
and gain insights faster, because they don’t have to spend time
searching within a more complex data warehouse or manually
aggregating data from different sources.
Why create a data mart?
• A data mart provides easier access to data required by a specific team or
line of business within your organization. For example, if your marketing
team is looking for data to help improve campaign performance during the
holiday season, sifting through and combining data scattered across
multiple systems could prove costly in terms of time, accuracy, and
ultimately, money.

• Teams forced to locate data from various sources most often rely on
spreadsheets to share this data and collaborate. This usually results in
human errors, confusion, complex reconciliations, and multiple sources of
truth—the so-called “spreadsheet nightmare.” Data marts have become
popular as a centralized place where the necessary data is collected and
organized before reports, dashboards, and visualizations are created.
The difference between data marts, data lakes, and data
warehouses
Data marts, data lakes, and data warehouses serve different purposes and needs.
• A data warehouse is a data management system designed to support business intelligence and analytics for
an entire organization. Data warehouses often contain large amounts of data, including historical data. The
data within a data warehouse usually is derived from a wide range of sources, such as application log files
and transactional applications. A data warehouse stores structured data, whose purpose is usually well-
defined.

• A data lake allows organizations to store large amounts of structured and unstructured data (for example,
from social media or clickstream data), and to immediately make it available for real-time analytics, data
science, and machine learning use cases. With a data lake, data is ingested in its original form, without
alteration.

• The key difference between a data lake and a data warehouse is that data lakes store vast amounts of raw
data, without a predefined structure. Organizations do not need to know in advance how the data will be
used.

• A data mart is a simple form of a data warehouse that is focused on a single subject or line of business, such
as sales, finance, or marketing. Given their focus, data marts draw data from fewer sources than data
warehouses. Data mart sources can include internal operational systems, a central data warehouse, and
external data.
Benefits of a data mart

• A single source of truth. The centralized nature of a data mart helps ensure that everyone in a department or organization
makes decisions based on the same data. This is a major benefit, because the data and the predictions based on that data can
be trusted, and stakeholders can focus on making decisions and taking action, as opposed to arguing about the data itself.

• Quicker access to data. Specific business teams and users can rapidly access the subset of data they need from the
enterprise data warehouse and combine it with data from various other sources. Once the connections to their desired data
sources are established, they can get live data from a data mart whenever needed without having to go to IT to obtain
periodic extracts. Business and IT teams both gain improved productivity as a result.

• Faster insights leading to faster decision making. While a data warehouse enables enterprise-level decision-making, a data
mart allows data analytics at the department level. Analysts can focus on specific challenges and opportunities in areas such
as finance and HR and move more rapidly from data to insights, which enables them to make better and faster decisions.

• Simpler and faster implementation. Setting up an enterprise data warehouse to cater to the needs of your entire
organization can require significant time and effort. A data mart, in contrast, is focused on serving the needs of specific
business teams, requiring access to fewer data sets. It therefore is much simpler and faster to implement.

• Creating agile and scalable data management. Data marts provide an agile data management system that works in tandem
with business needs, including being able to use information gathered in past projects to help with current tasks. Teams can
update and change their data mart based on new and evolving analytics project.

• Transient analysis. Some data analytics projects are short-lived—for example, completing a specific analysis of online sales
for a two-week promotion prior to a team meeting. Teams can rapidly set up a data mart to accomplish such a project
Types of Data Mart

1. Dependent Data Mart:Dependent Data Mart is created by extracting the data from central repository,
Datawarehouse. First data warehouse is created by extracting data (through ETL tool) from external
sources and then data mart is created from data warehouse. Dependent data mart is created in top-down
approach of Datawarehouse architecture. This model of data mart is used by big organizations.
2. Independent Data Mart : Independent Data Mart is created directly from external sources
instead of data warehouse. First data mart is created by extracting data from external sources and
then Datawarehouse is created from the data present in data mart. Independent data mart is designed
in bottom-up approach of Datawarehouse architecture. This model of data mart is used by small
organizations and is cost effective comparatively.
3. Hybrid Data Mart:This type of Data Mart is created by extracting data from operational source
or from data warehouse. 1Path reflects accessing data directly from external sources and 2Path
reflects dependent data model of data mart.
• Need Of Data Mart:
1. Data Mart focuses only on functioning of particular department of an organization.
2. It is maintained by single authority of an organization.
3. Since, it stores the data related to specific part of an organization, data retrieval from it is very quick.
4. Designing and maintenance of data mart is found to be quite cinch as compared to data warehouse.
5. It reduces the response time of user as it stores small volume of data.
6. It is small in size due to which accessing data from it very fast.
7. This Storage unit is used by most of organizations for the smooth running of their departments.
• Advantages of Data Mart:
1. Implementation of data mart needs less time as compared to implementation of data warehouse as data
mart is designed for a particular department of an organization.
2. Organizations are provided with choices to choose model of data mart depending upon cost and their
business.
3. Data can be easily accessed from data mart.
4. It contains frequently accessed queries, so enable to analyze business trend.
• Disadvantages of Data Mart:
1. Since it stores the data related only to specific function, so does not store huge volume of data related to
each and every department of an organization like data warehouse.
2. Creating too many data marts becomes cumbersome sometimes.
Features of data marts:
• Subset of Data: Data marts are designed to store a subset of data from a larger data warehouse or
data lake. This allows for faster query performance since the data in the data mart is focused on a
specific business unit or department.
• Optimized for Query Performance: Data marts are optimized for query performance, which means
that they are designed to support fast queries and analysis of the data stored in the data mart.
• Customizable: Data marts are customizable, which means that they can be designed to meet the
specific needs of a business unit or department.
• Self-Contained: Data marts are self-contained, which means that they have their own set of tables,
indexes, and data models. This allows for easier management and maintenance of the data mart.
• Security: Data marts can be secured, which means that access to the data in the data mart can be
controlled and restricted to specific users or groups.
• Scalability: Data marts can be scaled horizontally or vertically to accommodate larger volumes of
data or to support more users.
• Integration with Business Intelligence Tools: Data marts can be integrated with business
intelligence tools, such as Tableau, Power BI, or QlikView, which allows users to analyze and
visualize the data stored in the data mart.
• ETL Process: Data marts are typically populated using an Extract, Transform, Load (ETL) process,
which means that data is extracted from the larger data warehouse or data lake, transformed to meet
the requirements of the data mart, and loaded into the data mart.
What is OLAP (Online Analytical Processing)?

• OLAP stands for On-Line Analytical Processing. OLAP is a classification of software


technology which authorizes analysts, managers, and executives to gain insight into information
through fast, consistent, interactive access in a wide variety of possible views of data that has been
transformed from raw information to reflect the real dimensionality of the enterprise as understood
by the clients.

• OLAP implement the multidimensional analysis of business information and support the
capability for complex estimations, trend analysis, and sophisticated data modeling. It is rapidly
enhancing the essential foundation for Intelligent Solutions containing Business Performance
Management, Planning, Budgeting, Forecasting, Financial Documenting, Analysis, Simulation-
Models, Knowledge Discovery, and Data Warehouses Reporting. OLAP enables end-clients to
perform ad hoc analysis of record in multiple dimensions, providing the insight and understanding
they require for better decision making.
OLAP applications
Finance and accounting:
• Budgeting
• Activity-based costing
• Financial performance analysis
• And financial modeling

Sales and Marketing


• Sales analysis and forecasting
• Market research analysis
• Promotion analysis
• Customer analysis
• Market and customer segmentation

Production
• Production planning
• Defect analysis

OLAP cubes have two main purposes. The first is to provide business users with a data model more intuitive to them
than a tabular model. This model is called a Dimensional Model.
The second purpose is to enable fast query response that is usually difficult to achieve using tabular models.
How OLAP Works?

OLAP has a very simple concept. It pre-calculates most of the queries


that are typically very hard to execute over tabular databases, namely
aggregation, joining, and grouping. These queries are calculated during
a process that is usually called 'building' or 'processing' of the OLAP
cube. This process happens overnight, and by the time end users get to
work - data will have been updated.
OLAP Guidelines (Dr.E.F.Codd Rule)

Dr E.F. Codd, the "father" of the relational model, has formulated a list
of 12 guidelines and requirements as the basis for selecting OLAP
systems:
1) Multidimensional Conceptual View: This is the central features of an OLAP system. By needing a multidimensional view, it is possible to
carry out methods like slice and dice.
2) Transparency: Make the technology, underlying information repository, computing operations, and the dissimilar nature of source data totally
transparent to users. Such transparency helps to improve the efficiency and productivity of the users.
3) Accessibility: It provides access only to the data that is actually required to perform the particular analysis, present a single, coherent, and
consistent view to the clients. The OLAP system must map its own logical schema to the heterogeneous physical data stores and perform any
necessary transformations. The OLAP operations should be sitting between data sources (e.g., data warehouses) and an OLAP front-end.
4) Consistent Reporting Performance: To make sure that the users do not feel any significant degradation in documenting performance as the
number of dimensions or the size of the database increases. That is, the performance of OLAP should not suffer as the number of dimensions is
increased. Users must observe consistent run time, response time, or machine utilization every time a given query is run.
5) Client/Server Architecture: Make the server component of OLAP tools sufficiently intelligent that the various clients to be attached with a
minimum of effort and integration programming. The server should be capable of mapping and consolidating data between dissimilar databases.
6) Generic Dimensionality: An OLAP method should treat each dimension as equivalent in both is structure and operational capabilities.
Additional operational capabilities may be allowed to selected dimensions, but such additional tasks should be grantable to any dimension.
7) Dynamic Sparse Matrix Handling: To adapt the physical schema to the specific analytical model being created and loaded that optimizes
sparse matrix handling. When encountering the sparse matrix, the system must be easy to dynamically assume the distribution of the information
and adjust the storage and access to obtain and maintain a consistent level of performance.
8) Multiuser Support: OLAP tools must provide concurrent data access, data integrity, and access security.
9) Unrestricted cross-dimensional Operations: It provides the ability for the methods to identify dimensional order and necessarily functions
roll-up and drill-down methods within a dimension or across the dimension.
10) Intuitive Data Manipulation: Data Manipulation fundamental the consolidation direction like as reorientation (pivoting), drill-down and
roll-up, and another manipulation to be accomplished naturally and precisely via point-and-click and drag and drop methods on the cells of the
scientific model. It avoids the use of a menu or multiple trips to a user interface.
11) Flexible Reporting: It implements efficiency to the business clients to organize columns, rows, and cells in a manner that facilitates simple
manipulation, analysis, and synthesis of data.
12) Unlimited Dimensions and Aggregation Levels: The number of data dimensions should be unlimited. Each of these common dimensions
must allow a practically unlimited number of customer-defined aggregation levels within any given consolidation path.
Difference between OLAP and OLTP
Category OLAP (Online Analytical Processing) OLTP (Online Transaction Processing)
Definition It is well-known as an online database query management system. It is well-known as an online database modifying system.
Data source Consists of historical data from various Databases. Consists of only operational current data.
Method used It makes use of a data warehouse. It makes use of a standard database management system (DBMS).
Application It is subject-oriented. Used for Data Mining, Analytics, Decisions making, etc. It is application-oriented. Used for business tasks.
Normalized In an OLAP database, tables are not normalized. In an OLTP database, tables are normalized (3NF).
Usage of data The data is used in planning, problem-solving, and decision-making. The data is used to perform day-to-day fundamental operations.
Task It provides a multi-dimensional view of different business tasks. It reveals a snapshot of present business tasks.
Purpose It serves the purpose to extract information for analysis and decision-making. It serves the purpose to Insert, Update, and Delete information from the database.

Volume of data A large amount of data is stored typically in TB, PB The size of the data is relatively small as the historical data is archived in MB, and GB.

Queries Relatively slow as the amount of data involved is large. Queries may take hours. Very Fast as the queries operate on 5% of the data.

Update The OLAP database is not often updated. As a result, data integrity is unaffected. The data integrity constraint must be maintained in an OLTP database.

Backup and Recovery It only needs backup from time to time as compared to OLTP. The backup and recovery process is maintained rigorously

Processing time The processing of complex queries can take a lengthy time. It is comparatively fast in processing because of simple and straightforward queries.

Types of users This data is generally managed by CEO, MD, and GM. This data is managed by clerksForex and managers.
Operations Only read and rarely write operations. Both read and write operations.
Updates With lengthy, scheduled batch operations, data is refreshed on a regular basis. The user initiates data updates, which are brief and quick.
Nature of audience The process is focused on the customer. The process is focused on the market.
Database Design Design with a focus on the subject. Design that is focused on the application.
Productivity Improves the efficiency of business analysts. Enhances the user’s productivity.
Difference between Star Schema and Snowflake Schema

• Star Schema: Star schema is the type of multidimensional model which is used for
data warehouse. In star schema, The fact tables and the dimension tables are contained.
In this schema fewer foreign-key join is used. This schema forms a star with fact table
and dimension tables.
• Snowflake Schema: Snowflake Schema is also the type of multidimensional
model which is used for data warehouse. In snowflake schema, The fact tables,
dimension tables as well as sub dimension tables are contained. This schema
forms a snowflake with fact tables, dimension tables as well as sub-dimension
tables.
Difference between Star Schema and Snowflake Schema
Star Schema Snowflake Schema

In star schema, The fact tables and the dimension tables are While in snowflake schema, The fact tables, dimension tables as well as sub dimension tables are
contained. contained.

Star schema is a top-down model. While it is a bottom-up model.

Star schema uses more space. While it uses less space.

It takes less time for the execution of queries. While it takes more time than star schema for the execution of queries.

In star schema, Normalization is not used. While in this, Both normalization and denormalization are used.

It’s design is very simple. While it’s design is complex.

The query complexity of star schema is low. While the query complexity of snowflake schema is higher than star schema.

It’s understanding is very simple. While it’s understanding is difficult.

It has less number of foreign keys. While it has more number of foreign keys.

It has high data redundancy. While it has low data redundancy.


ETL Process in Data Warehouse

ETL stands for Extract, Transform, Load and it is a process used in data warehousing to extract
data from various sources, transform it into a format suitable for loading into a data warehouse,
and then load it into the warehouse. The process of ETL can be broken down into the following
three stages:

1. Extract: The first stage in the ETL process is to extract data from various sources such as
transactional systems, spreadsheets, and flat files. This step involves reading data from the
source systems and storing it in a staging area.
2. Transform: In this stage, the extracted data is transformed into a format that is suitable for
loading into the data warehouse. This may involve cleaning and validating the data, converting
data types, combining data from multiple sources, and creating new data fields.
3. Load: After the data is transformed, it is loaded into the data warehouse. This step involves
creating the physical data structures and loading the data into the warehouse.

The ETL process is an iterative process that is repeated as new data is added to the warehouse. The
process is important because it ensures that the data in the data warehouse is accurate, complete,
and up-to-date. It also helps to ensure that the data is in the format required for data mining and
reporting.
ETL Tools: Most commonly used ETL tools are Hevo, Sybase, Oracle Warehouse builder, CloverETL, and
MarkLogic.
Data Warehouses: Most commonly used Data Warehouses are Snowflake, Redshift, BigQuery, and
Firebolt.
Roles and Responsibilities of an ETL Developer

• Works on different database management systems efficiently


• Has a good command over data warehousing concepts to load the data effectively
• Design, develop, and automate the data
• Develops data flow models for ETL applications
• Testing the storage system after transferring the data
• Prepare documentation for the further reference
• Ensure that the data is scalable and architecture
• Work with clients to deliver scalable and maintainable data
• Gather essentials and process them for business needs
• Exposures to Business Intelligence platforms like Power BI or Tableau
• Collaborate with other IT teams to modify, and enhance existing BI assets.
• Troubleshoot and resolve issues like database performance, database capacity,
replication, and other data-related issues

You might also like