Unit-2 Bi
Unit-2 Bi
Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization
Data Mining in Business Analytics: Definition, Techniques, and
Benefits
The term “data mining” may conjure images of hackers getting access to your
information or spying on you. But the truth is that data mining plays a very
important and positive role for many individuals and organizations.
What is Data Mining?
• Organizations that engage in data mining can a gain a competitive advantage with
a better understanding of consumer behavior, efficient oversight of business
operations, improved customer acquisition, and new growth opportunities.
• Some businesses may look for better ways to optimize their supply chain while
others may opt to improve marketing outreach techniques. Whatever an
organization’s goals might be, the process of data mining can help them make
more effective decisions through comprehensive analysis.
Benefits of Data Mining
While they can differ across industries, the benefits of data mining commonly include the
following:
• Cost effective. Organizations that invest in efficient methods of data mining can save
money in the long run.
• Reliable. Many—if not all—types of data mining are designed to produce dependable,
actionable results.
• Quantifiable. Information pulled from data mining can be easily measured and compared
against other sets of data.
• Different processes:
Before passing the data to the database or data warehouse server, the data must be cleaned, integrated, and selected. As the
information comes from various sources and in different formats, it can't be used directly for the data mining procedure because
the data may not be complete and accurate. So, the first data requires to be cleaned and unified. More information than needed
will be collected from various data sources, and only the data of interest will have to be selected and passed to the server. These
procedures are not as easy as we think. Several methods may be performed on the data as part of selection, integration, and
cleaning.
• Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might be helpful to guide the search or
evaluate the stake of the result patterns. The knowledge base may even contain user views and data from user
experiences that might be helpful in the data mining process. The data mining engine may receive inputs from the
knowledge base to make the result more accurate and reliable. The pattern assessment module regularly interacts
with the knowledge base to get inputs, and also update it.
Difference between Data Lake and Data Warehouse
Types of data mining techniques
• Classification
The most common technique is classification. To do this, identify a target variable and then divide
that variable into appropriate level of detail categories. For example, the variable ‘occupation
level’ might be split into ‘entry-level’, ‘associate’, and ‘senior’. With other fields such as age and
education level, you can train your data model to predict what occupation level a person is more
likely to have. You may add an entry for a recent 22-year-old graduate, and the data model could
automatically classify that person in an ‘entry-level’ position. Insurance or financial institutions
such as PEMCO Insurance used classification to train their algorithms to flag fraud and to monitor
claims.
• Clustering
Clustering is another common technique, grouping records, observations, or cases by similarity.
There won’t be a target variable like in classification. Instead, clustering just means separating the
data set into subgroups. This method can include grouping records of users by geographic area or
age group. Typically, clustering the data into subgroups is preparation for analysis. The subgroups
become inputs for a different technique.
Functionalities of Data Mining
There is a lot of confusion between data mining and data analysis. Data mining functions
are used to define the trends or correlations contained in data mining activities. While data
analysis is used to test statistical models that fit the dataset, for example, analysis of a
marketing campaign, data mining uses Machine Learning and mathematical and statistical
models to discover patterns hidden in the data. In comparison, data mining activities can be
divided into two categories:
4. Classification
• Classification is a data mining technique that categorizes items in a collection based on some predefined
properties. It uses methods like if-then, decision trees or neural networks to predict a class or essentially classify a
collection of items. A training set containing items whose properties are known is used to train the system to
predict the category of items from an unknown collection of items.
5. Prediction
• It defines predict some unavailable data values or spending trends. An object can be anticipated based on the
attribute values of the object and attribute values of the classes. It can be a prediction of missing numerical values
or increase or decrease trends in time-related information. There are primarily two types of predictions in data
mining: numeric and class predictions.
• Numeric predictions are made by creating a linear regression model that is based on historical data. Prediction of
numeric values helps businesses ramp up for a future event that might impact the business positively or negatively.
• Class predictions are used to fill in missing class information for products using a training data set where the class
for products is known.
6. Cluster Analysis
• In image processing, pattern recognition and bioinformatics, clustering is a popular data mining
functionality. It is similar to classification, but the classes are not predefined. Data attributes represent the
classes. Similar data are grouped together, with the difference being that a class label is not known.
Clustering algorithms group data based on similar features and dissimilarities.
7. Outlier Analysis
• Outlier analysis is important to understand the quality of data. If there are too many outliers, you cannot
trust the data or draw patterns. An outlier analysis determines if there is something out of turn in the data
and whether it indicates a situation that a business needs to consider and take measures to mitigate. An
outlier analysis of the data that cannot be grouped into any classes by the algorithms is pulled up.
8. Evolution and Deviation Analysis
• Evolution Analysis pertains to the study of data sets that change over time. Evolution analysis models are
designed to capture evolutionary trends in data helping to characterize, classify, cluster or discriminate
time-related data.
9. Correlation Analysis
• Correlation is a mathematical technique for determining whether and how strongly two attributes is
related to one another. It refers to the various types of data structures, such as trees and graphs, that can
be combined with an item set or subsequence. It determines how well two numerically measured
continuous variables are linked. Researchers can use this type of analysis to see if there are any possible
correlations between variables in their study.
Classification of Data Mining Systems
Classification Based on the mined Databases
• A data mining system can be classified based on the types of databases that have been mined. A
database system can be further segmented based on distinct principles, such as data models,
types of data, etc., which further assist in classifying a data mining system.
• For example, if we want to classify a database based on the data model, we need to select either
relational, transactional, object-relational or data warehouse mining systems.
No Coupling
• In no coupling schema, the data mining system does not use any database or data warehouse system functions.
Loose Coupling
• In loose coupling, data mining utilizes some of the database or data warehouse system functionalities. It mainly
fetches the data from the data repository managed by these systems and then performs data mining. The results
are kept either in the file or any designated place in the database or data warehouse.
Semi-Tight Coupling
• In semi-tight coupling, data mining is linked to either the DB or DW system and provides an efficient
implementation of data mining primitives within the database.
Tight Coupling
• A data mining system can be effortlessly combined with a database or data warehouse system in tight coupling.
KDD Process in Data Mining (Representation of Input Data)
KDD Process
• KDD (Knowledge Discovery in Databases) is a process that involves the
extraction of useful, previously unknown, and potentially valuable
information from large datasets. The KDD process is an iterative process and
it requires multiple iterations to extract accurate knowledge from the data.
Steps included in KDD process
Data Cleaning
• Data cleaning is defined as removal of noisy and irrelevant data from collection.
1.Cleaning in case of Missing values.
2.Cleaning noisy data, where noise is a random or variance error.
3.Cleaning with Data discrepancy detection and Data transformation tools.
Data Integration
• Data integration is defined as heterogeneous data from multiple sources combined
in a common source(Data Warehouse). Data integration using Data Migration
tools, Data Synchronization tools and ETL(Extract-Load-Transformation)
process.
Data Selection
• Data selection is defined as the process where data relevant to the analysis is
decided and retrieved from the data collection. For this we can use Neural
network, Decision Trees, Naive bayes, Clustering, and Regression methods.
Data Transformation
• Data Transformation is defined as the process of transforming data into appropriate form
required by mining procedure. Data Transformation is a two step process:
Data Mining
• Data mining is defined as techniques that are applied to extract patterns potentially
useful. It transforms task relevant data into patterns, and decides purpose of model using
classification or characterization.
Pattern Evaluation
• Pattern Evaluation is defined as identifying strictly increasing patterns representing
knowledge based on given measures. It find interestingness score of each pattern, and
uses summarization and Visualization to make data understandable by user.
Knowledge Representation
Methodology analysis of data
The 7 Most Useful Data Analysis Methods and Techniques
Regression analysis
• Regression analysis is used to estimate the relationship between a set of variables. The aim of regression analysis is to estimate
how one or more variables might impact the dependent variable, in order to identify trends and patterns. This is especially useful
for making predictions and forecasting future trends.
• Let’s imagine you work for an ecommerce company and you want to examine the relationship between: (a) how much money is
spent on social media marketing, and (b) sales revenue. In this case, sales revenue is your dependent variable—it’s the factor
you’re most interested in predicting and boosting. Social media spend is your independent variable; you want to determine
whether or not it has an impact on sales and, ultimately, whether it’s worth increasing, decreasing, or keeping the same. Using
regression analysis, you’d be able to see if there’s a relationship between the two variables. A positive correlation would imply
that the more you spend on social media marketing, the more sales revenue you make. No correlation at all might suggest that
social media marketing has no bearing on your sales. Understanding the relationship between these two variables would help
you to make informed decisions about the social media budget going forward. However: It’s important to note that, on their own,
regressions can only be used to determine whether or not there is a relationship between a set of variables—they don’t tell you
anything about cause and effect. So, while a positive correlation between social media spend and sales revenue may suggest that
one impacts the other, it’s impossible to draw definitive conclusions based on this analysis alone.
• There are many different types of regression analysis, and the model you use depends on the type of data you have for the
dependent variable. For example, your dependent variable might be continuous (i.e. something that can be measured on a
continuous scale, such as sales revenue in USD), in which case you’d use a different type of regression analysis than if your
dependent variable was categorical in nature (i.e. comprising values that can be categorised into a number of distinct groups
based on a certain characteristic, such as customer location by continent). You can learn more about different types of dependent
variables and how to choose the right regression analysis in this guide.
• Regression analysis in action:
Investigating the relationship between clothing brand Benetton’s advertising expenditure and sales
b. Monte Carlo simulation
• When making decisions or taking certain actions, there are a range of different possible outcomes. If you take the bus, you might get
stuck in traffic. If you walk, you might get caught in the rain or bump into your chatty neighbor, potentially delaying your journey. In
everyday life, we tend to briefly weigh up the pros and cons before deciding which action to take; however, when the stakes are high,
it’s essential to calculate, as thoroughly and accurately as possible, all the potential risks and rewards.
• Monte Carlo simulation, otherwise known as the Monte Carlo method, is a computerized technique used to generate models of
possible outcomes and their probability distributions. It essentially considers a range of possible outcomes and then calculates how
likely it is that each particular outcome will be realized. The Monte Carlo method is used by data analysts to conduct advanced risk
analysis, allowing them to better forecast what might happen in the future and make decisions accordingly.
So how does Monte Carlo simulation work, and what can it tell us?
• To run a Monte Carlo simulation, you’ll start with a mathematical model of your data—such as a spreadsheet. Within your
spreadsheet, you’ll have one or several outputs that you’re interested in; profit, for example, or number of sales. You’ll also have a
number of inputs; these are variables that may impact your output variable. If you’re looking at profit, relevant inputs might include
the number of sales, total marketing spend, and employee salaries. If you knew the exact, definitive values of all your input variables,
you’d quite easily be able to calculate what profit you’d be left with at the end. However, when these values are uncertain, a Monte
Carlo simulation enables you to calculate all the possible options and their probabilities. What will your profit be if you make 100,000
sales and hire five new employees on a salary of $50,000 each? What is the likelihood of this outcome? What will your profit be if you
only make 12,000 sales and hire five new employees? And so on. It does this by replacing all uncertain values with functions which
generate random samples from distributions determined by you, and then running a series of calculations and recalculations to produce
models of all the possible outcomes and their probability distributions. The Monte Carlo method is one of the most popular techniques
for calculating the effect of unpredictable variables on a specific output variable, making it ideal for risk analysis.
• Monte Carlo simulation in action: A case study using Monte Carlo simulation for risk analysis
c. Factor analysis
• Factor analysis is a technique used to reduce a large number of variables to a smaller number of factors. It works on the basis that
multiple separate, observable variables correlate with each other because they are all associated with an underlying construct. This is
useful not only because it condenses large datasets into smaller, more manageable samples, but also because it helps to uncover hidden
patterns. This allows you to explore concepts that cannot be easily measured or observed—such as wealth, happiness, fitness, or, for a
more business-relevant example, customer loyalty and satisfaction.
• Let’s imagine you want to get to know your customers better, so you send out a rather long survey comprising one hundred questions.
Some of the questions relate to how they feel about your company and product; for example, “Would you recommend us to a friend?”
and “How would you rate the overall customer experience?” Other questions ask things like “What is your yearly household income?”
and “How much are you willing to spend on skincare each month?”
• Once your survey has been sent out and completed by lots of customers, you end up with a large dataset that essentially tells you one
hundred different things about each customer (assuming each customer gives one hundred responses). Instead of looking at each of these
responses (or variables) individually, you can use factor analysis to group them into factors that belong together—in other words, to
relate them to a single underlying construct. In this example, factor analysis works by finding survey items that are strongly correlated.
This is known as covariance. So, if there’s a strong positive correlation between household income and how much they’re willing to
spend on skincare each month (i.e. as one increases, so does the other), these items may be grouped together. Together with other
variables (survey responses), you may find that they can be reduced to a single factor such as “consumer purchasing power”. Likewise, if
a customer experience rating of 10/10 correlates strongly with “yes” responses regarding how likely they are to recommend your product
to a friend, these items may be reduced to a single factor such as “customer satisfaction”.
• In the end, you have a smaller number of factors rather than hundreds of individual variables. These factors are then taken forward for
further analysis, allowing you to learn more about your customers (or any other area you’re interested in exploring).
• Factor analysis in action: Using factor analysis to explore customer behavior patterns in Tehran
• d. Cohort analysis
• “Cohort analysis is a subset of behavioral analytics that takes the data from a given dataset and rather than looking at all users as one unit,
it breaks them into related groups for analysis. These related groups, or cohorts, usually share common characteristics or experiences
within a defined time-span.”
• So what does this mean and why is it useful? Let’s break down the above definition further. A cohort is a group of people who share a
common characteristic (or action) during a given time period. Students who enrolled at university in 2020 may be referred to as the 2020
cohort. Customers who purchased something from your online store via the app in the month of December may also be considered a
cohort.
• With cohort analysis, you’re dividing your customers or users into groups and looking at how these groups behave over time. So, rather
than looking at a single, isolated snapshot of all your customers at a given moment in time (with each customer at a different point in
their journey), you’re examining your customers’ behavior in the context of the customer lifecycle. As a result, you can start to identify
patterns of behavior at various points in the customer journey—say, from their first ever visit to your website, through to email newsletter
sign-up, to their first purchase, and so on. As such, cohort analysis is dynamic, allowing you to uncover valuable insights about the
customer lifecycle.
• This is useful because it allows companies to tailor their service to specific customer segments (or cohorts). Let’s imagine you run a 50%
discount campaign in order to attract potential new customers to your website. Once you’ve attracted a group of new customers (a
cohort), you’ll want to track whether they actually buy anything and, if they do, whether or not (and how frequently) they make a repeat
purchase. With these insights, you’ll start to gain a much better understanding of when this particular cohort might benefit from another
discount offer or retargeting ads on social media, for example. Ultimately, cohort analysis allows companies to optimize their service
offerings (and marketing) to provide a more targeted, personalized experience. You can learn more about
how to run cohort analysis using Google Analytics here.
• Cohort analysis in action: How Ticketmaster used cohort analysis to boost revenue
e. Cluster analysis
• Cluster analysis is an exploratory technique that seeks to identify structures within a dataset. The
goal of cluster analysis is to sort different data points into groups (or clusters) that are internally
homogeneous and externally heterogeneous. This means that data points within a cluster are similar
to each other, and dissimilar to data points in another cluster. Clustering is used to gain insight into
how data is distributed in a given dataset, or as a preprocessing step for other algorithms.
• There are many real-world applications of cluster analysis. In marketing, cluster analysis is
commonly used to group a large customer base into distinct segments, allowing for a more targeted
approach to advertising and communication. Insurance firms might use cluster analysis to
investigate why certain locations are associated with a high number of insurance claims. Another
common application is in geology, where experts will use cluster analysis to evaluate which cities
are at greatest risk of earthquakes (and thus try to mitigate the risk with protective measures).
• It’s important to note that, while cluster analysis may reveal structures within your data, it won’t
explain why those structures exist. With that in mind, cluster analysis is a useful starting point for
understanding your data and informing further analysis. Clustering algorithms are also used in
machine learning—you can learn more about clustering in machine learning here.
• Cluster analysis in action:
Using cluster analysis for customer segmentation—a telecoms case study example
f. Time series analysis
• Time series analysis is a statistical technique used to identify trends and cycles over time. Time series data is a
sequence of data points which measure the same variable at different points in time (for example, weekly sales
figures or monthly email sign-ups). By looking at time-related trends, analysts are able to forecast how the
variable of interest may fluctuate in the future.
• When conducting time series analysis, the main patterns you’ll be looking out for in your data are:
• Trends: Stable, linear increases or decreases over an extended time period.
• Seasonality: Predictable fluctuations in the data due to seasonal factors over a short period of time. For
example, you might see a peak in swimwear sales in summer around the same time every year.
• Cyclic patterns: Unpredictable cycles where the data fluctuates. Cyclical trends are not due to seasonality, but
rather, may occur as a result of economic or industry-related conditions.
• As you can imagine, the ability to make informed predictions about the future has immense value for business.
Time series analysis and forecasting is used across a variety of industries, most commonly for stock market
analysis, economic forecasting, and sales forecasting. There are different types of time series models
depending on the data you’re using and the outcomes you want to predict. These models are typically
classified into three broad types: the autoregressive (AR) models, the integrated (I) models, and the moving
average (MA) models. For an in-depth look at time series analysis.
• Time series analysis in action: Developing a time series model to predict jute yarn demand in Bangladesh
g. Sentiment analysis
• With sentiment analysis, the goal is to interpret and classify the emotions conveyed within textual data. From a business
perspective, this allows you to ascertain how your customers feel about various aspects of your brand, product, or service.
There are several different types of sentiment analysis models, each with a slightly different focus. The three main types include:
Fine-grained sentiment analysis
• If you want to focus on opinion polarity (i.e. positive, neutral, or negative) in depth, fine-grained sentiment analysis will allow you
to do so.
• For example, if you wanted to interpret star ratings given by customers, you might use fine-grained sentiment analysis to
categorize the various ratings along a scale ranging from very positive to very negative.
Emotion detection
• This model often uses complex machine learning algorithms to pick out various emotions from your textual data.
• You might use an emotion detection model to identify words associated with happiness, anger, frustration, and excitement, giving
you insight into how your customers feel when writing about you or your product on, say, a product review site.
Aspect-based sentiment analysis
• This type of analysis allows you to identify what specific aspects the emotions or opinions relate to, such as a certain product
feature or a new ad campaign.
• If a customer writes that they “find the new Instagram advert so annoying”, your model should detect not only a negative
sentiment, but also the object towards which it’s directed.
• In a nutshell, sentiment analysis uses various Natural Language Processing (NLP) algorithms and systems which are trained to
associate certain inputs (for example, certain words) with certain outputs.
• For example, the input “annoying” would be recognized and tagged as “negative”. Sentiment analysis is crucial to understanding
how your customers feel about you and your products, for identifying areas for improvement, and even for averting PR disasters
in real-time!
• Sentiment analysis in action: 5 Real-world sentiment analysis case studies
What is a Data Mart?
• Teams forced to locate data from various sources most often rely on
spreadsheets to share this data and collaborate. This usually results in
human errors, confusion, complex reconciliations, and multiple sources of
truth—the so-called “spreadsheet nightmare.” Data marts have become
popular as a centralized place where the necessary data is collected and
organized before reports, dashboards, and visualizations are created.
The difference between data marts, data lakes, and data
warehouses
Data marts, data lakes, and data warehouses serve different purposes and needs.
• A data warehouse is a data management system designed to support business intelligence and analytics for
an entire organization. Data warehouses often contain large amounts of data, including historical data. The
data within a data warehouse usually is derived from a wide range of sources, such as application log files
and transactional applications. A data warehouse stores structured data, whose purpose is usually well-
defined.
• A data lake allows organizations to store large amounts of structured and unstructured data (for example,
from social media or clickstream data), and to immediately make it available for real-time analytics, data
science, and machine learning use cases. With a data lake, data is ingested in its original form, without
alteration.
• The key difference between a data lake and a data warehouse is that data lakes store vast amounts of raw
data, without a predefined structure. Organizations do not need to know in advance how the data will be
used.
• A data mart is a simple form of a data warehouse that is focused on a single subject or line of business, such
as sales, finance, or marketing. Given their focus, data marts draw data from fewer sources than data
warehouses. Data mart sources can include internal operational systems, a central data warehouse, and
external data.
Benefits of a data mart
• A single source of truth. The centralized nature of a data mart helps ensure that everyone in a department or organization
makes decisions based on the same data. This is a major benefit, because the data and the predictions based on that data can
be trusted, and stakeholders can focus on making decisions and taking action, as opposed to arguing about the data itself.
• Quicker access to data. Specific business teams and users can rapidly access the subset of data they need from the
enterprise data warehouse and combine it with data from various other sources. Once the connections to their desired data
sources are established, they can get live data from a data mart whenever needed without having to go to IT to obtain
periodic extracts. Business and IT teams both gain improved productivity as a result.
• Faster insights leading to faster decision making. While a data warehouse enables enterprise-level decision-making, a data
mart allows data analytics at the department level. Analysts can focus on specific challenges and opportunities in areas such
as finance and HR and move more rapidly from data to insights, which enables them to make better and faster decisions.
• Simpler and faster implementation. Setting up an enterprise data warehouse to cater to the needs of your entire
organization can require significant time and effort. A data mart, in contrast, is focused on serving the needs of specific
business teams, requiring access to fewer data sets. It therefore is much simpler and faster to implement.
• Creating agile and scalable data management. Data marts provide an agile data management system that works in tandem
with business needs, including being able to use information gathered in past projects to help with current tasks. Teams can
update and change their data mart based on new and evolving analytics project.
• Transient analysis. Some data analytics projects are short-lived—for example, completing a specific analysis of online sales
for a two-week promotion prior to a team meeting. Teams can rapidly set up a data mart to accomplish such a project
Types of Data Mart
1. Dependent Data Mart:Dependent Data Mart is created by extracting the data from central repository,
Datawarehouse. First data warehouse is created by extracting data (through ETL tool) from external
sources and then data mart is created from data warehouse. Dependent data mart is created in top-down
approach of Datawarehouse architecture. This model of data mart is used by big organizations.
2. Independent Data Mart : Independent Data Mart is created directly from external sources
instead of data warehouse. First data mart is created by extracting data from external sources and
then Datawarehouse is created from the data present in data mart. Independent data mart is designed
in bottom-up approach of Datawarehouse architecture. This model of data mart is used by small
organizations and is cost effective comparatively.
3. Hybrid Data Mart:This type of Data Mart is created by extracting data from operational source
or from data warehouse. 1Path reflects accessing data directly from external sources and 2Path
reflects dependent data model of data mart.
• Need Of Data Mart:
1. Data Mart focuses only on functioning of particular department of an organization.
2. It is maintained by single authority of an organization.
3. Since, it stores the data related to specific part of an organization, data retrieval from it is very quick.
4. Designing and maintenance of data mart is found to be quite cinch as compared to data warehouse.
5. It reduces the response time of user as it stores small volume of data.
6. It is small in size due to which accessing data from it very fast.
7. This Storage unit is used by most of organizations for the smooth running of their departments.
• Advantages of Data Mart:
1. Implementation of data mart needs less time as compared to implementation of data warehouse as data
mart is designed for a particular department of an organization.
2. Organizations are provided with choices to choose model of data mart depending upon cost and their
business.
3. Data can be easily accessed from data mart.
4. It contains frequently accessed queries, so enable to analyze business trend.
• Disadvantages of Data Mart:
1. Since it stores the data related only to specific function, so does not store huge volume of data related to
each and every department of an organization like data warehouse.
2. Creating too many data marts becomes cumbersome sometimes.
Features of data marts:
• Subset of Data: Data marts are designed to store a subset of data from a larger data warehouse or
data lake. This allows for faster query performance since the data in the data mart is focused on a
specific business unit or department.
• Optimized for Query Performance: Data marts are optimized for query performance, which means
that they are designed to support fast queries and analysis of the data stored in the data mart.
• Customizable: Data marts are customizable, which means that they can be designed to meet the
specific needs of a business unit or department.
• Self-Contained: Data marts are self-contained, which means that they have their own set of tables,
indexes, and data models. This allows for easier management and maintenance of the data mart.
• Security: Data marts can be secured, which means that access to the data in the data mart can be
controlled and restricted to specific users or groups.
• Scalability: Data marts can be scaled horizontally or vertically to accommodate larger volumes of
data or to support more users.
• Integration with Business Intelligence Tools: Data marts can be integrated with business
intelligence tools, such as Tableau, Power BI, or QlikView, which allows users to analyze and
visualize the data stored in the data mart.
• ETL Process: Data marts are typically populated using an Extract, Transform, Load (ETL) process,
which means that data is extracted from the larger data warehouse or data lake, transformed to meet
the requirements of the data mart, and loaded into the data mart.
What is OLAP (Online Analytical Processing)?
• OLAP implement the multidimensional analysis of business information and support the
capability for complex estimations, trend analysis, and sophisticated data modeling. It is rapidly
enhancing the essential foundation for Intelligent Solutions containing Business Performance
Management, Planning, Budgeting, Forecasting, Financial Documenting, Analysis, Simulation-
Models, Knowledge Discovery, and Data Warehouses Reporting. OLAP enables end-clients to
perform ad hoc analysis of record in multiple dimensions, providing the insight and understanding
they require for better decision making.
OLAP applications
Finance and accounting:
• Budgeting
• Activity-based costing
• Financial performance analysis
• And financial modeling
Production
• Production planning
• Defect analysis
OLAP cubes have two main purposes. The first is to provide business users with a data model more intuitive to them
than a tabular model. This model is called a Dimensional Model.
The second purpose is to enable fast query response that is usually difficult to achieve using tabular models.
How OLAP Works?
Dr E.F. Codd, the "father" of the relational model, has formulated a list
of 12 guidelines and requirements as the basis for selecting OLAP
systems:
1) Multidimensional Conceptual View: This is the central features of an OLAP system. By needing a multidimensional view, it is possible to
carry out methods like slice and dice.
2) Transparency: Make the technology, underlying information repository, computing operations, and the dissimilar nature of source data totally
transparent to users. Such transparency helps to improve the efficiency and productivity of the users.
3) Accessibility: It provides access only to the data that is actually required to perform the particular analysis, present a single, coherent, and
consistent view to the clients. The OLAP system must map its own logical schema to the heterogeneous physical data stores and perform any
necessary transformations. The OLAP operations should be sitting between data sources (e.g., data warehouses) and an OLAP front-end.
4) Consistent Reporting Performance: To make sure that the users do not feel any significant degradation in documenting performance as the
number of dimensions or the size of the database increases. That is, the performance of OLAP should not suffer as the number of dimensions is
increased. Users must observe consistent run time, response time, or machine utilization every time a given query is run.
5) Client/Server Architecture: Make the server component of OLAP tools sufficiently intelligent that the various clients to be attached with a
minimum of effort and integration programming. The server should be capable of mapping and consolidating data between dissimilar databases.
6) Generic Dimensionality: An OLAP method should treat each dimension as equivalent in both is structure and operational capabilities.
Additional operational capabilities may be allowed to selected dimensions, but such additional tasks should be grantable to any dimension.
7) Dynamic Sparse Matrix Handling: To adapt the physical schema to the specific analytical model being created and loaded that optimizes
sparse matrix handling. When encountering the sparse matrix, the system must be easy to dynamically assume the distribution of the information
and adjust the storage and access to obtain and maintain a consistent level of performance.
8) Multiuser Support: OLAP tools must provide concurrent data access, data integrity, and access security.
9) Unrestricted cross-dimensional Operations: It provides the ability for the methods to identify dimensional order and necessarily functions
roll-up and drill-down methods within a dimension or across the dimension.
10) Intuitive Data Manipulation: Data Manipulation fundamental the consolidation direction like as reorientation (pivoting), drill-down and
roll-up, and another manipulation to be accomplished naturally and precisely via point-and-click and drag and drop methods on the cells of the
scientific model. It avoids the use of a menu or multiple trips to a user interface.
11) Flexible Reporting: It implements efficiency to the business clients to organize columns, rows, and cells in a manner that facilitates simple
manipulation, analysis, and synthesis of data.
12) Unlimited Dimensions and Aggregation Levels: The number of data dimensions should be unlimited. Each of these common dimensions
must allow a practically unlimited number of customer-defined aggregation levels within any given consolidation path.
Difference between OLAP and OLTP
Category OLAP (Online Analytical Processing) OLTP (Online Transaction Processing)
Definition It is well-known as an online database query management system. It is well-known as an online database modifying system.
Data source Consists of historical data from various Databases. Consists of only operational current data.
Method used It makes use of a data warehouse. It makes use of a standard database management system (DBMS).
Application It is subject-oriented. Used for Data Mining, Analytics, Decisions making, etc. It is application-oriented. Used for business tasks.
Normalized In an OLAP database, tables are not normalized. In an OLTP database, tables are normalized (3NF).
Usage of data The data is used in planning, problem-solving, and decision-making. The data is used to perform day-to-day fundamental operations.
Task It provides a multi-dimensional view of different business tasks. It reveals a snapshot of present business tasks.
Purpose It serves the purpose to extract information for analysis and decision-making. It serves the purpose to Insert, Update, and Delete information from the database.
Volume of data A large amount of data is stored typically in TB, PB The size of the data is relatively small as the historical data is archived in MB, and GB.
Queries Relatively slow as the amount of data involved is large. Queries may take hours. Very Fast as the queries operate on 5% of the data.
Update The OLAP database is not often updated. As a result, data integrity is unaffected. The data integrity constraint must be maintained in an OLTP database.
Backup and Recovery It only needs backup from time to time as compared to OLTP. The backup and recovery process is maintained rigorously
Processing time The processing of complex queries can take a lengthy time. It is comparatively fast in processing because of simple and straightforward queries.
Types of users This data is generally managed by CEO, MD, and GM. This data is managed by clerksForex and managers.
Operations Only read and rarely write operations. Both read and write operations.
Updates With lengthy, scheduled batch operations, data is refreshed on a regular basis. The user initiates data updates, which are brief and quick.
Nature of audience The process is focused on the customer. The process is focused on the market.
Database Design Design with a focus on the subject. Design that is focused on the application.
Productivity Improves the efficiency of business analysts. Enhances the user’s productivity.
Difference between Star Schema and Snowflake Schema
• Star Schema: Star schema is the type of multidimensional model which is used for
data warehouse. In star schema, The fact tables and the dimension tables are contained.
In this schema fewer foreign-key join is used. This schema forms a star with fact table
and dimension tables.
• Snowflake Schema: Snowflake Schema is also the type of multidimensional
model which is used for data warehouse. In snowflake schema, The fact tables,
dimension tables as well as sub dimension tables are contained. This schema
forms a snowflake with fact tables, dimension tables as well as sub-dimension
tables.
Difference between Star Schema and Snowflake Schema
Star Schema Snowflake Schema
In star schema, The fact tables and the dimension tables are While in snowflake schema, The fact tables, dimension tables as well as sub dimension tables are
contained. contained.
It takes less time for the execution of queries. While it takes more time than star schema for the execution of queries.
In star schema, Normalization is not used. While in this, Both normalization and denormalization are used.
The query complexity of star schema is low. While the query complexity of snowflake schema is higher than star schema.
It has less number of foreign keys. While it has more number of foreign keys.
ETL stands for Extract, Transform, Load and it is a process used in data warehousing to extract
data from various sources, transform it into a format suitable for loading into a data warehouse,
and then load it into the warehouse. The process of ETL can be broken down into the following
three stages:
1. Extract: The first stage in the ETL process is to extract data from various sources such as
transactional systems, spreadsheets, and flat files. This step involves reading data from the
source systems and storing it in a staging area.
2. Transform: In this stage, the extracted data is transformed into a format that is suitable for
loading into the data warehouse. This may involve cleaning and validating the data, converting
data types, combining data from multiple sources, and creating new data fields.
3. Load: After the data is transformed, it is loaded into the data warehouse. This step involves
creating the physical data structures and loading the data into the warehouse.
The ETL process is an iterative process that is repeated as new data is added to the warehouse. The
process is important because it ensures that the data in the data warehouse is accurate, complete,
and up-to-date. It also helps to ensure that the data is in the format required for data mining and
reporting.
ETL Tools: Most commonly used ETL tools are Hevo, Sybase, Oracle Warehouse builder, CloverETL, and
MarkLogic.
Data Warehouses: Most commonly used Data Warehouses are Snowflake, Redshift, BigQuery, and
Firebolt.
Roles and Responsibilities of an ETL Developer