OLAP operations?
OLAP operations are techniques used in Online Analytical Processing (OLAP) to explore, analyze, and
manipulate multi-dimensional data for business intelligence, reporting, and decision-making. OLAP deals with
complex queries that allow users to slice, dice, and navigate through data to derive insights.
Here are the main OLAP operations:
1. Roll-up (Drill-up):
o What it does: Aggregates data, moving from a lower level of detail to a higher one.
o Example: Summarizing daily sales into monthly or yearly totals.
2. Drill-down:
o What it does: Breaks data down into finer granularity, providing more detailed views.
o Example: Drilling from yearly sales to monthly or daily sales data.
3. Slice:
o What it does: Extracts a single layer from the data cube by fixing a specific value for one
dimension.
o Example: Viewing sales data only for a particular year or region.
4. Dice:
o What it does: Extracts a subcube by selecting a range of values from multiple dimensions.
o Example: Viewing sales data for a specific product, region, and time range.
5. Pivot (Rotation):
o What it does: Reorients the data cube to present data from different perspectives by rotating
dimensions.
o Example: Changing the view from sales per region to sales per product by switching rows and
columns.
--------------------------------------------------------------------------------------------------------------------------------------------------
SCHEMAS?
The basic concepts of dimensional modeling are: facts, dimensions and measures. A fact is a
collection of related data items, consisting of measures and context data. It typically represents business
items or business transactions. A dimension is a collection of data that describe one business dimension.
Dimensions determine the contextual background for the facts; they are the parameters over which we
want to perform OLAP. A measure is a numeric attribute of a fact, representing the performance or
behavior of the business relative to the dimensions. Considering Relational context, there are three basic
schemas that are used in dimensional modeling:
1. Star schema
2. Snowflake schema
3. Fact constellation schema
Star Schema
Each dimension in a star schema is represented with only one-dimension table.
This dimension table contains the set of attributes.
The following diagram shows the sales data of a company with respect to the four dimensions, namely
time, item, branch, and location.
There is a fact table at the center. It contains the keys to each of four dimensions.
The fact table also contains the attributes, namely dollars sold and units sold.
Snowflake Schema
Some dimension tables in the Snowflake schema are normalized.
The normalization splits up the data into additional tables.
Unlike Star schema, the dimensions table in a snowflake schema is normalized. For example, the item
dimension table in star schema is normalized and split into two dimension tables, namely item and
supplier table.
Now the item dimension table contains the attributes item_key, item_name, type, brand, and supplier-
key.
The supplier key is linked to the supplier dimension table. The supplier dimension table contains the
attributes supplier_key and supplier_type.
Fact Constellation Schema
A fact constellation has multiple fact tables. It is also known as galaxy schema.
The following diagram shows two fact tables, namely sales and shipping.
The sales fact table is same as that in the star schema.
The shipping fact table has the five dimensions, namely item_key, time_key, shipper_key,
from_location, to_location.
The shipping fact table also contains two measures, namely dollars sold and units sold.
It is also possible to share dimension tables between fact tables. For example, time, item, and location
dimension tables are shared between the sales and shipping fact table.
--------------------------------------------------------------------------------------------------------------------------------------------------
OLAP & OLTP DIFFERENCES?
Aspect OLAP (Online Analytical OLTP (Online Transaction Processing)
Processing)
Purpose Analytical querying and reporting Transactional operations and data entry
Data Operations Complex queries for data analysis Simple, read/write queries for daily operations
Query Complex, involving aggregation, Simple and often predefined
Complexity multi-table joins
Data Volume Handles large volumes of historical Handles smaller, real-time operational data
data
Data Type Historical, aggregated, and Current, transactional, and detailed data
summarized data
Response Time Generally slower due to complex Fast response time, optimized for quick data
queries operations
Normalization Typically uses denormalized data (e.g., Highly normalized data (to reduce redundancy)
star schema)
Concurrency Fewer users with fewer concurrent High concurrency with many users performing
queries transactions simultaneously
Usage Business intelligence, reporting, data Day-to-day business operations like order
mining processing, banking transactions
Backup & Not as frequent as OLTP, focus on Frequent, as data is more critical for daily
Recovery historical data integrity operations
Examples Data warehouses, decision support ERP systems, online booking systems
systems
--------------------------------------------------------------------------------------------------------------------------------------------------
DATAWARE HOUSE ARCHITECTURE?
Tier-1:
The bottom tier is a warehouse database server that is almost always a relational database system. Back-
end tools and utilities are used to feed data into the bottom tier from operational databases or other
external sources (such as customer profile information provided by external consultants). These tools
and utilities perform data extraction, cleaning, and transformation (e.g., to merge similar data from
different sources into a unified format), as well as load and refresh functions to update the data
warehouse . The data are extracted using application program interfaces known as gateways. A gateway
is supported by the underlying DBMS and allows client programs to generate SQL code to be executed
at a server.
Examples of gateways include ODBC (Open Database Connection) and OLEDB (Open Linking and
Embedding for Databases) by Microsoft and JDBC (Java Database Connection).
This tier also contains a metadata repository, which stores information aboutthe data warehouse and its
contents.
Tier-2:
The middle tier is an OLAP server that is typically implemented using either a relational OLAP
(ROLAP) model or a multidimensional OLAP.
OLAP model is an extended relational DBMS thatmaps operations on multidimensional data to standard
relational operations.
A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that directly implements
multidimensional data and operations.
Tier-3:
The top tier is a front-end client layer, which contains query and reporting tools, analysis tools, and/or
data mining tools (e.g., trend analysis, prediction, and so on).
Data warehouse Architecture and its seven components
Data warehouse architecture typically consists of several layers and components designed to manage, store,
and analyze large amounts of data from different sources. The seven key components of a data warehouse
architecture are:
1. Data Sources:
o These are external systems or databases (e.g., OLTP systems, flat files, external data feeds)
from which the data warehouse extracts data. Data sources can include operational databases,
CRM, ERP systems, spreadsheets, and third-party data sources.
2. Data Extraction, Transformation, and Loading (ETL) Tools:
o Extraction: Extracts data from various sources.
o Transformation: Cleanses, formats, and transforms data into a suitable structure for the
warehouse.
o Loading: Loads the transformed data into the data warehouse.
o ETL processes ensure that data is accurate, consistent, and structured for analysis.
3. Data Staging Area:
o A temporary storage area where data is held after extraction and before transformation and
loading. It provides a space to clean and consolidate data before it moves into the data
warehouse. This stage ensures that the data is processed properly without affecting the
performance of the data warehouse.
4. Data Storage (Data Warehouse Repository):
o This is the central repository where the cleansed, integrated, and consolidated data is stored. It
can be structured in different ways:
Normalized or denormalized databases.
Star schema or snowflake schema.
Data in the repository is typically organized in a multi-dimensional model (data cubes) to
facilitate analysis and reporting.
5. Metadata:
o Metadata is data about data, describing the structure, data types, relationships, and lineage of
data in the warehouse. It is crucial for understanding, managing, and using the data warehouse
effectively. Metadata is used by ETL tools, query tools, and reporting tools to process data.
6. OLAP (Online Analytical Processing) Engine:
o OLAP tools enable users to perform complex queries, reporting, and data analysis using multi-
dimensional data models. The OLAP engine allows for operations such as slicing, dicing, roll-up,
and drill-down to analyze data across various dimensions.
7. Data Access and Reporting Tools:
o These tools allow end-users to access, visualize, and analyze data stored in the data warehouse.
Common tools include:
Business intelligence (BI) tools for dashboards and reporting.
Query tools for ad-hoc querying and analysis.
Data mining tools for discovering patterns and insights.
These tools help translate the raw data into actionable insights for decision-making.
--------------------------------------------------------------------------------------------------------------------------------------------------
Knowledge Discovery in Databases (KDD)?
Knowledge Discovery in Databases refers to the overall process of discovering useful, non-trivial information
or patterns from large datasets. It involves several steps, from data preparation to the extraction of patterns,
and ultimately transforming raw data into meaningful knowledge. KDD is closely related to data mining, but
while data mining focuses on applying algorithms to extract patterns, KDD encompasses the entire process of
discovery.
Main Steps of KDD:
1. Data Selection:
o Relevant data is selected from a larger dataset for analysis. This step involves identifying which
data sources or attributes will be used in the knowledge discovery process.
2. Data Preprocessing (Cleaning and Integration):
o This step involves cleaning the data to handle missing values, noise, and inconsistencies. Data
from different sources might also be integrated into a coherent dataset. Preprocessing ensures
the data is ready for analysis.
3. Data Transformation:
o The selected and cleaned data is transformed into a suitable format for mining. This could
involve normalization, aggregation, or constructing new attributes (feature engineering) to
highlight important features in the data.
4. Data Mining:
o At this stage, algorithms are applied to the transformed data to discover patterns. Various
techniques, such as classification, clustering, association rule mining, and anomaly detection,
can be employed based on the specific goal.
5. Pattern Evaluation:
o The discovered patterns are evaluated for their significance and usefulness. Not all patterns
found are interesting or meaningful, so this step involves selecting the patterns that have the
most value.
6. Knowledge Representation:
o The final step involves representing the discovered knowledge in a comprehensible way, often
through visualization, reports, or integration into decision-support systems. This step ensures
the extracted patterns can be applied to real-world problems.
--------------------------------------------------------------------------------------------------------------------------------------------------
Data mining functionalities?
Data mining functionalities refer to the different types of tasks or operations that data mining techniques can
perform to extract patterns or insights from large datasets. The main data mining functionalities include:
1. Classification:
o Assigns data to predefined categories or classes based on attributes. This is often used in
predictive modeling where the goal is to predict the class label of new data based on past data.
o Example: Predicting whether an email is spam or not based on its content.
2. Clustering:
o Groups data into clusters of similar objects where the objects in one cluster are more similar to
each other than to those in other clusters. Unlike classification, clustering does not rely on
predefined labels.
o Example: Segmenting customers based on purchasing behavior.
3. Association Rule Mining:
o Identifies relationships or correlations between different items in a dataset. The most common
example is market basket analysis, which finds associations between products frequently
bought together.
o Example: If a customer buys bread, they are likely to also buy butter.
4. Regression:
o A predictive modeling technique used to predict a numeric value based on input data.
Regression analyzes the relationship between dependent and independent variables.
o Example: Predicting house prices based on factors like location, size, and number of rooms.
5. Anomaly Detection:
Identifies data points that deviate significantly from the norm. These outliers can indicate fraud,
o
errors, or rare events.
o Example: Detecting fraudulent transactions in financial datasets.
6. Sequential Pattern Mining:
o Finds patterns or trends in data over time. It is particularly useful for discovering time-based
associations.
o Example: Identifying customer purchasing patterns that follow specific sequences, like buying a
laptop followed by a mouse and then a keyboard.
7. Summarization:
o Provides a concise description of the dataset, often through generating summaries or
aggregated views of the data.
o Example: Summarizing the key statistics of a large dataset (e.g., average age of customers).
--------------------------------------------------------------------------------------------------------------------------------------------------
Data preprocessing techniques?
Data preprocessing is a crucial step in data mining that involves cleaning, transforming, and preparing raw
data for analysis. It ensures that the data is accurate, consistent, and ready for mining. Key data preprocessing
techniques include:
1. Data Cleaning:
o Purpose: Handles missing, inconsistent, and noisy data.
o Techniques:
Filling in missing values (e.g., using mean, median, or most frequent value).
Removing or correcting errors in the data.
Smoothing noisy data using techniques like binning or clustering.
2. Data Integration:
o Purpose: Combines data from multiple sources into a unified dataset.
o Techniques:
Merging datasets from different databases, data warehouses, or files.
Resolving data inconsistencies and redundancies (e.g., schema matching).
3. Data Transformation:
o Purpose: Converts data into a suitable format for mining.
o Techniques:
Normalization: Scaling data to a specific range (e.g., 0 to 1) to improve algorithm
performance.
Aggregation: Summarizing data (e.g., monthly sales totals instead of daily).
Feature Selection/Creation: Reducing the number of features or creating new relevant
features for better analysis.
4. Data Reduction:
o Purpose: Reduces the volume of data while retaining important information.
o Techniques:
Dimensionality reduction: Using techniques like Principal Component Analysis (PCA) to
reduce the number of variables.
Data compression: Using coding schemes to reduce data size.
Sampling: Selecting a subset of data for analysis when working with large datasets.
5. Data Discretization:
o Purpose: Converts continuous data into discrete intervals or categories.
o Techniques:
Binning: Dividing data into bins or ranges.
Histogram analysis: Grouping data based on frequency distributions.
6. Data Encoding:
o Purpose: Converts categorical data into a numerical format.
o Techniques:
Label encoding: Assigning numerical values to categories.
One-hot encoding: Creating binary variables for each category.
--------------------------------------------------------------------------------------------------------------------------------------------------