0% found this document useful (0 votes)
7 views16 pages

DWM Exp1

The document outlines the concepts of data warehousing and mining, emphasizing their significance in decision-making and strategic analysis. It details the characteristics of data warehouses, such as being subject-oriented, integrated, time-variant, and non-volatile, and discusses the steps involved in data mining, including data cleaning, integration, and pattern discovery. Additionally, it highlights the applications of data warehouses in various fields, including business intelligence, operational analytics, and healthcare.

Uploaded by

qurehitufail786
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views16 pages

DWM Exp1

The document outlines the concepts of data warehousing and mining, emphasizing their significance in decision-making and strategic analysis. It details the characteristics of data warehouses, such as being subject-oriented, integrated, time-variant, and non-volatile, and discusses the steps involved in data mining, including data cleaning, integration, and pattern discovery. Additionally, it highlights the applications of data warehouses in various fields, including business intelligence, operational analytics, and healthcare.

Uploaded by

qurehitufail786
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Experiment No: 1

• Aim: Study of data warehousing and mining & its applications.


• Theory: Data Warehouse.
• Definition:
➢ A Data Warehouse is a subject oriented, integrated, non-volatile, and time variant
collection of data in support of management’s decisions. A data warehouse can be
summarized as follows:
- Provides an integrated view of enterprise data.
- Makes current and historical data readily accessible for decision-making.
- Makes decision-support transactions possible without hindering operational systems.
- Ensures consistent and reliable organizational information.
- Presents a flexible and interactive source of strategic information.

• Need:
➢ Data warehouses are essential because they:
- Handle large volumes of data (TBs) efficiently.
- Support complex analytical queries not feasible with transactional databases.
- Integrate and centralize data for comprehensive business analysis.
- Enable organizations to derive strategic insights and make informed decisions.
• Features of Data-Warehouse:
1. Subject Oriented:

In data warehousing, data is organized and stored based on business subjects or areas (e.g.,
sales, customers, products) rather than being tied to specific operational applications. This
approach allows for a comprehensive view of each subject across the organization, integrating
data from various sources to provide a unified perspective. For example, data about orders,
customers, inventory, and transactions are structured to support comprehensive analysis and
decision-making across all related functions, rather than being siloed within individual
application datasets. This subject-oriented approach in data warehousing facilitates integrated
analysis and strategic insights across the organization.
2. Integrated Data:

For proper decision-making, you must gather all the relevant data from the various
applications. The data in the data warehouse comes from several operational systems. Source
data are in different databases, files, and data segments. These are disparate applications, so
the operational platforms and operating systems could differ. The file layouts, character code
representations, and field naming conventions could be different.

3. Time-Variant Data:

The data collected in a data warehouse is identified with a particular time period. The data in
a data warehouse provides information from the historical point of view. For an operational
system, the stored data contains the current values. In an accounts receivable system, the
balance is the current outstanding balance in the customer’s account. In an order entry
system, the status of an order is the current status of the order. In a consumer loans
application, the balance amount owed by the customer is the current amount. Of course, we
store some past transactions in operational systems, but, essentially, operational systems
reflect current information because these systems support day-to-day current operations.
4. Non-Volatile Data:
Non-volatile means the previous data is not erased when new data is added to it. A data
warehouse is kept separate from the operational database and therefore frequent changes in
operational database is not reflected in the data warehouse. Data extracted from the various
operational systems and pertinent data obtained from outside sources are transformed,
integrated, and stored in the data warehouse. The data in the data warehouse is not intended
to run the day-to-day business. When you want to process the next order received from a
customer, you do not look into the data warehouse to find the current stock status. The
operational order entry application is meant for that purpose. In the data warehouse, you
keep the extracted stock status data as snapshots over time. You do not update the data
warehouse every time you process a single order.
5. Data Granularity:
In an operational system, data is usually kept at the lowest level of detail. In a point-of sale
system for a grocery store, the units of sale are captured and stored at the level of units of a
product per transaction at the check-out counter. In an order entry system, the quantity
ordered is captured and stored at the level of units of a product per order received from the
customer. Whenever you need summary data, you add up the individual transactions. If you
are looking for units of a product ordered this month, you read all the orders entered for the
entire month for that product and add up. You do not usually keep summary data in an
operational system.

• Difference Between Data Warehouse and Data Mart:


• Top Down & Bottom-Up Approach:

• Practical Approach:
The steps in this practical approach are as follows:
- Plan and define requirements at the overall corporate level.
- Create a surrounding architecture for a complete warehouse.
- Conform and standardize the data content.
- Implement the data warehouse as a series of super-marts, one at a time.
• Architecture of Data Warehouse:

- Source Systems: These are the operational systems where raw data originates, such as
transactional databases, CRM systems, ERP systems, flat files, etc.
- ETL (Extract, Transform, Load): ETL tools and processes are used to extract data from
source systems, transform it into a structured format suitable for analysis (e.g., cleaning,
filtering, aggregating), and load it into the data warehouse.
- Data Warehouse: The central repository where integrated, structured, and cleaned data
from various sources is stored. It typically uses a schema optimized for querying and
analysis, such as star schema or snowflake schema.
- Data Marts: These are subsets of data warehouses that are optimized for specific
departments or functions within an organization. Data marts provide easier and faster
access to data relevant to specific user groups.
- Metadata Repository: Metadata is data about data, which describes the characteristics
of the data warehouse content, structure, and usage. A metadata repository stores this
information, helping users understand and manage the data.
- OLAP (Online Analytical Processing) Engine: OLAP tools and engines enable
multidimensional analysis of data stored in the data warehouse. They support complex
queries and facilitate interactive analysis through features like drill-down, slice-and-dice,
and pivot operations.
- Reporting and Analysis Tools: These tools provide interfaces for querying the data
warehouse, creating reports, and visualizing data. Examples include BI (Business
Intelligence) tools, dashboards, and ad-hoc query tools.
- Data Quality Tools: Tools and processes to ensure data quality by detecting and correcting
errors, maintaining consistency, and ensuring data integrity throughout the data
warehouse.
- Security and Access Control: Measures to secure data within the data warehouse,
including authentication, authorization, and encryption to protect against unauthorized
access and ensure compliance with data privacy regulations.
- Backup and Recovery: Strategies and processes for backing up data in the data warehouse
and recovering it in case of data loss or system failure.

• Meta Data:

Metadata is essential data that describes other data, such as its content, structure, and
context. It improves data organization, discoverability, and usability across different
applications and domains.
Examples include file details, image properties, music attributes, and web page information.
Metadata supports effective data governance, interoperability, and enhances data
preservation and visualization efforts. Its structured format enables better search engine
optimization, content management, and data integration.
• Types of Meta Data:
- Operational Metadata. As you know, data for the data warehouse comes from several
operational systems of the enterprise. These source systems contain different data
structures. The data elements selected for the data warehouse have various field lengths
and data types. In selecting data from the source systems for the data warehouse, you
split records, combine parts of records from different source files, and deal with multiple
coding schemes and field lengths. When you deliver information to the end-users, you
must be able to tie that back to the original source data sets. Operational metadata
contain all of this information about the operational data sources.
- Extraction and Transformation Metadata: Extraction and transformation metadata
contain data about the extraction of data from the source systems, namely, the extraction
frequencies, extraction methods, and business rules for the data extraction. Also, this
category of metadata contains information about all the data transformations that take
place in the data staging area.
- End-User Metadata: The end-user metadata is the navigational map of the data
warehouse. It enables the end-users to find information from the data warehouse. The
end-user metadata allows the end-users to use their own business terminology and look
for information in those ways in which they normally think of the business.
• Applications:

- Business Intelligence (BI) and Reporting: Data warehouses centralize data from multiple
sources, enabling businesses to perform complex queries and analysis. This facilitates the
generation of insightful reports and dashboards for decision-making.
- Strategic Decision Making: By providing a unified view of data across an organization,
data warehouses support strategic decision-making processes. Executives and managers
can access timely and accurate information to formulate strategies and business plans.
- Operational Analytics: Data warehouses enable real-time or near-real-time analysis of
operational data. This helps businesses monitor performance metrics, identify trends, and
optimize operations for improved efficiency.
- Customer Analytics: Analysing customer data stored in a data warehouse allows
businesses to understand customer behaviour, preferences, and trends. This supports
targeted marketing campaigns, customer segmentation, and personalized customer
experiences.
- Financial Analysis: Finance departments use data warehouses to consolidate financial
data from various systems (e.g., ERP systems), analyse financial performance, manage
budgets, and forecast future financial trends.
- Supply Chain Management: Data warehouses help optimize supply chain operations by
integrating and analysing data related to inventory levels, supplier performance, logistics,
and demand forecasts. This improves inventory management and reduces costs.
- Regulatory Compliance and Risk Management: Data warehouses assist in regulatory
compliance by providing auditable records and ensuring data integrity. They also support
risk management initiatives by analysing historical data to identify potential risks and
trends.
- Healthcare Analytics: In healthcare, data warehouses integrate patient records, medical
histories, treatment outcomes, and operational data. This enables healthcare providers
to improve patient care, manage resources effectively, and conduct medical research.
• Theory: Data Mining.
• Definition:
➢ Data mining is the process of discovering interesting patterns and knowledge from large
amounts of data. The data sources can include databases, data warehouses, the Web, other
information repositories, or data that are streamed into the system dynamically.
• Steps required in Data Mining:
- Data cleaning (to remove noise and inconsistent data).
- Data integration (where multiple data sources may be combined).
- Data selection (where data relevant to the analysis task are retrieved from the database).
- Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations) .
- Data mining (an essential process where intelligent methods are applied to extract data
patterns).
- Knowledge presentation (where visualization and knowledge representation techniques
are used to present mined knowledge to users).
• Types of Data that can be mined:
- Database Data: A database system, also called a database management system (DBMS),
consists of a collection of interrelated data, known as a database, and a set of software
programs to manage and access the data. The software programs provide mechanisms
for defining database structures and data storage; for specifying and managing
concurrent, shared, or distributed data access; and for ensuring consistency and security
of the information stored despite system crashes or attempts at unauthorized access.
- Data Warehouses data: A data warehouse is usually modelled by a multidimensional data
structure, called a data cube, in which each dimension corresponds to an attribute or a
set of attributes in the schema, and each cell stores the value of some aggregate measure
such as counter sum(sales amount). A data cube provides a multidimensional view of data
and allows the precomputation and fast access of summarized data.
- Transactional Data: In general, each record in a transactional database captures a
transaction, such as a customer’s purchase, a flight booking, or a user’s clicks on a web
page. A transaction typically includes a unique transaction identity number (trans ID) and
a list of the items making up the transaction, such as the items purchased in the
transaction. A transactional database may have additional tables, which contain other
information related to the transactions, such as item description, information about the
salesperson or the branch, and so on.
• Kinds of pattern that can be mined:
- Characterization and Discrimination: Data characterization is a summarization of the
general characteristics or features of a target class of data. The data corresponding to the
user-specified class are typically collected by a query. Data discrimination is a comparison
of the general features of the target class data objects against the general features of
objects from one or multiple contrasting classes. The target and contrasting classes can
be specified by a user, and the corresponding data objects can be retrieved through
database queries.
- Mining Frequent Patterns, Associations, and Correlations: Frequent patterns, as the
name suggests, are patterns that occur frequently in data. There are many kinds of
frequent patterns, including frequent item sets, frequent sub sequences (also known as
sequential patterns), and frequent substructures. A frequent itemset typically refers to a
set of items that often appear together in a transactional data set.
- Classification and Regression for Predictive Analysis: Classification is the process of
finding a model (or function) that describes and distinguishes data classes or concepts.
The model are derived based on the analysis of a set of training data (i.e., data objects for
which the class labels are known). The model is used to predict the class label of objects
for which the class label is unknown.
- Cluster Analysis: Unlike classification and regression, which analyse class-labelled
(training) data sets, clustering analyses data objects without consulting class labels. In
many cases, class labelled data may simply not exist at the beginning.
- Outlier Analysis: A data set may contain objects that do not comply with the general
behaviour or model of the data. These data objects are outliers. Many data mining
methods discard outliers as noise or exceptions.
• Technologies Used:

1. Machine Learning:
- Supervised Learning: Involves learning a function that maps an input to an output based
on example input-output pairs (e.g., decision trees, neural networks, support vector
machines).
- Unsupervised Learning: Involves finding hidden patterns or intrinsic structures in input
data without labelled responses (e.g., clustering with K-means, association rule learning
with Apriori algorithm).
2. Statistics:
- Descriptive Statistics: Summarizes and describes the main features of a data set, including
measures such as mean, median, mode, and standard deviation.
- Inferential Statistics: Makes inferences and predictions about a population based on a
sample of data, using techniques like regression analysis, hypothesis testing, and ANOVA
(Analysis of Variance).
3. Pattern Recognition:
- Techniques used to identify patterns and regularities in data. Often applied in image and
speech recognition to classify or categorize data based on learned patterns.
4. Neural Networks:
- Computational models inspired by the human brain. They consist of interconnected nodes
(neurons) and are used for complex pattern recognition tasks, such as image and speech
recognition.
5. Artificial Intelligence:
- Expert Systems: Rule-based systems that emulate human decision-making.
- Genetic Algorithms: Optimization algorithms inspired by natural selection.
• Issues of data mining:
- Mining Methodology: Researchers have been vigorously developing new data mining
methodologies. This involves the investigation of new kinds of knowledge, mining in
multidimensional space, integrating methods from other disciplines, and the
consideration of semantic ties among data objects. In addition, mining methodologies
should consider issues such as data uncertainty, noise, and incompleteness.
- User Interaction: The user plays an important role in the data mining process. Interesting
areas of research include how to interact with a data mining system, how to incorporate
a user’s background knowledge in mining, and how to visualize and comprehend data
mining results.
- Efficiency and Scalability: Data mining algorithms must be efficient and scalable in order
to effectively extract information from huge amounts of data in many data repositories or
in dynamic data streams. In other words, the running time of a data mining algorithm
must be predictable, short, and acceptable by applications. Efficiency, scalability,
performance, optimization, and the ability to execute in real time are key criteria that
drive the development of many new data mining algorithms.
- Diversity of Database Types: Diverse applications generate a wide spectrum of new data
types, from structured data such as relational and data warehouse data to semi-
structured and unstructured data; from stable data repositories to dynamic data streams;
from simple data objects to temporal data, biological sequences, sensor data, spatial data,
hypertext data, multimedia data, software program code, Web data, and social network
data.
- Data Mining and Society: With data mining penetrating our everyday lives, it is important
to study the impact of data mining on society. How can we use data mining technology to
benefit society? How can we guard against its misuse? The improper disclosure or use of
data and the potential violation of individual privacy and data protection rights are areas
of concern that need to be addressed.
• Applications:
1. Market Analysis and Management
- Customer Profiling: Identifying characteristics of customers who are likely to purchase a
particular product.
- Market Segmentation: Dividing a market into distinct subsets of customers with common
needs or characteristics.
2. Risk Management and Fraud Detection
- Credit Scoring: Assessing the creditworthiness of potential customers.
- Fraud Detection: Identifying fraudulent activities in areas like credit card transactions,
insurance claims, and telecommunications.
3. Healthcare and Medical Diagnosis
- Disease Diagnosis: Identifying patterns that lead to the diagnosis of diseases.
- Treatment Effectiveness: Evaluating the effectiveness of treatments based on patient
data.
4. Manufacturing and Production
- Quality Control: Identifying defects and ensuring product quality.
- Process Optimization: Improving manufacturing processes through data analysis.
5. Web Mining
- Web Usage Mining: Analysing web server logs to find user navigation patterns.
- Web Structure Mining: Understanding the structure of websites and their hyperlink
hierarchy.
- Web Content Mining: Extracting useful information from the content of web pages.
6. Scientific Research
- Bioinformatics: Analysing biological data, such as gene sequences.
- Astronomy: Discovering patterns and relationships in astronomical data.

• Conclusion:
Concept of data warehouse and data mining are studied and understood successfully.

You might also like