0% found this document useful (0 votes)

29 views11 pages

Apuntes Big Data Ii

The document covers various aspects of big data, including data storage, management systems, and the significance of insights derived from data. It discusses the architecture of data warehouses, the importance of decision-making hierarchies, and the challenges faced in data integration and scaling. Additionally, it introduces Hadoop as a scalable framework for processing large datasets, detailing its components and functionalities.

Uploaded by

tocito8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views11 pages

Apuntes Big Data Ii

Uploaded by

tocito8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

APUNTES BIG DATA II

TEMA 1
Information can be distributed, stored and processed using modern digital technologies like
internet, disk drives and modern computer.

Data is a representation of something that captures some features and ignores others. A
smartphone photograph is just numbers stored in the phone, these numbers can be storage,
shared or displayed on any screen, it is a basic skill to be able to identify what I data and what
is not.

The organization of data has a major impact on how easily you can use the data to answer
questions. These questions are usually named “insights” or “motivated needs” from different
sectors.

Insight = The capacity to gain an accurate and deep understanding of someone or something.

Data store is a collection of data, the service that maintains my cloud storage also has a much
larger data store of all the documents saved by all of its users.

Database is an organized data store, an example can be a spreadsheet containing movie titles,
actor’s etc., you can even organize photos and call that a database.

Database Management System (DBMS) = A DBMS enables you to create a database, add and
update data, and easily retrieve data based on its organization.

Identifying object or letters represented in a digital image is a king of classification, a type of

machine learning in increasing use today.

A classifier is a type of computer program that can take records with potentially many data
points, and can infer one or more simple categorical values that are suitable for the record. For
example, given a picture of an object, name the object. The program performs the
computational task of resolving all the pixel values to the simple label. Modern classifiers can
“learn” how to classify pictures by first being presented with a large number of pictures that
are already properly labelled.

One narrow form of computer vison is already in successful use: optical character recognition
OCR. If you have an image and you know you are looking for letters or numeric digits, it’s a
simple matter to scan the image and signal when and where they appear. The software finds
the images of digits for the check amount and interprets each digit as one of the ten printed
numerals 0 up to 9.
TEMA 2
When you have a lot of useful data generated in enormous amounts nowadays, you should
store it somewhere. The first solution is to get a big capacity node (centralized node). The
second approach is to store your data on a collection of nodes (distribution nodes), the
technique is called “scale out” or horizontal scale. The third approach is to get a larger node or
part of a node increasingly on demand (cloud storage), this technique is called “scale up” o
vertical scale.

The risk of scaling of scaling out is that one node will get out of the service every three years
on average, for a cluster of 1000 nodes it means 1 failure every single day. In order to calculate
the failure rates in our system we have:
∑(𝑠𝑡𝑎𝑟𝑡 𝑜𝑓 𝑑𝑜𝑤𝑛𝑡𝑖𝑚𝑒−𝑠𝑡𝑎𝑟𝑡 𝑜𝑓 𝑢𝑝𝑡𝑖𝑚𝑒)
MTBF (Mean Time Between Failures): 𝑛º 𝑜𝑓 𝑓𝑎𝑖𝑙𝑢𝑟𝑒𝑠

∑(𝑠𝑡𝑎𝑟𝑡 𝑜𝑓 𝑢𝑝𝑡𝑖𝑚𝑒−𝑠𝑡𝑎𝑟𝑡 𝑜𝑓 𝑑𝑜𝑤𝑛𝑡𝑖𝑚𝑒)

MDT (Mean Time To Repair): 𝑛º 𝑜𝑓 𝑓𝑎𝑖𝑙𝑢𝑟𝑒𝑠

TEMA 3
Decision making hierarchy:

- Lower-level management deals with short-term problems related to individual

transactions. Periodic summaries of operational databases and exception reports assist
operational management.

- Middle management relies on summarized data that are integrated across

operational databases. Middle management may want to integrate data across
different departments, manufacturing plants, and retail stores.

- Top management relies on the result of middle management analysis and external
data sources. Top management needs to integrate data so that customers, products,
suppliers, and other important entities can be tracked across the entire organization.

The limitations were a combination of inadequacy of database technology and deployment

limitations

˗ Missing features for summary data

˗ Performance limitation
˗ Lack of integration

Data Warehouse

Is an essential part of infrastructure for business intelligence, with a centralized repository for
decision making. Characteristics:
˗ Subject-oriented: Organized around business entities rather than business processes.
˗ Integrated: Many transformations to unify source data form independent data sources.
˗ Time-variant: Historical data, snapshots of business processes captured at different points in
time.
˗ Non-volatile: new data are appended periodically; existing data is not changed; warehouse data
may be archived after its usefulness declines
Transaction processing vs Business processing:

˗ Transaction processing: It uses primary data form transactions and its involve in daily
operations and short term decisions.
˗ Business intelligence processing: It uses transformed secondary data and its involve in
long term decisions.

Operational database vs Data Warehouse

More complex schema patterns for operational databases.

˗ Order contains header (order) and detail lines.

˗ Star Schema with dimension tables and fact table, designed for business intelligence
reporting Sales flattened just for order details.

Data integration: data moved from operational database to data warehouse after completion
of transactions along with substantial amount of transformations (standardization, integration,
consistency checking, completeness checking,

Challenges in data warehouse projects:

˗ Substantial coordination across organizational units

˗ Uncertain data quality in data sources
˗ Difficult to scale data warehouse

Architecture choices:

Top down architecture: The three-tier architecture is sometimes augmented with a staging
area to support the data transformation process. The staging area provides temporary storage
of transformed data before loading into the data warehouse. The staging area is particularly
useful for data warehouses with a large number of operational databases and external data
sources requiring complex data transformations. Also known as enterprise data warehouse
architecture (EDM).

Bottom up architecture: Also known as independent data mart architecture. No centralized

data warehouse. Data marts are small data warehouses oriented towards individual user
departments. Easier to cost justify than a larger data warehouse. Data marts may eventually
evolve into a data warehouse. Data marts may cooperate to provide data to other data marts:
data mart bus approach. It is a controversial architecture because many people claim that the
long term benefits are lost with a bottom-up approach so data marts must be re-implemented
as centralized data warehouse.

Federated architecture: For highly decentralized or independent organizations, the federated

data warehouse architecture provides another compromise approach. As represented in the
diagram, the federated data warehouse approach supports two levels of data warehouses.
Each organization independently maintains one or more data warehouses using any of the
architectures. To provide inter-organizational sharing, each organization contributes to the
federated data warehouse. Typically, another layer of data integration and a query portal
support data sharing in the federated data warehouse. Depending on the environment,
participation can be voluntary or compulsory.

Maturity model

Stages:

The maturity model consists of six stages as

summarized in this table. The stages provide
a framework to view an organization’s
progress, not an absolute metric as
organizations may demonstrate aspects of
multiple stages at the same time. As
organizations move from lower to more
advanced stages, increased business value
can occur. However, organizations may have
difficulty justifying significant new data warehouse investments as benefits are sometimes
difficult to quantify.

Insights:

An important insight of the maturity model is the difficulty of moving between certain stages.
For small but growing organizations, moving from the infant to the child stages can be difficult
because a significant investment in data warehouse technology is necessary. For large
organizations, the struggle is to move from the teenager to the adult stage. To make the
transition, upper management must perceive the data warehouse as a vital enterprise
resource, not just a tool provided by the information technology department.

TEMA 4
Data Cube Basics:

General: Business analysts think about data in a multidimensional arrangement. Influence

diagram. Narrow range of factors. Focus on one or more quantitative variables.

Terminology:

- Dimension: label of a row or column (can have more than 3 dimensions)

- Member: value of a dimension

- Measure: quantitative data stored in cells. It can have more than one measure in a cell.

Hierarchies: Member can have sub members

Sparsity: Many cells are typically empty when dimensions are related. Major problem with
storing data cubes is the compression of unused space.
Measure aggregation properties

Additive:

˗ Summarized by addition across all dimensions

˗ Common measures such as sales, costs and profit

Semi-additive:

˗ Summarized by addition in some but not all dimensions such as time.

˗ Periodic measurements such as account balances and inventory levels.

Non-additive:

˗ Cannot be summarized by addition through any dimension

˗ Historical facts such as unit price for a sale

Slice operator: Replace dimension with a specific value

Slice summarize variation: Replace a dimension with a summary of its values across all
members

Dice operator: Replace a dimension with a subset of values, often follows a slice operation.

Navigation operators: they are for hierarchical dimensions, three types:

˗ Drill-down: add detail to a dimension

˗ Roll-up: Remove detail from a dimension
˗ Distribute or recalculate measure values

Pivot operator: The pivot operator supports rearrangement of the dimensions in a data cube.
The pivot operator allows a data cube to be presented in the most appealing visual order.

TEMA 5
MDX was first introduced as part of the OLE DB for OLAP specification in 1997 from Microsoft.
It was invented by a group of SQL Server engineers. The specification was quickly followed by
commercial release of Microsoft Analysis Services. The latest version of the OLE DB for OLAP
specification was issued by Microsoft in 1999. In Analysis Services 2005, Microsoft added some
extensions, most notably subselects. This new variation is sometimes referred to as MDX 2005.
mdXML was specified as the query language in 2001 as part of the XML, MDX statement is
enclosed in the <Statement> tag.

MDX became the base of many Microsoft products and features such as SQL Server Analysis
and Excel Pivot Tables. MDX was also adopted by wide range of vendors from both commercial
side and open source side.

MDX terminology:

Tuple: Combination of members. Identifies a cell.

Measures: Numeric values in cells; same definition as already given.

Dimensions: Main business concepts. Measures can be aggregated but dimensions cannot be
aggregated.

Axis: Dimension selected in a query.

Slicer: Combination of dimensions members.

Fundamental difference: While SQL SELECT generates a 2 dimensional table, MDX SELECT
generates m-dimensional cube.

Source cube cells in SELECT; can be list subset of dimensions member’s values.

Result cube cells in WHERE: dimensions must be different than axis dimensions

Data source (cube) in FROM

Time dimension on the rows. 2003 and 2004 are the year members on the rows.

Sales and quantity are on the columns.

The number (in black) inside the cells are the

measures

Cell show sales and quantity for classic cars.

WHERE clause restricts cube calculations to

classic cars.

Dimensions in the where clause must be

different than the SELECT clause

WHERE condition is known as a slicer

condition

If no measure appears in the SELECT list, the default measure is shown on the cells.

TEMA 6
Motivation for table design

˗ Lack of scalabillity and integration of data cube store engines.

˗ Dominance of relational model and products
˗ Large amount of research and development on relational database features for data
warehouses
˗ Predominant usage of relational databases for large data warehouse.

Dimension table: Store values of one dimension. Often not normalized.

Fact table: Store measure values. Often contain multiple foreign keys. Types of fact table:

Transaction: is the most common and usually uses additive measures

Snapshot: Periodic or accumulating view of asset level, uses semi-additive measures.

Factless: Event occurrence and it doesn’t have measures just FKs

Grain: Most detailed stored value in fact tables. Detail of dimensions determinates grain such
as individual customer. Determinates the size of data warehouse: product of dimension
cardinalities times sparsity. Grain to small: Large data warehouse and increased computation
time. Grain too large: cannot answer queries on more detailed dimension values.

Table design Patterns

Star Schema: One fact table in the centre, multiple dimension tables around. Represents one
data cube. DW may contain many star schemas.

Constellation schema: Multiple fact tables. Fact tables share dimension tables. Relationship
diagram looks like a constellation.

Snowflake schema: Multiple levels of dimension tables. Use when dimension tables are small:
little performance gain by denormalizing. Relationship diagram looks like a snowflake.

Time representation for fact tables:

Alternatives: timestamp. Time dimension table for organization specific calendar features.

Variations: Time of day columns. Accumulating fact table for representation of multiple events.

TEMA 7
HADOOP BASICS:

Hadoop is an apache open source software framework for storage and large scale processing
data-sets and clusters on commodity hardware. It´s licensed under the Apache license and it´s
open source.

It provides a shared and integrated foundation where we can bring additional tools and build
up this framework. Scalability at its core (nucleo) of a Hadoop system. We can distribute and
scale across very easily in a very cost effective manner. All of the modules in Hadoop are
designed with a fundamental assumption that hardware fails, every machine fails at some
point of time. These failures are so common that we have to account for them ahead of the
time. A new thing about Hadoop is that we can keep all the data that we have, and we can
take the data and analyse it in new interesting ways, so I can read the data and create the
schema as I´m reading. We can bring more data into simple algorithms, which has shown that
with more granularity, you can achieve often better results in taking a small amount of data
and then some really complex analytics on it.

HADOOP FRAMEWORK:

There are four basic components to the framework:

Hadoop common: Contains libraries and utilities needed by other Hadoop modules

Hadoop Distributed File System: Is a distributed file system that stores data on a commodity
machine. Providing very high aggregate bandwidth across the entire cluster.
Hadoop yarn: Is a resource management platform responsible of managing compute resources
in the cluster and using them in order to schedule users and applications.

Hadoop MapReduce: Is a programming model that scales data across a lot of different
processes.

Apart from these 4 elements we have a new one related with Cloud environment called
OZONE

The Hadoop framework itself is mostly written in Java programming language and it has some
applications in native C and command line utilities that are written in shell scripts.

Two major pieces of Hadoop framework are Hadoop Distribute File System and MapReduce,
these are both open source and they are both inspired by technologies developed at google.

HADOOP DISTRIBUTED FILE SYSTEM:

It’s a distributed, scalable and portable file system written in Java in order to support the
Hadoop Frameworks. Each node in Hadoop instance typically has a single name node, and a
cluster of data nodes that formed this HDFS cluster.

The HDFS file system encodes the so-called secondary NameNode, which misleads some
people into thinking that the primary NameNode goes offline, the secondary will take over.
The secondary NameNode regulary connects to the primary NameNode and builds snapshots
of the primary´s NameNode, they capture information and members which system saves to
local and the remote directories. About every one of the Hadoop based system sits some
version of a mapreduce engine. The MapReduce has different ways to submit jobs and track
kinds of jobs we have submitted. The typical MapReduce engine will consist of a job tracker, to
which client applications can submit MapReduce jobs, and this job tracker typically pushes
work out to all the available task trackers, now it's in the cluster.

The Apache Hadoop YARN is actually another subset of the Hadoop and part of the Apache
software foundation, and it was introduced as a Hadoop 2.0. It basically separates the resource
management and the processes component. Yarn enhances the power of the Hadoop compute
cluster, without being limited by the map produce kind of framework. Its scalability's great.
The processing power and data centres continue to grow quickly, because the YARN research
manager focuses exclusively on scheduling. It can manage those very large clusters quite
quickly and easily. YARN is completely compatible with the MapReduce. Existing MapReduce
application, end users can run on top of the Yarn without disrupting any of their existing
processes. It does have an Improved cluster utilization as well.

There's no named map and there's no reduce slots, so it helps us utilize this cluster in better
ways. It supports other work flows other than just map reduce as I said earlier, we're not stuck
with the mapping and reducing. Now we can bring in additional programming models, such as
graph process or iterative modelling, and now it's possible to process the data in your base.
This is especially useful when we talk about machine learning applications.

Yarn allows multiple access engines, either open source or proprietary, to use Hadoop as a
common standard for either batch or interactive processing, and even real time engines that
can simultaneous acts as a lot of different data, so you can put streaming kind of applications
on top of YARN inside a Hadoop architecture, and seamlessly work and communicate between
these environments.
HADOOP ZOO:

We have the original Google stack as some people refer to it. And what they did, is they
started with a Google file system. They thought that it would be a great idea to distribute a
large amount of pretty cheap storage and try and put a lot of data on there. And come up with
some framework that would allow us to process all that data. So they had their original
MapReduce, and they were storing and processing large amounts of data.

Then they said, well that was really great, but we would really like to be able to access that
data and access it in a SQL like language. So they built the SQL gateway to adjust the data into
the MapReduce cluster and be able to query some of that data as well.

Then, they realized they needed a high-level specific language to access MapReduce in the
cluster and submit some of those jobs. So Sawzall came along.

Then, Evenflow came along and allowed to chain together complex work codes and coordinate
events and service across this kind of a framework or the specific cluster they had at the time.

Then, Dremel came along. Dremel was a columnar storage in the metadata manager that
allows us to manage the data. Then of course, you needed something to coordinate all of this.
So Chubby came along as a coordination system that would manage all of the products in this
one unit or one ecosystem that could process all these large amounts of structured data
seamlessly.

So you can see that there's a pattern that emerges across all these stacks the different
organizations use it. And now we can see that we come down to the Cloudera's distribution for
Hadoop, and we can see which one of the pieces of the system Cloudera has. So you can see
that

We have Sqoop and Flume for ingestion.

We use HBase for the common store.

We have Oozie as the coordination and the workflow engine.

We use Pig and Hive for high level languages and querying some of the data.

And then we use Zookeeper as a coordination service on bottom of this stack.

HADOOP ECOSYSTEM:
Let's talk about different tools that are on top of the Hadoop framework. With technology
evolution, it is now possible to manage immense volumes of data that we previously could
only handle by super computers. Prices in systems have dropped and the results in new
techniques for distributed computing can come along. Companies like Yahoo, Google or
Facebook came to the realization that they needed to do something to monetize these
massive amounts of data they were collecting. So this is where all of these applications have
evolved over time as we have seen through numerous of these organizations.

So let's talk about the Apache Sqoop. Sqoop stands for SQL to Hadoop. It is a straightforward
command line tool that has different capabilities. It lets us import individual tables or entire
databases into our HDF system. And it generates Java classes to allow us to interact and import
data, with all the data that we imported. It provides the ability to import SQL databases
straight into the high data warehouse which sits within the HDFS system.
Next up is Hbase. Hbase is a key component of the Hadoop stack, as its design caters to
applications that require really fast random access to significant data set. Hbase if you
remember we talked about it earlier is based on Google's big table and it can handle massive
data tables combining billion and billions of rows and millions of columns.

Pig it's a scripting language, it's really a high level platform for creating MapReduce programs
using Hadoop. This language is called Pig Latin, and it excels at describing data analysis
problems as data flows. For the user defined functions, facilities in the Pig, you can actually
have Pig in many different languages, like JRuby, JPython, and Java. You can execute PIG
scripts in other languages. So the result is really that you can use PIG as a component to build
much larger, much more complex applications that can tackle some real business problems.
PIG can ingest data from files, streams, or any other sources using the UDF. A user-defined
functions that we can write ourselves. When it has all the data it can perform, select, iterate
and do kinds of transformations. Over that data, again the UDF features can allow us passing
data to more complex algorithms for more complex transformations, and it can take all of
these transformations and store it back on the data file system.

The Apache Hive data warehouse software facilitates querying and managing large datasets
residing in our distributed file storage. It actually provides a mechanism to project structure on
top of all of this data and allow us to use SQL like queries to access the data that we have
stored in this data warehouse. This query language is called HiveQL. At the same time this
language also allows us traditional man produced programs to park inside the custom mappers
and reducers when it is inconvenient or to complex or really inefficient to express the logic we
would like to express in processing our data within the Hive in language.

Oozie's a workflow schedule system that manages all of our Apache Hadoop jobs. Oozie
workflow jobs are what we call DAGs or Directed Graphs. Oozie coordinator jobs are
recurrent Oozie workflow jobs that are triggered by frequency or data availability. It's
integrated with the rest of the Hadoop stack supporting several different Hadoop jobs right
out of the box. You can bring in Java MapReduce; you can bring in streaming MapReduce. You
can run Pig and Hive and Sqoop and many other specific jobs on the system itself. It's very
scalable and reliable and a quite extensible system.

Well we have the large zoo of crazy wild animals and we've got to keep them in and keep them
somehow organized. Well that's kind of what the Zookeeper does, a patch Zookeeper provides
operational services for the Hadoop cluster. It provides a distributed configuration service and
synchronization service so he can synchronize all these jobs and a naming registry for the
entire distributed system. Distributed applications use the zookeeper to store immediate
updates to important configuration information On the cluster itself.

Flume is a distributed and reliable available service for efficiently collecting aggregating
and moving large amounts of data. It has a simple and very flexible architecture based on
streaming data flows. It's quite robust and fall tolerant, and it's really tunable to enhance the
reliability mechanisms, fail over, recovery, and all the other mechanisms that keep the cluster
safe and reliable. It uses simple extensible data model that allows us to apply all kinds of
online analytic applications.

The additional component we haven't mentioned yet is Spark. Although Hadoop captures the
most attention for distributed data analytics, there are now a number of alternatives that
provide some kind of interesting advantages over the traditional Hadoop platform. Spark is
one of them. Spark is a scalable data analytics platform that incorporates primitives for in-
memory computing and therefore, is allowing to exercise some different performance
advantages over traditional Hadoop's cluster storage system approach. And it's implemented
and supports something called Scala language, and provides unique environment for data
processing. Spark is Is really great for more complex kinds of analytics, and it's great at
supporting machine learning libraries.

Data Mining v3
No ratings yet
Data Mining v3
54 pages
Unit - 2
No ratings yet
Unit - 2
116 pages
Chapter 2
No ratings yet
Chapter 2
79 pages
DWH Start l2
No ratings yet
DWH Start l2
117 pages
Data Warehouse: Features & Architecture
No ratings yet
Data Warehouse: Features & Architecture
5 pages
Building Blocks & Trends in Data Warehouse
No ratings yet
Building Blocks & Trends in Data Warehouse
45 pages
Data Warehousing and Data Mining Final Year Seminar Topic
No ratings yet
Data Warehousing and Data Mining Final Year Seminar Topic
10 pages
Data Warehouse
No ratings yet
Data Warehouse
5 pages
Malineni Lakshmaiah Engineering College S.KONDA-523101 Andhra Pradesh
No ratings yet
Malineni Lakshmaiah Engineering College S.KONDA-523101 Andhra Pradesh
15 pages
11 Stefano Rizzi DW and Beyond
No ratings yet
11 Stefano Rizzi DW and Beyond
33 pages
Data Warehousing Fundamentals
No ratings yet
Data Warehousing Fundamentals
47 pages
Data Warehouse Final Report
No ratings yet
Data Warehouse Final Report
19 pages
Data Warehouses: FPT University
No ratings yet
Data Warehouses: FPT University
49 pages
Chapter Four
No ratings yet
Chapter Four
43 pages
Data Warehouse
No ratings yet
Data Warehouse
74 pages
Unit 1
No ratings yet
Unit 1
22 pages
DW Concepts
100% (1)
DW Concepts
40 pages
Data Mining and Warehosuing Lecture 01
No ratings yet
Data Mining and Warehosuing Lecture 01
36 pages
Need of Two Types of Data: Information
No ratings yet
Need of Two Types of Data: Information
7 pages
Modul 9 - Data Warehousing and Business Intelligence - DMBOK2
No ratings yet
Modul 9 - Data Warehousing and Business Intelligence - DMBOK2
59 pages
Decision Support System: Unit 1
No ratings yet
Decision Support System: Unit 1
34 pages
Lec 01 - Intro To Data Warehouse
No ratings yet
Lec 01 - Intro To Data Warehouse
54 pages
Data Warehouse Design & Architecture
No ratings yet
Data Warehouse Design & Architecture
12 pages
Data Warehousing for Business Insights
100% (1)
Data Warehousing for Business Insights
95 pages
CS2032 Data Warehousing and Data Mining PPT Unit I
No ratings yet
CS2032 Data Warehousing and Data Mining PPT Unit I
88 pages
DWH Week 02
No ratings yet
DWH Week 02
22 pages
Module 3 - Datawarehousing
No ratings yet
Module 3 - Datawarehousing
45 pages
Data Warehousing Essentials Guide
No ratings yet
Data Warehousing Essentials Guide
20 pages
DWDM Lecture Notes III-II
No ratings yet
DWDM Lecture Notes III-II
81 pages
Data Warehousing Keys To Success WP Us
No ratings yet
Data Warehousing Keys To Success WP Us
16 pages
Overview of Data Warehousing and OLAP
No ratings yet
Overview of Data Warehousing and OLAP
12 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
29 pages
Emailing CHAPTER-1-Data-Warehouse-The-Building-Blocks
No ratings yet
Emailing CHAPTER-1-Data-Warehouse-The-Building-Blocks
9 pages
Unit 1
No ratings yet
Unit 1
14 pages
Data Warehouse Basics & Applications
No ratings yet
Data Warehouse Basics & Applications
14 pages
Data Warehousing
No ratings yet
Data Warehousing
14 pages
Data Mining Chapter 1 Introduction
No ratings yet
Data Mining Chapter 1 Introduction
39 pages
Data Warehouseclass
No ratings yet
Data Warehouseclass
25 pages
Data Mining and Data Warehouse: Raju - Qis@yahoo - Co.in Praneeth - Grp@yahoo - Co.in
No ratings yet
Data Mining and Data Warehouse: Raju - Qis@yahoo - Co.in Praneeth - Grp@yahoo - Co.in
8 pages
Data Warehousing
No ratings yet
Data Warehousing
71 pages
Chapter 2
No ratings yet
Chapter 2
44 pages
DWM Exp1
No ratings yet
DWM Exp1
12 pages
DWDM Lecture Notes III-ii - For Nlcad-6-86
No ratings yet
DWDM Lecture Notes III-ii - For Nlcad-6-86
81 pages
Data Integration Across Sources: Trust Credit Card Savings Loans
No ratings yet
Data Integration Across Sources: Trust Credit Card Savings Loans
27 pages
Chapter-2 DM
No ratings yet
Chapter-2 DM
23 pages
FDS Unit 2
No ratings yet
FDS Unit 2
21 pages
Data Warehousing: Essential or Not?
No ratings yet
Data Warehousing: Essential or Not?
8 pages
Data Mining Warehousing I & II
No ratings yet
Data Mining Warehousing I & II
7 pages
Bda U2
No ratings yet
Bda U2
44 pages
Data Mining
No ratings yet
Data Mining
142 pages
Data Ware House
No ratings yet
Data Ware House
6 pages
Size & Estimation of DW
No ratings yet
Size & Estimation of DW
12 pages
Data Warehousing Essentials Guide
No ratings yet
Data Warehousing Essentials Guide
34 pages
Unit Ii
No ratings yet
Unit Ii
59 pages
Data Warehouse & Data Mining
No ratings yet
Data Warehouse & Data Mining
41 pages
Data Mining and Data Warehouse BY
100% (1)
Data Mining and Data Warehouse BY
12 pages
Data Warehousing and Mining
No ratings yet
Data Warehousing and Mining
52 pages
DWDM Unit 1
No ratings yet
DWDM Unit 1
103 pages
PDC
No ratings yet
PDC
3 pages
NUEVO
No ratings yet
NUEVO
30 pages
Rajant CaseStudy Australiaacubis
No ratings yet
Rajant CaseStudy Australiaacubis
2 pages
4M1E Change Application
No ratings yet
4M1E Change Application
6 pages
Balram H Pandey RESUME 1722016
No ratings yet
Balram H Pandey RESUME 1722016
4 pages
Computer Troubleshooting Guide
No ratings yet
Computer Troubleshooting Guide
6 pages
Wire Size vs. Length of Run For SY Series Actuators
No ratings yet
Wire Size vs. Length of Run For SY Series Actuators
3 pages
Etwork Nalysis AND Ynthesis
100% (1)
Etwork Nalysis AND Ynthesis
20 pages
Revision Assignment - CSS - Class VII
75% (4)
Revision Assignment - CSS - Class VII
3 pages
Latty Carofthe Month
No ratings yet
Latty Carofthe Month
1 page
GIG Plan
No ratings yet
GIG Plan
22 pages
Komatsu 560B-1 Wheel Loader - Parts Manual 10001-UP - PDF Download
No ratings yet
Komatsu 560B-1 Wheel Loader - Parts Manual 10001-UP - PDF Download
33 pages
Elements of SCRUM in Student Robotics Project
No ratings yet
Elements of SCRUM in Student Robotics Project
10 pages
Ai
No ratings yet
Ai
19 pages
Module 3 - Protecting Your Data and Privacy
No ratings yet
Module 3 - Protecting Your Data and Privacy
47 pages
Document Control Procedure Guide
No ratings yet
Document Control Procedure Guide
7 pages
Concorde Simulator Guide
No ratings yet
Concorde Simulator Guide
70 pages
Opel Timing Belt Replacement Guide
No ratings yet
Opel Timing Belt Replacement Guide
1 page
Report On Coursera To Intern
No ratings yet
Report On Coursera To Intern
2 pages
CSE 100 Lab Report: Fall 2024
No ratings yet
CSE 100 Lab Report: Fall 2024
3 pages
MTR Brochure CO2 Removal
100% (1)
MTR Brochure CO2 Removal
2 pages
GEH-6421 Vol I
No ratings yet
GEH-6421 Vol I
208 pages
Kuhn MM 1524et80102 PDF
No ratings yet
Kuhn MM 1524et80102 PDF
106 pages
Installing Oracle 19c DB With Asm
No ratings yet
Installing Oracle 19c DB With Asm
24 pages
Review 1 PPT Batch 5
No ratings yet
Review 1 PPT Batch 5
18 pages
AI Innovations in Vascular Surgery
100% (1)
AI Innovations in Vascular Surgery
16 pages
QUIZ-Hazardous Area Classification
100% (4)
QUIZ-Hazardous Area Classification
2 pages
Pvs 800 Inverter Falut Tracing
100% (4)
Pvs 800 Inverter Falut Tracing
67 pages
Irwin, Engineering Circuit Analysis, 11e ISV Chapter 13
No ratings yet
Irwin, Engineering Circuit Analysis, 11e ISV Chapter 13
92 pages
Telecom Project Reference Guide
No ratings yet
Telecom Project Reference Guide
7 pages

Apuntes Big Data Ii

Uploaded by

Apuntes Big Data Ii

Uploaded by

APUNTES BIG DATA II

Identifying object or letters represented in a digital image is a king of classification, a type of

∑(𝑠𝑡𝑎𝑟𝑡 𝑜𝑓 𝑢𝑝𝑡𝑖𝑚𝑒−𝑠𝑡𝑎𝑟𝑡 𝑜𝑓 𝑑𝑜𝑤𝑛𝑡𝑖𝑚𝑒)

- Lower-level management deals with short-term problems related to individual

- Middle management relies on summarized data that are integrated across

The limitations were a combination of inadequacy of database technology and deployment

˗ Missing features for summary data

Operational database vs Data Warehouse

More complex schema patterns for operational databases.

˗ Order contains header (order) and detail lines.

Challenges in data warehouse projects:

˗ Substantial coordination across organizational units

Bottom up architecture: Also known as independent data mart architecture. No centralized

Federated architecture: For highly decentralized or independent organizations, the federated

The maturity model consists of six stages as

General: Business analysts think about data in a multidimensional arrangement. Influence

- Dimension: label of a row or column (can have more than 3 dimensions)

- Member: value of a dimension

Hierarchies: Member can have sub members

˗ Summarized by addition across all dimensions

˗ Summarized by addition in some but not all dimensions such as time.

˗ Cannot be summarized by addition through any dimension

Slice operator: Replace dimension with a specific value

Navigation operators: they are for hierarchical dimensions, three types:

˗ Drill-down: add detail to a dimension

Tuple: Combination of members. Identifies a cell.

Measures: Numeric values in cells; same definition as already given.

Axis: Dimension selected in a query.

Slicer: Combination of dimensions members.

Data source (cube) in FROM

Sales and quantity are on the columns.

The number (in black) inside the cells are the

Cell show sales and quantity for classic cars.

WHERE clause restricts cube calculations to

Dimensions in the where clause must be

WHERE condition is known as a slicer

˗ Lack of scalabillity and integration of data cube store engines.

Dimension table: Store values of one dimension. Often not normalized.

Transaction: is the most common and usually uses additive measures

Factless: Event occurrence and it doesn’t have measures just FKs

Table design Patterns

Time representation for fact tables:

There are four basic components to the framework:

HADOOP DISTRIBUTED FILE SYSTEM:

We have Sqoop and Flume for ingestion.

We use HBase for the common store.

We have Oozie as the coordination and the workflow engine.

And then we use Zookeeper as a coordination service on bottom of this stack.

You might also like