APUNTES BIG DATA II
TEMA 1
Information can be distributed, stored and processed using modern digital technologies like
internet, disk drives and modern computer.
Data is a representation of something that captures some features and ignores others. A
smartphone photograph is just numbers stored in the phone, these numbers can be storage,
shared or displayed on any screen, it is a basic skill to be able to identify what I data and what
is not.
The organization of data has a major impact on how easily you can use the data to answer
questions. These questions are usually named “insights” or “motivated needs” from different
sectors.
Insight = The capacity to gain an accurate and deep understanding of someone or something.
Data store is a collection of data, the service that maintains my cloud storage also has a much
larger data store of all the documents saved by all of its users.
Database is an organized data store, an example can be a spreadsheet containing movie titles,
actor’s etc., you can even organize photos and call that a database.
Database Management System (DBMS) = A DBMS enables you to create a database, add and
update data, and easily retrieve data based on its organization.
Identifying object or letters represented in a digital image is a king of classification, a type of
machine learning in increasing use today.
A classifier is a type of computer program that can take records with potentially many data
points, and can infer one or more simple categorical values that are suitable for the record. For
example, given a picture of an object, name the object. The program performs the
computational task of resolving all the pixel values to the simple label. Modern classifiers can
“learn” how to classify pictures by first being presented with a large number of pictures that
are already properly labelled.
One narrow form of computer vison is already in successful use: optical character recognition
OCR. If you have an image and you know you are looking for letters or numeric digits, it’s a
simple matter to scan the image and signal when and where they appear. The software finds
the images of digits for the check amount and interprets each digit as one of the ten printed
numerals 0 up to 9.
TEMA 2
When you have a lot of useful data generated in enormous amounts nowadays, you should
store it somewhere. The first solution is to get a big capacity node (centralized node). The
second approach is to store your data on a collection of nodes (distribution nodes), the
technique is called “scale out” or horizontal scale. The third approach is to get a larger node or
part of a node increasingly on demand (cloud storage), this technique is called “scale up” o
vertical scale.
The risk of scaling of scaling out is that one node will get out of the service every three years
on average, for a cluster of 1000 nodes it means 1 failure every single day. In order to calculate
the failure rates in our system we have:
                                         ∑(𝑠𝑡𝑎𝑟𝑡 𝑜𝑓 𝑑𝑜𝑤𝑛𝑡𝑖𝑚𝑒−𝑠𝑡𝑎𝑟𝑡 𝑜𝑓 𝑢𝑝𝑡𝑖𝑚𝑒)
MTBF (Mean Time Between Failures):                   𝑛º 𝑜𝑓 𝑓𝑎𝑖𝑙𝑢𝑟𝑒𝑠
                               ∑(𝑠𝑡𝑎𝑟𝑡 𝑜𝑓 𝑢𝑝𝑡𝑖𝑚𝑒−𝑠𝑡𝑎𝑟𝑡 𝑜𝑓 𝑑𝑜𝑤𝑛𝑡𝑖𝑚𝑒)
MDT (Mean Time To Repair):                 𝑛º 𝑜𝑓 𝑓𝑎𝑖𝑙𝑢𝑟𝑒𝑠
TEMA 3
Decision making hierarchy:
        - Lower-level management deals with short-term problems related to individual
        transactions. Periodic summaries of operational databases and exception reports assist
        operational management.
        - Middle management relies on summarized data that are integrated across
        operational databases. Middle management may want to integrate data across
        different departments, manufacturing plants, and retail stores.
        - Top management relies on the result of middle management analysis and external
        data sources. Top management needs to integrate data so that customers, products,
        suppliers, and other important entities can be tracked across the entire organization.
The limitations were a combination of inadequacy of database technology and deployment
limitations
    ˗   Missing features for summary data
    ˗   Performance limitation
    ˗   Lack of integration
Data Warehouse
Is an essential part of infrastructure for business intelligence, with a centralized repository for
decision making. Characteristics:
    ˗   Subject-oriented: Organized around business entities rather than business processes.
    ˗   Integrated: Many transformations to unify source data form independent data sources.
    ˗   Time-variant: Historical data, snapshots of business processes captured at different points in
        time.
    ˗   Non-volatile: new data are appended periodically; existing data is not changed; warehouse data
        may be archived after its usefulness declines
Transaction processing vs Business processing:
    ˗   Transaction processing: It uses primary data form transactions and its involve in daily
        operations and short term decisions.
    ˗   Business intelligence processing: It uses transformed secondary data and its involve in
        long term decisions.
Operational database vs Data Warehouse
More complex schema patterns for operational databases.
    ˗   Order contains header (order) and detail lines.
    ˗   Star Schema with dimension tables and fact table, designed for business intelligence
        reporting Sales flattened just for order details.
Data integration: data moved from operational database to data warehouse after completion
of transactions along with substantial amount of transformations (standardization, integration,
consistency checking, completeness checking,
Challenges in data warehouse projects:
    ˗   Substantial coordination across organizational units
    ˗   Uncertain data quality in data sources
    ˗   Difficult to scale data warehouse
Architecture choices:
Top down architecture: The three-tier architecture is sometimes augmented with a staging
area to support the data transformation process. The staging area provides temporary storage
of transformed data before loading into the data warehouse. The staging area is particularly
useful for data warehouses with a large number of operational databases and external data
sources requiring complex data transformations. Also known as enterprise data warehouse
architecture (EDM).
Bottom up architecture: Also known as independent data mart architecture. No centralized
data warehouse. Data marts are small data warehouses oriented towards individual user
departments. Easier to cost justify than a larger data warehouse. Data marts may eventually
evolve into a data warehouse. Data marts may cooperate to provide data to other data marts:
data mart bus approach. It is a controversial architecture because many people claim that the
long term benefits are lost with a bottom-up approach so data marts must be re-implemented
as centralized data warehouse.
Federated architecture: For highly decentralized or independent organizations, the federated
data warehouse architecture provides another compromise approach. As represented in the
diagram, the federated data warehouse approach supports two levels of data warehouses.
Each organization independently maintains one or more data warehouses using any of the
architectures. To provide inter-organizational sharing, each organization contributes to the
federated data warehouse. Typically, another layer of data integration and a query portal
support data sharing in the federated data warehouse. Depending on the environment,
participation can be voluntary or compulsory.
Maturity model
Stages:
The maturity model consists of six stages as
summarized in this table. The stages provide
a framework to view an organization’s
progress, not an absolute metric as
organizations may demonstrate aspects of
multiple stages at the same time. As
organizations move from lower to more
advanced stages, increased business value
can occur. However, organizations may have
difficulty justifying significant new data warehouse investments as benefits are sometimes
difficult to quantify.
Insights:
An important insight of the maturity model is the difficulty of moving between certain stages.
For small but growing organizations, moving from the infant to the child stages can be difficult
because a significant investment in data warehouse technology is necessary. For large
organizations, the struggle is to move from the teenager to the adult stage. To make the
transition, upper management must perceive the data warehouse as a vital enterprise
resource, not just a tool provided by the information technology department.
TEMA 4
Data Cube Basics:
General: Business analysts think about data in a multidimensional arrangement. Influence
diagram. Narrow range of factors. Focus on one or more quantitative variables.
Terminology:
- Dimension: label of a row or column (can have more than 3 dimensions)
- Member: value of a dimension
- Measure: quantitative data stored in cells. It can have more than one measure in a cell.
Hierarchies: Member can have sub members
Sparsity: Many cells are typically empty when dimensions are related. Major problem with
storing data cubes is the compression of unused space.
Measure aggregation properties
Additive:
    ˗   Summarized by addition across all dimensions
    ˗   Common measures such as sales, costs and profit
Semi-additive:
    ˗   Summarized by addition in some but not all dimensions such as time.
    ˗   Periodic measurements such as account balances and inventory levels.
Non-additive:
    ˗   Cannot be summarized by addition through any dimension
    ˗   Historical facts such as unit price for a sale
Slice operator: Replace dimension with a specific value
Slice summarize variation: Replace a dimension with a summary of its values across all
members
Dice operator: Replace a dimension with a subset of values, often follows a slice operation.
Navigation operators: they are for hierarchical dimensions, three types:
            ˗    Drill-down: add detail to a dimension
            ˗    Roll-up: Remove detail from a dimension
            ˗    Distribute or recalculate measure values
Pivot operator: The pivot operator supports rearrangement of the dimensions in a data cube.
The pivot operator allows a data cube to be presented in the most appealing visual order.
TEMA 5
MDX was first introduced as part of the OLE DB for OLAP specification in 1997 from Microsoft.
It was invented by a group of SQL Server engineers. The specification was quickly followed by
commercial release of Microsoft Analysis Services. The latest version of the OLE DB for OLAP
specification was issued by Microsoft in 1999. In Analysis Services 2005, Microsoft added some
extensions, most notably subselects. This new variation is sometimes referred to as MDX 2005.
mdXML was specified as the query language in 2001 as part of the XML, MDX statement is
enclosed in the <Statement> tag.
MDX became the base of many Microsoft products and features such as SQL Server Analysis
and Excel Pivot Tables. MDX was also adopted by wide range of vendors from both commercial
side and open source side.
MDX terminology:
Tuple: Combination of members. Identifies a cell.
Measures: Numeric values in cells; same definition as already given.
Dimensions: Main business concepts. Measures can be aggregated but dimensions cannot be
aggregated.
Axis: Dimension selected in a query.
Slicer: Combination of dimensions members.
Fundamental difference: While SQL SELECT generates a 2 dimensional table, MDX SELECT
generates m-dimensional cube.
Source cube cells in SELECT; can be list subset of dimensions member’s values.
Result cube cells in WHERE: dimensions must be different than axis dimensions
Data source (cube) in FROM
Time dimension on the rows. 2003 and 2004 are the year members on the rows.
Sales and quantity are on the columns.
The number (in black) inside the cells are the
measures
Cell show sales and quantity for classic cars.
WHERE clause restricts cube calculations to
classic cars.
Dimensions in the where clause must be
different than the SELECT clause
WHERE condition is known as a slicer
condition
If no measure appears in the SELECT list, the default measure is shown on the cells.
TEMA 6
Motivation for table design
    ˗   Lack of scalabillity and integration of data cube store engines.
    ˗   Dominance of relational model and products
    ˗   Large amount of research and development on relational database features for data
        warehouses
    ˗   Predominant usage of relational databases for large data warehouse.
Dimension table: Store values of one dimension. Often not normalized.
Fact table: Store measure values. Often contain multiple foreign keys. Types of fact table:
Transaction: is the most common and usually uses additive measures
Snapshot: Periodic or accumulating view of asset level, uses semi-additive measures.
Factless: Event occurrence and it doesn’t have measures just FKs
Grain: Most detailed stored value in fact tables. Detail of dimensions determinates grain such
as individual customer. Determinates the size of data warehouse: product of dimension
cardinalities times sparsity. Grain to small: Large data warehouse and increased computation
time. Grain too large: cannot answer queries on more detailed dimension values.
Table design Patterns
Star Schema: One fact table in the centre, multiple dimension tables around. Represents one
data cube. DW may contain many star schemas.
Constellation schema: Multiple fact tables. Fact tables share dimension tables. Relationship
diagram looks like a constellation.
Snowflake schema: Multiple levels of dimension tables. Use when dimension tables are small:
little performance gain by denormalizing. Relationship diagram looks like a snowflake.
Time representation for fact tables:
Alternatives: timestamp. Time dimension table for organization specific calendar features.
Variations: Time of day columns. Accumulating fact table for representation of multiple events.
TEMA 7
HADOOP BASICS:
Hadoop is an apache open source software framework for storage and large scale processing
data-sets and clusters on commodity hardware. It´s licensed under the Apache license and it´s
open source.
It provides a shared and integrated foundation where we can bring additional tools and build
up this framework. Scalability at its core (nucleo) of a Hadoop system. We can distribute and
scale across very easily in a very cost effective manner. All of the modules in Hadoop are
designed with a fundamental assumption that hardware fails, every machine fails at some
point of time. These failures are so common that we have to account for them ahead of the
time. A new thing about Hadoop is that we can keep all the data that we have, and we can
take the data and analyse it in new interesting ways, so I can read the data and create the
schema as I´m reading. We can bring more data into simple algorithms, which has shown that
with more granularity, you can achieve often better results in taking a small amount of data
and then some really complex analytics on it.
HADOOP FRAMEWORK:
There are four basic components to the framework:
Hadoop common: Contains libraries and utilities needed by other Hadoop modules
Hadoop Distributed File System: Is a distributed file system that stores data on a commodity
machine. Providing very high aggregate bandwidth across the entire cluster.
Hadoop yarn: Is a resource management platform responsible of managing compute resources
in the cluster and using them in order to schedule users and applications.
Hadoop MapReduce: Is a programming model that scales data across a lot of different
processes.
Apart from these 4 elements we have a new one related with Cloud environment called
OZONE
The Hadoop framework itself is mostly written in Java programming language and it has some
applications in native C and command line utilities that are written in shell scripts.
Two major pieces of Hadoop framework are Hadoop Distribute File System and MapReduce,
these are both open source and they are both inspired by technologies developed at google.
HADOOP DISTRIBUTED FILE SYSTEM:
It’s a distributed, scalable and portable file system written in Java in order to support the
Hadoop Frameworks. Each node in Hadoop instance typically has a single name node, and a
cluster of data nodes that formed this HDFS cluster.
The HDFS file system encodes the so-called secondary NameNode, which misleads some
people into thinking that the primary NameNode goes offline, the secondary will take over.
The secondary NameNode regulary connects to the primary NameNode and builds snapshots
of the primary´s NameNode, they capture information and members which system saves to
local and the remote directories. About every one of the Hadoop based system sits some
version of a mapreduce engine. The MapReduce has different ways to submit jobs and track
kinds of jobs we have submitted. The typical MapReduce engine will consist of a job tracker, to
which client applications can submit MapReduce jobs, and this job tracker typically pushes
work out to all the available task trackers, now it's in the cluster.
The Apache Hadoop YARN is actually another subset of the Hadoop and part of the Apache
software foundation, and it was introduced as a Hadoop 2.0. It basically separates the resource
management and the processes component. Yarn enhances the power of the Hadoop compute
cluster, without being limited by the map produce kind of framework. Its scalability's great.
The processing power and data centres continue to grow quickly, because the YARN research
manager focuses exclusively on scheduling. It can manage those very large clusters quite
quickly and easily. YARN is completely compatible with the MapReduce. Existing MapReduce
application, end users can run on top of the Yarn without disrupting any of their existing
processes. It does have an Improved cluster utilization as well.
There's no named map and there's no reduce slots, so it helps us utilize this cluster in better
ways. It supports other work flows other than just map reduce as I said earlier, we're not stuck
with the mapping and reducing. Now we can bring in additional programming models, such as
graph process or iterative modelling, and now it's possible to process the data in your base.
This is especially useful when we talk about machine learning applications.
Yarn allows multiple access engines, either open source or proprietary, to use Hadoop as a
common standard for either batch or interactive processing, and even real time engines that
can simultaneous acts as a lot of different data, so you can put streaming kind of applications
on top of YARN inside a Hadoop architecture, and seamlessly work and communicate between
these environments.
HADOOP ZOO:
We have the original Google stack as some people refer to it. And what they did, is they
started with a Google file system. They thought that it would be a great idea to distribute a
large amount of pretty cheap storage and try and put a lot of data on there. And come up with
some framework that would allow us to process all that data. So they had their original
MapReduce, and they were storing and processing large amounts of data.
Then they said, well that was really great, but we would really like to be able to access that
data and access it in a SQL like language. So they built the SQL gateway to adjust the data into
the MapReduce cluster and be able to query some of that data as well.
Then, they realized they needed a high-level specific language to access MapReduce in the
cluster and submit some of those jobs. So Sawzall came along.
Then, Evenflow came along and allowed to chain together complex work codes and coordinate
events and service across this kind of a framework or the specific cluster they had at the time.
Then, Dremel came along. Dremel was a columnar storage in the metadata manager that
allows us to manage the data. Then of course, you needed something to coordinate all of this.
So Chubby came along as a coordination system that would manage all of the products in this
one unit or one ecosystem that could process all these large amounts of structured data
seamlessly.
So you can see that there's a pattern that emerges across all these stacks the different
organizations use it. And now we can see that we come down to the Cloudera's distribution for
Hadoop, and we can see which one of the pieces of the system Cloudera has. So you can see
that
We have Sqoop and Flume for ingestion.
We use HBase for the common store.
We have Oozie as the coordination and the workflow engine.
We use Pig and Hive for high level languages and querying some of the data.
And then we use Zookeeper as a coordination service on bottom of this stack.
HADOOP ECOSYSTEM:
Let's talk about different tools that are on top of the Hadoop framework. With technology
evolution, it is now possible to manage immense volumes of data that we previously could
only handle by super computers. Prices in systems have dropped and the results in new
techniques for distributed computing can come along. Companies like Yahoo, Google or
Facebook came to the realization that they needed to do something to monetize these
massive amounts of data they were collecting. So this is where all of these applications have
evolved over time as we have seen through numerous of these organizations.
So let's talk about the Apache Sqoop. Sqoop stands for SQL to Hadoop. It is a straightforward
command line tool that has different capabilities. It lets us import individual tables or entire
databases into our HDF system. And it generates Java classes to allow us to interact and import
data, with all the data that we imported. It provides the ability to import SQL databases
straight into the high data warehouse which sits within the HDFS system.
Next up is Hbase. Hbase is a key component of the Hadoop stack, as its design caters to
applications that require really fast random access to significant data set. Hbase if you
remember we talked about it earlier is based on Google's big table and it can handle massive
data tables combining billion and billions of rows and millions of columns.
Pig it's a scripting language, it's really a high level platform for creating MapReduce programs
using Hadoop. This language is called Pig Latin, and it excels at describing data analysis
problems as data flows. For the user defined functions, facilities in the Pig, you can actually
have Pig in many different languages, like JRuby, JPython, and Java. You can execute PIG
scripts in other languages. So the result is really that you can use PIG as a component to build
much larger, much more complex applications that can tackle some real business problems.
PIG can ingest data from files, streams, or any other sources using the UDF. A user-defined
functions that we can write ourselves. When it has all the data it can perform, select, iterate
and do kinds of transformations. Over that data, again the UDF features can allow us passing
data to more complex algorithms for more complex transformations, and it can take all of
these transformations and store it back on the data file system.
The Apache Hive data warehouse software facilitates querying and managing large datasets
residing in our distributed file storage. It actually provides a mechanism to project structure on
top of all of this data and allow us to use SQL like queries to access the data that we have
stored in this data warehouse. This query language is called HiveQL. At the same time this
language also allows us traditional man produced programs to park inside the custom mappers
and reducers when it is inconvenient or to complex or really inefficient to express the logic we
would like to express in processing our data within the Hive in language.
Oozie's a workflow schedule system that manages all of our Apache Hadoop jobs. Oozie
workflow jobs are what we call DAGs or Directed Graphs. Oozie coordinator jobs are
recurrent Oozie workflow jobs that are triggered by frequency or data availability. It's
integrated with the rest of the Hadoop stack supporting several different Hadoop jobs right
out of the box. You can bring in Java MapReduce; you can bring in streaming MapReduce. You
can run Pig and Hive and Sqoop and many other specific jobs on the system itself. It's very
scalable and reliable and a quite extensible system.
Well we have the large zoo of crazy wild animals and we've got to keep them in and keep them
somehow organized. Well that's kind of what the Zookeeper does, a patch Zookeeper provides
operational services for the Hadoop cluster. It provides a distributed configuration service and
synchronization service so he can synchronize all these jobs and a naming registry for the
entire distributed system. Distributed applications use the zookeeper to store immediate
updates to important configuration information On the cluster itself.
Flume is a distributed and reliable available service for efficiently collecting aggregating
and moving large amounts of data. It has a simple and very flexible architecture based on
streaming data flows. It's quite robust and fall tolerant, and it's really tunable to enhance the
reliability mechanisms, fail over, recovery, and all the other mechanisms that keep the cluster
safe and reliable. It uses simple extensible data model that allows us to apply all kinds of
online analytic applications.
The additional component we haven't mentioned yet is Spark. Although Hadoop captures the
most attention for distributed data analytics, there are now a number of alternatives that
provide some kind of interesting advantages over the traditional Hadoop platform. Spark is
one of them. Spark is a scalable data analytics platform that incorporates primitives for in-
memory computing and therefore, is allowing to exercise some different performance
advantages over traditional Hadoop's cluster storage system approach. And it's implemented
and supports something called Scala language, and provides unique environment for data
processing. Spark is Is really great for more complex kinds of analytics, and it's great at
supporting machine learning libraries.