0% found this document useful (0 votes)

7 views57 pages

Ch10 Big Data

Chapter 10 discusses Big Data, highlighting its characteristics such as volume, velocity, and variety, driven by the growth of web and IoT. It covers querying Big Data, storage systems like distributed file systems and key-value stores, and the MapReduce paradigm for processing large datasets. The chapter emphasizes scalability and the trade-offs between consistency and availability in distributed systems.

Uploaded by

labiba ibnat matin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views57 pages

Ch10 Big Data

Uploaded by

labiba ibnat matin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

Chapter 10: Big Data

Database System Concepts, 7th Ed.

©Silberschatz, Korth and Sudarshan
See www.db-book.com for conditions on re-use
Motivation

▪ Very large volumes of data being collected

• Driven by growth of web, social media, and more recently
internet-of-things
• Web logs were an early source of data
▪ Analytics on web logs has great value for advertisements, web
site structuring, what posts to show to a user, etc
▪ Big Data: diﬀerentiated from data handled by earlier generation
databases
• Volume: much larger amounts of data stored
• Velocity: much higher rates of insertions
• Variety: many types of data, beyond relational data

Database System Concepts - 7th 10.2 ©Silberschatz, Korth and Sudarshan

Querying Big Data

▪ Transaction processing systems that need very high scalability

• Many applications willing to sacriﬁce ACID properties and other
database features, if they can get very high scalability
▪ Query processing systems that
• Need very high scalability, and
• Need to support non-relation data

Database System Concepts - 7th 10.3 ©Silberschatz, Korth and Sudarshan

Big Data Storage Systems

▪ Distributed ﬁle systems

▪ Shardring across multiple databases
▪ Key-value storage systems
▪ Parallel and distributed databases

Database System Concepts - 7th 10.4 ©Silberschatz, Korth and Sudarshan

Distributed File Systems

▪ A distributed ﬁle system stores data across a large collection of machines,

but provides single file-system view
▪ Highly scalable distributed file system for large data-intensive
applications.
• E.g., 10K nodes, 100 million files, 10 PB
▪ Provides redundant storage of massive amounts of data on cheap and
unreliable computers
• Files are replicated to handle hardware failure
• Detect failures and recovers from them
▪ Examples:
• Google File System (GFS)
• Hadoop File System (HDFS)

Database System Concepts - 7th 10.5 ©Silberschatz, Korth and Sudarshan

Hadoop File System Architecture

▪ Single Namespace for entire

cluster
▪ Files are broken up into
blocks
• Typically 64 MB block
size
• Each block replicated on
multiple DataNodes
▪ Client
• Finds location of blocks
from NameNode
• Accesses data directly
from DataNode

Database System Concepts - 7th 10.6 ©Silberschatz, Korth and Sudarshan

Hadoop Distributed File System (HDFS)

▪ NameNode
• Maps a filename to list of Block IDs
• Maps each Block ID to DataNodes containing a replica of the block
▪ DataNode: Maps a Block ID to a physical location on disk
▪ Data Coherency
• Write-once-read-many access model
• Client can only append to existing files
▪ Distributed file systems good for millions of large files
• But have very high overheads and poor performance with billions of
smaller tuples

Database System Concepts - 7th 10.7 ©Silberschatz, Korth and Sudarshan

Sharding

▪ Sharding: partition data across multiple databases

▪ Partitioning usually done on some partitioning attributes (also known as
partitioning keys or shard keys e.g. user ID
• E.g., records with key values from 1 to 100,000 on database 1,
records with key values from 100,001 to 200,000 on database 2, etc.
▪ Application must track which records are on which database and send
queries/updates to that database
▪ Positives: scales well, easy to implement
▪ Drawbacks:
• Not transparent: application has to deal with routing of queries,
queries that span multiple databases
• When a database is overloaded, moving part of its load out is not
easy
• Chance of failure more with more databases
▪ need to keep replicas to ensure availability, which is more work for
application

Database System Concepts - 7th 10.8 ©Silberschatz, Korth and Sudarshan

Parallel Databases and Data Stores

▪ Supporting scalable data access

• Approach 1: memcache or other caching mechanisms at application
servers, to reduce database access
▪ Limited in scalability
• Approach 2: Partition (“shard”) data across multiple separate
database servers
• Approach 3: Use existing parallel databases
▪ Historically: parallel databases that can scale to large number of
machines were designed for decision support not OLTP
• Approach 4: Massively Parallel Key-Value Data Store
▪ Partitioning, high availability etc. completely transparent to
application
▪ Sharding systems and key-value stores don’t support many relational
features, such as joins, integrity constraints, etc., across partitions.

Database System Concepts - 7th 10.9 ©Silberschatz, Korth and Sudarshan

Key Value Storage Systems

▪ Key-value storage systems store large numbers (billions or even more) of

small (KB-MB) sized records
▪ Records are partitioned across multiple machines and
▪ Queries are routed by the system to appropriate machine
▪ Records are also replicated across multiple machines, to ensure
availability even if a machine fails
• Key-value stores ensure that updates are applied to all replicas, to
ensure that their values are consistent

Database System Concepts - 7th 10.10 ©Silberschatz, Korth and Sudarshan

Key Value Storage Systems

▪ Key-value stores may store

• uninterpreted bytes, with an associated key
▪ E.g., Amazon S3, Amazon Dynamo
• Wide-table (can have arbitrarily many attribute names) with
associated key
• Google BigTable, Apache Cassandra, Apache Hbase, Amazon
DynamoDB
• Allows some operations (e.g., ﬁltering) to execute on storage
node
• JSON
▪ MongoDB, CouchDB (document model)
▪ Document stores store semi-structured data, typically JSON
▪ Some key-value stores support multiple versions of data, with
timestamps/version numbers

Database System Concepts - 7th 10.11 ©Silberschatz, Korth and Sudarshan

Data Representation

▪ An example of a JSON object is:

{
"ID": "22222",
"name": {
"firstname: "Albert",
"lastname: "Einstein"
},
"deptname": "Physics",
"children": [
{ "firstname": "Hans", "lastname": "Einstein" },
{ "firstname": "Eduard", "lastname": "Einstein" }
]
}

Database System Concepts - 7th 10.12 ©Silberschatz, Korth and Sudarshan

Key Value Storage Systems

▪ Key-value stores support

• put(key, value): used to store values with an associated key,
• get(key): which retrieves the stored value associated with the
speciﬁed key
• delete(key) -- Remove the key and its associated value
▪ Some systems also support range queries on key values
▪ Document stores also support queries on non-key attributes
• See book for MongoDB queries
▪ Key value stores are not full database systems
• Have no/limited support for transactional updates
• Applications must manage query processing on their own
▪ Not supporting above features makes it easier to build scalable data
storage systems
• Also called NoSQL systems

Database System Concepts - 7th 10.13 ©Silberschatz, Korth and Sudarshan

Parallel and Distributed Databases

▪ Parallel databases run multiple machines (cluser)

• Developed in 1980s, well before Big Data
▪ Parallel databases were designed for smaller scale (10s to 100s of
machines)
• Did not provide easy scalability
▪ Replication used to ensure data availability despite machine failure
• But typically restart query in event of failure
▪ Restarts may be frequent at very large scale
▪ Map-reduce systems (coming up next) can continue query
execution, working around failures

Database System Concepts - 7th 10.14 ©Silberschatz, Korth and Sudarshan

Replication and Consistency

▪ Availability (system can run even if parts have failed) is essential for
parallel/distributed databases
• Via replication, so even if a node has failed, another copy is available
▪ Consistency is important for replicated data
• All live replicas have same value, and each read sees latest version
• Often implemented using majority protocols
▪ E.g., have 3 replicas, reads/writes must access 2 replicas
• Details in chapter 23
▪ Network partitions (network can break into two or more parts, each with
active systems that can’t talk to other parts)
▪ In presence of partitions, cannot guarantee both availability and
consistency
• Brewer’s CAP “Theorem”

Database System Concepts - 7th 10.15 ©Silberschatz, Korth and Sudarshan

Replication and Consistency

▪ Very large systems will partition at some point

• Choose one of consistency or availability
▪ Traditional database choose consistency
▪ Most Web applications choose availability
• Except for speciﬁc parts such as order processing
▪ More details later, in Chapter 23

Database System Concepts - 7th 10.16 ©Silberschatz, Korth and Sudarshan

The MapReduce Paradigm

▪ Platform for reliable, scalable parallel computing

▪ Abstracts issues of distributed and parallel environment from programmer
• Programmer provides core logic (via map() and reduce() functions)
• System takes care of parallelization of computation, coordination, etc.
▪ Paradigm dates back many decades
• But very large scale implementations running on clusters with 10^3 to
10^4 machines are more recent
• Google Map Reduce, Hadoop, ..
▪ Data storage/access typically done using distributed ﬁle systems or
key-value stores

Database System Concepts - 7th 10.17 ©Silberschatz, Korth and Sudarshan

MapReduce: Word Count Example

▪ Consider the problem of counting the number of occurrences of each word

in a large collection of documents
▪ How would you do it in parallel?
▪ Solution:
• Divide documents among workers
• Each worker parses document to ﬁnd all words, map function outputs
(word, count) pairs
• Partition (word, count) pairs across workers based on word
• For each word at a worker, reduce function locally add up counts
▪ Given input: “One a penny, two a penny, hot cross buns.”
• Records output by the map() function would be
▪ (“One”, 1), (“a”, 1), (“penny”, 1),(“two”, 1), (“a”, 1), (“penny”, 1),
(“hot”, 1), (“cross”, 1), (“buns”, 1).
• Records output by reduce function would be
▪ (“One”, 1), (“a”, 2), (“penny”, 2), (“two”, 1), (“hot”, 1), (“cross”, 1),
(“buns”, 1)

Database System Concepts - 7th 10.18 ©Silberschatz, Korth and Sudarshan

Pseudo-code of Word Count

map(String record):
for each word in record
emit(word, 1);

// First attribute of emit above is called reduce key

// In eﬀect, group by is performed on reduce key to create a
// list of values (all 1’s in above code). This requires shuﬄe step
// across machines.
// The reduce function is called on list of values in each group

reduce(String key, List value_list):

String word = key
int count = 0;
for each value in value_list:
count = count + value
Output(word, count);

Database System Concepts - 7th 10.19 ©Silberschatz, Korth and Sudarshan

MapReduce Programming Model

▪ Inspired from map and reduce operations commonly used in functional

programming languages like Lisp.
▪ Input: a set of key/value pairs
▪ User supplies two functions:
• map(k,v) 🡪 list(k1,v1)
• reduce(k1, list(v1)) 🡪 v2
▪ (k1,v1) is an intermediate key/value pair
▪ Output is the set of (k1,v2) pairs
▪ For our example, assume that system
• Breaks up ﬁles into lines, and
• Calls map function with value of each line
▪ Key is the line number

Database System Concepts - 7th 10.20 ©Silberschatz, Korth and Sudarshan

MapReduce Example 2: Log Processing

▪ Given log ﬁle in following format:

...
2013/02/21 10:31:22.00EST /slide-dir/11.ppt
2013/02/21 10:43:12.00EST /slide-dir/12.ppt
2013/02/22 18:26:45.00EST /slide-dir/13.ppt
2013/02/22 20:53:29.00EST /slide-dir/12.ppt
...
▪ Goal: find how many times each of the files in the slide-dir directory was
accessed between 2013/01/01 and 2013/01/31.
▪ Options:
• Sequential program too slow on massive datasets
• Load into database expensive, direct operation on log files cheaper
• Custom built parallel program for this task possible, but very laborious
• Map-reduce paradigm

Database System Concepts - 7th 10.21 ©Silberschatz, Korth and Sudarshan

MapReduce: File Access Count Example

map(String key, String record) {

String attribute[3];
…. break up record into tokens (based on space character), and store the
tokens in array attributes
String date = attribute[0];
String time = attribute[1];
String filename = attribute[2];
if (date between 2013/01/01 and 2013/01/31
and filename starts with "/slide-dir/")
emit(filename, 1).
}
reduce(String key, List recordlist) {
String filename = key;
int count = 0;
For each record in recordlist
count = count + 1.
output(filename, count)
}

Database System Concepts - 7th 10.22 ©Silberschatz, Korth and Sudarshan

Schematic Flow of Keys and Values

▪ Flow of keys and values in a map reduce task

rk1 rv1 rk1 rv1,rv7,...

rk7 rv2 rk2 rv8,rvi,...
mk1 mv1
rk3 rv3 rk3 rv3,...
mk2 mv2
rk1 rv7
rk7 rv2,...
rk2 rv8

rki ... rvn,...

rk2 rvi
mkn mvn
rki rvn

map inputs map reduce

(key, outputs inputs
value) (key, value)

Database System Concepts - 7th 10.23 ©Silberschatz, Korth and Sudarshan

Parallel Processing of MapReduce Job

User
Progra
m
copy copy copy

Mas
ter
assign assign
map reduce
Part Map Reduc
File 1
1
Part 1 e1
2
Part Reduc
Map write File 2
3
Part 2 e1
4 local
Part write
Map Reduc
n read File m
n Remote em
Read, Sort
Input file Intermediate Output files
partitions files

Database System Concepts - 7th 10.24 ©Silberschatz, Korth and Sudarshan

Hadoop MapReduce

▪ Google pioneered map-reduce implementations that could run on

thousands of machines (nodes), and transparently handle failures of
machines
▪ Hadoop is a widely used open source implementation of Map Reduce
written in Java
• Map and reduce functions can be written in several different
languages, we use Java.
▪ Input and output to map reduce systems such as Hadoop must be done
in parallel
• Google used GFS distributed file system
• Hadoop uses Hadoop File System (HDFS),
• Input files can be in several formats
▪ Text/CSV
▪ compressed representation such as Avro, ORC and Parquet
• Hadoop also supports key-value stores such as Hbase, Cassandra,
MongoDB, etc.

Database System Concepts - 7th 10.25 ©Silberschatz, Korth and Sudarshan

Types in Hadoop

▪ Generic Mapper and Reducer interfaces both take four type arguments,
that specify the types of the
• input key, input value, output key and output value
▪ Map class in next slide implements the Mapper interface
• Map input key is of type LongWritable, i.e. a long integer
• Map input value which is (all or part of) a document, is of type Text.
• Map output key is of type Text, since the key is a word,
• Map output value is of type IntWritable, which is an integer value.

Database System Concepts - 7th 10.26 ©Silberschatz, Korth and Sudarshan

Hadoop Code in Java: Map Function

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable>

{
private ﬁnal static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}

Database System Concepts - 7th 10.27 ©Silberschatz, Korth and Sudarshan

Hadoop Code in Java: Reduce Function

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}

Database System Concepts - 7th 10.28 ©Silberschatz, Korth and Sudarshan

Hadoop Job Parameters

▪ The classes that contain the map and reduce functions for the job
• Set by methods setMapperClass() and setReducerClass()
▪ The types of the job’s output key and values
• Set by methods setOutputKeyClass() and setOutputValueClass()
▪ The input format of the job
• Set by method job.setInputFormatClass()
▪ Default input format in Hadoop is the TextInputFormat,
• Map key whose value is a byte offset into the file, and
• Map value is the contents of one line of the file
▪ The directories where the input files are stored, and where the output files
must be created
• Set by addInputPath() and addOutputPath()
▪ And many more parameters

Database System Concepts - 7th 10.29 ©Silberschatz, Korth and Sudarshan

Hadoop Code in Java: Overall Program

public class WordCount {

public static void main(String[] args) throws Exception {
Conﬁguration conf = new Conﬁguration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}

Database System Concepts - 7th 10.30 ©Silberschatz, Korth and Sudarshan

Local Pre-Aggregation

▪ Combiners: perform partial aggregation to minimize network traﬃc

• E.g., within machine
• And/or at rack level
▪ In Hadoop, reduce function is used by default if combiners are enabled
• But alternative implementation of combiner can be speciﬁed if input
and output types of reducers are diﬀerent

Database System Concepts - 7th 10.31 ©Silberschatz, Korth and Sudarshan

Map Reduce vs. Databases

▪ Map Reduce widely used for parallel processing

• Google, Yahoo, and 100’s of other companies
• Example uses: compute PageRank, build keyword indices, do data
analysis of web click logs, ….
• Allows procedural code in map and reduce functions
• Allows data of any type
▪ Many real-world uses of MapReduce cannot be expressed in SQL
▪ But many computations are much easier to express in SQL
• Map Reduce is cumbersome for writing simple queries

Database System Concepts - 7th 10.32 ©Silberschatz, Korth and Sudarshan

Map Reduce vs. Databases (Cont.)

▪ Relational operations (select, project, join, aggregation, etc.) can be

expressed using Map Reduce
▪ SQL queries can be translated into Map Reduce infrastructure for
execution
• Apache Hive SQL, Apache Pig Latin, Microsoft SCOPE
▪ Current generation execution engines support not only Map Reduce, but
also other algebraic operations such as joins, aggregation, etc. natively.

Database System Concepts - 7th 10.33 ©Silberschatz, Korth and Sudarshan

BEYOND MAPREDUCE: ALGEBRAIC
OPERATIONS

Database System Concepts - 7th 10.34 ©Silberschatz, Korth and Sudarshan

Algebraic Operations

▪ Current generation execution engines

• natively support algebraic operations such as joins, aggregation, etc.
natively.
• Allow users to create their own algebraic operators
• Support trees of algebraic operators that can be executed on multiple
nodes in parallel
▪ E.g. Apache Tez, Spark
• Tex provides low level API; Hive on Tez compiles SQL to Tez
• Spark provides more user-friendly API

Database System Concepts - 7th 10.35 ©Silberschatz, Korth and Sudarshan

Algebraic Operations in Spark

▪ Resilient Distributed Dataset (RDD) abstraction

• Collection of records that can be stored across multiple machines
▪ RDDs can be created by applying algebraic operations on other RDDs
▪ RDDs can be lazily computed when needed
▪ Spark programs can be written in Java/Scala/R
• Our examples are in Java
▪ Spark makes use of Java 8 Lambda expressions; the code
s - > Arrays.asList(s.split(" ")).iterator()
deﬁnes unnamed function that takes argument s and executes the
expression Arrays.asList(s.split(" ")).iterator() on the argument
▪ Lambda functions are particularly convenient as arguments to map,
reduce and other functions

Database System Concepts - 7th 10.36 ©Silberschatz, Korth and Sudarshan

Word Count in Spark

Database System Concepts - 7th 10.37 ©Silberschatz, Korth and Sudarshan

Algebraic Operations in Spark

▪ Algebraic operations in Spark are typically executed in parallel on multiple

machines
• With data partitioned across the machines
▪ Algebraic operations are executed lazily, not immediately
• Our preceding program creates an operator tree
• Tree is executed only on speciﬁc functions such as saveAsTextFile() or
collect()
• Query optimization can be performed on tree before it is executed

Spark DataFrames and DataSet

▪ RDDs in Spark can be typed in programs, but not dynamically

▪ The DataSet type allows types to be specified dynamically
▪ Row is a row type, with attribute names
• In code below, attribute names/types of instructor and department
are inferred from files read
▪ Operations filter, join, groupBy, agg, etc defined on DataSet, and can
execute in parallel
▪ Dataset<Row> instructor = spark.read().parquet("...");
Dataset<Row> department = spark.read().parquet("...");
instructor.filter(instructor.col("salary").gt(100000))
.join(department, instructor.col("dept name")
.equalTo(department.col("dept name")))
.groupBy(department.col("building"))
.agg(count(instructor.col("ID")));

STREAMING DATA

Streaming Data and Applications

▪ Streaming data refers to data that arrives in a continuous fashion

• Contrast to data-at-rest
▪ Applications include:
• Stock market: stream of trades
• e-commerce site: purchases, searches
• Sensors: sensor readings
▪ Internet of things
• Network monitoring data
• Social media: tweets and posts can be viewed as a stream
▪ Queries on streams can be very useful
• Monitoring, alerts, automated triggering of actions

Querying Streaming Data

Approaches to querying streams:

▪ Windowing: Break up stream into windows, and queries are run on

windows
• Stream query languages support window operations
• Windows may be based on time or tuples
• Must ﬁgure out when all tuples in a window have been seen
▪ Easy if stream totally ordered by timestamp
▪ Punctuations specify that all future tuples have timestamp
greater that some value
▪ Continuous Queries: Queries written e.g. in SQL, output partial results
based on stream seen so far; query results updated continuously
• Have some applications, but can lead to ﬂood of updates

Querying Streaming Data (Cont.)

Approaches to querying streams (Cont.):

▪ Algebraic operators on streams:
• Each operator consumes tuples from a stream and outputs tuples
• Operators can be written e.g., in an imperative language
• Operator may maintain state
▪ Pattern matching:
• Queries specify patterns, system detects occurrences of patterns
and triggers actions
• Complex Event Processing (CEP) systems
• E.g., Microsoft StreamInsight, Flink CEP, Oracle Event Processing

Stream Processing Architectures

▪ Many stream processing systems are purely in-memory, and do not

persist data
▪ Lambda architecture: split stream into two, one output goes to stream
processing system and the other to a database for storage
• Easy to implement and widely used
• But often leads to duplication of querying eﬀort, once on streaming
system and once in database

Stream Extensions to SQL

▪ SQL Window functions described in Section 5.5.2

▪ Streaming systems often support more window types
• Tumbling window
▪ E.g., hourly windows, windows don’t overlab
• Hopping window
▪ E.g., hourly window computed every 20 minutes
• Sliding window
▪ Window of speciﬁed size (based on timestamp interval or
number of tuples) around each incoming tuple
• Session window
▪ Groups tuples based on user sessions

Window Syntax in SQL

▪ Windowing syntax varies widely by system

▪ E.g., in Azure Stream Analytics SQL:
select item, System.Timestamp as window end, sum(amount)
from order timestamp by datetime
group by itemid, tumblingwindow(hour, 1)

▪ Aggregates are applied on windows

▪ Result of windowing operation on a stream is a relation
▪ Many systems support stream-relation joins
▪ Stream-stream joins often require join conditions to specify bound on
timestamp gap between matching tuples
• E.g., tuples must be at most 30 minutes apart in timestamp

Algebraic Operations on Streams

▪ Tuples in streams need to be routed to operators

▪ Routing of streams using DAG and publish-subscribe representations
• Used in Apache Storm and Apache Kafka respective

Publish Subscribe Systems

▪ Publish-subscribe (pub-sub) systems provide convenient abstraction for

processing streams
• Tuples in a stream are published to a topic
• Consumers subscribe to topic
▪ Parallel pub-sub systems allow tuples in a topic to be partitioned across
multiple machines
▪ Apache Kafka is a popular parallel pub-sub system widely used to
manage streaming data
▪ More details in book

GRAPH DATABASES

Graph Data Model

▪ Graphs are a very general data model

▪ ER model of an enterprise can be viewed as a graph
• Every entity is a node
• Every binary relationship is an edge
• Ternary and higher degree relationships can be modelled as binary
relationships

Graph Data Model (Cont.)

▪ Graphs can be modelled as relations

• node(ID, label, node_data)
• edge(fromID, toID, label, edge_data)
▪ Above representation too simplistic
▪ Graph databases like Neo4J can provide a graph view of relational
schema
• Relations can be identiﬁed as representing either nodes or edges
▪ Query languages for graph databases make it
• easy to express queries requiring edge traversal
• allow eﬃcient algorithms to be used for evaluation

Graph Data Model (Cont.)

▪ Suppose
• Relations instructor and student are nodes, and
• Relation advisor represents edges between instructors and student
▪ Query in Neo4J:
match (i:instructor)<-[:advisor]-(s:student)
where i.dept name= 'Comp. Sci.’
return i.ID as ID, i.name as name, collect(s.name) as advisees
▪ match clause matches nodes and edges in graphs
▪ Recursive traversal of edges is also possible
• Suppose prereq(course_id, prereq_id) is modeled as an edge
• Transitive closure can be done as follows:

match (c1:course)-[:prereq *1..]->(c2:course)

return c1.course id, c2.course id

Parallel Graph Processing

▪ Very large graphs (billions of nodes, trillions of edges)

• Web graph: web pages are nodes, hyper links are edges
• Social network graph: people are nodes, friend/follow links are edges
▪ Two popular approaches for parallel processing on such graphs
• Map-reduce and algebraic frameworks
• Bulk synchronous processing (BSP) framework
▪ Multiple iterations are required for any computations on graphs
• Map-reduce/algebraic frameworks often have high overheads per
iteration
• BSP frameworks have much lower per-iteration overheads
▪ Google’s Pregel system popularized the BSP framework
▪ Apache Giraph is an open-source version of Pregel
▪ Apache Spark’s GraphX component provides a Pregel-like API

Bulk Synchronous Processing

▪ Each vertex (node) of a graph has data (state) associated with it

• Vertices are partitioned across multiple machines, and state of node
kept in-memory
▪ Analogous to map() and reduce() functions, programmers provide
methods to be executed for each node
• Node method can send messages to or receive messages from
neighboring nodes
▪ Computation consists of multiple iterations, or supersteps
▪ In each superstep
• Nodes process received messages
• Update their state, and
• Send further messages or vote to halt
• Computation ends when all nodes vote to halt, and there are no
pending messages;

End of Chapter 10

Database System Concepts - 7th 10.56 ©Silberschatz, Korth and Sudarshan
Database System Concepts - 7th 10.57 ©Silberschatz, Korth and Sudarshan

Ch10 Big Data
No ratings yet
Ch10 Big Data
55 pages
Chapter 10: Big Data: Database System Concepts, 7 Ed
No ratings yet
Chapter 10: Big Data: Database System Concepts, 7 Ed
14 pages
Chapter 10
No ratings yet
Chapter 10
25 pages
Big Data Slides
No ratings yet
Big Data Slides
26 pages
Storage Structure
No ratings yet
Storage Structure
28 pages
CH 20
No ratings yet
CH 20
39 pages
Database System Architecture
No ratings yet
Database System Architecture
45 pages
Two-Tier and Three-Tier Architectures
No ratings yet
Two-Tier and Three-Tier Architectures
8 pages
Unit1 Part1
No ratings yet
Unit1 Part1
59 pages
Chapt 8 PDF
No ratings yet
Chapt 8 PDF
37 pages
Chapter 13: Data Storage Structures: Database System Concepts, 7 Ed
No ratings yet
Chapter 13: Data Storage Structures: Database System Concepts, 7 Ed
29 pages
Database Systems Course Overview
No ratings yet
Database Systems Course Overview
52 pages
CH 1
No ratings yet
CH 1
35 pages
Data Storage Structures
No ratings yet
Data Storage Structures
38 pages
Chapter 20: Database System Architectures
No ratings yet
Chapter 20: Database System Architectures
45 pages
ch1 Intro
No ratings yet
ch1 Intro
51 pages
CH 11
No ratings yet
CH 11
53 pages
Database Management Systems: ©silberschatz, Korth and Sudarshan 1.1 Database System Concepts
No ratings yet
Database Management Systems: ©silberschatz, Korth and Sudarshan 1.1 Database System Concepts
33 pages
07 BigData DataAnalysis
No ratings yet
07 BigData DataAnalysis
66 pages
NoSQL for Tech Professionals
No ratings yet
NoSQL for Tech Professionals
40 pages
CH 1
No ratings yet
CH 1
38 pages
CH 1
No ratings yet
CH 1
32 pages
Chapter 14: Indexing: Database System Concepts, 7 Ed
No ratings yet
Chapter 14: Indexing: Database System Concepts, 7 Ed
29 pages
Database Note 3
No ratings yet
Database Note 3
36 pages
Database System Concepts Guide
No ratings yet
Database System Concepts Guide
39 pages
CH 12
No ratings yet
CH 12
23 pages
Week7 Com206
No ratings yet
Week7 Com206
34 pages
CH 1
No ratings yet
CH 1
32 pages
DBM S Design 19 March
No ratings yet
DBM S Design 19 March
184 pages
Big Data Challenges & Solutions
100% (1)
Big Data Challenges & Solutions
17 pages
CH 21
No ratings yet
CH 21
44 pages
CH 7
No ratings yet
CH 7
36 pages
Lecture 16
No ratings yet
Lecture 16
31 pages
DBMS - Unit 1 (Chapter 1)
No ratings yet
DBMS - Unit 1 (Chapter 1)
34 pages
Introduction
No ratings yet
Introduction
22 pages
Database Management System Lecture 01 Introduction
No ratings yet
Database Management System Lecture 01 Introduction
20 pages
Chapter 22: Parallel and Distributed Query Processing: Database System Concepts, 7 Ed
No ratings yet
Chapter 22: Parallel and Distributed Query Processing: Database System Concepts, 7 Ed
79 pages
Chapter 19: Distributed Databases
No ratings yet
Chapter 19: Distributed Databases
95 pages
Lec2 DB
No ratings yet
Lec2 DB
32 pages
CH 22
No ratings yet
CH 22
34 pages
Database System Concepts 7th Edition Abraham Silberschatz PDF Available
100% (2)
Database System Concepts 7th Edition Abraham Silberschatz PDF Available
162 pages
CH 1
No ratings yet
CH 1
37 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Chapter 1: Introduction: Database System Concepts, 7 Ed
No ratings yet
Chapter 1: Introduction: Database System Concepts, 7 Ed
37 pages
CSQSA Report
No ratings yet
CSQSA Report
30 pages
09 - Cloud-Enabling Technologies - v2
No ratings yet
09 - Cloud-Enabling Technologies - v2
45 pages
Physical Design
No ratings yet
Physical Design
28 pages
BDA Answers
No ratings yet
BDA Answers
6 pages
CH 19
No ratings yet
CH 19
27 pages
Cloud Unit3
No ratings yet
Cloud Unit3
26 pages
Chuong2 Index
No ratings yet
Chuong2 Index
79 pages
CH 13
No ratings yet
CH 13
40 pages
CH 22
No ratings yet
CH 22
93 pages
NoSQL for Data Engineers
No ratings yet
NoSQL for Data Engineers
144 pages
Introduction
No ratings yet
Introduction
32 pages
04 NoSQL
No ratings yet
04 NoSQL
126 pages
Introduction To Database Systems
No ratings yet
Introduction To Database Systems
6 pages
Ch1 Introduction
No ratings yet
Ch1 Introduction
36 pages
Ch4 Intermediate SQL
No ratings yet
Ch4 Intermediate SQL
61 pages
Ch6 Design ER Model
No ratings yet
Ch6 Design ER Model
84 pages
Ch5 Advanced SQL
No ratings yet
Ch5 Advanced SQL
80 pages
Cse 306
No ratings yet
Cse 306
4 pages
COMPUTER Engineering - DEC2018 New-Min
No ratings yet
COMPUTER Engineering - DEC2018 New-Min
163 pages
BLockchain
No ratings yet
BLockchain
6 pages
New 120 ITIL v3 Foundation Exam Dumps Practical Questions
100% (1)
New 120 ITIL v3 Foundation Exam Dumps Practical Questions
33 pages
Anna University, Chennai 600 025
No ratings yet
Anna University, Chennai 600 025
8 pages
Development of TSMC 0.25 M Standard Cell Library
No ratings yet
Development of TSMC 0.25 M Standard Cell Library
3 pages
C300 Controller Fundamentals
No ratings yet
C300 Controller Fundamentals
1 page
Cyber Security Notes New
No ratings yet
Cyber Security Notes New
117 pages
2017 Csta K 12 Standards Progression Chart
No ratings yet
2017 Csta K 12 Standards Progression Chart
2 pages
Software Testing
No ratings yet
Software Testing
211 pages
HW e PN 2015 16 PDF
No ratings yet
HW e PN 2015 16 PDF
186 pages
SRS of University Management System by Balwinder Singh Vehgal
No ratings yet
SRS of University Management System by Balwinder Singh Vehgal
17 pages
cs3551 Unit II Notes None
No ratings yet
cs3551 Unit II Notes None
35 pages
whittle优化教程
No ratings yet
whittle优化教程
25 pages
String
No ratings yet
String
5 pages
Workshop 1: ICT Accomplishment Report PDF
No ratings yet
Workshop 1: ICT Accomplishment Report PDF
23 pages
Data Sheet - Item Number: 750-8212/025-002 Controller PFC200 2nd Generation 2 X ETHERNET, RS-232/-485 Telecontrol Technology Ext. Temperature ECO
No ratings yet
Data Sheet - Item Number: 750-8212/025-002 Controller PFC200 2nd Generation 2 X ETHERNET, RS-232/-485 Telecontrol Technology Ext. Temperature ECO
20 pages
Regular Expressions To Finite Automata: - High-Level Sketch
No ratings yet
Regular Expressions To Finite Automata: - High-Level Sketch
32 pages
Grimaldi - Risk Breakdown Matrix
No ratings yet
Grimaldi - Risk Breakdown Matrix
16 pages
Van Der Post H Rust For Data Science A Rustacean Odyssey 2023
No ratings yet
Van Der Post H Rust For Data Science A Rustacean Odyssey 2023
352 pages
CV of NABILA NAWAZ
No ratings yet
CV of NABILA NAWAZ
2 pages
(Studies in New Media) Narissra M. Punyanunt-Carter and Jason S. Wrench - The Impact of Social Media in Modern Romantic Relationships (2017, Lexington Books) - Libgen - Li
No ratings yet
(Studies in New Media) Narissra M. Punyanunt-Carter and Jason S. Wrench - The Impact of Social Media in Modern Romantic Relationships (2017, Lexington Books) - Libgen - Li
289 pages
Midterm F09 Answers
No ratings yet
Midterm F09 Answers
12 pages
Syllabus CST
No ratings yet
Syllabus CST
110 pages
Gap Fill Strategy for Coding Education
No ratings yet
Gap Fill Strategy for Coding Education
3 pages
10-Day-Wise Plan ADM940 and ADM950
No ratings yet
10-Day-Wise Plan ADM940 and ADM950
6 pages
Text Search & Multimedia Retrieval
No ratings yet
Text Search & Multimedia Retrieval
22 pages
Invite Suppliers To RFQ Based On Product and Services Category
No ratings yet
Invite Suppliers To RFQ Based On Product and Services Category
1 page
R300 Series Ref Manual
100% (1)
R300 Series Ref Manual
58 pages
TIKTOK Algorithm - How To Get To VIRAL in 24 Hours (0 - 100k Fans)
87% (15)
TIKTOK Algorithm - How To Get To VIRAL in 24 Hours (0 - 100k Fans)
29 pages
M8550-90011B AgilentLabAdvisor IT-AdminGuide
No ratings yet
M8550-90011B AgilentLabAdvisor IT-AdminGuide
44 pages

Ch10 Big Data

Uploaded by

Ch10 Big Data

Uploaded by

Chapter 10: Big Data

Database System Concepts, 7th Ed.

▪ Very large volumes of data being collected

Database System Concepts - 7th 10.2 ©Silberschatz, Korth and Sudarshan

▪ Transaction processing systems that need very high scalability

Database System Concepts - 7th 10.3 ©Silberschatz, Korth and Sudarshan

▪ Distributed ﬁle systems

Database System Concepts - 7th 10.4 ©Silberschatz, Korth and Sudarshan

▪ A distributed ﬁle system stores data across a large collection of machines,

Database System Concepts - 7th 10.5 ©Silberschatz, Korth and Sudarshan

▪ Single Namespace for entire

Database System Concepts - 7th 10.6 ©Silberschatz, Korth and Sudarshan

Database System Concepts - 7th 10.7 ©Silberschatz, Korth and Sudarshan

▪ Sharding: partition data across multiple databases

Database System Concepts - 7th 10.8 ©Silberschatz, Korth and Sudarshan

▪ Supporting scalable data access

Database System Concepts - 7th 10.9 ©Silberschatz, Korth and Sudarshan

▪ Key-value storage systems store large numbers (billions or even more) of

Database System Concepts - 7th 10.10 ©Silberschatz, Korth and Sudarshan

▪ Key-value stores may store

Database System Concepts - 7th 10.11 ©Silberschatz, Korth and Sudarshan

▪ An example of a JSON object is:

Database System Concepts - 7th 10.12 ©Silberschatz, Korth and Sudarshan

▪ Key-value stores support

Database System Concepts - 7th 10.13 ©Silberschatz, Korth and Sudarshan

▪ Parallel databases run multiple machines (cluser)

Database System Concepts - 7th 10.14 ©Silberschatz, Korth and Sudarshan

Database System Concepts - 7th 10.15 ©Silberschatz, Korth and Sudarshan

▪ Very large systems will partition at some point

Database System Concepts - 7th 10.16 ©Silberschatz, Korth and Sudarshan

▪ Platform for reliable, scalable parallel computing

Database System Concepts - 7th 10.17 ©Silberschatz, Korth and Sudarshan

▪ Consider the problem of counting the number of occurrences of each word

Database System Concepts - 7th 10.18 ©Silberschatz, Korth and Sudarshan

// First attribute of emit above is called reduce key

reduce(String key, List value_list):

Database System Concepts - 7th 10.19 ©Silberschatz, Korth and Sudarshan

▪ Inspired from map and reduce operations commonly used in functional

Database System Concepts - 7th 10.20 ©Silberschatz, Korth and Sudarshan

▪ Given log ﬁle in following format:

Database System Concepts - 7th 10.21 ©Silberschatz, Korth and Sudarshan

map(String key, String record) {

Database System Concepts - 7th 10.22 ©Silberschatz, Korth and Sudarshan

▪ Flow of keys and values in a map reduce task

rk1 rv1 rk1 rv1,rv7,...

rki ... rvn,...

map inputs map reduce

Database System Concepts - 7th 10.23 ©Silberschatz, Korth and Sudarshan

Database System Concepts - 7th 10.24 ©Silberschatz, Korth and Sudarshan

▪ Google pioneered map-reduce implementations that could run on

Database System Concepts - 7th 10.25 ©Silberschatz, Korth and Sudarshan

Database System Concepts - 7th 10.26 ©Silberschatz, Korth and Sudarshan

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable>

Database System Concepts - 7th 10.27 ©Silberschatz, Korth and Sudarshan

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

Database System Concepts - 7th 10.28 ©Silberschatz, Korth and Sudarshan

Database System Concepts - 7th 10.29 ©Silberschatz, Korth and Sudarshan

public class WordCount {

Database System Concepts - 7th 10.30 ©Silberschatz, Korth and Sudarshan

▪ Combiners: perform partial aggregation to minimize network traﬃc

Database System Concepts - 7th 10.31 ©Silberschatz, Korth and Sudarshan

▪ Map Reduce widely used for parallel processing

Database System Concepts - 7th 10.32 ©Silberschatz, Korth and Sudarshan

▪ Relational operations (select, project, join, aggregation, etc.) can be

Database System Concepts - 7th 10.33 ©Silberschatz, Korth and Sudarshan

Database System Concepts - 7th 10.34 ©Silberschatz, Korth and Sudarshan

▪ Current generation execution engines

Database System Concepts - 7th 10.35 ©Silberschatz, Korth and Sudarshan

▪ Resilient Distributed Dataset (RDD) abstraction

Database System Concepts - 7th 10.36 ©Silberschatz, Korth and Sudarshan

Database System Concepts - 7th 10.37 ©Silberschatz, Korth and Sudarshan

▪ Algebraic operations in Spark are typically executed in parallel on multiple

Database System Concepts - 7th 10.38 ©Silberschatz, Korth and Sudarshan

▪ RDDs in Spark can be typed in programs, but not dynamically

Database System Concepts - 7th 10.39 ©Silberschatz, Korth and Sudarshan

Database System Concepts - 7th 10.40 ©Silberschatz, Korth and Sudarshan