BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI
WORK INTEGRATED LEARNING PROGRAMMES
COURSE HANDOUT
Part A: Content Design
Course Title Systems for Data Analytics
Course No(s) DSE* ZG517
Credit Units 5
Course Author Prof. Shan Balasubramaniam
Version No 1
Date 26 / April / 2019
Course Description
Learn about fundamentals of data engineering; Basics of systems and techniques for data processing -
comprising of relevant database, cloud computing and distributed computing concepts.
Course Objectives
CO1 Introduce students to a systems perspective of data analytics: to leverage systems effectively,
understand, measure, and improve performance while performing data analytics tasks
CO2 Enable students to develop a working knowledge of how to use parallel and distributed systems for
data analytics
CO3 Enable students to apply best practices in storing and retrieving data for analytics
CO4 Enable students to leverage commodity infrastructure (such as scale-out clusters, distributed data-
stores, and the cloud) for data analytics.
Text Book(s)
T1 Kai Hwang, Geoffrey Fox, and Dongarra. - Distributed Computing and Cloud
Computing. Morgan Kauffman
Reference Book(s) & other resources
R1 Nikolas Roman Herbst, Samuel Kounev, Ralf Reussner. Elasticity in cloud computing:
What it is, and what it is not. 10th International Conference on Autonomic Computing
(ICAC ’13). USENIX Association.
R2 Mohammed Alhamad, Tharam Dillon, Elizabeth Chang.Conceptual SLA Framework
for Cloud Computing.4th IEEE International Conference on Digital Ecosystems and
Technologies. April 2010, Dubai, UAE.
R3 Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System.
SOSP’03, October 19–22, 2003, Bolton Landing, New York, USA.
R4 Apache CouchDB. Technical Overview.
http://docs.couchdb.org/en/stable/intro/overview.html
R5 Apache CouchDB. Eventual Consistency.
http://docs.couchdb.org/en/stable/intro/consistency.html
R6 Seth Gilbert and Nancy A. Lynch. Perspectives on the CAP Theorem. IEEE
Computer. vol. 45. Issue 2. Feb. 2012
R7 Werner vogels. Eventually Consistent. january 2009. vol. 52. no. 1 Communications
of the acm.
R8 Eric Brewer.CAP Twelve Years Later: How the “Rules” Have Changed. IEEE
Computer. vol. 45. Issue 2. Feb. 2012
R9 M. Burrows, The Chubby Lock Service for Loosely-Coupled Distributed Systems, in:
OSDI’06: Seventh Symposium on Operating System Design and Implementation,
USENIX, Seattle, WA, 2006, pp. 335–350.
R10 MATEI ZAHARIA et. al. Apache Spark: A Unified Engine for Big Data Processing
.COMMUNICATIONS OF THE ACM | NOVEMBER 2016 | VOL. 59 | NO. 11.
R11 YASER MANSOURI, ADEL NADJARAN TOOSI, and RAJKUMAR BUYYA. Data Storage
Management in Cloud Environments:Taxonomy, Survey, and Future Directions . ACM Computing
Surveys, Vol. 50, No. 6, Article 91. December 2017
R12 Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar Introduction to Parallel
Computing, Second Edition(2003), Addison Wesley (at least Chapters 1, 2, 3 & 5)
R13 George Coulouris, Jean Dollimore, Tim Kindberg, Gordon Blair - Distributed Systems
Concepts and Design, Fifth Edition, Pearson (Chapter 1 & 2)
Modular Content Structure
# Topics
1 Introduction to Data Engineering
1.1 Systems Attributes for Data Analytics - Single System
Storage for Data: Structured Data (Relational Databases) , Semi-structured data (Object
Stores), Unstructured Data (file systems)
Processing: In-memory vs. (from) secondary storage vs. (over the) network
Storage Models and Cost: Memory Hierarchy, Access costs, I/O Costs (i.e. number of disk
blocks accessed);
Locality of Reference: Principle, examples
Impact of Latency: Algorithms and data structures that leverage locality, data organization
on disk for better locality
1.2 Systems Attributes for Data Analytics - Parallel and Distributed Systems
Motivation for Parallel Processing (Size of data and complexity of processing)
Storing data in parallel and distributed systems: Shared Memory vs. Message Passing
Strategies for data access: Partition, Replication, and Messaging
Memory Hierarchy in Parallel Systems: Shared memory access and memory contention;
shared data access and mutual exclusion
Memory Hierarchy in Distributed Systems: In-node vs. over the network latencies, Locality,
Communication Cost
2 Systems Architecture for Data Analytics
2.1 Introduction to Systems Architecture
Parallel Architectures and Programming Models: Flynn’s Taxonomy (SIMD, MISD, MIMD)
and Parallel Programming (SPMD, MPSD, MPMD)
Parallel Processing Models:, {Data, Task, and Request}-Parallelism;
Mapping: Data Parallel - SPMD, Task Parallel - MPMD, Request Parallel -
Services/Cloud,
Client-Server vs. Peer-to-Peer models of distributed Computing.
Parallel vs. Distributed Systems: Shared Memory vs. Distributed Memory (i.e. message
passing)
Motivation for distributed systems (large size, easy scalability, cost-benefit)
Cluster Computing: Components and Architecture.
2.2 Performance Attributes of Systems
Scalability - Speedup and Amdahl’s Law;
How to apply Amdahl’s Law?
(Relation to Barsis-Gustafson Law?)
Impact of Memory Hierarchy on Performance:
● Shared Memory and Memory Contention
● Communication Cost
● Locality
Reliability (for distributed systems): MTTF and MTTR, Serial vs. Parallel Connections,
Single Point-of-Failure
Building Reliable Systems: Redundancy and Resilience; Failure Models in Distributed
systems: Transient vs. Permanent Failures,
Failure Recovery: Fail-over, Active Fail-over etc
Process Migration
Availability: Calculating Availability; Service Agreements and SLAs
Elasticity: Resilient Performance and Scenarios; Calculating Elasticity; Achieving elasticity
(via resource provisioning and virtualization)
3. Data Storage and Organization for Analytics:
File systems vs. Database systems. Vs. Object Stores
Distributed File Systems - Basic architecture, Case Studies (GFS/HDFS)
Unstructured Databases - Basic architecture, Case Study and Examples (Google
BigTable, CouchDB / MongoDB)
Consistency Models - Weak and Strong Consistency, Eventual Consistency, CAP
Theorem - Result and Implications;
Synchronization: Chubby Locking as a case study.
4. Distributed Data Processing for Analytics
4.1 (Re-)Designing Algorithms for Distributed Systems
Design Strategy: Divide-and-conquer for Parallel / Distributed Systems - Basic scenarios
and Implications
Parallel Programming Pattern: Data-parallel programs, and map as a construct
Parallel Programming Pattern: Tree-parallelism, reduce as a construct
Map-reduce model: Examples (of map, reduce, map-reduce combinations, Iterative map-
reduce)
Batch processing vs. Online Processing; Streaming - Systems-level understanding (input-
output, memory model, constraints)
Master-Slave Processing: Implications for speedup and communication cost
4.2 Distributed Data Analytics
● Partitioning vs. Replication and Communication vs. Locality for Data Mining
algorithms like k-means, DBSCAN, Nearest Neighbor
● Using data structures (such as kd-trees) for partitioning)
● Matrices and Locality - Row-major vs. Column major vs. Blocking in distributed
context
Learning Outcomes:
No Learning Outcomes
L01 Ability to identify the right storage model to use given a dataset
L02 Ability to apply the appropriate parallel programming model to a given dataset
L03 Ability to identify and tune some common quality attributes of a distributed system
L04 Ability to choose the relevant consistency model for data stores based on application
L05 Ability to apply data mining algorithms like k-means clustering on appropriate dataset
L06 Ability to design and develop a n-tier data mining system in a cloud environment
Part B: Contact Session Plan
Academic Term 2019 Second Semester
Course Title Systems for Data analytics
Course No DSE* ZG517
Lead Instructor Y. R. Sudhakar
Course Contents
Contact Topic # List of Topic Title Reading /
Session # (from (from content structure in Part A) Reference
content
(2 hours / structu
Session) re in
Part A)
Systems Attributes for Data Analytics - Single System Class Slides
Storage for Data: Structured Data (Relational Class Slides
1 1.1
Databases) , Semi-structured data (Object Stores),
Unstructured Data (file systems)
Processing: In-memory vs. (from) secondary storage vs. T1 Sec. 1.2.3
(over the) network
Storage Models and Cost: Memory Hierarchy, Access Class Slides
costs, I/O Costs (i.e. number of disk blocks accessed);
Locality of Reference: Principle, examples Class Slides
2 1.1
Impact of Latency: Algorithms and data structures that Class Slides
leverage locality, data organization on disk for better
locality
Systems Attributes for Data Analytics - Parallel and R12
Distributed Systems Class Slides
Motivation for Parallel Processing (Size of data and R12
complexity of processing) Class Slides
Storing data in parallel and distributed systems: Shared T1. Sec. 1.4.3
1.2 Memory vs. Message Passing R12
3-4 Class Slides
Strategies for data access: Partition, Replication, and R12
Messaging Class Slides
Memory Hierarchy in Parallel Systems: Shared memory R12
access and memory contention; shared data access and Class Slides
mutual exclusion
Memory Hierarchy in Distributed Systems: In-node vs. R12
1.2 over the network latencies, Locality, Communication Class Slides
Cost
Introduction to Systems Architecture
Parallel Architectures and Programming Models: Flynn’s T1 Sec. 1.4.3
Taxonomy (SIMD, MISD, MIMD) and Parallel R12
5 Programming (SPMD, MPSD, MPMD) Class Slides
Parallel Processing Models:, {Data, Task, and Request}- T1 Sec. 1.4.3
2.1 Parallelism; R12
Mapping: Data Parallel - SPMD, Task Parallel - MPMD, R13
Request Parallel - Services/Cloud, Class Slides
Client-Server vs. Peer-to-Peer models of distributed
Computing.
Parallel vs. Distributed Systems: Shared Memory vs. T1 Sec. 1.4.3
Distributed Memory (i.e. message passing) T1 Sec. 2.1
Motivation for distributed systems (large size, easy R12
scalability, cost-benefit) Class Slides
6 2.1
Cluster Computing: Components and Architecture. T1 Sec. 2.2.1
to 2.2.4, Sec
2.3
Scalability - Speedup and Amdahl’s Law; T1 Sec. 1.5.1
How to apply Amdahl’s Law?
(Relation to Barsis-Gustafson Law)
Impact of Memory Hierarchy on Performance: Additional
● Shared Memory and Memory Contention Reading
2.2 ● Communication Cost
7-8 ● Locality
Reliability (for distributed systems): MTTF and MTTR, T1 Sec. 1.5.2
Serial vs. Parallel Connections, Single Point-of-Failure and 2.3.3
Building Reliable Systems: Redundancy and Resilience; T1 Sec. 1.5.2
Failure Models in Distributed systems: Transient vs. and 2.3.3
2.2 Permanent Failures,
Failure Recovery: Fail-over, Active Fail-Over etc T1 Sec. 1.5.2
Overview of Process Migration and 2.3.3
Availability: Calculating Availability; T1 Sec. 1.5.2
9 2.2
Review of Topics for Mid Semester Exam ( ~40 Mins)
File systems vs. Database systems. Vs. Object Stores -
Distributed File Systems - Basic architecture, Case T1 Sec. 6.3.2
Studies (GFS/HDFS) R3
10 - 12 3.1
Unstructured Databases - Basic architecture, Case T1 Sec. 6.3.3
Study and Examples (Google BigTable, CouchDB /
MongoDB)
Overview of Consistency Models - Weak and Strong R6, R7 & R8
Consistency, Eventual Consistency, CAP Theorem -
Result and Implications;
3.1 Synchronization: Chubby Locking as a case study. R9
[additional [supplementary video to be added. Not to be done in
content] Class]
(Re-)Designing Algorithms for Distributed Systems
Design Strategy: Divide-and-conquer for Parallel / Notes
Distributed Systems - Basic scenarios and Implications
4.1 Parallel Programming Pattern: Data-parallel programs, T1 Sec. 6.2.1
13
and map as a construct
Parallel Programming Pattern: Tree-parallelism, reduce T1 Sec. 6.2.2
as a construct
Map-reduce model: Examples (of map, reduce, map- T1 Sec. 6.2.2
reduce combinations, Iterative map-reduce)
4.1 Batch processing vs. Online Processing; Streaming - R10
14-15
Systems-level understanding (input-output, memory
model, constraints)
4.1 Master-Slave Processing: Implications for speedup and Notes
communication cost
● Parallelization of Data mining algorithms like k- AR –
means, DBSCAN, Nearest Neighbor & identifying Notes
4.2 locality issues
16 ● Matrices and Locality - Row-major vs. Column
major vs. Blocking in distributed context
# The above contact hours and topics can be adapted for non-specific and specific WILP programs
depending on the requirements and class interests.
Select Topics for experiential learning [Tutorials]
Topic Select Topics in Syllabus for experiential Resources (Need Weka or equivalent
No. learning software)
1 Introduction to Cloud Computing (with AWS [Resources: Amazon student license]
as an example)
2 Setting up a simple 3-tier application on the [Resources: Amazon student license]
Cloud
3 Programming exercises on map-reduce [Resources: Cloud Infra. Lab in Hyd.]
4 Synchronization exercise on CouchDB [Resources: Cloud Infra. Lab or
Amazon student license]
5 Pen-and-paper exercise on Locality, Memory
Contention, and Communication
Requirement
6 Pen-and-paper exercise on calculations of
speedup, MTTF, and MTTR.
Evaluation Scheme
Legend: EC = Evaluation Component
No Name Type Duration Weight Day, Date, Session, Time
Assignment-1 Take Home 12 To be announced
EC-1 Quiz-II Take Home 5 To be announced
Assignment-II Take Home 13 To be announced
EC-2 Mid-Semester Test Closed Book 90 Min 30 To be announced
EC-3 Comprehensive Exam Open Book 120 Min 40 To be announced
Note - Evaluation components can be tailored depending on the proposed model.
Important Information
Syllabus for Mid-Semester Test (Closed Book): Topics in Weeks 1-7
Syllabus for Comprehensive Exam (Open Book): All topics given in plan of study
Evaluation Guidelines:
1. EC-1 consists of two Assignments and a Quiz. Announcements regarding the same will be made in a
timely manner.
2. For Closed Book tests: No books or reference material of any kind will be permitted.
Laptops/Mobiles of any kind are not allowed. Exchange of any material is not allowed.
3. For Open Book exams: Use of prescribed and reference text books, in original (not photocopies) is
permitted. Class notes/slides as reference material in filed or bound form is permitted. However,
loose sheets of paper will not be allowed. Use of calculators is permitted in all exams.
Laptops/Mobiles of any kind are not allowed. Exchange of any material is not allowed.
4. If a student is unable to appear for the Regular Test/Exam due to genuine exigencies, the student
should follow the procedure to apply for the Make-Up Test/Exam. The genuineness of the reason for
absence in the Regular Exam shall be assessed prior to giving permission to appear for the Make-up
Exam. Make-Up Test/Exam will be conducted only at selected exam centres on the dates to be
announced later.
It shall be the responsibility of the individual student to be regular in maintaining the self-study schedule as
given in the course handout, attend the lectures, and take all the prescribed evaluation components such as
Assignment/Quiz, Mid-Semester Test and Comprehensive Exam according to the evaluation scheme
provided in the handout.