BIG DATA
CS-585
Unit-1: Lecture 3
           •   Structure of Big Data,
                 •   Big Data Processes
                 •   Big Data Framework
                 •   Big Data Plateform and Applications Framework
Contents         •   Examples of Big Data Plateform in Practice
                 •   Big Data Plateform Manifesto
                 •   Big Data Technologies
                 •   Big Data Tools
                 •   Big Data Analytics and Techniques
                 •   Big Data Use Case
    Big Data Processes / Life Cycle
●Problem: The sale of Chewing gum is going down
●Acquisition
     –   Sales by customer, region and time
     –   Survey by users                                                    Acquistion
     –   Social Networks
Extraction
●
     –   Data Loading from receipts
     –   Automatics reading of questionnaires                                             Extraction
     –   Data extraction from twitter                       Decision
Integration
●
     –   Based on user types
Analysis
●
     –   Chewing gum bought by people older than 25
     –   Chewing gum preferred by people younger than an
         age
Interpretation
●
                                                           Interpretation
     –   Moms believes: Chewing gum = bad teeth                                           Integration
     –   Boys and girl believe that chewing gum are for
         babies
Decisions
●
     –   We make Chewing gums without sugar
     –   We ask dentists to advertise our chewing gum as                       Analysis
         refresher
     –   We make commercials targeted to boys and girls.
Big Data Framework
Big Data Framework
Big Data Platform Manifesto
               •   Hadoop/HDFS (2007)
                      •   A framework based on the principles given by Map Reduce and Big
                          Tables. Follows the principle of distributed computing, where the
                          data is distributed, managed and stored on different systems
                          known as nodes (HDFS: Hadoop Distributed File System). First used
                          by yahoo to support the storage of structured, unstructured and
                          semi structured data.
                      •   Designed to parallelize data processing across computing nodes to
                          speed the computation and hide the latency.(Doug Cutting)
               •   Map Reduce(2007)
                      •   Designed to process a large amount of data in batch mode. Follow
  Big Data                the distributed computing model, where each of the task is mapped
                          to many systems for processing in a way that manages the recovery
                          from failure and balance of load. The system was developed by
Technologies              Google.
                      •   Reduce operation aggregates the results. Mainly designed to work
                          with HDFS but now support other db formats also like Cassandra,
                          Hbase etc.
               •   Big Table
                      •   Data storage was solved with the help of big tables. It is distributed
                          storage system to manage the vast quantity of highly scalable
                          structured data.
                      •   It is like a multidimensional sorted map. Data captured is stored in
                          different nodes across the systems. It is unlike the traditional
                          databases where data is organized in rows and columns.
                         Big Data Technologies
•Cassandra - 2008 - A key-value pair NoSQL database, with column family data representation and asynchronous
masterless replication.
•HBase - 2008 - A key-value pair NoSQL database, with column family data representation, with master-slave
replication. It uses HDFS as underlying storage.
•Zookeeper - 2008 - A distributed coordination service for distributed applications. It is based on Paxos algorithm
variant called Zab.
•Pig - 2009 - Pig is a scripting interface over MapReduce for developers who prefer scripting interface over native
Java MapReduce programming.
•Hive - 2009 - Hive is a SQL interface over MapReduce for developers and analysts who prefer SQL interface over
native Java MapReduce programming.
•Mahout - 2009 - A library of machine learning algorithms, implemented on top of MapReduce, for finding
meaningful patterns in HDFS datasets.
•Sqoop - 2010 - A tool to import data from RDBMS/DataWarehouse into HDFS/HBase and export back.
•YARN - 2011 - A system to schedule applications and services on an HDFS cluster and manage the cluster resources
like memory and CPU.
•Flume - 2011 - A tool to collect, aggregate, reliably move and ingest large amounts of data into HDFS.
•Storm - 2011 - A system to process high-velocity streaming data with 'at least once' message semantics.
•Spark - 2012 - An in-memory data processing engine that can run a DAG of operations. It provides libraries for
Machine Learning, SQL interface and near real-time Stream Processing.
•Kafka - 2012 - A distributed messaging system with partitioned topics for very high scalability.
•SolrCloud - 2012 - A distributed search engine with a REST-like interface for full-text search. It uses Lucene library
for data indexing.
           •   Databases
                 •   MongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase,
                     Dypertable, Voldemart, Riak, ZooKeeper.
           •   Map Reduce
                 •   Hadoop, Hive, Pig, Cascading, Cascalog, Mrjob,
                     Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban,
                     Oozie, Greenplum
           •   Storage
Big Data         •   S3, HDFS
 Tools     •   Servers
                 •   EC2, Google App Engine, Elastic, Beanstalk, Heroku
           •   Processing
                 •   R, Yahoo, Pipes, Mechanical, Turk, Solr/Lucence, Elastic
                     Search, Datameer, BigSheets, TinderPop
Big Data Tools and Technologies (Graphical View)
Apache Hadoop Eco System update
How Uber handled/solved their data problem
Big Data Success Stories
                  • Epidepmic Early warning
  Healthcare      • ICU and remote monitoring
                  • USD 150000 reduction in the cost of unnecessary neonatal surgery
                  • Fleet risk advisors helping truck operators by building stronger and
Transportation      faster risk prediction models. 80% reduction in serious accidents, 20%
                    reduction in minor accidents, 30% reduction in driver retention rates.
                  • In japan claim processing has been made faster. 22% fewer mistakenly
 Life Insurance     unpaid claims, 90% accuracy in coding medical terms, 20% reducation
                    in assessment workforce.
  IT Company      • IBM’s Big data business grew over 150% in 2014. IBM joins apple &
                    twitter in strategic partnership.
                        References
•   IBM ICE course Material
•   http://blog.newtechways.com/2017/10/apache-hadoop-
    ecosystem.html
•   https://www.edureka.co/blog/what-is-big-data/
•   https://bigdataanalyticsnews.com/
•   https://data-flair.training/blogs/hadoop-tutorial/
•   https://intellipaat.com/blog/tutorial/hadoop-tutorial/
•   https://intellipaat.com/blog/tutorial/hadoop-tutorial/
•   https://nptel.ac.in/courses/106/104/106104189/
•   Thank You
Wish you a prosperous career
with Big Data