SNSCT/IQAC/CLT/1.
1 (Ver 2)
SNS COLLEGE OF TECHNOLOGY
(An Autonomous Institution)
Approved by AICTE, New Delhi, Affiliated to Anna University, Chennai
Accredited by NAAC-UGC with ‘A++’ Grade (Cycle III) ,
Accredited by NBA (B.E - CSE, EEE, ECE, Mech,B.Tech.IT)
COIMBATORE-641 035, TAMIL NADU
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Faculty Name : Ms.Lavanya.M, AP/ CSE Academic Year : 2025-2026 (Odd)
Year , Branch : IV CSE Semester : VII
Course : 19ITE305-Big Data& Analytics
UNIT – II
Unit II: Introduction to Technology Landscape - Big Data Analytics
Below are detailed notes covering all topics in Unit II: Introduction to Technology
Landscape from the course “Big Data Analytics” (19ITE305) as outlined in the provided
document. The topics include Hadoop Ecosystem, NoSQL Databases, In-Memory
Databases, Analytical Tools, Stream Processing, Machine Learning, Cloud Computing, and
Data Visualization.
1. Hadoop Ecosystem
Overview
Hadoop is an open-source framework developed by Apache for distributed storage and
processing of massive datasets using commodity hardware. It is designed to handle big
data challenges like scalability, fault tolerance, and cost-effectiveness.
Core Components
• Hadoop Distributed File System (HDFS):
– Function: Stores large datasets across multiple nodes in a distributed
manner.
– Architecture: Consists of a NameNode (manages metadata and file system
namespace) and DataNodes (store actual data blocks).
– Features:
• Data Replication: Stores multiple copies of data (default: 3 replicas)
to ensure fault tolerance.
• Block Storage: Splits files into fixed-size blocks (default: 128 MB or
256 MB) for efficient storage and retrieval.
• Scalability: Scales horizontally by adding more nodes to the cluster.
– Use Case: Storing petabytes of data for applications like log analysis or data
warehousing.
• MapReduce:
– Function: A programming model for parallel processing of large datasets
across a Hadoop cluster.
– Process:
SNSCT/IQAC/CLT/1.1 (Ver 2)
1. Map Phase: Breaks down input data into key-value pairs and
processes them in parallel.
2. Reduce Phase: Aggregates the output from the Map phase to
produce final results.
– Features:
• Fault Tolerance: Automatically retries failed tasks.
• Scalability: Processes data across thousands of nodes.
• Example: Word count in a large text dataset, where Map counts
words per node, and Reduce aggregates the counts.
• YARN (Yet Another Resource Negotiator):
– Function: Manages and schedules resources (CPU, memory) for applications
running on the Hadoop cluster.
– Components:
• ResourceManager: Central authority that allocates resources and
schedules tasks.
• NodeManager: Runs on each node, managing local resources and
executing tasks.
– Features:
• Dynamic Resource Allocation: Allocates resources based on
application needs.
• Scalability: Supports thousands of nodes and applications
concurrently.
– Use Case: Running multiple applications (e.g., Hive, Pig) on the same
Hadoop cluster.
• Hadoop Common: Utilities and libraries supporting other Hadoop modules.
Key Features
• Scalability: Handles petabytes of data by adding more nodes.
• Fault Tolerance: Ensures reliability through data replication and task retry
mechanisms.
• Cost-Effectiveness: Uses commodity hardware, reducing infrastructure costs.
• Flexibility: Supports various data types (structured, unstructured) and processing
models.
Hadoop Ecosystem Components
• Hive: SQL-like interface for querying data stored in HDFS.
• Pig: High-level scripting language (Pig Latin) for data processing.
• HBase: Distributed, scalable NoSQL database for random, real-time read/write
access.
• Oozie: Workflow scheduler for managing Hadoop jobs.
• Sqoop: Tool for transferring data between Hadoop and relational databases.
• Flume: Service for collecting and moving large amounts of log data into HDFS.
Use Cases
• Log processing for web analytics.
• Data warehousing for business intelligence.
• Large-scale data analysis for recommendation systems.
SNSCT/IQAC/CLT/1.1 (Ver 2)
2. NoSQL Databases
Definition
NoSQL (Not Only SQL) databases are non-relational databases designed to handle large
volumes of unstructured, semi-structured, or structured data. They prioritize scalability
and flexibility over traditional relational database management systems (RDBMS).
Comparison with SQL
• SQL (RDBMS):
– Fixed schema, structured data.
– Uses SQL for querying.
– Vertical scaling (adding more CPU/memory to a single server).
– Examples: MySQL, PostgreSQL, Oracle.
• NoSQL:
– Schema-less or flexible schema.
– Supports various data models (key-value, document, column-family, graph).
– Horizontal scaling (adding more servers).
– Examples: MongoDB, Cassandra, Redis, Neo4j.
Types of NoSQL Databases
• Key-Value Stores:
– Description: Simplest NoSQL model, storing data as key-value pairs.
– Examples: Redis, DynamoDB.
– Use Case: Caching, session management.
– Features: High performance, low latency, simple querying.
• Document Stores:
– Description: Stores data as JSON, BSON, or XML documents.
– Examples: MongoDB, CouchDB.
– Use Case: Content management, real-time analytics.
– Features: Flexible schema, hierarchical data storage.
• Column-Family Stores:
– Description: Organizes data into columns instead of rows, optimized for
analytical queries.
– Examples: Cassandra, HBase.
– Use Case: Time-series data, large-scale analytics.
– Features: High write throughput, scalable for large datasets.
• Graph Databases:
– Description: Stores data as nodes and edges for relationship-focused
queries.
– Examples: Neo4j, ArangoDB.
– Use Case: Social networks, fraud detection.
– Features: Efficient traversal of complex relationships.
Advantages
• Scalability: Horizontal scaling across distributed systems.
• Flexibility: Handles diverse data types without predefined schemas.
• High Performance: Optimized for specific workloads (e.g., high write throughput in
Cassandra).
• Distributed Architecture: Supports big data applications with high availability.
SNSCT/IQAC/CLT/1.1 (Ver 2)
Challenges
• Lack of Standardization: No universal query language like SQL.
• Complex Querying: Limited support for complex joins compared to RDBMS.
• Consistency Trade-offs: Many NoSQL databases prioritize availability over
consistency (CAP theorem).
3. In-Memory Databases
Definition
In-memory databases store data in the main memory (RAM) rather than on disk, enabling
faster data access and processing compared to traditional disk-based databases.
Examples
• Redis: Key-value store used for caching and real-time analytics.
• SAP HANA: Enterprise-grade in-memory database for analytics and applications.
• Memcached: Distributed memory caching system.
Key Features
• Speed: Extremely low latency for read/write operations due to in-memory storage.
• Volatility: Data is lost unless persisted to disk (some systems offer persistence
options).
• Scalability: Supports distributed architectures for large-scale deployments.
• Use Cases:
– Real-time analytics (e.g., fraud detection).
– Caching for web applications.
– Session management in e-commerce platforms.
Advantages
• High Throughput: Ideal for big data applications requiring rapid data access.
• Real-Time Processing: Supports low-latency analytics and transactions.
• Simplified Architecture: Reduces I/O bottlenecks associated with disk-based
storage.
Challenges
• Limited Storage Capacity: RAM is more expensive and has lower capacity than disk
storage.
• Data Durability: Requires mechanisms (e.g., snapshots, replication) to prevent data
loss.
• Cost: Higher infrastructure costs due to reliance on RAM.
Use in Big Data
• Used in conjunction with Hadoop or NoSQL databases to cache frequently accessed
data.
• Enables real-time analytics for streaming data or time-sensitive applications.
4. Analytical Tools
Overview
Analytical tools process, analyze, and visualize big data to derive actionable insights. These
tools integrate with Hadoop, NoSQL databases, and cloud platforms to handle large-scale
data analytics.
Popular Analytical Tools
• Apache Spark:
SNSCT/IQAC/CLT/1.1 (Ver 2)
– Description: In-memory data processing engine for batch and streaming
data.
– Features:
• Faster than MapReduce due to in-memory computation.
• Supports SQL (Spark SQL), machine learning (MLlib), and graph
processing (GraphX).
– Use Case: Real-time analytics, ETL (Extract, Transform, Load) pipelines.
• Tableau:
– Description: Data visualization tool for creating interactive dashboards.
– Features:
• Drag-and-drop interface for non-technical users.
• Integrates with Hadoop, cloud platforms, and databases.
– Use Case: Business intelligence, sales reporting.
• Power BI:
– Description: Microsoft’s analytics platform for interactive visualizations.
– Features:
• Seamless integration with Azure and SQL Server.
• AI-powered insights and natural language querying.
– Use Case: Enterprise reporting, data exploration.
• R and Python:
– Description: Programming languages with libraries for data analysis and
visualization.
– Libraries:
• R: ggplot2, dplyr for data manipulation and visualization.
• Python: Pandas, NumPy, Matplotlib, Seaborn for analytics and
plotting.
– Use Case: Statistical analysis, predictive modeling.
Key Features
• Scalability: Handles large datasets with distributed computing support.
• Integration: Connects with Hadoop, NoSQL databases, and cloud platforms.
• User-Friendly: Visual interfaces for non-technical users.
• Automation: Supports automated insights through AI/ML integration.
Applications
• Predictive Analytics: Forecasting sales or customer behavior.
• Customer Segmentation: Grouping users based on behavior or demographics.
• Trend Analysis: Identifying patterns in time-series data.
5. Stream Processing
Definition
Stream processing involves real-time analysis of continuous data streams, enabling
immediate insights and actions.
Technologies
• Apache Kafka:
– Description: Distributed streaming platform for handling high-throughput
data streams.
– Features:
SNSCT/IQAC/CLT/1.1 (Ver 2)
• Publishes and subscribes to data streams.
• Fault-tolerant and scalable with partitioned logs.
– Use Case: Real-time event streaming for IoT or social media.
• Apache Flink:
– Description: Stream processing framework for low-latency, high-throughput
analytics.
– Features:
• Exactly-once processing semantics.
• Supports both batch and stream processing.
– Use Case: Real-time fraud detection, log analytics.
• Apache Storm:
– Description: Real-time computation system for unbounded data streams.
– Features:
• Processes data in real-time with low latency.
• Integrates with Hadoop and Kafka.
– Use Case: Real-time monitoring, clickstream analysis.
Key Features
• Low Latency: Processes data as it arrives, minimizing delays.
• Scalability: Handles high-velocity data across distributed systems.
• Fault Tolerance: Ensures reliability through replication and recovery mechanisms.
• Event-Driven: Processes data based on events (e.g., sensor data, user actions).
Use Cases
• Fraud Detection: Real-time analysis of transactions to identify anomalies.
• IoT Data Processing: Monitoring sensor data for predictive maintenance.
• Social Media Analytics: Tracking trends and sentiments in real-time.
6. Machine Learning
Overview
Machine learning (ML) involves algorithms and models that learn from data to make
predictions or decisions without explicit programming.
Big Data Integration
• Frameworks:
– TensorFlow: Open-source library for building and training ML models.
– PyTorch: Flexible framework for deep learning and research.
– Scikit-learn: Python library for traditional ML algorithms (e.g., regression,
clustering).
– Spark MLlib: Distributed ML library for large-scale machine learning.
• Process:
– Data Preparation: Cleaning and transforming big data for model training.
– Model Training: Using distributed computing (e.g., Spark) for large datasets.
– Deployment: Integrating models into production systems for real-time
predictions.
Key Applications
• Predictive Analytics: Forecasting trends (e.g., sales, stock prices).
• Natural Language Processing: Sentiment analysis, chatbots, text summarization.
• Recommendation Systems: Personalized suggestions (e.g., Netflix, Amazon).
SNSCT/IQAC/CLT/1.1 (Ver 2)
• Anomaly Detection: Identifying fraud or network intrusions.
Challenges
• Data Quality: Requires clean, structured data for effective training.
• Scalability: Training models on massive datasets requires distributed computing.
• Interpretability: Complex models (e.g., deep learning) may lack transparency.
Big Data Use Cases
• Training recommendation models on large user datasets.
• Real-time fraud detection using streaming data.
• Predictive maintenance in IoT using sensor data.
7. Cloud Computing
Definition
Cloud computing delivers computing services (storage, processing, analytics) over the
internet, providing scalable and flexible infrastructure for big data applications.
Big Data Relevance
• Platforms:
– AWS: Offers services like S3 (data storage), EMR (managed Hadoop),
Redshift (data warehousing).
– Google Cloud: Provides BigQuery (serverless data warehouse), Dataflow
(stream processing).
– Microsoft Azure: Includes Azure Data Lake, Synapse Analytics, and
HDInsight.
• Services:
– Data Lakes: Centralized repositories for raw, unstructured data (e.g., AWS
S3).
– Managed Hadoop Clusters: Simplifies Hadoop deployment (e.g., AWS EMR).
– Serverless Computing: Executes code without managing servers (e.g., AWS
Lambda).
Advantages
• Scalability: Elastic resources to handle varying workloads.
• Cost Efficiency: Pay-as-you-go pricing reduces upfront costs.
• Accessibility: Global access to data and tools via the internet.
• Integration: Seamless integration with analytical tools and ML frameworks.
Challenges
• Data Security: Protecting sensitive data in the cloud.
• Vendor Lock-In: Dependency on specific cloud providers.
• Compliance: Adhering to regulations like GDPR or HIPAA.
Use Cases
• Hosting Hadoop clusters for large-scale data processing.
• Storing and analyzing IoT data in real-time.
• Running ML models on cloud-based infrastructure.
8. Data Visualization
Definition
Data visualization is the graphical representation of data to identify patterns, trends, and
insights, making complex data more accessible and understandable.
SNSCT/IQAC/CLT/1.1 (Ver 2)
Tools
• Tableau:
– Description: Creates interactive dashboards and visualizations.
– Features:
• Drag-and-drop interface for non-technical users.
• Integrates with Hadoop, NoSQL, and cloud platforms.
– Use Case: Business intelligence, sales forecasting.
• Power BI:
– Description: Microsoft’s platform for interactive data visualization.
– Features:
• AI-powered insights and natural language querying.
• Seamless integration with Azure and SQL Server.
– Use Case: Enterprise reporting, real-time monitoring.
• D3.js:
– Description: JavaScript library for custom, web-based visualizations.
– Features:
• Highly customizable for complex visualizations.
• Supports dynamic, interactive charts.
– Use Case: Custom dashboards, data-driven journalism.
Key Features
• Interactivity: Allows users to explore data through filters and drill-downs.
• Scalability: Handles large datasets with cloud and distributed computing support.
• Customization: Tailors visuals to specific business needs.
• Real-Time Visualization: Displays streaming data for immediate insights.
Applications
• Business Reporting: Visualizing sales, revenue, and KPIs.
• Real-Time Monitoring: Tracking system performance or user activity.
• Decision-Making Support: Providing insights for strategic planning.