Hadoop Ecosystem Frameworks:
Applications on Big Data
1. Pig
Introduction to Pig
• High-level platform for processing large data sets.
• Developed by Yahoo; runs on Hadoop.
Execution Modes
• Local Mode: Runs on a single machine.
• MapReduce Mode: Distributes processing across Hadoop cluster.
Comparison with Databases
• Schema-less vs. schema-based.
• Suitable for semi-structured data.
Grunt
• Interactive shell for Pig.
• Supports command execution and script testing.
Pig Latin
• Data flow language.
• Includes constructs like LOAD, FILTER, FOREACH, JOIN.
User Defined Functions (UDFs)
• Extend Pig’s capabilities using Java, Python, or other languages.
Data Processing Operators
• Filtering (FILTER), grouping (GROUP), joining (JOIN), sorting (ORDER BY), etc.
2. Hive
Apache Hive Architecture
• Built on Hadoop to provide SQL-like access to data.
• Converts HiveQL queries into MapReduce jobs.
Hive Components
• Hive Shell
• Hive Services (Driver, Compiler, Execution Engine)
• Metastore: Stores schema and metadata.
Comparison with Traditional Databases
• Schema-on-read vs schema-on-write.
• Optimized for batch processing, not OLTP.
HiveQL
• SQL-like language to write queries.
Tables and Querying
• Supports managed and external tables.
• Standard SQL queries for data retrieval.
User Defined Functions (UDFs)
• Customize operations like filters and transformations.
Advanced Query Features
• Sorting, aggregation, joins, subqueries.
• MapReduce integration.
3. HBase
HBase Concepts
• NoSQL database modeled after Google’s Bigtable.
• Column-oriented storage.
Clients and Examples
• Java API, REST, Thrift clients.
HBase vs RDBMS
• Schema-less, horizontal scalability, real-time read/write vs. structured schema and ACID.
Advanced Usage
• Time-stamped versioning, compression, sharding.
Schema Design
• Design based on access patterns, not normalization.
Advanced Indexing
• Custom secondary indexes via Phoenix or Coprocessors.
Zookeeper
• Coordinates distributed systems.
• Used for monitoring HBase clusters.
4. IBM Big Data Tools
IBM Big Data Strategy
• End-to-end platform for big data storage, analysis, and visualization.
Infosphere
• Platform for information integration and governance.
BigInsights
• Enterprise Hadoop solution with additional tooling.
Big Sheets
• Spreadsheet-like interface for analyzing big data.
Big SQL
• SQL engine to query Hadoop data using ANSI-compliant SQL.