0% found this document useful (0 votes)
33 views57 pages

Unit 1

Big Data refers to high-volume, high-velocity, and high-variety information assets that require innovative processing methods for enhanced decision-making. It encompasses structured, semi-structured, and unstructured data, with significant implications for industries like healthcare, finance, and marketing. Key technologies supporting Big Data include Hadoop, Spark, and NoSQL databases, which enable efficient storage, processing, and analysis of massive datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
33 views57 pages

Unit 1

Big Data refers to high-volume, high-velocity, and high-variety information assets that require innovative processing methods for enhanced decision-making. It encompasses structured, semi-structured, and unstructured data, with significant implications for industries like healthcare, finance, and marketing. Key technologies supporting Big Data include Hadoop, Spark, and NoSQL databases, which enable efficient storage, processing, and analysis of massive datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 57
yg INTRODUCTION * Big Data may well be the Next Big Thing in the IT world. * Big data burst upon the scene in the first decade of the 21st century. * The first organizations to embrace it were online and startup firms. Firms like Google, eBay, Linkedin, and Facebook were built around big data from the beginning. * Like many new information technologies, big data can bring about dramatic cost reductions, substantial improvements in the time required to perform a computing task, or new product and service offerings. WHAT IS BIG DATA * Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. * Big Data consists of extensive datasets, primarily in the characteristics of volume, variety, velocity, and/or variability, that require a scalable architecture for efficient storage, manipulation, and analysis. * Facebook alone generates more than 500 terabytes of data daily whereas many other organizations like Jet Air and Stock Exchange Market generates terabytes of data every hour. Types of Data: 1. Structured Data : These data is organized in a highly mechanized and manageable way. Ex: Tables, Transactions, Legacy Data etc... 2.Unstructured Data : These data is raw and unorganized, it varies in its content and can change from entry to entry. Ex: Videos, images, audio, Text Data, Graph Data, social media etc. 3.Semi-Structured Data : Ex: XML Database, 50% structured and 50% unstructured. * Walmart handles more than 1 million customertransactions every hour. * Decoding the human genome originally took 10years to process; now it can be achieved in one week. WHY BIG DATA * Key enablers of appearance and growth of Big Data are Why Learn Big Data_) L.Increase of storage capacities 2.Increase of processing power e 3.Availability of data 4.Every day we create 2.5 quintillion bytes of data; 90% of the data in the world today has been created in the last two years alone * Big Data describe a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques. * Big Data is the massive amounts of diverse, unstructured data produced by high-performance applications. * Data Growth is huge and all that data is valuable. * Data won't fit on a single system, that's why use Distributed data * Distributed data=Faster Computation. Uy Uy NK) knowledge TT * More knowledge leads to better customer engagement, fraud prevention and new products. * Big Data Matters for Aggregation, Statistics, Indexing, Searching, Querying and Discovering Knowledge. 1 Bit 1 Nibble 1 Byte 1 Word 1024 Bytes 1024 KB 1024 MB 1024 GB 1024 TB 1024 PB 1024 HB 1024 YB 1024 ZB eee eee eee ee O (or) 1 4 Bits 8 Bits 4 Bytes/8 Bytes 1Kilo Byte 1Mega Byte 1Giga Byte Tera Byte 1Peta Byte 1Hecta Byte 1Yotta Byte 1Zotta Byte Infinity Fig: Measuring the Data in Big Data System CHARACTERISTICS OF BIG DATA * Volume * Velocity * Variety Big Data * Veracity * Value * Variability Volume + The name ‘Big Data’ itself is related to a size which is enormous. * Volume is a huge amount of data. * The size of data is key to its value. If the data volume is huge, it qualifies as "Big Data.” Essentially, whether data is considered Big Data depends on its volume. Example: In the year 2016, the estimated global mobile traffic was 6.2 Exabytes (6.2 billion GB) per month. Also, by the year 2020 we will have almost 40000 Exabytes of data. > you Velocity * Velocity refers to the high speed of accumulation of data. * In Big Data velocity data flows in from sources like machines, networks, social media, mobile phones etc. * The continuous and rapid flow of data shows its potential by how quickly it is generated and processed to meet demands. Example: There are more than 3.5 billion searches per day are made on Google. Also, Facebook users are increasing by 22%(Approx.) year by year. Variety * Variety refers to nature of data that is structured, semi-structured and unstructured data * Variety refers to data coming from diverse sources, both internal and external, in structured, semi-structured, and unstructured formats. 1.Structured data: Organized data with a defined format and length, like rows and columns in a database. 2.Semi-structured data: Partially organized data that doesn’t fully follow formal structures, like log files. 3.Unstructured data: Unorganized data that doesn’t fit into traditional formats, such as text, images, and videos. Veracity: How much data? at kind of How frequent o i nee * Veracity refers to inconsistencies and uncertainty in data, that is data which is available can sometimes get messy and quality and accuracy are difficult to control * Big Data is also variable because of the multitude of data dimensions resulting from multiple disparate data types and sources. Example: Data in bulk could create confusion whereas less amount of data could convey half or Incomplete Information. Value * Alongside the 4 V's, Value is crucial. Data holds no benefit unless it is transformed into meaningful, actionable insights for the company. * Data in itself is of no use or importance but it needs to be converted into something valuable to extract Information. Hence, you can state that Value! is the most important V of all the 6V’s Variability * It refers to the inconsistencies and fluctuations in data. It highlights how data can differ significantly over time or in different contexts, such as changes in patterns, trends, or quality. Managing variability is important to ensure reliable analysis and actionable insights Example: if you are eating same ice-cream daily and the taste just keep changing. CONVERGENCE OF KEY TRENDS Big Data works with other technologies to create smarter solutions. How is Big Data evolving with other trends? 1.Internet of Things (loT): * loT devices like smartwatches, smart home systems, industrial sensors, and autonomous vehicles generate huge amounts of data. Example: A smartwatch tracks heart rate, sleep patterns, and physical activity, sending data to health apps for analysis. * In smart cities, loT sensors collect traffic data to optimize signals and reduce congestion. 2. Cloud Computing: Provides unlimited storage for massive datasets. * Big Data requires scalable and cost-effective storage solutions, which cloud platforms like AWS, Google Cloud, and Microsoft Azure provide. * Cloud computing allows companies to process and analyze data remotely without investing in expensive physical servers. Example: Netflix uses cloud-based Big Data analytics to analyze viewer preferences and recommend personalized content. 3. Artificial Intelligence (Al): * Uses Big Data for training machine learning models. * Al and Machine Learning (ML) rely on Big Data for mode! training, predictions, and automation. * The more data available, the better Al models perform in recognizing patterns and making decisions. Example: Chatbots like ChatGPT analyze huge datasets to understand and generate human-like responses. * In healthcare, Al analyzes medical Big Data to detect diseases early, recommend treatments, and assist doctors. 4. Data-Driven Business Models : * Businesses rely on data for strategy and decision-making. * Companies use customer behavior data, sales trends, and operational insights to make informed business decisions. Example: E-commerce platforms like Amazon analyze shopping behavior to recommend products and optimize pricing strategies Dsta-biven Business Model * Financial firms use Big Data analytics to detect “a fraud, assess risks, and predict stock market movements. * Data-driven models improve efficiency, customer satisfaction, and overall business growth. STRUCTURED DATA VS. UNSTRUCTURED DATA Types of Data 2 g & S g & 1.Structured Data: Definition: Structured data refers to information that is highly organized and formatted in a predefined schema, making it easy to search, store, and analyze Typically represented as rows and columns in relational databases, Storage: Stored in SQL databases, spreadsheets, data warehouses. Examples: * Customer details in an Excel sheet (Name, Phone, Email, Address). « Banking transactions (Account Number, Date, Transaction Amount, Type). « Airline booking system (Passenger Name, Flight Number, Departure & Arrival Time). Advantages: « Easy to search, retrieve, and manage. * Efficient for quick analysis and reporting. 2. Semi-Structured Data: Definition: Semi-structured data is information that doesn't follow a strict tabular format but includes organizational elements like tags or metadata blending aspects of both structured and unstructured data for easier management. Storage: NoSQL databases (MongoDB, Cassandra), JSON, XML files. Examples: + JSON file storing e-commerce product details (Product Name, Price, Category, Ratings). * Emails (Contains both structured parts like subject/sender and \ (| unstructured content in the message body) * Sensor logs from loT devices (Temperature, Humidity, Timestamp). 1 10 ‘o1o1 Log a SEMI-STRUCTURED DATA Advantages: * More flexible than structured data. * Can store data with different formats without strict schema restrictions. 3. Unstructured Data * No predefined format, requires advanced processing Definition: Unstructured data is information without a predefined format or schema, making it difficult to organize and analyze with traditional systems. It requires advanced techniques like Al and ML for meaningful insights. Storage: Stored in data lakes, cloud storage, or specialized NoSQL databases. Unstructured data types 8 oO Examples: * Social media posts (Facebook, Twitter, Instagram) - text, images, videos, and comments. * YouTube videos - raw video files with no inherent structure. * Medical reports - handwritten doctor notes, MRI scans, and medical images. * WhatsApp messages - mix of text, emojis, images, voice notes. Challenges: * Harder to search and analyze compared to structured data * Requires Al, Natural Language Processing (NLP), and Machine Learning (ML) for proper insights Real-World Examples INDUSTRY EXAMPLE F BIG DATA Healthcare: Retail 1. Disease Prediction 1.Customer Behavior Analysis 2.Drug Discovery 2.Demand Forecasting 3.Hospital Management 3.Dynamic Pricing Finance Social Media 1.Fraud Detection 2.Risk Management 3.Algorithmic Trading 1.Sentiment Analysis 2. Targeted Ads 3.Influencer Marketing BIG DATA IN WEB ANALYTICS & MARKETING Big Data enhances website performance and marketing by analyzing customer behavior and personalizing strategies. Tools like Google Analytics and predictive analytics optimize campaigns and target the right audience. 1.Customer Behavior Tracking * Big Data helps businesses understand how users navigate a website, what they click on, and how long they stay. Example: E-commerce sites analyze browsing history to recommend relevant products. 2. Digital Marketing Optimization * Companies analyze ad performance, email campaigns, and social media engagement to optimize marketing strategies. Example: Google Ads uses Al to optimize ad placements based on user interactions. 3. Personalized Advertisements * Websites use customer data (search history, location, preferences) to show highly relevant ads. Example: If a user searches for laptops, they will start seeing laptop ads across different websites. 4. Customer Segmentation * Businesses group customers based on age, @ location, interests, and purchasing behavior. 2) Example: Netflix segments users based on watch history to suggest personalized content. r=) e 5. Predictive Analytics « Al-driven models forecast future purchasing trends based on past user behavior. Example: Amazon predicts when a customer is likely to reorder an item and suggests it in advance. Fraud Detection & Risk Management with Big 1. Fraud Detection in Banking & Finaata * Big Data analyzes millions of transactions in real-time to detect fraudulent activities. ne * Al-powered anomaly detection identifies unusual spending patterns, duplicate transactions, and unauthorized access.. Example: on Banks monitor credit card transactions and flag suspicious purchases (e.g., a sudden high-value purchase from a foreign location). Financial institutions use Al to detect money laundering by tracking unusual fund transfers 2. Fraud Detection in Stock Trading * Aland Big Data analyze market trends and investor behavior to detect fraud. Identifies insider trading and suspicious trading patterns. Example: * Trading platforms detect unusual stock purchases before major announcements (possible insider trading). 3. Risk Management in Business * Predicting Financial Crises: 1.Big Data examines economic trends, stock market data, and global events to forecast financial downturns. 2.Example: Al-based risk models warned about the 2008 financial crisis by analyzing housing market trends. * Reducing Supply Chain Risks: 1.Big Data optimizes supply chains by predicting demand, tracking shipments, and identifying delays. 2.Example: Walmart uses Big Data to adjust inventory levels based on customer demand and weather predictions. Credit Risk Management & Algorithmic Trading with Big Data Big Data enhances banking by optimizing credit risk and trading with Al- driven analysis, improving decision-making and profitability. 1. Credit Risk Management in Banking * Big Data plays a crucial role in assessing a customer’s creditworthiness before approving a loan. Banks and financial institutions analyze massive datasets to predict potential loan defaults and reduce financial management risks. How Banks Use Big Data for Loans: Credit risk Br i. Credit History Analysis _ iii. Credit History Analysis ii. Spending Behavior Assessment iv Loan Default Prediction Example: Abank may reject a loan application if the applicant has: © High outstanding debt © Irregular income * Poor credit score rai i 2. Big Data in Algorithmic Trading ees) i iy ie Algorithmic trading (also called Algo Trading) uses Al and Big Data to analyze stock market trends and execute trades automatically. How Big Data Enhances Algorithmic Trading i, Predicting Stock Prices ii, High-Speed Trading ili, Risk Management Example: * Analyze financial news & market sentiment before making a trade. * Buy/sell stocks automatically based on real-time data. Big Data in Healthcare & Medi Big Data is transforming the healthcare industry by improving disease prediction, drug discovery, and personalized medicine. Disease Prediction: * Al models analyze medical images and records to detect diseases early, like identifying cancer patterns for timely diagnosis and treatment. Drug Discovery: * Pharmaceutical companies use Big Data to analyze patient, clinical, and genetic data, speeding up drug discovery and developing more effective treatments. Personalized Medicine: * Big Data helps doctors create personalized treatment plans, improving outcomes and reducing medication side effects. Big Data Technologies * Big Data Tools are Hadoop, Cloudera, Datameer, Splunk, Mahout, Hive, HBase, LucidWorks, R, MapR, Ubuntu and Linux flavors. Leaflet O €% Google a fee ECD Qstudic = 2 ous Bo Se a B53 = > STORM Dn = eso, hearse mMysau Bem vnc: SE tame * Big Data requires powerful tools to store, process, and analyze massive volumes of data efficiently. =} Hadoop: An open-source framework that enables distributed data storage and processing across multiple servers. It is highly scalable and handles large datasets efficiently using its Hadoop Distributed File System (HDFS) and MapReduce processing model. Q => Spark: A fast and flexible Big Data processing framework that performs real-time data processing. Unlike Hadoop, which relies on disk-based storage, Spark uses in- memory computing, making it significantly faster for iterative processing and machine learning applications. => Kafka: A real-time data streaming platform that allows businesses to process continuous flows of data from multiple sources. It is widely used for real-time analytics, event processing, and log monitoring in banking, social media, and e- commerce. B\ NosaLDArABASE TYPES ‘ x =} NoSQL Databases: Unlike traditional relational databases, NoSQL databases (such as MongoDB and Cassandra) are optimized for handling semi-structured and unstructured data, making them ideal for social media posts, loT data, and multimedia content. INTRODUCTION TO HADOOP Hadoop: Hadoop is a open source software framework platform and major concepts are distributed file system called Hadoop Distributed File System(HDFS) and the high performance parallel data processing engine called Hadoop MapReduce. It consists of 1.History of Hadoop 2.Apache hadoop 3.Analyzing the Data with UNIX Tools 4.Analyzing the Data with Hadoop 5.Hadoop Streaming 6.Hadoop Ecosystem 1. History of Hadoop: * Hadoop proposed by Doug Cutting in 2004. * Hadoop origin is from Google white papers (GFS, MapReduce and BigTable). * Hadoop came from GFS as HDFS, MapReduce ~ MapReduce and BigTable — HBase). * The company name is Apache™ Hadoop® * The Technique Name is Hadoop used in Big Data Analytics. * Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation Open-source software for reliable, scalable, distributed computing CEO - History of Hadoop Google Re “itis an important “Map-reduce’ technique!” 2004 g Tae Doug Cutting GO | : The great journey begins... * Atop level Apache project is initiated and led by Yahoo!. * The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. * Hadoop is a platform for processing large scale data sets in a distributed fashion. * Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of Tera Bytes of Data * Important characteristic of Hadoop is works on Distributed model. i, Hadoop Common - Common utilities ii. Hadoop Distributed File System (HDFS) iii. Hadoop YARN - Yet Another Resource Negotiator iv. Hadoop Map Reduce - Programming Model Apache Hadoop 1.Hadoop Distributed File System (HDFS) 2.MapReduce * These are changing rapidly ~ active area of use and growth. These are big areas today * “Silicon Valley investors have poured $2 billion into companies based on the data-collection software known as Hadoop.”- Wall Street Journal, June 15, 2015 * IBM to invest few hundred million dollars a year in Spark * Not including investments by Facebook, Google, Yahoo!, Baidu, and others Apache Hadoop: Purpose “Framework that allows distributed processing of large data sets across clusters of computers... using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures” > Apache Hadoop - key components . * Hadoop Common: Common utilities * (Storage Component) Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access 1. Many other data storage approaches also in use 2.E.G., Apache Cassandra, Apache Hbase, Apache Accumulo (NSA- contributed) * (Scheduling) Hadoop YARN: A framework for job scheduling and cluster resource management * (Processing) Hadoop MapReduce (MR2): A YARN-based system for parallel processing of large data sets 1.0ther execution engines increasingly in use, e.g., Spark SS .~ Components of HDFS |Sqoop_|Hive [Pig [Hbase | UI Framework SDK ‘HUE Wor fic ay | Scheduling leases! | Metadata | t HIVE Lai Compilers | Data (sana laecoan tse) Pp! (amma Sommem coteees | Fast inteeretia @redoap |, Access Coordination ZOOKEEPER Fig: Hadoop Ecosystem HDFS (Hadoop Distributed File System) - Data Storage + HDFS is the storage layer of Hadoop, designed to handle large files by distributing them across multiple machines (nodes) in a cluster. * Instead of storing data on a single system, HDFS splits large files into smaller blocks (typically 128MB or 256MB each) and stores them across multiple servers. * To prevent data loss, HDFS replicates each block multiple times across different nodes. This ensures fault tolerance—even if one machine fails, the data is available on another. HOFS Rack Awareness Example: * If a company collects terabytes of customer transaction data daily, HDFS can store it efficiently across multiple servers, allowing fast access and high reliability. = Comma MapReduce - Data Processing * MapReduce is the processing engine of Hadoop that allows large-scale parallel processing of data. * It divides a large computational task into smaller sub-tasks and processes them simultaneously across multiple nodes. * The process works in two main steps: 1. Map Phase: Breaks down large datasets into smaller parts and processes them in parallel. 2. Reduce Phase: Combines the processed data and generates meaningful results Example: * A social media company wants to analyze trending hashtags from billions of tweets. Using MapReduce: 1. Map phase: Each machine counts hashtags from a subset of tweets. B 2. Reduce phase: Aggregates counts from all machines to determine the most popular hashtags. YARN (Yet Another Resource Negotiator) - Resource Managenvent * YARN is the resource management layer of Hadoop that efficiently allocates system resources (CPU, memory, storage) among different appt running on a Hadoop cluster. * It enables multiple processing frameworks (such as Spark, MapReduce, and Flink) to work on the same data at the same time. * YARN enhances performance and scalability by dynamically distributing resources as per workload requirements. ations L vy OPEN SOURCE TECHNOLOGIES FOR BIG DATA © Big Data requires powerful open-source technologies to handle, process, and analyze vast amounts of data efficiently. These technologies are designed for high-speed, scalable, and real-time data processing, making them essential for businesses and industries dealing with massive datasets. * three key open-source technologies that play a vital role in Big Data analytics: Apache Spark, Apache Flink, and Elasticsearch. Apache Spark - Real- ime Big Data Processing * Apache Spark is a fast, distributed, and in-memory data processing engine designed for large-scale analytics. Unlike traditional batch processing frameworks, Spark supports real-time analytics, making it ideal for handling time-sensitive data. Key Features of Apache Spark: VY In-Memory Computing v Supports Multiple Programming Languages: APACHE < ¥ Real-Time Stream Processing v Machine Learning & Al Support Por J Graph Processing & SQL Queries Example i. banking institution ii. E-commerce companies Apache Flink - Large-Scale Real-Time Data Streaming * Apache Flink is a powerful, stream-processing engine that processes real-time data streams with high accuracy and low latency. Unlike batch processing systems, Flink operates in continuous mode, making it ideal for scenarios requiring instant decision-making. Key Features of Apache Spark: ¥ True Real-Time Streaming Engine v Exactly-Once Processing Apac he Flink ¥ Fault-Tolerant & Scalable v Supports Batch and Stream Processing Z Example i, Stock Market Monitoring ii. Cybersecurity Threat Detection: WC, CLOUD AND BIG DATA & MOBILE BUSINESS INTELLIGENCE Cloud and Big Data: © Big Data generates massive volumes of information, which require scalable, flexible, and efficient storage and processing solutions. Cloud computing plays a crucial role in handling Big Data by offering infrastructure that can store, analyze, and manage these large datasets efficiently. Key Features & Benefits: ER 1.Scalability 3.Security & Reliability am = eas a 4.Speed & Performance 5. Data Integration Popular Cloud Platforms for Big Data: * Amazon Web Services (AWS): Provides tools like Amazon S3 (storage) and Amazon Redshift (data warehouse) for Big Data processing. * Google Cloud Platform (GCP): Uses BigQuery for real-time analytics and Al-powered insights. * Microsoft Azure: Offers services like Azure Data Lake for managing massive datasets. Mobile Business Intelligence (BI) * With the rise of smartphones and mobile applications, Mobile BI enables businesses to access, analyze, and make decisions on data in real-time, from anywhere. It helps executives, sales teams, and managers stay updated on key metrics without being tied to a desktop. Key Features : 1. Real-Time Access mJ se st, 2. Data Visualization wy 3.Alerts & Notifications oN a 4.Collaboration & Sharing =. e How Mobile BI is Used in Businesses: onneree sronent * Retail: Tracks sales, inventory levels, and customer behavior in real time. * Finance: Monitors stock prices, credit risk, and fraud detection instantly. * Healthcare: Doctors can access patient records and hospital performance metrics remotel CROWD-SOURCING ANALYTICS & INTER AND TRANS-FIREWALL ANALYTICS Crowd-sourcing analytics uses public data, like Google Maps’ traffic updates, to i decision-making. Inter and trans-firewall analytics enhance cybersecurity by detecti threats across multiple firewalls. Crowd-Sourcing Analytics * Crowd-sourcing analytics collects real-time data from public users. © It generates valuable insights for businesses, researchers, and organizations. * Helps improve decision-making and optimize services. Example Google Maps , Waze: Inter and Trans-Firewall Analytics © Inter and Trans-Firewall Analytics focus on analyzing data across multiple firewalls. * Helps detect potential cyber threats and prevent unauthorized access. * Enhances network security by monitoring data traffic between firewalls. * Provides advanced cybersecurity measures for businesses and governments. * Improves threat detection and response capabilities in complex network environments. RSy/points: Firewall a = ® Helps businesses and government agencies monitor Analytics Loy) suspicious activities across different networks. LS Ty * Uses Al and machine learning to detect threats by analyzing = firewall logs * Prevents data breaches, hacking attempts, and malware infections. Example: * Banking Sector: Banks use firewall analytics to detect fraud attempts and block suspicious transactions. © Government Security Systems: National agencies use trans-firewall analytics to monitor cyber threats across different regions and prevent cyberattacks. CONCLUSION Big Data is revolutionizing industries by enabling smarter decision-making and innovation. Technologies like Al, loT, and Cloud Computing are enhancing its capabilities. The future holds quantum computing for faster processing and stronger data privacy measures to address ethical concerns. As data continues to grow, leveraging it responsibly will be key to success. ~ Thank you ~

You might also like