Big Data Unit          2:
Mining Data Streams
 Data Stream: A data stream is an existing, continuous, ordered (implicitly by entrance
 time or explicitly by timestamp) chain of items. It is unfeasible to control the order in
 which units arrive, nor it is feasible to locally capture stream in its entirety. It is enormous
 volumes of data, items arrive at a high rate.
 Types of Data Streams :
     Data stream: A data stream is a(possibly unchained) sequence of tuples. Each tuple
     comprised of a set of attributes, similar to a row in a database table.
     Transactional data stream: It is a log interconnection between entities
          • Credit card – purchases by consumers from producer
          • Telecommunications – phone calls by callers to the dialed parties
          • Web – accesses by clients of information at servers
     Measurement data streams
          • Sensor Networks – a physical natural phenomenon, road traffic
          • IP Network – traffic at router interfaces
          • Earth climate – temperature, humidity level at weather stations
 Examples of Stream Sources-
      •   Sensor Data: In navigation systems, sensor data is used. Imagine a temperature
          sensor floating about in the ocean, sending back to the base station a reading of
          the surface temperature each hour. The data generated by this sensor is a stream
          of real numbers. We have 3.5 terabytes arriving every day and we for sure need to
          think about what we can be kept continuing and what can only be archived.
      •   Image Data: Satellites frequently send down-to-earth streams containing many
          terabytes of images per day. Surveillance cameras generate images with lower
          resolution than satellites, but there can be numerous of them, each producing a
          stream of images at a break of 1 second each.
      •   Internet and Web Traffic: A bobbing node in the center of the internet receives
          streams of IP packets from many inputs and paths them to its outputs. Websites
          receive streams of heterogeneous types. For example, Google receives a hundred
          million search queries per day.
  Characteristics of Data Streams :
     • Large volumes of continuous data, possibly infinite.
     • Steady changing and requires a fast, real-time response.
     • Data stream captures nicely our data processing needs of today.
     • Random access is expensive and a single scan algorithm
     • Store only the summary of the data seen so far.
     • Maximum stream data are at a pretty low level or multidimensional in creation,
         needs multilevel and multidimensional treatment.
  Applications of Data Streams :
     • Fraud perception
     • Real-time goods dealing
     • Consumer enterprise
     • Observing and describing on inside IT systems
  Advantages of Data Streams :
     • This data is helpful in upgrading sales
     • Help in recognizing the fallacy
     • Helps in minimizing costs
     • It provides details to react swiftly to risk
  Disadvantages of Data Streams :
     • Lack of security of data in the cloud
     • Hold cloud donor subordination
     • Off-premises warehouse of details introduces the probable for disconnection
Stream Data Model and Architecture: A streaming data architecture is a dedicated network
of software components capable of processing large amount of stream data from many
sources. Unlike conventional data architecture solutions, which focus on batch reading and
writing, a streaming data architecture take data as it is generated in its raw form, stores it,
and may incorporate different components for real-time data processing and manipulation.
An effective streaming architecture must designed for the different characteristics of data
streams which tend to generate large amounts of structured and semi-structured data that
requires filtering and pre-processing to be useful.
        Due to its complexity, stream processing cannot be solved with one ETL(Extract,
transform, and load) tool or database. That’s why organizations need to adopt solutions
consisting of multiple building blocks that can be combined with data pipelines within the
organization’s data architecture.
        Although stream processing was initially considered a niche technology, it is hard to
find a modern business that does not have an eCommerce site, an online advertising strategy,
an app, or products enabled by IoT.
        Each of these digital assets generates real-time event data streams, thus fueling the
need to implement a streaming data architecture capable of handling powerful, complex, and
real-time analytics.
Stream Computing: The word stream in stream computing is used to mean pulling in
streams of data, processing the data and streaming it back out as a single flow. Stream
computing uses software algorithmsthat analyzes the data in real time as it streams in to
increase speed and accuracy when dealing with data handling and analysis.
     •     Stream computing is a computing paradigm that reads data from collections of
           software or hardware sensors in stream form and computes continuous data
           streams.
       • Stream computing uses software programs that compute continuous data
           streams.
       • Stream computing uses software algorithm that analyzes the data in real time.
       • Stream computing is one effective way to support Big Data by providing extremely
           low-latency velocities with massively parallel processing architectures.
       • It is becoming the fastest and most efficient way to obtain useful knowledge from
           Big Data.
       Examples: In June 2007, IBM announced its stream computing system, called System
       S. This system runs on 800 microprocessors and the System S software enables
       software applications to split up tasks and then reassemble the data into an answer.
       ATI Technologies also announced a stream computing technology that describes its
       technology that enables the graphics processors (GPUs) to work in conjunction with
       high-performance, low-latency CPUs to solve complex computational problems. ATI’s
       stream computing technology is derived from a class of applications that run on the
       GPU instead of a CPU.
Sampling Data in a Stream: Stream sampling is the process of collecting a representative
sample of the elements of a data stream. The sample is usually much smaller than the entire
stream, but can be designed to retain many important characteristics of the stream, and can
be used to estimate many important aggregates on the stream. Unlike sampling from a
stored data set, stream sampling must be performed online, when the data arrives. Any
element that is not stored within the sample is lost forever, and cannot be retrieved. This
article discusses various methods of sampling from a data stream and applications of these
methods.
Filtering Streams: stream filtering is one of the most useful and practical approaches to
efficient stream evaluation, whether it is done implicitly by the system to guarantee the
stability of the stream processing under overload conditions, or explicitly by the evaluating
procedure. In this section we will review some of the filtering techniques commonly used in
data stream processing.
       Filtering Techniques in Data Mining consist of three disciplines: Machine Learning
techniques, Statistical Models, and Deep Learning algorithms. Depending on various
methods, Data Mining professionals try to understand how to process and make conclusions
from the huge amount of data.
      • Tracking Patterns: Tracking patterns is one of the most basic Filtering Techniques
          in Data Mining. It helps recognize aberrations in data or an ebb and flow of a
          variable. Pattern tracking will help determine if a product is ordered more for a
          demographic. A brand can use this to better stock the original product for this
          demographic or create similar products. For example, you can identify Sales data
          trends and capitalize on those insights.
      • Classification: Classification Filtering Techniques in Data Mining are used to
          categorize or classify related data after identifying the main characteristics of data
          types. You can classify data by various criteria such as type of data sources mined,
          database involved, the kind of knowledge discovered, and more.
      • Clustering: Clustering Filtering Techniques in Data Mining identify similar data and
          divide information into groups of connected objects (clusters) based on their
          characteristics. It models data by its clusters and is seen as a historical point of
          view in Data Modeling. Clustering helps in scientific data exploration, text mining,
          spatial database applications, information retrieval, CRM, medical diagnostics, and
          much more. You can recognize the differences and similarities in the data with this
          method. Clustering is similar to classification but involves grouping chunks of data
          based on their similarities.
      • Visualization: Data Visualizations are another element of Data Mining; these
          Filtering Techniques in Data Mining provide information about data based on
          sensory perceptions. Today’s Data Visualizations are dynamic and helpful in
          Streaming Data in real-time, characterized by various colours to reveal different
          trends and patterns. Dashboards are powerful tools to uncover data mining
          insights. Organizations can base dashboards on multiple metrics and use
          visualizations to highlight patterns in data instead of numerical models.
      • Association: Association is a Filtering Technique in Data Mining, related to tracking
          patterns and statistics. It signifies that certain Data Events are associated with
          other data-driven events. It’s like the co-occurrence notion in Machine Learning,
          where the presence of another indicates the likelihood of another data-driven
          event. The notion of association also indicated a Relationship between two data
          events.
      • Regression: Although, the Regression Filtering techniques in Data Mining are used
          as a form of planning and modelling by identifying the likelihood of a certain
          variable when other variables are known. Its primary focus is to uncover the
          relationship between variables in a given dataset. For example, you could use it to
          project prices based on consumer demand, availability, and competition.
      • Prediction: The Prediction Filtering Techniques in Data Mining are about finding
          patterns in historical and current data to extend them into future predictions,
          providing insights into what might happen next. For example, reviewing
         consumers’ past purchases and credit histories to predict whether they’ll be a
         credit risk in the future.
      • Neural Networks: Primarily used for deep learning algorithms, the neural network
         filtering techniques in the Data Mining process mimic the human brain’s
         interconnectivity. They have various layers of nodes where each node is made up
         of weights, inputs, a bias, and an output. Filtering Techniques in Data Mining can
         be a powerful tool in Data Mining but should be used with caution as these
         models are incredibly complex.
      • Decision Tree: Decision tree filtering techniques in Data Mining is a Predictive
         model that uses Regression or Classification methods to classify potential
         outcomes. It uses a tree-like structure/model to represent the possible outcomes.
         These Filtering Techniques in Data Mining enable companies to understand how
         their data inputs affect the output.
      • K-Nearest Neighbor (KNN): Nonparametric Filtering Techniques in Data Mining
         classify data points based on their association and proximity to other available
         data. This algorithm for filtering techniques in Data Mining assumes that you can
         find similar data points near each other. It calculates the distance between data
         points and assigns a category based on the most frequent type or average.
Counting Distinct Elements in a Stream
    Naive Approach for finding the count of distinct numbers
       • For every index, i from 0 to N – K, traverse the array from i to i + k using another
           loop. This is the window
       • Traverse the window, from i to that index and check if the element is present or
           not
       • If the element is not present in the prefix of the array, i.e no duplicate element is
           present from i to index-1, then increase the count, else ignore it
       • Print the count
Estimating Moments
      • Estimating moments is a generalization of the problem of counting distinct
         elements in a stream. The problem, called computing "moments," involves the
         distribution of frequencies of different elements in the stream.
      • Suppose a stream consists of elements chosen from a universal set. Assume the
         universal set is ordered so we can speak of the ith element for any i.
      • Let mi be the number of occurrences of the ith element for any i. Then the kth-
         order moment of the stream is the sum over all i of (mi)k
      Example :-
      • The 0th moment is the sum of 1 of each mi that is greater than 0 i.e., 0th moment
         is a count of the number of distinct element in the stream.
      • The 1st moment is the sum of the mi ’s, which must be the length of the stream.
         Thus, first moments are especially easy to compute i.e., just count the length of
         the stream seen so far.
     •   The second moment is the sum of the squares of the mi’s. It is sometimes called
         the surprise number, since it measures how uneven the distribution of elements in
         the stream is.
     • To see the distinction, suppose we have a stream of length 100, in which eleven
         different elements appear. The most even distribution of these eleven elements
         would have one appearing 10 times and the other ten appearing 9 times each.
     • In this case, the surprise number is 102 + 10 × 92 = 910. At the other extreme, one
         of the eleven elements could appear 90 times and the other ten appear 1 time
         each. Then, the surprise number would be 902 + 10 × 12 = 8110.
Decaying Window:
     • This algorithm allows you to identify the most popular elements (trending, in other
         words) in an incoming data stream.
     • The decaying window algorithm not only tracks the most recurring elements in an
         incoming data stream, but also discounts any random spikes or spam requests that
         might have boosted an element’s frequency.
     Algo:In a decaying window, you assign a score or weight to every element of the
     incoming data stream. Further, you need to calculate the aggregate sum for each
     distinct element by adding all the weights assigned to that element. The element with
     the highest total score is listed as trending or the most popular.
           o Assign each element with a weight/score.
           o Calculate aggregate sum for each distinct element by adding all the weights
               assigned to that element.
        Advantages of Decaying Window Algorithm:
              • Sudden spikes or spam data is taken care.
              • New element is given more weight by this mechanism, to achieve right
                 trending output.
Real time Analytics: Real-time analytics permits businesses to get awareness and take
action on data immediately or soon after the data enters their system. Real-time app
analytics response queries within seconds. They grasp a large amount of data with high
velocity and low reaction time. For example, real-time big data analytics uses data in
financial databases to notify trading decisions. Analytics can be on-demand or
uninterrupted. On-demand notifies results when the user requests it. Continuous
renovation users as events happen and can be programmed to answer automatically to
certain events. For example, real-time web analytics might refurbish an administrator if the
page load presentation goes out of the present boundary.
        Advantages:
             • Create our interactive analytics tools.
             • Transparent dashboards allow users to share information.
             • Monitor behaviour in a way that is customized.
             • Perform immediate adjustments if necessary.
             • Make use of machine learning.
Real time Analytics Platform (RTAP)
      • A real-time analytics platform enables organizations to make the most out of real-
         time data by helping them to extract the valuable information and trends from it.
      • Such platforms help in measuring data from the business point of view in real
         time, further making the best use of data.
      • An ideal real-time analytics platform would help in analyzing the data, correlating
         it and predicting the outcomes on a real-time basis.
      • The real-time analytics platform helps organizations in tracking things in real time,
         thus helping them in the decision-making process.
      • The platforms connect the data sources for better analytics and visualization.
      • Real time analytics is the analysis of data as soon as that data becomes available.
         In other words, users get insights or can draw conclusions immediately the data
         enters their system.
      Applications:
            • Real time credit scoring, helping financial institutions to decide immediately
               whether to extend credit.
            • Customer relationship management (CRM), maximizing satisfaction and
               business results during each interaction with the customer.
            • Fraud detection at points of sale.
            • Targeting individual customers in retail outlets with promotions and
               incentives, while the customers are in the store and next to the
               merchandise.
Case Studies:
      • Big data in Netflix: Netflix implements data analytics models to discover customer
         behavior and buying patterns. Then, using this information it recommends movies
         and TV shows to their customers. That is, it analyzes the customer’s choice and
         preferences and suggests shows and movies accordingly. According to Netflix,
         around 75% of viewer activity is based on personalized recommendations. Netflix
         generally collects data, which is enough to create a detailed profile of its
         subscribers or customers. This profile helps them to know their customers better
         and in the growth of the business.
      • Big data at Google: Google uses Big data to optimize and refine its core search and
         ad-serving algorithms. And Google continually develops new products and services
         that have Big data algorithms. Google generally uses Big data from its Web index
           to initially match the queries with potentially useful results. It uses machine-
           learning algorithms to assess the reliability of data and then ranks the sites
           accordingly. Google optimized its search engine to collect the data from us as we
           browse the Web and show suggestions according to our preferences and interests.
      • Big data at LinkedIn: LinkedIn is mainly for professional networking. It generally
           uses Big data to develop product offerings such as people you may know, who
           have viewed your profile, jobs you may be interested in, and more. LinkedIn uses
           complex algorithms, analyzes the profiles, and suggests opportunities according to
           qualification and interests. As the network grows moment by moment, LinkedIn’s
           rich trove of information also grows more detailed and comprehensive.
      • Big Data at Uber:Uber is the first choice for people around the world when they
           think of moving people and making deliveries. It uses the personal data of the user
           to closely monitor which features of the service are mostly used, to analyze usage
           patterns and to determine where the services should be more focused. Uber
           focuses on the supply and demand of the services due to which the prices of the
           services provided changes. Therefore one of Uber’s biggest uses of data is surge
           pricing. For instance, if you are running late for an appointment and you book a
           cab in a crowded place then you must be ready to pay twice the amount.For
           example, On New Year’s Eve, the price for driving for one mile can go from 200 to
           1000. In the short term, surge pricing affects the rate of demand, while long term
           use could be the key to retaining or losing customers. Machine learning algorithms
           are considered to determine where the demand is strong.
Real Time Sentiment Analysis: Real-time Sentiment Analysis is a machine learning
technique that automatically recognizes and extracts the sentiment in a text whenever it
occurs. It is most commonly used to analyse brand and product mentions in live social
comments and posts. An important thing to note is that real-time sentiment analysis can be
done only from social media platforms that share live feeds like Twitter does
       The real-time sentiment analysis process uses several ML tasks such as natural
language processing, text analysis, semantic clustering, etc to identify opinions expressed
about brand experiences in live feeds and extract business intelligence from them.
       Need of real time sentimental analysis
               • Live social feeds from video platforms like Instagram or Facebook
               • Real-time sentiment analysis of text feeds from platforms such as Twitter is
                    helpful in threat detection in cyberbullying.
               • Live monitoring of Influencer live streams.
               • Live video streams of interviews, news broadcasts, seminars, panel
                    discussions, speaker events, and lectures.
               • Live audio streams such as in virtual meetings on Zoom or Skype, or at
                    product support call centers for customer feedback analysis.
               • Live monitoring of product review platforms for brand mentions.
       Way to do it
            • Step 1 - Data collection
             •   Step 2 - Data processing
             •   Step 3 - Data analysis
             •   Step 4 - Data visualization
Stock Market Prediction: The Stock market process is full of uncertainty and is affected by
many factors. Hence the Stock market prediction is one of the important exertions in
finance and business. There are two types of analysis possible for prediction, technical and
fundamental. In this paper both technical and fundamental analysis are considered.
Technical analysis is done using historical data of stock prices by applying machine learning
and fundamental analysis is done using social media data by applying sentiment analysis.
Social media data has high impact today than ever, it can aide in predicting the trend of the
stock market. The method involves collecting news and social media data and extracting
sentiments expressed by individual. Then the correlation between the sentiments and the
stock values is analyzed. The learned model can then be used to make future predictions
about stock values. It can be shown that this method is able to predict the sentiment and
the stock performance and its recent news and social data are also closely correlated.