UNIT 1
Introduction:
Introduction to big data
Data which are very large in size is called Big Data. Normally we work on data of size
MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte
size is called Big Data. It is stated that almost 90% of today's data has been generated in the
past 3 years.
Sources of Big Data
These data come from many sources like
Social networking sites: Facebook, Google, LinkedIn all these sites generates huge amount of
data on a day to day basis as they have billions of users worldwide.
E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from
which users buying trends can be traced.
Weather Station: All the weather station and satellite gives very huge data which are stored
and manipulated to forecast weather.
Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.
Share Market: Stock exchange across the world generates huge amount of data through its
daily transaction.
3V's of Big Data
Velocity: The data is increasing at a very fast rate. It is estimated that the volume of data will
double in every 2 years.
Variety: Now a days data are not stored in rows and column. Data is structured as well as
unstructured. Log file, CCTV footage is unstructured data. Data which can be saved in tables
are structured data like the transaction data of the bank.
Volume: The amount of data which we deal with is of very large size of Peta bytes.
Use case
An e-commerce site XYZ (having 100 million users) wants to offer a gift voucher of 100$ to
its top 10 customers who have spent the most in the previous year.Moreover, they want to find
the buying trend of these customers so that company can suggest more items related to them.
Issues
Huge amount of unstructured data which needs to be stored, processed and analyzed.
Solution
Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File System)
which uses commodity hardware to form clusters and store data in a distributed fashion. It
works on Write once, read many times principle.
Processing: Map Reduce paradigm is applied to data distributed over network to find the
required output.
Analyze: Pig, Hive can be used to analyze the data.
Cost: Hadoop is open source so the cost is no more an issue.
Solution
Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File System)
which uses commodity hardware to form clusters and store data in a distributed fashion. It
works on Write once, read many times principle.
Processing: Map Reduce paradigm is applied to data distributed over network to find the
required output.
Analyze: Pig, Hive can be used to analyze the data.
Cost: Hadoop is open source so the cost is no more an issue.
Introduction to Big Data Platform
The constant stream of information from various sources is becoming more intense[4],
especially with the advance in technology. And this is where big data platforms come in to
store and analyze the ever-increasing mass of information.
A big data platform is an integrated computing solution that combines numerous software
systems, tools, and hardware for big data management. It is a one-stop architecture that solves
all the data needs of a business regardless of the volume and size of the data at hand. Due to
their efficiency in data management, enterprises are increasingly adopting big data platforms
to gather tons of data and convert them into structured, actionable business insights[5].
Currently, the marketplace is flooded with numerous Open source and commercially available
big data platforms. They boast different features and capabilities for use in a big data
environment.
Challenges of Conventional Systems
Uncertainty of Data Management Landscape: Because big data is continuously expanding,
there are new companies and technologies that are being developed everyday. A big challenge
for companies is to find out which technology works bests for them without the introduction
of new risks and problems.The Big Data Talent Gap: While Big Data is a growing field, there
are very few experts available in this field. This is because Big data is a complex field and
people who understand the complexity and intricate nature of this field are far few and between.
Another major challenge in the field is the talent gap that exists in the industryGetting data into
the big data platform: Data is increasing every single day. This means that companies have to
tackle limitless amount of data on a regular basis. The scale and variety of data that is available
today can overwhelm any data practitioner and that is why it is important to make data
accessibility simple and convenient for brand mangers and owners.Need for synchronisation
across data sources: As data sets become more diverse, there is a need to incorporate them into
an analytical platform. If this is ignored, it can create gaps and lead to wrong insights and
messages.Getting important insights through the use of Big data analytics: It is important that
companies gain proper insights from big data analytics and it is important that the correct
department has access to this information. A major challenge in the big data analytics is
bridging this gap in an effective fashion.
Intelligent data analysis
•   ou will find the best dissertation research areas / topics for future researchers enrolled
    in Computer Science & Information.
•   In order to identify the future research topics, we have reviewed the computer science
    (recent peer-reviewed studies) on Data Analysis.
•   Process of finding and identifying the meaning of data.
•   Main advantage of visual representations is to discover, make sense of data and
    communicating data.
Data
Data is nothing but things known or anything that is assumed; facts from which conclusions
can be gathered.
PhD Assistance works on Intelligent Data Analysis and Visualization. Hiring our experts,
you are assured with quality and on-time delivery.
Data Analysis
    •   Breaking up of any data into parts i.e., the examination of these parts to know about
        their nature, proportion, function, interrelationship, etc.
   •   A process in which the analyst moves laterally and recursively between three modes:
       describing data (profiling, correlation, summarizing), assembling data (scrubbing,
       translating, synthesizing, filtering) and creating data (deriving, formulating,
       simulating).
   •   It is a sense of making data. The process of finding and identifying the meaning of data.
Hire PhD Assistance experts to develop intelligent data analysis and visualization for
your Engineering & Technology.
Data Visualization
   •   It is a process of revealing already existing data and/or its features (origin, metadata,
       allocation), which includes everything from the table to charts and multidimensional
       animation (Min Yao, 2014) .
   •   To form an intellectual image of something not there to the sight.
   •   Visual data analysis is another form of data analysis, in which some or all forms of data
       visualization may be used to give feedback sign to the analyst. Our product uses visual
       signs such as charts, interactive browsing, and workflow process cues to help the
       analyst in moving through the modes of data analysis.
   •   The main advantage of visual representations is to discover, make sense of data and
       communicating data. Data visualization is a central part and an essential means to carry
       out data analysis and then, once the importance have been identified and understood, it
       is easy to communicate those meanings to others.
PhD       Assistance       experts       has experience        in      handling dissertation and
assignment in Computer Science and Technology with assured 2:1 distinction. Talk to
Experts Now
Importance of IDA:
Intelligent Data Analysis (IDA) is one of the major issues in artificial intelligence and
information. Intelligent data analysis discloses hidden facts that are not known previously and
provides potentially important information or facts from large quantities of data (White, 2008).
It also helps in making a decision. Based on machine learning, artificial intelligence,
recognition of pattern, and records and visualization technology mainly, IDA helps to obtain
useful information, necessary data and interesting models from a lot of data available online in
order to make the right choices.
Intelligent data analysis helps to solve a problem that is already solved as a matter of routine.
If the data is collected for the past cases together with the result that was finally achieved, such
data can be used to revise and optimize the presently used strategy to arrive at a conclusion.
In certain cases, if some questions arise for the first time, and have only a little knowledge
about it, data from the related situations helps us to solve the new problem or any unknown
relationships can be discovered from the data to gain knowledge in an unfamiliar area.
Steps Involved In IDA:
IDA, in general, includes three stages: (1) Preparation of data; (2) data mining; (3) data
validation and explanation (Keim & Ward, 2007). The preparation of data involves opting for
the required data from the related data source and incorporating it into a data set that can be
used for data mining.
The main goal of intelligent data analysis is to obtain knowledge. Data analysis is the process
of a combination of extracting data from data set, analyzing, classification of data, organizing,
reasoning, and so on. It is challenging to choose suitable methods to resolve the complexity of
the process.
Regarding the term visualization, we have moved away from visualization to use the
term charting. The term analysis is used for the method of incorporating, influencing, filtering
and scrubbing the data, which certainly contains, but is not limited to interrelating with their
data through charts.
The Goal of Data Analysis:
Data analysis need not essentially involve arithmetic or statistics. While it is true that analysis
often involves one or both, and that many analytical pursuits cannot be handled without them,
much of the data analysis that people perform in the course of their work involves at most
mathematics no more complicated than the calculation of the mean of a set of values. The
essential activity of analysis is a comparison (of values, patterns, etc.), which can often be done
by simply using our eyes.
The aim of the analysis is not to find out appealing information in the data. Rather, this is only
a vital part of the process (Berthold & Hand, 2003). The aim is to make sense of data (i.e., to
understand what it means) and then to make decisions based on the understanding that is
achieved. Information in and of itself is not useful. Even understanding information in and of
it is not useful. The aim of data analysis is to make better decisions.
The process of data analysis starts with the collection of data that can add to the solution of
any given problem, and with the organization of that data in some regular form. It involves
identifying and applying a statistical or deterministic schema or model of the data that can be
manipulated for explanatory or predictive purposes. It then involves an interactive or automated
solution that explores the structured data in order to extract information – a solution to the
business problem – from the data.
    PhD Assistance has vast experience in developing dissertation for data analysis
    topics for student’s pursuing the UK dissertation in Engineering & Technology. Order
    Now
    The Goal of Visualization
    The basic idea of visual data mining is to present the data in some visual form, allowing the
    user to gain insight into the data, draw conclusions, and directly interact with the data. Visual
    data analysis techniques have proven to be of high value in exploratory data analysis. Visual
    data mining is mainly helpful when the only little fact is known about the data and the
    exploration goals are indistinct.
    The main uses of visual data examination over data analysis methods are:
        •   Visual data examination can simply deal with highly non-homogeneous and noisy data.
        •   Visual data exploration is spontaneous and requires no knowledge of complex
            mathematical or arithmetical algorithms or parameters.
        •   Visualization can present a qualitative outline of the data, letting data phenomenon to
            be secluded for further quantitative analysis. Accordingly, visual data examination
            usually allows a quicker data investigation and often provides fascinating results,
            especially in cases where automatic algorithms fail.
        •   Visual data examination techniques provide a much higher degree of assurance in the
            findings of the exploration.
    Conclusion
    The examination of large data sets is a significant but complicated problem. Information
    visualization techniques can be helpful in solving this problem. Visual data investigation is
    helpful for many purposes such as fraud detection system and data mining can make use of
    data visualization technology for improved data analysis.
    Nature of Data
    Big Data Characteristics
    Big Data contains a large amount of data that is not being processed by traditional data storage
    or the processing unit. It is used by many multinational companies to process the data and
    business of many organizations. The data flow would exceed 150 exabytes per day before
    replication.
    There are five v's of Big Data that explains the characteristics.
    5 V's of Big Data
o   Volume
o   Veracity
o   Variety
o   Value
o   Velocity
    Volume
    The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data
    generated from many sources daily, such as business processes, machines, social media
    platforms, networks, human interactions, and many more.
    Facebook can generate approximately a billion messages, 4.5 billion times that the "Like"
    button is recorded, and more than 350 million new posts are uploaded each day. Big data
    technologies can handle large amounts of data.
   Variety
   Big Data can be structured, unstructured, and semi-structured that are being collected from
   different sources. Data will only be collected from databases and sheets in the past, But these
   days the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos,
   videos, etc.
   The data is categorized as below:
   a.      Structured data: In Structured schema, along with all the required columns. It is in a
   tabular form. Structured Data is stored in the relational database management system.
b. Semi-structured: In Semi-structured, the schema is not appropriately defined, e.g., JSON,
   XML, CSV, TSV, and email. OLTP (Online Transaction Processing) systems are built to
   work with semi-structured data. It is stored in relations, i.e., tables.
c. Unstructured Data: All the unstructured files, log files, audio files, and image files are
   included in the unstructured data. Some organizations have much data available, but they did
   not know how to derive the value of data since the data is raw.
d. Quasi-structured Data:The data format contains textual data with inconsistent data formats
   that are formatted with effort and time with some tools.
   Example: Web server logs, i.e., the log file is created and maintained by some server that
   contains a list of activities.
   Veracity
   Veracity means how much the data is reliable. It has many ways to filter or translate the data.
   Veracity is the process of being able to handle and manage data efficiently. Big Data is also
   essential in business development.
   For example, Facebook posts with hashtags.
Value
Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the
data is created in real-time. It contains the linking of incoming data sets speeds, rate of
change, and activity bursts. The primary aspect of Big Data is to provide demanding data
rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.
Analytic Processes and Tool
There are hundreds of data analytics tools out there in the market today but the selection of
the right tool will depend upon your business NEED, GOALS, and VARIETY to get
business in the right direction. Now, let’s check out the top 10 analytics tools in big data.
1. APACHE Hadoop
It’s a Java-based open-source platform that is being used to store and process big data. It is
built on a cluster system that allows the system to process data efficiently and let the data run
parallel. It can process both structured and unstructured data from one server to multiple
    computers. Hadoop also offers cross-platform support for its users. Today, it is the best big
    data analytic tool and is popularly used by many tech giants such as Amazon, Microsoft,
    IBM, etc.
    Features of Apache Hadoop:
•   Free to use and offers an efficient storage solution for businesses.
•   Offers quick access via HDFS (Hadoop Distributed File System).
•   Highly flexible and can be easily implemented with MySQL, and JSON.
•   Highly scalable as it can distribute a large amount of data in small segments.
•   It works on small commodity hardware like JBOD or a bunch of disks.
    2. Cassandra
    APACHE Cassandra is an open-source NoSQL distributed database that is used to fetch large
    amounts of data. It’s one of the most popular tools for data analytics and has been praised
    by many tech companies due to its high scalability and availability without compromising
    speed and performance. It is capable of delivering thousands of operations every
    second and can handle petabytes of resources with almost zero downtime. It was created by
    Facebook back in 2008 and was published publicly.
    Features of APACHE Cassandra:
•   Data Storage Flexibility: It supports all forms of data i.e. structured, unstructured, semi-
    structured, and allows users to change as per their needs.
•   Data Distribution System: Easy to distribute data with the help of replicating data on
    multiple data centers.
•   Fast Processing: Cassandra has been designed to run on efficient commodity hardware and
    also offers fast storage and data processing.
•   Fault-tolerance: The moment, if any node fails, it will be replaced without any delay.
    3. Qubole
    It’s an open-source big data tool that helps in fetching data in a value of chain using ad-hoc
    analysis in machine learning. Qubole is a data lake platform that offers end-to-end service
    with reduced time and effort which are required in moving data pipelines. It is capable of
    configuring multi-cloud services such as AWS, Azure, and Google Cloud. Besides, it also
    helps in lowering the cost of cloud computing by 50%.
    Features of Qubole:
•   Supports ETL process: It allows companies to migrate data from multiple sources in one
    place.
•   Real-time Insight: It monitors user’s systems and allows them to view real-time insights
•   Predictive Analysis: Qubole offers predictive analysis so that companies can take actions
    accordingly for targeting more acquisitions.
•   Advanced Security System: To protect users’ data in the cloud, Qubole uses an advanced
    security system and also ensures to protect any future breaches. Besides, it also allows
    encrypting cloud data from any potential threat.
    4. Xplenty
    It is a data analytic tool for building a data pipeline by using minimal codes in it. It offers a
    wide range of solutions for sales, marketing, and support. With the help of its interactive
    graphical interface, it provides solutions for ETL, ELT, etc. The best part of using Xplenty is
    its low investment in hardware & software and its offers support via email, chat, telephonic
    and virtual meetings. Xplenty is a platform to process data for analytics over the cloud and
    segregates all the data together.
    Features of Xplenty:
•   Rest API: A user can possibly do anything by implementing Rest API
•   Flexibility: Data can be sent, and pulled to databases, warehouses, and salesforce.
•   Data Security: It offers SSL/TSL encryption and the platform is capable of verifying
    algorithms and certificates regularly.
•   Deployment: It offers integration apps for both cloud & in-house and supports deployment
    to integrate apps over the cloud.
    5. Spark
    APACHE Spark is another framework that is used to process data and perform numerous
    tasks on a large scale. It is also used to process data via multiple computers with the help of
    distributing tools. It is widely used among data analysts as it offers easy-to-use APIs that
    provide easy data pulling methods and it is capable of handling multi-petabytes of data as
    well. Recently, Spark made a record of processing 100 terabytes of data in just 23
    minutes which broke the previous world record of Hadoop (71 minutes). This is the reason
    why big tech giants are moving towards spark now and is highly suitable for ML and AI
    today.
    Features of APACHE Spark:
•   Ease of use: It allows users to run in their preferred language. (JAVA, Python, etc.)
•   Real-time Processing: Spark can handle real-time streaming via Spark Streaming
•   Flexible: It can run on, Mesos, Kubernetes, or the cloud.
    6. Mongo DB
    Came in limelight in 2010, is a free, open-source platform and a document-oriented
    (NoSQL) database that is used to store a high volume of data. It uses collections and
    documents for storage and its document consists of key-value pairs which are considered a
    basic unit of Mongo DB. It is so popular among developers due to its availability for multi-
    programming languages such as Python, Jscript, and Ruby.
    Features of Mongo DB:
•   Written in C++: It’s a schema-less DB and can hold varieties of documents inside.
•   Simplifies Stack: With the help of mongo, a user can easily store files without any
    disturbance in the stack.
•   Master-Slave Replication: It can write/read data from the master and can be called back for
    backup.
    7. Apache Storm
    A storm is a robust, user-friendly tool used for data analytics, especially in small companies.
    The best part about the storm is that it has no language barrier (programming) in it and can
    support any of them. It was designed to handle a pool of large data in fault-tolerance and
    horizontally scalable methods. When we talk about real-time data processing, Storm leads
    the chart because of its distributed real-time big data processing system, due to which today
    many tech giants are using APACHE Storm in their system. Some of the most notable names
    are Twitter, Zendesk, NaviSite, etc.
    Features of Storm:
•   Data Processing: Storm process the data even if the node gets disconnected
•   Highly Scalable: It keeps the momentum of performance even if the load increases
•   Fast: The speed of APACHE Storm is impeccable and can process up to 1 million
    messages of 100 bytes on a single node.
    8. SAS
    Today it is one of the best tools for creating statistical modeling used by data analysts. By
    using SAS, a data scientist can mine, manage, extract or update data in different variants from
    different sources. Statistical Analytical System or SAS allows a user to access the data in any
    format (SAS tables or Excel worksheets). Besides that it also offers a cloud platform for
    business analytics called SAS Viya and also to get a strong grip on AI & ML, they have
    introduced new tools and products.
    Features of SAS:
•   Flexible Programming Language: It offers easy-to-learn syntax and has also vast libraries
    which make it suitable for non-programmers
•   Vast Data Format: It provides support for many programming languages which also include
    SQL and carries the ability to read data from any format.
•   Encryption: It provides end-to-end security with a feature called SAS/SECURE.
    9. Data Pine
    Datapine is an analytical used for BI and was founded back in 2012 (Berlin, Germany). In a
    short period of time, it has gained much popularity in a number of countries and it’s mainly
    used for data extraction (for small-medium companies fetching data for close monitoring).
    With the help of its enhanced UI design, anyone can visit and check the data as per their
    requirement and offer in 4 different price brackets, starting from $249 per month. They do
    offer dashboards by functions, industry, and platform.
    Features of Datapine:
•   Automation: To cut down the manual chase, datapine offers a wide array of AI assistant and
    BI tools.
•   Predictive Tool: datapine provides forecasting/predictive analytics by using historical and
    current data, it derives the future outcome.
•   Add on: It also offers intuitive widgets, visual analytics & discovery, ad hoc reporting, etc.
    10. Rapid Miner
    It’s a fully automated visual workflow design tool used for data analytics. It’s a no-code
    platform and users aren’t required to code for segregating data. Today, it is being heavily
    used in many industries such as ed-tech, training, research, etc. Though it’s an open-source
    platform but has a limitation of adding 10000 data rows and a single logical processor.
    With the help of Rapid Miner, one can easily deploy their ML models to the web or mobile
    (only when the user interface is ready to collect real-time figures).
    Features of Rapid Miner:
•   Accessibility: It allows users to access 40+ types of files (SAS, ARFF, etc.) via URL
•   Storage: Users can access cloud storage facilities such as AWS and dropbox
•   Data validation: Rapid miner enables the visual display of multiple results in history for
    better evaluation.
Analysis vs Reporting
 Analytics                             Reporting
 Analytics is the method of
                                       Reporting is an action that includes all
 examining and analyzing
                                       the needed information and data and is
 summarized data to make business
                                       put together in an organized way.
 decisions.
 Questioning the data,                 Identifying business events, gathering
 understanding it, investigating it,   the required information, organizing,
 and presenting it to the end users    summarizing, and presenting existing
 are all part of analytics.            data are all part of reporting.
 The purpose of analytics is to        The purpose of reporting is to organize
 draw conclusions based on data.       the data into meaningful information.
 Analytics is used by data analysts,   Reporting is provided to the appropriate
 scientists, and business people to    business leaders to perform effectively
 make effective decisions.             and efficiently within a firm.