0% found this document useful (0 votes)
60 views24 pages

Bda - Unit 5

The document discusses Big Data visualization, highlighting its importance, challenges, and various types of visualization techniques. It emphasizes the need for effective tools and methods to analyze large datasets, including proprietary and open-source options like Tableau. Additionally, it outlines analytical techniques and the significance of interactive and intuitive data representation for better decision-making.

Uploaded by

Anonymous
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views24 pages

Bda - Unit 5

The document discusses Big Data visualization, highlighting its importance, challenges, and various types of visualization techniques. It emphasizes the need for effective tools and methods to analyze large datasets, including proprietary and open-source options like Tableau. Additionally, it outlines analytical techniques and the significance of interactive and intuitive data representation for better decision-making.

Uploaded by

Anonymous
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

UNIT – V
Big Data Visualization:
Introduction to Data visualization, Challenges to Big data visualization, Types of data visualization,
Visualizing Big Data, Tools used in data visualization, Proprietary Data Visualization tools, Open-
source data visualization tools, Data visualization with Tableau.

Data visualization
Data visualization is a graphical representation of any data or information. Visual
elements such as charts, graphs, and maps are the few data visualization tools that provide the
viewers with an easy and accessible way of understanding the represented information.
Data visualization tools and technologies are essential to analyze massive amounts of
information and make data-driven decisions.

Importance of Data Visualization:


● Easily, graspable information – Data is increasing day-by-day, and it is not wise for
anyone to scram through such quantity of data to understand it. Data visualization
comes handy then.
● Establish relationships – Charts and graphs do not only show the data but also
established co-relations between different data types and information.
● Share – Data visualization is also easy to share with others. You could share any
important fact about a market trend using a chart and your team would be more
receptive about it.
● Interactive visualization – when technological inventions are making waves in every
market segment, regardless of big or small, you could also leverage interactive
visualization to dig deeper and segment the different portions of charts and graphs to
obtain a more detailed analysis of the information being presented.
● Intuitive, personalized, updatable – Data visualization is interactive. You could click on it
and get another big picture of a particular information segment. They are also tailored
according to the target audience and could be easily updated if the information modifies.

Example of Data visualization:


Profit and loss – Business companies often resort to pie charts or bar graphs showing their
annual profit or loss margin.

Challenges of Big Data Visualization

Scalability and dynamics are two major challenges in visual analytics


The visualization-based methods take the challenges presented by the “four Vs” of big data and
turn them into following opportunities.
● Volume: The methods are developed to work with an immense number of datasets and
enable to derive meaning from large volumes of data.
● Variety: The methods are developed to combine as many data sources as needed.
● Velocity: With the methods, businesses can replace batch processing with real-time stream
processing.
● Value: The methods not only enable users to create attractive info graphics and heat maps,
but also create business value by gaining insights from big data.

Big data often has unstructured formats. Due to bandwidth limitations and power requirements,
visualization should move closer to the data to extract meaningful information efficiently.
Effective data visualization is a key part of the discovery process in the era of big data. For the
challenges of high complexity and high dimensionality in big data, there are different
dimensionality reduction methods.
There are also following problems for big data visualization:
● Visual noise: Most of the objects in dataset are too relative to each other. Users cannot
divide them as separate objects on the screen.
● Information loss: Reduction of visible data sets can be used, but leads to information loss.
● Large image perception: Data visualization methods are not only limited by aspect ratio
and resolution of device, but also by physical perception limits.
● High rate of image change: Users observe data and cannot react to the number of data
change or its intensity on display.
● High performance requirements: It can be hardly noticed in static visualization because of
lower visualization speed requirements--high performance requirement.
In Big Data applications, it is difficult to conduct data visualization because of the large size and
high dimension of big data. Most of current Big Data visualization tools have poor
performances in scalability, functionalities, and response time. Uncertainty can result in a great
challenge to effective uncertainty-aware visualization and arise during a visual analytics
process.

Potential solutions to some challenges or problems about visualization and big data were presented:
● Meeting the need for speed: One possible solution is hardware. Increased memory and
powerful parallel processing can be used. Another method is putting data in-memory but
using a grid computing approach, where many machines are used.

● Understanding the data: One solution is to have the proper domain expertise in place.

● Addressing data quality: It is necessary to ensure the data is clean through the process of
data governance or information management.

● Displaying meaningful results: One way is to cluster data into a higher-level view where
smaller groups of data are visible and the data can be effectively visualized.

● Dealing with outliers: Possible solutions are to remove the outliers from the data or create a
separate chart for the outliers.

Types of Data Visualization:

Data can be visualized in many ways, such as in the form of 1D,2D, or 3D structures. The below
table briefly describes the different types of data visualization:

Name Description Tool

1D/Linear For example, a list of items Generally, no tool is used for


organized in a predefined 1D visualization
manner
2D/Planar For example, choropleth, GeoCommons, Google
cartogram, dot distribution Fusion Tables, Google Maps,
map, and proportional symbol API, Polymaps, Many Eyes,
map Google Charts, and Tableau
Public

3D/ Volumetric For example, 3D computer TimeFlow, timeline JS,


models, surface rendering, Excel, Timeplot,
volume rendering, and TimeSearcher, Google charts,
computer simulations Tableau Public, and Google
Fusion Tables

Multidimensional For example, pie chart, Many eyes, google charts,


histogram, tag cloud, bubble Tableau public, and google
cloud, bar chart, scatter plot, fusion tables
heat map, etc.

Tree/ Hierarchical For example, dendogram, D3,google charts,


radial tree, hyperbolic tree, and network workbench/Sci2
and wedge stack graph

Network For example, matrix, node link Pajek, gephi,


diagram, hive plot, and tube nodeXL, VOSviewer,
map UCINET,
GUESS, Network
workbench/Sci2,
sigma.js,d3/Protovis,
Many eyes, and google fusion
tables

As shown in the table, the simplest type of data visualization is 1D representation and the most
complex data visualization is the network representation. The following is a brief description of
each of these data visualizations:
▪ 1D (Linear) Data Visualization – In the linear data visualization, data is presented in the
form of lists. Hence, we cannot term it as visualization. It is rather a data organization
technique. Therefore, no tool is required to visualize data in a linear manner.

▪ 2D(planar) Data visualization –This technique presents data in the form of images,
diagrams or charts on aplane surface.

▪ 3D (volumetric) Data visualization –In this method, data presentation involves exactly
three dimensions to show simulations, surface and volume rendering, etc. generally, it is
used in scientific studies. Today, many organizations use 3D computer modeling and volume
rendering in advertisements to provide users a better feel of their products.

▪ Temporal Data Visualization – sometimes, visualizations are time dependent. To visualize


the dependence of analyses on time, the temporal data visualization is used.

▪ Multidimensional Data visualization – in this type of data visualization, numerous


dimensions are used to present data.

▪ Tree/Hierarchical Data visualization – sometimes, data relationships need to be shown in


the form of hierarchies. To represent such kind of relationships, we use tree or hierarchical
data visualizations.

▪ Network data visualization – It is used to represent data relations that are too complex to
be represented in the form of hierarchies.

Visualizing Big Data:

Data visualization is a great way to reduce the turn-around time consumed in interpreting big data.
Traditional analytical techniques are not enough to capture or interpret the information that big data
possesses. Traditional tools are developed by using relational models that works best on static
interaction. Big data is highly dynamic in function and therefore, most traditional tools are not able
to generate quality results. The response time of traditional tool is quite high, making it unfit for
quality interaction.

Deriving Business solution:


The most common notation used for big data is 3 V’s- volume, velocity, and variety. But, the
most exciting feature is the way in which value is filtered from the haystack of data.

Now a days, IT companies that are using Big Data faces the following challenges:

Most data is in unstructured form

Data is not analyzed in real time

The amount of data generated is huge

There is a lack of efficient tools and techniques

By considering the above factors, IT companies are focusing more on research and development of
robust algorithm, software, and tools to analyze the data that is scattered in the internet space.

Turning Data into Information:

Visualization of data produces cluttered images that are filtered with the help of clutter- reduction
techniques. Uniform sampling and dimension reduction are two commonly used clutter- reduction
techniques.

Visual data reduction process involves automated data analysis to measure density, outliers, and
their differences. These measures are then used as quality metrics to evaluate data – reduction
activity.

Visual quality metrics can be categorized as:

Size metrics

Visual effectiveness metrics

Feature preservation metrics

A visual analytics tool should be:

Simple enough so that even non- technical users can operate it


Interactive to connect with different sources of data

Component to create appropriate visuals for interpretations

Able to interpret big data and share information

A part from representing data, a visualization tool must be able to establish links between different
data values, restore the missing data, and polish data for further analysis.

Tools Used in Data Visualization:

Some useful visualization tools are listed as follows:

EXCEL – It is a new tool that is used for data analytics. It helps you to track and visualize data for
deriving better insights. This tool provides various ways to share data and analytical conclusions
within and across organizations.
LastForward – It is open – source software provided by last.fm for analyzing and visualizing
social music network.

Digg.com – Digg.com provides some of the best web-based visualization tools.

Pics – This tool is used to track the activity of images on the website.

D3 – D3 allows you to bind arbitrary data to a document object model (DOM) and then applies
data- driven transformations to the document. For example, you can use D3 to generate an
HTML table from an array of numbers. Or, use the same data to create an interactive SVG bar
chart with smooth transitions and interactions.

Rootzmap Mapping the internet – It is a tool to generate a series of maps on the basis of the
datasets provided by the National Aeronautics and Space Administration (NASA).
Open source data visualization tools

Due to economic and infrastructural limitations, every organization


cannot purchase all the applications required for analyzing data. Therefore, to fulfill their
requirement of advanced tools and technologies, organizations often turn to open- source
libraries. These libraries can be defined as pools of freely available applications and analytical
tools. Some examples of open- source tools available for data visualization are VTK,
Cave5D, ELKI, Tulip, Gephi, IBM openDX, Tableau public and vis5D.

● Open source tools are easy to use, consistent, and reusable.

● They deliver performance and are complaint with the web as well as
mobile web security.

Analytical Techniques used in Big Data Visualization:

Analytical techniques are used to analyze complex relationships among variables. The following
are some commonly used analytical techniques for big data solutions:

Regression analysis – it is a statistical tool used for prediction. Regression analysis is used to
predict continuous dependent variables from independent variables.

Types of regression analysis are as follows:

Ordinary least squares regression – it is used when dependent variable


is continuous and there exists some relationship between the dependent variable and
independent variable.

Logistic regression – it is used when dependent variable has only two


potential results.

Hierarchical linear modeling – it is used when data is in nested form

Duration models – it is used to measure length of process


Grouping methods – the technique of categorizing observation into significant or
purposeful blocks is called grouping. The recognition of features to create a distinction
between groups is called discriminant analysis.

Multiple equation models – it is used to analyze casual pathways from independent


variables to dependent variables. Types of multiple equation models are as follows :

o Path analysis

o Structural equation modeling

Data Visualization with tableau

Introduction to Tableau:

Tableau is a Data visualization software that allows developers to build interactive dashboards
that are easily updated with new data and can be shared with a wider audience. There are various
types of Tableau products available in the market. Some of the commonly known products include
Tableau Desktop, Tableau Server, Tableau Online, Tableau Reader, and Tableau Public.
The important features of Tableau Software include the following:
Single- click data analytics in visual form
In- depth statistical analysis
Management of metadata
In built, top-class data analytic practices
In built data engine
Big data analytics
Quick and accurate data discovery
Business dashboards creation
Various types of data visualization
Social media analytics, including Facebook and Twitter
Easy and quick integration of R
Business intelligence through mobile
Analysis of time series data
Analysis of data from surveys

Tableau software can be used in various industries and data environments.


Tableau has primarily been used within the following data environments:
All data sources
Amazon redshift
Excel charts and graphs
Google analytics
Google bigquery
Hadoop
HP vertica
SAP
Splunk

Tableau Desktop Workspace


Tableau’s desktop environment is simple and easy to learn almost for anyone.
Tableau is an extremely efficient tool, which can answer your questions about data analytics
quickly.
To use tableau desktop, you first need to download and install the tool on your computer.
Steps to download and install the tableau software:
1. Open the link http://www.tableau.com/products/desktop in the web browser
2. Download the trail version of the tableau desktop by clicking the TRY IT Free button.
3. Go to the directory of your computer where you have stored the tableau desktop setup and
double click the executable file. The tableau desktop installer, showing the tableau version
number on top, will appear.
4. Click the I have read and accept the terms of this License Agreement check box to
enable the install button
5. Click the install button to start the installation of the tableau desktop.
6. Click the start trail now option. This will ask for registration.
7. Click the register button to open the registration form and provide all the required details.

Click the continue button. The open page of tableau tool appears.
Toolbar Icons:
Tableau is a GUI- oriented drag and drop tool. The following figure shows the icons present on the
tableau toolbar.
o Undo/ Redo – scrolls backward or forward on the screen. You can
retrieve any step by clicking the undo/redo button

o File save – saves your work. You need to click this button frequently
as tableau does not have the automated save function.

o Connect to a new data source – connects you to a data source

o New dashboard or worksheet – Adds new page to your worksheet

o Duplicate sheet- creates an exact/duplicate copy of a worksheet as


well as of the dashboard page that you are working on

o Clear sheet – Allows you to clear data of a sheet

o Auto/ manual update – generates visual. It is particularly helpful


for large datasets where dragging and dropping items consume time

o Group – allows you to group data by selecting more than one


headers in a table or values in a legend.

o Pivot worksheet- allows you to create a pivot table on a new


worksheet

o Ascending/ Descending sort - Sorts selected items in an ascending or


descending order.

o Label Marks - turns on or off screen elements.

o Presentation mode – Hides/Unhides design shelves. It is particularly


used during presentations where you want to use Tableau as a presentation slide deck to
keep the slides of the presentation.

o Reset cards – provides a menu to turn on or off screen elements, such


as caption or summary.
o Fit Menu – allows different views of the tableau screen. You can fit the
screen either horizontally or vertically.

o Fit Axis - Fixes the axis of view. You can zoom in/out charts with this
button

o Highlight control - compares the selected combinations of dimensions.

Main menu

Main menu of tableau contains following options :

File – contains general functions, such as open, save, and save as. Other functions are print to pdf
and Repository location function to review and change the default location of the saved file.

Data – helps to analyze the tabular data on the tableau website. The edit relationships option is
used to blend data when the field names in two data sources are not identical

Worksheet – provides option such as export option, excel crosstab, and duplicate as crosstab

Dashboard – Provides the actions menu, which is the most important option on the dashboard
menu because all the actions related to tableau worksheets and dashboards are defined within the
actions menu. The actions menu is present under the worksheet menu as well.

Story – provides the new story option that is used for explaining the relationship among facts,
providing context to certain events, showing the dependency between decisions and outcomes, etc.

Analysis – provides the aggregate measures and stack mark options. To create new measures or
dimensions, use create calculated field or edit calculated field.

Map – provides options to change the color scheme and replace the default maps

Format – contains options like cell size and workbook theme

Server – provides options to publish work on tableau server

Window – provides the bookmark menu, which is used to create .tbm files that can be shared with
different users
Help – provides options to access tableau’s online manual, training videos, and sample workbooks

Tableau Server
In Tableau Server Users can interact with the dashboards on the server without any installation on
their machines. Tableau Online is Tableau Server hosted by Tableau on a cloud platform.
Tableau server also provides robust security to the dashboards. Tableau Server web-edit feature
allows authorized users to download and edit the dashboards. Tableau server allows users to
publish and share their data sources as live connections or extracts. Tableau Server is highly
secured for visualizing data. It leverages fast databases through live connections

Tableau workbook and data source files

Depending on their utility and the amount of information they contain, tableau saves and shares
files as:

● Tableau workbook – it is the default save type when you save your
work on the desktop. The extension of such files will be .twb. The files with extension
.twbx can be shared with people not having tableau desktop license or those who cannot
access the data source.

● Tableau data source – if you frequently connect to a specific data


source or if you have manipulated the metadata of any data source, saving the file as
tableau data source is of great use. The extension of such a file will be .tds, and it includes
server address, password, and metadata.

● Tableau bookmark - if you want to share any specific file with others,
use tableau bookmark.

● Tableau data extract – it compresses your extracted data and improves


performance by incorporating more formulas and functions. The extension of a tableau
data extract file is .tde.

Tableau charts

Tableau can create different types of univariate, bivariate, and multivariate charts.
The following are some of the common chart types that tableau can create:

● Tables – tables are an excellent choice of presenting data as they


preserve all the information, which in turn minimize the chances of
misinterpretation.

● Scatter plots – scatter plots are used to describe the


relationship between two variables.

● Trend lines – trend lines are used to analyze the relationship between
variables as well as predict the future outcome

● Bullet graph – bullet graph is just like a bar graph and is generally
used in qualitative analysis.

● Box plot – box plot represents distribution of data and is used in


the comparison of multiple sets of data. It can effectively compute:

o Minimum and Maximum value

o Median

o 25% and 75% quartile

o Treemap – treemap is one of the best compact techniques to


visualize the part to whole relationships as well as hierarchical models.

o Bubble charts – bubble charts help in categorizing and


computing different values and factors in the data with the help of
bubbles.

o Word cloud – similar to bubble charts, the words in a word cloud are
sized according to the frequency at which they appear in the content.
17. Additional Topics

Overview of the Spark Architecture

On the left we have a Driver Program that runs on the master node. The master node is the one that
is responsible for the entire flow of data and transformations across the multiple worker nodes.
Usually, when we write our Spark code, the machine to which we deploy acts as the master node.
After the Driver Program, the very first thing that we need to do is to initiate a SparkContext. The
SparkContext can be considered as a session using which you can use all the features available in
Spark. For example, you can consider the SparkContext as a database connection within your
application. Using that database connection you can interact with the database, similarly, using the
SparkContext you can interact with the other functionalities of Spark.

There is also a Cluster Manager installed that is used to control multiple worker nodes. The
SparkContext that we have generated in the previous step works in conjunction with the Cluster
Manager to manage and control various jobs across all the worker nodes. Whenever a job has to be
executed by the Cluster Manager, it splits up the entire job into multiple individual tasks and then
these tasks are distributed over the worker nodes. This is taken care of by the Driver Program and
the SparkContext. As soon as an RDD is created, it is distributed by the Cluster Manager across the
multiple worker nodes and cached there.

On the right-hand side of the architecture diagram, you can see that we have two worker nodes. In
practice, this can range from two to multiple worker nodes depending on the workload of the
system. The worker nodes actually act as the slave nodes that execute the tasks distributed to them
by the Cluster Manager. These worker nodes return the execution result of the tasks to the
SparkContext. A key point to mention here is that you can increase your worker nodes such that all
the jobs are distributed to each of the worker nodes and as such the tasks can be performed
parallelly. This will increase the speed of data processing to a large extent.
18. Known Gaps : Nil

19. Discussion topics

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for
expressing data analysis programs, coupled with infrastructure for evaluating these programs. The
salient property of Pig programs is that their structure is amenable to substantial parallelization,
which in turns enables them to handle very large data sets.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of
Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the
Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin,
which has the following key properties:

● Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly


parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data
transformations are explicitly encoded as data flow sequences, making them easy to write,
understand, and maintain.

● Optimization opportunities. The way in which tasks are encoded permits the system to
optimize their execution automatically, allowing the user to focus on semantics rather than
efficiency.

● Extensibility. Users can create their own functions to do special-purpose processing.


20. UNIVERSITY QUESTION PAPERS OF PREVIOUS YEAR
21. References, Journals, websites, and E-links

Websites
1. https://azure.microsoft.com/en-in/resources/cloud-computing-dictionary/what-is-big-data-
analytics
2. https://www.tibco.com/reference-center/what-is-big-data-analytics
3. https://www.databricks.com/glossary/big-data-analytics
REFERENCES
Textbook(s)
1. Big Data, Black Book, DT Editorial Services, ISBN: 9789351197577, 2016 Edition

Reference Book(s)

1. BIG DATA and ANALYTICS, Seema Acharya, Subhasinin Chellappan, Wiley


publications
2. Synopses for Massive Data:Samples, Histograms, Wavelets, Sketches,
Foundation and trends in databases, Graham Cormode, Minos Garofalakis, Peter
J. Haas and Chris Jermaine, 2012.
3. R for Business Analytics, A.Ohri, Springer, ISBN:978-1-4614-4343-8, 2016.
4. Hadoop in practice, Alex Holmes, Dreamtech press, ISBN:9781617292224, 2015.
5. Mining of Massive Datasets, Jure Leskovec Stanford Univ. Anand Rajaraman
Milliway Labs Jeffrey D Ullman Stanford University.
Journals

1. The Journal of Big Data publishes open-access original research on data science and
data analytics.

2. Frontiers in big data is an innovative journal focuses on the power of big data - its role in
machine learning, AI, and data mining, and its practical application from cybersecurity to
climate science and public health.

3. Big data in research journal aims to promote and communicate advances in big data
research by providing a fast and high-quality forum for researchers, practitioners, and
policy makers.
4. Big Data and Cognitive Computing is an international, scientific, peer-reviewed, open
access journal of big data and cognitive computing published quarterly online by
MDPI.

5. Journal on Big Data is launched in a new area when the engineering features of big data
are setting off upsurges of explorations in algorithms, raising challenges on big data, and
industrial development integration.

6. Big Data Mining and Analytics (Published by Tsinghua University Press) discovers hidden
patterns, correlations, insights and knowledge through mining and analysing large amounts
of data obtained from various applications.

22. Quality Control Sheets : Will be submitted End of the semester.

23. Student List

STUDENT NOMINAL ROLL


Class & Branch : B.Tech (CSE) IV Year I Sem Section: A Batch : 2021
Sl.No. Admn No. Student Name Sl.No. Admn No. Student Name
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
R
Class & Branch : B.Tech (CSE) IV Year I Sem Section: E Batch : 2020
Sl.No. Admn No. Student Name Sl.No. Admn No. Student Name
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

You might also like