Business Intelligence & Big Data
Analytics- CSE3124Y
BIG DATA ESSENTIALS
LECTURE 1
Learning Outcomes
Explain Big data concepts.
Describe the characteristics of Big Data.
Identify the challenges and opportunities in
implementing Big Data
Discuss the application domains for Big Data
Determine how big data analytics are being used
in case studies (application domains)
Definition
Big Data is a term often used to describe data sets whose
size is beyond the capability of commonly used software
tools to capture, manage, and process.
Sagiroglu and Sinanc (2013) defines Big Data as a term
“for massive data sets having large, more varied and
complex structure with the difficulties of storing,
analyzing and visualizing for further processes or results”.
Big Data
Big Data can be generated from many different sources,
including :
Social networks
Banking and financial services
E-commerce services
Web-centric services
Internet search indexes
Scientific and document searches
Medical records
Web logs
Evolution of Big Data
Explosion of the Internet, social media, technologies such as mobile
devices, sensors and
applications have led to the creation of massive data sets.
According to McAfee et al. (2012), as of 2012, about 2.5 exabytes of
data were created each day
that number is doubling every 40 months and so.
As of 2014, Google processes data of hundreds of Petabyte (PB) and
Facebook
generates log data of over 10 PB per month (Chen et al., 2014).
Characteristics of Big Data
5 V’s of Big Data (Anuradha, 2015)
Characteristics of Big Data (1)
Volume:
The volume of data refers to large amount of data with size varying from terabytes to zettabyte.
Analyzing and manipulating such a large amount of data require substantial resources and
represent a major challenge.
Velocity:
The velocity refers to the speed at which data is created. It can be measured using data volume per
time.
Variety:
Variety refers to different types of data: structured, semi-structured and unstructured data that
are being stored and analysed.
Semi- structured data consist of a combination of structured and unstructured data.
The types of data can include text,
audio, video, images, sensor data,
emails, log files, social media posts amongst others
Characteristics of Big Data (2)
Veracity:
Veracity refers to trustworthiness of the data.
It includes other data quality attributes such as authenticity,
reputation, availability, consistency and accountability of data
Value:
Raw data is of no value.
Big data has to be transformed into smart data to add value to a
business or even generate revenue.
Activity 1
Differentiate between structured and unstructured data, by
providing examples of both types.
How is Big Data different from traditional data?
Structured vs Unstructured
Challenges
Data Representation
The evolution of Big Data has led to creation of large amount of heterogeneous data with variations in type,
structure, semantics, organization, granularity and accessibility.
Thus representing data so that it is meaningful and efficient, is a major challenge.
Data Analysis
Analyzing Big Data is a challenging task due to the incompleteness and inconsistencies of semi-structured and
unstructured data .
Data volume is scaling faster than compute resources.
The analysis of large data sets is very time consuming
It is of utmost importance to address these challenges to realize the full potential of Big Data analysis.
Furthermore, in order to get much benefit and insights from Big Data analysis, Big Data has to be pre-processed,
cleaned and transformed properly.
Data Acquisition
Data acquisition consists of data collection, data transmission and data pre-processing.
Big data collected from various sources such as log files and sensors often consist of large amount of redundant data.
It is therefore a major challenge to remove this high redundancy.
Appropriate compression algorithms have to be applied.
Challenges
Data Storage
Traditional RDBMS are found to be ill-suited for storing and processing big data
NoSQL databases, also referred to as non-traditional databases are becoming increasingly
popular for Big Data storage. Some examples of Big Data databases include Dynamo,
Voldemort, BigTable, Cassandra, MongoDB, SimpleDB and CouchDB
Challenge to offer information storage service with reliable storage space as well as powerful
access interface for query and analysis of a large amount of data Chen et al., 2014).
Data Management
Managing Big Data is the most difficult problem.
A number of issues still have to be resolved such as “access, metadata, utilization, updating,
governance, and reference (in publications)”.
Need for new approaches to qualify and validate data as they find it impractical to perform
validation on every data item in large datasets.
Activity 2
1. Describe some other challenges that have cropped up with the
evolution of Big Data.
2. There are three types of Big Data databases namely key-value
databases, column-oriented databases, and document-oriented
databases. Differentiate between these three types of databases.
3. Categorize the existing Big Data databases into these three
groups.
Application Domains
Healthcare
Enhanced 360o View of the Customer
Security/Intelligence Extension
Transportation services
…………………..AND many others
Case Study: Healthcare
Healthcare
Medical information is doubling every 5 years, much of which is unstructured
81% of physicians report spending 5 hours or less per month reading medical journals
Big Data is currently being used in healthcare for the prediction and surveillance of
diseases
Analysing disease patterns can prevent the spreading of the disease.
Analysing large data sets of patients’ information can help identification of patients who
are likely to suffer from a particular disease such as diabetes.
Healthcare analytics have the potential to reduce costs of treatment, predict outbreaks
of epidemics, avoid preventable diseases and improve the quality of life in general
How big data analytics can help:
Epidemic early warning
Intensive Care Unit and remote monitoring
Activity 3
https://www.datapine.com/blog/big-data-examples-in-healthcare/
Read materials from the website and summarise how big data
analytics are being used in the healthcare.
First define big data
Second define big data analytics
Third discuss on how big data analytics are being used in the
healthcare sector
Case Study 2: Transportation services
Problem:
Traffic congestion has been increasing worldwide as a result of
increased urbanization and population growth reducing the
efficiency of transportation infrastructure and increasing travel
time and fuel consumption.
How big data analytics can help:
Real time analysis to weather and traffic congestion data streams
to identify traffic patterns reducing transportation costs.
Activity 4:
1. Explain what do you understand by ITS (Intelligent Transport
System)
2. Describe the components of an ITS
3. List the advantages of an ITS.
4. What are the challenges that government/ authorities would have
to face/ overcome for the deployment of ITS
Case Study: Financial services
Problem:
Manage the several Petabytes of data which is growing at 40-100% per year
under increasing pressure to prevent frauds and complain to regulations.
How big data analytics can help:
Fraud detection
Risk management
360°View of the Customer
Activity 4
Suggest other application domains where Big Data could be
applied.
Big Data Tools (1)
Big Data Tools based on batch processing namely
Apache Hadoop, Dryad, Apache Mahout, Jaspersoft BI Suite, Pentaho Business Analytics, Skytree Server,
Tableau, Karmasphere Studio and Analyst and Talend Open Studio.
The use of the different tools is shown in following Figure
Big Data Tools (2)
Big Data Tools (3)
There are additional tools and platform that distribute
open-source Hadoop platforms namely AWS, Cloudera,
Hortonworks, and MapR Technologies (Raghupathi and
Raghupathi, 2014).
Proprietary options such as IBM’s BigInsights are also
available
References
Sagiroglu, S. and Sinanc, D., 2013, May. Big data: A review. In Collaboration
Technologies and Systems (CTS), 2013 International Conference on (pp. 42-47).
IEEE.
Chen, C.P. and Zhang, C.Y., 2014. Data-intensive applications, challenges,
techniques and technologies: A survey on Big Data. Information Sciences, 275,
pp.314-347.
Chen, M., Mao, S. and Liu, Y., 2014. Big data: A survey. Mobile Networks and
Applications, 19(2), pp.171-209.
McAfee, A., Brynjolfsson, E. and Davenport, T.H., 2012. Big data: the
management revolution. Harvard business review, 90(10), pp.60-68.
Raghupathi, W. and Raghupathi, V., 2014. Big data analytics in healthcare:
promise and potential. Health information science and systems, 2(1), p.3.