0% found this document useful (0 votes)
30 views13 pages

Chap 1 Introduction To Big Data

Big data refers to vast and complex datasets that exceed the capabilities of traditional database systems, encompassing both structured and unstructured data. It is characterized by its high volume, velocity, and variety, necessitating innovative processing methods for valuable insights. Applications of big data span various industries, including marketing, healthcare, finance, and sports, enabling improved decision-making and optimization of processes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views13 pages

Chap 1 Introduction To Big Data

Big data refers to vast and complex datasets that exceed the capabilities of traditional database systems, encompassing both structured and unstructured data. It is characterized by its high volume, velocity, and variety, necessitating innovative processing methods for valuable insights. Applications of big data span various industries, including marketing, healthcare, finance, and sports, enabling improved decision-making and optimization of processes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

UNIT I Introduction to BIG DATA

What is big data – why big data – convergence of key trends – unstructured data –
industry examples of big data – web analytics – big data and marketing – fraud and big
data – risk and big data – credit risk management – big data and algorithmic trading –
big data and healthcare – big data in medicine – advertising and big data – big data
technologies – introduction to Hadoop – open source technologies – cloud and big data –
mobile business intelligence – Crowd sourcing analytics – inter and trans firewall
analytics

Big Data

Big data is data that exceeds the processing capacity of conventional database
systems. The data is too big, moves too fast, or does not fit the structures of traditional
database architectures. In other words, Big data is an all-encompassing term for any
collection of data sets so large and complex that it becomes difficult to process using
on-hand data management tools or traditional data processing applications. To gain
value from this data, you must choose an alternative way to process it. Big Data is the
next generation of data warehousing and business analytics and is poised to deliver top
line revenues cost efficiently for enterprises. Big data is a popular term used to describe
the exponential growth and availability of data, both structured and unstructured.
Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the
world today has been created in the last two years alone. This data comes from
everywhere: sensors used to gather climate information, posts to social media sites,
digital pictures and videos, purchase transaction records, and cell phone GPS signals to
name a few. This data is big data.

Definition

Big data usually includes data sets with sizes beyond the ability of commonly used
software tools to capture, create, manage, and process the data within a tolerable
elapsed time

Big data is high-volume, high-velocity and high-variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight and
decision-making.
Big data is often boiled down to a few varieties including social data, machine data, and
transactional data. Social media data is providing remarkable insights to companies on
consumer behavior and sentiment that can be integrated with CRM data for analysis,
with 230 million tweets posted on Twitter per day, 2.7 billion Likes and comments
added to Facebook every day, and 60 hours of video uploaded to YouTube every minute
(this is what we mean by velocity of data). Machine data consists of information
generated from industrial equipment, real-time data from sensors that track parts and
monitor machinery (often also called the Internet of Things), and even web logs that
track user behavior online. At arcplan client CERN, the largest particle physics research
center in the world, the Large Hadron Collider (LHC) generates 40 terabytes of data
every second during experiments. Regarding transactional data, large retailers and
even B2B companies can generate multitudes of data on a regular basis considering
that their transactions consist of one or many items, product IDs, prices, payment
information, manufacturer and distributor data, and much more. Major retailers like
Amazon.com, which posted $10B in sales in Q3 2011, and restaurants like US pizza
chain Domino's, which serves over 1 million customers per day, are generating
petabytes of transactional big data. The thing to note is that big data can resemble
traditional structured data or unstructured, high frequency information.

Big Data Analytics


Big (and small) Data analytics is the process of examining data—typically of a variety of
sources, types, volumes and / or complexities—to uncover hidden patterns, unknown
correlations, and other useful information. The intent is to find business insights that
were not previously possible or were missed, so that better decisions can be made.
Big Data analytics uses a wide variety of advanced analytics to provide
1. Deeper insights. Rather than looking at segments, classifications, regions,
groups, or other summary levels you ’ll have insights into all the individuals, all
the products, all the parts, all the events, all the transactions, etc.
2. Broader insights. The world is complex. Operating a business in a global,
connected economy is very complex given constantly evolving and changing
conditions. As humans, we simplify conditions so we can process events and
understand what is happening. But our best-laid plans often go astray because of
the estimating or approximating. Big Data analytics takes into account all the
data, including new data sources, to understand the complex, evolving, and
interrelated conditions to produce more accurate insights.
3. Frictionless actions. Increased reliability and accuracy that will allow the
deeper and broader insights to be automated into systematic actions.

Advanced Big data analytics

Big data analytic applications


3 Dimensions / Characteristics of Big data
3Vs (volume, variety and velocity) are three defining properties or dimensions of big
data. Volume refers to the amount of data, variety refers to the number of types of data
and velocity refers to the speed of data processing.
increase the volume of data that has to be analyzed. This is a major issue for those
looking to put that data to use instead of letting it just disappear. Volume:

The size of available data has been growing at an increasing rate.

The volume of data is growing. Experts predict that the volume of data in the world will
grow to 25 Zettabytes in 2020. That same phenomenon affects every business – their
data is growing at the same exponential rate too.

This applies to companies and to individuals. A text file is a few kilo bytes, a sound file is
a few mega bytes while a full length movie is a few giga bytes. More sources of data are
added on continuous basis. For companies, in the old days, all data was generated
internally by employees. Currently, the data is generated by employees, partners and
customers. For a group of companies, the data is also generated by machines. For
example, Hundreds of millions of smart phones send a variety of information to the
network infrastructure. This data did not exist five years ago.

More sources of data with a larger size of data combine to


Peta byte data sets are common these days and Exa byte is not far away.

Velocity:
Data is increasingly accelerating the velocity at which it is created and at which it is
integrated. We have moved from batch to a real-time business.

Initially, companies analyzed data using a batch process. One takes a chunk of data,
submits a job to the server and waits for delivery of the result. That scheme works when
the incoming data rate is slower than the batch-processing rate and when the result is
useful despite the delay. With the new sources of data such as social and mobile
applications, the batch process breaks down. The data is now streaming into the server
in real time, in a continuous fashion and the result is only useful if the delay is very
short.

Data comes at you at a record or a byte level, not always in bulk. And the demands of
the business have increased as well – from an answer next week to an answer in a
minute. In addition, the world is becoming more instrumented and interconnected. The
volume of data streaming off those instruments is exponentially larger than it was even
2 years ago.

Variety:
Variety presents an equally difficult challenge. The growth in data sources has fuelled
the growth in data types. In fact, 80% of the world’s data is unstructured. Yet most
traditional methods apply analytics only to structured information.
From excel tables and databases, data structure has changed to loose its structure and
to add hundreds of formats. Pure text, photo, audio, video, web, GPS data, sensor data,
relational data bases, documents, SMS, pdf, flash, etc. One no longer has control over
the input data format. Structure can no longer be imposed like in the past in order to
keep control over the analysis. As new applications are introduced new data formats
come to life.
The variety of data sources continues to increase. It includes
■ Internet data (i.e., click stream, social media, social networking links)
■ Primary research (i.e., surveys, experiments, observations)
■ Secondary research (i.e., competitive and marketplace data, industry reports,
consumer data, business data)
■ Location data (i.e., mobile device data, geospatial data)
■ Image data (i.e., video, satellite image, surveillance)
■ Supply chain data (i.e., EDI, vendor catalogs and pricing, quality information)
■ Device data (i.e., sensors, PLCs, RF devices, LIMs, telemetry)

Why Big data?

1. Understanding and Targeting Customers


This is one of the biggest and most publicized areas of big data use today. Here, big
data is used to better understand customers and their behaviors and preferences.
Companies are keen to expand their traditional data sets with social media data,
browser logs as well as text analytics and sensor data to get a more complete picture of
their customers. The big objective, in many cases, is to create predictive models. You
might remember the example of U.S. retailer Target, who is now able to very accurately
predict when one of their customers will expect a baby. Using big data, Telecom
companies can now better predict customer churn; Wal-Mart can predict what products
will sell, and car insurance companies understand how well their customers actually
drive. Even government election campaigns can be optimized using big data analytics.

2. Understanding and Optimizing Business Processes


Big data is also increasingly used to optimize business processes. Retailers are able to
optimize their stock based on predictions generated from social media data, web search
trends and weather forecasts. One particular business process that is seeing a lot of big
data analytics is supply chain or delivery route optimization. Here, geographic
positioning and radio frequency identification sensors are used to track goods or
delivery vehicles and optimize routes by integrating live traffic data, etc. HR business
processes are also being improved using big data analytics. This includes the
optimization of talent acquisition – Moneyball style, as well as the measurement of
company culture and staff engagement using big data tools

3. Personal Quantification and Performance Optimization

Big data is not just for companies and governments but also for all of us individually. We
can now benefit from the data generated from wearable devices such as smart watches
or smart bracelets. Take the Up band from Jawbone as an example: the armband
collects data on our calorie consumption, activity levels, and our sleep patterns. While it
gives individuals rich insights, the real value is in analyzing the collective data. In
Jawbone’s case, the company now collects 60 years worth of sleep data every night.
Analyzing such volumes of data will bring entirely new insights that it can feed back to
individual users. The other area where we benefit from big data analytics is finding love
- online this is. Most online dating sites apply big data tools and algorithms to find us
the most appropriate matches.

4. Improving Healthcare and Public Health

The computing power of big data analytics enables us to decode entire DNA strings in
minutes and will allow us to find new cures and better understand and predict disease
patterns. Just think of what happens when all the individual data from smart watches
and wearable devices can be used to apply it to millions of people and their various
diseases. The clinical trials of the future won’t be limited by small sample sizes but
could potentially include everyone! Big data techniques are already being used to
monitor babies in a specialist premature and sick baby unit. By recording and analyzing
every heart beat and breathing pattern of every baby, the unit was able to develop
algorithms that can now predict infections 24 hours before any physical symptoms
appear. That way, the team can intervene early and save fragile babies in an
environment where every hour counts. What’s more, big data analytics allow us to
monitor and predict the developments of epidemics and disease outbreaks. Integrating
data from medical records with social media analytics enables us to monitor flu
outbreaks in real-time, simply by listening to what people are saying, i.e. “Feeling
rubbish today - in bed with a cold”.
5. Improving Sports Performance

Most elite sports have now embraced big data analytics. We have the IBM SlamTracker
tool for tennis tournaments; we use video analytics that track the performance of every
player in a football or baseball game, and sensor technology in sports equipment such
as basket balls or golf clubs allows us to get feedback (via smart phones and cloud
servers) on our game and how to improve it. Many elite sports teams also track athletes
outside of the sporting environment – using smart technology to track nutrition and
sleep, as well as social media conversations to monitor emotional wellbeing.

6. Improving Science and Research

Science and research is currently being transformed by the new possibilities big data
brings. Take, for example, CERN, the Swiss nuclear physics lab with its Large Hadron
Collider, the world’s largest and most powerful particle accelerator. Experiments to
unlock the secrets of our universe – how it started and works - generate huge amounts
of data. The CERN data center has 65,000 processors to analyze its 30 petabytes of
data. However, it uses the computing powers of thousands of computers distributed
across 150 data centers worldwide to analyze the data. Such computing powers can be
leveraged to transform so many other areas of science and research.

7. Optimizing Machine and Device Performance

Big data analytics help machines and devices become smarter and more autonomous.
For example, big data tools are used to operate Google’s self-driving car. The Toyota
Prius is fitted with cameras, GPS as well as powerful computers and sensors to safely
drive on the road without the intervention of human beings. Big data tools are also used
to optimize energy grids using data from smart meters. We can even use big data tools
to optimize the performance of computers and data warehouses.

8. Improving Security and Law Enforcement.

Big data is applied heavily in improving security and enabling law enforcement. I am
sure you are aware of the revelations that the National Security Agency (NSA) in the
U.S. uses big data analytics to foil terrorist plots (and maybe spy on us). Others use big
data techniques to detect and prevent cyber attacks. Police forces use big data tools to
catch criminals and even predict criminal activity and credit card companies use big
data use it to detect fraudulent transactions.

9. Improving and Optimizing Cities and Countries


Big data is used to improve many aspects of our cities and countries. For example, it
allows cities to optimize traffic flows based on real time traffic information as well as
social media and weather data. A number of cities are currently piloting big data
analytics with the aim of turning themselves into Smart Cities, where the transport
infrastructure and utility processes are all joined up. Where a bus would wait for a
delayed train and where traffic signals predict traffic volumes and operate to minimize
jams.

10. Financial Trading

My final category of big data application comes from financial trading. High-Frequency
Trading (HFT) is an area where big data finds a lot of use today. Here, big data
algorithms are used to make trading decisions. Today, the majority of equity trading
now takes place via data algorithms that increasingly take into account signals from
social media networks and news websites to make, buy and sell decisions in split
seconds.

Unstructured data

Unstructured data is information that either does not have a predefined data model
and/or does not fit well into a relational database. Unstructured information is typically
text heavy, but may contain data such as dates, numbers, and facts as well. The term
semi-structured data is used to describe structured data that does not fit into a formal
structure of data models. However, semi-structured data does contain tags that
separate semantic elements, which includes the capability to enforce hierarchies within
the data. The amount of data (all data, everywhere) is doubling every two years. Most
new data is unstructured. Specifically, unstructured data represents almost 80 percent
of new data, while structured data represents only 20 percent. Unstructured data tends
to grow exponentially, unlike structured data, which tends to grow in a more linear
fashion. Unstructured data is vastly underutilized.

Mining Unstructured Data

Mining Unstructured Data

Unstructured Data and Big Data

As mentioned above, unstructured data is the opposite of structured data. Structured


data generally resides in a relational database, and as a result, it is sometimes called
"relational data." This type of data can be easily mapped into pre-designed fields. For
example, a database designer may set up fields for phone numbers, zip codes and
credit card numbers that accept a certain number of digits. Structured data has been or
can be placed in fields like these. By contrast, unstructured data is not relational and
doesn't fit into these sorts of pre-defined data models.

In addition to structured and unstructured data, there's also a third category: semi-
structured data. Semi-structured data is information that doesn't reside in a relational
database but that does have some organizational properties that make it easier to
analyze. Examples of semi-structured data might include XML documents and NoSQL
databases.

The term "big data" is closely associated with unstructured data. Big data refers to
extremely large datasets that are difficult to analyze with traditional tools. Big data can
include both structured and unstructured data, but IDC estimates that 90 percent of big
data is unstructured data. Many of the tools designed to analyze big data can handle
unstructured data.

Implementing Unstructured Data Management

Organizations use of variety of different software tools to help them organize and
manage unstructured data. These can include the following:

 Big data tools: Software like Hadoop can process stores of both unstructured
and structured data that are extremely large, very complex and changing
rapidly.
 Business intelligence software: Also known as BI, this is a broad category of
analytics, data mining, dashboards and reporting tools that help companies make
sense of their structured and unstructured data for the purpose of making better
business decisions.
 Data integration tools: These tools combine data from disparate sources so
that they can be viewed or analyzed from a single application. They sometimes
include the capability to unify structured and unstructured data.
 Document management systems: Also called "enterprise content
management systems," a DMS can track, store and share unstructured data that
is saved in the form of document files.
 Information management solutions: This type of software tracks structured
and unstructured enterprise data throughout its lifecycle.
 Search and indexing tools: These tools retrieve information from unstructured
data files such as documents, Web pages and photos.

Big Data Challenges are:

1. Sharing and Accessing Data:


 Perhaps the most frequent challenge in big data efforts is the inaccessibility of
data sets from external sources.
 Sharing data can cause substantial challenges.
 It include the need for inter and intra- institutional legal documents.
 Accessing data from public repositories leads to multiple difficulties.
 It is necessary for the data to be available in an accurate, complete and timely
manner because if data in the companies information system is to be used to
make accurate decisions in time then it becomes necessary for data to be
available in this manner.

2. Privacy and Security:


 It is another most important challenge with Big Data. This challenge includes
sensitive, conceptual, technical as well as legal significance.
 Most of the organizations are unable to maintain regular checks due to large
amounts of data generation. However, it should be necessary to perform security
checks and observation in real time because it is most beneficial.
 There is some information of a person which when combined with external large
data may lead to some facts of a person which may be secretive and he might
not want the owner to know this information about that person.
 Some of the organization collects information of the people in order to add value
to their business. This is done by making insights into their lives that they’re
unaware of.

3. Analytical Challenges:
 There are some huge analytical challenges in big data which arise some main
challenges questions like how to deal with a problem if data volume gets too
large?
 Or how to find out the important data points?
 Or how to use data to the best advantage?
 These large amount of data on which these type of analysis is to be done can be
structured (organized data), semi-structured (Semi-organized data) or
unstructured (unorganized data). There are two techniques through which
decision making can be done:
 Either incorporate massive data volumes in the analysis.
 Or determine upfront which Big data is relevant.

4. Technical challenges:
 Quality of data:
 When there is a collection of a large amount of data and storage of this data, it
comes at a cost. Big companies, business leaders and IT leaders always want
large data storage.
 For better results and conclusions, Big data rather than having irrelevant data,
focuses on quality data storage.
 This further arise a question that how it can be ensured that data is relevant, how
much data would be enough for decision making and whether the stored data is
accurate or not.
 Fault tolerance:
 Fault tolerance is another technical challenge and fault tolerance computing is
extremely hard, involving intricate algorithms.
 Nowadays some of the new technologies like cloud computing and big data
always intended that whenever the failure occurs the damage done should be
within the acceptable threshold that is the whole task should not begin from the
scratch.
 Scalability:
 Big data projects can grow and evolve rapidly. The scalability issue of Big Data
has lead towards cloud computing.
 It leads to various challenges like how to run and execute various jobs so that
goal of each workload can be achieved cost-effectively.
 It also requires dealing with the system failures in an efficient manner. This leads
to a big question again that what kinds of storage devices are to be used.

You might also like