BDA PST
Q. Define Big data and its characteristics?
Big data refers to extremely large and diverse collections of structured, unstructured, and
semi-structured data that continues to grow exponentially over time.
These datasets are so huge and complex in volume, velocity, and variety, that traditional
data management systems cannot store, process, and analyze them.
Big data describes large and diverse datasets that are huge in volume and also rapidly
grow in size over time.
Big data is used in machine learning, predictive modeling, and other advanced analytics
to solve business problems and make informed decisions.
There are five v's of Big Data that explains the characteristics.
5 V's of Big Data
o Volume o Accuracy
o Veracity o Reliability
o Variety o Completeness
o Value
o Relevances
o Velocity
o Timeliness
Volume
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data
generated from many sources daily, such as business processes, machines, social media
platforms, networks, human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the "Like"
button is recorded, and more than 350 million new posts are uploaded each day. Big data
technologies can handle large amounts of data.
Variety
Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these
days the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos,
videos, etc.
Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also
essential in business development.
For example, Facebook posts with hashtags.
Value
Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the data
is created in real-time. It contains the linking of incoming data sets speeds, rate of change,
and activity bursts. The primary aspect of Big Data is to provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.
Variability:
Refers to the inconsistency of the data, which can change over time.
This could involve variations in data formats, quality, or even in how data is collected
and interpreted.
Complexity
Refers to the complexity involved in managing, processing, and analyzing big data.
The interconnectedness and large-scale nature of data sources often require
sophisticated infrastructure and tools.
Q. Define big data analytics and challenges and advantages of big data?
Big Data Analytics is all about crunching massive amounts of information to uncover
hidden trends, patterns, and relationships. It's like sifting through a giant mountain of
data to find the gold nuggets of insight.
Here's a breakdown of what it involves:
o Collecting Data: Such data is coming from various sources such as social
media, web traffic, sensors and customer reviews.
o Cleaning the Data: Imagine having to assess a pile of rocks that included some
gold pieces in it. You would have to clean the dirt and the debris first. When
data is being cleaned, mistakes must be fixed, duplicates must be removed and
the data must be formatted properly.
o Analyzing the Data: It is here that the wizardry takes place. Data analysts
employ powerful tools and techniques to discover patterns and trends. It is the
same thing as looking for a specific pattern in all those rocks that you sorted
through.
For example, big data analytics is integral to the modern health care industry. As you can
imagine, systems that must manage thousands of patient records, insurance plans,
prescriptions, and vaccine information.
Challenges of Big data analytics
While Big Data Analytics offers incredible benefits, it also comes with its set of challenges:
Data Overload: Consider Twitter, where approximately 6,000 tweets are posted
every second. The challenge is sifting through this avalanche of data to find
valuable insights.
Data Quality: If the input data is inaccurate or incomplete, the insights generated
by Big Data Analytics can be flawed. For example, incorrect sensor readings could
lead to wrong conclusions in weather forecasting.
Privacy Concerns: With the vast amount of personal data used, like in Facebook's
ad targeting, there's a fine line between providing personalized experiences and
infringing on privacy.
Security Risks: With cyber threats increasing, safeguarding sensitive data becomes
crucial. For instance, banks use Big Data Analytics to detect fraudulent activities,
but they must also protect this information from breaches.
Costs: Implementing and maintaining Big Data Analytics systems can be
expensive. Airlines like Delta use analytics to optimize flight schedules, but they
need to ensure that the benefits outweigh the costs.
Benefits of Big Data Analytics
Big Data Analytics offers a host of real-world advantages, and let's understand with examples:
1. Informed Decisions: Imagine a store like Walmart. Big Data Analytics helps them
make smart choices about what products to stock. This not only reduces waste but
also keeps customers happy and profits high.
2. Enhanced Customer Experiences: Think about Amazon. Big Data Analytics is
what makes those product suggestions so accurate. It's like having a personal
shopper who knows your taste and helps you find what you want.
3. Fraud Detection: Credit card companies, like MasterCard, use Big Data Analytics
to catch and stop fraudulent transactions. It's like having a guardian that watches
over your money and keeps it safe.
4. Optimized Logistics: FedEx, for example, uses Big Data Analytics to deliver your
packages faster and with less impact on the environment. It's like taking the fastest
route to your destination while also being kind to the planet.
Q. Evolution of Big Data?
If we see the last few decades, we can analyze that Big Data technology has gained so much
growth. There are a lot of milestones in the evolution of Big Data which are described below:
1. Data Warehousing:
In the 1990s, data warehousing emerged as a solution to store and analyze large volumes
of structured data.
2. Hadoop:
Hadoop was introduced in 2006 by Doug Cutting and Mike Cafarella. Distributed storage
medium and large data processing are provided by Hadoop, and it is an open-source
framework.
3. NoSQL Databases:
In 2009, NoSQL databases were introduced, which provide a flexible way to store and
retrieve unstructured data.
4. Cloud Computing:
Cloud Computing technology helps companies to store their important data in data
centers that are remote, and it saves their infrastructure cost and maintenance costs.
5. Machine Learning:
Machine Learning algorithms are those algorithms that work on large data, and analysis is
done on a huge amount of data to get meaningful insights from it. This has led to the
development of artificial intelligence (AI) applications.
6. Data Streaming:
Data Streaming technology has emerged as a solution to process large volumes of data in
real time.
7. Edge Computing:
dge Computing is a kind of distributed computing paradigm that allows data processing to
be done at the edge or the corner of the network, closer to the source of the data.
Q. Explain any one domain specific example of big data?
One domain-specific example of big data is healthcare analytics. In the healthcare industry,
large volumes of data are generated from various sources such as patient records, medical
devices, diagnostic equipment, wearable health trackers, and even social media.
Example: Predictive Healthcare Analytics
Hospitals and healthcare providers can use big data to predict patient outcomes, improve
treatment plans, and prevent diseases. For instance, by analyzing historical data from millions
of patients, healthcare professionals can develop predictive models to identify individuals at
high risk for certain diseases, like diabetes or heart disease, even before symptoms appear.
How It Works:
1. Data Sources: The data used includes electronic health records (EHR), lab test results,
medical imaging, and sensor data from wearable devices.
2. Processing: Advanced algorithms and machine learning models process these vast amounts
of data to find patterns or correlations that would be impossible for humans to identify
manually.
3. Outcomes: With these insights, doctors can make more informed decisions, leading to
better care, more accurate diagnoses, and cost reductions by preventing hospital
readmissions or unnecessary treatments.
Benefits:
Personalized Medicine: Tailoring treatments based on individual patient data.
Early Detection: Identifying at-risk patients earlier than traditional methods.
Operational Efficiency: Optimizing hospital workflows by predicting peak times, resource
needs, and patient flow.
This use of big data in healthcare improves both patient outcomes and operational efficiency,
making it a critical example of how big data is transforming industries.
Q. Explain analytic flow of big data?
Analytics Flow for Big Data
The analytics flow for big data refers to the process of collecting, storing, processing, and
analyzing large and complex data sets to gain insights and make better decisions. It typically
includes the following steps:
1. Data collection: Data is collected from various sources such as social media, IoT devices, and
sensors. The data can be structured, semi-structured, or unstructured and may need to be cleaned
and transformed before it can be analyzed.
2. Data storage: The data is stored in a centralized repository such as a data lake, Hadoop
Distributed File System (HDFS), or NoSQL database.
3. Data processing: The data is processed using technologies such as Hadoop MapReduce,
stream processing, and machine learning to extract insights and prepare it for analysis.
4. Data analysis: The data is analyzed using tools such as SQL, data visualization, and machine
learning algorithms to gain insights and make better decisions.
5. Data governance: Data governance policies and procedures are put in place to ensure data is
accurate, complete, consistent and compliant with regulations.
6. Data security: Security measures such as data encryption, access controls, and incident
response are implemented to protect sensitive information and prevent unauthorized access.
7.Data visualization: The data is transformed into interactive and easy-to-understand
visualizations using tools such as Tableau, QlikView and Power BI.
8. Decision-making: Insights from the data are used to make better decisions and take action.
Q. Classification of Big Data Analytics?
Types of Data Analytics
1. Predictive (forecasting)
2. Descriptive (business intelligence and data
mining)
3. Prescriptive (optimization and simulation)
4. Diagnostic analytics
Predictive Analytics
Predictive analytics turn the data into valuable, actionable information. predictive
analytics uses data to determine the probable outcome of an event or a likelihood of
a situation occurring.
Predictive analytics holds a variety of statistical techniques from modeling, machine
learning , data mining , and game theory that analyze current and historical facts to
make predictions about a future event. Techniques that are used for predictive
analytics are:
Linear Regression
Time Series Analysis and Forecasting
Data Mining
Basic Cornerstones of Predictive Analytics
Predictive modeling
Decision Analysis and optimization
Transaction profiling
Descriptive Analytics
Descriptive analytics looks at data and analyze past event for insight as to how to
approach future events.
It looks at past performance and understands the performance by mining historical
data to understand the cause of success or failure in the past.
Almost all management reporting such as sales, marketing, operations, and finance
uses this type of analysis.
The descriptive model quantifies relationships in data in a way that is often used to
classify customers or prospects into groups.
Unlike a predictive model that focuses on predicting the behavior of a single
customer, Descriptive analytics identifies many different relationships between
customer and product.
Common examples of Descriptive analytics are company reports that provide historic
reviews like:
Data Queries
Reports
Descriptive Statistics
Data dashboard
Prescriptive Analytics
Prescriptive Analytics automatically synthesize big data, mathematical science,
business rule, and machine learning to make a prediction and then suggests a
decision option to take advantage of the prediction.
Prescriptive analytics goes beyond predicting future outcomes by also suggesting
action benefits from the predictions and showing the decision maker the implication
of each decision option.
Prescriptive Analytics not only anticipates what will happen and when to happen but
also why it will happen.
Further, Prescriptive Analytics can suggest decision options on how to take
advantage of a future opportunity or mitigate a future risk and illustrate the
implication of each decision option.
For example, Prescriptive Analytics can benefit healthcare strategic planning by
using analytics to leverage operational and usage data combined with data of
external factors such as economic data, population demography, etc.
Diagnostic Analytics
In this analysis, we generally use historical data over other data to answer any
question or for the solution of any problem. We try to find any dependency and
pattern in the historical data of the particular problem.
For example, companies go for this analysis because it gives a great insight into a
problem, and they also keep detailed information about their disposal otherwise data
collection may turn out individual for every problem and it will be very time-
consuming. Common techniques used for Diagnostic Analytics are:
Data discovery
Data mining
Correlations