0% found this document useful (0 votes)
13 views6 pages

472 Eb

Uploaded by

saurav.sarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views6 pages

472 Eb

Uploaded by

saurav.sarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Chapter-2

Arranging and Collecting Data


Data Collection
The method of gathering data for calculating and analyzing reliable insights is known as data collection,
which is done using standard validated techniques. A researcher or scientist works based on the
collected data. Data collection is a primary and essential step in most cases. The approach of data
collection is different in different fields.

For Example- If we survey the temperature of many cities worldwide on the same day, the first
important step would be to collect data on temperature from many towns. Let us assume we have
recorded temperature across six cities at the same time. The temperature data collected is as follows.

Now, if we represent this data in a bar chart, it will look like below.

Variables
A variable is an attribute of an object of study that may vary for different cases. Thus, a variable varies
for different case studies in research. Considering the previous example of the survey on the
temperature of many cities worldwide on the same day, the variables are "Temperature" and "City"
because both the attributes vary for different cases.
Types of Variables
Variables can be of two types -

1) Numerical variable
They represent values that have numbers. For Example, age, weight, height.
2) Categorical variable
These variables represent values that have words. Example, name, nationality, sport, etc.

Types of data
Data can be narrowly divided into two categories -
1) Quantitative Data
Quantitative data are numbers or values that can be measured. For Example, the number of
times a product has been searched on the internet or the number of items sold per month.
Since these data can be quantified, they are comparatively easy to analyze.
2) Qualitative Data
Qualitative data, on the other hand, is subjective. For Example, a traveler's review for a hotel
or customer service feedback given by a consumer after a telephone conversation. These data
help to understand experiences in depth.
Sources of data
Data sources can be classified into primary and secondary sources.
1) Primary
These represent the sources created to collect data for analysis, for example, surveys,
interviews, questionnaires, feedback forms, etc.
Some methods of collecting primary data are:
a) Physical interviews b) Online surveys c) Feedback forms

2) Secondary
At times data is already recorded for some other purpose but then re-used for analysis. These
are secondary data sources. They include internal transactional databases, sensor data, etc.
Some methods of collecting secondary data are:
a) Social medial data tracking b) Web traffic tracking c) Satellite data tracking
Big Data
When the data volumes exceed the processing capacities of traditional databases, they are called Big
Data. Big Data techniques are widely used in different sectors, for example Retail, Science, Sports,
Social media and Health care etc.
Big Data systems
Millions of users are using the platforms and creating an enormous amount of content every minute.
Processing this vast amount of data requires specialized skills and systems. Such systems capable of
extracting statistical insights from a huge amount of data are called Big Data systems.
Some of the key characteristics that can define Big Data:
1) Volume - This refers to the size of the data. Usually, data sets greater than terabytes and
petabytes are called Big Data
2) Variety - Big Data sets are generally collected from a wide range of sources, including
transactional databases, sensor data, etc. It could include images, pictures, audio, video, etc.
So, variety of data is an essential characteristic of Big Data.
3) Velocity - The rate at which data is generated. Big Data has generally created a rapid speed
resulting in high volumes very soon. For Example, social media platforms generate a massive
amount of data every minute.
Big Data techniques are widely used in different sectors. Let us see some of them:
a) Retail - Popular retail chains are spread across the world. They handle millions of customers every
minute. They store and analyze their customer data and their transactions in Big Data systems.
b) Science - On the Discover supercomputing cluster, The NASA Center for Climate Simulation (NCCS)
stores 32 petabytes of climate observations and simulations.
c) Sports - Race cars with hundreds of sensors produce terabytes of data in Formula One races. These
sensors gather data points from tire pressure to fuel burn efficiency. Based on the data, data analysts
and engineers decide whether modifications should be made to get the best outcome in the race.
Moreover, based on simulations using data collected over the season collected through big data, race
teams try to foresee the time they finish the race beforehand.

d) Social media - Most popular social media platforms store and analyze petabytes of data every day.
They use Big Data techniques for storage and analysis.
e) Healthcare - During COVID 19 pandemic, different governments used Big Data to track infected
people's locations to reduce the spread. It was also used for case identification and medical treatment.
How we can interpret the data
Data is typically stored as numbers (numeric) or labels (categories). Based on the type of data, we
need to ask five simple questions to the data.

1) Binary classification or two-class classification algorithm - if a question has two possible


answers. For Example
Q1: Will a customer buy this product?
A: Yes/No
Q2: Can India win this cricket match?
A: Yes/No

Similarly, if a question has more than two possible answers, then we use a multiclass
classification algorithm.

2) Anomaly detection algorithms - unexpected records in a set of mostly consistent data


For example –
1. If an unexpected transaction is done from your bank account, which does not match your
regular transactions, there could be a case of fraud. Banks track these records and alert
the customer that an unexpected transaction has happened, protecting the customer's
money.
2. Your father is getting his blood pressure checked. Is the reading regular?
3. You are checking your car tyre pressure. Is the reading regular?

3) Regression algorithms - to predict numerical values based on the data


For example –
Q1: How many goals will your favorite team score in this football match?
A: 3
Q2: What will be the temperature of your city next Friday?
A: 32°C
4) Sometimes data may be separated into distinct groups. This approach is called clustering.
For Example, consider a class of 60 students. We have recorded their heights and arranged them
in a table.
As you can see, students can be categorized into groups based on their height.

Similarly, if we plot these in a chart, it will look like above.

5) These are questions that, generally, a machine or robot is programmed to do. Based on trial
and error, machines take some actions. These types of learning are called reinforcement
learning.
Consider the following questions:
Q1: I am a self-driving car. I am at a traffic signal with a red light. What should I do now?
A: Brake
Q2: I am a micro-oven. I have already heated the food for the set timing. What should I do
now?
A: Stop
Univariate data:
This type of data has only one variable. They do not involve multiple parameters or
relationships. For Example, the height of students is univariate data.
Multivariate data:

This type of data involves a relationship between multiple variables—for Example, sales of
umbrellas increase during the rainy season. We see umbrella sales are dependent on rainfall.
So, there are two variables – "rainfall" and "sales." These types of data are more complex than
univariate as they involve comparisons and relations with multiple parameters.

You might also like