CCST9047: The Age
of Big Data
Lecture 3
Things to reminder
‣ We only have 4 tutorials in the weeks of:
‣ 2.17 - 2.21 (next week), 2.24 - 2.28, 3.24 - 3.28, 3.31 - 4.4
Weekday Start time End Venue
THU 10:30 time
11:20 HW311
THU 11:30 12:20 HW311
THU 13:30 14:20 MB226
THU 14:30 15:20 MB226 If you haven’t registered, please
THU 15:30 16:20 MB226 do it asap!
THU 16:30 17:20 MB226
THU 17:30 18:20 MB226
FRI 10:30 11:20 MB154
FRI 11:30 12:20 MB127
FRI 13:30 14:20 MB127
Things to reminder
‣ The detailed description of the nal project will be released by this evening.
fi
Recap of Previous Lecture
Volume
‣ Daily applications
‣ Web data
‣ Social Network data
‣ Image and Video
Big Data
‣ Text data
‣ Sensor data
‣ Medical data Velcolity Variety
Recap of Previous Lecture
‣ Key points:
‣ How data is generated and collected in daily applications.
‣ How these data can enable better services?
‣ How data is formalized and organized with various data types.
‣ Properties of di erent data.
‣ Simple calculations of the data size.
Then, after collecting data, how to organize then and do some simple analysis?
ff
Things Will be Covered
‣ Data transformations
‣ Data visualizations
‣ Simple data statistics
Data Transformation
‣ Data are the results of deliberate human intervention
‣ Data vary across domains
‣ Di erent domains have di erent form of data
‣ Data vary within domains
‣ Data within the domain have di erent values.
The raw data is complex, and we need some about of preparation!
ff
ff
ff
Data Preparation
‣ Deliverables from big data
‣ Building prediction tools
‣ Building decision-making models
‣ Generate gures and reports
Lot of things need to do before preparing the deliverables.
fi
Data Preparation Python
‣ After collecting the data, we should import pandas as pd
‣
url = 'https://raw.githubusercontent.com/nytimes/
Data is typically organized as a table covid-19-data/master/us-states.csv'
Data = pd.read_csv(url)
‣ Understand what the variables are
‣ Manage column types Columns
‣ Handle missing values
Rows
Data Table and Type
‣ What do tables mean?
‣ Collection of di erent variables
‣ Collections of multiple observations
Di erent variables
Observation
ff
ff
Data Table and Type
‣ What do columns mean?
‣ Di erent features
‣ Di erent aspects of the data point
Di erent Features
ff
ff
ff
Data Table and Type
‣ How were the data collected?
‣ Province_state and country_region are collected by human
‣ Last_update time is collected by the clock
‣ Lat and Long_ are collected by GPS
‣ Con rmed and deaths are collected by human
Column Data type Description
Province_State String State name
Country_Region String Country name
Last_update Datetime Time of update
Lat Float Value of latitude
Long_ Float Value of longtitude
Con rmed Int Number of con rmed cases
Death Int Number of deaths
fi
fi
fi
Features and Observations
‣ Features represent the values collected in di erent manners
‣ Collected by di erent sensors
‣ Collected at di erent locations
‣ Collected for di erent objects
‣ Collected at di erent times
‣ …
‣ Observations represent the data point containing the corresponding features
‣ Can be di erent at di erent times
‣ Can be di erent for di erent individuals
‣ …
https://en.wikipedia.org/wiki/Moore%27s_law
ff
ff
ff
ff
ff
ff
ff
ff
ff
One Single Observation
‣Covid data
observation
• Observation
• Each observation is a
collection of feature values.
• Di erent observations refer to
di erent states and the
corresponding information
ff
ff
Managing Data Types
‣ Data has di erent “types”
‣ Numerica, categorial, dates, integers, string
‣ We need to manage some of them to make their information clearer
‣ Numerical methods cannot directly handle string-type data
Column Data type Description
Province_State String State name
Country_Region String Country name
Last_update String Time of update
Lat Float Value of latitude
Long_ Float Value of longtitude
Con rmed Int Number of con rmed cases
Death Int Number of deaths
fi
fi
ff
Managing Data Types: Dates
‣ The update time is stored as string
‣ To better make use of it, we need to convert the string to date time
String Datetime
“2023-01-23 04:31:38” → datetime(2023, 1, 23, 04, 31, 38)
One single date time string can be used to derive a number of new
features, e.g., year, month, day, …
Managing Data Types: Categorical
‣ Categorical type is general in many dataset
‣ categorical variable is a variable that can take on one of a limited, and
usually xed, number of possible values/groups
Province_State is a categorical variable as it
can only be one of the states or small
islands in the US.
fi
Managing Data Types: Categorical
‣ Managing categorical variables
‣ The number of groups may be large, merge some values into one group.
{Alabama, Alaska, American Samoa,
Arizona, Arkansas, California} → {Arizona,
California, and others}.
7 groups reduce to 3 groups.
Managing Data Types: Categorical
‣ Managing categorical variables
‣ A single categorical might encode multiple pieces of information, need
to decompose them to better organize information.
Location Province_State Country
California, US California US
Arizona, US Arizona US
Guangdong, China Guangdong China
British Columbia, Canada British Columbia Canada
Decompose categorical variable by using more features.
Managing Data Types: Categorical
‣ Managing categorical variables
‣ Converting categorical variables to numerical values
Using integers, ps code maps
state to numbers
fi
Managing Data Types: Categorical
‣ Managing categorical variables
‣ Converting categorical variables to numerical values
One-hot encoding
Correct?
• Each variable is represented
TRUE [1, 0, 0]
as an one-hot vector.
TRUE
FALSE [0, 1, 0] • Dimension/number of entries
FALSE of the vector is the number of
Don’t know Don’t know [0, 0, 1] categorical groups.
• One entry being 1 and
remaining being 0.
Handling Missing Values
‣ Missing value occurs frequently
‣ Caused by human error, sensor failures, data collection constraints
‣ If we can choose not to use it, it is typically marked as NA (or NaN).
‣ We can consider removing the entire rows with missing values
NaN NaN
Handling Missing Values
‣ Missing value occurs frequently
‣ If we need to use it for building models, we may need imputation
‣ Using the average from the observed cases
‣ Predict the missing value by prior knowledge (e.g., too large, too small)
‣ Predict the missing value using simple rules (mean, mode, etc.)
NaN The deaths of American
Samoa will be very
likely smaller than 100.
Handling Missing Values
‣ Missing value occurs frequently
‣ If we need to use it for building models, we may need imputation
‣ There are many more powerful methods for handling missing value by
using more data points.
‣ many modern machine learning models allow masking the missing
value, but implicit complete it during the model training.
Data Visualization
‣ Visualization can help you get a qualitative understand about the data
‣ Visualization can help you know what happens, good or bad
‣ See if the visualization matches your understanding
‣ Visualization can also help you understand the data from di erent
perspective
‣ Correlations between di erent features
‣ Trend of the data
ff
ff
Visualization Encodings
‣ Visualization represents data using graphical marks
‣ Di erent attributes of the marks encode data variables
‣ Marks allows us to make comparisons
‣ We can further reason about the data based on the visualization
x,y positions Di erent size Di erent color Di erent shape
Reasoning: the value will keep increasing in the future
ff
ff
ff
ff
Visualization Encodings
‣ Visualization represents data using graphical marks
‣ Some encodings are better for some variables types
‣ Size is better for continuous, not good for categorical
‣ For categorical, it would be better to use position
‣ Some encodings are easier to perceive
‣ Color is better than shape
‣ But do not use two many colors
Different Visualization
‣ Given a dataset, we have di erent visualization methods
‣ Using curve to see the trend import matplotlib.pyplot as plt
data_ca = data[data['state']=='California']
daily_cases = np.diff(data_ca['cases'].to_numpy())
plt.plot(daily_cases)
Daily covid cases in California
What can we conclude and reason about?
• There are 4 waves.
• The second wave is stronger than
the rst one.
• The daily cases will likely to
decrease in the future days.
Each point can be viewed as a data pair (x, y)
fi
ff
Different Visualization
‣ Given a dataset, we have di erent visualization methods
‣ Using curve to make comparisons
Daily covid cumulative cases in
California and Texas What can we conclude and reason about?
• California has more cases than
Texas.
• The di erence is become larger.
• The cumulative cases are still
increasing with a steady rate.
ff
ff
Different Visualization
‣ Given a dataset, we have di erent visualization methods
‣ Using pie chart to highlight the ratio of di erent components
Cumulative cases pie chart
What can we conclude and reason about?
• California cases take 12% of the all
cases.
• There are 4 states taking over 5% of
all cases.
• Can help us understand which
component takes the largest ratio.
ff
ff
Different Visualization
‣ Given a dataset, we have di erent visualization methods
‣ Using bar chart to highlight the comparison
Daily cases comparison at di erent date What can we conclude and reason about?
• 2020-08-12 has signi cantly more
cases than 2020-03-15.
• 2021-02-28 has fewer cases than
2020-08-12.
• Cases in Aug. 2020 will be more than
those in May 2020.
fi
ff
ff
Different Visualization
‣ Given a large number of observations, we can
‣ Use histogram to visualize the distribution density
‣ Count within bin ranges
Histogram of the daily cases What can we see and reason about?
• Mostly the daily cases are less than
20000.
• There are some days with cases
greater than 30000.
• Daily cases are mostly concentrating
around 2000-4000.
Different Visualization
‣ Given a large number of observations, we can
‣ Use histogram to visualize the distribution density
‣ Count within bin ranges
‣ Histogram of the daily cases
Normalizing to distribution: make total area is 1.
• Only the y-axis changes
• The shape is the same!
Different Visualization
‣ Given a large number of observations, we can
‣ Use scatter plot to visualize the distribution
‣ Also infer the relationships between di erent features
Distribution of cases and deaths What can we see and reason about?
• There are some outliers.
• Larger daily cases typically imply
larger daily deaths.
ff
Different Visualization
‣ Visualizations are now easy to generate using Python
Some details will be discussed in the tutorial!
Simple Data Stats
‣ Simple data statistics help summarize the property of data
‣ Statistics provide a quantitative understanding
‣ Statistics for a single variable
‣ mean, median
‣ variance, standard deviation
‣ Statistics for two variables
‣ Covariance
‣ Correlation
Statistics for Single Variables: Location Measures
‣ Given a large number of observations, we aim to understand
‣ What’s the typical value of them
‣ What’s the center of them
Daily Covid deaths in Aug. 2021 in California
[ 121 -5 93 54 39 13 45 10 51 -357 52 63 34 31 23 43
89 92 112 62 45 33 68 140 115 122 71 70 47 70 124]
Statistics for Single Variables: Location Measures
‣ Location measures: mean and median
‣ Mean denotes the average of samples
‣ Median denotes the sample at the center of the distribution,
i.e.,# samples < median = # samples > median
Given a set of values x1, x2, …, xn, the mean Given a set of values x1, x2, …, xn, the mode
is calculated using the following formula is calculated using the following formula
n • First sort the values from smallest to largest:
1
x(1), x(2), …, x(n)
n∑
x̄ = mean(x) = xi
i=1
• Then pick the middle values:
• If n is odd, pick x(n+1)/2
• If n is even, pick [xn/2 + xn/2+1]/2.
Statistics for Single Variables: Location Measures
‣ Location measures: mean and median
‣ Mean denotes the average of samples
‣ Median denotes the sample at the center of the distribution,
i.e.,# samples < median = # samples > median
Daily Covid deaths in Aug. 2021 in California
[ 121 -5 93 54 39 13 45 10 51 -357 52 63 34 31 23 43 89
92 112 62 45 33 68 140 115 122 71 70 47 70 124]
[-357 -5 10 13 23 31 33 34 39 43 45 45 47 51 52 54
Sorted sequence
62 63 68 70 70 71 89 92 93 112 115 121 122 124 140]
Mean = 50.6, median = 54
Statistics for Single Variables: Location Measures
‣ Location measures: mean and median
‣ Change the samples lead to the change of mean and median
‣ Median is more stable after removing a small number of outliers
Daily Covid deaths in Aug. 2021 in California Removing unreasonable samples
[ 121 -5 93 54 39 13 45 10 51 -357 52 63 34 31 23 43
89 92 112 62 45 33 68 140 115 122 71 70 47 70 124]
[10 13 23 31 33 34 39 43 45 45 47 51 52 54 62 63
Sorted sequence
68 70 70 71 89 92 93 112 115 121 122 124 140]
Mean = 66.6, median = 62
Mathematical Description of Mean and Median
‣ Find the center of a number of samples
‣ De nition of centers
‣ How to nd them
Mathematically, center is de ned as the point such that the average
distance to this point is minimized.
n
1
x n∑
x̄ = arg min dist(x, xi)
i=1
fi
fi
fi
Mathematical Description of Mean and Median
‣ Find the center of a number of samples
‣ De nition of centers
‣ Distance metric can be absolute deviation: | xi − xj |
‣ Distance metric can be squared deviation: (xi − xj) 2
Mean minimizes the average Median minimizes the average
distance using squared deviation distance using absolute deviation
n n
1 2 1
x n∑ x n∑
x̄ = arg min (x − xi) x̄ = arg min | x − xi |
i=1 i=1
fi
Visualization of the mean and median
‣ For di erent distribution, median and mean have di erent behaviors
As the data distribution become more balanced, mean and median will be closer.
ff
ff
Spread of the data
‣ How to characterize the spread of the data
Mean is the center of the data, spread characterizes the mean of deviation of a
data to the center.
Statistics for Single Variables: Spread Measures
‣ Given a large number of observations, the spread also shows
‣ The uncertainty among them
‣ The variation of the values around the mean
Daily Covid deaths in Aug. 2021 in California
[ 121 -5 93 54 39 13 45 10 51 -357 52 63 34 31 23 43
89 92 112 62 45 33 68 140 115 122 71 70 47 70 124]
Variance and Standard Deviation
‣ Spread measures: variance and standard deviation
‣ The average distance to the mean
‣ Recall 1 n
2
n∑
x̄ = arg min (x − xi)
x
i=1 When x̄ is estimated, we typically
‣ Variance is then denoted as use the following formula
n
1 n 1 2
n−1∑
2 Var = (x̄ − xi)
n∑
Var = (x̄ − xi)
i=1 i=1
‣ Standard deviation is the squared root of variance 1 n
2
n−1∑
std = (x̄ − xi)
n
Std has the same unit 1 2 i=1
∑
std = (x̄ − xi)
as the sample. n i=1
Variance and Standard Deviation
‣ Calculations
[ 121 -5 93 54 39 13 45 10 51 -357 52 63 34 31 23 43
89 92 112 62 45 33 68 140 115 122 71 70 47 70 124]
Variance = 6852, std = 83
[ 121 -5 93 54 39 13 45 10 51 -357 52 63 34 31 23 43
89 92 112 62 45 33 68 140 115 122 71 70 47 70 124]
Variance = 1232, std = 35
Variance and standard deviation are even more sensitive to the outliers.
Statistics for Two Variables
‣ Given a large number of observations of two features, we aim to
understand
‣ What’s the correlation between them
Daily Covid deaths in Sept. 2021 in California
[ 144 91 93 15 73 153 176 164 87 50 113 110 225 178 123 86 72 52
86 158 130 178 30 16 161 68 148 145 109 18]
Daily Covid cases in Sept. 2021 in California
[12022 9727 10551 7054 11141 9061 11053 9861 9154 7392 16056 8746 8760
12292 10053 8263 7068 8696 6669 7557 9726 7495 1555 1179 19950 5305 7143
9241 8033 1363]
Statistics for Two Variables
• Deaths and cases are positively
correlated.
• How to handle the correlation?
Covariance
‣ Covariance: mean of the multiplication of deviations
n
1
n∑
cov(x, y) = (xi − x̄)(yi − ȳ)
i=1
‣ Consider extreme cases:
‣ What happens if xi = yi?
‣ What happens if xi = − yi ?
‣ What are the units of the covariance?
Correlation
‣ Correlation
‣ Covariance depends on the scaling of the variables
‣ But correlation does not
cov(x, y)
corr(x, y) =
std(x)std(y)
‣ Consider extreme cases:
‣ What happens if xi = yi?
‣ What happens if xi = − yi ?
‣ What is the unit of the correlation? [-1, 1]
Covariance and Correlations
‣ Calculations
[ 144 91 93 15 73 153 176 164 87 50 113 110 225 178 123 86 72 52
86 158 130 178 30 16 161 68 148 145 109 18]
[12022 9727 10551 7054 11141 9061 11053 9861 9154 7392 16056 8746 8760
12292 10053 8263 7068 8696 6669 7557 9726 7495 1555 1179 19950 5305 7143
9241 8033 1363]
Covariance = 115786, correlation = 0.55
Hard to interpret Easier to interpret
Correlation is more informative than covariance.
Summary
‣ Data transformations
‣ Features, observations, data type, tidy data
‣ Data visualizations
‣ Markers, plots, comparisons
‣ Simple Statistics
‣ Mean, median, variance, correlation