Chapter 2.
Data
Wrangling
2
Outline
1. Data Analysis Process
2. Data Analysis Techniques
3. Data Wrangling
4. Data Wrangling using Pandas library
3
1. Data Analysis Process
The process of analyzing raw data to get meaningful insights
4
1. Data Analysis Process
5 steps of the process:
1. Identify: What to do
2. Collect: file, SQL, web, ....
3. Clean: Remove useless data
4. Analyze: Explore the meaningful data
5. Interpret: Communicate the result
5
1. Data Analysis Process
In detail:
6
2. Data Analysis Techniques
- Descriptive Analysis: summarizing and describing the
main features of a dataset
- Diagnostic Analysis: understand the cause-and-effect
relationships within the data
- Predictive Analysis: using historical data to make
predictions about future outcomes.
- Prescriptive Analysis: recommending actions based on the
predictions.
7
3. Data Wrangling
- Gathering Data
- Assessing Data
- Cleaning Data
8
3. Data Wrangling
Gathering Data:
Reading data from:
◦ flat file
◦ csv file
◦ API, web-scraping, JSON, SQL
9
3. Data Wrangling
Home work:
1. Write program to read tiny_tsv.tsv file or access data from
web
2. Submit .ipynb file to Lab4
Reference:
https://www.youtube.com/watch?v=bargNl2WeN4&list=PLUaB-
1hjhk8FE_XZ87vPPSfHqb6OcM0cF&index=54
10
3. Data Wrangling
Assessing Data:
5 Dimensions of Data Quality:
○ Completeness
○ Validity
○ Accuracy
○ Consistency
○ Uniqueness
11
3. Data Wrangling
12
3. Data Wrangling
13
3. Data Wrangling
14
3. Data Wrangling
15
3. Data Wrangling
16
3. Data Wrangling
Cleaning Data: fixing or removing problems within a dataset:
○ incorrect,
○ corrupted,
○ incorrectly formatted,
○ duplicate,
○ incomplete data.
(https://www.youtube.com/watch?v=bDhvCp3_lYw&list=PLUaB-
1hjhk8FE_XZ87vPPSfHqb6OcM0cF&index=63)
17
3. Data Wrangling
Home work:
1. Write a program to investigate a Dataset - [TMDb Movie
Data]
2. Submit .ipynb file to Lab5
Reference Videos:
https://www.youtube.com/watch?v=bDhvCp3_lYw&list=PL
UaB-1hjhk8FE_XZ87vPPSfHqb6OcM0cF&index=63
18
4. Pandas library
- A very popular library for working with data
- Data Frames are at the center of pandas
- Use the following import convention:
import pandas as pd
19
4. Pandas library
20
4. Pandas library
Example:
21
4. Pandas library
Popular functions:
- Reading data:
pd.read_csv(“abc.csv")
pd.read_excel(“abc.xlsx", sheet_name=“sheet1")
pd.read_sql(sql, connection)
22
4. Pandas library
23
4. Pandas library
24
4. Pandas library
25
4. Pandas library
26
4. Pandas library
Home work:
1. Write a program to investigate a Dataset - [TMDb Movie
Data]
2. Submit .ipynb file to Lab6
Reference Videos:
https://www.youtube.com/watch?v=bDhvCp3_lYw&list=PL
UaB-1hjhk8FE_XZ87vPPSfHqb6OcM0cF&index=63
std()
27
4. Pandas library
Descriptive statistics:
28
4. Pandas library
Indexing and selecting data:
29
4. Pandas library
Boolean Indexing:
df.iloc[rindex, cindex]: get cell data
df.loc[2][‘date_taken']: get cell data
df1 = (df['id'] > 2) & (df[‘age'] < 30)
df2 = df2[[x.startswith('t') for x in df2['a']]]
df20 = df[df[‘age'] > 20]
options = ['Bitcoin', 'Ethereum','Litecoin', 'Bitcoin Cash']
df = df[df['name'].isin(options)]
std()
30
4. Pandas library
Data type:
.dtypes: check data type
.astype(dtype): Casting data type
df = df.astype({"a": np.bool_, "c": float64})
df[‘a’, ‘b’] = df[‘a’, ‘b’].astype(int)
to_datetime(): convertto date time objects
df.date_taken = pd.to_datetime(df.date_taken)
std()
31
4. Pandas library
Data type:
.dtypes: check data type
.to_numeric(): convert to numeric
df = pd.to_numeric(df) or df= df.apply(pd.to_numeric)
df["a"] = pd.to_numeric(df["a"])
df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric)
std()
32
4. Pandas library
Rename colums:
.rename(columns = {‘old name': ‘new name'}, inplace =
True)
Should use: small letters and hyphens
Reorder columns:
df = df.loc[:,[‘x', ‘y', ‘z']]
df = df[[‘y', ‘x', ‘z']]
std()
33
4. Pandas library
Add colums:
df[‘new col’] = data
df[type] = ‘’
Fill data to a column based on a condition:
df.loc[df [‘age'] > 90, ‘type'] = ‘Old person'
std()
34
4. Pandas library
Check null values:
.isna()
.notna()
Check duplicates rows:
.duplicated()
std()
35
4. Pandas library
Deal with null values:
.fillna(data): fill Null = data
.replace(r'^\s*$', np.nan, regex=True): replace white
spaces = na
Extract Number from String:
.str.extract('(\d+)')
df [‘x'] = df ['x'].str.extract('(\d+)')
std()
36
4. Pandas library
Delete:
.dropna() : Delete na rows
.dropna(axis=1): Delete na columns
.dropna(subset=[‘a', ‘b']): Delete na rows in columns a, b
.drop(index): Delete row index
.drop_duplicates(): Delete duplicated rows
.drop([‘a’,’b’], axis = 1): Delete a, b columns
std()
37
4. Pandas library
Split a colum:
.str.split('/',n, expand = True): Split a string to n parts from
the left
df[‘a’].str.split(':',n =2, expand = True)
df[‘x’, ‘y’, ’z’] = df[‘a’].str.split('/',n =2, expand = True):
split column a into 3 parts and put to 3 new colums x, y, z
std()
38
4. Pandas library
Replace values in a column:
df[‘x'] = df[x'].replace(['quote.USD.percent_change_1h',
'quote.USD.percent_change_24h',
'quote.USD.percent_change_7d'],
['1h', '24h', '7d'])
std()
39
4. Pandas library
Get columns with data type:
df.select_dtypes(include='object'): string columns
df.select_dtypes(include=‘number'): numeric columns
df.select_dtypes(include=‘float'): float columns
40
4. Pandas library
Home work:
1. Write a program to investigate a Dataset - [TMDb Movie
Data]
2. Submit .ipynb file to Lab7
Reference Videos:
https://www.youtube.com/watch?v=bDhvCp3_lYw&list=PL
UaB-1hjhk8FE_XZ87vPPSfHqb6OcM0cF&index=63
41
4. Pandas library
Merge, join, concatenate:
frames = [df1, df2, df3]
result = pd.concat(frames)
42
4. Pandas library
Merge, join, concatenate:
pd.concat([df1, df4], axis=1)
pd.concat([df1, df4], axis=1, join="inner")
43
4. Pandas library
Merge, join, concatenate:
44
4. Pandas library
Merge, join, concatenate:
result = pd.merge(left, right, on="key")
45
4. Pandas library
Merge, join, concatenate:
result = pd.merge(left, right, how="inner", on=["key1",
"key2"])
46
4. Pandas library
Merge, join, concatenate:
result = left.join(right, on=["key1", "key2"],
how="inner")
47
4. Pandas library
Home work:
1. Write a program to investigate a Dataset - [TMDb Movie
Data]
2. Submit .ipynb file to Lab8
Reference Videos:
https://www.youtube.com/watch?v=bDhvCp3_lYw&list=PL
UaB-1hjhk8FE_XZ87vPPSfHqb6OcM0cF&index=63
Thanks!
Any questions?