0% found this document useful (0 votes)
76 views48 pages

Data Wrangling & Pandas Guide

Uploaded by

nthai02092005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views48 pages

Data Wrangling & Pandas Guide

Uploaded by

nthai02092005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Chapter 2.

Data
Wrangling
2

Outline

1. Data Analysis Process


2. Data Analysis Techniques
3. Data Wrangling
4. Data Wrangling using Pandas library
3

1. Data Analysis Process

The process of analyzing raw data to get meaningful insights


4

1. Data Analysis Process

5 steps of the process:


1. Identify: What to do
2. Collect: file, SQL, web, ....
3. Clean: Remove useless data
4. Analyze: Explore the meaningful data
5. Interpret: Communicate the result
5

1. Data Analysis Process

In detail:
6

2. Data Analysis Techniques

- Descriptive Analysis: summarizing and describing the


main features of a dataset
- Diagnostic Analysis: understand the cause-and-effect
relationships within the data
- Predictive Analysis: using historical data to make
predictions about future outcomes.
- Prescriptive Analysis: recommending actions based on the
predictions.
7

3. Data Wrangling

- Gathering Data
- Assessing Data
- Cleaning Data
8

3. Data Wrangling

Gathering Data:
Reading data from:
◦ flat file
◦ csv file
◦ API, web-scraping, JSON, SQL
9

3. Data Wrangling

Home work:
1. Write program to read tiny_tsv.tsv file or access data from
web
2. Submit .ipynb file to Lab4
Reference:
https://www.youtube.com/watch?v=bargNl2WeN4&list=PLUaB-
1hjhk8FE_XZ87vPPSfHqb6OcM0cF&index=54
10

3. Data Wrangling

Assessing Data:
5 Dimensions of Data Quality:
○ Completeness
○ Validity
○ Accuracy
○ Consistency
○ Uniqueness
11

3. Data Wrangling
12

3. Data Wrangling
13

3. Data Wrangling
14

3. Data Wrangling
15

3. Data Wrangling
16

3. Data Wrangling

Cleaning Data: fixing or removing problems within a dataset:


○ incorrect,
○ corrupted,
○ incorrectly formatted,
○ duplicate,
○ incomplete data.
(https://www.youtube.com/watch?v=bDhvCp3_lYw&list=PLUaB-
1hjhk8FE_XZ87vPPSfHqb6OcM0cF&index=63)
17

3. Data Wrangling

Home work:
1. Write a program to investigate a Dataset - [TMDb Movie
Data]
2. Submit .ipynb file to Lab5
Reference Videos:
https://www.youtube.com/watch?v=bDhvCp3_lYw&list=PL
UaB-1hjhk8FE_XZ87vPPSfHqb6OcM0cF&index=63
18

4. Pandas library

- A very popular library for working with data


- Data Frames are at the center of pandas
- Use the following import convention:
import pandas as pd
19

4. Pandas library
20

4. Pandas library

Example:
21

4. Pandas library

Popular functions:
- Reading data:
pd.read_csv(“abc.csv")
pd.read_excel(“abc.xlsx", sheet_name=“sheet1")
pd.read_sql(sql, connection)
22

4. Pandas library
23

4. Pandas library
24

4. Pandas library
25

4. Pandas library
26

4. Pandas library

Home work:
1. Write a program to investigate a Dataset - [TMDb Movie
Data]
2. Submit .ipynb file to Lab6
Reference Videos:
https://www.youtube.com/watch?v=bDhvCp3_lYw&list=PL
UaB-1hjhk8FE_XZ87vPPSfHqb6OcM0cF&index=63
std()

27

4. Pandas library
Descriptive statistics:
28

4. Pandas library
Indexing and selecting data:
29

4. Pandas library
Boolean Indexing:
df.iloc[rindex, cindex]: get cell data
df.loc[2][‘date_taken']: get cell data
df1 = (df['id'] > 2) & (df[‘age'] < 30)
df2 = df2[[x.startswith('t') for x in df2['a']]]
df20 = df[df[‘age'] > 20]
options = ['Bitcoin', 'Ethereum','Litecoin', 'Bitcoin Cash']
df = df[df['name'].isin(options)]
std()

30

4. Pandas library
Data type:
.dtypes: check data type
.astype(dtype): Casting data type
df = df.astype({"a": np.bool_, "c": float64})
df[‘a’, ‘b’] = df[‘a’, ‘b’].astype(int)
to_datetime(): convertto date time objects
df.date_taken = pd.to_datetime(df.date_taken)
std()

31

4. Pandas library
Data type:
.dtypes: check data type
.to_numeric(): convert to numeric
df = pd.to_numeric(df) or df= df.apply(pd.to_numeric)
df["a"] = pd.to_numeric(df["a"])
df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric)
std()

32

4. Pandas library
Rename colums:
.rename(columns = {‘old name': ‘new name'}, inplace =
True)
Should use: small letters and hyphens
Reorder columns:
df = df.loc[:,[‘x', ‘y', ‘z']]
df = df[[‘y', ‘x', ‘z']]
std()

33

4. Pandas library
Add colums:
df[‘new col’] = data
df[type] = ‘’
Fill data to a column based on a condition:
df.loc[df [‘age'] > 90, ‘type'] = ‘Old person'
std()

34

4. Pandas library
Check null values:
.isna()
.notna()
Check duplicates rows:
.duplicated()
std()

35

4. Pandas library
Deal with null values:
.fillna(data): fill Null = data
.replace(r'^\s*$', np.nan, regex=True): replace white
spaces = na
Extract Number from String:
.str.extract('(\d+)')
df [‘x'] = df ['x'].str.extract('(\d+)')
std()

36

4. Pandas library
Delete:
.dropna() : Delete na rows
.dropna(axis=1): Delete na columns
.dropna(subset=[‘a', ‘b']): Delete na rows in columns a, b
.drop(index): Delete row index
.drop_duplicates(): Delete duplicated rows
.drop([‘a’,’b’], axis = 1): Delete a, b columns
std()

37

4. Pandas library
Split a colum:
.str.split('/',n, expand = True): Split a string to n parts from
the left
df[‘a’].str.split(':',n =2, expand = True)
df[‘x’, ‘y’, ’z’] = df[‘a’].str.split('/',n =2, expand = True):
split column a into 3 parts and put to 3 new colums x, y, z
std()

38

4. Pandas library
Replace values in a column:
df[‘x'] = df[x'].replace(['quote.USD.percent_change_1h',
'quote.USD.percent_change_24h',
'quote.USD.percent_change_7d'],
['1h', '24h', '7d'])
std()

39

4. Pandas library
Get columns with data type:
df.select_dtypes(include='object'): string columns
df.select_dtypes(include=‘number'): numeric columns
df.select_dtypes(include=‘float'): float columns
40

4. Pandas library

Home work:
1. Write a program to investigate a Dataset - [TMDb Movie
Data]
2. Submit .ipynb file to Lab7
Reference Videos:
https://www.youtube.com/watch?v=bDhvCp3_lYw&list=PL
UaB-1hjhk8FE_XZ87vPPSfHqb6OcM0cF&index=63
41

4. Pandas library
Merge, join, concatenate:
frames = [df1, df2, df3]
result = pd.concat(frames)
42

4. Pandas library
Merge, join, concatenate:
pd.concat([df1, df4], axis=1)
pd.concat([df1, df4], axis=1, join="inner")
43

4. Pandas library
Merge, join, concatenate:
44

4. Pandas library
Merge, join, concatenate:
result = pd.merge(left, right, on="key")
45

4. Pandas library
Merge, join, concatenate:
result = pd.merge(left, right, how="inner", on=["key1",
"key2"])
46

4. Pandas library
Merge, join, concatenate:
result = left.join(right, on=["key1", "key2"],
how="inner")
47

4. Pandas library

Home work:
1. Write a program to investigate a Dataset - [TMDb Movie
Data]
2. Submit .ipynb file to Lab8
Reference Videos:
https://www.youtube.com/watch?v=bDhvCp3_lYw&list=PL
UaB-1hjhk8FE_XZ87vPPSfHqb6OcM0cF&index=63
Thanks!
Any questions?

You might also like