0% found this document useful (0 votes)
14 views130 pages

DAA Prac

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views130 pages

DAA Prac

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 130

PMRP assignment 1

April 1, 2025

0.1 Question 1
Separate the given list based on the data types. List1 = [“Aakash”, 90, 77, “B”, 3.142,12]

[51]: List1 = ["Aakash", 90, 77, "B", 3.142,12]


string = []
inte = []
flo = []
for i in List1:
if type(i) == str:
string.append(i)
elif type(i) == int:
inte.append(i)
else:
flo.append(i)
print(f"strings are {string}\ninteger are {inte}\nfloat are {flo}")

strings are ['Aakash', 'B']


integer are [90, 77, 12]
float are [3.142]

0.2 Question 2
Consider you are collecting data from students on their heights (in cms) containing numbers as
140,145,153, etc. Use Numpy library and randomly generate 50 such numbers in the range 150 to
180. Which data type would you use list or array to store such data? Calculate measures of central
tendency of this data stored in list as well as array.

[4]: import numpy as np


arr = np.random.randint(150, 180, 50)
print(arr)
mean = np.mean(arr)
median = np.median(arr)
print(f"The mean is{mean}\nThe median is {median}")

[166 157 163 171 164 154 160 177 159 155 167 159 173 171 174 155 173 172
153 163 174 166 158 169 153 166 159 161 153 175 179 179 151 167 155 179
153 163 176 173 178 177 174 166 157 161 174 154 167 162]
The mean is165.3
The median is 166.0

1
0.3 Question 3
Part 1:-
Create the function that will plot simple line chart for any given data.

[33]: import matplotlib.pyplot as plt


data1 = np.random.randint(1, 50, 2)
data2 = np.random.randint(150, 200, 2)
plt.plot(data1, data2)

[33]: [<matplotlib.lines.Line2D at 0x215414697f0>]

0.4 Question 3
Part 2:-
Create the recursive function for finding out factorial of a given number

[12]: def fact(n):


if n == 1:
return 1
return n * fact(n - 1)

2
n = int(input())
print(fact(n))

10
3628800

0.5 Question 3
Part 3:-
Create generator function for Fibonacci series and print out first 10 numbers.

[14]: def fibonacci_generator(n):


x, y = 0, 1
for _ in range(n):
yield x
x,y=y,x+y
fib_gen = fibonacci_generator(10)
for num in fib_gen:
print(num)

0
1
1
2
3
5
8
13
21
34

0.6 Question 3
Part 4:-
Plot the graphs for trigonometric functions sin, cos, tan, cot, sec & cosec for the values pi to 2pi.

[17]: import math


x = np.linspace(math.pi, 2 * math.pi, 10000)
y = np.sin(x)
plt.grid()
plt.xlabel("Radian")
plt.ylabel("Value")
plt.title("Sin(x) Graph")
plt.plot(x, y)
plt.show()

3
[19]: import math
x = np.linspace(math.pi, 2 * math.pi, 10000)
y = np.cos(x)
plt.grid()
plt.xlabel("Radian")
plt.ylabel("Value")
plt.title("Cos(x) Graph")
plt.plot(x, y)
plt.show()

4
[21]: import math
x = np.linspace(math.pi, 2 * math.pi, 10000)
y = np.tan(x)
plt.grid()
plt.xlabel("Radian")
plt.ylabel("Value")
plt.title("Tan(x) Graph")
plt.ylim(-10, 10)
plt.plot(x, y)
plt.show()

5
[23]: import math
x = np.linspace(math.pi, 2 * math.pi, 10000)
y = 1 / np.cos(x)
y[np.abs(np.cos(x)) < 1e-5] = np.nan
plt.grid()
plt.xlabel("Radian")
plt.ylabel("Value")
plt.title("Sec(x) Graph")
plt.plot(x, y, linewidth = 1.1, color = 'blue', label = "Sec(x)")
plt.legend()
plt.ylim(-10, 10)
plt.show()

6
[35]: import math
x = np.linspace(math.pi, 2 * math.pi, 100)
y = 1 / np.sin(x)
y[np.abs(np.sin(x)) < 1e-5] = np.nan
plt.grid()
plt.xlabel("Radian")
plt.ylabel("Value")
plt.title("Cosec(x) Graph")
plt.plot(x, y, linewidth = 1.1, color = 'blue', label = "Cosec(x)")
plt.legend()
plt.ylim(-20, 20)
plt.show()

7
[27]: import math
x = np.linspace(math.pi, 2 * math.pi, 10000)
y = 1 / np.tan(x)
y[np.abs(np.tan(x)) < 1e-5] = np.nan
plt.grid()
plt.xlabel("Radian")
plt.ylabel("Value")
plt.title("Cot(x) Graph")
plt.plot(x, y, linewidth = 1.1, color = 'blue', label = "Cot(x)")
plt.legend()
plt.ylim(-10, 10)
plt.show()

8
0.7 Question 4
Consider you want create dataset with ages of people in your surroundings. Use input method to
ask user their age, store those ages in appropriate data type. Apply error handling that will not
accept more than 130 or less than 0 inputs, raise appropriate prompts to guide users.

[63]: def coll():


ages = []

while True:
usr = input("Enter Age of the person\nEnter q to exit")

if usr == 'q':
break
try:
usr = int(usr)
if(usr < 0 or usr > 130):
print("Invalid input")
else:
ages.append(usr)

9
except ValueError:
print("Invallid input")
return ages
use = coll()
print(use)

Enter Age of the person


Enter q to exit 5
Enter Age of the person
Enter q to exit q
[5]

0.8 Question 5
Create class as Employees with inputs as name, department and salary. Salary should be encap-
sulated.

[64]: class Employees:


def __init__(self, name, department, salary):
self.name = name
self.department = department
self.__salary = salary
def setsalary(self, slary):
self.__salary = slary
def getsalary(self):
return self.__salary
def print(self):
print(f"The Employee name is {self.name}\nThe department is {self.
↪department}\nThe salary is {self.__salary}")

e1 = Employees("Yash", "AIML", 100000000)


e1.print()

The Employee name is Yash


The department is AIML
The salary is 100000000

0.9 Question 6
Create two 3d arrays as matrices. Perform matrix operations (Addition, Multiplication, dot prod-
uct, inverse, determinant) on those matrices. Explain identity matrix, multiply each matrix with
identity matrix and record the observation. (All operations should be done with Numpy library)

[65]: import numpy as np

arr1 = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])


arr2 = np.array([[[9, 10], [11, 12]], [[13, 14], [15, 16]]])

print("Matrix Addition:\n", arr1 + arr2)

10
print("\nElement-wise Multiplication:\n", arr1 * arr2)
print("\nDot Product:\n", np.matmul(arr1, arr2))
print("\nInverse of arr1:\n", np.linalg.inv(arr1))
print("\nDeterminants of arr1:\n", np.linalg.det(arr1))
print("\nIdentity Matrix:\n", np.eye(2))
print("\narr1 multiplied with Identity Matrix:\n", np.array([np.dot(np.eye(2),␣
↪mat) for mat in arr1]))

Matrix Addition:
[[[10 12]
[14 16]]

[[18 20]
[22 24]]]

Element-wise Multiplication:
[[[ 9 20]
[ 33 48]]

[[ 65 84]
[105 128]]]

Dot Product:
[[[ 31 34]
[ 71 78]]

[[155 166]
[211 226]]]

Inverse of arr1:
[[[-2. 1. ]
[ 1.5 -0.5]]

[[-4. 3. ]
[ 3.5 -2.5]]]

Determinants of arr1:
[-2. -2.]

Identity Matrix:
[[1. 0.]
[0. 1.]]

arr1 multiplied with Identity Matrix:


[[[1. 2.]
[3. 4.]]

[[5. 6.]

11
[7. 8.]]]

12
Assignment_2_D24AIML081

April 1, 2025

1 Lab-2
[4]: import pandas as pd

drug_df = pd.read_csv("/content/drug200 - drug200.csv")


drug_df

[4]: Age Sex BP Cholesterol Na_to_K Drug


0 23 F HIGH HIGH 25.355 drugY
1 47 M LOW HIGH 13.093 drugC
2 47 M LOW HIGH 10.114 drugC
3 28 F NORMAL HIGH 7.798 drugX
4 61 F LOW HIGH 18.043 drugY
.. … .. … … … …
195 56 F LOW HIGH 11.567 drugC
196 16 M LOW HIGH 12.006 drugC
197 52 M NORMAL HIGH 9.894 drugX
198 23 M NORMAL NORMAL 14.020 drugX
199 40 F LOW NORMAL 11.349 drugX

[200 rows x 6 columns]

1. Plot Distribution curve for Age along with histogram.

[5]: import numpy as np


import seaborn as sns
import matplotlib.pyplot as plt

plt.hist(drug_df["Age"], bins=30, color='skyblue', edgecolor='black')


plt.title("Age histogram")
plt.show()

1
[6]: sns.displot(drug_df['Age'], color='darkgreen')
plt.title('Distribution curve for age')
plt.show()

2
2. Calculate Q1,Q2,Q3 and IQR without using np.percentile function. Calculate lower
and upper bound values.

[7]: v=drug_df['Age']
#u=drug_df['Na_to_K']

q1=(len(v)+1)/4
q2=(len(v)+1)/2
q3=((len(v)+1)*3)/4

iqr= q3 - q1
lower = q1-1.5*iqr
upper = q1+1.5*iqr

print("Q1 = ",q1)
print("Q2 = ",q2)
print("Q3 = ",q3)

3
print("IQR = ",iqr)
print("Lower bound = ",lower)
print("Upper bound = ",upper)

plt.boxplot(v, vert = False)


plt.show()

Q1 = 50.25
Q2 = 100.5
Q3 = 150.75
IQR = 100.5
Lower bound = -100.5
Upper bound = 201.0

3. Calculate frequency table as well for age column. Ranges for this can be in multiple
of 10, e.g. 10-20,20-30,etc..

[8]: x = drug_df['Age']

for i in range(10,80,10):
count = 0
for age in x:
if age >= i and age < i+10:

4
count+=1
print(f'{i} - {i+10} : {count}')

10 - 20 : 12
20 - 30 : 35
30 - 40 : 37
40 - 50 : 38
50 - 60 : 33
60 - 70 : 32
70 - 80 : 13
1. What is a Gender distribution of data?
2. What percent of total population have high cholesterol & high BP?
3. What are the unique values of Drugs given in data? (df[“Drug”].unique)
4. How many people have high cholesterol before age of 30?

[9]: gen_dis = drug_df['Sex'].value_counts()


gen_dis

[9]: Sex
M 104
F 96
Name: count, dtype: int64

[10]: high_col= drug_df[drug_df['Cholesterol']=="HIGH"]


high_bp = drug_df[drug_df['BP']=="HIGH"]
print(high_col)
print(high_bp)

Age Sex BP Cholesterol Na_to_K Drug


0 23 F HIGH HIGH 25.355 drugY
1 47 M LOW HIGH 13.093 drugC
2 47 M LOW HIGH 10.114 drugC
3 28 F NORMAL HIGH 7.798 drugX
4 61 F LOW HIGH 18.043 drugY
.. … .. … … … …
193 72 M LOW HIGH 6.769 drugC
194 46 F HIGH HIGH 34.686 drugY
195 56 F LOW HIGH 11.567 drugC
196 16 M LOW HIGH 12.006 drugC
197 52 M NORMAL HIGH 9.894 drugX

[103 rows x 6 columns]


Age Sex BP Cholesterol Na_to_K Drug
0 23 F HIGH HIGH 25.355 drugY
11 34 F HIGH NORMAL 19.199 drugY
15 16 F HIGH NORMAL 15.516 drugY
17 43 M HIGH HIGH 13.972 drugA
19 32 F HIGH NORMAL 25.974 drugY

5
.. … .. … … … …
188 65 M HIGH NORMAL 34.997 drugY
189 64 M HIGH NORMAL 20.932 drugY
190 58 M HIGH HIGH 18.991 drugY
191 23 M HIGH HIGH 8.011 drugA
194 46 F HIGH HIGH 34.686 drugY

[77 rows x 6 columns]

[11]: drug_df['Drug'].unique()
drug_df['Drug'].value_counts()

[11]: Drug
drugY 91
drugX 54
drugA 23
drugC 16
drugB 16
Name: count, dtype: int64

[12]: count = 0
for i in range(len(drug_df['Age'])):
if drug_df['Cholesterol'][i] == 'HIGH' and drug_df['Age'][i] <30 :
count +=1
print(count)

26

2 Assignment-2
[13]: df = pd.read_csv('/content/user_behavior_dataset.csv')
df

[13]: User ID Device Model Operating System App Usage Time (min/day) \
0 1 Google Pixel 5 Android 393
1 2 OnePlus 9 Android 268
2 3 Xiaomi Mi 11 Android 154
3 4 Google Pixel 5 Android 239
4 5 iPhone 12 iOS 187
.. … … … …
695 696 iPhone 12 iOS 92
696 697 Xiaomi Mi 11 Android 316
697 698 Google Pixel 5 Android 99
698 699 Samsung Galaxy S21 Android 62
699 700 OnePlus 9 Android 212

Screen On Time (hours/day) Battery Drain (mAh/day) \

6
0 6.4 1872
1 4.7 1331
2 4.0 761
3 4.8 1676
4 4.3 1367
.. … …
695 3.9 1082
696 6.8 1965
697 3.1 942
698 1.7 431
699 5.4 1306

Number of Apps Installed Data Usage (MB/day) Age Gender \


0 67 1122 40 Male
1 42 944 47 Female
2 32 322 42 Male
3 56 871 20 Male
4 58 988 31 Female
.. … … … …
695 26 381 22 Male
696 68 1201 59 Male
697 22 457 50 Female
698 13 224 44 Male
699 49 828 23 Female

User Behavior Class


0 4
1 3
2 2
3 3
4 3
.. …
695 2
696 4
697 2
698 1
699 3

[700 rows x 11 columns]

1. Find out the outliers in each numerical column

[20]: v=df['App Usage Time (min/day)']


q1=(len(v)+1)/4
q2=(len(v)+1)/2
q3=((len(v)+1)*3)/4

7
iqr= q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
outliers = v[(v < lower) | (v > upper)]

print("Q1 = ",q1)
print("Q2 = ",q2)
print("Q3 = ",q3)
print("IQR = ",iqr)
print("Lower bound = ",lower)
print("Upper bound = ",upper)
print("Outliers = ", outliers.tolist())

Q1 = 175.25
Q2 = 350.5
Q3 = 525.75
IQR = 350.5
Lower bound = -350.5
Upper bound = 1051.5
Outliers = []
2. Find out gender distribution in this data.

[23]: gender_distribution = df['Gender'].value_counts(normalize=True) * 100

print(gender_distribution)

Gender
Male 52.0
Female 48.0
Name: proportion, dtype: float64
3. What is average daily usage of data? Explore gender wise and device wise variation in average
usage of data.

[29]: average_usage = df['Data Usage (MB/day)'].mean()


print(f"Overall average daily data usage: {average_usage:.2f} MB")

gender_usage = df.groupby('Gender')['Data Usage (MB/day)'].mean()


print("\nAverage daily data usage by gender:")
print(gender_usage)

device_usage = df.groupby('Device Model')['Data Usage (MB/day)'].mean()


print("\nAverage daily data usage by device:")
print(device_usage)

gender_device_usage = df.groupby(['Gender', 'Device Model'])['Data Usage (MB/


↪day)'].mean()

8
print("\nAverage daily data usage by gender and device:")
print(gender_device_usage)

Overall average daily data usage: 929.74 MB

Average daily data usage by gender:


Gender
Female 914.321429
Male 943.978022
Name: Data Usage (MB/day), dtype: float64

Average daily data usage by device:


Device Model
Google Pixel 5 897.704225
OnePlus 9 911.120301
Samsung Galaxy S21 931.872180
Xiaomi Mi 11 940.164384
iPhone 12 965.506849
Name: Data Usage (MB/day), dtype: float64

Average daily data usage by gender and device:


Gender Device Model
Female Google Pixel 5 834.101449
OnePlus 9 862.377049
Samsung Galaxy S21 992.888889
Xiaomi Mi 11 917.858974
iPhone 12 970.878378
Male Google Pixel 5 957.821918
OnePlus 9 952.416667
Samsung Galaxy S21 890.164557
Xiaomi Mi 11 965.750000
iPhone 12 959.986111
Name: Data Usage (MB/day), dtype: float64
4. Which device have highest popularity based on Age and Gender?

[30]: popularity = df.groupby(['Age', 'Gender'])['Device Model'].agg(lambda x: x.


↪mode()[0])

print(popularity)

Age Gender
18 Male Samsung Galaxy S21
19 Female iPhone 12
Male OnePlus 9
20 Female Google Pixel 5
Male Google Pixel 5

57 Male Samsung Galaxy S21

9
58 Female iPhone 12
Male Samsung Galaxy S21
59 Female iPhone 12
Male Samsung Galaxy S21
Name: Device Model, Length: 83, dtype: object

10
D24AIML081_ASS_3_PRMP

April 1, 2025

[34]: #Assignment 3
import numpy as np
import pandas as pd

df = pd.read_csv("C:/Users/User/Downloads/matches.csv")
df

[34]: id season city date match_type player_of_match \


0 335982 2007/08 Bangalore 2008-04-18 League BB McCullum
1 335983 2007/08 Chandigarh 2008-04-19 League MEK Hussey
2 335984 2007/08 Delhi 2008-04-19 League MF Maharoof
3 335985 2007/08 Mumbai 2008-04-20 League MV Boucher
4 335986 2007/08 Kolkata 2008-04-20 League DJ Hussey
… … … … … … …
1090 1426307 2024 Hyderabad 2024-05-19 League Abhishek Sharma
1091 1426309 2024 Ahmedabad 2024-05-21 Qualifier 1 MA Starc
1092 1426310 2024 Ahmedabad 2024-05-22 Eliminator R Ashwin
1093 1426311 2024 Chennai 2024-05-24 Qualifier 2 Shahbaz Ahmed
1094 1426312 2024 Chennai 2024-05-26 Final MA Starc

venue \
0 M Chinnaswamy Stadium
1 Punjab Cricket Association Stadium, Mohali
2 Feroz Shah Kotla
3 Wankhede Stadium
4 Eden Gardens
… …
1090 Rajiv Gandhi International Stadium, Uppal, Hyd…
1091 Narendra Modi Stadium, Ahmedabad
1092 Narendra Modi Stadium, Ahmedabad
1093 MA Chidambaram Stadium, Chepauk, Chennai
1094 MA Chidambaram Stadium, Chepauk, Chennai

team1 team2 \
0 Royal Challengers Bangalore Kolkata Knight Riders
1 Kings XI Punjab Chennai Super Kings
2 Delhi Daredevils Rajasthan Royals
3 Mumbai Indians Royal Challengers Bangalore

1
4 Kolkata Knight Riders Deccan Chargers
… … …
1090 Punjab Kings Sunrisers Hyderabad
1091 Sunrisers Hyderabad Kolkata Knight Riders
1092 Royal Challengers Bengaluru Rajasthan Royals
1093 Sunrisers Hyderabad Rajasthan Royals
1094 Sunrisers Hyderabad Kolkata Knight Riders

toss_winner toss_decision winner \


0 Royal Challengers Bangalore field Kolkata Knight Riders
1 Chennai Super Kings bat Chennai Super Kings
2 Rajasthan Royals bat Delhi Daredevils
3 Mumbai Indians bat Royal Challengers Bangalore
4 Deccan Chargers bat Kolkata Knight Riders
… … … …
1090 Punjab Kings bat Sunrisers Hyderabad
1091 Sunrisers Hyderabad bat Kolkata Knight Riders
1092 Rajasthan Royals field Rajasthan Royals
1093 Rajasthan Royals field Sunrisers Hyderabad
1094 Sunrisers Hyderabad bat Kolkata Knight Riders

result result_margin target_runs target_overs super_over method \


0 runs 140.0 223.0 20.0 N NaN
1 runs 33.0 241.0 20.0 N NaN
2 wickets 9.0 130.0 20.0 N NaN
3 wickets 5.0 166.0 20.0 N NaN
4 wickets 5.0 111.0 20.0 N NaN
… … … … … … …
1090 wickets 4.0 215.0 20.0 N NaN
1091 wickets 8.0 160.0 20.0 N NaN
1092 wickets 4.0 173.0 20.0 N NaN
1093 runs 36.0 176.0 20.0 N NaN
1094 wickets 8.0 114.0 20.0 N NaN

umpire1 umpire2
0 Asad Rauf RE Koertzen
1 MR Benson SL Shastri
2 Aleem Dar GA Pratapkumar
3 SJ Davis DJ Harper
4 BF Bowden K Hariharan
… … …
1090 Nitin Menon VK Sharma
1091 AK Chaudhary R Pandit
1092 KN Ananthapadmanabhan MV Saidharshan Kumar
1093 Nitin Menon VK Sharma
1094 J Madanagopal Nitin Menon

2
[1095 rows x 20 columns]

[36]: #1) Find out count of unique records in each column.


unique = df.nunique()
print(unique)

id 1095
season 17
city 36
date 823
match_type 8
player_of_match 291
venue 58
team1 19
team2 19
toss_winner 19
toss_decision 2
winner 19
result 4
result_margin 98
target_runs 170
target_overs 15
super_over 2
method 1
umpire1 62
umpire2 62
dtype: int64

[38]: #2) Find if any outliers in data.


Q1 = df.select_dtypes(include=['float64', 'int64']).quantile(0.25)
Q3 = df.select_dtypes(include=['float64', 'int64']).quantile(0.75)
IQR = Q3 - Q1

outliers = ((df.select_dtypes(include=['float64', 'int64']) < (Q1 - 1.5 * IQR))␣


↪|

(df.select_dtypes(include=['float64', 'int64']) > (Q3 + 1.5 * IQR)))

outlier_counts = outliers.sum()
print(outlier_counts)

id 0
result_margin 121
target_runs 30
target_overs 30
dtype: int64

[40]: #3) Plot heatmap of correlation matrix and covariance matrix for the given␣
↪dataset.

3
import seaborn as sns
import matplotlib.pyplot as plt

numeric_df = df.select_dtypes(include=['float64', 'int64'])

correlation_matrix = numeric_df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f")
plt.title('Correlation Matrix')
plt.show()

covariance_matrix = numeric_df.cov()
plt.figure(figsize=(10, 8))
sns.heatmap(covariance_matrix, annot=True, fmt=".2f")
plt.title('Covariance Matrix')
plt.show()

4
[42]: #4) Remove unnecessary or empty columns as well as any rows if required from␣
↪the dataset.

df_cleaned = df.dropna(thresh=len(df) * 0.5, axis=1)

df_cleaned = df_cleaned.dropna()
df_cleaned

[42]: id season city date match_type player_of_match \


0 335982 2007/08 Bangalore 2008-04-18 League BB McCullum
1 335983 2007/08 Chandigarh 2008-04-19 League MEK Hussey
2 335984 2007/08 Delhi 2008-04-19 League MF Maharoof
3 335985 2007/08 Mumbai 2008-04-20 League MV Boucher
4 335986 2007/08 Kolkata 2008-04-20 League DJ Hussey
… … … … … … …
1090 1426307 2024 Hyderabad 2024-05-19 League Abhishek Sharma
1091 1426309 2024 Ahmedabad 2024-05-21 Qualifier 1 MA Starc

5
1092 1426310 2024 Ahmedabad 2024-05-22 Eliminator R Ashwin
1093 1426311 2024 Chennai 2024-05-24 Qualifier 2 Shahbaz Ahmed
1094 1426312 2024 Chennai 2024-05-26 Final MA Starc

venue \
0 M Chinnaswamy Stadium
1 Punjab Cricket Association Stadium, Mohali
2 Feroz Shah Kotla
3 Wankhede Stadium
4 Eden Gardens
… …
1090 Rajiv Gandhi International Stadium, Uppal, Hyd…
1091 Narendra Modi Stadium, Ahmedabad
1092 Narendra Modi Stadium, Ahmedabad
1093 MA Chidambaram Stadium, Chepauk, Chennai
1094 MA Chidambaram Stadium, Chepauk, Chennai

team1 team2 \
0 Royal Challengers Bangalore Kolkata Knight Riders
1 Kings XI Punjab Chennai Super Kings
2 Delhi Daredevils Rajasthan Royals
3 Mumbai Indians Royal Challengers Bangalore
4 Kolkata Knight Riders Deccan Chargers
… … …
1090 Punjab Kings Sunrisers Hyderabad
1091 Sunrisers Hyderabad Kolkata Knight Riders
1092 Royal Challengers Bengaluru Rajasthan Royals
1093 Sunrisers Hyderabad Rajasthan Royals
1094 Sunrisers Hyderabad Kolkata Knight Riders

toss_winner toss_decision winner \


0 Royal Challengers Bangalore field Kolkata Knight Riders
1 Chennai Super Kings bat Chennai Super Kings
2 Rajasthan Royals bat Delhi Daredevils
3 Mumbai Indians bat Royal Challengers Bangalore
4 Deccan Chargers bat Kolkata Knight Riders
… … … …
1090 Punjab Kings bat Sunrisers Hyderabad
1091 Sunrisers Hyderabad bat Kolkata Knight Riders
1092 Rajasthan Royals field Rajasthan Royals
1093 Rajasthan Royals field Sunrisers Hyderabad
1094 Sunrisers Hyderabad bat Kolkata Knight Riders

result result_margin target_runs target_overs super_over \


0 runs 140.0 223.0 20.0 N
1 runs 33.0 241.0 20.0 N
2 wickets 9.0 130.0 20.0 N

6
3 wickets 5.0 166.0 20.0 N
4 wickets 5.0 111.0 20.0 N
… … … … … …
1090 wickets 4.0 215.0 20.0 N
1091 wickets 8.0 160.0 20.0 N
1092 wickets 4.0 173.0 20.0 N
1093 runs 36.0 176.0 20.0 N
1094 wickets 8.0 114.0 20.0 N

umpire1 umpire2
0 Asad Rauf RE Koertzen
1 MR Benson SL Shastri
2 Aleem Dar GA Pratapkumar
3 SJ Davis DJ Harper
4 BF Bowden K Hariharan
… … …
1090 Nitin Menon VK Sharma
1091 AK Chaudhary R Pandit
1092 KN Ananthapadmanabhan MV Saidharshan Kumar
1093 Nitin Menon VK Sharma
1094 J Madanagopal Nitin Menon

[1028 rows x 19 columns]

[44]: #5) Plot histograms for each column and remove any skewness using␣
↪transformations.

df_cleaned.hist(figsize=(15, 10), bins=30)


plt.show()

for column in df_cleaned.select_dtypes(include=['float64', 'int64']).columns:


if df_cleaned[column].skew() > 1:
df_cleaned[column] = np.log1p(df_cleaned[column])

7
[46]: #6) Plot Yearly records for numerical columns (e.g. runs, trophies)
df_cleaned['year'] = df_cleaned['season'].str.split('/').str[0].astype(int)

yearly_records = df_cleaned.groupby('year').sum()

yearly_records.plot(kind='bar', figsize=(12, 6))


plt.title('Yearly Records for Numerical Columns')
plt.xlabel('Year')
plt.ylabel('Total')
plt.show()

8
[ ]:

[ ]:

[ ]:

9
ASSIGNMENT_4_D24AIML081

April 1, 2025

[20]: import pandas as pd

url = 'C:/Users/User/Downloads/creditcard.csv/creditcard.csv'
data = pd.read_csv(url)

data_cleaned = data.drop(columns=[f'V{i}' for i in range(1, 9)])

threshold = 100

total_transactions = len(data_cleaned)

total_fraudulent = data_cleaned[data_cleaned['Class'] == 1].shape[0]

total_high_amount = data_cleaned[data_cleaned['Amount'] > threshold].shape[0]

total_high_amount_fraudulent = data_cleaned[(data_cleaned['Amount'] >␣


↪threshold) & (data_cleaned['Class'] == 1)].shape[0]

P_fraudulent = total_fraudulent / total_transactions


P_high_amount = total_high_amount / total_transactions
P_high_amount_given_fraudulent = total_high_amount_fraudulent /␣
↪total_fraudulent if total_fraudulent > 0 else 0

if P_high_amount > 0:
P_fraudulent_given_high_amount = (P_high_amount_given_fraudulent *␣
↪P_fraudulent) / P_high_amount

else:
P_fraudulent_given_high_amount = 0

print(f"P(Fraudulent | High Amount) = {P_fraudulent_given_high_amount:.4f}")

P(Fraudulent | High Amount) = 0.0023

[ ]:

1
D24AIMl081_PR5

April 1, 2025

D24AIML081 DAHIYA MANDEEPSINH PMRP ASSIGNMENT 5 + CLASSWORK


IPL DATA ANALYTICS
1. Calculate the total number of matches played in each season
2. Find the most successful team (team with the most wins)
3. Find the average margin of victory by wickets and by runs
4. Which player won the most ’Player of the Match awards?
5. Find the number of matches where the toss winner won the match
6. Calculate the total number of runs scored in all matches for each team
7. Determine the average number of wickets taken by the winning team in each match
8. How many matches were decided by a Super Over?
9. Find the distribution of match results (runs vs wickets)
10. Find the top 5 venues with the most matches played
11. Find the match with the highest margin of victory (by wickets or runs)
12. Calculate the win percentage for each team
13. Find the average number of overs played in all matches
14. Find the most common match outcome (runs, wickets, or no result)
15. Find the total number of matches played at each venue by year
16. Analyze the win margin distribution by year
17. Calculate the total number of ‘no result’ matches and their impact on the tournament
18. How many matches were won by teams batting first vs. batting second?
19. Find out the average number of runs scored by the winning team
20. Identify the most successful captain (team with the most wins under a captain)

[77]: import pandas as pd


import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

1
df=pd.read_csv('D:/SEM4/PMRP/RAW_CODE/PMRP_DAY_13/matches.csv')
df.head,df.tail,df.describe,df.info

[77]: (<bound method NDFrame.head of id season city date


match_type player_of_match \
0 335982 2007/08 Bangalore 2008-04-18 League BB McCullum
1 335983 2007/08 Chandigarh 2008-04-19 League MEK Hussey
2 335984 2007/08 Delhi 2008-04-19 League MF Maharoof
3 335985 2007/08 Mumbai 2008-04-20 League MV Boucher
4 335986 2007/08 Kolkata 2008-04-20 League DJ Hussey
… … … … … … …
1090 1426307 2024 Hyderabad 2024-05-19 League Abhishek Sharma
1091 1426309 2024 Ahmedabad 2024-05-21 Qualifier 1 MA Starc
1092 1426310 2024 Ahmedabad 2024-05-22 Eliminator R Ashwin
1093 1426311 2024 Chennai 2024-05-24 Qualifier 2 Shahbaz Ahmed
1094 1426312 2024 Chennai 2024-05-26 Final MA Starc

venue \
0 M Chinnaswamy Stadium
1 Punjab Cricket Association Stadium, Mohali
2 Feroz Shah Kotla
3 Wankhede Stadium
4 Eden Gardens
… …
1090 Rajiv Gandhi International Stadium, Uppal, Hyd…
1091 Narendra Modi Stadium, Ahmedabad
1092 Narendra Modi Stadium, Ahmedabad
1093 MA Chidambaram Stadium, Chepauk, Chennai
1094 MA Chidambaram Stadium, Chepauk, Chennai

team1 team2 \
0 Royal Challengers Bangalore Kolkata Knight Riders
1 Kings XI Punjab Chennai Super Kings
2 Delhi Daredevils Rajasthan Royals
3 Mumbai Indians Royal Challengers Bangalore
4 Kolkata Knight Riders Deccan Chargers
… … …
1090 Punjab Kings Sunrisers Hyderabad
1091 Sunrisers Hyderabad Kolkata Knight Riders
1092 Royal Challengers Bengaluru Rajasthan Royals
1093 Sunrisers Hyderabad Rajasthan Royals
1094 Sunrisers Hyderabad Kolkata Knight Riders

toss_winner toss_decision winner \


0 Royal Challengers Bangalore field Kolkata Knight Riders
1 Chennai Super Kings bat Chennai Super Kings

2
2 Rajasthan Royals bat Delhi Daredevils
3 Mumbai Indians bat Royal Challengers Bangalore
4 Deccan Chargers bat Kolkata Knight Riders
… … … …
1090 Punjab Kings bat Sunrisers Hyderabad
1091 Sunrisers Hyderabad bat Kolkata Knight Riders
1092 Rajasthan Royals field Rajasthan Royals
1093 Rajasthan Royals field Sunrisers Hyderabad
1094 Sunrisers Hyderabad bat Kolkata Knight Riders

result result_margin target_runs target_overs super_over method \


0 runs 140.0 223.0 20.0 N NaN
1 runs 33.0 241.0 20.0 N NaN
2 wickets 9.0 130.0 20.0 N NaN
3 wickets 5.0 166.0 20.0 N NaN
4 wickets 5.0 111.0 20.0 N NaN
… … … … … … …
1090 wickets 4.0 215.0 20.0 N NaN
1091 wickets 8.0 160.0 20.0 N NaN
1092 wickets 4.0 173.0 20.0 N NaN
1093 runs 36.0 176.0 20.0 N NaN
1094 wickets 8.0 114.0 20.0 N NaN

umpire1 umpire2
0 Asad Rauf RE Koertzen
1 MR Benson SL Shastri
2 Aleem Dar GA Pratapkumar
3 SJ Davis DJ Harper
4 BF Bowden K Hariharan
… … …
1090 Nitin Menon VK Sharma
1091 AK Chaudhary R Pandit
1092 KN Ananthapadmanabhan MV Saidharshan Kumar
1093 Nitin Menon VK Sharma
1094 J Madanagopal Nitin Menon

[1095 rows x 20 columns]>,


<bound method NDFrame.tail of id season city date
match_type player_of_match \
0 335982 2007/08 Bangalore 2008-04-18 League BB McCullum
1 335983 2007/08 Chandigarh 2008-04-19 League MEK Hussey
2 335984 2007/08 Delhi 2008-04-19 League MF Maharoof
3 335985 2007/08 Mumbai 2008-04-20 League MV Boucher
4 335986 2007/08 Kolkata 2008-04-20 League DJ Hussey
… … … … … … …
1090 1426307 2024 Hyderabad 2024-05-19 League Abhishek Sharma
1091 1426309 2024 Ahmedabad 2024-05-21 Qualifier 1 MA Starc

3
1092 1426310 2024 Ahmedabad 2024-05-22 Eliminator R Ashwin
1093 1426311 2024 Chennai 2024-05-24 Qualifier 2 Shahbaz Ahmed
1094 1426312 2024 Chennai 2024-05-26 Final MA Starc

venue \
0 M Chinnaswamy Stadium
1 Punjab Cricket Association Stadium, Mohali
2 Feroz Shah Kotla
3 Wankhede Stadium
4 Eden Gardens
… …
1090 Rajiv Gandhi International Stadium, Uppal, Hyd…
1091 Narendra Modi Stadium, Ahmedabad
1092 Narendra Modi Stadium, Ahmedabad
1093 MA Chidambaram Stadium, Chepauk, Chennai
1094 MA Chidambaram Stadium, Chepauk, Chennai

team1 team2 \
0 Royal Challengers Bangalore Kolkata Knight Riders
1 Kings XI Punjab Chennai Super Kings
2 Delhi Daredevils Rajasthan Royals
3 Mumbai Indians Royal Challengers Bangalore
4 Kolkata Knight Riders Deccan Chargers
… … …
1090 Punjab Kings Sunrisers Hyderabad
1091 Sunrisers Hyderabad Kolkata Knight Riders
1092 Royal Challengers Bengaluru Rajasthan Royals
1093 Sunrisers Hyderabad Rajasthan Royals
1094 Sunrisers Hyderabad Kolkata Knight Riders

toss_winner toss_decision winner \


0 Royal Challengers Bangalore field Kolkata Knight Riders
1 Chennai Super Kings bat Chennai Super Kings
2 Rajasthan Royals bat Delhi Daredevils
3 Mumbai Indians bat Royal Challengers Bangalore
4 Deccan Chargers bat Kolkata Knight Riders
… … … …
1090 Punjab Kings bat Sunrisers Hyderabad
1091 Sunrisers Hyderabad bat Kolkata Knight Riders
1092 Rajasthan Royals field Rajasthan Royals
1093 Rajasthan Royals field Sunrisers Hyderabad
1094 Sunrisers Hyderabad bat Kolkata Knight Riders

result result_margin target_runs target_overs super_over method \


0 runs 140.0 223.0 20.0 N NaN
1 runs 33.0 241.0 20.0 N NaN
2 wickets 9.0 130.0 20.0 N NaN

4
3 wickets 5.0 166.0 20.0 N NaN
4 wickets 5.0 111.0 20.0 N NaN
… … … … … … …
1090 wickets 4.0 215.0 20.0 N NaN
1091 wickets 8.0 160.0 20.0 N NaN
1092 wickets 4.0 173.0 20.0 N NaN
1093 runs 36.0 176.0 20.0 N NaN
1094 wickets 8.0 114.0 20.0 N NaN

umpire1 umpire2
0 Asad Rauf RE Koertzen
1 MR Benson SL Shastri
2 Aleem Dar GA Pratapkumar
3 SJ Davis DJ Harper
4 BF Bowden K Hariharan
… … …
1090 Nitin Menon VK Sharma
1091 AK Chaudhary R Pandit
1092 KN Ananthapadmanabhan MV Saidharshan Kumar
1093 Nitin Menon VK Sharma
1094 J Madanagopal Nitin Menon

[1095 rows x 20 columns]>,


<bound method NDFrame.describe of id season city
date match_type player_of_match \
0 335982 2007/08 Bangalore 2008-04-18 League BB McCullum
1 335983 2007/08 Chandigarh 2008-04-19 League MEK Hussey
2 335984 2007/08 Delhi 2008-04-19 League MF Maharoof
3 335985 2007/08 Mumbai 2008-04-20 League MV Boucher
4 335986 2007/08 Kolkata 2008-04-20 League DJ Hussey
… … … … … … …
1090 1426307 2024 Hyderabad 2024-05-19 League Abhishek Sharma
1091 1426309 2024 Ahmedabad 2024-05-21 Qualifier 1 MA Starc
1092 1426310 2024 Ahmedabad 2024-05-22 Eliminator R Ashwin
1093 1426311 2024 Chennai 2024-05-24 Qualifier 2 Shahbaz Ahmed
1094 1426312 2024 Chennai 2024-05-26 Final MA Starc

venue \
0 M Chinnaswamy Stadium
1 Punjab Cricket Association Stadium, Mohali
2 Feroz Shah Kotla
3 Wankhede Stadium
4 Eden Gardens
… …
1090 Rajiv Gandhi International Stadium, Uppal, Hyd…
1091 Narendra Modi Stadium, Ahmedabad
1092 Narendra Modi Stadium, Ahmedabad

5
1093 MA Chidambaram Stadium, Chepauk, Chennai
1094 MA Chidambaram Stadium, Chepauk, Chennai

team1 team2 \
0 Royal Challengers Bangalore Kolkata Knight Riders
1 Kings XI Punjab Chennai Super Kings
2 Delhi Daredevils Rajasthan Royals
3 Mumbai Indians Royal Challengers Bangalore
4 Kolkata Knight Riders Deccan Chargers
… … …
1090 Punjab Kings Sunrisers Hyderabad
1091 Sunrisers Hyderabad Kolkata Knight Riders
1092 Royal Challengers Bengaluru Rajasthan Royals
1093 Sunrisers Hyderabad Rajasthan Royals
1094 Sunrisers Hyderabad Kolkata Knight Riders

toss_winner toss_decision winner \


0 Royal Challengers Bangalore field Kolkata Knight Riders
1 Chennai Super Kings bat Chennai Super Kings
2 Rajasthan Royals bat Delhi Daredevils
3 Mumbai Indians bat Royal Challengers Bangalore
4 Deccan Chargers bat Kolkata Knight Riders
… … … …
1090 Punjab Kings bat Sunrisers Hyderabad
1091 Sunrisers Hyderabad bat Kolkata Knight Riders
1092 Rajasthan Royals field Rajasthan Royals
1093 Rajasthan Royals field Sunrisers Hyderabad
1094 Sunrisers Hyderabad bat Kolkata Knight Riders

result result_margin target_runs target_overs super_over method \


0 runs 140.0 223.0 20.0 N NaN
1 runs 33.0 241.0 20.0 N NaN
2 wickets 9.0 130.0 20.0 N NaN
3 wickets 5.0 166.0 20.0 N NaN
4 wickets 5.0 111.0 20.0 N NaN
… … … … … … …
1090 wickets 4.0 215.0 20.0 N NaN
1091 wickets 8.0 160.0 20.0 N NaN
1092 wickets 4.0 173.0 20.0 N NaN
1093 runs 36.0 176.0 20.0 N NaN
1094 wickets 8.0 114.0 20.0 N NaN

umpire1 umpire2
0 Asad Rauf RE Koertzen
1 MR Benson SL Shastri
2 Aleem Dar GA Pratapkumar
3 SJ Davis DJ Harper

6
4 BF Bowden K Hariharan
… … …
1090 Nitin Menon VK Sharma
1091 AK Chaudhary R Pandit
1092 KN Ananthapadmanabhan MV Saidharshan Kumar
1093 Nitin Menon VK Sharma
1094 J Madanagopal Nitin Menon

[1095 rows x 20 columns]>,


<bound method DataFrame.info of id season city date
match_type player_of_match \
0 335982 2007/08 Bangalore 2008-04-18 League BB McCullum
1 335983 2007/08 Chandigarh 2008-04-19 League MEK Hussey
2 335984 2007/08 Delhi 2008-04-19 League MF Maharoof
3 335985 2007/08 Mumbai 2008-04-20 League MV Boucher
4 335986 2007/08 Kolkata 2008-04-20 League DJ Hussey
… … … … … … …
1090 1426307 2024 Hyderabad 2024-05-19 League Abhishek Sharma
1091 1426309 2024 Ahmedabad 2024-05-21 Qualifier 1 MA Starc
1092 1426310 2024 Ahmedabad 2024-05-22 Eliminator R Ashwin
1093 1426311 2024 Chennai 2024-05-24 Qualifier 2 Shahbaz Ahmed
1094 1426312 2024 Chennai 2024-05-26 Final MA Starc

venue \
0 M Chinnaswamy Stadium
1 Punjab Cricket Association Stadium, Mohali
2 Feroz Shah Kotla
3 Wankhede Stadium
4 Eden Gardens
… …
1090 Rajiv Gandhi International Stadium, Uppal, Hyd…
1091 Narendra Modi Stadium, Ahmedabad
1092 Narendra Modi Stadium, Ahmedabad
1093 MA Chidambaram Stadium, Chepauk, Chennai
1094 MA Chidambaram Stadium, Chepauk, Chennai

team1 team2 \
0 Royal Challengers Bangalore Kolkata Knight Riders
1 Kings XI Punjab Chennai Super Kings
2 Delhi Daredevils Rajasthan Royals
3 Mumbai Indians Royal Challengers Bangalore
4 Kolkata Knight Riders Deccan Chargers
… … …
1090 Punjab Kings Sunrisers Hyderabad
1091 Sunrisers Hyderabad Kolkata Knight Riders
1092 Royal Challengers Bengaluru Rajasthan Royals
1093 Sunrisers Hyderabad Rajasthan Royals

7
1094 Sunrisers Hyderabad Kolkata Knight Riders

toss_winner toss_decision winner \


0 Royal Challengers Bangalore field Kolkata Knight Riders
1 Chennai Super Kings bat Chennai Super Kings
2 Rajasthan Royals bat Delhi Daredevils
3 Mumbai Indians bat Royal Challengers Bangalore
4 Deccan Chargers bat Kolkata Knight Riders
… … … …
1090 Punjab Kings bat Sunrisers Hyderabad
1091 Sunrisers Hyderabad bat Kolkata Knight Riders
1092 Rajasthan Royals field Rajasthan Royals
1093 Rajasthan Royals field Sunrisers Hyderabad
1094 Sunrisers Hyderabad bat Kolkata Knight Riders

result result_margin target_runs target_overs super_over method \


0 runs 140.0 223.0 20.0 N NaN
1 runs 33.0 241.0 20.0 N NaN
2 wickets 9.0 130.0 20.0 N NaN
3 wickets 5.0 166.0 20.0 N NaN
4 wickets 5.0 111.0 20.0 N NaN
… … … … … … …
1090 wickets 4.0 215.0 20.0 N NaN
1091 wickets 8.0 160.0 20.0 N NaN
1092 wickets 4.0 173.0 20.0 N NaN
1093 runs 36.0 176.0 20.0 N NaN
1094 wickets 8.0 114.0 20.0 N NaN

umpire1 umpire2
0 Asad Rauf RE Koertzen
1 MR Benson SL Shastri
2 Aleem Dar GA Pratapkumar
3 SJ Davis DJ Harper
4 BF Bowden K Hariharan
… … …
1090 Nitin Menon VK Sharma
1091 AK Chaudhary R Pandit
1092 KN Ananthapadmanabhan MV Saidharshan Kumar
1093 Nitin Menon VK Sharma
1094 J Madanagopal Nitin Menon

[1095 rows x 20 columns]>)

1.Calculate the total number of Matches Played in Each Session

[78]: matches_per_season = df['season'].value_counts().sort_index()


print(matches_per_season)

8
season
2007/08 58
2009 57
2009/10 60
2011 73
2012 74
2013 76
2014 60
2015 59
2016 60
2017 59
2018 60
2019 60
2020/21 60
2021 60
2022 74
2023 74
2024 71
Name: count, dtype: int64
2. Find the Most Successfull team (team with most runs)

[79]: # runs_df = df[df['result'] == 'runs']

# most_successful_team = runs_df.groupby('winner')['result_margin'].sum().
↪idxmax()

# print(f"The most successful team (team with most runs) is:␣


↪{most_successful_team}")

df["winner"].value_counts()

[79]: winner
Mumbai Indians 144
Chennai Super Kings 138
Kolkata Knight Riders 131
Royal Challengers Bangalore 116
Rajasthan Royals 112
Kings XI Punjab 88
Sunrisers Hyderabad 88
Delhi Daredevils 67
Delhi Capitals 48
Deccan Chargers 29
Gujarat Titans 28
Lucknow Super Giants 24
Punjab Kings 24
Gujarat Lions 13
Pune Warriors 12

9
Rising Pune Supergiant 10
Royal Challengers Bengaluru 7
Kochi Tuskers Kerala 6
Rising Pune Supergiants 5
Name: count, dtype: int64

3. Find the average margin of victory by wickets and runs

[80]: average_runs_margin = df[df['result'] == 'runs']['result_margin'].mean()


average_wickets_margin = df[df['result'] == 'wickets']['result_margin'].mean()
print(f'Average margin of victory by runs: {average_runs_margin}')
print(f'Average margin of victory by wickets: {average_wickets_margin}')

Average margin of victory by runs: 30.104417670682732


Average margin of victory by wickets: 6.192041522491349
4. Which player won the most ’Player of the Match awards?

[81]: most_player_of_match = df['player_of_match'].value_counts().idxmax()


print(f"The player who won the most 'Player of the Match' awards is:␣
↪{most_player_of_match}")

The player who won the most 'Player of the Match' awards is: AB de Villiers
5. Find the number of matches where the toss winner won the match

[82]: toss_winner_matches = df[df['toss_winner'] == df['winner']].shape[0]


print(f"The number of matches where the toss winner won the match:␣
↪{toss_winner_matches}")

The number of matches where the toss winner won the match: 554
6. Calculate the total number of runs scored in all matches for each team

[83]: total_runs_per_team = df.groupby('team1')['target_runs'].sum() + df.


↪groupby('team2')['target_runs'].sum()

print(total_runs_per_team)

team1
Chennai Super Kings 39503.0
Deccan Chargers 12047.0
Delhi Capitals 15930.0
Delhi Daredevils 25492.0
Gujarat Lions 5077.0
Gujarat Titans 7865.0
Kings XI Punjab 31391.0
Kochi Tuskers Kerala 2014.0
Kolkata Knight Riders 40557.0
Lucknow Super Giants 7835.0
Mumbai Indians 43728.0
Pune Warriors 6950.0

10
Punjab Kings 9787.0
Rajasthan Royals 36250.0
Rising Pune Supergiant 2571.0
Rising Pune Supergiants 1993.0
Royal Challengers Bangalore 39807.0
Royal Challengers Bengaluru 2986.0
Sunrisers Hyderabad 30071.0
Name: target_runs, dtype: float64
7. Determine the average number of wickets taken by the winning team in each match

[84]: average_wickets_taken = df[df['result'] == 'wickets']['result_margin'].mean()


print(f'The average number of wickets taken by the winning team in each match␣
↪is: {average_wickets_taken}')

The average number of wickets taken by the winning team in each match is:
6.192041522491349
8. How many matches were decided by a Super Over?

[85]: super_over_matches = df[df['super_over'] == 'Y'].shape[0]


print(f"The number of matches decided by a Super Over: {super_over_matches}")

The number of matches decided by a Super Over: 14


9. Find the distribution of match results (runs vs wickets)

[86]: result_distribution = df['result'].value_counts()


print(result_distribution)

# Plotting the distribution


result_distribution.plot(kind='bar', color=['blue', 'orange'])
plt.title('Distribution of Match Results (Runs vs Wickets)')
plt.xlabel('Result Type')
plt.ylabel('Number of Matches')
plt.show()

result
wickets 578
runs 498
tie 14
no result 5
Name: count, dtype: int64

11
10. Find the top 5 venues with the most matches played

[87]: top_venues = df['venue'].value_counts().head(5)


print(top_venues)

# # Plotting the top 5 venues


# top_venues.plot(kind='bar', color='green')
# plt.title('Top 5 Venues with the Most Matches Played')
# plt.xlabel('Venue')
# plt.ylabel('Number of Matches')
# plt.show()

venue
Eden Gardens 77
Wankhede Stadium 73
M Chinnaswamy Stadium 65
Feroz Shah Kotla 60
Rajiv Gandhi International Stadium, Uppal 49

12
Name: count, dtype: int64
11. Find the match with the highest margin of victory (by wickets or runs)

[88]: df[df['result_margin']==df['result_margin'].max()]

[88]: id season city date match_type player_of_match \


620 1082635 2017 Delhi 2017-05-06 League LMP Simmons

venue team1 team2 toss_winner \


620 Feroz Shah Kotla Delhi Daredevils Mumbai Indians Delhi Daredevils

toss_decision winner result result_margin target_runs \


620 field Mumbai Indians runs 146.0 213.0

target_overs super_over method umpire1 umpire2


620 20.0 N NaN Nitin Menon CK Nandan

[89]: # Find the match with the highest margin of victory (by wickets or runs)
df_wickets=df[df['result']=='wickets']
df_runs=df[df['result']=='runs']

max_margin_wicket=df_wickets.loc[df_wickets['result_margin'].idxmax()]

max_margin_run=df_runs.loc[df_runs['result_margin'].idxmax()]

max_margin_run,max_margin_wicket

[89]: (id 1082635


season 2017
city Delhi
date 2017-05-06
match_type League
player_of_match LMP Simmons
venue Feroz Shah Kotla
team1 Delhi Daredevils
team2 Mumbai Indians
toss_winner Delhi Daredevils
toss_decision field
winner Mumbai Indians
result runs
result_margin 146.0
target_runs 213.0
target_overs 20.0
super_over N
method NaN
umpire1 Nitin Menon
umpire2 CK Nandan

13
Name: 620, dtype: object,
id 335994
season 2007/08
city Mumbai
date 2008-04-27
match_type League
player_of_match AC Gilchrist
venue Dr DY Patil Sports Academy
team1 Mumbai Indians
team2 Deccan Chargers
toss_winner Deccan Chargers
toss_decision field
winner Deccan Chargers
result wickets
result_margin 10.0
target_runs 155.0
target_overs 20.0
super_over N
method NaN
umpire1 Asad Rauf
umpire2 SL Shastri
Name: 12, dtype: object)

12. Calculate the win percentage for each team

[90]: matches_played = df['team1'].value_counts() + df['team2'].value_counts()

matches_won = df['winner'].value_counts()
win_percentage = (matches_won / matches_played) * 100

print(win_percentage)

Chennai Super Kings 57.983193


Deccan Chargers 38.666667
Delhi Capitals 52.747253
Delhi Daredevils 41.614907
Gujarat Lions 43.333333
Gujarat Titans 62.222222
Kings XI Punjab 46.315789
Kochi Tuskers Kerala 42.857143
Kolkata Knight Riders 52.191235
Lucknow Super Giants 54.545455
Mumbai Indians 55.172414
Pune Warriors 26.086957
Punjab Kings 42.857143
Rajasthan Royals 50.678733
Rising Pune Supergiant 62.500000
Rising Pune Supergiants 35.714286

14
Royal Challengers Bangalore 48.333333
Royal Challengers Bengaluru 46.666667
Sunrisers Hyderabad 48.351648
Name: count, dtype: float64
13. Find the average number of overs played in all matches

[91]: average_overs_played = df['target_overs'].mean()

print(f'The average number of overs played in all matches is:␣


↪{average_overs_played}')

The average number of overs played in all matches is: 19.75934065934066


14. Find the most common match outcome (runs, wickets, or no result)

[92]: most_common_outcome = result_distribution.idxmax()


print(f'The most common match outcome is: {most_common_outcome}')

The most common match outcome is: wickets


15. Find the total number of matches played at each venue by year

[93]: matches_per_venue_year = df.groupby(['season','venue']).size()


print(matches_per_venue_year)

season venue
2007/08 Dr DY Patil Sports Academy 4
Eden Gardens 7
Feroz Shah Kotla 6
M Chinnaswamy Stadium 7
MA Chidambaram Stadium, Chepauk 7
..
2024 Maharaja Yadavindra Singh International Cricket Stadium, Mullanpur 5
Narendra Modi Stadium, Ahmedabad 8
Rajiv Gandhi International Stadium, Uppal, Hyderabad 6
Sawai Mansingh Stadium, Jaipur 5
Wankhede Stadium, Mumbai 7
Length: 175, dtype: int64
16. Analyze the win margin distribution by year

[94]: # Grouping the data by season and result type


win_margin_by_year = df.groupby(['season', 'result'])['result_margin'].
↪describe()

print(win_margin_by_year)
# Plotting the win margin distribution by year
plt.figure(figsize=(14, 8))
sns.boxplot(x='season', y='result_margin', hue='result', data=df)
plt.title('Win Margin Distribution by Year')
plt.xlabel('Season')

15
plt.ylabel('Win Margin')
plt.xticks(rotation=45)
plt.legend(title='Result Type')
plt.show()

count mean std min 25% 50% 75% max


season result
2007/08 runs 24.0 29.375000 34.291351 1.0 8.25 16.0 35.00 140.0
wickets 34.0 6.500000 2.078024 3.0 5.00 7.0 8.00 10.0
2009 runs 27.0 28.296296 28.894789 1.0 10.00 16.0 32.50 92.0
tie 0.0 NaN NaN NaN NaN NaN NaN NaN
wickets 29.0 6.206897 1.820112 2.0 6.00 6.0 7.00 10.0
2009/10 runs 31.0 31.483871 20.990269 2.0 15.50 31.0 39.00 98.0
tie 0.0 NaN NaN NaN NaN NaN NaN NaN
wickets 28.0 6.785714 1.571909 4.0 5.75 7.0 8.00 10.0
2011 no result 0.0 NaN NaN NaN NaN NaN NaN NaN
runs 33.0 33.272727 26.081929 2.0 17.00 25.0 43.00 111.0
wickets 39.0 6.794872 1.794428 3.0 6.00 7.0 8.00 10.0
2012 runs 34.0 28.235294 19.645431 1.0 14.25 26.0 37.75 86.0
wickets 40.0 6.025000 1.716996 2.0 5.00 5.5 7.00 10.0
2013 runs 37.0 33.540541 28.657551 2.0 14.00 24.0 48.00 130.0
tie 0.0 NaN NaN NaN NaN NaN NaN NaN
wickets 37.0 6.135135 1.669367 3.0 5.00 6.0 7.00 10.0
2014 runs 22.0 29.272727 22.416367 2.0 15.25 24.5 33.50 93.0
tie 0.0 NaN NaN NaN NaN NaN NaN NaN
wickets 37.0 6.081081 1.516179 3.0 5.00 6.0 7.00 9.0
2015 no result 0.0 NaN NaN NaN NaN NaN NaN NaN
runs 32.0 26.562500 28.598373 1.0 8.75 20.0 29.00 138.0
tie 0.0 NaN NaN NaN NaN NaN NaN NaN
wickets 24.0 6.166667 2.219805 1.0 5.00 6.0 7.25 10.0
2016 runs 21.0 32.190476 36.347791 1.0 9.00 22.0 34.00 144.0
wickets 39.0 6.256410 1.772865 2.0 5.00 6.0 7.00 10.0
2017 runs 26.0 30.307692 33.638988 1.0 10.50 18.0 33.00 146.0
tie 0.0 NaN NaN NaN NaN NaN NaN NaN
wickets 32.0 6.375000 1.896516 2.0 5.00 6.5 7.25 10.0
2018 runs 28.0 24.107143 23.850366 3.0 10.75 14.0 31.00 102.0
wickets 32.0 5.812500 2.206113 1.0 5.00 6.0 7.00 10.0
2019 no result 0.0 NaN NaN NaN NaN NaN NaN NaN
runs 22.0 30.227273 27.194068 1.0 12.50 25.0 39.75 118.0
tie 0.0 NaN NaN NaN NaN NaN NaN NaN
wickets 35.0 5.771429 1.646488 2.0 5.00 6.0 7.00 9.0
2020/21 runs 27.0 39.370370 26.716673 2.0 15.50 37.0 58.00 97.0
tie 0.0 NaN NaN NaN NaN NaN NaN NaN
wickets 29.0 6.965517 1.762360 4.0 5.00 7.0 8.00 10.0
2021 runs 22.0 26.454545 24.039110 1.0 6.00 19.0 41.00 86.0
tie 0.0 NaN NaN NaN NaN NaN NaN NaN
wickets 37.0 5.918919 2.019053 2.0 4.00 6.0 7.00 10.0
2022 runs 37.0 27.945946 23.085525 2.0 12.00 18.0 44.00 91.0

16
wickets 37.0 6.000000 1.615893 3.0 5.00 6.0 7.00 9.0
2023 no result 0.0 NaN NaN NaN NaN NaN NaN NaN
runs 40.0 30.400000 27.554887 1.0 7.75 22.0 51.25 112.0
wickets 33.0 5.727273 1.908414 1.0 5.00 6.0 7.00 9.0
2024 runs 35.0 30.142857 25.994505 1.0 15.00 24.0 35.00 106.0
wickets 36.0 5.944444 1.999206 2.0 4.00 6.0 7.00 10.0

17. Calculate the total number of ‘no result’ matches and their impact on the tournament

[95]: # Calculate the total number of 'no result' matches


no_result_matches = df[df['result'] == 'no result'].shape[0]
print(f"The total number of 'no result' matches: {no_result_matches}")

# Analyze the distribution of 'no result' matches by season


no_result_by_season = df[df['result'] == 'no result']['season'].value_counts().
↪sort_index()

print("Distribution of 'no result' matches by season:")


print(no_result_by_season)

# Analyze the distribution of 'no result' matches by team


no_result_by_team = df[df['result'] == 'no result']['team1'].value_counts() +␣
↪df[df['result'] == 'no result']['team2'].value_counts()

print("Distribution of 'no result' matches by team:")


print(no_result_by_team)

17
The total number of 'no result' matches: 5
Distribution of 'no result' matches by season:
season
2011 1
2015 2
2019 1
2023 1
Name: count, dtype: int64
Distribution of 'no result' matches by team:
Chennai Super Kings NaN
Delhi Daredevils 2.0
Lucknow Super Giants NaN
Pune Warriors NaN
Rajasthan Royals NaN
Royal Challengers Bangalore NaN
Name: count, dtype: float64
18. How many matches were won by teams batting first vs. batting second?

[96]: # Matches won by teams batting first


batting_first_wins = df[(df['toss_decision'] == 'bat') & (df['toss_winner'] ==␣
↪df['winner'])].shape[0] + \

df[(df['toss_decision'] == 'field') & (df['toss_winner'] !


↪= df['winner'])].shape[0]

# Matches won by teams batting second


batting_second_wins = df[(df['toss_decision'] == 'field') & (df['toss_winner']␣
↪== df['winner'])].shape[0] + \

df[(df['toss_decision'] == 'bat') & (df['toss_winner'] !=␣


↪df['winner'])].shape[0]

print(f"Matches won by teams batting first: {batting_first_wins}")


print(f"Matches won by teams batting second: {batting_second_wins}")

Matches won by teams batting first: 504


Matches won by teams batting second: 591
19. Find out the average number of runs scored by the winning team

[97]: average_runs_scored_by_winning_team = df[df['result'] == 'runs']['target_runs'].


↪mean()

print(f'The average number of runs scored by the winning team is:␣


↪{average_runs_scored_by_winning_team}')

The average number of runs scored by the winning team is: 179.69678714859438
20. Identify the most unsuccessful team (team with lowest wins)

18
[98]: most_unsuccessful_team = matches_won.idxmin()
print(f"The most unsuccessful team (team with the lowest wins) is:␣
↪{most_unsuccessful_team}")

The most unsuccessful team (team with the lowest wins) is: Rising Pune
Supergiants
ASSIGNMENT QUESTIONS
Explore following for given dataset and also perform EDA. 1. Frequency Distribution of Wins by
Wickets 2. Relative Frequency Distribution 3. Cumulative Relative Frequency Graph 4. Probability
of Winning by 6 Wickets or Less 5. Normal Distribution of Wins by Wickets 6. Mean, Standard
Deviation, and Percentile Calculation 7. Find out outliers for the selective columns for lower range
outliers will be lower than mu - 2sigma, similarly for upper range outliers will be greater than
mu+2sigma.
1. Frequency Distribution of Wins by Wickets

[107]: # Frequency distribution of wins by wickets


wins_by_wickets = df_wickets['result_margin'].value_counts().sort_index()
print(wins_by_wickets)

# Plotting the frequency distribution


wins_by_wickets.plot(kind='bar', color='blue')
plt.title('Frequency Distribution of Wins by Wickets')
plt.xlabel('Number of Wickets')
plt.ylabel('Frequency')
plt.show()

result_margin
1.0 4
2.0 10
3.0 31
4.0 59
5.0 97
6.0 120
7.0 115
8.0 78
9.0 48
10.0 16
Name: count, dtype: int64

19
2. Relative Frequency Distribution

[109]: relative_frequency_wins_by_wickets = wins_by_wickets / wins_by_wickets.sum()


print(relative_frequency_wins_by_wickets)

# Plotting the relative frequency distribution


relative_frequency_wins_by_wickets.plot(kind='bar', color='orange')
plt.title('Relative Frequency Distribution of Wins by Wickets')
plt.xlabel('Number of Wickets')
plt.ylabel('Relative Frequency')
plt.show()

result_margin
1.0 0.006920
2.0 0.017301
3.0 0.053633
4.0 0.102076
5.0 0.167820
6.0 0.207612
7.0 0.198962

20
8.0 0.134948
9.0 0.083045
10.0 0.027682
Name: count, dtype: float64

3. Cumulative Relative Frequency Graph

[110]: # Calculate the cumulative relative frequency


cumulative_relative_frequency = relative_frequency_wins_by_wickets.cumsum()
print(cumulative_relative_frequency)

# Plotting the cumulative relative frequency graph


cumulative_relative_frequency.plot(kind='line', marker='o', color='purple')
plt.title('Cumulative Relative Frequency of Wins by Wickets')
plt.xlabel('Number of Wickets')
plt.ylabel('Cumulative Relative Frequency')
plt.grid(True)
plt.show()

result_margin

21
1.0 0.006920
2.0 0.024221
3.0 0.077855
4.0 0.179931
5.0 0.347751
6.0 0.555363
7.0 0.754325
8.0 0.889273
9.0 0.972318
10.0 1.000000
Name: count, dtype: float64

4. Probability of Winning by 6 Wickets or Less

[111]: # Calculate the total number of wins by wickets


total_wins_by_wickets = wins_by_wickets.sum()

# Calculate the number of wins by 6 wickets or less


wins_by_6_or_less = wins_by_wickets[wins_by_wickets.index <= 6].sum()

22
# Calculate the probability
probability_wins_by_6_or_less = wins_by_6_or_less / total_wins_by_wickets
print(f'The probability of winning by 6 wickets or less is:␣
↪{probability_wins_by_6_or_less}')

The probability of winning by 6 wickets or less is: 0.5553633217993079


5. Normal Distribution of Wins by Wickets

[112]: # Plotting the normal distribution of wins by wickets


plt.figure(figsize=(10, 6))
sns.histplot(df_wickets['result_margin'], kde=True, bins=10, color='blue')
plt.title('Normal Distribution of Wins by Wickets')
plt.xlabel('Number of Wickets')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

6. Mean, Standard Deviation, and Percentile Calculation

[116]: print(df.describe())

id result_margin target_runs target_overs


count 1.095000e+03 1076.000000 1092.000000 1092.000000
mean 9.048283e+05 17.259294 165.684066 19.759341
std 3.677402e+05 21.787444 33.427048 1.581108

23
min 3.359820e+05 1.000000 43.000000 5.000000
25% 5.483315e+05 6.000000 146.000000 20.000000
50% 9.809610e+05 8.000000 166.000000 20.000000
75% 1.254062e+06 20.000000 187.000000 20.000000
max 1.426312e+06 146.000000 288.000000 20.000000
7. Find out outliers for the selective columns for lower range outliers will be lower than mu -
2sigma, similarly for upper range outliers will be greater than mu+2sigma.

[118]: # Calculate the mean and standard deviation for the result_margin column
mu = df['result_margin'].mean()
sigma = df['result_margin'].std()

# Calculate the lower and upper bounds for outliers


lower_bound = mu - 2 * sigma
upper_bound = mu + 2 * sigma

# Find the outliers


outliers = df[(df['result_margin'] < lower_bound) | (df['result_margin'] >␣
↪upper_bound)]

print(outliers)

id season city date match_type player_of_match \


0 335982 2007/08 Bangalore 2008-04-18 League BB McCullum
9 335991 2007/08 Chandigarh 2008-04-25 League KC Sangakkara
39 336023 2007/08 Jaipur 2008-05-17 League GC Smith
55 336038 2007/08 Mumbai 2008-05-30 Semi Final SR Watson
59 392182 2009 Cape Town 2009-04-18 League R Dravid
… … … … … … …
1030 1422125 2024 Chennai 2024-03-26 League S Dube
1039 1422134 2024 Visakhapatnam 2024-04-03 League SP Narine
1058 1426273 2024 Delhi 2024-04-20 League TM Head
1069 1426284 2024 Chennai 2024-04-28 League RD Gaikwad
1077 1426292 2024 Lucknow 2024-05-05 League SP Narine

venue \
0 M Chinnaswamy Stadium
9 Punjab Cricket Association Stadium, Mohali
39 Sawai Mansingh Stadium
55 Wankhede Stadium
59 Newlands
… …
1030 MA Chidambaram Stadium, Chepauk, Chennai
1039 Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket St…
1058 Arun Jaitley Stadium, Delhi
1069 MA Chidambaram Stadium, Chepauk, Chennai
1077 Bharat Ratna Shri Atal Bihari Vajpayee Ekana C…

24
team1 team2 \
0 Royal Challengers Bangalore Kolkata Knight Riders
9 Kings XI Punjab Mumbai Indians
39 Rajasthan Royals Royal Challengers Bangalore
55 Delhi Daredevils Rajasthan Royals
59 Royal Challengers Bangalore Rajasthan Royals
… … …
1030 Chennai Super Kings Gujarat Titans
1039 Kolkata Knight Riders Delhi Capitals
1058 Sunrisers Hyderabad Delhi Capitals
1069 Chennai Super Kings Sunrisers Hyderabad
1077 Kolkata Knight Riders Lucknow Super Giants

toss_winner toss_decision winner \


0 Royal Challengers Bangalore field Kolkata Knight Riders
9 Mumbai Indians field Kings XI Punjab
39 Royal Challengers Bangalore field Rajasthan Royals
55 Delhi Daredevils field Rajasthan Royals
59 Royal Challengers Bangalore bat Royal Challengers Bangalore
… … … …
1030 Gujarat Titans field Chennai Super Kings
1039 Kolkata Knight Riders bat Kolkata Knight Riders
1058 Delhi Capitals field Sunrisers Hyderabad
1069 Sunrisers Hyderabad field Chennai Super Kings
1077 Lucknow Super Giants field Kolkata Knight Riders

result result_margin target_runs target_overs super_over method \


0 runs 140.0 223.0 20.0 N NaN
9 runs 66.0 183.0 20.0 N NaN
39 runs 65.0 198.0 20.0 N NaN
55 runs 105.0 193.0 20.0 N NaN
59 runs 75.0 134.0 20.0 N NaN
… … … … … … …
1030 runs 63.0 207.0 20.0 N NaN
1039 runs 106.0 273.0 20.0 N NaN
1058 runs 67.0 267.0 20.0 N NaN
1069 runs 78.0 213.0 20.0 N NaN
1077 runs 98.0 236.0 20.0 N NaN

umpire1 umpire2
0 Asad Rauf RE Koertzen
9 Aleem Dar AM Saheba
39 BF Bowden SL Shastri
55 BF Bowden RE Koertzen
59 BR Doctrove RB Tiffin
… … …
1030 AG Wharf Tapan Sharma
1039 A Totre UV Gandhe

25
1058 J Madanagopal Navdeep Singh
1069 R Pandit MV Saidharshan Kumar
1077 MV Saidharshan Kumar YC Barde

[65 rows x 20 columns]

26
D24AIML081_A_6

April 1, 2025

####
D24AIML081 PMRP ASSIGNMENT 6 WITH CONCLUSION
CLASSWORK
QUESTIONS:- ->General Population and Gender Distribution
What is the total population in each county, and how does it vary by state? What is the gender
distribution (Men vs. Women) across different counties? What is the average population size
for census tracts in each state? How does the population of each race (White, Black, Hispanic,
etc.) differ across states? What is the proportion of the male population compared to the female
population in each census tract?
->Ethnicity and Race
What is the distribution of Hispanic population across various counties and states? How do different
racial groups (White, Black, Native, etc.) vary in terms of percentage of total population in different
counties? Which states have the highest percentage of Black or Hispanic populations?
->Employment and Work Type
What is the employment rate (Employed vs. Unemployed) for each census tract? How does the
rate of self-employed individuals compare to those working in private/public sectors across different
states? What percentage of the population works from home, and how does it vary by county and
state? How does the unemployment rate vary across different states and counties? What is the
distribution of employed individuals working in private vs. public sectors?
->Commuting and Transportation
What is the average commuting time across counties and states, and how does it differ for employed
individuals? What modes of transportation are most commonly used for commuting in different
states (e.g., car, public transportation, walking)? How does the percentage of people commuting
via walking or public transportation vary between urban and rural areas?
->Income and Housing
What is the average income (or median household income) in each state and county? How does
the distribution of housing type (e.g., owner-occupied vs. renter-occupied) vary across different
counties? How does the cost of living compare across different states based on average income and
housing costs?
-> Social Characteristics

1
What is the relationship between education levels (e.g., percentage with a high school diploma,
bachelor’s degree) and employment types across different states?

[48]: import matplotlib.pyplot as plt


import pandas as pd
import numpy as np

df=pd.read_csv("C:/Users/User/Downloads/acs2017_census_tract_data.csv")
df,df.head(),df.tail(),df.describe(),df.info(),df.columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74001 entries, 0 to 74000
Data columns (total 37 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TractId 74001 non-null int64
1 State 74001 non-null object
2 County 74001 non-null object
3 TotalPop 74001 non-null int64
4 Men 74001 non-null int64
5 Women 74001 non-null int64
6 Hispanic 73305 non-null float64
7 White 73305 non-null float64
8 Black 73305 non-null float64
9 Native 73305 non-null float64
10 Asian 73305 non-null float64
11 Pacific 73305 non-null float64
12 VotingAgeCitizen 74001 non-null int64
13 Income 72885 non-null float64
14 IncomeErr 72885 non-null float64
15 IncomePerCap 73256 non-null float64
16 IncomePerCapErr 73256 non-null float64
17 Poverty 73159 non-null float64
18 ChildPoverty 72891 non-null float64
19 Professional 73190 non-null float64
20 Service 73190 non-null float64
21 Office 73190 non-null float64
22 Construction 73190 non-null float64
23 Production 73190 non-null float64
24 Drive 73200 non-null float64
25 Carpool 73200 non-null float64
26 Transit 73200 non-null float64
27 Walk 73200 non-null float64
28 OtherTransp 73200 non-null float64
29 WorkAtHome 73200 non-null float64
30 MeanCommute 73055 non-null float64
31 Employed 74001 non-null int64
32 PrivateWork 73190 non-null float64

2
33 PublicWork 73190 non-null float64
34 SelfEmployed 73190 non-null float64
35 FamilyWork 73190 non-null float64
36 Unemployment 73191 non-null float64
dtypes: float64(29), int64(6), object(2)
memory usage: 20.9+ MB

[48]: ( TractId State County TotalPop Men Women \


0 1001020100 Alabama Autauga County 1845 899 946
1 1001020200 Alabama Autauga County 2172 1167 1005
2 1001020300 Alabama Autauga County 3385 1533 1852
3 1001020400 Alabama Autauga County 4267 2001 2266
4 1001020500 Alabama Autauga County 9965 5054 4911
… … … … … … …
73996 72153750501 Puerto Rico Yauco Municipio 6011 3035 2976
73997 72153750502 Puerto Rico Yauco Municipio 2342 959 1383
73998 72153750503 Puerto Rico Yauco Municipio 2218 1001 1217
73999 72153750601 Puerto Rico Yauco Municipio 4380 1964 2416
74000 72153750602 Puerto Rico Yauco Municipio 3001 1343 1658

Hispanic White Black Native … Walk OtherTransp WorkAtHome \


0 2.4 86.3 5.2 0.0 … 0.5 0.0 2.1
1 1.1 41.6 54.5 0.0 … 0.0 0.5 0.0
2 8.0 61.4 26.5 0.6 … 1.0 0.8 1.5
3 9.6 80.3 7.1 0.5 … 1.5 2.9 2.1
4 0.9 77.5 16.4 0.0 … 0.8 0.3 0.7
… … … … … … … … …
73996 99.7 0.3 0.0 0.0 … 0.5 0.0 3.6
73997 99.1 0.9 0.0 0.0 … 0.0 0.0 1.3
73998 99.5 0.2 0.0 0.0 … 3.4 0.0 3.4
73999 100.0 0.0 0.0 0.0 … 0.0 0.0 0.0
74000 99.2 0.8 0.0 0.0 … 4.9 0.0 8.9

MeanCommute Employed PrivateWork PublicWork SelfEmployed \


0 24.5 881 74.2 21.2 4.5
1 22.2 852 75.9 15.0 9.0
2 23.1 1482 73.3 21.1 4.8
3 25.9 1849 75.8 19.7 4.5
4 21.0 4787 71.4 24.1 4.5
… … … … … …
73996 26.9 1576 59.2 33.8 7.0
73997 25.3 666 58.4 35.4 6.2
73998 23.5 560 57.5 34.5 8.0
73999 24.1 1062 67.7 30.4 1.9
74000 21.6 759 75.9 19.1 5.0

FamilyWork Unemployment

3
0 0.0 4.6
1 0.0 3.4
2 0.7 4.7
3 0.0 6.1
4 0.0 2.3
… … …
73996 0.0 20.8
73997 0.0 26.3
73998 0.0 23.0
73999 0.0 29.5
74000 0.0 17.9

[74001 rows x 37 columns],


TractId State County TotalPop Men Women Hispanic \
0 1001020100 Alabama Autauga County 1845 899 946 2.4
1 1001020200 Alabama Autauga County 2172 1167 1005 1.1
2 1001020300 Alabama Autauga County 3385 1533 1852 8.0
3 1001020400 Alabama Autauga County 4267 2001 2266 9.6
4 1001020500 Alabama Autauga County 9965 5054 4911 0.9

White Black Native … Walk OtherTransp WorkAtHome MeanCommute \


0 86.3 5.2 0.0 … 0.5 0.0 2.1 24.5
1 41.6 54.5 0.0 … 0.0 0.5 0.0 22.2
2 61.4 26.5 0.6 … 1.0 0.8 1.5 23.1
3 80.3 7.1 0.5 … 1.5 2.9 2.1 25.9
4 77.5 16.4 0.0 … 0.8 0.3 0.7 21.0

Employed PrivateWork PublicWork SelfEmployed FamilyWork Unemployment


0 881 74.2 21.2 4.5 0.0 4.6
1 852 75.9 15.0 9.0 0.0 3.4
2 1482 73.3 21.1 4.8 0.7 4.7
3 1849 75.8 19.7 4.5 0.0 6.1
4 4787 71.4 24.1 4.5 0.0 2.3

[5 rows x 37 columns],
TractId State County TotalPop Men Women \
73996 72153750501 Puerto Rico Yauco Municipio 6011 3035 2976
73997 72153750502 Puerto Rico Yauco Municipio 2342 959 1383
73998 72153750503 Puerto Rico Yauco Municipio 2218 1001 1217
73999 72153750601 Puerto Rico Yauco Municipio 4380 1964 2416
74000 72153750602 Puerto Rico Yauco Municipio 3001 1343 1658

Hispanic White Black Native … Walk OtherTransp WorkAtHome \


73996 99.7 0.3 0.0 0.0 … 0.5 0.0 3.6
73997 99.1 0.9 0.0 0.0 … 0.0 0.0 1.3
73998 99.5 0.2 0.0 0.0 … 3.4 0.0 3.4
73999 100.0 0.0 0.0 0.0 … 0.0 0.0 0.0

4
74000 99.2 0.8 0.0 0.0 … 4.9 0.0 8.9

MeanCommute Employed PrivateWork PublicWork SelfEmployed \


73996 26.9 1576 59.2 33.8 7.0
73997 25.3 666 58.4 35.4 6.2
73998 23.5 560 57.5 34.5 8.0
73999 24.1 1062 67.7 30.4 1.9
74000 21.6 759 75.9 19.1 5.0

FamilyWork Unemployment
73996 0.0 20.8
73997 0.0 26.3
73998 0.0 23.0
73999 0.0 29.5
74000 0.0 17.9

[5 rows x 37 columns],
TractId TotalPop Men Women Hispanic \
count 7.400100e+04 74001.000000 74001.000000 74001.000000 73305.000000
mean 2.839113e+10 4384.716017 2157.710707 2227.005311 17.265444
std 1.647593e+10 2228.936729 1120.560504 1146.240218 23.073811
min 1.001020e+09 0.000000 0.000000 0.000000 0.000000
25% 1.303901e+10 2903.000000 1416.000000 1465.000000 2.600000
50% 2.804700e+10 4105.000000 2007.000000 2082.000000 7.400000
75% 4.200341e+10 5506.000000 2707.000000 2803.000000 21.100000
max 7.215375e+10 65528.000000 32266.000000 33262.000000 100.000000

White Black Native Asian Pacific \


count 73305.000000 73305.00000 73305.000000 73305.000000 73305.000000
mean 61.309043 13.28910 0.734047 4.753691 0.147341
std 30.634461 21.60118 4.554247 8.999888 1.029250
min 0.000000 0.00000 0.000000 0.000000 0.000000
25% 38.000000 0.80000 0.000000 0.200000 0.000000
50% 70.400000 3.80000 0.000000 1.500000 0.000000
75% 87.700000 14.60000 0.400000 5.000000 0.000000
max 100.000000 100.00000 100.000000 100.000000 71.900000

… Walk OtherTransp WorkAtHome MeanCommute \


count … 73200.000000 73200.000000 73200.000000 73055.000000
mean … 3.042825 1.894605 4.661466 26.056594
std … 5.805753 2.549374 4.014940 7.124524
min … 0.000000 0.000000 0.000000 1.000000
25% … 0.400000 0.400000 2.000000 21.100000
50% … 1.400000 1.200000 3.800000 25.400000
75% … 3.300000 2.500000 6.300000 30.300000
max … 100.000000 100.000000 100.000000 73.900000

5
Employed PrivateWork PublicWork SelfEmployed FamilyWork \
count 74001.000000 73190.000000 73190.000000 73190.000000 73190.000000
mean 2049.152052 79.494222 14.163342 6.171484 0.171164
std 1138.865457 8.126383 7.328680 3.932364 0.456580
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1276.000000 75.200000 9.300000 3.500000 0.000000
50% 1895.000000 80.600000 13.000000 5.500000 0.000000
75% 2635.000000 85.000000 17.600000 8.000000 0.000000
max 28945.000000 100.000000 100.000000 100.000000 22.300000

Unemployment
count 73191.000000
mean 7.246738
std 5.227624
min 0.000000
25% 3.900000
50% 6.000000
75% 9.000000
max 100.000000

[8 rows x 35 columns],
None,
Index(['TractId', 'State', 'County', 'TotalPop', 'Men', 'Women', 'Hispanic',
'White', 'Black', 'Native', 'Asian', 'Pacific', 'VotingAgeCitizen',
'Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr', 'Poverty',
'ChildPoverty', 'Professional', 'Service', 'Office', 'Construction',
'Production', 'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp',
'WorkAtHome', 'MeanCommute', 'Employed', 'PrivateWork', 'PublicWork',
'SelfEmployed', 'FamilyWork', 'Unemployment'],
dtype='object'))

[50]: #What is the total population in each county, and how does it vary by state?
total_population_by_county = df.groupby(['State', 'County'])['TotalPop'].sum().
↪reset_index()

print(total_population_by_county)

total_population_by_state = df.groupby('State')['TotalPop'].sum().reset_index()
print(total_population_by_state)

State County TotalPop


0 Alabama Autauga County 55036
1 Alabama Baldwin County 203360
2 Alabama Barbour County 26201
3 Alabama Bibb County 22580
4 Alabama Blount County 57667
… … … …
3215 Wyoming Sweetwater County 44527
3216 Wyoming Teton County 22923

6
3217 Wyoming Uinta County 20758
3218 Wyoming Washakie County 8253
3219 Wyoming Weston County 7117

[3220 rows x 3 columns]


State TotalPop
0 Alabama 4850771
1 Alaska 738565
2 Arizona 6809946
3 Arkansas 2977944
4 California 38982847
5 Colorado 5436519
6 Connecticut 3594478
7 Delaware 943732
8 District of Columbia 672391
9 Florida 20278447
10 Georgia 10201635
11 Hawaii 1421658
12 Idaho 1657375
13 Illinois 12854526
14 Indiana 6614418
15 Iowa 3118102
16 Kansas 2903820
17 Kentucky 4424376
18 Louisiana 4663461
19 Maine 1330158
20 Maryland 5996079
21 Massachusetts 6789319
22 Michigan 9925568
23 Minnesota 5490726
24 Mississippi 2986220
25 Missouri 6075300
26 Montana 1029862
27 Nebraska 1893921
28 Nevada 2887725
29 New Hampshire 1331848
30 New Jersey 8960161
31 New Mexico 2084828
32 New York 19798228
33 North Carolina 10052564
34 North Dakota 745475
35 Ohio 11609756
36 Oklahoma 3896251
37 Oregon 4025127
38 Pennsylvania 12790505
39 Puerto Rico 3468963
40 Rhode Island 1056138
41 South Carolina 4893444

7
42 South Dakota 855444
43 Tennessee 6597381
44 Texas 27419612
45 Utah 2993941
46 Vermont 624636
47 Virginia 8365952
48 Washington 7169967
49 West Virginia 1836843
50 Wisconsin 5763217
51 Wyoming 583200

[52]: #What is the gender distribution (Men vs. Women) across different counties?

gender_distribution_by_county = df.groupby(['State', 'County'])[['Men',␣


↪'Women']].sum().reset_index()

fig, ax = plt.subplots(figsize=(12, 8))


gender_distribution_by_county.groupby('State')[['Men', 'Women']].sum().
↪plot(kind='bar', stacked=True, ax=ax)

ax.set_title('Gender Distribution (Men vs. Women) Across Different Counties')


ax.set_xlabel('State')
ax.set_ylabel('Population')
plt.legend(title='Gender')
plt.show()

8
[54]: #What is the average population size for census tracts in each state?

average_population_by_state = df.groupby('State')['TotalPop'].mean().
↪reset_index()

# print(average_population_by_state)
average_population_by_state.head()

[54]: State TotalPop


0 Alabama 4107.342083
1 Alaska 4422.544910
2 Arizona 4462.612058
3 Arkansas 4341.026239
4 California 4838.382400

[56]: #How does the population of each race (White, Black, Hispanic, etc.) differ␣
↪across states?

9
df['White_Percentage'] = (df['White'] / df['TotalPop']) * 100
df['Black_Percentage'] = (df['Black'] / df['TotalPop']) * 100
df['Native_Percentage'] = (df['Native'] / df['TotalPop']) * 100
df['Asian_Percentage'] = (df['Asian'] / df['TotalPop']) * 100
df['Pacific_Percentage'] = (df['Pacific'] / df['TotalPop']) * 100
df['Hispanic_Percentage'] = (df['Hispanic'] / df['TotalPop']) * 100

# Calculate weighted average percentages


race_population_by_state = df.groupby('State').apply(
lambda x: (x[['Hispanic', 'White', 'Black', 'Native', 'Asian', 'Pacific']].
↪sum() / x['TotalPop'].sum()) * 100

).reset_index()

race_population_by_state.plot(x='State', kind='bar', stacked=True, figsize=(15,␣


↪10))

plt.title('Population Distribution by Race Across States')


plt.xlabel('State')
plt.ylabel('Percentage of Population')
plt.legend(title='Race')
plt.show()

C:\Users\User\AppData\Local\Temp\ipykernel_6148\2768742177.py:12:
DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns.
This behavior is deprecated, and in a future version of pandas the grouping
columns will be excluded from the operation. Either pass `include_groups=False`
to exclude the groupings or explicitly select the grouping columns after groupby
to silence this warning.
race_population_by_state = df.groupby('State').apply(

10
[57]: #What is the proportion of the male population compared to the female␣
↪population in each census tract?

df['MaleToFemaleRatio'] = df['Men'] / df['Women'].replace(0, np.nan)

# Weighted average Male-to-Female Ratio per state


male_to_female_ratio_by_state = df.groupby('State').apply(
lambda x: x['Men'].sum() / x['Women'].sum()
).reset_index(name='MaleToFemaleRatio')

# Plot the distribution of Male to Female ratio across states


plt.figure(figsize=(15, 10))
plt.bar(male_to_female_ratio_by_state['State'],␣
↪male_to_female_ratio_by_state['MaleToFemaleRatio'])

plt.title('Average Male to Female Ratio by State')


plt.xlabel('State')
plt.ylabel('Male to Female Ratio')
plt.xticks(rotation=90)
plt.show()

11
C:\Users\User\AppData\Local\Temp\ipykernel_6148\1348919082.py:6:
DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns.
This behavior is deprecated, and in a future version of pandas the grouping
columns will be excluded from the operation. Either pass `include_groups=False`
to exclude the groupings or explicitly select the grouping columns after groupby
to silence this warning.
male_to_female_ratio_by_state = df.groupby('State').apply(

1 Ethnicity and Race


What is the distribution of Hispanic population across various counties and states?
How do different racial groups (White, Black, Native, etc.) vary in terms of percentage of total
population in different counties?
Which states have the highest percentage of Black or Hispanic populations?

[59]: #What is the distribution of Hispanic population across various counties and␣
↪states?

12
df['Hispanic_Percentage'] = (df['Hispanic'] / df['TotalPop']) * 100

# Aggregate data at the county level


hispanic_by_county = df.groupby(['State', 'County'])[['Hispanic', 'TotalPop']].
↪sum().reset_index()

hispanic_by_county['Hispanic_Percentage'] = (hispanic_by_county['Hispanic'] /␣
↪hispanic_by_county['TotalPop']) * 100

# Aggregate data at the state level using a weighted average


hispanic_state_data = hispanic_by_county.groupby('State').apply(
lambda x: (x['Hispanic'].sum() / x['TotalPop'].sum()) * 100
).reset_index(name='Hispanic_Percentage')

# Plot the Hispanic population distribution across states


plt.figure(figsize=(15, 10))
plt.bar(hispanic_state_data['State'],␣
↪hispanic_state_data['Hispanic_Percentage'])

plt.title('Distribution of Hispanic Population Across States')


plt.xlabel('State')
plt.ylabel('Hispanic Population Percentage')
plt.xticks(rotation=90)
plt.show()

C:\Users\User\AppData\Local\Temp\ipykernel_6148\482559711.py:11:
DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns.
This behavior is deprecated, and in a future version of pandas the grouping
columns will be excluded from the operation. Either pass `include_groups=False`
to exclude the groupings or explicitly select the grouping columns after groupby
to silence this warning.
hispanic_state_data = hispanic_by_county.groupby('State').apply(

13
[60]: #How do different racial groups (White, Black, Native, etc.) vary in terms of␣
↪percentage of total population in different counties?

# Calculate the percentage of each racial group in each county


df['White_Percentage'] = (df['White'] / df['TotalPop']) * 100
df['Black_Percentage'] = (df['Black'] / df['TotalPop']) * 100
df['Native_Percentage'] = (df['Native'] / df['TotalPop']) * 100
df['Asian_Percentage'] = (df['Asian'] / df['TotalPop']) * 100
df['Pacific_Percentage'] = (df['Pacific'] / df['TotalPop']) * 100

# Aggregate racial population counts at the county level


racial_population_by_county = df.groupby(['State', 'County'])[['White',␣
↪'Black', 'Native', 'Asian', 'Pacific', 'TotalPop']].sum().reset_index()

# Calculate weighted racial percentages at the county level


racial_population_by_county['White_Percentage'] =␣
↪(racial_population_by_county['White'] /␣

↪racial_population_by_county['TotalPop']) * 100

14
racial_population_by_county['Black_Percentage'] =␣
↪(racial_population_by_county['Black'] /␣

↪racial_population_by_county['TotalPop']) * 100

racial_population_by_county['Native_Percentage'] =␣
↪(racial_population_by_county['Native'] /␣

↪racial_population_by_county['TotalPop']) * 100

racial_population_by_county['Asian_Percentage'] =␣
↪(racial_population_by_county['Asian'] /␣

↪racial_population_by_county['TotalPop']) * 100

racial_population_by_county['Pacific_Percentage'] =␣
↪(racial_population_by_county['Pacific'] /␣

↪racial_population_by_county['TotalPop']) * 100

racial_population_by_county

[60]: State County White Black Native Asian Pacific \


0 Alabama Autauga County 867.9 259.3 4.8 7.5 0.4
1 Alabama Baldwin County 2580.6 310.6 27.5 16.8 0.0
2 Alabama Barbour County 414.2 430.7 1.3 4.4 0.0
3 Alabama Bibb County 317.6 70.2 1.5 0.0 0.0
4 Alabama Blount County 779.8 12.0 3.2 1.2 0.0
… … … … … … … …
3215 Wyoming Sweetwater County 959.3 8.2 7.2 7.8 4.9
3216 Wyoming Teton County 330.3 1.8 1.1 8.6 0.0
3217 Wyoming Uinta County 263.8 0.4 2.6 0.3 0.0
3218 Wyoming Washakie County 245.3 0.8 1.1 0.4 0.0
3219 Wyoming Weston County 183.0 1.0 0.2 8.4 0.0

TotalPop White_Percentage Black_Percentage Native_Percentage \


0 55036 1.576968 0.471146 0.008722
1 203360 1.268981 0.152734 0.013523
2 26201 1.580856 1.643830 0.004962
3 22580 1.406554 0.310895 0.006643
4 57667 1.352247 0.020809 0.005549
… … … … …
3215 44527 2.154423 0.018416 0.016170
3216 22923 1.440911 0.007852 0.004799
3217 20758 1.270835 0.001927 0.012525
3218 8253 2.972253 0.009693 0.013328
3219 7117 2.571308 0.014051 0.002810

Asian_Percentage Pacific_Percentage
0 0.013627 0.000727
1 0.008261 0.000000
2 0.016793 0.000000
3 0.000000 0.000000
4 0.002081 0.000000

15
… … …
3215 0.017517 0.011005
3216 0.037517 0.000000
3217 0.001445 0.000000
3218 0.004847 0.000000
3219 0.118027 0.000000

[3220 rows x 13 columns]

[61]: #Which states have the highest percentage of Black or Hispanic populations?

df['Black_Percentage'] = (df['Black'] / df['TotalPop']) * 100


df['Hispanic_Percentage'] = (df['Hispanic'] / df['TotalPop']) * 100

# Compute state-level Black and Hispanic percentages using weighted average


black_hispanic_percentage_by_state = df.groupby('State').apply(
lambda x: pd.Series({
'Black_Percentage': (x['Black'].sum() / x['TotalPop'].sum()) * 100,
'Hispanic_Percentage': (x['Hispanic'].sum() / x['TotalPop'].sum()) * 100
})
).reset_index()

highest_black_percentage_states = black_hispanic_percentage_by_state.
↪sort_values(by='Black_Percentage', ascending=False).head(10)

print("States with the highest percentage of Black population:")


print(highest_black_percentage_states)

highest_hispanic_percentage_states = black_hispanic_percentage_by_state.
↪sort_values(by='Hispanic_Percentage', ascending=False).head(10)

print("States with the highest percentage of Hispanic population:")


print(highest_hispanic_percentage_states)

# Plot Black population


plt.figure(figsize=(10, 6))
plt.bar(highest_black_percentage_states['State'],␣
↪highest_black_percentage_states['Black_Percentage'])

plt.title('Top 10 States with Highest Percentage of Black Population')


plt.xlabel('State')
plt.ylabel('Black Population Percentage')
plt.xticks(rotation=45)
plt.show()

# Plot Hispanic population


plt.figure(figsize=(10, 6))

16
plt.bar(highest_hispanic_percentage_states['State'],␣
↪highest_hispanic_percentage_states['Hispanic_Percentage'])

plt.title('Top 10 States with Highest Percentage of Hispanic Population')


plt.xlabel('State')
plt.ylabel('Hispanic Population Percentage')
plt.xticks(rotation=45)
plt.show()

C:\Users\User\AppData\Local\Temp\ipykernel_6148\72117619.py:8:
DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns.
This behavior is deprecated, and in a future version of pandas the grouping
columns will be excluded from the operation. Either pass `include_groups=False`
to exclude the groupings or explicitly select the grouping columns after groupby
to silence this warning.
black_hispanic_percentage_by_state = df.groupby('State').apply(
States with the highest percentage of Black population:
State Black_Percentage Hispanic_Percentage
8 District of Columbia 1.321820 0.263329
24 Mississippi 0.928682 0.064349
18 Louisiana 0.879639 0.121680
0 Alabama 0.766280 0.093264
20 Maryland 0.712222 0.209424
41 South Carolina 0.638939 0.116409
10 Georgia 0.620468 0.165446
22 Michigan 0.489261 0.132382
7 Delaware 0.478197 0.204232
33 North Carolina 0.462836 0.189193
States with the highest percentage of Hispanic population:
State Black_Percentage Hispanic_Percentage
39 Puerto Rico 0.002410 2.526179
31 New Mexico 0.040919 1.101539
4 California 0.115160 0.775820
44 Texas 0.222349 0.739622
2 Arizona 0.085987 0.669387
28 Nevada 0.189703 0.656354
5 Colorado 0.080154 0.481415
9 Florida 0.310578 0.458349
30 New Jersey 0.318835 0.434905
32 New York 0.379462 0.431436

17
18
2 Employment and Work Type
What is the employment rate (Employed vs. Unemployed) for each census tract?
How does the rate of self-employed individuals compare to those working in private/public sectors
across different states?
What percentage of the population works from home, and how does it vary by county and state?
How does the unemployment rate vary across different states and counties?
What is the distribution of employed individuals working in private vs. public sectors?

[63]: #What is the employment rate (Employed vs. Unemployed) for each census tract?
df['EmploymentRate'] = df['Employed'] / df['TotalPop']
df['UnemploymentRate'] = df['Unemployment'] / df['TotalPop']
df[['State', 'County', 'TractId', 'Employed', 'Unemployment', 'EmploymentRate',␣
↪'UnemploymentRate']]

[63]: State County TractId Employed Unemployment \


0 Alabama Autauga County 1001020100 881 4.6
1 Alabama Autauga County 1001020200 852 3.4
2 Alabama Autauga County 1001020300 1482 4.7

19
3 Alabama Autauga County 1001020400 1849 6.1
4 Alabama Autauga County 1001020500 4787 2.3
… … … … … …
73996 Puerto Rico Yauco Municipio 72153750501 1576 20.8
73997 Puerto Rico Yauco Municipio 72153750502 666 26.3
73998 Puerto Rico Yauco Municipio 72153750503 560 23.0
73999 Puerto Rico Yauco Municipio 72153750601 1062 29.5
74000 Puerto Rico Yauco Municipio 72153750602 759 17.9

EmploymentRate UnemploymentRate
0 0.477507 0.002493
1 0.392265 0.001565
2 0.437814 0.001388
3 0.433326 0.001430
4 0.480381 0.000231
… … …
73996 0.262186 0.003460
73997 0.284372 0.011230
73998 0.252480 0.010370
73999 0.242466 0.006735
74000 0.252916 0.005965

[74001 rows x 7 columns]

[65]: #How does the rate of self-employed individuals compare to those working in␣
↪private/public sectors across different states?

df['SelfEmployedRate'] = df['SelfEmployed'] / df['Employed']


df['PrivateWorkRate'] = df['PrivateWork'] / df['Employed']
df['PublicWorkRate'] = df['PublicWork'] / df['Employed']

employment_rates_by_state = df.groupby('State')[['SelfEmployedRate',␣
↪'PrivateWorkRate', 'PublicWorkRate']].mean().reset_index()

employment_rates_by_state.head()

employment_rates_by_state.set_index('State').plot(kind='bar', stacked=True,␣
↪figsize=(15, 10))

plt.title('Comparison of Employment Rates Across Different States')


plt.xlabel('State')
plt.ylabel('Rate')
plt.legend(title='Employment Type')
plt.show()

20
[66]: #What percentage of the population works from home, and how does it vary by␣
↪county and state?

df['WorkAtHomePercentage'] = (df['WorkAtHome'] / df['TotalPop']) * 100

work_at_home_by_county = df.groupby(['State',␣
↪'County'])['WorkAtHomePercentage'].mean().reset_index()

work_at_home_by_state = df.groupby('State')['WorkAtHomePercentage'].mean().
↪reset_index()

plt.figure(figsize=(15, 10))
plt.bar(work_at_home_by_state['State'],␣
↪work_at_home_by_state['WorkAtHomePercentage'])

plt.title('Percentage of Population Working from Home by State')


plt.xlabel('State')
plt.ylabel('Work at Home Percentage')
plt.xticks(rotation=90)

21
plt.show()

[67]: #How does the unemployment rate vary across different states and counties?

unemployment_rate_by_county = df.groupby(['State',␣
↪'County'])['UnemploymentRate'].mean().reset_index()

unemployment_rate_by_state = df.groupby('State')['UnemploymentRate'].mean().
↪reset_index()

plt.figure(figsize=(15, 10))
plt.bar(unemployment_rate_by_state['State'],␣
↪unemployment_rate_by_state['UnemploymentRate'])

plt.title('Unemployment Rate by State')


plt.xlabel('State')
plt.ylabel('Unemployment Rate')
plt.xticks(rotation=90)
plt.show()

22
[68]: #What is the distribution of employed individuals working in private vs. public␣
↪sectors?

employment_distribution_by_sector = df.groupby('State')[['PrivateWork',␣
↪'PublicWork']].sum().reset_index()

print(employment_distribution_by_sector)

employment_distribution_by_sector.set_index('State').plot(kind='bar',␣
↪stacked=True, figsize=(15, 10))

plt.title('Distribution of Employed Individuals Working in Private vs. Public␣


↪Sectors')

plt.xlabel('State')
plt.ylabel('Number of Employed Individuals')
plt.legend(title='Employment Sector')
plt.show()

State PrivateWork PublicWork


0 Alabama 93013.8 18161.0
1 Alaska 10828.5 4559.1
2 Arizona 119351.9 22220.5
3 Arkansas 52857.1 11189.5

23
4 California 620398.1 109397.2
5 Colorado 97844.1 17353.8
6 Connecticut 66745.7 10529.6
7 Delaware 17486.0 3008.6
8 District of Columbia 12648.6 4495.2
9 Florida 339459.8 50039.8
10 Georgia 154053.7 30032.9
11 Hawaii 22185.5 6871.9
12 Idaho 22544.6 4713.8
13 Illinois 257640.6 38630.2
14 Indiana 127561.2 15807.4
15 Iowa 65720.6 10575.5
16 Kansas 59007.9 11671.1
17 Kentucky 87831.4 16510.3
18 Louisiana 88930.0 17008.1
19 Maine 26940.6 4891.5
20 Maryland 101853.0 29729.1
21 Massachusetts 119845.6 17640.1
22 Michigan 231237.5 29144.3
23 Minnesota 109311.0 15853.7
24 Mississippi 49870.3 12135.6
25 Missouri 113203.8 17254.2
26 Montana 18926.0 5122.4
27 Nebraska 41686.2 7305.3
28 Nevada 55822.1 8199.3
29 New Hampshire 23277.0 3879.4
30 New Jersey 162630.5 27280.9
31 New Mexico 34674.1 11583.9
32 New York 379901.5 75897.8
33 North Carolina 172143.5 31173.4
34 North Dakota 14928.3 3435.3
35 Ohio 244297.7 34850.6
36 Oklahoma 80218.8 17329.3
37 Oregon 64121.0 11690.7
38 Pennsylvania 269785.3 33345.0
39 Puerto Rico 60283.9 19938.7
40 Rhode Island 19867.9 2902.2
41 South Carolina 86280.9 16694.5
42 South Dakota 16150.8 3726.8
43 Tennessee 116793.3 20576.8
44 Texas 414609.8 70444.0
45 Utah 46872.6 8617.6
46 Vermont 13926.9 2583.8
47 Virginia 140180.9 37739.8
48 Washington 111172.3 24186.2
49 West Virginia 37205.6 9021.3
50 Wisconsin 114720.4 16816.3
51 Wyoming 9333.9 2849.7

24
3 Commuting and Transportation
What is the average commuting time across counties and states, and how does it differ for employed
individuals?
What modes of transportation are most commonly used for commuting in different states (e.g., car,
public transportation, walking)?
How does the percentage of people commuting via walking or public transportation vary between
urban and rural areas?

[71]: #What is the average commuting time across counties and states, and how does it␣
↪differ for employed individuals?

average_commute_by_county = df.groupby(['State', 'County'])['MeanCommute'].


↪mean().reset_index()

average_commute_by_state = df.groupby('State')['MeanCommute'].mean().
↪reset_index()

#see how differ

25
plt.figure(figsize=(15, 10))
plt.bar(average_commute_by_state['State'],␣
↪average_commute_by_state['MeanCommute'],color='y')

plt.title('Average Commuting Time by State')


plt.xlabel('State')
plt.ylabel('Average Commuting Time (minutes)')
plt.xticks(rotation=90)
plt.show()

print(average_commute_by_county)
print(average_commute_by_state)

State County MeanCommute


0 Alabama Autauga County 25.766667
1 Alabama Baldwin County 27.054839
2 Alabama Barbour County 22.744444
3 Alabama Bibb County 31.200000
4 Alabama Blount County 35.011111

26
… … … …
3215 Wyoming Sweetwater County 20.708333
3216 Wyoming Teton County 14.450000
3217 Wyoming Uinta County 20.233333
3218 Wyoming Washakie County 14.533333
3219 Wyoming Weston County 26.000000

[3220 rows x 3 columns]


State MeanCommute
0 Alabama 24.458638
1 Alaska 18.209639
2 Arizona 24.833444
3 Arkansas 21.739824
4 California 28.720396
5 Colorado 24.836812
6 Connecticut 26.018909
7 Delaware 24.965421
8 District of Columbia 31.087640
9 Florida 26.436147
10 Georgia 27.015984
11 Hawaii 26.475641
12 Idaho 20.638721
13 Illinois 28.584158
14 Indiana 23.149035
15 Iowa 19.248967
16 Kansas 19.061133
17 Kentucky 23.754430
18 Louisiana 24.555298
19 Maine 23.745584
20 Maryland 32.228974
21 Massachusetts 28.636557
22 Michigan 24.467458
23 Minnesota 22.994670
24 Mississippi 23.791450
25 Missouri 23.416955
26 Montana 18.123792
27 Nebraska 18.464839
28 Nevada 23.829056
29 New Hampshire 26.895890
30 New Jersey 31.191014
31 New Mexico 21.963454
32 New York 33.084997
33 North Carolina 24.020849
34 North Dakota 17.738537
35 Ohio 23.213692
36 Oklahoma 21.298177
37 Oregon 23.183981
38 Pennsylvania 26.470801

27
39 Puerto Rico 28.281087
40 Rhode Island 24.409167
41 South Carolina 24.292173
42 South Dakota 17.077477
43 Tennessee 24.576626
44 Texas 25.205923
45 Utah 21.286770
46 Vermont 23.095082
47 Virginia 27.695833
48 Washington 26.888989
49 West Virginia 25.428306
50 Wisconsin 22.135396
51 Wyoming 18.145802

[72]: #What modes of transportation are most commonly used for commuting in different␣
↪states (e.g., car, public transportation, walking)?

#Calcultions
transportation_modes_by_state = df.groupby('State')[['Drive', 'Carpool',␣
↪'Transit', 'Walk', 'OtherTransp']].sum().reset_index()

# Normalize the data to get percentages


transportation_modes_by_state[['Drive', 'Carpool', 'Transit', 'Walk',␣
↪'OtherTransp']] = transportation_modes_by_state[['Drive', 'Carpool',␣

↪'Transit', 'Walk', 'OtherTransp']].

↪div(transportation_modes_by_state[['Drive', 'Carpool', 'Transit', 'Walk',␣

↪'OtherTransp']].sum(axis=1), axis=0) * 100

# Sort the data by the 'Drive' column in descending order


transportation_modes_by_state = transportation_modes_by_state.
↪sort_values(by='Drive', ascending=False)

# HoriZontal Plot of the data


transportation_modes_by_state.set_index('State').plot(kind='barh',␣
↪stacked=True, figsize=(15, 10))

plt.title('Modes of Transportation for Commuting in Different States')


plt.xlabel('Percentage')
plt.ylabel('State')
plt.legend(title='Transportation Mode')
plt.show()

print(transportation_modes_by_state)

28
State Drive Carpool Transit Walk \
0 Alabama 87.237800 9.464023 0.586360 1.483553
43 Tennessee 86.129360 9.594789 1.083404 1.773425
24 Mississippi 86.086598 10.119360 0.424133 1.671821
29 New Hampshire 85.520991 8.814976 0.895838 3.211881
41 South Carolina 85.126841 10.013474 0.761312 2.265327
3 Arkansas 85.009745 11.042141 0.500057 2.015314
16 Kansas 84.992725 10.007144 0.596194 2.837533
36 Oklahoma 84.907412 10.795471 0.583061 2.210284
33 North Carolina 84.901295 10.254153 1.212449 2.128705
27 Nebraska 84.767305 9.650356 0.817301 3.337992
14 Indiana 84.761873 9.662309 1.482523 2.462258
35 Ohio 84.513781 8.589462 2.775086 2.761280
15 Iowa 84.454221 9.198036 1.067967 3.695741
25 Missouri 84.286308 9.674766 2.303347 2.313148
17 Kentucky 84.152657 10.172520 1.336560 2.625019
49 West Virginia 84.103448 9.925722 1.045823 3.613250
22 Michigan 83.925573 9.733940 2.229668 2.609457
31 New Mexico 83.907252 10.063634 1.180954 2.694058
7 Delaware 83.632857 8.266754 3.799419 2.954340
9 Florida 83.585773 9.656995 2.252973 1.895705
42 South Dakota 83.578985 9.608715 0.613847 4.691166
34 North Dakota 83.455019 9.613547 0.561868 5.015786
44 Texas 83.448608 11.148666 1.682065 1.891720
50 Wisconsin 83.272902 8.829600 2.647031 3.519228

29
18 Louisiana 82.752882 9.928719 2.361261 2.474868
19 Maine 82.560276 10.427820 0.708804 4.577618
10 Georgia 82.553645 10.903683 2.667940 1.965389
12 Idaho 82.548234 10.779917 0.809219 3.478675
39 Puerto Rico 82.061982 8.496652 2.871417 4.380004
40 Rhode Island 81.987316 9.117606 2.908490 4.454913
23 Minnesota 81.692781 9.298336 3.999825 3.195506
46 Vermont 81.644848 9.479608 1.221118 5.917226
47 Virginia 81.290401 9.740763 4.383332 2.682129
51 Wyoming 81.222942 10.773234 1.309205 4.608724
2 Arizona 81.052286 11.526215 2.146066 2.364155
5 Colorado 80.907299 10.085541 3.281061 3.280014
6 Connecticut 80.681142 8.678442 5.921638 3.466852
28 Nevada 80.267411 10.829486 3.976415 2.597373
45 Utah 80.194172 12.128832 2.663132 2.935868
26 Montana 79.196714 10.630047 0.787306 6.936447
38 Pennsylvania 77.786649 9.225055 6.836267 4.608836
4 California 77.665634 11.035172 5.574819 2.972786
48 Washington 77.643134 10.693114 5.855954 3.677865
37 Oregon 76.806406 10.869483 4.276663 4.449932
20 Maryland 75.852719 9.589507 10.078988 2.900839
13 Illinois 75.248824 8.560149 11.003480 3.269925
30 New Jersey 73.641468 8.522955 12.381831 3.447695
21 Massachusetts 72.916093 8.115665 10.971217 5.811922
11 Hawaii 69.115348 14.034894 6.945447 5.954481
1 Alaska 67.163846 12.427795 1.499818 11.904292
32 New York 55.260617 7.185068 29.055900 6.596035
8 District of Columbia 38.846324 6.042644 37.586481 11.469521

OtherTransp
0 1.228263
43 1.419022
24 1.698087
29 1.556314
41 1.833046
3 1.432743
16 1.566404
36 1.503772
33 1.503398
27 1.427046
14 1.631036
35 1.360391
15 1.584035
25 1.422432
17 1.713244
49 1.311757
22 1.501362
31 2.154102

30
7 1.346630
9 2.608554
42 1.507287
34 1.353779
44 1.828942
50 1.731240
18 2.482271
19 1.725482
10 1.909344
12 2.383955
39 2.189944
40 1.531675
23 1.813552
46 1.737200
47 1.903376
51 2.085894
2 2.911277
5 2.446085
6 1.251926
28 2.329315
45 2.077996
26 2.449487
38 1.543194
4 2.751590
48 2.129932
37 3.597516
20 1.577947
13 1.917621
30 2.006051
21 2.185103
11 3.949830
1 7.004248
32 1.902380
8 6.055030

[73]: #How does the percentage of people commuting via walking or public␣
↪transportation vary between urban and rural areas?

df['AreaType'] = np.where(df['TotalPop'] > 5000, 'Urban', 'Rural')

commute_modes_by_area = df.groupby('AreaType')[['Walk', 'Transit']].mean().


↪reset_index()

# Plot the data

31
commute_modes_by_area.set_index('AreaType').plot(kind='bar', stacked=True,␣
↪figsize=(10, 6))

plt.title('Percentage of People Commuting via Walking or Public Transportation␣


↪(Urban vs. Rural)')

plt.xlabel('Area Type')
plt.ylabel('Percentage')
plt.legend(title='Commute Mode')
plt.show()

commute_modes_by_area

[73]: AreaType Walk Transit


0 Rural 3.36198 5.855194
1 Urban 2.39732 4.464799

4 Income and Housing


What is the average income (or median household income) in each state and county?
How does the distribution of housing type (e.g., owner-occupied vs. renter-occupied) vary across
different counties?
How does the cost of living compare across different states based on average income and housing

32
costs?

[77]: #What is the average income (or median household income) in each state and␣
↪county?

average_income_by_county = df.groupby(['State', 'County'])['Income'].mean().


↪reset_index()

average_income_by_state = df.groupby('State')['Income'].mean().reset_index()

print(average_income_by_county.head())
print(average_income_by_state.head())

State County Income


0 Alabama Autauga County 53567.500000
1 Alabama Baldwin County 52732.225806
2 Alabama Barbour County 32717.777778
3 Alabama Bibb County 44677.000000
4 Alabama Blount County 46325.555556
State Income
0 Alabama 45938.212947
1 Alaska 73796.757576
2 Arizona 57815.571807
3 Arkansas 44245.267936
4 California 73070.965821

[78]: # Calculate the total number of owner-occupied and renter-occupied housing␣


↪units in each county

# housing_distribution_by_county = df.groupby(['State',␣
↪'County'])[['OwnerOccupied', 'RenterOccupied']].sum().reset_index()

# # Plot the distribution


# fig, ax = plt.subplots(figsize=(15, 10))
# housing_distribution_by_county.set_index(['State', 'County']).
↪plot(kind='bar', stacked=True, ax=ax)

# ax.set_title('Distribution of Housing Type (Owner-Occupied vs.␣


↪Renter-Occupied) Across Different Counties')

# ax.set_xlabel('County')
# ax.set_ylabel('Number of Housing Units')
# plt.legend(title='Housing Type')
# plt.show()
'''

(Current dataset does not include owner/renter data, so this needs additional␣
↪information.)

'''
print(df.columns)

Index(['TractId', 'State', 'County', 'TotalPop', 'Men', 'Women', 'Hispanic',


'White', 'Black', 'Native', 'Asian', 'Pacific', 'VotingAgeCitizen',

33
'Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr', 'Poverty',
'ChildPoverty', 'Professional', 'Service', 'Office', 'Construction',
'Production', 'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp',
'WorkAtHome', 'MeanCommute', 'Employed', 'PrivateWork', 'PublicWork',
'SelfEmployed', 'FamilyWork', 'Unemployment', 'White_Percentage',
'Black_Percentage', 'Native_Percentage', 'Asian_Percentage',
'Pacific_Percentage', 'Hispanic_Percentage', 'MaleToFemaleRatio',
'EmploymentRate', 'UnemploymentRate', 'SelfEmployedRate',
'PrivateWorkRate', 'PublicWorkRate', 'WorkAtHomePercentage',
'AreaType'],
dtype='object')

[79]: #How does the cost of living compare across different states based on average␣
↪income and housing costs?

average_income_by_state = df.groupby('State')['Income'].mean().reset_index()
average_per_capita_income_by_state = df.groupby('State')['IncomePerCap'].mean().
↪reset_index()

cost_of_living_by_state = pd.merge(average_income_by_state,␣
↪average_per_capita_income_by_state, on='State')

cost_of_living_by_state.columns = ['State', 'AverageIncome', 'PerCapitaIncome']␣


↪ # Better naming

cost_of_living_by_state.set_index('State').plot(kind='bar', figsize=(15, 8),␣


↪colormap='coolwarm')

plt.title('Comparison of Income Levels Across Different States')


plt.xlabel('State')
plt.ylabel('Amount in USD')
plt.legend(title='Income Type')
plt.xticks(rotation=90)
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.show()

34
5 Social Characteristics
What is the relationship between education levels (e.g., percentage with a high school diploma,
bachelor’s degree) and employment types across different states?

[81]: #What is the relationship between education levels (e.g., percentage with a␣
↪high school diploma, bachelor’s degree) and employment types

# across different states?

'''
(Current dataset does not include direct education data, but employment types␣
↪are available.)

'''

employment_by_state = df.groupby('State')[['Professional', 'Service', 'Office',␣


↪'Construction', 'Production']].mean().reset_index()

# Plot employment distribution


employment_by_state.plot(x='State', kind='bar', stacked=True, figsize=(15, 10))
plt.title('Employment Type Distribution Across States')
plt.xlabel('State')
plt.ylabel('Average Employment Type Percentage')

35
plt.legend(title='Employment Type')
plt.xticks(rotation=90)
plt.show()

36
D24AIML081_AS7

April 1, 2025

[45]: import pandas as pd


import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

# Load the Iris dataset


iris = sns.load_dataset("iris")

# 1. Descriptive Statistics
print("Descriptive Statistics:")
print(iris.describe())
print("\nMedian values:")
print(iris.median(numeric_only=True))

Descriptive Statistics:
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

Median values:
sepal_length 5.80
sepal_width 3.00
petal_length 4.35
petal_width 1.30
dtype: float64

[46]: #1. Calculate basic descriptive statistics such as the mean, median, standard␣
↪deviation, and more for each of the numeric columns.

1
[47]: import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
print("Dataset Preview:\n", df.head())
stats = df.describe().T
stats['median'] = df.median()
print("\nDescriptive Statistics:\n", stats)

Dataset Preview:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

Descriptive Statistics:
count mean std min 25% 50% 75% max median
sepal length (cm) 150.0 5.843333 0.828066 4.3 5.1 5.80 6.4 7.9 5.80
sepal width (cm) 150.0 3.057333 0.435866 2.0 2.8 3.00 3.3 4.4 3.00
petal length (cm) 150.0 3.758000 1.765298 1.0 1.6 4.35 5.1 6.9 4.35
petal width (cm) 150.0 1.199333 0.762238 0.1 0.3 1.30 1.8 2.5 1.30

[48]: #2. Normal Distribution (Check for Normality) check whether the `sepal_length`␣
↪follows a normal distribution using a histogram and a Q-Q plot.

[53]: plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.histplot(df['sepal length (cm)'], kde=True, bins=20)
plt.title("Histogram of Sepal Length")

plt.subplot(1, 2, 2)
import scipy.stats as stats
stats.probplot(df['sepal length (cm)'].values, dist="norm", plot=plt)

plt.title("Q-Q Plot of Sepal Length")

plt.show()

2
[55]: #3. Hypothesis Testing (One-Sample t-Test) perform a one-sample t-test to check␣
↪if the average `sepal_length` is different from 5.0.

[57]: t_stat, p_value = stats.ttest_1samp(df['sepal length (cm)'], 5.0)

print("One-Sample t-Test Results:")


print(f"T-Statistic: {t_stat:.4f}")
print(f"P-Value: {p_value:.4f}")

alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis: The average sepal length is␣
↪significantly different from 5.0.")

else:
print("Fail to reject the null hypothesis: No significant difference from 5.
↪0.")

One-Sample t-Test Results:


T-Statistic: 12.4733
P-Value: 0.0000
Reject the null hypothesis: The average sepal length is significantly different
from 5.0.

[59]: #4. Correlation Analysis calculate the Pearson correlation coefficient between␣
↪`sepal_length` and `petal_length` to see if they are related.

[61]: correlation, p_value = stats.pearsonr(df['sepal length (cm)'], df['petal length␣


↪(cm)'])

print("Pearson Correlation Analysis:")

3
print(f"Correlation Coefficient: {correlation:.4f}")
print(f"P-Value: {p_value:.4f}")

alpha = 0.05
if p_value < alpha:
print("The correlation is statistically significant.")
else:
print("The correlation is not statistically significant.")

Pearson Correlation Analysis:


Correlation Coefficient: 0.8718
P-Value: 0.0000
The correlation is statistically significant.

[63]: #5. Simple Linear Regression perform a simple linear regression to predict␣
↪`petal_length` based on `sepal_length`.

[65]: corr = stats.pearsonr(df["sepal length (cm)"], df["petal length (cm)"])


print(f"Pearson Correlation: {corr}")

Pearson Correlation: PearsonRResult(statistic=0.8717537758865832,


pvalue=1.0386674194497525e-47)

[67]: #6. ANOVA (One-Way Analysis of Variance)We will perform an ANOVA test to check␣
↪if there is a significant difference in the `sepal_length` between different␣

↪species.

[69]: df['species'] = [iris.target_names[i] for i in load_iris().target]

# Group data by species


species_groups = [df[df['species'] == species]['sepal length (cm)'] for species␣
↪in iris.target_names]

# Perform one-way ANOVA


f_stat, p_value = stats.f_oneway(*species_groups)

print("One-Way ANOVA Results:")


print(f"F-Statistic: {f_stat:.4f}")
print(f"P-Value: {p_value:.4f}")

alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis: There is a significant difference in␣
↪sepal length between species.")

else:
print("Fail to reject the null hypothesis: No significant difference in␣
↪sepal length between species.")

One-Way ANOVA Results:

4
F-Statistic: 119.2645
P-Value: 0.0000
Reject the null hypothesis: There is a significant difference in sepal length
between species.

[71]: #7. Chi-Square Test for Independence perform a Chi-Square test to see if there␣
↪is an association between `species` and `sepal_width`.

[73]: df['sepal_width_category'] = pd.cut(df['sepal width (cm)'], bins=3)


species_s_width = pd.crosstab(df['species'], df['sepal_width_category'])
chi2, p, dof, expected = stats.chi2_contingency(species_s_width)

print(f"Chi-Square Statistic: {chi2:.4f}")


print(f"P-value: {p:.4f}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies Table:")
print(pd.DataFrame(expected, index=species_s_width.index,␣
↪columns=species_s_width.columns))

alpha = 0.05
if p < alpha:
print("Reject the null hypothesis: There is an association between species␣
↪and sepal width.")

else:
print("Fail to reject the null hypothesis: No significant association␣
↪between species and sepal width.")

Chi-Square Statistic: 45.1247


P-value: 0.0000
Degrees of Freedom: 4
Expected Frequencies Table:
sepal_width_category (1.998, 2.8] (2.8, 3.6] (3.6, 4.4]
species
setosa 15.666667 29.333333 5.0
versicolor 15.666667 29.333333 5.0
virginica 15.666667 29.333333 5.0
Reject the null hypothesis: There is an association between species and sepal
width.

[75]: #8. Calculate the 95% confidence interval for the `petal_length` for each␣
↪species. Use the `petal_length` column and apply the `groupby()` function to␣

↪compute the confidence interval by species.

[77]: def confidence_interval(data, confidence=0.95):


n = len(data)
mean = np.mean(data)
std_err = stats.sem(data)
margin_of_error = std_err * stats.t.ppf((1 + confidence) / 2, n - 1)

5
return mean - margin_of_error, mean + margin_of_error

df['species'] = [iris.target_names[i] for i in load_iris().target]

ci_results = df.groupby('species')['petal length (cm)'].


↪apply(confidence_interval)

print("95% Confidence Intervals for Petal Length by Species:")


print(ci_results)

95% Confidence Intervals for Petal Length by Species:


species
setosa (1.4126452382875103, 1.51135476171249)
versicolor (4.126452777905478, 4.393547222094521)
virginica (5.395153262927524, 5.708846737072477)
Name: petal length (cm), dtype: object

[79]: #9. Find the correlation between `petal_length` and `petal_width`. Is it a␣


↪strong positive, weak positive, or negative correlation? Provide the␣

↪correlation coefficient and p-value.

[81]: correlation, p_value = stats.pearsonr(df['petal length (cm)'], df['petal width␣


↪(cm)'])

print("Pearson Correlation Analysis:")


print(f"Correlation Coefficient: {correlation:.4f}")
print(f"P-Value: {p_value:.4f}")

if correlation > 0.7:


strength = "strong positive"
elif 0.3 < correlation <= 0.7:
strength = "moderate positive"
elif 0 < correlation <= 0.3:
strength = "weak positive"
elif -0.3 <= correlation < 0:
strength = "weak negative"
elif -0.7 <= correlation < -0.3:
strength = "moderate negative"
else:
strength = "strong negative"

print(f"The correlation is {strength}.")

Pearson Correlation Analysis:


Correlation Coefficient: 0.9629
P-Value: 0.0000
The correlation is strong positive.

6
[83]: #10. Conduct a Chi-Square test to see if there is an association between the␣
↪`season` and `species`. You will need to categorize the `season` column␣

↪(Spring, Summer, Fall, Winter) and check if the distribution of species␣

↪varies by season.

[85]: np.random.seed(42)
seasons = ['Spring', 'Summer', 'Fall', 'Winter']
df['season'] = np.random.choice(seasons, size=len(df))

contingency_table = pd.crosstab(df['species'], df['season'])

chi2_stat, p_value, dof, expected = stats.chi2_contingency(contingency_table)

print("Chi-Square Test for Independence Results:")


print(f"Chi-Square Statistic: {chi2_stat:.4f}")
print(f"P-Value: {p_value:.4f}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies Table:")
print(pd.DataFrame(expected, index=contingency_table.index,␣
↪columns=contingency_table.columns))

alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis: There is an association between species␣
↪and season.")

else:
print("Fail to reject the null hypothesis: No significant association␣
↪between species and season.")

Chi-Square Test for Independence Results:


Chi-Square Statistic: 2.6935
P-Value: 0.8462
Degrees of Freedom: 6
Expected Frequencies Table:
season Fall Spring Summer Winter
species
setosa 11.333333 11.333333 12.0 15.333333
versicolor 11.333333 11.333333 12.0 15.333333
virginica 11.333333 11.333333 12.0 15.333333
Fail to reject the null hypothesis: No significant association between species
and season.

[87]: #11. Calculate the Z-scores for `sepal_length` and identify if any values are␣
↪outliers (with a threshold of ±3). How many outliers do you find?

[89]: df['sepal_length_zscore'] = (df['sepal length (cm)'] - df['sepal length (cm)'].


↪mean()) / df['sepal length (cm)'].std()

7
outliers = df[np.abs(df['sepal_length_zscore']) > 3]

# Display results
print("Number of outliers in Sepal Length:", len(outliers))
print(outliers[['sepal length (cm)', 'sepal_length_zscore']])

Number of outliers in Sepal Length: 0


Empty DataFrame
Columns: [sepal length (cm), sepal_length_zscore]
Index: []

[91]: #12.Create a pair plot to visualize the relationships between `sepal_length`,␣


↪`sepal_width`, `petal_length`, and `petal_width`. Based on the plot,␣

↪describe any patterns or correlations you observe.

[93]: sns.pairplot(df, vars=['sepal length (cm)', 'sepal width (cm)', 'petal length␣
↪(cm)', 'petal width (cm)'], hue='species', diag_kind='kde')

plt.show()

8
[ ]:

[ ]:

9
PMRP PRACTICAL 8

April 1, 2025

[1]: import pandas as pd

[20]: import numpy as np


import matplotlib.pyplot as plt

[28]: data=pd.read_csv(r"/home/mandeep/Life Expectancy Data.csv")

1 Q1 : what are the columns and data types in the dataset:


[22]: data_types=data.dtypes
data_types

[22]: Country object


Year int64
Status object
Life expectancy float64
Adult Mortality float64
infant deaths int64
Alcohol float64
percentage expenditure float64
Hepatitis B float64
Measles int64
BMI float64
under-five deaths int64
Polio float64
Total expenditure float64
Diphtheria float64
HIV/AIDS float64
GDP float64
Population float64
thinness 1-19 years float64
thinness 5-9 years float64
Income composition of resources float64
Schooling float64
dtype: object

1
2 Q2 : How many missing values are present in the coulmns:
[23]: null_values = data.isnull().sum()
null_values

[23]: Country 0
Year 0
Status 0
Life expectancy 10
Adult Mortality 10
infant deaths 0
Alcohol 194
percentage expenditure 0
Hepatitis B 553
Measles 0
BMI 34
under-five deaths 0
Polio 19
Total expenditure 226
Diphtheria 19
HIV/AIDS 0
GDP 448
Population 652
thinness 1-19 years 34
thinness 5-9 years 34
Income composition of resources 167
Schooling 163
dtype: int64

3 Q3 : what are the unique values for catergorical columns ? what


is the distributaion of life expectancy across different counties:
[24]: County=data['Country'].unique()
status=data['Status'].unique()
print(County)
print(status)

['Afghanistan' 'Albania' 'Algeria' 'Angola' 'Antigua and Barbuda'


'Argentina' 'Armenia' 'Australia' 'Austria' 'Azerbaijan' 'Bahamas'
'Bahrain' 'Bangladesh' 'Barbados' 'Belarus' 'Belgium' 'Belize' 'Benin'
'Bhutan' 'Bolivia (Plurinational State of)' 'Bosnia and Herzegovina'
'Botswana' 'Brazil' 'Brunei Darussalam' 'Bulgaria' 'Burkina Faso'
'Burundi' "Côte d'Ivoire" 'Cabo Verde' 'Cambodia' 'Cameroon' 'Canada'
'Central African Republic' 'Chad' 'Chile' 'China' 'Colombia' 'Comoros'
'Congo' 'Cook Islands' 'Costa Rica' 'Croatia' 'Cuba' 'Cyprus' 'Czechia'
"Democratic People's Republic of Korea"
'Democratic Republic of the Congo' 'Denmark' 'Djibouti' 'Dominica'

2
'Dominican Republic' 'Ecuador' 'Egypt' 'El Salvador' 'Equatorial Guinea'
'Eritrea' 'Estonia' 'Ethiopia' 'Fiji' 'Finland' 'France' 'Gabon' 'Gambia'
'Georgia' 'Germany' 'Ghana' 'Greece' 'Grenada' 'Guatemala' 'Guinea'
'Guinea-Bissau' 'Guyana' 'Haiti' 'Honduras' 'Hungary' 'Iceland' 'India'
'Indonesia' 'Iran (Islamic Republic of)' 'Iraq' 'Ireland' 'Israel'
'Italy' 'Jamaica' 'Japan' 'Jordan' 'Kazakhstan' 'Kenya' 'Kiribati'
'Kuwait' 'Kyrgyzstan' "Lao People's Democratic Republic" 'Latvia'
'Lebanon' 'Lesotho' 'Liberia' 'Libya' 'Lithuania' 'Luxembourg'
'Madagascar' 'Malawi' 'Malaysia' 'Maldives' 'Mali' 'Malta'
'Marshall Islands' 'Mauritania' 'Mauritius' 'Mexico'
'Micronesia (Federated States of)' 'Monaco' 'Mongolia' 'Montenegro'
'Morocco' 'Mozambique' 'Myanmar' 'Namibia' 'Nauru' 'Nepal' 'Netherlands'
'New Zealand' 'Nicaragua' 'Niger' 'Nigeria' 'Niue' 'Norway' 'Oman'
'Pakistan' 'Palau' 'Panama' 'Papua New Guinea' 'Paraguay' 'Peru'
'Philippines' 'Poland' 'Portugal' 'Qatar' 'Republic of Korea'
'Republic of Moldova' 'Romania' 'Russian Federation' 'Rwanda'
'Saint Kitts and Nevis' 'Saint Lucia' 'Saint Vincent and the Grenadines'
'Samoa' 'San Marino' 'Sao Tome and Principe' 'Saudi Arabia' 'Senegal'
'Serbia' 'Seychelles' 'Sierra Leone' 'Singapore' 'Slovakia' 'Slovenia'
'Solomon Islands' 'Somalia' 'South Africa' 'South Sudan' 'Spain'
'Sri Lanka' 'Sudan' 'Suriname' 'Swaziland' 'Sweden' 'Switzerland'
'Syrian Arab Republic' 'Tajikistan' 'Thailand'
'The former Yugoslav republic of Macedonia' 'Timor-Leste' 'Togo' 'Tonga'
'Trinidad and Tobago' 'Tunisia' 'Turkey' 'Turkmenistan' 'Tuvalu' 'Uganda'
'Ukraine' 'United Arab Emirates'
'United Kingdom of Great Britain and Northern Ireland'
'United Republic of Tanzania' 'United States of America' 'Uruguay'
'Uzbekistan' 'Vanuatu' 'Venezuela (Bolivarian Republic of)' 'Viet Nam'
'Yemen' 'Zambia' 'Zimbabwe']
['Developing' 'Developed']

[29]: print(data.groupby('Country')['Life expectancy'].mean().


↪sort_values(ascending=False).head(10))

data.groupby('Country')['Life expectancy'].mean().sort_values(ascending=False).
↪head(10).plot(kind='bar',figsize=(20,5))

plt.title("Life Ecpectancy Across Different Countries : ")

Country
Japan 82.53750
Sweden 82.51875
Iceland 82.44375
Switzerland 82.33125
France 82.21875
Italy 82.18750
Spain 82.06875
Australia 81.81250
Norway 81.79375
Canada 81.68750

3
Name: Life expectancy, dtype: float64

[29]: Text(0.5, 1.0, 'Life Ecpectancy Across Different Countries : ')

4 Q4 : What is the correlation between life expectancy and other


numerical features? :
[41]: data.columns = data.columns.str.strip()
numeric_df = data.select_dtypes(include=["number"])
correlation_matrix = numeric_df.corr()
print(correlation_matrix["Life expectancy"].sort_values(ascending=False))

Life expectancy 1.000000


Schooling 0.751975
Income composition of resources 0.724776
BMI 0.567694
Diphtheria 0.479495
Polio 0.465556
GDP 0.461455
Alcohol 0.404877
percentage expenditure 0.381864
Hepatitis B 0.256762
Total expenditure 0.218086
Year 0.170033
Population -0.021538
Measles -0.157586
infant deaths -0.196557
under-five deaths -0.222529
thinness 5-9 years -0.471584
thinness 1-19 years -0.477183
HIV/AIDS -0.556556
Adult Mortality -0.696359
Name: Life expectancy, dtype: float64

4
5 Q5 : What are the top 10 Countries with the highest and lowest
GDP:
[30]: highest_gdp = data.groupby('Country')['GDP'].mean().
↪sort_values(ascending=False).head(10)

lowest_gdp = data.groupby('Country')['GDP'].mean().sort_values(ascending=True).
↪head(10)

print("TOP 10 Country WITH HIGHEST GDP ", highest_gdp)


print("TOP 10 Country WITH LOWEST GDP ",lowest_gdp)

TOP 10 Country WITH HIGHEST GDP Country


Switzerland 57362.874601
Luxembourg 53257.012741
Qatar 40748.444104
Netherlands 34964.719797
Australia 34637.565047
Ireland 33835.272005
Austria 33827.476309
Denmark 33067.407916
Singapore 32790.105907
Kuwait 31914.378339
Name: GDP, dtype: float64
TOP 10 Country WITH LOWEST GDP Country
Nauru 136.183210
Burundi 137.815321
Malawi 237.504042
Liberia 246.281748
Eritrea 259.395356
Niger 259.782441
Ethiopia 264.970950
Sierra Leone 271.505561
Senegal 274.611166
Guinea 279.464798
Name: GDP, dtype: float64

6 Q6 What is the trend of Life Expectancy over the years for


different regions:
[31]: data.groupby('Year')['Life expectancy'].mean().plot(kind='line',figsize=(10,5))
plt.title("What is the trend of Life Expectancy over the years for different␣
↪regions:")

[31]: Text(0.5, 1.0, 'What is the trend of Life Expectancy over the years for
different regions:')

5
7 Q7 : How does adult mortality impact expectancy across coun-
try :
[32]: data.groupby('Country')['Adult Mortality'].mean().sort_values(ascending=True).
↪plot(kind='bar',figsize=(20,10))

[32]: <AxesSubplot: xlabel='Country'>

6
8 Q8 : Is there a significant relationship between life expactancy
and gdp per capital :
[33]: data.groupby('GDP')['Life expectancy'].mean().sort_values(ascending=False).
↪head(15).plot(kind='bar',figsize=(20,5))

plt.title("Is there a significant relationship between life expactancy and gdp␣


↪per capital ")

[33]: Text(0.5, 1.0, 'Is there a significant relationship between life expactancy and
gdp per capital ')

7
9 Q9 : How does alcohol consumption relate to life expectancy :
[34]: data.groupby('Country')['Alcohol'].mean().sort_values().head(20).
↪plot(kind='bar',figsize=(20,5))

plt.title("Alcohol Consumption relate to Life Expectancy : ")

[34]: Text(0.5, 1.0, 'Alcohol Consumption relate to Life Expectancy : ')

10 Q10: What is the impact of BMI on life expectancy in different


countries.
[35]: print(data.groupby('Country')['BMI'].mean().sort_values().head(20))
data.groupby('Country')['BMI'].mean().sort_values().head(20).
↪plot(kind='bar',figsize=(20,5))

plt.xlabel("Countries")
plt.ylabel("BMI VALUE")

8
plt.title("BMI relate to Life Expectancy : ")

Country
Saint Kitts and Nevis 5.20000
Viet Nam 11.18750
Bangladesh 12.87500
Lao People's Democratic Republic 14.36250
Timor-Leste 14.55000
Rwanda 14.75000
Madagascar 14.76875
India 14.79375
Ethiopia 14.80000
Eritrea 15.15625
Nepal 15.17500
Burundi 15.31250
Cambodia 15.36250
Burkina Faso 15.50000
Afghanistan 15.51875
Uganda 15.52500
Kenya 15.56250
Democratic Republic of the Congo 15.83750
Mozambique 16.14375
Chad 16.31875
Name: BMI, dtype: float64

[35]: Text(0.5, 1.0, 'BMI relate to Life Expectancy : ')

9
11 Q11 : Does immunization coverage (Hepatitis B, Polio) affect
life expectancy :
[36]: print(data.groupby('Country')['Hepatitis B'].mean().
↪sort_values(ascending=False))

data.groupby('Country')['Hepatitis B'].mean().sort_values(ascending=False).
↪head(50).plot(kind='bar',figsize=(20,5))

plt.title("HEPATITIS B V LIFE EXPECTANCY")

Country
Palau 99.0000
Monaco 99.0000
Niue 99.0000
Fiji 98.8750
Oman 98.8125

Japan NaN
Norway NaN
Slovenia NaN
Switzerland NaN
United Kingdom of Great Britain and Northern Ireland NaN
Name: Hepatitis B, Length: 193, dtype: float64

[36]: Text(0.5, 1.0, 'HEPATITIS B V LIFE EXPECTANCY')

[37]: print(data.groupby('Country')['Polio'].mean().sort_values(ascending=False))
data.groupby('Country')['Polio'].mean().sort_values(ascending=False).head(50).
↪plot(kind='bar',figsize=(20,5))

plt.title("HEPATITIS B V LIFE EXPECTANCY")

Country
Niue 99.0000

10
Monaco 99.0000
Palau 99.0000
Hungary 98.9375
Cuba 98.6875

Nigeria 41.3125
Equatorial Guinea 36.8750
Chad 32.8750
Somalia 29.8125
Tuvalu 9.0000
Name: Polio, Length: 193, dtype: float64

[37]: Text(0.5, 1.0, 'HEPATITIS B V LIFE EXPECTANCY')

[38]: data.info

[38]: <bound method DataFrame.info of Country Year Status Life


expectancy Adult Mortality \
0 Afghanistan 2015 Developing 65.0 263.0
1 Afghanistan 2014 Developing 59.9 271.0
2 Afghanistan 2013 Developing 59.9 268.0
3 Afghanistan 2012 Developing 59.5 272.0
4 Afghanistan 2011 Developing 59.2 275.0
… … … … … …
2933 Zimbabwe 2004 Developing 44.3 723.0
2934 Zimbabwe 2003 Developing 44.5 715.0
2935 Zimbabwe 2002 Developing 44.8 73.0
2936 Zimbabwe 2001 Developing 45.3 686.0
2937 Zimbabwe 2000 Developing 46.0 665.0

11
infant deaths Alcohol percentage expenditure Hepatitis B Measles \
0 62 0.01 71.279624 65.0 1154
1 64 0.01 73.523582 62.0 492
2 66 0.01 73.219243 64.0 430
3 69 0.01 78.184215 67.0 2787
4 71 0.01 7.097109 68.0 3013
… … … … … …
2933 27 4.36 0.000000 68.0 31
2934 26 4.06 0.000000 7.0 998
2935 25 4.43 0.000000 73.0 304
2936 25 1.72 0.000000 76.0 529
2937 24 1.68 0.000000 79.0 1483

… Polio Total expenditure Diphtheria HIV/AIDS GDP \


0 … 6.0 8.16 65.0 0.1 584.259210
1 … 58.0 8.18 62.0 0.1 612.696514
2 … 62.0 8.13 64.0 0.1 631.744976
3 … 67.0 8.52 67.0 0.1 669.959000
4 … 68.0 7.87 68.0 0.1 63.537231
… … … … … … …
2933 … 67.0 7.13 65.0 33.6 454.366654
2934 … 7.0 6.52 68.0 36.7 453.351155
2935 … 73.0 6.53 71.0 39.8 57.348340
2936 … 76.0 6.16 75.0 42.1 548.587312
2937 … 78.0 7.10 78.0 43.5 547.358878

Population thinness 1-19 years thinness 5-9 years \


0 33736494.0 17.2 17.3
1 327582.0 17.5 17.5
2 31731688.0 17.7 17.7
3 3696958.0 17.9 18.0
4 2978599.0 18.2 18.2
… … … …
2933 12777511.0 9.4 9.4
2934 12633897.0 9.8 9.9
2935 125525.0 1.2 1.3
2936 12366165.0 1.6 1.7
2937 12222251.0 11.0 11.2

Income composition of resources Schooling


0 0.479 10.1
1 0.476 10.0
2 0.470 9.9
3 0.463 9.8
4 0.454 9.5
… … …
2933 0.407 9.2

12
2934 0.418 9.5
2935 0.427 10.0
2936 0.427 9.8
2937 0.434 9.8

[2938 rows x 22 columns]>

[ ]:

[ ]:

[ ]:

13
D24AIML081_PR9

April 1, 2025

[2]: import numpy as np


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as st

[3]: df=pd.read_csv(r"C:\Users\User\Downloads\Housing.csv")
sns.histplot(df['price'],bins=30,kde= True,color='black')
plt.title("Housing Price Distribution")
plt.show()

1
[9]: x1=df[df['airconditioning']=="yes"]['price'].mean()
x2=df[df['airconditioning']=="no"]['price'].mean()
print("With Aircondtioning",x1)
print("without Airconditioning",x2)
s=df['price'].std()
n1=len(df[df['aircondtioning']=='yes'])
n2=len(df[df['aircondtioning']=='no'])
t = (x1-x2)/(s*np.sqrt(1/n1+1/n2))
print("T-statistics",t)
t_critical=st.t.ppf(0.025,n1+n2-2)
print("T-Criticial",t_critical)
if(t>t_critical):
print("Reject Null Hypothesis")
else:
print("fail to ")

With Aircondtioning 6013220.5813953485


without Airconditioning 4191939.678284182

---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~\anaconda3\Lib\site-packages\pandas\core\indexes\base.py:3805, in Index.
↪get_loc(self, key)

3804 try:
-> 3805 return self._engine.get_loc(casted_key)
3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas\\_libs\\hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.


↪PyObjectHashTable.get_item()

File pandas\\_libs\\hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.


↪PyObjectHashTable.get_item()

KeyError: 'aircondtioning'

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last)


Cell In[9], line 6
4 print("without Airconditioning",x2)
5 s=df['price'].std()
----> 6 n1=len(df[df['aircondtioning']=='yes'])
7 n2=len(df[df['aircondtioning']=='no'])
8 t = (x1-x2)/(s*np.sqrt(1/n1+1/n2))

2
File ~\anaconda3\Lib\site-packages\pandas\core\frame.py:4102, in DataFrame.
↪__getitem__(self, key)

4100 if self.columns.nlevels > 1:


4101 return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
4103 if is_integer(indexer):
4104 indexer = [indexer]

File ~\anaconda3\Lib\site-packages\pandas\core\indexes\base.py:3812, in Index.


↪get_loc(self, key)

3807 if isinstance(casted_key, slice) or (


3808 isinstance(casted_key, abc.Iterable)
3809 and any(isinstance(x, slice) for x in casted_key)
3810 ):
3811 raise InvalidIndexError(key)
-> 3812 raise KeyError(key) from err
3813 except TypeError:
3814 # If we have a listlike key, _check_indexing_error will raise
3815 # InvalidIndexError. Otherwise we fall through and re-raise
3816 # the TypeError.
3817 self._check_indexing_error(key)

KeyError: 'aircondtioning'

[ ]:

[ ]:

3
D24AIML081_PR_10

April 1, 2025

[4]: import numpy as np


import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats

[5]: data=pd.read_csv(r"C:\Users\User\Downloads\global_energy_consumption.csv")
data.head()

[5]: Country Year Total Energy Consumption (TWh) Per Capita Energy Use (kWh) \
0 Canada 2018 9525.38 42301.43
1 Germany 2020 7922.08 36601.38
2 Russia 2002 6630.01 41670.20
3 Brazil 2010 8580.19 10969.58
4 Canada 2006 848.88 32190.85

Renewable Energy Share (%) Fossil Fuel Dependency (%) \


0 13.70 70.47
1 33.63 41.95
2 10.82 39.32
3 73.24 16.71
4 73.60 74.86

Industrial Energy Use (%) Household Energy Use (%) \


0 45.18 19.96
1 34.32 22.27
2 53.66 26.44
3 30.55 27.60
4 42.39 23.43

Carbon Emissions (Million Tons) Energy Price Index (USD/kWh)


0 3766.11 0.12
1 2713.12 0.08
2 885.98 0.26
3 1144.11 0.47
4 842.39 0.48

1
[15]: # Q1 : What is the average total energy consumption across all countries?
avg_con = data['Total Energy Consumption (TWh)'].mean()
print("Average of Total Energy Consumption(TWh) is ",avg_con)

Average of Total Energy Consumption(TWh) is 5142.564425

[19]: # Q2 : What is the median per capita energy use?


med_energy = data['Per Capita Energy Use (kWh)'].median()
print("Median of Per Capita Energy Use (kWh) is",med_energy)

Median of Per Capita Energy Use (kWh) is 25098.77

[21]: # Q3 : What is the correlation between fossil fuel dependency and carbon␣
↪emissions?

cor_rel = data['Fossil Fuel Dependency (%)'].corr(data['Carbon Emissions␣


↪(Million Tons)'])

print("Correlation between fossil fuel dependency and Carbon Emissions (Million␣


↪Tons) is ",cor_rel)

Correlation between fossil fuel dependency and Carbon Emissions (Million Tons)
is 0.004444006196321776

[31]: # Q4 : Which country has the highest average renewable energy share?
country = data.groupby('Country')['Renewable Energy Share (%)'].mean().idxmax()
high_avg_erg = data.groupby('Country')['Renewable Energy Share (%)'].mean().
↪max()

{
'Country': country,
'highest average renewable energy ' : high_avg_erg,
}

[31]: {'Country': 'USA', 'highest average renewable energy ': 48.19013295346629}

[37]: # Q5 What is the standard deviation of the energy price index across different␣
↪years?

data.groupby('Country')['Industrial Energy Use (%)'].mean().


↪sort_values(ascending=False).plot(kind ='line')

plt.title("Industrial Energy Use(%)")


plt.show()
data.groupby('Country')['Household Energy Use (%)'].mean().
↪sort_values(ascending=False).plot(kind ='line')

plt.title("Household Energy Use (%)")

2
3
[39]: # Q6 : How does industrial energy use compare to household energy use on␣
↪average?

[49]: # Q7 : Is there a statistically significant difference in per capita energy use␣


↪between developed and developing countries?

data['Country'].unique()
developed=['Australia','China','USA','UK','Germany','Russia']
developing=['Japan','India','Canada','Brazil']
#mean of per capita mean of every country
developed_country = data[data['Country'].isin(developed)]['Per Capita Energy␣
↪Use (kWh)']

developing_country = data[data['Country'].isin(developing)]['Per Capita Energy␣


↪Use (kWh)']

{
'Developed' : developed_country.mean(),

'Developing' : developing_country.mean()
}

[49]: {'Developed': 25151.206330625104, 'Developing': 24870.894376417233}

4
[51]: # Q8 : What is the distribution of total energy consumption? Is it normally␣
↪distributed?

[55]: # Q9 : Can we build a regression model to predict carbon emissions based on␣
↪energy consumption and fossil fuel dependency?

from sklearn.linear_model import LinearRegression


# Prepare Data For Regeression :
x=data[['Carbon Emissions (Million Tons)']] #To Predict
y=data['Fossil Fuel Dependency (%)'] #Target

#Fit the Regression Model


model = LinearRegression()
model.fit(x,y)
#Coefficient
print(f"Intercept:{model.intercept_},coffiecent:{model.coef_[0]}")

Intercept:44.772961929867485,coffiecent:6.304406118696378e-05

[57]: model.predict([[5.0]])

C:\Users\prade\anaconda3\Lib\site-packages\sklearn\base.py:493: UserWarning: X
does not have valid feature names, but LinearRegression was fitted with feature
names
warnings.warn(

[57]: array([44.77327715])

[59]: # Q10 : What is the impact of renewable energy share on energy price index?
impact= data['Renewable Energy Share (%)'].corr(data['Energy Price Index (USD/
↪kWh)'])

print(impact)

-0.0156399186330418

1 PART 2
[96]: #Q1 : What is the trend of total energy consumption over the years for␣
↪different countries?

data.groupby('Country')['Total Energy Consumption (TWh)'].sum().


↪sort_values(ascending=False).plot(kind='line')

[96]: <Axes: xlabel='Country'>

5
[80]: # Q2 : Which countries have the highest and lowest fossil fuel dependency?
country_min = data.groupby('Country')['Fossil Fuel Dependency (%)'].idxmin()
min_value = data.groupby('Country')['Fossil Fuel Dependency (%)'].min()
country_max = data.groupby('Country')['Fossil Fuel Dependency (%)'].idxmax()
max_value = data.groupby('Country')['Fossil Fuel Dependency (%)'].min()
{
'Country With Min Value' : country_min,
'Minimum Value ' : min_value,
'Country With Max Value' : country_max,
'Maximum Value ' : max_value

[80]: {'Country With Min Value': Country


Australia 5172
Brazil 7279
Canada 4039
China 5375
Germany 7192
India 3306

6
Japan 4572
Russia 6433
UK 4545
USA 3900
Name: Fossil Fuel Dependency (%), dtype: int64,
'Minimum Value ': Country
Australia 10.03
Brazil 10.03
Canada 10.02
China 10.11
Germany 10.04
India 10.04
Japan 10.07
Russia 10.05
UK 10.01
USA 10.04
Name: Fossil Fuel Dependency (%), dtype: float64,
'Country With Max Value': Country
Australia 9496
Brazil 7327
Canada 7775
China 5095
Germany 7194
India 9004
Japan 3191
Russia 8683
UK 140
USA 2502
Name: Fossil Fuel Dependency (%), dtype: int64,
'Maximum Value ': Country
Australia 10.03
Brazil 10.03
Canada 10.02
China 10.11
Germany 10.04
India 10.04
Japan 10.07
Russia 10.05
UK 10.01
USA 10.04
Name: Fossil Fuel Dependency (%), dtype: float64}

[100]: # Q3 : How has the share of renewable energy changed over time?
renewable_trend = data.groupby('Year')['Renewable Energy Share (%)'].mean()

# Plotting the trend of renewable energy share over time


plt.figure(figsize=(10, 6))

7
plt.plot(renewable_trend.index, renewable_trend.values, marker='o',␣
↪color='green')

plt.xlabel('Year')
plt.ylabel('Average Renewable Energy Share (%)')
plt.title('Trend of Renewable Energy Share Over Time')
plt.grid(True)
plt.show()

[102]: # Q4 : What are the top 5 countries with the highest carbon emissions?
data.groupby('Country')['Carbon Emissions (Million Tons)'].mean().nlargest(5)

[102]: Country
China 2596.863320
Australia 2580.429833
India 2544.816727
Brazil 2542.097661
UK 2540.094797
Name: Carbon Emissions (Million Tons), dtype: float64

[106]: # Q5 : What is the distribution of energy price index across different regions?
data.groupby('Country')['Energy Price Index (USD/kWh)'].mean().
↪plot(kind='line',figsize=(10,6))

[106]: <Axes: xlabel='Country'>

8
[116]: #Q6 : Is there a relationship between energy consumption and Energy Price␣
↪Index?

plt.scatter(data['Total Energy Consumption (TWh)'], data['Energy Price Index␣


↪(USD/kWh)'].mean(), color='magenta')

plt.xlabel('Total Energy Consumption (TWh)')


plt.ylabel('Energy Price Index (USD/kWh)')
plt.title('Energy Consumption vs Energy Price Index')
plt.grid(True)
plt.show()

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[116], line 2
1 #Q6 : Is there a relationship between energy consumption and Energy␣
↪Price Index?

----> 2 plt.scatter(data['Total Energy Consumption (TWh)'], data['Energy Price␣


↪Index (USD/kWh)'].mean(), color='magenta')

3 plt.xlabel('Total Energy Consumption (TWh)')


4 plt.ylabel('Energy Price Index (USD/kWh)')

File ~\anaconda3\Lib\site-packages\matplotlib\pyplot.py:3699, in scatter(x, y,␣


↪s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, edgecolors,␣
↪plotnonfinite, data, **kwargs)

3680 @_copy_docstring_and_deprecators(Axes.scatter)
3681 def scatter(

9
3682 x: float | ArrayLike,
(…)
3697 **kwargs,
3698 ) -> PathCollection:
-> 3699 __ret = gca().scatter(
3700 x,
3701 y,
3702 s=s,
3703 c=c,
3704 marker=marker,
3705 cmap=cmap,
3706 norm=norm,
3707 vmin=vmin,
3708 vmax=vmax,
3709 alpha=alpha,
3710 linewidths=linewidths,
3711 edgecolors=edgecolors,
3712 plotnonfinite=plotnonfinite,
3713 **({"data": data} if data is not None else {}),
3714 **kwargs,
3715 )
3716 sci(__ret)
3717 return __ret

File ~\anaconda3\Lib\site-packages\matplotlib\__init__.py:1465, in␣


↪_preprocess_data.<locals>.inner(ax, data, *args, **kwargs)

1462 @functools.wraps(func)
1463 def inner(ax, *args, data=None, **kwargs):
1464 if data is None:
-> 1465 return func(ax, *map(sanitize_sequence, args), **kwargs)
1467 bound = new_sig.bind(ax, *args, **kwargs)
1468 auto_label = (bound.arguments.get(label_namer)
1469 or bound.kwargs.get(label_namer))

File ~\anaconda3\Lib\site-packages\matplotlib\axes\_axes.py:4655, in Axes.


↪scatter(self, x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths,␣
↪edgecolors, plotnonfinite, **kwargs)

4653 y = np.ma.ravel(y)
4654 if x.size != y.size:
-> 4655 raise ValueError("x and y must be the same size")
4657 if s is None:
4658 s = (20 if mpl.rcParams['_internal.classic_mode'] else
4659 mpl.rcParams['lines.markersize'] ** 2.0)

ValueError: x and y must be the same size

10
[ ]:

11

You might also like