0% found this document useful (0 votes)

14 views130 pages

DAA Prac

Uploaded by

dhavalsolanki6494

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views130 pages

DAA Prac

Uploaded by

dhavalsolanki6494

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 130

PMRP assignment 1

April 1, 2025

0.1 Question 1
Separate the given list based on the data types. List1 = [“Aakash”, 90, 77, “B”, 3.142,12]

[51]: List1 = ["Aakash", 90, 77, "B", 3.142,12]

string = []
inte = []
flo = []
for i in List1:
if type(i) == str:
string.append(i)
elif type(i) == int:
inte.append(i)
else:
flo.append(i)
print(f"strings are {string}\ninteger are {inte}\nfloat are {flo}")

strings are ['Aakash', 'B']

integer are [90, 77, 12]
float are [3.142]

0.2 Question 2
Consider you are collecting data from students on their heights (in cms) containing numbers as
140,145,153, etc. Use Numpy library and randomly generate 50 such numbers in the range 150 to
180. Which data type would you use list or array to store such data? Calculate measures of central
tendency of this data stored in list as well as array.

[4]: import numpy as np

arr = np.random.randint(150, 180, 50)
print(arr)
mean = np.mean(arr)
median = np.median(arr)
print(f"The mean is{mean}\nThe median is {median}")

[166 157 163 171 164 154 160 177 159 155 167 159 173 171 174 155 173 172
153 163 174 166 158 169 153 166 159 161 153 175 179 179 151 167 155 179
153 163 176 173 178 177 174 166 157 161 174 154 167 162]
The mean is165.3
The median is 166.0

1
0.3 Question 3
Part 1:-
Create the function that will plot simple line chart for any given data.

[33]: import matplotlib.pyplot as plt

data1 = np.random.randint(1, 50, 2)
data2 = np.random.randint(150, 200, 2)
plt.plot(data1, data2)

[33]: [<matplotlib.lines.Line2D at 0x215414697f0>]

0.4 Question 3
Part 2:-
Create the recursive function for finding out factorial of a given number

[12]: def fact(n):

if n == 1:
return 1
return n * fact(n - 1)

2
n = int(input())
print(fact(n))

10
3628800

0.5 Question 3
Part 3:-
Create generator function for Fibonacci series and print out first 10 numbers.

[14]: def fibonacci_generator(n):

x, y = 0, 1
for _ in range(n):
yield x
x,y=y,x+y
fib_gen = fibonacci_generator(10)
for num in fib_gen:
print(num)

0
1
1
2
3
5
8
13
21
34

0.6 Question 3
Part 4:-
Plot the graphs for trigonometric functions sin, cos, tan, cot, sec & cosec for the values pi to 2pi.

[17]: import math

x = np.linspace(math.pi, 2 * math.pi, 10000)
y = np.sin(x)
plt.grid()
plt.xlabel("Radian")
plt.ylabel("Value")
plt.title("Sin(x) Graph")
plt.plot(x, y)
plt.show()

3
[19]: import math
x = np.linspace(math.pi, 2 * math.pi, 10000)
y = np.cos(x)
plt.grid()
plt.xlabel("Radian")
plt.ylabel("Value")
plt.title("Cos(x) Graph")
plt.plot(x, y)
plt.show()

4
[21]: import math
x = np.linspace(math.pi, 2 * math.pi, 10000)
y = np.tan(x)
plt.grid()
plt.xlabel("Radian")
plt.ylabel("Value")
plt.title("Tan(x) Graph")
plt.ylim(-10, 10)
plt.plot(x, y)
plt.show()

5
[23]: import math
x = np.linspace(math.pi, 2 * math.pi, 10000)
y = 1 / np.cos(x)
y[np.abs(np.cos(x)) < 1e-5] = np.nan
plt.grid()
plt.xlabel("Radian")
plt.ylabel("Value")
plt.title("Sec(x) Graph")
plt.plot(x, y, linewidth = 1.1, color = 'blue', label = "Sec(x)")
plt.legend()
plt.ylim(-10, 10)
plt.show()

6
[35]: import math
x = np.linspace(math.pi, 2 * math.pi, 100)
y = 1 / np.sin(x)
y[np.abs(np.sin(x)) < 1e-5] = np.nan
plt.grid()
plt.xlabel("Radian")
plt.ylabel("Value")
plt.title("Cosec(x) Graph")
plt.plot(x, y, linewidth = 1.1, color = 'blue', label = "Cosec(x)")
plt.legend()
plt.ylim(-20, 20)
plt.show()

7
[27]: import math
x = np.linspace(math.pi, 2 * math.pi, 10000)
y = 1 / np.tan(x)
y[np.abs(np.tan(x)) < 1e-5] = np.nan
plt.grid()
plt.xlabel("Radian")
plt.ylabel("Value")
plt.title("Cot(x) Graph")
plt.plot(x, y, linewidth = 1.1, color = 'blue', label = "Cot(x)")
plt.legend()
plt.ylim(-10, 10)
plt.show()

8
0.7 Question 4
Consider you want create dataset with ages of people in your surroundings. Use input method to
ask user their age, store those ages in appropriate data type. Apply error handling that will not
accept more than 130 or less than 0 inputs, raise appropriate prompts to guide users.

[63]: def coll():

ages = []

while True:
usr = input("Enter Age of the person\nEnter q to exit")

if usr == 'q':
break
try:
usr = int(usr)
if(usr < 0 or usr > 130):
print("Invalid input")
else:
ages.append(usr)

9
except ValueError:
print("Invallid input")
return ages
use = coll()
print(use)

Enter Age of the person

Enter q to exit 5
Enter Age of the person
Enter q to exit q
[5]

0.8 Question 5
Create class as Employees with inputs as name, department and salary. Salary should be encap-
sulated.

[64]: class Employees:

def __init__(self, name, department, salary):
self.name = name
self.department = department
self.__salary = salary
def setsalary(self, slary):
self.__salary = slary
def getsalary(self):
return self.__salary
def print(self):
print(f"The Employee name is {self.name}\nThe department is {self.
↪department}\nThe salary is {self.__salary}")

e1 = Employees("Yash", "AIML", 100000000)

e1.print()

The Employee name is Yash

The department is AIML
The salary is 100000000

0.9 Question 6
Create two 3d arrays as matrices. Perform matrix operations (Addition, Multiplication, dot prod-
uct, inverse, determinant) on those matrices. Explain identity matrix, multiply each matrix with
identity matrix and record the observation. (All operations should be done with Numpy library)

[65]: import numpy as np

arr1 = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])

arr2 = np.array([[[9, 10], [11, 12]], [[13, 14], [15, 16]]])

print("Matrix Addition:\n", arr1 + arr2)

10
print("\nElement-wise Multiplication:\n", arr1 * arr2)
print("\nDot Product:\n", np.matmul(arr1, arr2))
print("\nInverse of arr1:\n", np.linalg.inv(arr1))
print("\nDeterminants of arr1:\n", np.linalg.det(arr1))
print("\nIdentity Matrix:\n", np.eye(2))
print("\narr1 multiplied with Identity Matrix:\n", np.array([np.dot(np.eye(2),␣
↪mat) for mat in arr1]))

Matrix Addition:
[[[10 12]
[14 16]]

[[18 20]
[22 24]]]

Element-wise Multiplication:
[[[ 9 20]
[ 33 48]]

[[ 65 84]
[105 128]]]

Dot Product:
[[[ 31 34]
[ 71 78]]

[[155 166]
[211 226]]]

Inverse of arr1:
[[[-2. 1. ]
[ 1.5 -0.5]]

[[-4. 3. ]
[ 3.5 -2.5]]]

Determinants of arr1:
[-2. -2.]

Identity Matrix:
[[1. 0.]
[0. 1.]]

arr1 multiplied with Identity Matrix:

[[[1. 2.]
[3. 4.]]

[[5. 6.]

11
[7. 8.]]]

12
Assignment_2_D24AIML081

April 1, 2025

1 Lab-2
[4]: import pandas as pd

drug_df = pd.read_csv("/content/drug200 - drug200.csv")

drug_df

[4]: Age Sex BP Cholesterol Na_to_K Drug

0 23 F HIGH HIGH 25.355 drugY
1 47 M LOW HIGH 13.093 drugC
2 47 M LOW HIGH 10.114 drugC
3 28 F NORMAL HIGH 7.798 drugX
4 61 F LOW HIGH 18.043 drugY
.. … .. … … … …
195 56 F LOW HIGH 11.567 drugC
196 16 M LOW HIGH 12.006 drugC
197 52 M NORMAL HIGH 9.894 drugX
198 23 M NORMAL NORMAL 14.020 drugX
199 40 F LOW NORMAL 11.349 drugX

[200 rows x 6 columns]

1. Plot Distribution curve for Age along with histogram.

[5]: import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

plt.hist(drug_df["Age"], bins=30, color='skyblue', edgecolor='black')

plt.title("Age histogram")
plt.show()

1
[6]: sns.displot(drug_df['Age'], color='darkgreen')
plt.title('Distribution curve for age')
plt.show()

2
2. Calculate Q1,Q2,Q3 and IQR without using np.percentile function. Calculate lower
and upper bound values.

[7]: v=drug_df['Age']
#u=drug_df['Na_to_K']

q1=(len(v)+1)/4
q2=(len(v)+1)/2
q3=((len(v)+1)*3)/4

iqr= q3 - q1
lower = q1-1.5*iqr
upper = q1+1.5*iqr

print("Q1 = ",q1)
print("Q2 = ",q2)
print("Q3 = ",q3)

3
print("IQR = ",iqr)
print("Lower bound = ",lower)
print("Upper bound = ",upper)

plt.boxplot(v, vert = False)

plt.show()

Q1 = 50.25
Q2 = 100.5
Q3 = 150.75
IQR = 100.5
Lower bound = -100.5
Upper bound = 201.0

3. Calculate frequency table as well for age column. Ranges for this can be in multiple
of 10, e.g. 10-20,20-30,etc..

[8]: x = drug_df['Age']

for i in range(10,80,10):
count = 0
for age in x:
if age >= i and age < i+10:

4
count+=1
print(f'{i} - {i+10} : {count}')

10 - 20 : 12
20 - 30 : 35
30 - 40 : 37
40 - 50 : 38
50 - 60 : 33
60 - 70 : 32
70 - 80 : 13
1. What is a Gender distribution of data?
2. What percent of total population have high cholesterol & high BP?
3. What are the unique values of Drugs given in data? (df[“Drug”].unique)
4. How many people have high cholesterol before age of 30?

[9]: gen_dis = drug_df['Sex'].value_counts()

gen_dis

[9]: Sex
M 104
F 96
Name: count, dtype: int64

[10]: high_col= drug_df[drug_df['Cholesterol']=="HIGH"]

high_bp = drug_df[drug_df['BP']=="HIGH"]
print(high_col)
print(high_bp)

Age Sex BP Cholesterol Na_to_K Drug

0 23 F HIGH HIGH 25.355 drugY
1 47 M LOW HIGH 13.093 drugC
2 47 M LOW HIGH 10.114 drugC
3 28 F NORMAL HIGH 7.798 drugX
4 61 F LOW HIGH 18.043 drugY
.. … .. … … … …
193 72 M LOW HIGH 6.769 drugC
194 46 F HIGH HIGH 34.686 drugY
195 56 F LOW HIGH 11.567 drugC
196 16 M LOW HIGH 12.006 drugC
197 52 M NORMAL HIGH 9.894 drugX

[103 rows x 6 columns]

Age Sex BP Cholesterol Na_to_K Drug
0 23 F HIGH HIGH 25.355 drugY
11 34 F HIGH NORMAL 19.199 drugY
15 16 F HIGH NORMAL 15.516 drugY
17 43 M HIGH HIGH 13.972 drugA
19 32 F HIGH NORMAL 25.974 drugY

5
.. … .. … … … …
188 65 M HIGH NORMAL 34.997 drugY
189 64 M HIGH NORMAL 20.932 drugY
190 58 M HIGH HIGH 18.991 drugY
191 23 M HIGH HIGH 8.011 drugA
194 46 F HIGH HIGH 34.686 drugY

[77 rows x 6 columns]

[11]: drug_df['Drug'].unique()
drug_df['Drug'].value_counts()

[11]: Drug
drugY 91
drugX 54
drugA 23
drugC 16
drugB 16
Name: count, dtype: int64

[12]: count = 0
for i in range(len(drug_df['Age'])):
if drug_df['Cholesterol'][i] == 'HIGH' and drug_df['Age'][i] <30 :
count +=1
print(count)

2 Assignment-2
[13]: df = pd.read_csv('/content/user_behavior_dataset.csv')
df

[13]: User ID Device Model Operating System App Usage Time (min/day) \
0 1 Google Pixel 5 Android 393
1 2 OnePlus 9 Android 268
2 3 Xiaomi Mi 11 Android 154
3 4 Google Pixel 5 Android 239
4 5 iPhone 12 iOS 187
.. … … … …
695 696 iPhone 12 iOS 92
696 697 Xiaomi Mi 11 Android 316
697 698 Google Pixel 5 Android 99
698 699 Samsung Galaxy S21 Android 62
699 700 OnePlus 9 Android 212

Screen On Time (hours/day) Battery Drain (mAh/day) \

6
0 6.4 1872
1 4.7 1331
2 4.0 761
3 4.8 1676
4 4.3 1367
.. … …
695 3.9 1082
696 6.8 1965
697 3.1 942
698 1.7 431
699 5.4 1306

Number of Apps Installed Data Usage (MB/day) Age Gender \

0 67 1122 40 Male
1 42 944 47 Female
2 32 322 42 Male
3 56 871 20 Male
4 58 988 31 Female
.. … … … …
695 26 381 22 Male
696 68 1201 59 Male
697 22 457 50 Female
698 13 224 44 Male
699 49 828 23 Female

User Behavior Class

0 4
1 3
2 2
3 3
4 3
.. …
695 2
696 4
697 2
698 1
699 3

[700 rows x 11 columns]

1. Find out the outliers in each numerical column

[20]: v=df['App Usage Time (min/day)']

q1=(len(v)+1)/4
q2=(len(v)+1)/2
q3=((len(v)+1)*3)/4

7
iqr= q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
outliers = v[(v < lower) | (v > upper)]

print("Q1 = ",q1)
print("Q2 = ",q2)
print("Q3 = ",q3)
print("IQR = ",iqr)
print("Lower bound = ",lower)
print("Upper bound = ",upper)
print("Outliers = ", outliers.tolist())

Q1 = 175.25
Q2 = 350.5
Q3 = 525.75
IQR = 350.5
Lower bound = -350.5
Upper bound = 1051.5
Outliers = []
2. Find out gender distribution in this data.

[23]: gender_distribution = df['Gender'].value_counts(normalize=True) * 100

print(gender_distribution)

Gender
Male 52.0
Female 48.0
Name: proportion, dtype: float64
3. What is average daily usage of data? Explore gender wise and device wise variation in average
usage of data.

[29]: average_usage = df['Data Usage (MB/day)'].mean()

print(f"Overall average daily data usage: {average_usage:.2f} MB")

gender_usage = df.groupby('Gender')['Data Usage (MB/day)'].mean()

print("\nAverage daily data usage by gender:")
print(gender_usage)

device_usage = df.groupby('Device Model')['Data Usage (MB/day)'].mean()

print("\nAverage daily data usage by device:")
print(device_usage)

gender_device_usage = df.groupby(['Gender', 'Device Model'])['Data Usage (MB/

↪day)'].mean()

8
print("\nAverage daily data usage by gender and device:")
print(gender_device_usage)

Overall average daily data usage: 929.74 MB

Average daily data usage by gender:

Gender
Female 914.321429
Male 943.978022
Name: Data Usage (MB/day), dtype: float64

Average daily data usage by device:

Device Model
Google Pixel 5 897.704225
OnePlus 9 911.120301
Samsung Galaxy S21 931.872180
Xiaomi Mi 11 940.164384
iPhone 12 965.506849
Name: Data Usage (MB/day), dtype: float64

Average daily data usage by gender and device:

Gender Device Model
Female Google Pixel 5 834.101449
OnePlus 9 862.377049
Samsung Galaxy S21 992.888889
Xiaomi Mi 11 917.858974
iPhone 12 970.878378
Male Google Pixel 5 957.821918
OnePlus 9 952.416667
Samsung Galaxy S21 890.164557
Xiaomi Mi 11 965.750000
iPhone 12 959.986111
Name: Data Usage (MB/day), dtype: float64
4. Which device have highest popularity based on Age and Gender?

[30]: popularity = df.groupby(['Age', 'Gender'])['Device Model'].agg(lambda x: x.

↪mode()[0])

print(popularity)

Age Gender
18 Male Samsung Galaxy S21
19 Female iPhone 12
Male OnePlus 9
20 Female Google Pixel 5
Male Google Pixel 5
…
57 Male Samsung Galaxy S21

9
58 Female iPhone 12
Male Samsung Galaxy S21
59 Female iPhone 12
Male Samsung Galaxy S21
Name: Device Model, Length: 83, dtype: object

10
D24AIML081_ASS_3_PRMP

April 1, 2025

[34]: #Assignment 3
import numpy as np
import pandas as pd

df = pd.read_csv("C:/Users/User/Downloads/matches.csv")
df

[34]: id season city date match_type player_of_match \

0 335982 2007/08 Bangalore 2008-04-18 League BB McCullum
1 335983 2007/08 Chandigarh 2008-04-19 League MEK Hussey
2 335984 2007/08 Delhi 2008-04-19 League MF Maharoof
3 335985 2007/08 Mumbai 2008-04-20 League MV Boucher
4 335986 2007/08 Kolkata 2008-04-20 League DJ Hussey
… … … … … … …
1090 1426307 2024 Hyderabad 2024-05-19 League Abhishek Sharma
1091 1426309 2024 Ahmedabad 2024-05-21 Qualifier 1 MA Starc
1092 1426310 2024 Ahmedabad 2024-05-22 Eliminator R Ashwin
1093 1426311 2024 Chennai 2024-05-24 Qualifier 2 Shahbaz Ahmed
1094 1426312 2024 Chennai 2024-05-26 Final MA Starc

venue \
0 M Chinnaswamy Stadium
1 Punjab Cricket Association Stadium, Mohali
2 Feroz Shah Kotla
3 Wankhede Stadium
4 Eden Gardens
… …
1090 Rajiv Gandhi International Stadium, Uppal, Hyd…
1091 Narendra Modi Stadium, Ahmedabad
1092 Narendra Modi Stadium, Ahmedabad
1093 MA Chidambaram Stadium, Chepauk, Chennai
1094 MA Chidambaram Stadium, Chepauk, Chennai

team1 team2 \
0 Royal Challengers Bangalore Kolkata Knight Riders
1 Kings XI Punjab Chennai Super Kings
2 Delhi Daredevils Rajasthan Royals
3 Mumbai Indians Royal Challengers Bangalore

1
4 Kolkata Knight Riders Deccan Chargers
… … …
1090 Punjab Kings Sunrisers Hyderabad
1091 Sunrisers Hyderabad Kolkata Knight Riders
1092 Royal Challengers Bengaluru Rajasthan Royals
1093 Sunrisers Hyderabad Rajasthan Royals
1094 Sunrisers Hyderabad Kolkata Knight Riders

toss_winner toss_decision winner \

0 Royal Challengers Bangalore field Kolkata Knight Riders
1 Chennai Super Kings bat Chennai Super Kings
2 Rajasthan Royals bat Delhi Daredevils
3 Mumbai Indians bat Royal Challengers Bangalore
4 Deccan Chargers bat Kolkata Knight Riders
… … … …
1090 Punjab Kings bat Sunrisers Hyderabad
1091 Sunrisers Hyderabad bat Kolkata Knight Riders
1092 Rajasthan Royals field Rajasthan Royals
1093 Rajasthan Royals field Sunrisers Hyderabad
1094 Sunrisers Hyderabad bat Kolkata Knight Riders

result result_margin target_runs target_overs super_over method \

0 runs 140.0 223.0 20.0 N NaN
1 runs 33.0 241.0 20.0 N NaN
2 wickets 9.0 130.0 20.0 N NaN
3 wickets 5.0 166.0 20.0 N NaN
4 wickets 5.0 111.0 20.0 N NaN
… … … … … … …
1090 wickets 4.0 215.0 20.0 N NaN
1091 wickets 8.0 160.0 20.0 N NaN
1092 wickets 4.0 173.0 20.0 N NaN
1093 runs 36.0 176.0 20.0 N NaN
1094 wickets 8.0 114.0 20.0 N NaN

umpire1 umpire2
0 Asad Rauf RE Koertzen
1 MR Benson SL Shastri
2 Aleem Dar GA Pratapkumar
3 SJ Davis DJ Harper
4 BF Bowden K Hariharan
… … …
1090 Nitin Menon VK Sharma
1091 AK Chaudhary R Pandit
1092 KN Ananthapadmanabhan MV Saidharshan Kumar
1093 Nitin Menon VK Sharma
1094 J Madanagopal Nitin Menon

2
[1095 rows x 20 columns]

[36]: #1) Find out count of unique records in each column.

unique = df.nunique()
print(unique)

id 1095
season 17
city 36
date 823
match_type 8
player_of_match 291
venue 58
team1 19
team2 19
toss_winner 19
toss_decision 2
winner 19
result 4
result_margin 98
target_runs 170
target_overs 15
super_over 2
method 1
umpire1 62
umpire2 62
dtype: int64

[38]: #2) Find if any outliers in data.

Q1 = df.select_dtypes(include=['float64', 'int64']).quantile(0.25)
Q3 = df.select_dtypes(include=['float64', 'int64']).quantile(0.75)
IQR = Q3 - Q1

outliers = ((df.select_dtypes(include=['float64', 'int64']) < (Q1 - 1.5 * IQR))␣

↪|

(df.select_dtypes(include=['float64', 'int64']) > (Q3 + 1.5 * IQR)))

outlier_counts = outliers.sum()
print(outlier_counts)

id 0
result_margin 121
target_runs 30
target_overs 30
dtype: int64

[40]: #3) Plot heatmap of correlation matrix and covariance matrix for the given␣
↪dataset.

3
import seaborn as sns
import matplotlib.pyplot as plt

numeric_df = df.select_dtypes(include=['float64', 'int64'])

correlation_matrix = numeric_df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f")
plt.title('Correlation Matrix')
plt.show()

covariance_matrix = numeric_df.cov()
plt.figure(figsize=(10, 8))
sns.heatmap(covariance_matrix, annot=True, fmt=".2f")
plt.title('Covariance Matrix')
plt.show()

4
[42]: #4) Remove unnecessary or empty columns as well as any rows if required from␣
↪the dataset.

df_cleaned = df.dropna(thresh=len(df) * 0.5, axis=1)

df_cleaned = df_cleaned.dropna()
df_cleaned

[42]: id season city date match_type player_of_match \

5
1092 1426310 2024 Ahmedabad 2024-05-22 Eliminator R Ashwin
1093 1426311 2024 Chennai 2024-05-24 Qualifier 2 Shahbaz Ahmed
1094 1426312 2024 Chennai 2024-05-26 Final MA Starc

team1 team2 \
0 Royal Challengers Bangalore Kolkata Knight Riders
1 Kings XI Punjab Chennai Super Kings
2 Delhi Daredevils Rajasthan Royals
3 Mumbai Indians Royal Challengers Bangalore
4 Kolkata Knight Riders Deccan Chargers
… … …
1090 Punjab Kings Sunrisers Hyderabad
1091 Sunrisers Hyderabad Kolkata Knight Riders
1092 Royal Challengers Bengaluru Rajasthan Royals
1093 Sunrisers Hyderabad Rajasthan Royals
1094 Sunrisers Hyderabad Kolkata Knight Riders

toss_winner toss_decision winner \

result result_margin target_runs target_overs super_over \

0 runs 140.0 223.0 20.0 N
1 runs 33.0 241.0 20.0 N
2 wickets 9.0 130.0 20.0 N

6
3 wickets 5.0 166.0 20.0 N
4 wickets 5.0 111.0 20.0 N
… … … … … …
1090 wickets 4.0 215.0 20.0 N
1091 wickets 8.0 160.0 20.0 N
1092 wickets 4.0 173.0 20.0 N
1093 runs 36.0 176.0 20.0 N
1094 wickets 8.0 114.0 20.0 N

[1028 rows x 19 columns]

[44]: #5) Plot histograms for each column and remove any skewness using␣
↪transformations.

df_cleaned.hist(figsize=(15, 10), bins=30)

plt.show()

for column in df_cleaned.select_dtypes(include=['float64', 'int64']).columns:

if df_cleaned[column].skew() > 1:
df_cleaned[column] = np.log1p(df_cleaned[column])

7
[46]: #6) Plot Yearly records for numerical columns (e.g. runs, trophies)
df_cleaned['year'] = df_cleaned['season'].str.split('/').str[0].astype(int)

yearly_records = df_cleaned.groupby('year').sum()

yearly_records.plot(kind='bar', figsize=(12, 6))

plt.title('Yearly Records for Numerical Columns')
plt.xlabel('Year')
plt.ylabel('Total')
plt.show()

8
[ ]:

[ ]:

9
ASSIGNMENT_4_D24AIML081

April 1, 2025

[20]: import pandas as pd

url = 'C:/Users/User/Downloads/creditcard.csv/creditcard.csv'
data = pd.read_csv(url)

data_cleaned = data.drop(columns=[f'V{i}' for i in range(1, 9)])

threshold = 100

total_transactions = len(data_cleaned)

total_fraudulent = data_cleaned[data_cleaned['Class'] == 1].shape[0]

total_high_amount = data_cleaned[data_cleaned['Amount'] > threshold].shape[0]

total_high_amount_fraudulent = data_cleaned[(data_cleaned['Amount'] >␣

↪threshold) & (data_cleaned['Class'] == 1)].shape[0]

P_fraudulent = total_fraudulent / total_transactions

P_high_amount = total_high_amount / total_transactions
P_high_amount_given_fraudulent = total_high_amount_fraudulent /␣
↪total_fraudulent if total_fraudulent > 0 else 0

if P_high_amount > 0:
P_fraudulent_given_high_amount = (P_high_amount_given_fraudulent *␣
↪P_fraudulent) / P_high_amount

else:
P_fraudulent_given_high_amount = 0

print(f"P(Fraudulent | High Amount) = {P_fraudulent_given_high_amount:.4f}")

P(Fraudulent | High Amount) = 0.0023

[ ]:

1
D24AIMl081_PR5

April 1, 2025

D24AIML081 DAHIYA MANDEEPSINH PMRP ASSIGNMENT 5 + CLASSWORK

IPL DATA ANALYTICS
1. Calculate the total number of matches played in each season
2. Find the most successful team (team with the most wins)
3. Find the average margin of victory by wickets and by runs
4. Which player won the most ’Player of the Match awards?
5. Find the number of matches where the toss winner won the match
6. Calculate the total number of runs scored in all matches for each team
7. Determine the average number of wickets taken by the winning team in each match
8. How many matches were decided by a Super Over?
9. Find the distribution of match results (runs vs wickets)
10. Find the top 5 venues with the most matches played
11. Find the match with the highest margin of victory (by wickets or runs)
12. Calculate the win percentage for each team
13. Find the average number of overs played in all matches
14. Find the most common match outcome (runs, wickets, or no result)
15. Find the total number of matches played at each venue by year
16. Analyze the win margin distribution by year
17. Calculate the total number of ‘no result’ matches and their impact on the tournament
18. How many matches were won by teams batting first vs. batting second?
19. Find out the average number of runs scored by the winning team
20. Identify the most successful captain (team with the most wins under a captain)

[77]: import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

1
df=pd.read_csv('D:/SEM4/PMRP/RAW_CODE/PMRP_DAY_13/matches.csv')
df.head,df.tail,df.describe,df.info

[77]: (<bound method NDFrame.head of id season city date

match_type player_of_match \
0 335982 2007/08 Bangalore 2008-04-18 League BB McCullum
1 335983 2007/08 Chandigarh 2008-04-19 League MEK Hussey
2 335984 2007/08 Delhi 2008-04-19 League MF Maharoof
3 335985 2007/08 Mumbai 2008-04-20 League MV Boucher
4 335986 2007/08 Kolkata 2008-04-20 League DJ Hussey
… … … … … … …
1090 1426307 2024 Hyderabad 2024-05-19 League Abhishek Sharma
1091 1426309 2024 Ahmedabad 2024-05-21 Qualifier 1 MA Starc
1092 1426310 2024 Ahmedabad 2024-05-22 Eliminator R Ashwin
1093 1426311 2024 Chennai 2024-05-24 Qualifier 2 Shahbaz Ahmed
1094 1426312 2024 Chennai 2024-05-26 Final MA Starc

toss_winner toss_decision winner \

0 Royal Challengers Bangalore field Kolkata Knight Riders
1 Chennai Super Kings bat Chennai Super Kings

2
2 Rajasthan Royals bat Delhi Daredevils
3 Mumbai Indians bat Royal Challengers Bangalore
4 Deccan Chargers bat Kolkata Knight Riders
… … … …
1090 Punjab Kings bat Sunrisers Hyderabad
1091 Sunrisers Hyderabad bat Kolkata Knight Riders
1092 Rajasthan Royals field Rajasthan Royals
1093 Rajasthan Royals field Sunrisers Hyderabad
1094 Sunrisers Hyderabad bat Kolkata Knight Riders

result result_margin target_runs target_overs super_over method \

[1095 rows x 20 columns]>,

<bound method NDFrame.tail of id season city date
match_type player_of_match \
0 335982 2007/08 Bangalore 2008-04-18 League BB McCullum
1 335983 2007/08 Chandigarh 2008-04-19 League MEK Hussey
2 335984 2007/08 Delhi 2008-04-19 League MF Maharoof
3 335985 2007/08 Mumbai 2008-04-20 League MV Boucher
4 335986 2007/08 Kolkata 2008-04-20 League DJ Hussey
… … … … … … …
1090 1426307 2024 Hyderabad 2024-05-19 League Abhishek Sharma
1091 1426309 2024 Ahmedabad 2024-05-21 Qualifier 1 MA Starc

3
1092 1426310 2024 Ahmedabad 2024-05-22 Eliminator R Ashwin
1093 1426311 2024 Chennai 2024-05-24 Qualifier 2 Shahbaz Ahmed
1094 1426312 2024 Chennai 2024-05-26 Final MA Starc

toss_winner toss_decision winner \

result result_margin target_runs target_overs super_over method \

0 runs 140.0 223.0 20.0 N NaN
1 runs 33.0 241.0 20.0 N NaN
2 wickets 9.0 130.0 20.0 N NaN

4
3 wickets 5.0 166.0 20.0 N NaN
4 wickets 5.0 111.0 20.0 N NaN
… … … … … … …
1090 wickets 4.0 215.0 20.0 N NaN
1091 wickets 8.0 160.0 20.0 N NaN
1092 wickets 4.0 173.0 20.0 N NaN
1093 runs 36.0 176.0 20.0 N NaN
1094 wickets 8.0 114.0 20.0 N NaN

[1095 rows x 20 columns]>,

<bound method NDFrame.describe of id season city
date match_type player_of_match \
0 335982 2007/08 Bangalore 2008-04-18 League BB McCullum
1 335983 2007/08 Chandigarh 2008-04-19 League MEK Hussey
2 335984 2007/08 Delhi 2008-04-19 League MF Maharoof
3 335985 2007/08 Mumbai 2008-04-20 League MV Boucher
4 335986 2007/08 Kolkata 2008-04-20 League DJ Hussey
… … … … … … …
1090 1426307 2024 Hyderabad 2024-05-19 League Abhishek Sharma
1091 1426309 2024 Ahmedabad 2024-05-21 Qualifier 1 MA Starc
1092 1426310 2024 Ahmedabad 2024-05-22 Eliminator R Ashwin
1093 1426311 2024 Chennai 2024-05-24 Qualifier 2 Shahbaz Ahmed
1094 1426312 2024 Chennai 2024-05-26 Final MA Starc

5
1093 MA Chidambaram Stadium, Chepauk, Chennai
1094 MA Chidambaram Stadium, Chepauk, Chennai

toss_winner toss_decision winner \

result result_margin target_runs target_overs super_over method \

umpire1 umpire2
0 Asad Rauf RE Koertzen
1 MR Benson SL Shastri
2 Aleem Dar GA Pratapkumar
3 SJ Davis DJ Harper

6
4 BF Bowden K Hariharan
… … …
1090 Nitin Menon VK Sharma
1091 AK Chaudhary R Pandit
1092 KN Ananthapadmanabhan MV Saidharshan Kumar
1093 Nitin Menon VK Sharma
1094 J Madanagopal Nitin Menon

[1095 rows x 20 columns]>,

<bound method DataFrame.info of id season city date
match_type player_of_match \
0 335982 2007/08 Bangalore 2008-04-18 League BB McCullum
1 335983 2007/08 Chandigarh 2008-04-19 League MEK Hussey
2 335984 2007/08 Delhi 2008-04-19 League MF Maharoof
3 335985 2007/08 Mumbai 2008-04-20 League MV Boucher
4 335986 2007/08 Kolkata 2008-04-20 League DJ Hussey
… … … … … … …
1090 1426307 2024 Hyderabad 2024-05-19 League Abhishek Sharma
1091 1426309 2024 Ahmedabad 2024-05-21 Qualifier 1 MA Starc
1092 1426310 2024 Ahmedabad 2024-05-22 Eliminator R Ashwin
1093 1426311 2024 Chennai 2024-05-24 Qualifier 2 Shahbaz Ahmed
1094 1426312 2024 Chennai 2024-05-26 Final MA Starc

7
1094 Sunrisers Hyderabad Kolkata Knight Riders

toss_winner toss_decision winner \

result result_margin target_runs target_overs super_over method \

[1095 rows x 20 columns]>)

1.Calculate the total number of Matches Played in Each Session

[78]: matches_per_season = df['season'].value_counts().sort_index()

print(matches_per_season)

8
season
2007/08 58
2009 57
2009/10 60
2011 73
2012 74
2013 76
2014 60
2015 59
2016 60
2017 59
2018 60
2019 60
2020/21 60
2021 60
2022 74
2023 74
2024 71
Name: count, dtype: int64
2. Find the Most Successfull team (team with most runs)

[79]: # runs_df = df[df['result'] == 'runs']

# most_successful_team = runs_df.groupby('winner')['result_margin'].sum().
↪idxmax()

# print(f"The most successful team (team with most runs) is:␣

↪{most_successful_team}")

df["winner"].value_counts()

[79]: winner
Mumbai Indians 144
Chennai Super Kings 138
Kolkata Knight Riders 131
Royal Challengers Bangalore 116
Rajasthan Royals 112
Kings XI Punjab 88
Sunrisers Hyderabad 88
Delhi Daredevils 67
Delhi Capitals 48
Deccan Chargers 29
Gujarat Titans 28
Lucknow Super Giants 24
Punjab Kings 24
Gujarat Lions 13
Pune Warriors 12

9
Rising Pune Supergiant 10
Royal Challengers Bengaluru 7
Kochi Tuskers Kerala 6
Rising Pune Supergiants 5
Name: count, dtype: int64

3. Find the average margin of victory by wickets and runs

[80]: average_runs_margin = df[df['result'] == 'runs']['result_margin'].mean()

average_wickets_margin = df[df['result'] == 'wickets']['result_margin'].mean()
print(f'Average margin of victory by runs: {average_runs_margin}')
print(f'Average margin of victory by wickets: {average_wickets_margin}')

Average margin of victory by runs: 30.104417670682732

Average margin of victory by wickets: 6.192041522491349
4. Which player won the most ’Player of the Match awards?

[81]: most_player_of_match = df['player_of_match'].value_counts().idxmax()

print(f"The player who won the most 'Player of the Match' awards is:␣
↪{most_player_of_match}")

The player who won the most 'Player of the Match' awards is: AB de Villiers
5. Find the number of matches where the toss winner won the match

[82]: toss_winner_matches = df[df['toss_winner'] == df['winner']].shape[0]

print(f"The number of matches where the toss winner won the match:␣
↪{toss_winner_matches}")

The number of matches where the toss winner won the match: 554
6. Calculate the total number of runs scored in all matches for each team

[83]: total_runs_per_team = df.groupby('team1')['target_runs'].sum() + df.

↪groupby('team2')['target_runs'].sum()

print(total_runs_per_team)

team1
Chennai Super Kings 39503.0
Deccan Chargers 12047.0
Delhi Capitals 15930.0
Delhi Daredevils 25492.0
Gujarat Lions 5077.0
Gujarat Titans 7865.0
Kings XI Punjab 31391.0
Kochi Tuskers Kerala 2014.0
Kolkata Knight Riders 40557.0
Lucknow Super Giants 7835.0
Mumbai Indians 43728.0
Pune Warriors 6950.0

10
Punjab Kings 9787.0
Rajasthan Royals 36250.0
Rising Pune Supergiant 2571.0
Rising Pune Supergiants 1993.0
Royal Challengers Bangalore 39807.0
Royal Challengers Bengaluru 2986.0
Sunrisers Hyderabad 30071.0
Name: target_runs, dtype: float64
7. Determine the average number of wickets taken by the winning team in each match

[84]: average_wickets_taken = df[df['result'] == 'wickets']['result_margin'].mean()

print(f'The average number of wickets taken by the winning team in each match␣
↪is: {average_wickets_taken}')

The average number of wickets taken by the winning team in each match is:
6.192041522491349
8. How many matches were decided by a Super Over?

[85]: super_over_matches = df[df['super_over'] == 'Y'].shape[0]

print(f"The number of matches decided by a Super Over: {super_over_matches}")

The number of matches decided by a Super Over: 14

9. Find the distribution of match results (runs vs wickets)

[86]: result_distribution = df['result'].value_counts()

print(result_distribution)

# Plotting the distribution

result_distribution.plot(kind='bar', color=['blue', 'orange'])
plt.title('Distribution of Match Results (Runs vs Wickets)')
plt.xlabel('Result Type')
plt.ylabel('Number of Matches')
plt.show()

result
wickets 578
runs 498
tie 14
no result 5
Name: count, dtype: int64

11
10. Find the top 5 venues with the most matches played

[87]: top_venues = df['venue'].value_counts().head(5)

print(top_venues)

# # Plotting the top 5 venues

# top_venues.plot(kind='bar', color='green')
# plt.title('Top 5 Venues with the Most Matches Played')
# plt.xlabel('Venue')
# plt.ylabel('Number of Matches')
# plt.show()

venue
Eden Gardens 77
Wankhede Stadium 73
M Chinnaswamy Stadium 65
Feroz Shah Kotla 60
Rajiv Gandhi International Stadium, Uppal 49

12
Name: count, dtype: int64
11. Find the match with the highest margin of victory (by wickets or runs)

[88]: df[df['result_margin']==df['result_margin'].max()]

[88]: id season city date match_type player_of_match \

620 1082635 2017 Delhi 2017-05-06 League LMP Simmons

venue team1 team2 toss_winner \

620 Feroz Shah Kotla Delhi Daredevils Mumbai Indians Delhi Daredevils

toss_decision winner result result_margin target_runs \

620 field Mumbai Indians runs 146.0 213.0

target_overs super_over method umpire1 umpire2

620 20.0 N NaN Nitin Menon CK Nandan

[89]: # Find the match with the highest margin of victory (by wickets or runs)
df_wickets=df[df['result']=='wickets']
df_runs=df[df['result']=='runs']

max_margin_wicket=df_wickets.loc[df_wickets['result_margin'].idxmax()]

max_margin_run=df_runs.loc[df_runs['result_margin'].idxmax()]

max_margin_run,max_margin_wicket

[89]: (id 1082635

season 2017
city Delhi
date 2017-05-06
match_type League
player_of_match LMP Simmons
venue Feroz Shah Kotla
team1 Delhi Daredevils
team2 Mumbai Indians
toss_winner Delhi Daredevils
toss_decision field
winner Mumbai Indians
result runs
result_margin 146.0
target_runs 213.0
target_overs 20.0
super_over N
method NaN
umpire1 Nitin Menon
umpire2 CK Nandan

13
Name: 620, dtype: object,
id 335994
season 2007/08
city Mumbai
date 2008-04-27
match_type League
player_of_match AC Gilchrist
venue Dr DY Patil Sports Academy
team1 Mumbai Indians
team2 Deccan Chargers
toss_winner Deccan Chargers
toss_decision field
winner Deccan Chargers
result wickets
result_margin 10.0
target_runs 155.0
target_overs 20.0
super_over N
method NaN
umpire1 Asad Rauf
umpire2 SL Shastri
Name: 12, dtype: object)

12. Calculate the win percentage for each team

[90]: matches_played = df['team1'].value_counts() + df['team2'].value_counts()

matches_won = df['winner'].value_counts()
win_percentage = (matches_won / matches_played) * 100

print(win_percentage)

Chennai Super Kings 57.983193

Deccan Chargers 38.666667
Delhi Capitals 52.747253
Delhi Daredevils 41.614907
Gujarat Lions 43.333333
Gujarat Titans 62.222222
Kings XI Punjab 46.315789
Kochi Tuskers Kerala 42.857143
Kolkata Knight Riders 52.191235
Lucknow Super Giants 54.545455
Mumbai Indians 55.172414
Pune Warriors 26.086957
Punjab Kings 42.857143
Rajasthan Royals 50.678733
Rising Pune Supergiant 62.500000
Rising Pune Supergiants 35.714286

14
Royal Challengers Bangalore 48.333333
Royal Challengers Bengaluru 46.666667
Sunrisers Hyderabad 48.351648
Name: count, dtype: float64
13. Find the average number of overs played in all matches

[91]: average_overs_played = df['target_overs'].mean()

print(f'The average number of overs played in all matches is:␣

↪{average_overs_played}')

The average number of overs played in all matches is: 19.75934065934066

14. Find the most common match outcome (runs, wickets, or no result)

[92]: most_common_outcome = result_distribution.idxmax()

print(f'The most common match outcome is: {most_common_outcome}')

The most common match outcome is: wickets

15. Find the total number of matches played at each venue by year

[93]: matches_per_venue_year = df.groupby(['season','venue']).size()

print(matches_per_venue_year)

season venue
2007/08 Dr DY Patil Sports Academy 4
Eden Gardens 7
Feroz Shah Kotla 6
M Chinnaswamy Stadium 7
MA Chidambaram Stadium, Chepauk 7
..
2024 Maharaja Yadavindra Singh International Cricket Stadium, Mullanpur 5
Narendra Modi Stadium, Ahmedabad 8
Rajiv Gandhi International Stadium, Uppal, Hyderabad 6
Sawai Mansingh Stadium, Jaipur 5
Wankhede Stadium, Mumbai 7
Length: 175, dtype: int64
16. Analyze the win margin distribution by year

[94]: # Grouping the data by season and result type

win_margin_by_year = df.groupby(['season', 'result'])['result_margin'].
↪describe()

print(win_margin_by_year)
# Plotting the win margin distribution by year
plt.figure(figsize=(14, 8))
sns.boxplot(x='season', y='result_margin', hue='result', data=df)
plt.title('Win Margin Distribution by Year')
plt.xlabel('Season')

15
plt.ylabel('Win Margin')
plt.xticks(rotation=45)
plt.legend(title='Result Type')
plt.show()

count mean std min 25% 50% 75% max

season result
2007/08 runs 24.0 29.375000 34.291351 1.0 8.25 16.0 35.00 140.0
wickets 34.0 6.500000 2.078024 3.0 5.00 7.0 8.00 10.0
2009 runs 27.0 28.296296 28.894789 1.0 10.00 16.0 32.50 92.0
tie 0.0 NaN NaN NaN NaN NaN NaN NaN
wickets 29.0 6.206897 1.820112 2.0 6.00 6.0 7.00 10.0
2009/10 runs 31.0 31.483871 20.990269 2.0 15.50 31.0 39.00 98.0
tie 0.0 NaN NaN NaN NaN NaN NaN NaN
wickets 28.0 6.785714 1.571909 4.0 5.75 7.0 8.00 10.0
2011 no result 0.0 NaN NaN NaN NaN NaN NaN NaN
runs 33.0 33.272727 26.081929 2.0 17.00 25.0 43.00 111.0
wickets 39.0 6.794872 1.794428 3.0 6.00 7.0 8.00 10.0
2012 runs 34.0 28.235294 19.645431 1.0 14.25 26.0 37.75 86.0
wickets 40.0 6.025000 1.716996 2.0 5.00 5.5 7.00 10.0
2013 runs 37.0 33.540541 28.657551 2.0 14.00 24.0 48.00 130.0
tie 0.0 NaN NaN NaN NaN NaN NaN NaN
wickets 37.0 6.135135 1.669367 3.0 5.00 6.0 7.00 10.0
2014 runs 22.0 29.272727 22.416367 2.0 15.25 24.5 33.50 93.0
tie 0.0 NaN NaN NaN NaN NaN NaN NaN
wickets 37.0 6.081081 1.516179 3.0 5.00 6.0 7.00 9.0
2015 no result 0.0 NaN NaN NaN NaN NaN NaN NaN
runs 32.0 26.562500 28.598373 1.0 8.75 20.0 29.00 138.0
tie 0.0 NaN NaN NaN NaN NaN NaN NaN
wickets 24.0 6.166667 2.219805 1.0 5.00 6.0 7.25 10.0
2016 runs 21.0 32.190476 36.347791 1.0 9.00 22.0 34.00 144.0
wickets 39.0 6.256410 1.772865 2.0 5.00 6.0 7.00 10.0
2017 runs 26.0 30.307692 33.638988 1.0 10.50 18.0 33.00 146.0
tie 0.0 NaN NaN NaN NaN NaN NaN NaN
wickets 32.0 6.375000 1.896516 2.0 5.00 6.5 7.25 10.0
2018 runs 28.0 24.107143 23.850366 3.0 10.75 14.0 31.00 102.0
wickets 32.0 5.812500 2.206113 1.0 5.00 6.0 7.00 10.0
2019 no result 0.0 NaN NaN NaN NaN NaN NaN NaN
runs 22.0 30.227273 27.194068 1.0 12.50 25.0 39.75 118.0
tie 0.0 NaN NaN NaN NaN NaN NaN NaN
wickets 35.0 5.771429 1.646488 2.0 5.00 6.0 7.00 9.0
2020/21 runs 27.0 39.370370 26.716673 2.0 15.50 37.0 58.00 97.0
tie 0.0 NaN NaN NaN NaN NaN NaN NaN
wickets 29.0 6.965517 1.762360 4.0 5.00 7.0 8.00 10.0
2021 runs 22.0 26.454545 24.039110 1.0 6.00 19.0 41.00 86.0
tie 0.0 NaN NaN NaN NaN NaN NaN NaN
wickets 37.0 5.918919 2.019053 2.0 4.00 6.0 7.00 10.0
2022 runs 37.0 27.945946 23.085525 2.0 12.00 18.0 44.00 91.0

16
wickets 37.0 6.000000 1.615893 3.0 5.00 6.0 7.00 9.0
2023 no result 0.0 NaN NaN NaN NaN NaN NaN NaN
runs 40.0 30.400000 27.554887 1.0 7.75 22.0 51.25 112.0
wickets 33.0 5.727273 1.908414 1.0 5.00 6.0 7.00 9.0
2024 runs 35.0 30.142857 25.994505 1.0 15.00 24.0 35.00 106.0
wickets 36.0 5.944444 1.999206 2.0 4.00 6.0 7.00 10.0

17. Calculate the total number of ‘no result’ matches and their impact on the tournament

[95]: # Calculate the total number of 'no result' matches

no_result_matches = df[df['result'] == 'no result'].shape[0]
print(f"The total number of 'no result' matches: {no_result_matches}")

# Analyze the distribution of 'no result' matches by season

no_result_by_season = df[df['result'] == 'no result']['season'].value_counts().
↪sort_index()

print("Distribution of 'no result' matches by season:")

print(no_result_by_season)

# Analyze the distribution of 'no result' matches by team

no_result_by_team = df[df['result'] == 'no result']['team1'].value_counts() +␣
↪df[df['result'] == 'no result']['team2'].value_counts()

print("Distribution of 'no result' matches by team:")

print(no_result_by_team)

17
The total number of 'no result' matches: 5
Distribution of 'no result' matches by season:
season
2011 1
2015 2
2019 1
2023 1
Name: count, dtype: int64
Distribution of 'no result' matches by team:
Chennai Super Kings NaN
Delhi Daredevils 2.0
Lucknow Super Giants NaN
Pune Warriors NaN
Rajasthan Royals NaN
Royal Challengers Bangalore NaN
Name: count, dtype: float64
18. How many matches were won by teams batting first vs. batting second?

[96]: # Matches won by teams batting first

batting_first_wins = df[(df['toss_decision'] == 'bat') & (df['toss_winner'] ==␣
↪df['winner'])].shape[0] + \

df[(df['toss_decision'] == 'field') & (df['toss_winner'] !

↪= df['winner'])].shape[0]

# Matches won by teams batting second

batting_second_wins = df[(df['toss_decision'] == 'field') & (df['toss_winner']␣
↪== df['winner'])].shape[0] + \

df[(df['toss_decision'] == 'bat') & (df['toss_winner'] !=␣

↪df['winner'])].shape[0]

print(f"Matches won by teams batting first: {batting_first_wins}")

print(f"Matches won by teams batting second: {batting_second_wins}")

Matches won by teams batting first: 504

Matches won by teams batting second: 591
19. Find out the average number of runs scored by the winning team

[97]: average_runs_scored_by_winning_team = df[df['result'] == 'runs']['target_runs'].

↪mean()

print(f'The average number of runs scored by the winning team is:␣

↪{average_runs_scored_by_winning_team}')

The average number of runs scored by the winning team is: 179.69678714859438
20. Identify the most unsuccessful team (team with lowest wins)

18
[98]: most_unsuccessful_team = matches_won.idxmin()
print(f"The most unsuccessful team (team with the lowest wins) is:␣
↪{most_unsuccessful_team}")

The most unsuccessful team (team with the lowest wins) is: Rising Pune
Supergiants
ASSIGNMENT QUESTIONS
Explore following for given dataset and also perform EDA. 1. Frequency Distribution of Wins by
Wickets 2. Relative Frequency Distribution 3. Cumulative Relative Frequency Graph 4. Probability
of Winning by 6 Wickets or Less 5. Normal Distribution of Wins by Wickets 6. Mean, Standard
Deviation, and Percentile Calculation 7. Find out outliers for the selective columns for lower range
outliers will be lower than mu - 2sigma, similarly for upper range outliers will be greater than
mu+2sigma.
1. Frequency Distribution of Wins by Wickets

[107]: # Frequency distribution of wins by wickets

wins_by_wickets = df_wickets['result_margin'].value_counts().sort_index()
print(wins_by_wickets)

# Plotting the frequency distribution

wins_by_wickets.plot(kind='bar', color='blue')
plt.title('Frequency Distribution of Wins by Wickets')
plt.xlabel('Number of Wickets')
plt.ylabel('Frequency')
plt.show()

result_margin
1.0 4
2.0 10
3.0 31
4.0 59
5.0 97
6.0 120
7.0 115
8.0 78
9.0 48
10.0 16
Name: count, dtype: int64

19
2. Relative Frequency Distribution

[109]: relative_frequency_wins_by_wickets = wins_by_wickets / wins_by_wickets.sum()

print(relative_frequency_wins_by_wickets)

# Plotting the relative frequency distribution

relative_frequency_wins_by_wickets.plot(kind='bar', color='orange')
plt.title('Relative Frequency Distribution of Wins by Wickets')
plt.xlabel('Number of Wickets')
plt.ylabel('Relative Frequency')
plt.show()

result_margin
1.0 0.006920
2.0 0.017301
3.0 0.053633
4.0 0.102076
5.0 0.167820
6.0 0.207612
7.0 0.198962

20
8.0 0.134948
9.0 0.083045
10.0 0.027682
Name: count, dtype: float64

3. Cumulative Relative Frequency Graph

[110]: # Calculate the cumulative relative frequency

cumulative_relative_frequency = relative_frequency_wins_by_wickets.cumsum()
print(cumulative_relative_frequency)

# Plotting the cumulative relative frequency graph

cumulative_relative_frequency.plot(kind='line', marker='o', color='purple')
plt.title('Cumulative Relative Frequency of Wins by Wickets')
plt.xlabel('Number of Wickets')
plt.ylabel('Cumulative Relative Frequency')
plt.grid(True)
plt.show()

result_margin

21
1.0 0.006920
2.0 0.024221
3.0 0.077855
4.0 0.179931
5.0 0.347751
6.0 0.555363
7.0 0.754325
8.0 0.889273
9.0 0.972318
10.0 1.000000
Name: count, dtype: float64

4. Probability of Winning by 6 Wickets or Less

[111]: # Calculate the total number of wins by wickets

total_wins_by_wickets = wins_by_wickets.sum()

# Calculate the number of wins by 6 wickets or less

wins_by_6_or_less = wins_by_wickets[wins_by_wickets.index <= 6].sum()

22
# Calculate the probability
probability_wins_by_6_or_less = wins_by_6_or_less / total_wins_by_wickets
print(f'The probability of winning by 6 wickets or less is:␣
↪{probability_wins_by_6_or_less}')

The probability of winning by 6 wickets or less is: 0.5553633217993079

5. Normal Distribution of Wins by Wickets

[112]: # Plotting the normal distribution of wins by wickets

plt.figure(figsize=(10, 6))
sns.histplot(df_wickets['result_margin'], kde=True, bins=10, color='blue')
plt.title('Normal Distribution of Wins by Wickets')
plt.xlabel('Number of Wickets')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

6. Mean, Standard Deviation, and Percentile Calculation

[116]: print(df.describe())

id result_margin target_runs target_overs

count 1.095000e+03 1076.000000 1092.000000 1092.000000
mean 9.048283e+05 17.259294 165.684066 19.759341
std 3.677402e+05 21.787444 33.427048 1.581108

23
min 3.359820e+05 1.000000 43.000000 5.000000
25% 5.483315e+05 6.000000 146.000000 20.000000
50% 9.809610e+05 8.000000 166.000000 20.000000
75% 1.254062e+06 20.000000 187.000000 20.000000
max 1.426312e+06 146.000000 288.000000 20.000000
7. Find out outliers for the selective columns for lower range outliers will be lower than mu -
2sigma, similarly for upper range outliers will be greater than mu+2sigma.

[118]: # Calculate the mean and standard deviation for the result_margin column
mu = df['result_margin'].mean()
sigma = df['result_margin'].std()

# Calculate the lower and upper bounds for outliers

lower_bound = mu - 2 * sigma
upper_bound = mu + 2 * sigma

# Find the outliers

outliers = df[(df['result_margin'] < lower_bound) | (df['result_margin'] >␣
↪upper_bound)]

print(outliers)

id season city date match_type player_of_match \

0 335982 2007/08 Bangalore 2008-04-18 League BB McCullum
9 335991 2007/08 Chandigarh 2008-04-25 League KC Sangakkara
39 336023 2007/08 Jaipur 2008-05-17 League GC Smith
55 336038 2007/08 Mumbai 2008-05-30 Semi Final SR Watson
59 392182 2009 Cape Town 2009-04-18 League R Dravid
… … … … … … …
1030 1422125 2024 Chennai 2024-03-26 League S Dube
1039 1422134 2024 Visakhapatnam 2024-04-03 League SP Narine
1058 1426273 2024 Delhi 2024-04-20 League TM Head
1069 1426284 2024 Chennai 2024-04-28 League RD Gaikwad
1077 1426292 2024 Lucknow 2024-05-05 League SP Narine

venue \
0 M Chinnaswamy Stadium
9 Punjab Cricket Association Stadium, Mohali
39 Sawai Mansingh Stadium
55 Wankhede Stadium
59 Newlands
… …
1030 MA Chidambaram Stadium, Chepauk, Chennai
1039 Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket St…
1058 Arun Jaitley Stadium, Delhi
1069 MA Chidambaram Stadium, Chepauk, Chennai
1077 Bharat Ratna Shri Atal Bihari Vajpayee Ekana C…

24
team1 team2 \
0 Royal Challengers Bangalore Kolkata Knight Riders
9 Kings XI Punjab Mumbai Indians
39 Rajasthan Royals Royal Challengers Bangalore
55 Delhi Daredevils Rajasthan Royals
59 Royal Challengers Bangalore Rajasthan Royals
… … …
1030 Chennai Super Kings Gujarat Titans
1039 Kolkata Knight Riders Delhi Capitals
1058 Sunrisers Hyderabad Delhi Capitals
1069 Chennai Super Kings Sunrisers Hyderabad
1077 Kolkata Knight Riders Lucknow Super Giants

toss_winner toss_decision winner \

0 Royal Challengers Bangalore field Kolkata Knight Riders
9 Mumbai Indians field Kings XI Punjab
39 Royal Challengers Bangalore field Rajasthan Royals
55 Delhi Daredevils field Rajasthan Royals
59 Royal Challengers Bangalore bat Royal Challengers Bangalore
… … … …
1030 Gujarat Titans field Chennai Super Kings
1039 Kolkata Knight Riders bat Kolkata Knight Riders
1058 Delhi Capitals field Sunrisers Hyderabad
1069 Sunrisers Hyderabad field Chennai Super Kings
1077 Lucknow Super Giants field Kolkata Knight Riders

result result_margin target_runs target_overs super_over method \

0 runs 140.0 223.0 20.0 N NaN
9 runs 66.0 183.0 20.0 N NaN
39 runs 65.0 198.0 20.0 N NaN
55 runs 105.0 193.0 20.0 N NaN
59 runs 75.0 134.0 20.0 N NaN
… … … … … … …
1030 runs 63.0 207.0 20.0 N NaN
1039 runs 106.0 273.0 20.0 N NaN
1058 runs 67.0 267.0 20.0 N NaN
1069 runs 78.0 213.0 20.0 N NaN
1077 runs 98.0 236.0 20.0 N NaN

umpire1 umpire2
0 Asad Rauf RE Koertzen
9 Aleem Dar AM Saheba
39 BF Bowden SL Shastri
55 BF Bowden RE Koertzen
59 BR Doctrove RB Tiffin
… … …
1030 AG Wharf Tapan Sharma
1039 A Totre UV Gandhe

25
1058 J Madanagopal Navdeep Singh
1069 R Pandit MV Saidharshan Kumar
1077 MV Saidharshan Kumar YC Barde

[65 rows x 20 columns]

26
D24AIML081_A_6

April 1, 2025

####
D24AIML081 PMRP ASSIGNMENT 6 WITH CONCLUSION
CLASSWORK
QUESTIONS:- ->General Population and Gender Distribution
What is the total population in each county, and how does it vary by state? What is the gender
distribution (Men vs. Women) across different counties? What is the average population size
for census tracts in each state? How does the population of each race (White, Black, Hispanic,
etc.) differ across states? What is the proportion of the male population compared to the female
population in each census tract?
->Ethnicity and Race
What is the distribution of Hispanic population across various counties and states? How do different
racial groups (White, Black, Native, etc.) vary in terms of percentage of total population in different
counties? Which states have the highest percentage of Black or Hispanic populations?
->Employment and Work Type
What is the employment rate (Employed vs. Unemployed) for each census tract? How does the
rate of self-employed individuals compare to those working in private/public sectors across different
states? What percentage of the population works from home, and how does it vary by county and
state? How does the unemployment rate vary across different states and counties? What is the
distribution of employed individuals working in private vs. public sectors?
->Commuting and Transportation
What is the average commuting time across counties and states, and how does it differ for employed
individuals? What modes of transportation are most commonly used for commuting in different
states (e.g., car, public transportation, walking)? How does the percentage of people commuting
via walking or public transportation vary between urban and rural areas?
->Income and Housing
What is the average income (or median household income) in each state and county? How does
the distribution of housing type (e.g., owner-occupied vs. renter-occupied) vary across different
counties? How does the cost of living compare across different states based on average income and
housing costs?
-> Social Characteristics

1
What is the relationship between education levels (e.g., percentage with a high school diploma,
bachelor’s degree) and employment types across different states?

[48]: import matplotlib.pyplot as plt

import pandas as pd
import numpy as np

df=pd.read_csv("C:/Users/User/Downloads/acs2017_census_tract_data.csv")
df,df.head(),df.tail(),df.describe(),df.info(),df.columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74001 entries, 0 to 74000
Data columns (total 37 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TractId 74001 non-null int64
1 State 74001 non-null object
2 County 74001 non-null object
3 TotalPop 74001 non-null int64
4 Men 74001 non-null int64
5 Women 74001 non-null int64
6 Hispanic 73305 non-null float64
7 White 73305 non-null float64
8 Black 73305 non-null float64
9 Native 73305 non-null float64
10 Asian 73305 non-null float64
11 Pacific 73305 non-null float64
12 VotingAgeCitizen 74001 non-null int64
13 Income 72885 non-null float64
14 IncomeErr 72885 non-null float64
15 IncomePerCap 73256 non-null float64
16 IncomePerCapErr 73256 non-null float64
17 Poverty 73159 non-null float64
18 ChildPoverty 72891 non-null float64
19 Professional 73190 non-null float64
20 Service 73190 non-null float64
21 Office 73190 non-null float64
22 Construction 73190 non-null float64
23 Production 73190 non-null float64
24 Drive 73200 non-null float64
25 Carpool 73200 non-null float64
26 Transit 73200 non-null float64
27 Walk 73200 non-null float64
28 OtherTransp 73200 non-null float64
29 WorkAtHome 73200 non-null float64
30 MeanCommute 73055 non-null float64
31 Employed 74001 non-null int64
32 PrivateWork 73190 non-null float64

2
33 PublicWork 73190 non-null float64
34 SelfEmployed 73190 non-null float64
35 FamilyWork 73190 non-null float64
36 Unemployment 73191 non-null float64
dtypes: float64(29), int64(6), object(2)
memory usage: 20.9+ MB

[48]: ( TractId State County TotalPop Men Women \

0 1001020100 Alabama Autauga County 1845 899 946
1 1001020200 Alabama Autauga County 2172 1167 1005
2 1001020300 Alabama Autauga County 3385 1533 1852
3 1001020400 Alabama Autauga County 4267 2001 2266
4 1001020500 Alabama Autauga County 9965 5054 4911
… … … … … … …
73996 72153750501 Puerto Rico Yauco Municipio 6011 3035 2976
73997 72153750502 Puerto Rico Yauco Municipio 2342 959 1383
73998 72153750503 Puerto Rico Yauco Municipio 2218 1001 1217
73999 72153750601 Puerto Rico Yauco Municipio 4380 1964 2416
74000 72153750602 Puerto Rico Yauco Municipio 3001 1343 1658

Hispanic White Black Native … Walk OtherTransp WorkAtHome \

0 2.4 86.3 5.2 0.0 … 0.5 0.0 2.1
1 1.1 41.6 54.5 0.0 … 0.0 0.5 0.0
2 8.0 61.4 26.5 0.6 … 1.0 0.8 1.5
3 9.6 80.3 7.1 0.5 … 1.5 2.9 2.1
4 0.9 77.5 16.4 0.0 … 0.8 0.3 0.7
… … … … … … … … …
73996 99.7 0.3 0.0 0.0 … 0.5 0.0 3.6
73997 99.1 0.9 0.0 0.0 … 0.0 0.0 1.3
73998 99.5 0.2 0.0 0.0 … 3.4 0.0 3.4
73999 100.0 0.0 0.0 0.0 … 0.0 0.0 0.0
74000 99.2 0.8 0.0 0.0 … 4.9 0.0 8.9

MeanCommute Employed PrivateWork PublicWork SelfEmployed \

0 24.5 881 74.2 21.2 4.5
1 22.2 852 75.9 15.0 9.0
2 23.1 1482 73.3 21.1 4.8
3 25.9 1849 75.8 19.7 4.5
4 21.0 4787 71.4 24.1 4.5
… … … … … …
73996 26.9 1576 59.2 33.8 7.0
73997 25.3 666 58.4 35.4 6.2
73998 23.5 560 57.5 34.5 8.0
73999 24.1 1062 67.7 30.4 1.9
74000 21.6 759 75.9 19.1 5.0

FamilyWork Unemployment

3
0 0.0 4.6
1 0.0 3.4
2 0.7 4.7
3 0.0 6.1
4 0.0 2.3
… … …
73996 0.0 20.8
73997 0.0 26.3
73998 0.0 23.0
73999 0.0 29.5
74000 0.0 17.9

[74001 rows x 37 columns],

TractId State County TotalPop Men Women Hispanic \
0 1001020100 Alabama Autauga County 1845 899 946 2.4
1 1001020200 Alabama Autauga County 2172 1167 1005 1.1
2 1001020300 Alabama Autauga County 3385 1533 1852 8.0
3 1001020400 Alabama Autauga County 4267 2001 2266 9.6
4 1001020500 Alabama Autauga County 9965 5054 4911 0.9

White Black Native … Walk OtherTransp WorkAtHome MeanCommute \

0 86.3 5.2 0.0 … 0.5 0.0 2.1 24.5
1 41.6 54.5 0.0 … 0.0 0.5 0.0 22.2
2 61.4 26.5 0.6 … 1.0 0.8 1.5 23.1
3 80.3 7.1 0.5 … 1.5 2.9 2.1 25.9
4 77.5 16.4 0.0 … 0.8 0.3 0.7 21.0

Employed PrivateWork PublicWork SelfEmployed FamilyWork Unemployment

0 881 74.2 21.2 4.5 0.0 4.6
1 852 75.9 15.0 9.0 0.0 3.4
2 1482 73.3 21.1 4.8 0.7 4.7
3 1849 75.8 19.7 4.5 0.0 6.1
4 4787 71.4 24.1 4.5 0.0 2.3

[5 rows x 37 columns],
TractId State County TotalPop Men Women \
73996 72153750501 Puerto Rico Yauco Municipio 6011 3035 2976
73997 72153750502 Puerto Rico Yauco Municipio 2342 959 1383
73998 72153750503 Puerto Rico Yauco Municipio 2218 1001 1217
73999 72153750601 Puerto Rico Yauco Municipio 4380 1964 2416
74000 72153750602 Puerto Rico Yauco Municipio 3001 1343 1658

Hispanic White Black Native … Walk OtherTransp WorkAtHome \

73996 99.7 0.3 0.0 0.0 … 0.5 0.0 3.6
73997 99.1 0.9 0.0 0.0 … 0.0 0.0 1.3
73998 99.5 0.2 0.0 0.0 … 3.4 0.0 3.4
73999 100.0 0.0 0.0 0.0 … 0.0 0.0 0.0

4
74000 99.2 0.8 0.0 0.0 … 4.9 0.0 8.9

MeanCommute Employed PrivateWork PublicWork SelfEmployed \

73996 26.9 1576 59.2 33.8 7.0
73997 25.3 666 58.4 35.4 6.2
73998 23.5 560 57.5 34.5 8.0
73999 24.1 1062 67.7 30.4 1.9
74000 21.6 759 75.9 19.1 5.0

FamilyWork Unemployment
73996 0.0 20.8
73997 0.0 26.3
73998 0.0 23.0
73999 0.0 29.5
74000 0.0 17.9

[5 rows x 37 columns],
TractId TotalPop Men Women Hispanic \
count 7.400100e+04 74001.000000 74001.000000 74001.000000 73305.000000
mean 2.839113e+10 4384.716017 2157.710707 2227.005311 17.265444
std 1.647593e+10 2228.936729 1120.560504 1146.240218 23.073811
min 1.001020e+09 0.000000 0.000000 0.000000 0.000000
25% 1.303901e+10 2903.000000 1416.000000 1465.000000 2.600000
50% 2.804700e+10 4105.000000 2007.000000 2082.000000 7.400000
75% 4.200341e+10 5506.000000 2707.000000 2803.000000 21.100000
max 7.215375e+10 65528.000000 32266.000000 33262.000000 100.000000

White Black Native Asian Pacific \

count 73305.000000 73305.00000 73305.000000 73305.000000 73305.000000
mean 61.309043 13.28910 0.734047 4.753691 0.147341
std 30.634461 21.60118 4.554247 8.999888 1.029250
min 0.000000 0.00000 0.000000 0.000000 0.000000
25% 38.000000 0.80000 0.000000 0.200000 0.000000
50% 70.400000 3.80000 0.000000 1.500000 0.000000
75% 87.700000 14.60000 0.400000 5.000000 0.000000
max 100.000000 100.00000 100.000000 100.000000 71.900000

… Walk OtherTransp WorkAtHome MeanCommute \

count … 73200.000000 73200.000000 73200.000000 73055.000000
mean … 3.042825 1.894605 4.661466 26.056594
std … 5.805753 2.549374 4.014940 7.124524
min … 0.000000 0.000000 0.000000 1.000000
25% … 0.400000 0.400000 2.000000 21.100000
50% … 1.400000 1.200000 3.800000 25.400000
75% … 3.300000 2.500000 6.300000 30.300000
max … 100.000000 100.000000 100.000000 73.900000

5
Employed PrivateWork PublicWork SelfEmployed FamilyWork \
count 74001.000000 73190.000000 73190.000000 73190.000000 73190.000000
mean 2049.152052 79.494222 14.163342 6.171484 0.171164
std 1138.865457 8.126383 7.328680 3.932364 0.456580
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1276.000000 75.200000 9.300000 3.500000 0.000000
50% 1895.000000 80.600000 13.000000 5.500000 0.000000
75% 2635.000000 85.000000 17.600000 8.000000 0.000000
max 28945.000000 100.000000 100.000000 100.000000 22.300000

Unemployment
count 73191.000000
mean 7.246738
std 5.227624
min 0.000000
25% 3.900000
50% 6.000000
75% 9.000000
max 100.000000

[8 rows x 35 columns],
None,
Index(['TractId', 'State', 'County', 'TotalPop', 'Men', 'Women', 'Hispanic',
'White', 'Black', 'Native', 'Asian', 'Pacific', 'VotingAgeCitizen',
'Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr', 'Poverty',
'ChildPoverty', 'Professional', 'Service', 'Office', 'Construction',
'Production', 'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp',
'WorkAtHome', 'MeanCommute', 'Employed', 'PrivateWork', 'PublicWork',
'SelfEmployed', 'FamilyWork', 'Unemployment'],
dtype='object'))

[50]: #What is the total population in each county, and how does it vary by state?
total_population_by_county = df.groupby(['State', 'County'])['TotalPop'].sum().
↪reset_index()

print(total_population_by_county)

total_population_by_state = df.groupby('State')['TotalPop'].sum().reset_index()
print(total_population_by_state)

State County TotalPop

0 Alabama Autauga County 55036
1 Alabama Baldwin County 203360
2 Alabama Barbour County 26201
3 Alabama Bibb County 22580
4 Alabama Blount County 57667
… … … …
3215 Wyoming Sweetwater County 44527
3216 Wyoming Teton County 22923

6
3217 Wyoming Uinta County 20758
3218 Wyoming Washakie County 8253
3219 Wyoming Weston County 7117

[3220 rows x 3 columns]

State TotalPop
0 Alabama 4850771
1 Alaska 738565
2 Arizona 6809946
3 Arkansas 2977944
4 California 38982847
5 Colorado 5436519
6 Connecticut 3594478
7 Delaware 943732
8 District of Columbia 672391
9 Florida 20278447
10 Georgia 10201635
11 Hawaii 1421658
12 Idaho 1657375
13 Illinois 12854526
14 Indiana 6614418
15 Iowa 3118102
16 Kansas 2903820
17 Kentucky 4424376
18 Louisiana 4663461
19 Maine 1330158
20 Maryland 5996079
21 Massachusetts 6789319
22 Michigan 9925568
23 Minnesota 5490726
24 Mississippi 2986220
25 Missouri 6075300
26 Montana 1029862
27 Nebraska 1893921
28 Nevada 2887725
29 New Hampshire 1331848
30 New Jersey 8960161
31 New Mexico 2084828
32 New York 19798228
33 North Carolina 10052564
34 North Dakota 745475
35 Ohio 11609756
36 Oklahoma 3896251
37 Oregon 4025127
38 Pennsylvania 12790505
39 Puerto Rico 3468963
40 Rhode Island 1056138
41 South Carolina 4893444

7
42 South Dakota 855444
43 Tennessee 6597381
44 Texas 27419612
45 Utah 2993941
46 Vermont 624636
47 Virginia 8365952
48 Washington 7169967
49 West Virginia 1836843
50 Wisconsin 5763217
51 Wyoming 583200

[52]: #What is the gender distribution (Men vs. Women) across different counties?

gender_distribution_by_county = df.groupby(['State', 'County'])[['Men',␣

↪'Women']].sum().reset_index()

fig, ax = plt.subplots(figsize=(12, 8))

gender_distribution_by_county.groupby('State')[['Men', 'Women']].sum().
↪plot(kind='bar', stacked=True, ax=ax)

ax.set_title('Gender Distribution (Men vs. Women) Across Different Counties')

ax.set_xlabel('State')
ax.set_ylabel('Population')
plt.legend(title='Gender')
plt.show()

8
[54]: #What is the average population size for census tracts in each state?

average_population_by_state = df.groupby('State')['TotalPop'].mean().
↪reset_index()

# print(average_population_by_state)
average_population_by_state.head()

[54]: State TotalPop

0 Alabama 4107.342083
1 Alaska 4422.544910
2 Arizona 4462.612058
3 Arkansas 4341.026239
4 California 4838.382400

[56]: #How does the population of each race (White, Black, Hispanic, etc.) differ␣
↪across states?

9
df['White_Percentage'] = (df['White'] / df['TotalPop']) * 100
df['Black_Percentage'] = (df['Black'] / df['TotalPop']) * 100
df['Native_Percentage'] = (df['Native'] / df['TotalPop']) * 100
df['Asian_Percentage'] = (df['Asian'] / df['TotalPop']) * 100
df['Pacific_Percentage'] = (df['Pacific'] / df['TotalPop']) * 100
df['Hispanic_Percentage'] = (df['Hispanic'] / df['TotalPop']) * 100

# Calculate weighted average percentages

race_population_by_state = df.groupby('State').apply(
lambda x: (x[['Hispanic', 'White', 'Black', 'Native', 'Asian', 'Pacific']].
↪sum() / x['TotalPop'].sum()) * 100

).reset_index()

race_population_by_state.plot(x='State', kind='bar', stacked=True, figsize=(15,␣

↪10))

plt.title('Population Distribution by Race Across States')

plt.xlabel('State')
plt.ylabel('Percentage of Population')
plt.legend(title='Race')
plt.show()

C:\Users\User\AppData\Local\Temp\ipykernel_6148\2768742177.py:12:
DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns.
This behavior is deprecated, and in a future version of pandas the grouping
columns will be excluded from the operation. Either pass `include_groups=False`
to exclude the groupings or explicitly select the grouping columns after groupby
to silence this warning.
race_population_by_state = df.groupby('State').apply(

10
[57]: #What is the proportion of the male population compared to the female␣
↪population in each census tract?

df['MaleToFemaleRatio'] = df['Men'] / df['Women'].replace(0, np.nan)

# Weighted average Male-to-Female Ratio per state

male_to_female_ratio_by_state = df.groupby('State').apply(
lambda x: x['Men'].sum() / x['Women'].sum()
).reset_index(name='MaleToFemaleRatio')

# Plot the distribution of Male to Female ratio across states

plt.figure(figsize=(15, 10))
plt.bar(male_to_female_ratio_by_state['State'],␣
↪male_to_female_ratio_by_state['MaleToFemaleRatio'])

plt.title('Average Male to Female Ratio by State')

plt.xlabel('State')
plt.ylabel('Male to Female Ratio')
plt.xticks(rotation=90)
plt.show()

11
C:\Users\User\AppData\Local\Temp\ipykernel_6148\1348919082.py:6:
DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns.
This behavior is deprecated, and in a future version of pandas the grouping
columns will be excluded from the operation. Either pass `include_groups=False`
to exclude the groupings or explicitly select the grouping columns after groupby
to silence this warning.
male_to_female_ratio_by_state = df.groupby('State').apply(

1 Ethnicity and Race

What is the distribution of Hispanic population across various counties and states?
How do different racial groups (White, Black, Native, etc.) vary in terms of percentage of total
population in different counties?
Which states have the highest percentage of Black or Hispanic populations?

[59]: #What is the distribution of Hispanic population across various counties and␣
↪states?

12
df['Hispanic_Percentage'] = (df['Hispanic'] / df['TotalPop']) * 100

# Aggregate data at the county level

hispanic_by_county = df.groupby(['State', 'County'])[['Hispanic', 'TotalPop']].
↪sum().reset_index()

hispanic_by_county['Hispanic_Percentage'] = (hispanic_by_county['Hispanic'] /␣
↪hispanic_by_county['TotalPop']) * 100

# Aggregate data at the state level using a weighted average

hispanic_state_data = hispanic_by_county.groupby('State').apply(
lambda x: (x['Hispanic'].sum() / x['TotalPop'].sum()) * 100
).reset_index(name='Hispanic_Percentage')

# Plot the Hispanic population distribution across states

plt.figure(figsize=(15, 10))
plt.bar(hispanic_state_data['State'],␣
↪hispanic_state_data['Hispanic_Percentage'])

plt.title('Distribution of Hispanic Population Across States')

plt.xlabel('State')
plt.ylabel('Hispanic Population Percentage')
plt.xticks(rotation=90)
plt.show()

C:\Users\User\AppData\Local\Temp\ipykernel_6148\482559711.py:11:
DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns.
This behavior is deprecated, and in a future version of pandas the grouping
columns will be excluded from the operation. Either pass `include_groups=False`
to exclude the groupings or explicitly select the grouping columns after groupby
to silence this warning.
hispanic_state_data = hispanic_by_county.groupby('State').apply(

13
[60]: #How do different racial groups (White, Black, Native, etc.) vary in terms of␣
↪percentage of total population in different counties?

# Calculate the percentage of each racial group in each county

df['White_Percentage'] = (df['White'] / df['TotalPop']) * 100
df['Black_Percentage'] = (df['Black'] / df['TotalPop']) * 100
df['Native_Percentage'] = (df['Native'] / df['TotalPop']) * 100
df['Asian_Percentage'] = (df['Asian'] / df['TotalPop']) * 100
df['Pacific_Percentage'] = (df['Pacific'] / df['TotalPop']) * 100

# Aggregate racial population counts at the county level

racial_population_by_county = df.groupby(['State', 'County'])[['White',␣
↪'Black', 'Native', 'Asian', 'Pacific', 'TotalPop']].sum().reset_index()