DAA Prac
DAA Prac
April 1, 2025
0.1 Question 1
Separate the given list based on the data types. List1 = [“Aakash”, 90, 77, “B”, 3.142,12]
0.2 Question 2
Consider you are collecting data from students on their heights (in cms) containing numbers as
140,145,153, etc. Use Numpy library and randomly generate 50 such numbers in the range 150 to
180. Which data type would you use list or array to store such data? Calculate measures of central
tendency of this data stored in list as well as array.
[166 157 163 171 164 154 160 177 159 155 167 159 173 171 174 155 173 172
153 163 174 166 158 169 153 166 159 161 153 175 179 179 151 167 155 179
153 163 176 173 178 177 174 166 157 161 174 154 167 162]
The mean is165.3
The median is 166.0
1
0.3 Question 3
Part 1:-
Create the function that will plot simple line chart for any given data.
0.4 Question 3
Part 2:-
Create the recursive function for finding out factorial of a given number
2
n = int(input())
print(fact(n))
10
3628800
0.5 Question 3
Part 3:-
Create generator function for Fibonacci series and print out first 10 numbers.
0
1
1
2
3
5
8
13
21
34
0.6 Question 3
Part 4:-
Plot the graphs for trigonometric functions sin, cos, tan, cot, sec & cosec for the values pi to 2pi.
3
[19]: import math
x = np.linspace(math.pi, 2 * math.pi, 10000)
y = np.cos(x)
plt.grid()
plt.xlabel("Radian")
plt.ylabel("Value")
plt.title("Cos(x) Graph")
plt.plot(x, y)
plt.show()
4
[21]: import math
x = np.linspace(math.pi, 2 * math.pi, 10000)
y = np.tan(x)
plt.grid()
plt.xlabel("Radian")
plt.ylabel("Value")
plt.title("Tan(x) Graph")
plt.ylim(-10, 10)
plt.plot(x, y)
plt.show()
5
[23]: import math
x = np.linspace(math.pi, 2 * math.pi, 10000)
y = 1 / np.cos(x)
y[np.abs(np.cos(x)) < 1e-5] = np.nan
plt.grid()
plt.xlabel("Radian")
plt.ylabel("Value")
plt.title("Sec(x) Graph")
plt.plot(x, y, linewidth = 1.1, color = 'blue', label = "Sec(x)")
plt.legend()
plt.ylim(-10, 10)
plt.show()
6
[35]: import math
x = np.linspace(math.pi, 2 * math.pi, 100)
y = 1 / np.sin(x)
y[np.abs(np.sin(x)) < 1e-5] = np.nan
plt.grid()
plt.xlabel("Radian")
plt.ylabel("Value")
plt.title("Cosec(x) Graph")
plt.plot(x, y, linewidth = 1.1, color = 'blue', label = "Cosec(x)")
plt.legend()
plt.ylim(-20, 20)
plt.show()
7
[27]: import math
x = np.linspace(math.pi, 2 * math.pi, 10000)
y = 1 / np.tan(x)
y[np.abs(np.tan(x)) < 1e-5] = np.nan
plt.grid()
plt.xlabel("Radian")
plt.ylabel("Value")
plt.title("Cot(x) Graph")
plt.plot(x, y, linewidth = 1.1, color = 'blue', label = "Cot(x)")
plt.legend()
plt.ylim(-10, 10)
plt.show()
8
0.7 Question 4
Consider you want create dataset with ages of people in your surroundings. Use input method to
ask user their age, store those ages in appropriate data type. Apply error handling that will not
accept more than 130 or less than 0 inputs, raise appropriate prompts to guide users.
while True:
usr = input("Enter Age of the person\nEnter q to exit")
if usr == 'q':
break
try:
usr = int(usr)
if(usr < 0 or usr > 130):
print("Invalid input")
else:
ages.append(usr)
9
except ValueError:
print("Invallid input")
return ages
use = coll()
print(use)
0.8 Question 5
Create class as Employees with inputs as name, department and salary. Salary should be encap-
sulated.
0.9 Question 6
Create two 3d arrays as matrices. Perform matrix operations (Addition, Multiplication, dot prod-
uct, inverse, determinant) on those matrices. Explain identity matrix, multiply each matrix with
identity matrix and record the observation. (All operations should be done with Numpy library)
10
print("\nElement-wise Multiplication:\n", arr1 * arr2)
print("\nDot Product:\n", np.matmul(arr1, arr2))
print("\nInverse of arr1:\n", np.linalg.inv(arr1))
print("\nDeterminants of arr1:\n", np.linalg.det(arr1))
print("\nIdentity Matrix:\n", np.eye(2))
print("\narr1 multiplied with Identity Matrix:\n", np.array([np.dot(np.eye(2),␣
↪mat) for mat in arr1]))
Matrix Addition:
[[[10 12]
[14 16]]
[[18 20]
[22 24]]]
Element-wise Multiplication:
[[[ 9 20]
[ 33 48]]
[[ 65 84]
[105 128]]]
Dot Product:
[[[ 31 34]
[ 71 78]]
[[155 166]
[211 226]]]
Inverse of arr1:
[[[-2. 1. ]
[ 1.5 -0.5]]
[[-4. 3. ]
[ 3.5 -2.5]]]
Determinants of arr1:
[-2. -2.]
Identity Matrix:
[[1. 0.]
[0. 1.]]
[[5. 6.]
11
[7. 8.]]]
12
Assignment_2_D24AIML081
April 1, 2025
1 Lab-2
[4]: import pandas as pd
1
[6]: sns.displot(drug_df['Age'], color='darkgreen')
plt.title('Distribution curve for age')
plt.show()
2
2. Calculate Q1,Q2,Q3 and IQR without using np.percentile function. Calculate lower
and upper bound values.
[7]: v=drug_df['Age']
#u=drug_df['Na_to_K']
q1=(len(v)+1)/4
q2=(len(v)+1)/2
q3=((len(v)+1)*3)/4
iqr= q3 - q1
lower = q1-1.5*iqr
upper = q1+1.5*iqr
print("Q1 = ",q1)
print("Q2 = ",q2)
print("Q3 = ",q3)
3
print("IQR = ",iqr)
print("Lower bound = ",lower)
print("Upper bound = ",upper)
Q1 = 50.25
Q2 = 100.5
Q3 = 150.75
IQR = 100.5
Lower bound = -100.5
Upper bound = 201.0
3. Calculate frequency table as well for age column. Ranges for this can be in multiple
of 10, e.g. 10-20,20-30,etc..
[8]: x = drug_df['Age']
for i in range(10,80,10):
count = 0
for age in x:
if age >= i and age < i+10:
4
count+=1
print(f'{i} - {i+10} : {count}')
10 - 20 : 12
20 - 30 : 35
30 - 40 : 37
40 - 50 : 38
50 - 60 : 33
60 - 70 : 32
70 - 80 : 13
1. What is a Gender distribution of data?
2. What percent of total population have high cholesterol & high BP?
3. What are the unique values of Drugs given in data? (df[“Drug”].unique)
4. How many people have high cholesterol before age of 30?
[9]: Sex
M 104
F 96
Name: count, dtype: int64
5
.. … .. … … … …
188 65 M HIGH NORMAL 34.997 drugY
189 64 M HIGH NORMAL 20.932 drugY
190 58 M HIGH HIGH 18.991 drugY
191 23 M HIGH HIGH 8.011 drugA
194 46 F HIGH HIGH 34.686 drugY
[11]: drug_df['Drug'].unique()
drug_df['Drug'].value_counts()
[11]: Drug
drugY 91
drugX 54
drugA 23
drugC 16
drugB 16
Name: count, dtype: int64
[12]: count = 0
for i in range(len(drug_df['Age'])):
if drug_df['Cholesterol'][i] == 'HIGH' and drug_df['Age'][i] <30 :
count +=1
print(count)
26
2 Assignment-2
[13]: df = pd.read_csv('/content/user_behavior_dataset.csv')
df
[13]: User ID Device Model Operating System App Usage Time (min/day) \
0 1 Google Pixel 5 Android 393
1 2 OnePlus 9 Android 268
2 3 Xiaomi Mi 11 Android 154
3 4 Google Pixel 5 Android 239
4 5 iPhone 12 iOS 187
.. … … … …
695 696 iPhone 12 iOS 92
696 697 Xiaomi Mi 11 Android 316
697 698 Google Pixel 5 Android 99
698 699 Samsung Galaxy S21 Android 62
699 700 OnePlus 9 Android 212
6
0 6.4 1872
1 4.7 1331
2 4.0 761
3 4.8 1676
4 4.3 1367
.. … …
695 3.9 1082
696 6.8 1965
697 3.1 942
698 1.7 431
699 5.4 1306
7
iqr= q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
outliers = v[(v < lower) | (v > upper)]
print("Q1 = ",q1)
print("Q2 = ",q2)
print("Q3 = ",q3)
print("IQR = ",iqr)
print("Lower bound = ",lower)
print("Upper bound = ",upper)
print("Outliers = ", outliers.tolist())
Q1 = 175.25
Q2 = 350.5
Q3 = 525.75
IQR = 350.5
Lower bound = -350.5
Upper bound = 1051.5
Outliers = []
2. Find out gender distribution in this data.
print(gender_distribution)
Gender
Male 52.0
Female 48.0
Name: proportion, dtype: float64
3. What is average daily usage of data? Explore gender wise and device wise variation in average
usage of data.
8
print("\nAverage daily data usage by gender and device:")
print(gender_device_usage)
print(popularity)
Age Gender
18 Male Samsung Galaxy S21
19 Female iPhone 12
Male OnePlus 9
20 Female Google Pixel 5
Male Google Pixel 5
…
57 Male Samsung Galaxy S21
9
58 Female iPhone 12
Male Samsung Galaxy S21
59 Female iPhone 12
Male Samsung Galaxy S21
Name: Device Model, Length: 83, dtype: object
10
D24AIML081_ASS_3_PRMP
April 1, 2025
[34]: #Assignment 3
import numpy as np
import pandas as pd
df = pd.read_csv("C:/Users/User/Downloads/matches.csv")
df
venue \
0 M Chinnaswamy Stadium
1 Punjab Cricket Association Stadium, Mohali
2 Feroz Shah Kotla
3 Wankhede Stadium
4 Eden Gardens
… …
1090 Rajiv Gandhi International Stadium, Uppal, Hyd…
1091 Narendra Modi Stadium, Ahmedabad
1092 Narendra Modi Stadium, Ahmedabad
1093 MA Chidambaram Stadium, Chepauk, Chennai
1094 MA Chidambaram Stadium, Chepauk, Chennai
team1 team2 \
0 Royal Challengers Bangalore Kolkata Knight Riders
1 Kings XI Punjab Chennai Super Kings
2 Delhi Daredevils Rajasthan Royals
3 Mumbai Indians Royal Challengers Bangalore
1
4 Kolkata Knight Riders Deccan Chargers
… … …
1090 Punjab Kings Sunrisers Hyderabad
1091 Sunrisers Hyderabad Kolkata Knight Riders
1092 Royal Challengers Bengaluru Rajasthan Royals
1093 Sunrisers Hyderabad Rajasthan Royals
1094 Sunrisers Hyderabad Kolkata Knight Riders
umpire1 umpire2
0 Asad Rauf RE Koertzen
1 MR Benson SL Shastri
2 Aleem Dar GA Pratapkumar
3 SJ Davis DJ Harper
4 BF Bowden K Hariharan
… … …
1090 Nitin Menon VK Sharma
1091 AK Chaudhary R Pandit
1092 KN Ananthapadmanabhan MV Saidharshan Kumar
1093 Nitin Menon VK Sharma
1094 J Madanagopal Nitin Menon
2
[1095 rows x 20 columns]
id 1095
season 17
city 36
date 823
match_type 8
player_of_match 291
venue 58
team1 19
team2 19
toss_winner 19
toss_decision 2
winner 19
result 4
result_margin 98
target_runs 170
target_overs 15
super_over 2
method 1
umpire1 62
umpire2 62
dtype: int64
outlier_counts = outliers.sum()
print(outlier_counts)
id 0
result_margin 121
target_runs 30
target_overs 30
dtype: int64
[40]: #3) Plot heatmap of correlation matrix and covariance matrix for the given␣
↪dataset.
3
import seaborn as sns
import matplotlib.pyplot as plt
correlation_matrix = numeric_df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f")
plt.title('Correlation Matrix')
plt.show()
covariance_matrix = numeric_df.cov()
plt.figure(figsize=(10, 8))
sns.heatmap(covariance_matrix, annot=True, fmt=".2f")
plt.title('Covariance Matrix')
plt.show()
4
[42]: #4) Remove unnecessary or empty columns as well as any rows if required from␣
↪the dataset.
df_cleaned = df_cleaned.dropna()
df_cleaned
5
1092 1426310 2024 Ahmedabad 2024-05-22 Eliminator R Ashwin
1093 1426311 2024 Chennai 2024-05-24 Qualifier 2 Shahbaz Ahmed
1094 1426312 2024 Chennai 2024-05-26 Final MA Starc
venue \
0 M Chinnaswamy Stadium
1 Punjab Cricket Association Stadium, Mohali
2 Feroz Shah Kotla
3 Wankhede Stadium
4 Eden Gardens
… …
1090 Rajiv Gandhi International Stadium, Uppal, Hyd…
1091 Narendra Modi Stadium, Ahmedabad
1092 Narendra Modi Stadium, Ahmedabad
1093 MA Chidambaram Stadium, Chepauk, Chennai
1094 MA Chidambaram Stadium, Chepauk, Chennai
team1 team2 \
0 Royal Challengers Bangalore Kolkata Knight Riders
1 Kings XI Punjab Chennai Super Kings
2 Delhi Daredevils Rajasthan Royals
3 Mumbai Indians Royal Challengers Bangalore
4 Kolkata Knight Riders Deccan Chargers
… … …
1090 Punjab Kings Sunrisers Hyderabad
1091 Sunrisers Hyderabad Kolkata Knight Riders
1092 Royal Challengers Bengaluru Rajasthan Royals
1093 Sunrisers Hyderabad Rajasthan Royals
1094 Sunrisers Hyderabad Kolkata Knight Riders
6
3 wickets 5.0 166.0 20.0 N
4 wickets 5.0 111.0 20.0 N
… … … … … …
1090 wickets 4.0 215.0 20.0 N
1091 wickets 8.0 160.0 20.0 N
1092 wickets 4.0 173.0 20.0 N
1093 runs 36.0 176.0 20.0 N
1094 wickets 8.0 114.0 20.0 N
umpire1 umpire2
0 Asad Rauf RE Koertzen
1 MR Benson SL Shastri
2 Aleem Dar GA Pratapkumar
3 SJ Davis DJ Harper
4 BF Bowden K Hariharan
… … …
1090 Nitin Menon VK Sharma
1091 AK Chaudhary R Pandit
1092 KN Ananthapadmanabhan MV Saidharshan Kumar
1093 Nitin Menon VK Sharma
1094 J Madanagopal Nitin Menon
[44]: #5) Plot histograms for each column and remove any skewness using␣
↪transformations.
7
[46]: #6) Plot Yearly records for numerical columns (e.g. runs, trophies)
df_cleaned['year'] = df_cleaned['season'].str.split('/').str[0].astype(int)
yearly_records = df_cleaned.groupby('year').sum()
8
[ ]:
[ ]:
[ ]:
9
ASSIGNMENT_4_D24AIML081
April 1, 2025
url = 'C:/Users/User/Downloads/creditcard.csv/creditcard.csv'
data = pd.read_csv(url)
threshold = 100
total_transactions = len(data_cleaned)
if P_high_amount > 0:
P_fraudulent_given_high_amount = (P_high_amount_given_fraudulent *␣
↪P_fraudulent) / P_high_amount
else:
P_fraudulent_given_high_amount = 0
[ ]:
1
D24AIMl081_PR5
April 1, 2025
1
df=pd.read_csv('D:/SEM4/PMRP/RAW_CODE/PMRP_DAY_13/matches.csv')
df.head,df.tail,df.describe,df.info
venue \
0 M Chinnaswamy Stadium
1 Punjab Cricket Association Stadium, Mohali
2 Feroz Shah Kotla
3 Wankhede Stadium
4 Eden Gardens
… …
1090 Rajiv Gandhi International Stadium, Uppal, Hyd…
1091 Narendra Modi Stadium, Ahmedabad
1092 Narendra Modi Stadium, Ahmedabad
1093 MA Chidambaram Stadium, Chepauk, Chennai
1094 MA Chidambaram Stadium, Chepauk, Chennai
team1 team2 \
0 Royal Challengers Bangalore Kolkata Knight Riders
1 Kings XI Punjab Chennai Super Kings
2 Delhi Daredevils Rajasthan Royals
3 Mumbai Indians Royal Challengers Bangalore
4 Kolkata Knight Riders Deccan Chargers
… … …
1090 Punjab Kings Sunrisers Hyderabad
1091 Sunrisers Hyderabad Kolkata Knight Riders
1092 Royal Challengers Bengaluru Rajasthan Royals
1093 Sunrisers Hyderabad Rajasthan Royals
1094 Sunrisers Hyderabad Kolkata Knight Riders
2
2 Rajasthan Royals bat Delhi Daredevils
3 Mumbai Indians bat Royal Challengers Bangalore
4 Deccan Chargers bat Kolkata Knight Riders
… … … …
1090 Punjab Kings bat Sunrisers Hyderabad
1091 Sunrisers Hyderabad bat Kolkata Knight Riders
1092 Rajasthan Royals field Rajasthan Royals
1093 Rajasthan Royals field Sunrisers Hyderabad
1094 Sunrisers Hyderabad bat Kolkata Knight Riders
umpire1 umpire2
0 Asad Rauf RE Koertzen
1 MR Benson SL Shastri
2 Aleem Dar GA Pratapkumar
3 SJ Davis DJ Harper
4 BF Bowden K Hariharan
… … …
1090 Nitin Menon VK Sharma
1091 AK Chaudhary R Pandit
1092 KN Ananthapadmanabhan MV Saidharshan Kumar
1093 Nitin Menon VK Sharma
1094 J Madanagopal Nitin Menon
3
1092 1426310 2024 Ahmedabad 2024-05-22 Eliminator R Ashwin
1093 1426311 2024 Chennai 2024-05-24 Qualifier 2 Shahbaz Ahmed
1094 1426312 2024 Chennai 2024-05-26 Final MA Starc
venue \
0 M Chinnaswamy Stadium
1 Punjab Cricket Association Stadium, Mohali
2 Feroz Shah Kotla
3 Wankhede Stadium
4 Eden Gardens
… …
1090 Rajiv Gandhi International Stadium, Uppal, Hyd…
1091 Narendra Modi Stadium, Ahmedabad
1092 Narendra Modi Stadium, Ahmedabad
1093 MA Chidambaram Stadium, Chepauk, Chennai
1094 MA Chidambaram Stadium, Chepauk, Chennai
team1 team2 \
0 Royal Challengers Bangalore Kolkata Knight Riders
1 Kings XI Punjab Chennai Super Kings
2 Delhi Daredevils Rajasthan Royals
3 Mumbai Indians Royal Challengers Bangalore
4 Kolkata Knight Riders Deccan Chargers
… … …
1090 Punjab Kings Sunrisers Hyderabad
1091 Sunrisers Hyderabad Kolkata Knight Riders
1092 Royal Challengers Bengaluru Rajasthan Royals
1093 Sunrisers Hyderabad Rajasthan Royals
1094 Sunrisers Hyderabad Kolkata Knight Riders
4
3 wickets 5.0 166.0 20.0 N NaN
4 wickets 5.0 111.0 20.0 N NaN
… … … … … … …
1090 wickets 4.0 215.0 20.0 N NaN
1091 wickets 8.0 160.0 20.0 N NaN
1092 wickets 4.0 173.0 20.0 N NaN
1093 runs 36.0 176.0 20.0 N NaN
1094 wickets 8.0 114.0 20.0 N NaN
umpire1 umpire2
0 Asad Rauf RE Koertzen
1 MR Benson SL Shastri
2 Aleem Dar GA Pratapkumar
3 SJ Davis DJ Harper
4 BF Bowden K Hariharan
… … …
1090 Nitin Menon VK Sharma
1091 AK Chaudhary R Pandit
1092 KN Ananthapadmanabhan MV Saidharshan Kumar
1093 Nitin Menon VK Sharma
1094 J Madanagopal Nitin Menon
venue \
0 M Chinnaswamy Stadium
1 Punjab Cricket Association Stadium, Mohali
2 Feroz Shah Kotla
3 Wankhede Stadium
4 Eden Gardens
… …
1090 Rajiv Gandhi International Stadium, Uppal, Hyd…
1091 Narendra Modi Stadium, Ahmedabad
1092 Narendra Modi Stadium, Ahmedabad
5
1093 MA Chidambaram Stadium, Chepauk, Chennai
1094 MA Chidambaram Stadium, Chepauk, Chennai
team1 team2 \
0 Royal Challengers Bangalore Kolkata Knight Riders
1 Kings XI Punjab Chennai Super Kings
2 Delhi Daredevils Rajasthan Royals
3 Mumbai Indians Royal Challengers Bangalore
4 Kolkata Knight Riders Deccan Chargers
… … …
1090 Punjab Kings Sunrisers Hyderabad
1091 Sunrisers Hyderabad Kolkata Knight Riders
1092 Royal Challengers Bengaluru Rajasthan Royals
1093 Sunrisers Hyderabad Rajasthan Royals
1094 Sunrisers Hyderabad Kolkata Knight Riders
umpire1 umpire2
0 Asad Rauf RE Koertzen
1 MR Benson SL Shastri
2 Aleem Dar GA Pratapkumar
3 SJ Davis DJ Harper
6
4 BF Bowden K Hariharan
… … …
1090 Nitin Menon VK Sharma
1091 AK Chaudhary R Pandit
1092 KN Ananthapadmanabhan MV Saidharshan Kumar
1093 Nitin Menon VK Sharma
1094 J Madanagopal Nitin Menon
venue \
0 M Chinnaswamy Stadium
1 Punjab Cricket Association Stadium, Mohali
2 Feroz Shah Kotla
3 Wankhede Stadium
4 Eden Gardens
… …
1090 Rajiv Gandhi International Stadium, Uppal, Hyd…
1091 Narendra Modi Stadium, Ahmedabad
1092 Narendra Modi Stadium, Ahmedabad
1093 MA Chidambaram Stadium, Chepauk, Chennai
1094 MA Chidambaram Stadium, Chepauk, Chennai
team1 team2 \
0 Royal Challengers Bangalore Kolkata Knight Riders
1 Kings XI Punjab Chennai Super Kings
2 Delhi Daredevils Rajasthan Royals
3 Mumbai Indians Royal Challengers Bangalore
4 Kolkata Knight Riders Deccan Chargers
… … …
1090 Punjab Kings Sunrisers Hyderabad
1091 Sunrisers Hyderabad Kolkata Knight Riders
1092 Royal Challengers Bengaluru Rajasthan Royals
1093 Sunrisers Hyderabad Rajasthan Royals
7
1094 Sunrisers Hyderabad Kolkata Knight Riders
umpire1 umpire2
0 Asad Rauf RE Koertzen
1 MR Benson SL Shastri
2 Aleem Dar GA Pratapkumar
3 SJ Davis DJ Harper
4 BF Bowden K Hariharan
… … …
1090 Nitin Menon VK Sharma
1091 AK Chaudhary R Pandit
1092 KN Ananthapadmanabhan MV Saidharshan Kumar
1093 Nitin Menon VK Sharma
1094 J Madanagopal Nitin Menon
8
season
2007/08 58
2009 57
2009/10 60
2011 73
2012 74
2013 76
2014 60
2015 59
2016 60
2017 59
2018 60
2019 60
2020/21 60
2021 60
2022 74
2023 74
2024 71
Name: count, dtype: int64
2. Find the Most Successfull team (team with most runs)
# most_successful_team = runs_df.groupby('winner')['result_margin'].sum().
↪idxmax()
df["winner"].value_counts()
[79]: winner
Mumbai Indians 144
Chennai Super Kings 138
Kolkata Knight Riders 131
Royal Challengers Bangalore 116
Rajasthan Royals 112
Kings XI Punjab 88
Sunrisers Hyderabad 88
Delhi Daredevils 67
Delhi Capitals 48
Deccan Chargers 29
Gujarat Titans 28
Lucknow Super Giants 24
Punjab Kings 24
Gujarat Lions 13
Pune Warriors 12
9
Rising Pune Supergiant 10
Royal Challengers Bengaluru 7
Kochi Tuskers Kerala 6
Rising Pune Supergiants 5
Name: count, dtype: int64
The player who won the most 'Player of the Match' awards is: AB de Villiers
5. Find the number of matches where the toss winner won the match
The number of matches where the toss winner won the match: 554
6. Calculate the total number of runs scored in all matches for each team
print(total_runs_per_team)
team1
Chennai Super Kings 39503.0
Deccan Chargers 12047.0
Delhi Capitals 15930.0
Delhi Daredevils 25492.0
Gujarat Lions 5077.0
Gujarat Titans 7865.0
Kings XI Punjab 31391.0
Kochi Tuskers Kerala 2014.0
Kolkata Knight Riders 40557.0
Lucknow Super Giants 7835.0
Mumbai Indians 43728.0
Pune Warriors 6950.0
10
Punjab Kings 9787.0
Rajasthan Royals 36250.0
Rising Pune Supergiant 2571.0
Rising Pune Supergiants 1993.0
Royal Challengers Bangalore 39807.0
Royal Challengers Bengaluru 2986.0
Sunrisers Hyderabad 30071.0
Name: target_runs, dtype: float64
7. Determine the average number of wickets taken by the winning team in each match
The average number of wickets taken by the winning team in each match is:
6.192041522491349
8. How many matches were decided by a Super Over?
result
wickets 578
runs 498
tie 14
no result 5
Name: count, dtype: int64
11
10. Find the top 5 venues with the most matches played
venue
Eden Gardens 77
Wankhede Stadium 73
M Chinnaswamy Stadium 65
Feroz Shah Kotla 60
Rajiv Gandhi International Stadium, Uppal 49
12
Name: count, dtype: int64
11. Find the match with the highest margin of victory (by wickets or runs)
[88]: df[df['result_margin']==df['result_margin'].max()]
[89]: # Find the match with the highest margin of victory (by wickets or runs)
df_wickets=df[df['result']=='wickets']
df_runs=df[df['result']=='runs']
max_margin_wicket=df_wickets.loc[df_wickets['result_margin'].idxmax()]
max_margin_run=df_runs.loc[df_runs['result_margin'].idxmax()]
max_margin_run,max_margin_wicket
13
Name: 620, dtype: object,
id 335994
season 2007/08
city Mumbai
date 2008-04-27
match_type League
player_of_match AC Gilchrist
venue Dr DY Patil Sports Academy
team1 Mumbai Indians
team2 Deccan Chargers
toss_winner Deccan Chargers
toss_decision field
winner Deccan Chargers
result wickets
result_margin 10.0
target_runs 155.0
target_overs 20.0
super_over N
method NaN
umpire1 Asad Rauf
umpire2 SL Shastri
Name: 12, dtype: object)
matches_won = df['winner'].value_counts()
win_percentage = (matches_won / matches_played) * 100
print(win_percentage)
14
Royal Challengers Bangalore 48.333333
Royal Challengers Bengaluru 46.666667
Sunrisers Hyderabad 48.351648
Name: count, dtype: float64
13. Find the average number of overs played in all matches
season venue
2007/08 Dr DY Patil Sports Academy 4
Eden Gardens 7
Feroz Shah Kotla 6
M Chinnaswamy Stadium 7
MA Chidambaram Stadium, Chepauk 7
..
2024 Maharaja Yadavindra Singh International Cricket Stadium, Mullanpur 5
Narendra Modi Stadium, Ahmedabad 8
Rajiv Gandhi International Stadium, Uppal, Hyderabad 6
Sawai Mansingh Stadium, Jaipur 5
Wankhede Stadium, Mumbai 7
Length: 175, dtype: int64
16. Analyze the win margin distribution by year
print(win_margin_by_year)
# Plotting the win margin distribution by year
plt.figure(figsize=(14, 8))
sns.boxplot(x='season', y='result_margin', hue='result', data=df)
plt.title('Win Margin Distribution by Year')
plt.xlabel('Season')
15
plt.ylabel('Win Margin')
plt.xticks(rotation=45)
plt.legend(title='Result Type')
plt.show()
16
wickets 37.0 6.000000 1.615893 3.0 5.00 6.0 7.00 9.0
2023 no result 0.0 NaN NaN NaN NaN NaN NaN NaN
runs 40.0 30.400000 27.554887 1.0 7.75 22.0 51.25 112.0
wickets 33.0 5.727273 1.908414 1.0 5.00 6.0 7.00 9.0
2024 runs 35.0 30.142857 25.994505 1.0 15.00 24.0 35.00 106.0
wickets 36.0 5.944444 1.999206 2.0 4.00 6.0 7.00 10.0
17. Calculate the total number of ‘no result’ matches and their impact on the tournament
17
The total number of 'no result' matches: 5
Distribution of 'no result' matches by season:
season
2011 1
2015 2
2019 1
2023 1
Name: count, dtype: int64
Distribution of 'no result' matches by team:
Chennai Super Kings NaN
Delhi Daredevils 2.0
Lucknow Super Giants NaN
Pune Warriors NaN
Rajasthan Royals NaN
Royal Challengers Bangalore NaN
Name: count, dtype: float64
18. How many matches were won by teams batting first vs. batting second?
The average number of runs scored by the winning team is: 179.69678714859438
20. Identify the most unsuccessful team (team with lowest wins)
18
[98]: most_unsuccessful_team = matches_won.idxmin()
print(f"The most unsuccessful team (team with the lowest wins) is:␣
↪{most_unsuccessful_team}")
The most unsuccessful team (team with the lowest wins) is: Rising Pune
Supergiants
ASSIGNMENT QUESTIONS
Explore following for given dataset and also perform EDA. 1. Frequency Distribution of Wins by
Wickets 2. Relative Frequency Distribution 3. Cumulative Relative Frequency Graph 4. Probability
of Winning by 6 Wickets or Less 5. Normal Distribution of Wins by Wickets 6. Mean, Standard
Deviation, and Percentile Calculation 7. Find out outliers for the selective columns for lower range
outliers will be lower than mu - 2sigma, similarly for upper range outliers will be greater than
mu+2sigma.
1. Frequency Distribution of Wins by Wickets
result_margin
1.0 4
2.0 10
3.0 31
4.0 59
5.0 97
6.0 120
7.0 115
8.0 78
9.0 48
10.0 16
Name: count, dtype: int64
19
2. Relative Frequency Distribution
result_margin
1.0 0.006920
2.0 0.017301
3.0 0.053633
4.0 0.102076
5.0 0.167820
6.0 0.207612
7.0 0.198962
20
8.0 0.134948
9.0 0.083045
10.0 0.027682
Name: count, dtype: float64
result_margin
21
1.0 0.006920
2.0 0.024221
3.0 0.077855
4.0 0.179931
5.0 0.347751
6.0 0.555363
7.0 0.754325
8.0 0.889273
9.0 0.972318
10.0 1.000000
Name: count, dtype: float64
22
# Calculate the probability
probability_wins_by_6_or_less = wins_by_6_or_less / total_wins_by_wickets
print(f'The probability of winning by 6 wickets or less is:␣
↪{probability_wins_by_6_or_less}')
[116]: print(df.describe())
23
min 3.359820e+05 1.000000 43.000000 5.000000
25% 5.483315e+05 6.000000 146.000000 20.000000
50% 9.809610e+05 8.000000 166.000000 20.000000
75% 1.254062e+06 20.000000 187.000000 20.000000
max 1.426312e+06 146.000000 288.000000 20.000000
7. Find out outliers for the selective columns for lower range outliers will be lower than mu -
2sigma, similarly for upper range outliers will be greater than mu+2sigma.
[118]: # Calculate the mean and standard deviation for the result_margin column
mu = df['result_margin'].mean()
sigma = df['result_margin'].std()
print(outliers)
venue \
0 M Chinnaswamy Stadium
9 Punjab Cricket Association Stadium, Mohali
39 Sawai Mansingh Stadium
55 Wankhede Stadium
59 Newlands
… …
1030 MA Chidambaram Stadium, Chepauk, Chennai
1039 Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket St…
1058 Arun Jaitley Stadium, Delhi
1069 MA Chidambaram Stadium, Chepauk, Chennai
1077 Bharat Ratna Shri Atal Bihari Vajpayee Ekana C…
24
team1 team2 \
0 Royal Challengers Bangalore Kolkata Knight Riders
9 Kings XI Punjab Mumbai Indians
39 Rajasthan Royals Royal Challengers Bangalore
55 Delhi Daredevils Rajasthan Royals
59 Royal Challengers Bangalore Rajasthan Royals
… … …
1030 Chennai Super Kings Gujarat Titans
1039 Kolkata Knight Riders Delhi Capitals
1058 Sunrisers Hyderabad Delhi Capitals
1069 Chennai Super Kings Sunrisers Hyderabad
1077 Kolkata Knight Riders Lucknow Super Giants
umpire1 umpire2
0 Asad Rauf RE Koertzen
9 Aleem Dar AM Saheba
39 BF Bowden SL Shastri
55 BF Bowden RE Koertzen
59 BR Doctrove RB Tiffin
… … …
1030 AG Wharf Tapan Sharma
1039 A Totre UV Gandhe
25
1058 J Madanagopal Navdeep Singh
1069 R Pandit MV Saidharshan Kumar
1077 MV Saidharshan Kumar YC Barde
26
D24AIML081_A_6
April 1, 2025
####
D24AIML081 PMRP ASSIGNMENT 6 WITH CONCLUSION
CLASSWORK
QUESTIONS:- ->General Population and Gender Distribution
What is the total population in each county, and how does it vary by state? What is the gender
distribution (Men vs. Women) across different counties? What is the average population size
for census tracts in each state? How does the population of each race (White, Black, Hispanic,
etc.) differ across states? What is the proportion of the male population compared to the female
population in each census tract?
->Ethnicity and Race
What is the distribution of Hispanic population across various counties and states? How do different
racial groups (White, Black, Native, etc.) vary in terms of percentage of total population in different
counties? Which states have the highest percentage of Black or Hispanic populations?
->Employment and Work Type
What is the employment rate (Employed vs. Unemployed) for each census tract? How does the
rate of self-employed individuals compare to those working in private/public sectors across different
states? What percentage of the population works from home, and how does it vary by county and
state? How does the unemployment rate vary across different states and counties? What is the
distribution of employed individuals working in private vs. public sectors?
->Commuting and Transportation
What is the average commuting time across counties and states, and how does it differ for employed
individuals? What modes of transportation are most commonly used for commuting in different
states (e.g., car, public transportation, walking)? How does the percentage of people commuting
via walking or public transportation vary between urban and rural areas?
->Income and Housing
What is the average income (or median household income) in each state and county? How does
the distribution of housing type (e.g., owner-occupied vs. renter-occupied) vary across different
counties? How does the cost of living compare across different states based on average income and
housing costs?
-> Social Characteristics
1
What is the relationship between education levels (e.g., percentage with a high school diploma,
bachelor’s degree) and employment types across different states?
df=pd.read_csv("C:/Users/User/Downloads/acs2017_census_tract_data.csv")
df,df.head(),df.tail(),df.describe(),df.info(),df.columns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74001 entries, 0 to 74000
Data columns (total 37 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TractId 74001 non-null int64
1 State 74001 non-null object
2 County 74001 non-null object
3 TotalPop 74001 non-null int64
4 Men 74001 non-null int64
5 Women 74001 non-null int64
6 Hispanic 73305 non-null float64
7 White 73305 non-null float64
8 Black 73305 non-null float64
9 Native 73305 non-null float64
10 Asian 73305 non-null float64
11 Pacific 73305 non-null float64
12 VotingAgeCitizen 74001 non-null int64
13 Income 72885 non-null float64
14 IncomeErr 72885 non-null float64
15 IncomePerCap 73256 non-null float64
16 IncomePerCapErr 73256 non-null float64
17 Poverty 73159 non-null float64
18 ChildPoverty 72891 non-null float64
19 Professional 73190 non-null float64
20 Service 73190 non-null float64
21 Office 73190 non-null float64
22 Construction 73190 non-null float64
23 Production 73190 non-null float64
24 Drive 73200 non-null float64
25 Carpool 73200 non-null float64
26 Transit 73200 non-null float64
27 Walk 73200 non-null float64
28 OtherTransp 73200 non-null float64
29 WorkAtHome 73200 non-null float64
30 MeanCommute 73055 non-null float64
31 Employed 74001 non-null int64
32 PrivateWork 73190 non-null float64
2
33 PublicWork 73190 non-null float64
34 SelfEmployed 73190 non-null float64
35 FamilyWork 73190 non-null float64
36 Unemployment 73191 non-null float64
dtypes: float64(29), int64(6), object(2)
memory usage: 20.9+ MB
FamilyWork Unemployment
3
0 0.0 4.6
1 0.0 3.4
2 0.7 4.7
3 0.0 6.1
4 0.0 2.3
… … …
73996 0.0 20.8
73997 0.0 26.3
73998 0.0 23.0
73999 0.0 29.5
74000 0.0 17.9
[5 rows x 37 columns],
TractId State County TotalPop Men Women \
73996 72153750501 Puerto Rico Yauco Municipio 6011 3035 2976
73997 72153750502 Puerto Rico Yauco Municipio 2342 959 1383
73998 72153750503 Puerto Rico Yauco Municipio 2218 1001 1217
73999 72153750601 Puerto Rico Yauco Municipio 4380 1964 2416
74000 72153750602 Puerto Rico Yauco Municipio 3001 1343 1658
4
74000 99.2 0.8 0.0 0.0 … 4.9 0.0 8.9
FamilyWork Unemployment
73996 0.0 20.8
73997 0.0 26.3
73998 0.0 23.0
73999 0.0 29.5
74000 0.0 17.9
[5 rows x 37 columns],
TractId TotalPop Men Women Hispanic \
count 7.400100e+04 74001.000000 74001.000000 74001.000000 73305.000000
mean 2.839113e+10 4384.716017 2157.710707 2227.005311 17.265444
std 1.647593e+10 2228.936729 1120.560504 1146.240218 23.073811
min 1.001020e+09 0.000000 0.000000 0.000000 0.000000
25% 1.303901e+10 2903.000000 1416.000000 1465.000000 2.600000
50% 2.804700e+10 4105.000000 2007.000000 2082.000000 7.400000
75% 4.200341e+10 5506.000000 2707.000000 2803.000000 21.100000
max 7.215375e+10 65528.000000 32266.000000 33262.000000 100.000000
5
Employed PrivateWork PublicWork SelfEmployed FamilyWork \
count 74001.000000 73190.000000 73190.000000 73190.000000 73190.000000
mean 2049.152052 79.494222 14.163342 6.171484 0.171164
std 1138.865457 8.126383 7.328680 3.932364 0.456580
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1276.000000 75.200000 9.300000 3.500000 0.000000
50% 1895.000000 80.600000 13.000000 5.500000 0.000000
75% 2635.000000 85.000000 17.600000 8.000000 0.000000
max 28945.000000 100.000000 100.000000 100.000000 22.300000
Unemployment
count 73191.000000
mean 7.246738
std 5.227624
min 0.000000
25% 3.900000
50% 6.000000
75% 9.000000
max 100.000000
[8 rows x 35 columns],
None,
Index(['TractId', 'State', 'County', 'TotalPop', 'Men', 'Women', 'Hispanic',
'White', 'Black', 'Native', 'Asian', 'Pacific', 'VotingAgeCitizen',
'Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr', 'Poverty',
'ChildPoverty', 'Professional', 'Service', 'Office', 'Construction',
'Production', 'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp',
'WorkAtHome', 'MeanCommute', 'Employed', 'PrivateWork', 'PublicWork',
'SelfEmployed', 'FamilyWork', 'Unemployment'],
dtype='object'))
[50]: #What is the total population in each county, and how does it vary by state?
total_population_by_county = df.groupby(['State', 'County'])['TotalPop'].sum().
↪reset_index()
print(total_population_by_county)
total_population_by_state = df.groupby('State')['TotalPop'].sum().reset_index()
print(total_population_by_state)
6
3217 Wyoming Uinta County 20758
3218 Wyoming Washakie County 8253
3219 Wyoming Weston County 7117
7
42 South Dakota 855444
43 Tennessee 6597381
44 Texas 27419612
45 Utah 2993941
46 Vermont 624636
47 Virginia 8365952
48 Washington 7169967
49 West Virginia 1836843
50 Wisconsin 5763217
51 Wyoming 583200
[52]: #What is the gender distribution (Men vs. Women) across different counties?
8
[54]: #What is the average population size for census tracts in each state?
average_population_by_state = df.groupby('State')['TotalPop'].mean().
↪reset_index()
# print(average_population_by_state)
average_population_by_state.head()
[56]: #How does the population of each race (White, Black, Hispanic, etc.) differ␣
↪across states?
9
df['White_Percentage'] = (df['White'] / df['TotalPop']) * 100
df['Black_Percentage'] = (df['Black'] / df['TotalPop']) * 100
df['Native_Percentage'] = (df['Native'] / df['TotalPop']) * 100
df['Asian_Percentage'] = (df['Asian'] / df['TotalPop']) * 100
df['Pacific_Percentage'] = (df['Pacific'] / df['TotalPop']) * 100
df['Hispanic_Percentage'] = (df['Hispanic'] / df['TotalPop']) * 100
).reset_index()
C:\Users\User\AppData\Local\Temp\ipykernel_6148\2768742177.py:12:
DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns.
This behavior is deprecated, and in a future version of pandas the grouping
columns will be excluded from the operation. Either pass `include_groups=False`
to exclude the groupings or explicitly select the grouping columns after groupby
to silence this warning.
race_population_by_state = df.groupby('State').apply(
10
[57]: #What is the proportion of the male population compared to the female␣
↪population in each census tract?
11
C:\Users\User\AppData\Local\Temp\ipykernel_6148\1348919082.py:6:
DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns.
This behavior is deprecated, and in a future version of pandas the grouping
columns will be excluded from the operation. Either pass `include_groups=False`
to exclude the groupings or explicitly select the grouping columns after groupby
to silence this warning.
male_to_female_ratio_by_state = df.groupby('State').apply(
[59]: #What is the distribution of Hispanic population across various counties and␣
↪states?
12
df['Hispanic_Percentage'] = (df['Hispanic'] / df['TotalPop']) * 100
hispanic_by_county['Hispanic_Percentage'] = (hispanic_by_county['Hispanic'] /␣
↪hispanic_by_county['TotalPop']) * 100
C:\Users\User\AppData\Local\Temp\ipykernel_6148\482559711.py:11:
DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns.
This behavior is deprecated, and in a future version of pandas the grouping
columns will be excluded from the operation. Either pass `include_groups=False`
to exclude the groupings or explicitly select the grouping columns after groupby
to silence this warning.
hispanic_state_data = hispanic_by_county.groupby('State').apply(
13
[60]: #How do different racial groups (White, Black, Native, etc.) vary in terms of␣
↪percentage of total population in different counties?
↪racial_population_by_county['TotalPop']) * 100
14
racial_population_by_county['Black_Percentage'] =␣
↪(racial_population_by_county['Black'] /␣
↪racial_population_by_county['TotalPop']) * 100
racial_population_by_county['Native_Percentage'] =␣
↪(racial_population_by_county['Native'] /␣
↪racial_population_by_county['TotalPop']) * 100
racial_population_by_county['Asian_Percentage'] =␣
↪(racial_population_by_county['Asian'] /␣
↪racial_population_by_county['TotalPop']) * 100
racial_population_by_county['Pacific_Percentage'] =␣
↪(racial_population_by_county['Pacific'] /␣
↪racial_population_by_county['TotalPop']) * 100
racial_population_by_county
Asian_Percentage Pacific_Percentage
0 0.013627 0.000727
1 0.008261 0.000000
2 0.016793 0.000000
3 0.000000 0.000000
4 0.002081 0.000000
15
… … …
3215 0.017517 0.011005
3216 0.037517 0.000000
3217 0.001445 0.000000
3218 0.004847 0.000000
3219 0.118027 0.000000
[61]: #Which states have the highest percentage of Black or Hispanic populations?
highest_black_percentage_states = black_hispanic_percentage_by_state.
↪sort_values(by='Black_Percentage', ascending=False).head(10)
highest_hispanic_percentage_states = black_hispanic_percentage_by_state.
↪sort_values(by='Hispanic_Percentage', ascending=False).head(10)
16
plt.bar(highest_hispanic_percentage_states['State'],␣
↪highest_hispanic_percentage_states['Hispanic_Percentage'])
C:\Users\User\AppData\Local\Temp\ipykernel_6148\72117619.py:8:
DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns.
This behavior is deprecated, and in a future version of pandas the grouping
columns will be excluded from the operation. Either pass `include_groups=False`
to exclude the groupings or explicitly select the grouping columns after groupby
to silence this warning.
black_hispanic_percentage_by_state = df.groupby('State').apply(
States with the highest percentage of Black population:
State Black_Percentage Hispanic_Percentage
8 District of Columbia 1.321820 0.263329
24 Mississippi 0.928682 0.064349
18 Louisiana 0.879639 0.121680
0 Alabama 0.766280 0.093264
20 Maryland 0.712222 0.209424
41 South Carolina 0.638939 0.116409
10 Georgia 0.620468 0.165446
22 Michigan 0.489261 0.132382
7 Delaware 0.478197 0.204232
33 North Carolina 0.462836 0.189193
States with the highest percentage of Hispanic population:
State Black_Percentage Hispanic_Percentage
39 Puerto Rico 0.002410 2.526179
31 New Mexico 0.040919 1.101539
4 California 0.115160 0.775820
44 Texas 0.222349 0.739622
2 Arizona 0.085987 0.669387
28 Nevada 0.189703 0.656354
5 Colorado 0.080154 0.481415
9 Florida 0.310578 0.458349
30 New Jersey 0.318835 0.434905
32 New York 0.379462 0.431436
17
18
2 Employment and Work Type
What is the employment rate (Employed vs. Unemployed) for each census tract?
How does the rate of self-employed individuals compare to those working in private/public sectors
across different states?
What percentage of the population works from home, and how does it vary by county and state?
How does the unemployment rate vary across different states and counties?
What is the distribution of employed individuals working in private vs. public sectors?
[63]: #What is the employment rate (Employed vs. Unemployed) for each census tract?
df['EmploymentRate'] = df['Employed'] / df['TotalPop']
df['UnemploymentRate'] = df['Unemployment'] / df['TotalPop']
df[['State', 'County', 'TractId', 'Employed', 'Unemployment', 'EmploymentRate',␣
↪'UnemploymentRate']]
19
3 Alabama Autauga County 1001020400 1849 6.1
4 Alabama Autauga County 1001020500 4787 2.3
… … … … … …
73996 Puerto Rico Yauco Municipio 72153750501 1576 20.8
73997 Puerto Rico Yauco Municipio 72153750502 666 26.3
73998 Puerto Rico Yauco Municipio 72153750503 560 23.0
73999 Puerto Rico Yauco Municipio 72153750601 1062 29.5
74000 Puerto Rico Yauco Municipio 72153750602 759 17.9
EmploymentRate UnemploymentRate
0 0.477507 0.002493
1 0.392265 0.001565
2 0.437814 0.001388
3 0.433326 0.001430
4 0.480381 0.000231
… … …
73996 0.262186 0.003460
73997 0.284372 0.011230
73998 0.252480 0.010370
73999 0.242466 0.006735
74000 0.252916 0.005965
[65]: #How does the rate of self-employed individuals compare to those working in␣
↪private/public sectors across different states?
employment_rates_by_state = df.groupby('State')[['SelfEmployedRate',␣
↪'PrivateWorkRate', 'PublicWorkRate']].mean().reset_index()
employment_rates_by_state.head()
employment_rates_by_state.set_index('State').plot(kind='bar', stacked=True,␣
↪figsize=(15, 10))
20
[66]: #What percentage of the population works from home, and how does it vary by␣
↪county and state?
work_at_home_by_county = df.groupby(['State',␣
↪'County'])['WorkAtHomePercentage'].mean().reset_index()
work_at_home_by_state = df.groupby('State')['WorkAtHomePercentage'].mean().
↪reset_index()
plt.figure(figsize=(15, 10))
plt.bar(work_at_home_by_state['State'],␣
↪work_at_home_by_state['WorkAtHomePercentage'])
21
plt.show()
[67]: #How does the unemployment rate vary across different states and counties?
unemployment_rate_by_county = df.groupby(['State',␣
↪'County'])['UnemploymentRate'].mean().reset_index()
unemployment_rate_by_state = df.groupby('State')['UnemploymentRate'].mean().
↪reset_index()
plt.figure(figsize=(15, 10))
plt.bar(unemployment_rate_by_state['State'],␣
↪unemployment_rate_by_state['UnemploymentRate'])
22
[68]: #What is the distribution of employed individuals working in private vs. public␣
↪sectors?
employment_distribution_by_sector = df.groupby('State')[['PrivateWork',␣
↪'PublicWork']].sum().reset_index()
print(employment_distribution_by_sector)
employment_distribution_by_sector.set_index('State').plot(kind='bar',␣
↪stacked=True, figsize=(15, 10))
plt.xlabel('State')
plt.ylabel('Number of Employed Individuals')
plt.legend(title='Employment Sector')
plt.show()
23
4 California 620398.1 109397.2
5 Colorado 97844.1 17353.8
6 Connecticut 66745.7 10529.6
7 Delaware 17486.0 3008.6
8 District of Columbia 12648.6 4495.2
9 Florida 339459.8 50039.8
10 Georgia 154053.7 30032.9
11 Hawaii 22185.5 6871.9
12 Idaho 22544.6 4713.8
13 Illinois 257640.6 38630.2
14 Indiana 127561.2 15807.4
15 Iowa 65720.6 10575.5
16 Kansas 59007.9 11671.1
17 Kentucky 87831.4 16510.3
18 Louisiana 88930.0 17008.1
19 Maine 26940.6 4891.5
20 Maryland 101853.0 29729.1
21 Massachusetts 119845.6 17640.1
22 Michigan 231237.5 29144.3
23 Minnesota 109311.0 15853.7
24 Mississippi 49870.3 12135.6
25 Missouri 113203.8 17254.2
26 Montana 18926.0 5122.4
27 Nebraska 41686.2 7305.3
28 Nevada 55822.1 8199.3
29 New Hampshire 23277.0 3879.4
30 New Jersey 162630.5 27280.9
31 New Mexico 34674.1 11583.9
32 New York 379901.5 75897.8
33 North Carolina 172143.5 31173.4
34 North Dakota 14928.3 3435.3
35 Ohio 244297.7 34850.6
36 Oklahoma 80218.8 17329.3
37 Oregon 64121.0 11690.7
38 Pennsylvania 269785.3 33345.0
39 Puerto Rico 60283.9 19938.7
40 Rhode Island 19867.9 2902.2
41 South Carolina 86280.9 16694.5
42 South Dakota 16150.8 3726.8
43 Tennessee 116793.3 20576.8
44 Texas 414609.8 70444.0
45 Utah 46872.6 8617.6
46 Vermont 13926.9 2583.8
47 Virginia 140180.9 37739.8
48 Washington 111172.3 24186.2
49 West Virginia 37205.6 9021.3
50 Wisconsin 114720.4 16816.3
51 Wyoming 9333.9 2849.7
24
3 Commuting and Transportation
What is the average commuting time across counties and states, and how does it differ for employed
individuals?
What modes of transportation are most commonly used for commuting in different states (e.g., car,
public transportation, walking)?
How does the percentage of people commuting via walking or public transportation vary between
urban and rural areas?
[71]: #What is the average commuting time across counties and states, and how does it␣
↪differ for employed individuals?
average_commute_by_state = df.groupby('State')['MeanCommute'].mean().
↪reset_index()
25
plt.figure(figsize=(15, 10))
plt.bar(average_commute_by_state['State'],␣
↪average_commute_by_state['MeanCommute'],color='y')
print(average_commute_by_county)
print(average_commute_by_state)
26
… … … …
3215 Wyoming Sweetwater County 20.708333
3216 Wyoming Teton County 14.450000
3217 Wyoming Uinta County 20.233333
3218 Wyoming Washakie County 14.533333
3219 Wyoming Weston County 26.000000
27
39 Puerto Rico 28.281087
40 Rhode Island 24.409167
41 South Carolina 24.292173
42 South Dakota 17.077477
43 Tennessee 24.576626
44 Texas 25.205923
45 Utah 21.286770
46 Vermont 23.095082
47 Virginia 27.695833
48 Washington 26.888989
49 West Virginia 25.428306
50 Wisconsin 22.135396
51 Wyoming 18.145802
[72]: #What modes of transportation are most commonly used for commuting in different␣
↪states (e.g., car, public transportation, walking)?
#Calcultions
transportation_modes_by_state = df.groupby('State')[['Drive', 'Carpool',␣
↪'Transit', 'Walk', 'OtherTransp']].sum().reset_index()
print(transportation_modes_by_state)
28
State Drive Carpool Transit Walk \
0 Alabama 87.237800 9.464023 0.586360 1.483553
43 Tennessee 86.129360 9.594789 1.083404 1.773425
24 Mississippi 86.086598 10.119360 0.424133 1.671821
29 New Hampshire 85.520991 8.814976 0.895838 3.211881
41 South Carolina 85.126841 10.013474 0.761312 2.265327
3 Arkansas 85.009745 11.042141 0.500057 2.015314
16 Kansas 84.992725 10.007144 0.596194 2.837533
36 Oklahoma 84.907412 10.795471 0.583061 2.210284
33 North Carolina 84.901295 10.254153 1.212449 2.128705
27 Nebraska 84.767305 9.650356 0.817301 3.337992
14 Indiana 84.761873 9.662309 1.482523 2.462258
35 Ohio 84.513781 8.589462 2.775086 2.761280
15 Iowa 84.454221 9.198036 1.067967 3.695741
25 Missouri 84.286308 9.674766 2.303347 2.313148
17 Kentucky 84.152657 10.172520 1.336560 2.625019
49 West Virginia 84.103448 9.925722 1.045823 3.613250
22 Michigan 83.925573 9.733940 2.229668 2.609457
31 New Mexico 83.907252 10.063634 1.180954 2.694058
7 Delaware 83.632857 8.266754 3.799419 2.954340
9 Florida 83.585773 9.656995 2.252973 1.895705
42 South Dakota 83.578985 9.608715 0.613847 4.691166
34 North Dakota 83.455019 9.613547 0.561868 5.015786
44 Texas 83.448608 11.148666 1.682065 1.891720
50 Wisconsin 83.272902 8.829600 2.647031 3.519228
29
18 Louisiana 82.752882 9.928719 2.361261 2.474868
19 Maine 82.560276 10.427820 0.708804 4.577618
10 Georgia 82.553645 10.903683 2.667940 1.965389
12 Idaho 82.548234 10.779917 0.809219 3.478675
39 Puerto Rico 82.061982 8.496652 2.871417 4.380004
40 Rhode Island 81.987316 9.117606 2.908490 4.454913
23 Minnesota 81.692781 9.298336 3.999825 3.195506
46 Vermont 81.644848 9.479608 1.221118 5.917226
47 Virginia 81.290401 9.740763 4.383332 2.682129
51 Wyoming 81.222942 10.773234 1.309205 4.608724
2 Arizona 81.052286 11.526215 2.146066 2.364155
5 Colorado 80.907299 10.085541 3.281061 3.280014
6 Connecticut 80.681142 8.678442 5.921638 3.466852
28 Nevada 80.267411 10.829486 3.976415 2.597373
45 Utah 80.194172 12.128832 2.663132 2.935868
26 Montana 79.196714 10.630047 0.787306 6.936447
38 Pennsylvania 77.786649 9.225055 6.836267 4.608836
4 California 77.665634 11.035172 5.574819 2.972786
48 Washington 77.643134 10.693114 5.855954 3.677865
37 Oregon 76.806406 10.869483 4.276663 4.449932
20 Maryland 75.852719 9.589507 10.078988 2.900839
13 Illinois 75.248824 8.560149 11.003480 3.269925
30 New Jersey 73.641468 8.522955 12.381831 3.447695
21 Massachusetts 72.916093 8.115665 10.971217 5.811922
11 Hawaii 69.115348 14.034894 6.945447 5.954481
1 Alaska 67.163846 12.427795 1.499818 11.904292
32 New York 55.260617 7.185068 29.055900 6.596035
8 District of Columbia 38.846324 6.042644 37.586481 11.469521
OtherTransp
0 1.228263
43 1.419022
24 1.698087
29 1.556314
41 1.833046
3 1.432743
16 1.566404
36 1.503772
33 1.503398
27 1.427046
14 1.631036
35 1.360391
15 1.584035
25 1.422432
17 1.713244
49 1.311757
22 1.501362
31 2.154102
30
7 1.346630
9 2.608554
42 1.507287
34 1.353779
44 1.828942
50 1.731240
18 2.482271
19 1.725482
10 1.909344
12 2.383955
39 2.189944
40 1.531675
23 1.813552
46 1.737200
47 1.903376
51 2.085894
2 2.911277
5 2.446085
6 1.251926
28 2.329315
45 2.077996
26 2.449487
38 1.543194
4 2.751590
48 2.129932
37 3.597516
20 1.577947
13 1.917621
30 2.006051
21 2.185103
11 3.949830
1 7.004248
32 1.902380
8 6.055030
[73]: #How does the percentage of people commuting via walking or public␣
↪transportation vary between urban and rural areas?
31
commute_modes_by_area.set_index('AreaType').plot(kind='bar', stacked=True,␣
↪figsize=(10, 6))
plt.xlabel('Area Type')
plt.ylabel('Percentage')
plt.legend(title='Commute Mode')
plt.show()
commute_modes_by_area
32
costs?
[77]: #What is the average income (or median household income) in each state and␣
↪county?
average_income_by_state = df.groupby('State')['Income'].mean().reset_index()
print(average_income_by_county.head())
print(average_income_by_state.head())
# housing_distribution_by_county = df.groupby(['State',␣
↪'County'])[['OwnerOccupied', 'RenterOccupied']].sum().reset_index()
# ax.set_xlabel('County')
# ax.set_ylabel('Number of Housing Units')
# plt.legend(title='Housing Type')
# plt.show()
'''
(Current dataset does not include owner/renter data, so this needs additional␣
↪information.)
'''
print(df.columns)
33
'Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr', 'Poverty',
'ChildPoverty', 'Professional', 'Service', 'Office', 'Construction',
'Production', 'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp',
'WorkAtHome', 'MeanCommute', 'Employed', 'PrivateWork', 'PublicWork',
'SelfEmployed', 'FamilyWork', 'Unemployment', 'White_Percentage',
'Black_Percentage', 'Native_Percentage', 'Asian_Percentage',
'Pacific_Percentage', 'Hispanic_Percentage', 'MaleToFemaleRatio',
'EmploymentRate', 'UnemploymentRate', 'SelfEmployedRate',
'PrivateWorkRate', 'PublicWorkRate', 'WorkAtHomePercentage',
'AreaType'],
dtype='object')
[79]: #How does the cost of living compare across different states based on average␣
↪income and housing costs?
average_income_by_state = df.groupby('State')['Income'].mean().reset_index()
average_per_capita_income_by_state = df.groupby('State')['IncomePerCap'].mean().
↪reset_index()
cost_of_living_by_state = pd.merge(average_income_by_state,␣
↪average_per_capita_income_by_state, on='State')
plt.show()
34
5 Social Characteristics
What is the relationship between education levels (e.g., percentage with a high school diploma,
bachelor’s degree) and employment types across different states?
[81]: #What is the relationship between education levels (e.g., percentage with a␣
↪high school diploma, bachelor’s degree) and employment types
'''
(Current dataset does not include direct education data, but employment types␣
↪are available.)
'''
35
plt.legend(title='Employment Type')
plt.xticks(rotation=90)
plt.show()
36
D24AIML081_AS7
April 1, 2025
# 1. Descriptive Statistics
print("Descriptive Statistics:")
print(iris.describe())
print("\nMedian values:")
print(iris.median(numeric_only=True))
Descriptive Statistics:
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
Median values:
sepal_length 5.80
sepal_width 3.00
petal_length 4.35
petal_width 1.30
dtype: float64
[46]: #1. Calculate basic descriptive statistics such as the mean, median, standard␣
↪deviation, and more for each of the numeric columns.
1
[47]: import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
print("Dataset Preview:\n", df.head())
stats = df.describe().T
stats['median'] = df.median()
print("\nDescriptive Statistics:\n", stats)
Dataset Preview:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
Descriptive Statistics:
count mean std min 25% 50% 75% max median
sepal length (cm) 150.0 5.843333 0.828066 4.3 5.1 5.80 6.4 7.9 5.80
sepal width (cm) 150.0 3.057333 0.435866 2.0 2.8 3.00 3.3 4.4 3.00
petal length (cm) 150.0 3.758000 1.765298 1.0 1.6 4.35 5.1 6.9 4.35
petal width (cm) 150.0 1.199333 0.762238 0.1 0.3 1.30 1.8 2.5 1.30
[48]: #2. Normal Distribution (Check for Normality) check whether the `sepal_length`␣
↪follows a normal distribution using a histogram and a Q-Q plot.
plt.subplot(1, 2, 1)
sns.histplot(df['sepal length (cm)'], kde=True, bins=20)
plt.title("Histogram of Sepal Length")
plt.subplot(1, 2, 2)
import scipy.stats as stats
stats.probplot(df['sepal length (cm)'].values, dist="norm", plot=plt)
plt.show()
2
[55]: #3. Hypothesis Testing (One-Sample t-Test) perform a one-sample t-test to check␣
↪if the average `sepal_length` is different from 5.0.
alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis: The average sepal length is␣
↪significantly different from 5.0.")
else:
print("Fail to reject the null hypothesis: No significant difference from 5.
↪0.")
[59]: #4. Correlation Analysis calculate the Pearson correlation coefficient between␣
↪`sepal_length` and `petal_length` to see if they are related.
3
print(f"Correlation Coefficient: {correlation:.4f}")
print(f"P-Value: {p_value:.4f}")
alpha = 0.05
if p_value < alpha:
print("The correlation is statistically significant.")
else:
print("The correlation is not statistically significant.")
[63]: #5. Simple Linear Regression perform a simple linear regression to predict␣
↪`petal_length` based on `sepal_length`.
[67]: #6. ANOVA (One-Way Analysis of Variance)We will perform an ANOVA test to check␣
↪if there is a significant difference in the `sepal_length` between different␣
↪species.
alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis: There is a significant difference in␣
↪sepal length between species.")
else:
print("Fail to reject the null hypothesis: No significant difference in␣
↪sepal length between species.")
4
F-Statistic: 119.2645
P-Value: 0.0000
Reject the null hypothesis: There is a significant difference in sepal length
between species.
[71]: #7. Chi-Square Test for Independence perform a Chi-Square test to see if there␣
↪is an association between `species` and `sepal_width`.
alpha = 0.05
if p < alpha:
print("Reject the null hypothesis: There is an association between species␣
↪and sepal width.")
else:
print("Fail to reject the null hypothesis: No significant association␣
↪between species and sepal width.")
[75]: #8. Calculate the 95% confidence interval for the `petal_length` for each␣
↪species. Use the `petal_length` column and apply the `groupby()` function to␣
5
return mean - margin_of_error, mean + margin_of_error
6
[83]: #10. Conduct a Chi-Square test to see if there is an association between the␣
↪`season` and `species`. You will need to categorize the `season` column␣
↪varies by season.
[85]: np.random.seed(42)
seasons = ['Spring', 'Summer', 'Fall', 'Winter']
df['season'] = np.random.choice(seasons, size=len(df))
alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis: There is an association between species␣
↪and season.")
else:
print("Fail to reject the null hypothesis: No significant association␣
↪between species and season.")
[87]: #11. Calculate the Z-scores for `sepal_length` and identify if any values are␣
↪outliers (with a threshold of ±3). How many outliers do you find?
7
outliers = df[np.abs(df['sepal_length_zscore']) > 3]
# Display results
print("Number of outliers in Sepal Length:", len(outliers))
print(outliers[['sepal length (cm)', 'sepal_length_zscore']])
[93]: sns.pairplot(df, vars=['sepal length (cm)', 'sepal width (cm)', 'petal length␣
↪(cm)', 'petal width (cm)'], hue='species', diag_kind='kde')
plt.show()
8
[ ]:
[ ]:
9
PMRP PRACTICAL 8
April 1, 2025
1
2 Q2 : How many missing values are present in the coulmns:
[23]: null_values = data.isnull().sum()
null_values
[23]: Country 0
Year 0
Status 0
Life expectancy 10
Adult Mortality 10
infant deaths 0
Alcohol 194
percentage expenditure 0
Hepatitis B 553
Measles 0
BMI 34
under-five deaths 0
Polio 19
Total expenditure 226
Diphtheria 19
HIV/AIDS 0
GDP 448
Population 652
thinness 1-19 years 34
thinness 5-9 years 34
Income composition of resources 167
Schooling 163
dtype: int64
2
'Dominican Republic' 'Ecuador' 'Egypt' 'El Salvador' 'Equatorial Guinea'
'Eritrea' 'Estonia' 'Ethiopia' 'Fiji' 'Finland' 'France' 'Gabon' 'Gambia'
'Georgia' 'Germany' 'Ghana' 'Greece' 'Grenada' 'Guatemala' 'Guinea'
'Guinea-Bissau' 'Guyana' 'Haiti' 'Honduras' 'Hungary' 'Iceland' 'India'
'Indonesia' 'Iran (Islamic Republic of)' 'Iraq' 'Ireland' 'Israel'
'Italy' 'Jamaica' 'Japan' 'Jordan' 'Kazakhstan' 'Kenya' 'Kiribati'
'Kuwait' 'Kyrgyzstan' "Lao People's Democratic Republic" 'Latvia'
'Lebanon' 'Lesotho' 'Liberia' 'Libya' 'Lithuania' 'Luxembourg'
'Madagascar' 'Malawi' 'Malaysia' 'Maldives' 'Mali' 'Malta'
'Marshall Islands' 'Mauritania' 'Mauritius' 'Mexico'
'Micronesia (Federated States of)' 'Monaco' 'Mongolia' 'Montenegro'
'Morocco' 'Mozambique' 'Myanmar' 'Namibia' 'Nauru' 'Nepal' 'Netherlands'
'New Zealand' 'Nicaragua' 'Niger' 'Nigeria' 'Niue' 'Norway' 'Oman'
'Pakistan' 'Palau' 'Panama' 'Papua New Guinea' 'Paraguay' 'Peru'
'Philippines' 'Poland' 'Portugal' 'Qatar' 'Republic of Korea'
'Republic of Moldova' 'Romania' 'Russian Federation' 'Rwanda'
'Saint Kitts and Nevis' 'Saint Lucia' 'Saint Vincent and the Grenadines'
'Samoa' 'San Marino' 'Sao Tome and Principe' 'Saudi Arabia' 'Senegal'
'Serbia' 'Seychelles' 'Sierra Leone' 'Singapore' 'Slovakia' 'Slovenia'
'Solomon Islands' 'Somalia' 'South Africa' 'South Sudan' 'Spain'
'Sri Lanka' 'Sudan' 'Suriname' 'Swaziland' 'Sweden' 'Switzerland'
'Syrian Arab Republic' 'Tajikistan' 'Thailand'
'The former Yugoslav republic of Macedonia' 'Timor-Leste' 'Togo' 'Tonga'
'Trinidad and Tobago' 'Tunisia' 'Turkey' 'Turkmenistan' 'Tuvalu' 'Uganda'
'Ukraine' 'United Arab Emirates'
'United Kingdom of Great Britain and Northern Ireland'
'United Republic of Tanzania' 'United States of America' 'Uruguay'
'Uzbekistan' 'Vanuatu' 'Venezuela (Bolivarian Republic of)' 'Viet Nam'
'Yemen' 'Zambia' 'Zimbabwe']
['Developing' 'Developed']
data.groupby('Country')['Life expectancy'].mean().sort_values(ascending=False).
↪head(10).plot(kind='bar',figsize=(20,5))
Country
Japan 82.53750
Sweden 82.51875
Iceland 82.44375
Switzerland 82.33125
France 82.21875
Italy 82.18750
Spain 82.06875
Australia 81.81250
Norway 81.79375
Canada 81.68750
3
Name: Life expectancy, dtype: float64
4
5 Q5 : What are the top 10 Countries with the highest and lowest
GDP:
[30]: highest_gdp = data.groupby('Country')['GDP'].mean().
↪sort_values(ascending=False).head(10)
lowest_gdp = data.groupby('Country')['GDP'].mean().sort_values(ascending=True).
↪head(10)
[31]: Text(0.5, 1.0, 'What is the trend of Life Expectancy over the years for
different regions:')
5
7 Q7 : How does adult mortality impact expectancy across coun-
try :
[32]: data.groupby('Country')['Adult Mortality'].mean().sort_values(ascending=True).
↪plot(kind='bar',figsize=(20,10))
6
8 Q8 : Is there a significant relationship between life expactancy
and gdp per capital :
[33]: data.groupby('GDP')['Life expectancy'].mean().sort_values(ascending=False).
↪head(15).plot(kind='bar',figsize=(20,5))
[33]: Text(0.5, 1.0, 'Is there a significant relationship between life expactancy and
gdp per capital ')
7
9 Q9 : How does alcohol consumption relate to life expectancy :
[34]: data.groupby('Country')['Alcohol'].mean().sort_values().head(20).
↪plot(kind='bar',figsize=(20,5))
plt.xlabel("Countries")
plt.ylabel("BMI VALUE")
8
plt.title("BMI relate to Life Expectancy : ")
Country
Saint Kitts and Nevis 5.20000
Viet Nam 11.18750
Bangladesh 12.87500
Lao People's Democratic Republic 14.36250
Timor-Leste 14.55000
Rwanda 14.75000
Madagascar 14.76875
India 14.79375
Ethiopia 14.80000
Eritrea 15.15625
Nepal 15.17500
Burundi 15.31250
Cambodia 15.36250
Burkina Faso 15.50000
Afghanistan 15.51875
Uganda 15.52500
Kenya 15.56250
Democratic Republic of the Congo 15.83750
Mozambique 16.14375
Chad 16.31875
Name: BMI, dtype: float64
9
11 Q11 : Does immunization coverage (Hepatitis B, Polio) affect
life expectancy :
[36]: print(data.groupby('Country')['Hepatitis B'].mean().
↪sort_values(ascending=False))
data.groupby('Country')['Hepatitis B'].mean().sort_values(ascending=False).
↪head(50).plot(kind='bar',figsize=(20,5))
Country
Palau 99.0000
Monaco 99.0000
Niue 99.0000
Fiji 98.8750
Oman 98.8125
…
Japan NaN
Norway NaN
Slovenia NaN
Switzerland NaN
United Kingdom of Great Britain and Northern Ireland NaN
Name: Hepatitis B, Length: 193, dtype: float64
[37]: print(data.groupby('Country')['Polio'].mean().sort_values(ascending=False))
data.groupby('Country')['Polio'].mean().sort_values(ascending=False).head(50).
↪plot(kind='bar',figsize=(20,5))
Country
Niue 99.0000
10
Monaco 99.0000
Palau 99.0000
Hungary 98.9375
Cuba 98.6875
…
Nigeria 41.3125
Equatorial Guinea 36.8750
Chad 32.8750
Somalia 29.8125
Tuvalu 9.0000
Name: Polio, Length: 193, dtype: float64
[38]: data.info
11
infant deaths Alcohol percentage expenditure Hepatitis B Measles \
0 62 0.01 71.279624 65.0 1154
1 64 0.01 73.523582 62.0 492
2 66 0.01 73.219243 64.0 430
3 69 0.01 78.184215 67.0 2787
4 71 0.01 7.097109 68.0 3013
… … … … … …
2933 27 4.36 0.000000 68.0 31
2934 26 4.06 0.000000 7.0 998
2935 25 4.43 0.000000 73.0 304
2936 25 1.72 0.000000 76.0 529
2937 24 1.68 0.000000 79.0 1483
12
2934 0.418 9.5
2935 0.427 10.0
2936 0.427 9.8
2937 0.434 9.8
[ ]:
[ ]:
[ ]:
13
D24AIML081_PR9
April 1, 2025
[3]: df=pd.read_csv(r"C:\Users\User\Downloads\Housing.csv")
sns.histplot(df['price'],bins=30,kde= True,color='black')
plt.title("Housing Price Distribution")
plt.show()
1
[9]: x1=df[df['airconditioning']=="yes"]['price'].mean()
x2=df[df['airconditioning']=="no"]['price'].mean()
print("With Aircondtioning",x1)
print("without Airconditioning",x2)
s=df['price'].std()
n1=len(df[df['aircondtioning']=='yes'])
n2=len(df[df['aircondtioning']=='no'])
t = (x1-x2)/(s*np.sqrt(1/n1+1/n2))
print("T-statistics",t)
t_critical=st.t.ppf(0.025,n1+n2-2)
print("T-Criticial",t_critical)
if(t>t_critical):
print("Reject Null Hypothesis")
else:
print("fail to ")
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~\anaconda3\Lib\site-packages\pandas\core\indexes\base.py:3805, in Index.
↪get_loc(self, key)
3804 try:
-> 3805 return self._engine.get_loc(casted_key)
3806 except KeyError as err:
KeyError: 'aircondtioning'
The above exception was the direct cause of the following exception:
2
File ~\anaconda3\Lib\site-packages\pandas\core\frame.py:4102, in DataFrame.
↪__getitem__(self, key)
KeyError: 'aircondtioning'
[ ]:
[ ]:
3
D24AIML081_PR_10
April 1, 2025
[5]: data=pd.read_csv(r"C:\Users\User\Downloads\global_energy_consumption.csv")
data.head()
[5]: Country Year Total Energy Consumption (TWh) Per Capita Energy Use (kWh) \
0 Canada 2018 9525.38 42301.43
1 Germany 2020 7922.08 36601.38
2 Russia 2002 6630.01 41670.20
3 Brazil 2010 8580.19 10969.58
4 Canada 2006 848.88 32190.85
1
[15]: # Q1 : What is the average total energy consumption across all countries?
avg_con = data['Total Energy Consumption (TWh)'].mean()
print("Average of Total Energy Consumption(TWh) is ",avg_con)
[21]: # Q3 : What is the correlation between fossil fuel dependency and carbon␣
↪emissions?
Correlation between fossil fuel dependency and Carbon Emissions (Million Tons)
is 0.004444006196321776
[31]: # Q4 : Which country has the highest average renewable energy share?
country = data.groupby('Country')['Renewable Energy Share (%)'].mean().idxmax()
high_avg_erg = data.groupby('Country')['Renewable Energy Share (%)'].mean().
↪max()
{
'Country': country,
'highest average renewable energy ' : high_avg_erg,
}
[37]: # Q5 What is the standard deviation of the energy price index across different␣
↪years?
2
3
[39]: # Q6 : How does industrial energy use compare to household energy use on␣
↪average?
data['Country'].unique()
developed=['Australia','China','USA','UK','Germany','Russia']
developing=['Japan','India','Canada','Brazil']
#mean of per capita mean of every country
developed_country = data[data['Country'].isin(developed)]['Per Capita Energy␣
↪Use (kWh)']
{
'Developed' : developed_country.mean(),
'Developing' : developing_country.mean()
}
4
[51]: # Q8 : What is the distribution of total energy consumption? Is it normally␣
↪distributed?
[55]: # Q9 : Can we build a regression model to predict carbon emissions based on␣
↪energy consumption and fossil fuel dependency?
Intercept:44.772961929867485,coffiecent:6.304406118696378e-05
[57]: model.predict([[5.0]])
C:\Users\prade\anaconda3\Lib\site-packages\sklearn\base.py:493: UserWarning: X
does not have valid feature names, but LinearRegression was fitted with feature
names
warnings.warn(
[57]: array([44.77327715])
[59]: # Q10 : What is the impact of renewable energy share on energy price index?
impact= data['Renewable Energy Share (%)'].corr(data['Energy Price Index (USD/
↪kWh)'])
print(impact)
-0.0156399186330418
1 PART 2
[96]: #Q1 : What is the trend of total energy consumption over the years for␣
↪different countries?
5
[80]: # Q2 : Which countries have the highest and lowest fossil fuel dependency?
country_min = data.groupby('Country')['Fossil Fuel Dependency (%)'].idxmin()
min_value = data.groupby('Country')['Fossil Fuel Dependency (%)'].min()
country_max = data.groupby('Country')['Fossil Fuel Dependency (%)'].idxmax()
max_value = data.groupby('Country')['Fossil Fuel Dependency (%)'].min()
{
'Country With Min Value' : country_min,
'Minimum Value ' : min_value,
'Country With Max Value' : country_max,
'Maximum Value ' : max_value
6
Japan 4572
Russia 6433
UK 4545
USA 3900
Name: Fossil Fuel Dependency (%), dtype: int64,
'Minimum Value ': Country
Australia 10.03
Brazil 10.03
Canada 10.02
China 10.11
Germany 10.04
India 10.04
Japan 10.07
Russia 10.05
UK 10.01
USA 10.04
Name: Fossil Fuel Dependency (%), dtype: float64,
'Country With Max Value': Country
Australia 9496
Brazil 7327
Canada 7775
China 5095
Germany 7194
India 9004
Japan 3191
Russia 8683
UK 140
USA 2502
Name: Fossil Fuel Dependency (%), dtype: int64,
'Maximum Value ': Country
Australia 10.03
Brazil 10.03
Canada 10.02
China 10.11
Germany 10.04
India 10.04
Japan 10.07
Russia 10.05
UK 10.01
USA 10.04
Name: Fossil Fuel Dependency (%), dtype: float64}
[100]: # Q3 : How has the share of renewable energy changed over time?
renewable_trend = data.groupby('Year')['Renewable Energy Share (%)'].mean()
7
plt.plot(renewable_trend.index, renewable_trend.values, marker='o',␣
↪color='green')
plt.xlabel('Year')
plt.ylabel('Average Renewable Energy Share (%)')
plt.title('Trend of Renewable Energy Share Over Time')
plt.grid(True)
plt.show()
[102]: # Q4 : What are the top 5 countries with the highest carbon emissions?
data.groupby('Country')['Carbon Emissions (Million Tons)'].mean().nlargest(5)
[102]: Country
China 2596.863320
Australia 2580.429833
India 2544.816727
Brazil 2542.097661
UK 2540.094797
Name: Carbon Emissions (Million Tons), dtype: float64
[106]: # Q5 : What is the distribution of energy price index across different regions?
data.groupby('Country')['Energy Price Index (USD/kWh)'].mean().
↪plot(kind='line',figsize=(10,6))
8
[116]: #Q6 : Is there a relationship between energy consumption and Energy Price␣
↪Index?
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[116], line 2
1 #Q6 : Is there a relationship between energy consumption and Energy␣
↪Price Index?
3680 @_copy_docstring_and_deprecators(Axes.scatter)
3681 def scatter(
9
3682 x: float | ArrayLike,
(…)
3697 **kwargs,
3698 ) -> PathCollection:
-> 3699 __ret = gca().scatter(
3700 x,
3701 y,
3702 s=s,
3703 c=c,
3704 marker=marker,
3705 cmap=cmap,
3706 norm=norm,
3707 vmin=vmin,
3708 vmax=vmax,
3709 alpha=alpha,
3710 linewidths=linewidths,
3711 edgecolors=edgecolors,
3712 plotnonfinite=plotnonfinite,
3713 **({"data": data} if data is not None else {}),
3714 **kwargs,
3715 )
3716 sci(__ret)
3717 return __ret
1462 @functools.wraps(func)
1463 def inner(ax, *args, data=None, **kwargs):
1464 if data is None:
-> 1465 return func(ax, *map(sanitize_sequence, args), **kwargs)
1467 bound = new_sig.bind(ax, *args, **kwargs)
1468 auto_label = (bound.arguments.get(label_namer)
1469 or bound.kwargs.get(label_namer))
4653 y = np.ma.ravel(y)
4654 if x.size != y.size:
-> 4655 raise ValueError("x and y must be the same size")
4657 if s is None:
4658 s = (20 if mpl.rcParams['_internal.classic_mode'] else
4659 mpl.rcParams['lines.markersize'] ** 2.0)
10
[ ]:
11