0% found this document useful (0 votes)
13 views7 pages

Ds Pract 2 Vedanti

The document outlines an assignment for data wrangling involving the creation of an 'Academic performance' dataset for students. It includes tasks such as handling missing values, identifying and dealing with outliers, and applying data transformations. The document also provides code snippets demonstrating the use of Python libraries like pandas and seaborn for data manipulation and visualization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views7 pages

Ds Pract 2 Vedanti

The document outlines an assignment for data wrangling involving the creation of an 'Academic performance' dataset for students. It includes tasks such as handling missing values, identifying and dealing with outliers, and applying data transformations. The document also provides code snippets demonstrating the use of Python libraries like pandas and seaborn for data manipulation and visualization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

In [1]: '''Assignment-2

Name : Shedage Vedanti Deepak


Class: TE(AI&DS)
Roll No:18
2) Data Wrangling II
Create an “Academic performance” dataset of students and perform the following operat
Python.
1. Scan all variables for missing values and inconsistencies. If there are missing va
inconsistencies, use any of the suitable techniques to deal with them.
2. Scan all numeric variables for outliers. If there are outliers, use any of the sui
to deal with them.
3. Apply data transformations on at least one of the variables. The purpose of this
transformation should be one of the following reasons: to change the scale for better
understanding of the variable, to convert a non-linear relation into a linear one, or
the skewness and convert the distribution into a normal distribution.'''
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]: dic = {
'Roll No': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
'Name': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'k', 'L'],
'Marathi': [99.0, 95.0, 85.0, 60.0, 98.0, np.nan, np.nan, 90.0, 81.0, 63.0, np.na
'English': [np.nan, 84.0, 85.0, 65.0, 79.0, np.nan, 95.0, 91.0, np.nan, 93.0, np.
}

In [5]: df=pd.DataFrame(dic)

In [7]: df

Out[7]: Roll No Name Marathi English

0 1 A 99.0 NaN

1 2 B 95.0 84.0

2 3 C 85.0 85.0

3 4 D 60.0 65.0

4 5 E 98.0 79.0

5 6 F NaN NaN

6 7 G NaN 95.0

7 8 H 90.0 91.0

8 9 I 81.0 NaN

9 10 J 63.0 93.0

10 11 k NaN NaN

11 12 L 52.0 NaN

In [9]: df.head()
Out[9]: Roll No Name Marathi English

0 1 A 99.0 NaN

1 2 B 95.0 84.0

2 3 C 85.0 85.0

3 4 D 60.0 65.0

4 5 E 98.0 79.0

In [11]: df.tail()

Out[11]: Roll No Name Marathi English

7 8 H 90.0 91.0

8 9 I 81.0 NaN

9 10 J 63.0 93.0

10 11 k NaN NaN

11 12 L 52.0 NaN

In [13]: df.describe()

Out[13]: Roll No Marathi English

count 12.000000 9.000000 7.000000

mean 6.500000 80.333333 84.571429

std 3.605551 17.705931 10.293317

min 1.000000 52.000000 65.000000

25% 3.750000 63.000000 81.500000

50% 6.500000 85.000000 85.000000

75% 9.250000 95.000000 92.000000

max 12.000000 99.000000 95.000000

In [15]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Roll No 12 non-null int64
1 Name 12 non-null object
2 Marathi 9 non-null float64
3 English 7 non-null float64
dtypes: float64(2), int64(1), object(1)
memory usage: 512.0+ bytes

In [17]: df['Marathi'].fillna(0,inplace=False)
Out[17]: 0 99.0
1 95.0
2 85.0
3 60.0
4 98.0
5 0.0
6 0.0
7 90.0
8 81.0
9 63.0
10 0.0
11 52.0
Name: Marathi, dtype: float64

In [19]: df

Out[19]: Roll No Name Marathi English

0 1 A 99.0 NaN

1 2 B 95.0 84.0

2 3 C 85.0 85.0

3 4 D 60.0 65.0

4 5 E 98.0 79.0

5 6 F NaN NaN

6 7 G NaN 95.0

7 8 H 90.0 91.0

8 9 I 81.0 NaN

9 10 J 63.0 93.0

10 11 k NaN NaN

11 12 L 52.0 NaN

In [21]: df["English"].fillna(method='ffill',inplace=True)

In [23]: df

Out[23]: Roll No Name Marathi English

0 1 A 99.0 NaN

1 2 B 95.0 84.0

2 3 C 85.0 85.0

3 4 D 60.0 65.0

4 5 E 98.0 79.0

5 6 F NaN 79.0

6 7 G NaN 95.0

7 8 H 90.0 91.0

8 9 I 81.0 91.0

9 10 J 63.0 93.0

10 11 k NaN 93.0

11 12 L 52.0 93.0

In [25]: df.dropna(inplace=True)
In [27]: df.isnull()

Out[27]: Roll No Name Marathi English

1 False False False False

2 False False False False

3 False False False False

4 False False False False

7 False False False False

8 False False False False

9 False False False False

11 False False False False

In [29]: df['Marathi'].mean()

Out[29]: 78.0

In [31]: df['Marathi'].fillna(78,inplace=True)

In [33]: df

Out[33]: Roll No Name Marathi English

1 2 B 95.0 84.0

2 3 C 85.0 85.0

3 4 D 60.0 65.0

4 5 E 98.0 79.0

7 8 H 90.0 91.0

8 9 I 81.0 91.0

9 10 J 63.0 93.0

11 12 L 52.0 93.0

In [35]: df.isnull()

Out[35]: Roll No Name Marathi English

1 False False False False

2 False False False False

3 False False False False

4 False False False False

7 False False False False

8 False False False False

9 False False False False

11 False False False False

In [37]: import seaborn as sns

In [39]: sns.boxplot(y=df['Marathi'])
Out[39]: <Axes: ylabel='Marathi'>

In [41]: sns.boxplot(y=df['English'])

Out[41]: <Axes: ylabel='English'>

In [43]: import numpy as np


Q1 = df['English'].quantile(0.25)
Q3 = df['English'].quantile(0.75)
IQR = Q3 - Q1

In [45]: lower_limit = Q1 - 1.5 * IQR


upper_limit = Q3 + 1.5 * IQR

In [47]: IQR

Out[47]: 8.75
In [49]: lower_limit

Out[49]: 69.625

In [51]: upper_limit

Out[51]: 104.625

In [53]: df.shape

Out[53]: (8, 4)

In [55]: # Remove outliers (keeping only valid values)


df_cleaned_English= df[
((df['English'] >= lower_limit) & (df['English'] <= upper_limit)) |
(df['English'].isna()) # Keep NaN values
]

In [57]: print("\nCleaned DataFrame:")


print(df_cleaned_English)

Cleaned DataFrame:
Roll No Name Marathi English
1 2 B 95.0 84.0
2 3 C 85.0 85.0
4 5 E 98.0 79.0
7 8 H 90.0 91.0
8 9 I 81.0 91.0
9 10 J 63.0 93.0
11 12 L 52.0 93.0

In [59]: import numpy as np


Q1 = df['Marathi'].quantile(0.25)
Q3 = df['Marathi'].quantile(0.75)
IQR = Q3 - Q1

In [61]: lower_limit_Marathi = Q1 - 1.5 * IQR


upper_limit_Marathi = Q3 + 1.5 * IQR

In [63]: IQR

Out[63]: 29.0

In [65]: lower_limit_Marathi

Out[65]: 18.75

In [67]: upper_limit_Marathi

Out[67]: 134.75

In [69]: df.shape

Out[69]: (8, 4)

In [71]: # Remove outliers (keeping only valid values)


df_cleaned_Marathi = df[
((df['Marathi'] >= lower_limit_Marathi) & (df['English'] <= upper_limit_Marathi))
(df['Marathi'].isna()) # Keep NaN values
]

In [73]: print("\nCleaned DataFrame:")


print(df_cleaned_Marathi)

Cleaned DataFrame:
Roll No Name Marathi English
1 2 B 95.0 84.0
2 3 C 85.0 85.0
3 4 D 60.0 65.0
4 5 E 98.0 79.0
7 8 H 90.0 91.0
8 9 I 81.0 91.0
9 10 J 63.0 93.0
11 12 L 52.0 93.0

In [75]: import seaborn as sns


sns.boxplot(df)

Out[75]: <Axes: >

In [ ]:

In [ ]:

You might also like