Data Sci
Data Sci
Assignment 1:SET-A
Q.1: Write a python program to create a data frame containing columns name as
name,age,and percentage. Add 10 rows to the data frame. View the data frame.
import pandas as pd
df=pd.DataFrame(columns=['Name','Age','Percentage'])
df.loc[0]=['subhan',20,78]
df.loc[1]=['tofik',22,45]
df.loc[2]=['rayyan',21,15]
df.loc[3]=['sharif',21,65]
df.loc[4]=['alim',88,99]
df.loc[5]=['shoib',18,97]
df.loc[6]=['danish',19,49]
df.loc[7]=['mustakim',25,6]
df.loc[8]=['mosin',20,78]
df.loc[9]=['arbaz',22,15]
print(df)
Output:
        Name Age Percentage
0     subhan 20          78
1      tofik 22          45
2     rayyan 21          15
3     sharif 21          65
4       alim 88          99
5      shoib 18          97
6     danish 19          49
7   mustakim 25           6
8      mosin 20          78
9      arbaz 22          15
Output:
(10, 3)
30
Name          object
Age           object
Percentage    object
dtype: object
Index(['Name', 'Age', 'Percentage'], dtype='object')
<bound method NDFrame.describe of        Name Age Percentage
0       alim 20         78
1     tofik 22          45
2    rayyan 21          15
3    sharif 21          65
4    subhan 20          12
5     shoib 18          97
6    danish 19          49
7 mustakim 25            6
8     mosin 20          78
9     arbaz 22          15>
import pandas as pd
df=pd.DataFrame(columns=['Name','Age','Percentage'])
df.loc[0]=['subhan',20,78]
df.loc[1]=['tofik',22,45]
df.loc[2]=['rayyan',21,15]
df.loc[3]=['sharif',21,65]
df.loc[4]=['alim',20,12]
df.loc[5]=['shoib',18,97]
df.loc[6]=['danish',19,49]
df.loc[7]=['mustakim',25,6]
df.loc[8]=['mosin',20,78]
df.loc[9]=['arbaz',22,15]
print(df.describe())
Output:
             Name    Age   Percentage
count          10     10           10
unique         10      6            8
top         arbaz     20           15
freq            1      3            2
Q.4: Write a python program to add 5 rows with duplicate values and missing value. Add
a column Remark with empty values.
import pandas as pd
df=pd.DataFrame(columns=['Name','Age','Percentage'])
df.loc[0]=['alim',20,78]
df.loc[1]=['tofik',22,45]
df.loc[2]=['subhan',21,15]
df.loc[3]=[None,21,65]
df.loc[4]=['subhan',20,12]
df['Remarks']=None
print(df
Output:
      Name Age Percentage Remarks
0     alim 20          78    None
1    tofik 22          45    None
2   subhan 21          15    None
3      NaN 21          65    None
4   subhan 20          12    None
Q.5: Write a python program to get the number of observations missing values and
duplicate values.
import pandas as pd
df=pd.DataFrame(columns=['Name','Age','Percentage'])
df.loc[0]=['alim',20,15]
df.loc[1]=['tofik',22,45]
df.loc[2]=['subhan',20,15]
df.loc[3]=[None,21,65]
df.loc[4]=['subhan',20,15]
print(df['Name'].size)
missing=df.isnull()
print(missing)
dup=df.duplicated()
print(dup)
Output:
5
    Name    Age        Percentage
0 False False               False
1 False False               False
2 False False               False
3   True False              False
4 False False               False
0    False
1    False
2    False
3    False
4     True
dtype: bool
Q.6:Write a python program to drop ‘Remark’ column from the dataframe. Also drop all
nulland empty values.print the modified data.
import pandas as pd
df=pd.DataFrame(columns=['Name','Age','Percentage'])
df.loc[0]=['ALIM',20,15]
df.loc[1]=['SHOAIB',22,45]
df.loc[2]=['SHARIF',20,15]
df.loc[3]=[None,21,65]
df.loc[4]=['ALIM',20,15]
df['Remarks']=None
print(df)
df.drop(labels=['Remarks'],axis=1,inplace=True)
print(df)
df.dropna(axis=0,inplace=True)
print(df)
Output:
      Name Age Percentage Remarks
0     ALIM 20          15    None
1   SHOAIB 22          45    None
2   SHARIF 20          15    None
3      NaN 21          65    None
4     ALIM 20          15    None
      Name Age Percentage
0     ALIM 20          15
1   SHOAIB 22          45
2   SHARIF 20          15
3      NaN 21          65
4     ALIM 20          15
      Name Age Percentage
0     ALIM 20          15
1   SHOAIB 22          45
2   SHARIF 20          15
4     ALIM 20          15
import pandas as pd
import matplotlib.pyplot as plt
df=pd.DataFrame(columns=['Name','age','percentage'])
df.loc[0]=['kashish',19,95]
df.loc[1]=['Ramiza',20,91]
df.loc[2]=['naki',7,90]
df.loc[3]=['Faisal',18,85]
df.loc[4]=['Aman',23,80]
df.loc[5]=['Anas',24,75]
df.loc[6]=['Fazil',21,70]
df.loc[7]=['Mustaqim',22,65]
df.loc[8]=['Alfiya',20,89]
df.loc[9]=['Aqsa',21,86]
print(df)
df.plot(x="Name",y="percentage")
plt.title('Line plot name vs percentage')
plt.xlabel('name of student')
plt.ylabel('percentage')
plt.show()
print(df)
Output:
        Name age percentage
0    kashish 19          95
1     Ramiza 20          91
2       naki   7         90
3     Faisal 18          85
4       Aman 23          80
5       Anas 24          75
6      Fazil 21          70
7   Mustaqim 22          65
8     Alfiya 20          89
9       Aqsa 21          86
import pandas as pd
import matplotlib.pyplot as plt
df=pd.DataFrame(columns=['Name','age','percentage'])
df.loc[0]=['kashish',19,95]
df.loc[1]=['Ramiza',20,91]
df.loc[2]=['naki',7,90]
df.loc[3]=['Faisal',18,85]
df.loc[4]=['Aman',23,80]
df.loc[5]=['Anas',24,75]
df.loc[6]=['Fazil',21,70]
df.loc[7]=['Mustaqim',22,65]
df.loc[8]=['Alfiya',20,89]
df.loc[9]=['Aqsa',21,86]
print(df)
plt.scatter(x=df["Name"],y=df["percentage"])
plt.title('Scatter plot name vs percentage')
plt.xlabel('name of student')
plt.ylabel('percentage')
plt.show()
print(df)
Output:
        Name age percentage
0    kashish 19          95
1     Ramiza 20          91
2       naki   7         90
3     Faisal 18          85
4       Aman 23          80
5       Anas 24          75
6      Fazil 21          70
7   Mustaqim 22          65
8     Alfiya 20          89
9       Aqsa 21          86
                                       Assingment 1:SET-B
Q1) Download the heights and weights dataset and load the data set from a given csv file
into a dataframe. print the first,last 10 rows and random 20 rows.
import pandas as pd
data=pd.read_csv('HeightWeight.csv')
df=pd.DataFrame(data)
print('/nfirst 10 rows')
print(df.head(10))
print('/nlast 10 rows')
print(df.tail(10))
Output:
/nfirst 10 rows
   Index Height(Inches) Weight(Pounds)
0      1        65.78331      112.9925
1      2        71.51521      136.4873
2      3        69.39874      153.0269
3      4        68.21660      142.3354
4      5        67.78781      144.2971
5      6        68.69784      123.3024
6      7        69.80204      141.4947
7      8        70.01472      136.4623
8      9        67.90265      112.3723
9     10        66.78236      120.6672
/nlast 10 rows
       Index Height(Inches) Weight(Pounds)
24990 24991         69.97767      125.3672
24991 24992         71.91656      128.2840
24992 24993         70.96218      146.1936
24993 24994         66.19462      118.7974
24994 24995         67.21126      127.6603
24995 24996         69.50215      118.0312
24996 24997         64.54826      120.1932
24997 24998         64.69855      118.2655
24998 24999         67.52918      132.2682
24999 25000         68.87761      124.8742
/n random 10 rows
       Index Height(Inches) Weight(Pounds)
18515 18516         71.15912      143.7729
3550     3551       65.95300      130.0755
16400 16401         67.20032      130.9151
10718 10719         70.79804      125.7816
4830     4831       66.24238      121.9611
9121     9122       68.05361      137.0546
11516 11517         68.67632      115.3375
3126     3127       67.59507      126.1888
15670 15671         68.55083      137.6187
20293      20294                65.96939                   139.4453
5842        5843                68.92916                   129.1092
21409      21410                69.27081                   124.6497
15365      15366                67.18395                   137.6251
16889      16890                69.07788                   131.0112
12382      12383                69.50005                   135.9850
14220      14221                68.49492                   132.0698
2701        2702                69.25709                   142.5795
4578        4579                68.91069                   103.7011
11468      11469                65.93881                   125.8178
19312      19313                68.47651                   117.6580
Q2) Write a python program to find the shape, size datatypes of the dataframe object.
import pandas as pd
data=pd.read_csv('HeightWeight.csv')
df=pd.DataFrame(data)
print('/n shape of dataframe',df.shape)
print('/n size of dataframe',df.size)
print('/n datatype of dataframe',df.dtypes)
Output:
 /n shape of dataframe (25000, 3)
/n size of dataframe 75000
/n datatype of dataframe Index                                        int64
Height(Inches)    float64
Weight(Pounds)    float64
dtype: object
Q3) Write a python program to view basic statistical details of the data.
import pandas as pd
data=pd.read_csv('HeightWeight.csv')
df=pd.DataFrame(data)
print("/n basic statistical details of a data:/n",df.describe())
Output:
  basic statistical details of a data:
               Index Height(Inches) Weight(Pounds)
count 25000.000000     25000.000000    25000.000000
mean   12500.500000       67.993114      127.079421
std     7217.022701        1.901679       11.660898
min        1.000000       60.278360       78.014760
25%     6250.750000       66.704397      119.308675
50%    12500.500000       67.995700      127.157750
75%    18750.250000       69.272958      134.892850
max    25000.000000       75.152800      170.924000
Q4) Write a pyython program to get the number of observation , missing values and nan
values.
import pandas as pd
import numpy as np
data=pd.read_csv('HeightWeight.csv')
df=pd.DataFrame(data)
print('/n number of observation:',df['Index'].size)
missing=df.isnull()
nan_values=np.isnan(df)
print("/n nan values:",nan_values.size)
Output:
 number of observation :                25000
 missing values:
 75000
Q5) Write a python program to add a column to the dataframe “BMI” which is
calculated as: weight/height^2.
import pandas as pd
data=pd.read_csv('HeightWeight.csv')
df=pd.DataFrame(data)
df['BMI']=(df['Weight(Pounds)']/df['Height(Inches)']**2)
print("after adding colom/n",df)
Output:
after adding column/n        Index                    Height(Inches)   Weight(Pounds)
BMI
0          1        65.78331                          112.9925   2.950311
1          2        71.51521                          136.4873   3.642400
2          3        69.39874                          153.0269   4.862195
3          4        68.21660                          142.3354   4.353572
4          5        67.78781                          144.2971   4.531187
...      ...             ...                               ...        ...
24995 24996         69.50215                          118.0312   2.884013
24996 24997         64.54826                          120.1932   3.467294
24997 24998         64.69855                          118.2655   3.341389
24998 24999         67.52918                          132.2682   3.836436
24999 25000         68.87761                          124.8742   3.286921
Output::
 Maximum of BMI =           5.933879009339526
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('HeightWeight.csv')
df = pd.DataFrame(data)
plt.scatter(x=df['Height(Inches)'],y=df['Weight(Pounds)'],c='blue')
plt.title("Scatter Plot")
plt.xlable("Height(Inches)")
plt.ylabel("Weight(Pounds)")
plt.show()
Output::
                                                               ROLL NO:
                                        Assingment 2:SET-A
Q1) Create an array using numpy and display mean and median.
importumpy as np
demo = np.array([[30,75,70],[80,90,20],[50,95,60]])
print(demo)
print('/n')
print(np.mean(demo))
print('/n')
print(np.median(demo))
print('/n')
Output::
[[30 75 70]
 [80 90 20]
 [50 95 60]]
/n
63.333333333333336
/n
70.0
/n
Output::
      Name  RamShamMeenaSeetaGeetaRakeshMadhav
Age                                          181
Rating                                     25.61
dtype: object
import pandas as pd
import numpy as np
md={'Name':pd.Series(['Ram','Sham','Meena','Seeta','Geeta','Rakesh','Madhav']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
df=pd.DataFrame(md)
print(df.describe())
Output:
       Age     Rating
count      7.000000 7.000000
mean      25.857143 3.658571
std        2.734262 0.698628
min       23.000000 2.560000
25%       24.000000 3.220000
50%       25.000000 3.800000
75%       27.500000 4.105000
max       30.000000 4.600000
import numpy as np
data=np.array([13,52,44,32,30,0,36,45])
print("Standard Deviation of sample is %s"%(np.std(data)))
Output::
Standard Deviation of sample is 16.263455967290593
Virat          Rohit
92             89
97             87
85             67
74             55
71             47
55             72
85             76
63             79
42             44
32             92
71             99
55 47
import pandas as pd
import scipy.stats as s
score={'Virat':[92,97,85,74,71,55,85,63,42,32,71,55],'Rohit':
[89,87,67,55,47,72,76,79,44,92,99,47]}
df=pd.DataFrame(score)
print(df)
print("\nArithmetic Mean Values")
print("Score 1",s.tmean(df["Virat"]).round(2))
print("Score 2",s.tmean(df["Rohit"]).round(2))
Output:
     Virat Rohit
0        92    89
1        97    87
2        85    67
3        74    55
4        71    47
5        55    72
6        85    76
7        63    79
8        42    44
9        32    92
10       71    99
11       55    47
import numpy as np
mydata=np.array([24,29,20,22,24,26,27,30,20,31,26,38,44,47])
q3,q1=np.percentile(mydata,[75,25])
iqrvalue=q3-q1
print(iqrvalue)
Output:
6.75
Q7) Write a python program to find the maximum value of a given flattened array:
import numpy as np
arr=np.array([[25,26,45],[12,36,42],[8,50,65]])
print("\n Original flattened Array:\n",arr)
arr.flatten()
max=np.max(arr)
print("\n Maximum value of flattened array:\n",max)
min=np.min(arr)
print("\n Minimum value of flattened array:\n",min)
Output:
 Original flattened Array:
 [[25 26 45]
 [12 36 42]
 [ 8 50 65]]
Q8) Write a python program to compute Eclidian Distance between two data points in a
dataset.
import numpy as np
point1= np.array((1,2,3))
point2= np.array((1,1,1))
dist = np.linalg.norm(point1 - point2)
print("Euclidian Distance between two pints: ",dist)
Output:
Euclidian Distance between two pints:                  2.23606797749979
Q.9: Create one dataframe of data values.Find out mean , range , and IQR for this data.
import pandas as pd
import numpy as np
import scipy.stats as s
d = [32,36,46,47,56,69,75,79,79,88,89,91,92,93,96,97,101,105,112,116]
data=pd.DataFrame(d)
print("\n DataFrame:\n",data)
print("\n Mean of dataframe : ",s.tmean(data))
data_range = np.max(data)-np.min(data)
print("\n Range of dataframe : ",data_range)
Q1 = np.median(data[:10])
Q3 = np.median(data[10:])
IQR = Q3 - Q1
print("\n Inter Quartile Range (IQR) of dataframe : ",IQR)
Output::
 DataFrame:
         0
0     32
1     36
2     46
3     47
4     56
5     69
6     75
7     79
8     79
9     88
10    89
11    91
12    92
13    93
14    96
15    97
16   101
17   105
18   112
19   116
 DataFrame:
         0
0     32
1     36
2     46
3     47
4     56
5     69
6     75
7     79
8     79
9     88
10    89
11    91
12    92
13    93
14    96
15    97
16   101
17   105
18   112
19   116
 Range of dataframe :            0      84
dtype: int64
OUPUT:
 sum of Manhattan distance between all pairs of point is = 18
Q.12: Create a Dataframe for student's information such name, graduation percentage
and age. Display average age of student ,average of graduation percentage. And ,also
describe all basic statistics od data.(Hint:use describe()).
import pandas as pd
import scipy.stats as s
data={'Name':['sharif','shoaib','nafisa','alim'],'Age':[20,22,23,21],
    'perc':[65.2,78.4,78.6,74.5]}
df=pd.DataFrame(data)
print("\n Average Age :",sum(df['Age']/len(df['Age'])))
print("\n Average Percentage : ",sum(df['perc']/len(df['perc'])))
print("\n Basic Stastistics of data :\n",df.describe())
Output:
 Average Age : 21.5
 Average Percentage :          74.17500000000001
Q.11: Write a Numpy program to compute the histogram of nums agains bins.
Sample Output:
nums:[0.5 0.7 1.0 1.2 1.3 2.1]
bins:[0 1 2 3].
import numpy as np
import matplotlib.pyplot as plt
nums = np.array([0.5,0.7,1.0,1.2,1.3,2.1])
bins = np.array([0,1,2,3])
print("nums: ",nums)
print("bins: ",bins)
print("Result:", np.histogram(nums, bins))
plt.hist(nums, bins=bins)
plt.show()
Output:
nums: [0.5 0.7 1. 1.2 1.3 2.1]
bins: [0 1 2 3]
Result: (array([2, 3, 1]), array([0, 1, 2, 3]))
                   Assingment 2:SET-B
Q1) Download iris dataset file.Read this csv file using read_csv() function.Take sample
from entire dataset.Display maximum and minimum values of all numeric attributes.
import pandas as pd
data=pd.read_csv("Iris.csv")
df=pd.DataFrame(data)
sample=df.sample()
print("\n Sample from Dataset:\n",sample)
print("\n Maximum of sepal Length:",max(df['SepalLengthCm']))
print("\n Minimum of sepal Length:",min(df['SepalLengthCm']))
print("\n Maximum of sepal Width:",max(df['SepalWidthCm']))
print("\n Minimum of sepal Width:",min(df['SepalWidthCm']))
Output:
Q.2:Continue with above dataset ,find number of records for each distinct value of class
attributes.Consider entire dataset and not the sample.
import pandas as pd
data=pd.read_csv("Iris.csv")
df=pd.DataFrame(data)
print("\n DataFrame:\n",df)
cnt=df['Species'].value_counts()
print("\n number of records for each distinct value of species attribute:\n",cnt)
Output:
Q.3:Display column-wise mean,and median for iris dataset (Hint:Use mean() and
median() function of pandas dataframe.
import pandas as pd
import scipy.stats as s
import statistics as st
data=pd.read_csv('Iris.csv')
df=pd.DataFrame(data)
print("\n DataFrame :\n",df)
print("\n Mean of sepal Length:", s.tmean(df['SepalLengthCm']))
print("\n Median of sepal Length:", st.median(df['SepalLengthCm']))
print("\n Mean of sepal Width:", s.tmean(df['SepalWidthCm']))
print("\n Median of sepal Width:", st.median(df['SepalWidthCm']))
OUTPUT:
 DataFrame :
        Id    SepalLengthCm       SepalWidthCm       PetalLengthCm    PetalWidthCm   \
0       1              5.1                3.5                 1.4             0.2
1       2              4.9                3.0                 1.4             0.2
2       3              4.7                3.2                 1.3             0.2
3       4              4.6                3.1                 1.5             0.2
4       5              5.0                3.6                 1.4             0.2
..    ...              ...                ...                 ...             ...
145   146              6.7                3.0                 5.2             2.3
146   147              6.3        2.5      5.0   1.9
147   148              6.5        3.0      5.2   2.0
148   149              6.2        3.4      5.4   2.3
149   150              5.9        3.0      5.1   1.8
             Species
0        Iris-setosa
1        Iris-setosa
2        Iris-setosa
3        Iris-setosa
4        Iris-setosa
..               ...
145   Iris-virginica
146   Iris-virginica
147   Iris-virginica
148   Iris-virginica
149   Iris-virginica
import pandas as pd
data=pd.read_csv("Data.csv")
df=pd.DataFrame(data)
print("\n Describing Dataset:\n",df.describe())
print("\n Shape of Dataset:\n",df.shape)
print("\n First three rows of Dataset:\n",df.head(3))
OUTPUT:
 Describing Dataset:
                 Age            Salary
count      9.000000          9.000000
mean      38.777778      63777.777778
std        7.693793      12265.579662
min       27.000000      48000.000000
25%       35.000000      54000.000000
50%       38.000000      61000.000000
75%       44.000000      72000.000000
max       50.000000      83000.000000
 Shape of Dataset:
 (10, 4)
import pandas as pd
data=pd.read_csv("Data.csv")
df=pd.DataFrame(data)
print("\n Displaying Dataset:\n",df)
data['Salary']= data['Salary'].fillna(data['Salary'].mean())
data['Age']= data['Age'].fillna(data['Age'].mean())
print("\n ****** Modified Dataset ******\n",df)
OUTPUT:
 Displaying Dataset:
     Country       Age      Salary Purchased
0    France      44.0     72000.0        No
1     Spain      27.0     48000.0       Yes
2   Germany      30.0     54000.0        No
3     Spain      38.0     61000.0        No
4   Germany      40.0         NaN       Yes
5    France      35.0     58000.0       Yes
6     Spain       NaN     52000.0        No
7    France      48.0     79000.0       Yes
8   Germany      50.0     83000.0        No
9    France      37.0     67000.0       Yes
Q3) Data.csv have two categorical column(the country column,and the purchased
column).
a. Apply OneHot coding on country column.
b.Apply Label encoding on purchased column.
import pandas as pd
from sklearn import preprocessing
data=pd.read_csv("Data.csv")
df=pd.DataFrame(data)
print("\n Describing Dataset:\n",df)
one_hot_encoded_data = pd.get_dummies(data, columns = ['Country'])
print("\n *******After applying OneHot coding on Country*******\n",one_hot_encoded_data)
label_encoder = preprocessing.LabelEncoder()
df['Purchased']= label_encoder.fit_transform(df['Purchased'])
df['Purchased'].unique()
print("\n *******After applying OneHot coding on Country**********\n",df)
OUTPUT:
 Describing Dataset:
     Country     Age     Salary Purchased
0    France    44.0    72000.0        No
1     Spain    27.0    48000.0       Yes
2   Germany    30.0    54000.0        No
3     Spain    38.0    61000.0        No
4   Germany    40.0        NaN       Yes
5    France    35.0    58000.0       Yes
6     Spain     NaN    52000.0        No
7    France    48.0    79000.0       Yes
8   Germany    50.0    83000.0        No
9    France    37.0    67000.0       Yes
import pandas as pd
data=pd.read_csv("winequality_red.csv")
df=pd.DataFrame(data)
print("\n Dataset is : \n",df)
OUTPUT:
    Dataset is :
         fixed acidity      volatile acidity   citric acid   residual sugar    chlorides   \
0                 7.4                 0.700          0.00              1.9        0.076
1                 7.8                 0.880          0.00              2.6        0.098
2                 7.8                 0.760          0.04              2.3        0.092
3                11.2                 0.280          0.56              1.9        0.075
4                 7.4                 0.700          0.00              1.9        0.076
...               ...                   ...           ...              ...          ...
1594              6.2                 0.600          0.08              2.0        0.090
1595              5.9                 0.550          0.10              2.2        0.062
1596              6.3                 0.510          0.13              2.3        0.076
1597              5.9                 0.645          0.12              2.0        0.075
1598              6.0                 0.310          0.47              3.6        0.067
        alcohol    quality
0           9.4          5
1           9.8          5
2           9.8          5
3           9.8          6
4           9.4          5
...         ...         ...
1594       10.5           5
1595       11.2           6
1596       11.0           6
1597       10.2           5
1598       11.0           6
OUTPUT:
       alcohol     quality
0          9.4           5
1          9.8           5
2          9.8           5
3          9.8           6
4          9.4           5
...        ...         ...
1594      10.5           5
1595      11.2           6
1596       11.0           6
1597       10.2           5
1598       11.0           6
Q.3:) Standardizing Data (trans. Them into a standard Guassian distribution with a mean
of 0 and a standard deviation of 1)
import pandas as pd
from sklearn import preprocessing
data=pd.read_csv("winequality-red.csv",sep=';')
df=pd.DataFrame(data)
print("\n Dataset is : \n",df)
standard = preprocessing.scale(df)
print("\n *********Standardized Data*********\n",standard)
OUTPUT:
  Dataset is :
        fixed acidity      volatile acidity       citric acid    residual sugar    chlorides   \
0                7.4                 0.700              0.00               1.9        0.076
1                7.8                 0.880              0.00               2.6        0.098
2                7.8                 0.760              0.04               2.3        0.092
3               11.2                 0.280              0.56               1.9        0.075
4                7.4                 0.700              0.00               1.9        0.076
...              ...                   ...               ...               ...          ...
1594             6.2                 0.600              0.08               2.0        0.090
1595             5.9                 0.550              0.10               2.2        0.062
1596             6.3                 0.510              0.13               2.3        0.076
1597             5.9                 0.645              0.12               2.0        0.075
1598             6.0                 0.310              0.47               3.6        0.067
       alcohol    quality
0          9.4          5
1          9.8          5
2          9.8          5
3          9.8          6
4          9.4          5
...        ...        ...
1594      10.5          5
1595      11.2          6
1596      11.0          6
1597      10.2          5
1598      11.0          6
 *********Standardized Data*********
 [[-0.52835961 0.96187667 -1.39147228 ... -0.57920652 -0.96024611
  -0.78782264]
 [-0.29854743 1.96744245 -1.39147228 ... 0.1289504 -0.58477711
  -0.78782264]
 [-0.29854743 1.29706527 -1.18607043 ... -0.04808883 -0.58477711
  -0.78782264]
 ...
 [-1.1603431 -0.09955388 -0.72391627 ... 0.54204194 0.54162988
   0.45084835]
 [-1.39015528 0.65462046 -0.77526673 ... 0.30598963 -0.20930812
  -0.78782264]
 [-1.33270223 -1.21684919 1.02199944 ... 0.01092425 0.54162988
   0.45084835]]
Q.4:)Normalizing Data (rescale each observation to a length pof 1 (a unit norm) .For this,
use the normalizer class.)
import pandas as pd
import numpy as np
from sklearn import preprocessing
data=pd.read_csv("winequality-red.csv",sep=';')
df=pd.DataFrame(data)
print("\n Dataset is : \n",df)
normalized = preprocessing.normalize(df,norm='l2')
print("\n***********Normalized Data***********\n",normalized)
OUTPUT:
 Dataset is :
        fixed acidity     volatile acidity      citric acid     residual sugar   chlorides   \
0                7.4                0.700             0.00                1.9       0.076
1                7.8                0.880             0.00                2.6       0.098
2                7.8                0.760             0.04                2.3       0.092
3               11.2                0.280             0.56                1.9       0.075
4                7.4                0.700             0.00                1.9       0.076
...              ...                  ...              ...                ...         ...
1594             6.2                0.600             0.08                2.0       0.090
1595             5.9                0.550             0.10                2.2       0.062
1596             6.3                0.510             0.13                2.3       0.076
1597             5.9                0.645             0.12                2.0       0.075
1598             6.0                0.310             0.47                3.6       0.067
        alcohol      quality
0           9.4            5
1           9.8            5
2           9.8            5
3           9.8            6
4           9.4            5
...         ...          ...
1594       10.5            5
1595       11.2            6
1596       11.0            6
1597       10.2            5
1598       11.0            6
***********Normalized Data***********
 [[0.19347777 0.01830195 0.         ... 0.01464156 0.24576906 0.13072822]
 [0.10698874 0.01207052 0.         ... 0.00932722 0.13442175 0.06858252]
 [0.13494887 0.01314886 0.00069205 ... 0.01124574 0.16955114 0.08650569]
 ...
 [0.1222319 0.00989496 0.00252225 ... 0.01455142 0.21342078 0.11641133]
 [0.10524769 0.01150589 0.00214063 ... 0.0126654 0.18195363 0.08919296]
 [0.12491328 0.00645385 0.00978487 ... 0.01374046 0.22900768 0.12491328]]
5.Biarizing Data using we use Binarizer class (Using a banary threshold, it is possible to
transform out data by marking the values above it 1 and those equal to or below it 0)
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import preprocessing
from sklearn.preprocessing import Binarizer
data_set = pd.read_csv('Data.csv')
data_set.head()
age = data_set['Age'].values
salary = data_set['Salary'].values
print("\n Original age data values : \n",age)
print("\n Original salary data values : \n",salary)
x = age
x = x.reshape(1,-1)
y = salary
y = y.reshape(1,-1)
binarizer_1 = Binarizer(threshold=35)
binarizer_1 = Binarizer(threshold=61000)
OUTPUT:
Original age data values :
 [44 27 30 38 40 35 58 48 50 37]
 Binarized age :
 [[0 0 0 0 0 0 0 0 0 0]]
 Binarized salary :
 [[1 0 0 0 0 0 0 1 1 1]]