## MACHINE LEARNING
i) Supervised Learning
ii)UnSupervised Learning
iii)Reinforcement Learning
i)Supervised Learning
a) Regression Learning
b) Classification Learning
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv(r"D:\data set\video\loan.csv")
dataset.head(3)
dataset.shape
dataset.shape[0]
dataset.isnull()
dataset.isnull().sum()
dataset.isnull().sum().sum()
dataset.notnull().sum()
(dataset.isnull().sum().sum()/(dataset.shape[0]*dataset.shape[1])*100 # over all
data in percentage
(dataset.isnull().sum()/dataset.shape[0])*100
sns.heatmap(dataset.isnull())
ptl.show()
dataset.drop(columns=["Credit_History"]) # drop columns name
dataset.shape
dataset.dropna() # Perticular row deleted
dataset.dropna(inplace=True) #if we put inplace True its deleted in orignal
dataset
dataset.isnull().sum()
sns.heatmap(dataset.isnull())
plt.show()
dataset.shape
# Handling Missing Values
dataset.fillna(10) # wrong way to fill the data
dataset.fillna(10).head(10)
dataset.info()
dataset.fillna(method="bfill") # backward filling
dataset.fillna(method="ffill",axis=1) # axis=0 means column wise or axis = 1 means
row wise filling
dataset["Gender"].mode()[0]
dataset["Gender"].fillna(dataset["Gender"].mode()[0])
dataset["Gender"].fillna(dataset["Gender"].mode()[0],inplace=True) # To fixed data
in original dataset (perticular column
dataset.select_dtypes(include="object")
dataset.select_dtypes(include="object").isnull().sum()
dataset.select_dtypes(include="object").columns
for i in dataset.select_dtypes(include="object").columns:
print(i)
for i in dataset.select_dtypes(include="object").columns:
dataset[i].fillna(dataset[i].mode()[0],inplace=True) # to fill all objective
by mode without giving error if they are not objective type then that place are
blank
# To fill the numerical values
dataset.isnull().sum()
dataset.info()
dataset.select_dtypes(include="float64").columns
from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy="mean")
si.fit_transform(dataset[[dataset.select_dtypes(include="float64").columns - just
put this output]])
arr = si.fit_transform(dataset[[dataset.select_dtypes(include="float64").columns -
just put this output]])
pd.DataFrame(arr,columns=dataset.select_dtypes(include="float64").columns)
new_dataset =
pd.DataFrame(arr,columns=dataset.select_dtypes(include="float64").columns)
new_dataset.isnull().sum()
new_dataset
dataset["LoanAmount"].mean()
# we genrally try to put numerical value that why we change the catogries value to
numberical value
# ONE HOT Encoding
import pandas as pd
dataset.head()
dataset.isnull().sum()
dataset["Gender"].fillna(dataset["Gender"].mode()[0],inplace=True)
dataset["Married"].fillna(dataset["Married"].mode()[0],inplace=True)
#1st method(ONE Hot Encoding) - get_dummies
en_data = dataset[["Gender","Married"]]
pd.get_dummies(en_data)
pd.get_dummies(en_data).info()
#2nd Method
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit_transform(en_data).toarray()
ar = ohe.fit_transform(en_data).toarray()
pd.DataFrame(ar,columns=["Gender_Female","Gender_Male","Married_No","Married_Yes"])
ohe1 = OneHotEncoder(drop="first")
ar1 = ohe1.fit_transform(en_data).toarray()
ar1
pd.DataFrame(ar,columns=["Gender_Male","Married_Yes"])
#Label Encoding
import pandas as pd
df = pd.DataFrame({"name":["wscube","cow","cat","dog","black"]})
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fir.transform(df["name"])
df["en_name"] = le.fir.transform(df["name"])
df
dataset = pd.read_csv("loan.csv")
dataset.head(3)
dataset["Property_Area].unique()
la = LabelEncoder()
la.fit(dataset["Property_Area"])
la.transform(dataset["Property_Area"])
dataset["Property_Area"] = la.transform(dataset["Property_Area"])
dataset["Property_Area"].unique()
#Ordinal Encoding
import pandas as pd
df = pd.DataFrame({"Size":["s","m","l","xl","s","m","l","s","s","l","xl","m"]})
df.head(3)
ord_data = [["s","m","l","xl"]] # we use double quotes because of 2 dimension
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=ord_data)
oe.fit(df[["Size"]])
oe.transform(df[["Size"]])
df["Size_encoding"] = oe.transform(df[["Size"]])
df
# 2nd method (Map function)
ord_data1 = {"s":0,"m":1,"l":2,"xl":3
df["Size"].map(ord_data1)
df["Size_encoding_map"] = df["Size"].map(ord_data1)
df
--------
# for example(learning purpose)
ord_data1 = {"s":5,"m":6,"l":7,"xl":8}
df["Size_encoding_map"] = df["Size"].map(ord_data1)
df
----------
#practical on big data
dataset = pd.read_csv("loan.csv")
dataset.head()
dataset["Property_Area"].unique() # for find name
dataset["Property_Area"].fillna(dataset["Property_Area"].model()[0],inplace=True)
en_data_ord = [['Rural','Semiurban','Urban']]
from sklearn.preprocessing import OrdinalEncoder
oen = OrdinalEncoder(categories=en_data_ord)
dataset["Property_Area"] = oen.fit_transform(dataset[["Property_Area"]] # use
fit_transform to permoform direct dataset
dataset.head(10)
#OUTLIER
# how to detect outlier
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv("loan.csv")
dataset.head(3)
dataset.info()
dataset.describe()
# box plot
plt.figure(figsize=(15,5))
sns.boxplot(x = "CoapplicantIncome", data=dataset)
plt.show()
sns.boxplot(x = "ApplicantIncome", data=dataset)
plt.show()
#another method to find outlier
sns.distplot(dataset["ApplicantIncome"])
plt.show()
# outlier remover method
# using IQR(inter quantile range)
# IQR = Q3-Q1
# min = Q1-(1.5*IQR)
# max = Q3 +(1.5 * IQR)
dataset.shape
q1 = dataset["CoapplicantIncome"].quantile(0.25)
q3 = dataset["CoapplicantIncome"].quantile(0.75)
q1
q3
IQR = q3-q1
min_range = q1 - (1.5*IQR)
max_range = q3 + (1.5*IQR)
min_range,max_range
dataset
dataset[dataset["CoapplicantIncome"]<=max_range ]
new_dataset = dataset[dataset["CoapplicantIncome"]<=max_range]
new_dataset.shape
sns.boxplot(x = "CoapplicantIncome", data=new_dataset)
plt.show()
# outlier removal using Z score
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv("loan.csv")
dataset.head(3)
dataset.isnull().sum()
dataset.describe()
sns.boxplot(x= "CoapplicantIncome",data= dataset)
sns.distplot(dataset["CoapplicantIncome"]) # distribution plot
min_range = dataset["CoapplicantIncome"].mean() -
3*dataset["CoapplicantIncome"].std() # std - standard division
max_range = dataset["CoapplicantIncome"].mean() +
3*dataset["CoapplicantIncome"].std()
min_range,max_range
new_dataset = dataset[dataset["CoapplicantIncome"]<= max_range]
sns.boxplot(x= "CoapplicantIncome",data= new_dataset)
# Z score
z_score = (dataset["CoapplicantIncome"] -
dataset["CoapplicantIncome"].mean())/(dataset["CoapplicantIncome"].std())
z_score
z_score>3
data["z_score"] = z_score # puting data in orignal dataset
dataset
# removing outlier
dataset[dataset["z_score"]<3]
new_dataset.shape
dataset[dataset["z_score"]<3].shape
# Feature Scaling(standardization)
standardization - It is a very effective technique which re-Scales a feature value
so that it has distribution with 0 mean value and variance equals to 1.
x(new) = x(i)-x(mean)/standard Deviation
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv("loan.csv")
dataset.head(3)
dataset.isnull().sum()
dataset["ApplicantIncome"].fillna(dataset["ApplicantIncome"].mean(),inplace=True)
sns.distplot(dataset["ApplicantIncome"])
plt.show()
dataset.describe()
# scalling throw scikit-learn
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss.fit(dataset[["ApplicantIncome"]])
ss.transform(data[["ApplicantIncome]])
dataset["ApplicantIncome_ss"] =
pd.DataFrame(ss.transform(dataset[["ApplicantIncome]]),columns=["x"])
dataset.head(3)
dataset.describe()
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
plt.title("Before")
sns.distplot(dataset["ApplicantIncome_ss"])
plt.subplot(1,2,2)
plt.title("After")
sns.distplot(dataset["ApplicantIncome"])
plt.show()
# Feature Scaling(normalization)
#min-max scaler(normalization techniques)
normalization - It is a scaling technique in which values are shifted and rescaled
so that they end up ranging between 0 and 1. it is also known as Min-Max scaling.
x(new) = x(i) -min(x)/max(x) - min(x)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv("loan.csv")
dataset.head(3)
dataset.isnull.sum()
dataset.describe()
sns.distplot(dataset["CoapplicantIncome"])
plt.show()
from sklearn.preprocessing import MinMaxScaler
ms = MinMaxScaler()
ms.fit(dataset[["CoapplicantIncome"]])
ms.transform(dataset[["CoapplicantIncome"]])
dataset["CoapplicantIncome_min"] = ms.transform(dataset[["CoapplicantIncome"]])
dataset.head(3)
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
plt.title("Before")
sns.distplot(dataset["CoapplicantIncome"])
plt.subplot(1,2,2)
plt.title("After")
sns.distplot(dataset["CoapplicantIncome_min"])
plt.show()
# Handling Duplicate Data
import pandas as pd
data = {"name":["a","b","c","d","a","c"],"eng":[8,7,5,8,8,5],"hindi":[2,3,4,5,2,6]}
df = pd.DataFrame(data)
df
df.duplicate()
# df["duplicate"] = df.duplicate()
df
df.drop_duplicates()
df.drop_duplicates(inplace=True)
# duplicate data on big data
dataset = pd.read_csv("loan.csv")
dataset.duplicate()
dataset.shape
dataset.drop_duplicates(inplace=True)
dataset.shape
# Replace And Data Type change
import pandas as pd
dataset = pd.read_csv("loan.csv")
dataset.head(3)
dataset.info()
dataset.isnull().sum()
dataset["Dependents"].value_counts()
dataset["Dependents"].fillna(dataset["Dependents"].mode()[0],inplace=True)
dataset.isnull().sum()
dataset["Dependents"].replace("3+","3",inplace=True)
dataset["Dependents"].astype("int64") #converting type of object
dataset["Dependents"] = dataset["Dependents"].astype("int64")
dataset["Dependents"].value_counts()
dataset.info()
# Function Transformer
# without outlier
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv("loan.csv")
dataset.head(3)
dataset.isnull.sum()
sns.distplot(dataset["CoapplicantIncome"])
plt.show()
#IQR
q1 = dataset["CoapplicantIncome"].quantile(0.25)
q3 = dataset["CoapplicantIncome"].quantile(0.75)
iqr = q3 -q1
min_r = q1 - (1.5*iqr)
max_r = q3 + (1.5*iqr)
min_r,max_r
dataset[dataset["CoapplicantIncome"]<=max_r]
dataset = dataset[dataset["CoapplicantIncome"]<=max_r]
sns.distplot(dataset["CoapplicantIncome"])
plt.show()
from sklearn.preprocessing import FunctionTransformer # function transforming here
ft = FunctionTransformer(func = np.log1p)
ft.fit(dataset[["CoapplicantIncome"]])
ft.transform(dataset[["CoapplicantIncome"]])
dataset["CoapplicantIncome_tf"] = ft.transform(dataset[["CoapplicantIncome"]])
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
sns.distplot(dataset["CoapplicantIncome"])
plt.title("Before")
plt.subplot(1,2,2)
sns.distplot(dataset["CoapplicantIncome_tf"])
plt.title("After")
plt.show()
# Function Transformer
# with outlier
dataset.head(3)
dataset.isnull.sum()
sns.distplot(dataset["CoapplicantIncome"])
plt.show()
## IQR
q1 = dataset["CoapplicantIncome"].quantile(0.25)
q3 = dataset["CoapplicantIncome"].quantile(0.75)
iqr = q3 -q1
min_r = q1 - (1.5*iqr)
max_r = q3 + (1.5*iqr)
min_r,max_r
#dataset[dataset["CoapplicantIncome"]<=max_r]
#dataset = dataset[dataset["CoapplicantIncome"]<=max_r]
sns.distplot(dataset["CoapplicantIncome"])
plt.show()
from sklearn.preprocessing import FunctionTransformer # function transforming here
ft = FunctionTransformer(func = np.log1p)
ft.fit(dataset[["CoapplicantIncome"]])
ft.transform(dataset[["CoapplicantIncome"]])
dataset["CoapplicantIncome_tf"] = ft.transform(dataset[["CoapplicantIncome"]])
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
sns.distplot(dataset["CoapplicantIncome"])
plt.title("Before")
plt.subplot(1,2,2)
sns.distplot(dataset["CoapplicantIncome_tf"])
plt.title("After")
plt.show()
## another method
ft1 = FunctionTransformer(func= lambda x : x**2)
ft1.fit(dataset[["CoapplicantIncome"]])
dataset["CoapplicantIncome_tf1"] = ft1.transform(dataset[["CoapplicantIncome"]])
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
sns.distplot(dataset["CoapplicantIncome"])
plt.title("Before")
plt.subplot(1,2,2)
sns.distplot(dataset["CoapplicantIncome_tf1"])
plt.title("After")
plt.show()
## Feature Selection Techniques
## Backward Elemination (using mlxtend) Forward Elimination (using mlxtend)
Feature_Selection :- A feature is an attribute that has an impact on a problem or
is useful for the problem, and choosing the important features for the model is
known as feature selection.
#Forward Elimination (using mlxtend) :-
import pandas as pd
from mlxtend.feature_selection import SequentialFeatureSelector
dataset = pd.read_csv("diabetes.csv")
dataset.head(3)
x = dataset.iloc[:,:-1]
y = dataset["Outcome"]
x.shape
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
fs = SequentialFeatureSelector(lr,k_features=5,forward=True)
fs.fit(x,y)
fs.feature_names
fs.k_feature_names_
fs.k_score_
# for backword
fs = SequentialFeatureSelector(lr,k_features=5,forward=False)
fs.fit(x,y)
fs.feature_names
fs.k_feature_names_
fs.k_score_
## Train Test Split in Data set ##
import pandas as pd
dataset = pd.read_csv("Boston.csv")
dataset.head(3)
dataset.shape
input_data = dataset.iloc[:,:-1]
output_data = dataset["House_Price"]
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test =
train_test_split(input_data,output_data,test_size=0.25)
x_test
x_train
y_test
y_train
x_train.shape , y_train.shape
x_test.shape , y_test.shape
### Regression Analysis
## LINEAR REGRESSION ALGORITHM (simple linear)
# simple linear Regression - simple linear Regression is a type of Regression
algorithms that models the relationship between a dependent variable and a single
independent variable.
y = mx+c
where y = dependent variable
x = independent variable
m = slope/gradient/coefficient
c = intercept
m = +ve thita < 90
m = -ve thita > 90
m = 0 thita = 0
#practical
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv("placement.csv")
#dataset = pd.read_csv(r"D:\data set\video\placement.csv")
dataset.head(3)
plt.figure(figsize=(5,3))
sns.scatterplot(x="cgpa",y="package",data=dataset)
plt.show()
dataset.isnull.sum()
dataset.ndim # if they are 1 dimensionthen please change in 2 dimension
x = dataset[["cgpa"]]
y = dataset["package"]
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train,y_train)
# y = m*x+c
lr.coef_ #array([0.57425647])
lr.intercept_ # -1.0270069374542108
# y = 0.57425647*x-1.0270069374542108
lr.score(x_test,y_test)*100
lr.predict([[6.89]])
# 0.57425647*6.89-1.0270069374542108 # 2.92962016
y_prd = lr.predict(x)
plt.figure(figsize=(5,4))
sns.scatterplot(x="cgpa",y="package",data=dataset)
plt.plot(dataset["cgpa"],y_prd,c="red" )
plt.legend(["org data","predict line"])
plt.savefig("predict.jpg")
plt.show()
## Multiple linear Regression
Multiple linear Regression is an extension of simple linear regression as it takes
more than one predictor variable to predict the response variable.
y = m1x1+m2x2+m3x3.....+c
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv("regression_dataset.csv")
dataset.head(3)
dataset.shape
dataset.isnull.sum()
sns.pairplot(data=dataset)
plt.show()
sns.heatmap(data=dataset.corr(),annot=True)
plt.show()
x = dataset.iloc[:,:-1]
x
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv("multiple_regression_dataset.csv")
dataset.head()
dataset.isnull.sum()
sns.pairplot(data=dataset)
plt.show()
sns.heatmap(data=dataset.corr(),annot=True)
plt.show()
x = dataset.iloc[:,:-1]
y = dataset["Salary"]
x.ndim
dataset.shape
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test train_test_split(x,y,
test_size=0.2,random_state=42)
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)*100 # 63.65989707495038
# y = m1x1+m2x2+c
lr.coef_ #array([1676.38278101, -136.23367567])
lr.intercept_ #34875.404040507696
# y_prd = 1676.38278101*Age -136.23367567*Experience + 34875.404040507696
x.column
lr.predict(x_test)
## POLYNOMIAL REGRESSION
Ploynomial Regression is a regression algorithm that models the relationship
between a dependent(y) and independent variable(x) as nth degree ploynomial.
Y = b0+b1x1+b2x1*square+.......bnx1square(n)
import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_csv("ploynomial.csv")
dataset.head(3)
dataset.corr()
plt.scatter(dataset["Level"],dataset["Salary"])
plt.xlabel("Level")
plt.ylabel("Salary")
plt.show()
x = dataset[["Level"]]
y = dataset[["Salary"]]
from sklearn.preprocessing import PolynomialFeatures
pf = PolynomialFeatures(degree=2)
pf.fit(x)
pf.transform(x)
x = pf.transform(x)
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)*100 # 99.999999
# y = m1x1+m2x2^2+c
# y = 1000.10647295*x1 + 500.00158031*x2^2 - 13.512174614
lr.coef_ #array([0. , 1000.10647295, 500.00158031])
lr.intercept_ # -13.512174614
prd =lr.predict(x)
plt.scatter(dataset["Level"],dataset["Salary"])
plt.plot(dataset["Level"],prd,c='red')
plt.xlabel("Level")
plt.ylabel("Salary")
plt.legend(["org","prd"])
plt.show()
test = pf.transform([[45]])
test # array([[1.000e+00, 4.500e+01, 2.025e+03]])
lr.predict(test) #([1057494.47922994])
# Cost Function
1 - A cost function is an important parameter that determines how well a machine
learning model performs for a given dataset.
2 - Cost function is a measure of how wrong the model is in estimating the
relationship between X(input) and Y(output) Parameter.
=> types of cost function
a) Regression Cost Function
b) Classification cost Function
a)Regression Cost Function
Regression models are used to make a prediction for the continuous variable.
1- MSE(Mean Square Error)
2- RMSE(Root Mean Square Error)
3- MAE- (Mean Absolute Error)
4- R^2 Accuracy
b) Classification cost Function
1- Binary Classification Cost Function:
Classification models are used to make predictions of categorical variables,
such as predictions for 0 or 1, Cat or dog, etc.
2- Multi-class Classification Cost Function:
A multi-class Classification cost function is used in the Classification
problems for which instances are allocated to one of more than two classes.
-> Binary Cross Entropy Cost Function or Log Loss Function
=> Regression Cost Function
1. Mean Squared Error:
. Mean Squared Error(MSE) is the mean Squared difference between the actual
and predicted values. MSE penalizes high errors caused by outliers by squaring the
errors.
. Mean Squared error is also known as L2 Loss.
Mean Squared Error = 1/n E (y(i) - ^y(i))^2
2. Mean Absolute Error:
. Mean Absolute Error(MAE) is the mean absolute difference between the actual
Values and the predicted values.
. MAE is more robust to outliers. The insensitivity to outlier is because it
does not penalize high errors caused by outliers.
Mean Absolute Error = 1/n E |Y(i) - ^y(i)|
3. Root Mean Squared Error:
. Root Mean Squared Error (RMSE) is the root Squared mean of the difference
between actual and predicted values.
. RMSE can be used in situations where we want to penalize high errors but
not as much as MSE does.
Root Mean Squared Error = root1/n E (y(i) -^y(i))^2
=> How To Find a Best Fit Line:-
=> L1(Lasso Regularization), L2(Ridge Regularization)
=> Regularization Techniques
. This is a form of regression, that constrains/regularizes or shrinks the
coefficient estimates towards zero.
. This technique discourages learning a more complex or flexible model, so as
to avoid the risk of overfitting.
-> Regularization can achieve this motive with 2 techniques:
- Ridge Regularization /L2
- Lasso Regularization /L1
-> Lasso Regularization /L1:(feature selection work)
. This is a regularization technique used in feature selection using a
Shrinkage method also referred to as the penalized regression method.
. Lasso Regression magnitude of coefficients can be exactly zero.
cost function = Loss + lambda E||w||
. Loss = sum of squared residual
. lambda = penalty
. w = slope of the curve
-> Ridge Regularization /L2:(overfitting reducing technique)
. Ridge Regression,also known as L2 regularization, is an extension to linear
Regression that introduces a regularization term to reduce model complexity and
help prevent overfitting.
. Ridge Regression is working value/magnitude of coefficients is almost equal
to zero.
cost function = loss + lambda E ||w||^2
. Loss = sum of squared residual
. lambda = penalty
. w = slope of the curve
=> L1(Lasso Regularization), L2(Ridge Regularization) Practical
-> Regularization Techniques (Practical)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
dataset = pd.read_csv(r"houseprice.csv")
dataset.head(3)
plt.figure(figsize=(10,10))
sns.heatmap(data=dataset.corr(),annot=True)
plt.show()
x = dataset.iloc[:,:-1]
y = dataset["price"]
sc = StandardScaler()
sc.fit(x)
sc.transform(x)
x = pd.DataFrame(sc.transform(x),column=x.column)
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)
from sklearn.linear_model import LinearRegression , Lasso , Ridge
from sklearn.metrics import mean_absolute_error,mean_squared_error
import numpy
# LinearRegression
lr = LinearRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)*100 # 3.2286184...
lr.coef_
print(mean_squared_error(y_test,lr.predict(x_test))) #986919392751.0544
print(mean_absolute_error(y_test,lr.predict(x_test))) #210903.52141518658
print(np.sqrt(mean_squared_error(y_test,lr.predict(x_test)))) #993438.167552996
plt.figure(figsize=(15,5))
plt.bar(x.columns,lr.coef_)
plt.title("LinearRegression")
plt.xlabel("columns")
plt.ylabel("coef")
plt.show()
# Lasso(L1)
la =Lasso(alpha=0.5)
la.fit(x_train,y_train)
la.score(x_test,y_test)*100 # 3.228361
la.coef_
print(mean_squared_error(y_test,la.predict(x_test))) #986921772009.158
print(mean_absolute_error(y_test,la.predict(x_test))) #210908.17447564355
print(np.sqrt(mean_squared_error(y_test,la.predict(x_test)))) #993439.3650390335
plt.figure(figsize=(15,5))
plt.bar(x.columns,la.coef_)
plt.title("Lasso")
plt.xlabel("columns")
plt.ylabel("coef")
plt.show()
# Ridge(L2)
ri = Ridge(alpha = 10)
ri.fit(x_train,y_train)
ri.score(x_test,y_test)*100 #3.2401994171
ri.coef_
print(mean_squared_error(y_test,ri.predict(x_test))) #986801284919.7765
print(mean_absolute_error(y_test,ri.predict(x_test))) #210815.94787357954
print(np.sqrt(mean_squared_error(y_test,ri.predict(x_test)))) #993378.7217973699
plt.figure(figsize=(15,5))
plt.bar(x.columns,ri.coef_)
plt.title("Ridge")
plt.xlabel("columns")
plt.ylabel("coef")
plt.show()
df =
pd.DataFrame({"col_name":x.columns,"LinearRegression":lr.coef_,"Lasso":la.coef_,"Ri
dge":ri.coef_})
## Classification
# Classification Algorithm
. The Classification algorithm is used to identify the category of new observation
on the basis of training data.
. In Classification, a program learns from the given dataset or observations and
then classifies new observation into a number of classes or groups.
. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. classes can be
called as targets/labels or categories.
-> There are two types of Classifications:
a) Binary Classifier: If the Classification problem has only two possible Outcomes,
then it is called as Binary Classifier.
ex- SPAM or NOT SPAM, CAT or DOG, etc.
b) Multi-class Classifier: If a Classification problem has more than two outcomes,
then it is called as Multi-class Classifier.
ex - Classifications of types of crops, Classification of types of music.
=> Types Of ML CLASSIFICATION ALGORITHM
Non-linear models linear Models
. K-Nearest Neighbours . Logistic Regression
. SVM . Support Vector Machines
. Naive Bayes
. Decision Tree Classification
. Random Forest Classification
=> Logistic Regression AND (Binary Classification)(Practical):
. Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique.
. It is used for predicting the categorical dependent variable using a given set of
independent variables.
. Therefore, the outcome must be a categorical or discrete value. it can be either
Yes or No,0 or 1,True or False, etc. but instead of giving the exact value as 0 and
1, it gives the probabilistic values which lie between 0 and 1.
-> Types of Logistic Regression
on the basis of the categories, Logistic Regression can be classified into three
types:
i) Binomial: In binomial Logistic regression, there can be only two possible types
of the dependent variables, such as 0 or 1,Pass or Fail,etc.
ii) Multinomial: In multinomial Logistic regression, there can be 3 or more
possible unordered types of the dependent variable, such as "cat","dogs", or
"sheep".
iii) Ordinal: In ordinal Logistic regression, there can be 3 or more possible
ordered types of dependent variables, such as "low","medium",or "High".
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv(r"Social_Network_Ads.csv")
dataset.drop(columns=["EstimatedSalary"],inplace=True)
dataset.head(5)
plt.figure(figsize=(4,3))
sns.scatterplot(x="Age",y="Purchased",data=dataset)
plt.show()
x = dataset[["Age"]]
y = dataset["Purchased"]
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)*100 # 91.25
lr.predict([[40]])
plt.figure(figsize=(4,3))
sns.scatterplot(x="Age",y="Purchased",data=dataset)
sns.lineplot(x = "Age",y = lr.predict(x),data=dataset,color = 'red')
plt.show()
=> Logistic Regression (Binary Classification)(Multiple input)(Practical):
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv(r"placement.csv")
dataset.head(3)
plt.figure(figsize=(5,4))
sns.scatterplot(x="cgpa",y="score",data=dataset,hue="placed")
plt.legend(loc=1)
plt.show()
x = dataset.iloc[:,:-1]
x.ndim
print(x)
y = dataset["placed"]
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)*100
lr.predict([[8.14,6.52]]) # array([1], dtype=int64)
lr.coef_
lr.intercept_
from mlxtend.plotting import plot_decision_regions
plot_decision_regions(x.to_numpy(),y.to_numpy,clf=lr)
plt.show()
=> Logistic Regression(Binary Classification)(Ploynomial input)(practical):
-> Logistic Regression(Ploynomial Feature):
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv(r"Polynomial_classification.csv")
dataset.head(5)
plt.figure(figsize=(5,4))
sns.scatterplot(x="data1",y="data2",data=dataset,hue="output")
plt.show()
x = dataset.iloc[:,:-1]
y = dataset["output"]
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0,random_state=42)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fix(x_train,y_train)
lr.score(x_test,y_test)*100 #90.0
from mlxtend.plotting import plot_decision_regions
plot_decision_regions(x.to_numpy(),y.to_numpy(),clf=lr)
plt.show()
->with Ploynomial Feature
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv(r"Polynomial_classification.csv")
dataset.head(5)
lt.figure(figsize=(5,4))
sns.scatterplot(x="data1",y="data2",data=dataset,hue="output")
plt.show()
x = dataset.iloc[:,:-1]
y = dataset["output"]
from sklearn.preprocessing import PolynomialFeatures
pf = PolynomialFeatures(degree=3)
pf.fit(x)
pf.transform(x)
x = pd.DataFrame(pf.transform(x))
x.shape
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0,random_state=42)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fix(x_train,y_train)
lr.score(x_test,y_test)*100 #95.0
=> Logistic Regression(Multiclass Classification)(practical):
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv(r"iris.csv")
dataset.head(3)
dataset["species"].unique() #array(['setosa','versicolor','virginica'],
dtype=object)
sns.pairplot(data=dataset,hue="species")
plt.show()
x = dataset.iloc[:,:-1]
y = dataset["species"]
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0,random_state=42)
## OVR
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(multi_class="ovr")
lr.fix(x_train,y_train)
lr.score(x_test,y_test)*100 #96.66666
# multinomial
lr1 = LogisticRegression(multi_class="multinomial")
lr1.fit(x_train,y_train)
lr1.score(x_test,y_test)*100
lr2 = LogisticRegression()
lr2.fit(x_train,y_train)
lr2.score(x_test,y_test)*100 #100.0
=> CONFUSION MATRIX:
. A Confusion matrix is a simple and useful tool for understanding the performance
of a Classification model, like one used in machine learning or statistics.
. It helps you evaluate how well your model is doing in categorizing things
correctly.
. It is also known as the error matrix.
. The matrix consists of predictions result in a summarized form, which has a total
number of correct predictions and incorrect predictions.
. Accuracy = (TP+TN)/N
. Error = (FN+FP)/n
. False Negative: The model has predicted no, but the actual value was Yes, it is
also called as Type-II error.
. False Positive:The model has predicted Yes, but the actual value was No. it is
also called a Type-I error.
=> CONFUSION MATRIX(Sensitivity, Precision, Recall,F1-score)
-> Precision: TP/(TP+FP)
It helps us to measure the ability to classify positive samples in the model.
-> F1-Score:
It is the harmonic mean of precision and recall. it takes both false positive
and false negattives into account. Therefore, it performs well on an imbalanced
dataset.
F1 Score = 2*Precision*Recall/(Precision+Recall)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv(r"placement.csv")
dataset.head(5)
x = dataset.iloc[:,:-1]
x
y = dataset["placed"]
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)*100 #100.0
from sklearn.metrics import confusion_matrix,precision_score,recall_score,f1_score
cf = confusion_matrix(y_test,lr.predict(x_test))
cf #array([[10, 0],
[0, 10]], dtype=int64)
sns.heatmap(cf,annot=True)
plt.show()
precision_score(y_test,lr.predict(x_test))*100 #100.0
recall_score(y_test,lr.predict(x_test))*100 #100.0
f1_score(y_test,lr.predict(x_test))*100 #100
=> IMBALANCED DTATASET
-> Techniques to Handle IMBALANCED Data
i) Random Under Sampling:
we will reduce the majority of the class so that it will have same no of as
the minority.
ii) Random Over Sampling:
we will increase the size of manority is inactive class to the size of
majority class ie active.
import pandas as pd
dataset = pd.read_csv("Social_Network_Ads.csv")
dataset.head(5)
dataset["Purchased"].value_counts()
x = dataset.iloc[:,:-1]
y = dataset["Purchased"]
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=42,test_size=0.2)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)*100
lr.predict([[45,26000]]) #array([0], dtype=int64)
=> imblearn
import pandas as pd
dataset = pd.read_csv("Social_Network_Ads.csv")
dataset.head(5)
x = dataset.iloc[:,:-1]
y = dataset["Purchased"]
dataset["Purchased"].value_counts()
-> from imblearn.under_sampling import RandomUnderSampler
ru = RandomUnderSampler()
ru.fit_resample(x,y)
ru_x, ru_y = ru.fit_resample(x,y)
ru_y.value_counts()
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=42,test_size=0.2)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)*100 # 58.620689655172406
lr.predict([[45,26000]]) #array([1], dtype=int64)
-> from imblearn.over_sampling import RandomOverSampler
ro = RandomOverSampler()
ro_x, ro_y = ro.fit_resample(x,y)
ro_y.value_counts()
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=42,test_size=0.2)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)*100 #45.63106796116505
lr1.predict([[45,26000]]) #array([1], dtype=int64)
## NAIVE BAYES
- Naive Bayes is a Classification algorithm based on Bayes theorem.
- which is a probability theory that describe the probability of an event, based on
prior knowledge of conditions that might be related to the event.
-> Naive: It is called Naive because it assumes that the occurrence of a certain
feature is independent of the occurrence of other features.
-> Bayes: It is called bayes because it depends on the principle of Bayes' Theorem.
-> Bayes' Theorem: Bayes' theorem is also known as Bayes' Rile or Bayes' law, which
is used to determines the probability of a hypothesis with prior knowledge. It
depends on the conditional probability.
p(a/b)= (p(b/a)p(a))/p(b)
where:
- p(a/b) is Posterior probability: probability of hypothesis A on the observed
event B.
- P(b/a) is Likelihood probability: Probability of the evidence given that the
probability a hypothesis is true.
- p(a) is Prior Probability: Probability of hypothesis before observing the
evidence.
- p(b) is Marginal Probability: Probability of evidence.
=> Types of Naive Bayes Model:
There are three types of Naive Bayes Model,
which are given below
. Gaussian
. Multinomial
. Bernoulli
i) Gaussian Naive Bayes:
- Assumes that continuous features follow a Gaussian (normal) distribution.
- Suitable for features that are continuous and have a normal distribution.
ii) Bernoulli Naive Bayes:
- Assumes that features are binary (Boolean) variable.
- Suitable for data that can be represented as binary features, such as document
Classification problems where each term is either present or absent.
iii) Multinomial Naive Bayes:
- Assumes that features follow a multinomial distribution.
- Typically used for discrete data, such as text data, where each feature
represents the frequency of a team.
-> Practical:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from mlxtend.plotting import plot_decision_regions
dataset = pd.read_csv(r"placement.csv")
dataset.head(5)
sns.kdeplot(data=dataset["cgpa"])
plt.show()
sns.kdeplot(data=dataset["score"])
plt.show()
plt.figure(figsize=(4,3))
sns.scatterplot(x="cgpa",y="score",data=dataset,hue="placed")
plt.show()
x = dataset.iloc[:,:-1]
y = dataset["placed"]
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=42,test_size=0.2)
from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB
gnb = GaussianNB()
gnb.fit(x_train,y_train)
gnb.score(x_test,y_testtest)*100 #100.0
gnb.score(x_train,y_train)*100 #97.5
gnb.predict([[6.17,5.17]]) # array([0],dtype=int64) # 6.17 5.17 0
plt.figure(figsize=(4,3))
plot_decision_regions(x.to_numpy(),y.to_numpy(),clf=gnb)
plt.show()
mnb = MultinomialNB()
mnb.fit(x_train,y_train)
mnb.score(x_test,y_test)*100 #75.0
mnb.score(x_train,y_train)*100 #73.75
plt.figure(figsize=(4,3))
plot_decision_regions(x.to_numpy(),y.to_numpy(),clf=mnb)
plt.show()
bnb = BernoulliNB()
bnb.fit(x_train,y_train)
bnb.score(x_test,y_test)*100 #50.0
bnb.score(x_train,y_train)*100 #50.0
plt.figure(figsize=(4,3))
plot_decision_regions(x.to_numpy(),y.to_numpy(),clf=bnb)
plt.show()
#==> DECISION TREE (REGRESSION):
. Decision Tree is a Supervised learning technique that can be used for both
Classification and Regression problems, but mostly it is preferred for solving
Classification problems.
. In order to build a tree, we use the CART algorithm, which stands for
Classification and Regression Tree algorithm.
-> important Terminology related to Decision Trees:
- Root Node: It represents the entire population or sample and this further
gets divided into two or more homogeneous sets.
- Splitting: It is a process of dividing a node into two or more sub-nodes.
- Decision Node: When a sub-node splits into further sub-nodes, then it is
called the decision node.
- Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
- Pruning: When we remove sub-nodes of a decision node , this process is
called pruning. You can say the opposite process of Splitting.
- Branch / sub-Tree: A subsection of the entire tree is called branch or sub-
tree.
- Parent and Child Node: A node , which is divided into sub-nodes is called a
parent node of sub-nodes where as sub-nodes are the child of a parentnode.
-> attribute selection measures
- This measurement, we can easily select the best attribute for the nodes of
the tree. There are two popular techniques for ASM, which are:
. Information Gain
. Entropy / Gini Index
=> Entropy : Entropy is a metric to measure the impurity in a given attribute. It
specifies randomness in data.
Entropy(s) = -P(Yes) log2 P(Yes) - P(no)log2 P(no)
where:
s = Total number of samples
P(Yes) = probability of Yes
P(no) = probability of no
=> Information Gain : Information gain is the measurement of changes in entropy
after the segmentation of a dataset based on an attribute. It calculates how much
information a feature provided us about a class.
Information Gain = Entropy(S) - [(Weighted Avg) *Entropy(each feature)]
## DECISION TREE (Classification)(Practical):