0% found this document useful (0 votes)

18 views23 pages

Data Preprocessing Techniques Guide

The document provides an overview of machine learning techniques, including supervised, unsupervised, and reinforcement learning, with a focus on data preprocessing methods such as handling missing values, encoding categorical variables, and outlier detection. It also discusses feature scaling, handling duplicate data, and feature selection techniques, along with practical examples using Python libraries like pandas, seaborn, and scikit-learn. Additionally, it touches on regression analysis and the implementation of linear regression algorithms.

Uploaded by

manoj.ray

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views23 pages

Data Preprocessing Techniques Guide

Uploaded by

manoj.ray

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 23

## MACHINE LEARNING

i) Supervised Learning
ii)UnSupervised Learning
iii)Reinforcement Learning

i)Supervised Learning
a) Regression Learning
b) Classification Learning

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

dataset = pd.read_csv(r"D:\data set\video\loan.csv")

dataset.head(3)
dataset.shape
dataset.shape[0]
dataset.isnull()
dataset.isnull().sum()
dataset.isnull().sum().sum()
dataset.notnull().sum()
(dataset.isnull().sum().sum()/(dataset.shape[0]*dataset.shape[1])*100 # over all
data in percentage
(dataset.isnull().sum()/dataset.shape[0])*100
sns.heatmap(dataset.isnull())
ptl.show()
dataset.drop(columns=["Credit_History"]) # drop columns name
dataset.shape
dataset.dropna() # Perticular row deleted
dataset.dropna(inplace=True) #if we put inplace True its deleted in orignal
dataset
dataset.isnull().sum()
sns.heatmap(dataset.isnull())
plt.show()
dataset.shape
# Handling Missing Values
dataset.fillna(10) # wrong way to fill the data
dataset.fillna(10).head(10)
dataset.info()
dataset.fillna(method="bfill") # backward filling
dataset.fillna(method="ffill",axis=1) # axis=0 means column wise or axis = 1 means
row wise filling
dataset["Gender"].mode()[0]
dataset["Gender"].fillna(dataset["Gender"].mode()[0])
dataset["Gender"].fillna(dataset["Gender"].mode()[0],inplace=True) # To fixed data
in original dataset (perticular column
dataset.select_dtypes(include="object")
dataset.select_dtypes(include="object").isnull().sum()
dataset.select_dtypes(include="object").columns

for i in dataset.select_dtypes(include="object").columns:
print(i)

for i in dataset.select_dtypes(include="object").columns:
dataset[i].fillna(dataset[i].mode()[0],inplace=True) # to fill all objective
by mode without giving error if they are not objective type then that place are
blank
# To fill the numerical values

dataset.isnull().sum()
dataset.info()
dataset.select_dtypes(include="float64").columns
from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy="mean")
si.fit_transform(dataset[[dataset.select_dtypes(include="float64").columns - just
put this output]])
arr = si.fit_transform(dataset[[dataset.select_dtypes(include="float64").columns -
just put this output]])
pd.DataFrame(arr,columns=dataset.select_dtypes(include="float64").columns)

new_dataset =
pd.DataFrame(arr,columns=dataset.select_dtypes(include="float64").columns)

new_dataset.isnull().sum()
new_dataset
dataset["LoanAmount"].mean()

# we genrally try to put numerical value that why we change the catogries value to
numberical value
# ONE HOT Encoding

import pandas as pd
dataset.head()
dataset.isnull().sum()
dataset["Gender"].fillna(dataset["Gender"].mode()[0],inplace=True)
dataset["Married"].fillna(dataset["Married"].mode()[0],inplace=True)
#1st method(ONE Hot Encoding) - get_dummies
en_data = dataset[["Gender","Married"]]
pd.get_dummies(en_data)
pd.get_dummies(en_data).info()

#2nd Method
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit_transform(en_data).toarray()
ar = ohe.fit_transform(en_data).toarray()
pd.DataFrame(ar,columns=["Gender_Female","Gender_Male","Married_No","Married_Yes"])
ohe1 = OneHotEncoder(drop="first")
ar1 = ohe1.fit_transform(en_data).toarray()
ar1
pd.DataFrame(ar,columns=["Gender_Male","Married_Yes"])

#Label Encoding
import pandas as pd
df = pd.DataFrame({"name":["wscube","cow","cat","dog","black"]})
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fir.transform(df["name"])
df["en_name"] = le.fir.transform(df["name"])
df
dataset = pd.read_csv("loan.csv")
dataset.head(3)
dataset["Property_Area].unique()
la = LabelEncoder()
la.fit(dataset["Property_Area"])
la.transform(dataset["Property_Area"])
dataset["Property_Area"] = la.transform(dataset["Property_Area"])
dataset["Property_Area"].unique()

#Ordinal Encoding
import pandas as pd
df = pd.DataFrame({"Size":["s","m","l","xl","s","m","l","s","s","l","xl","m"]})
df.head(3)
ord_data = [["s","m","l","xl"]] # we use double quotes because of 2 dimension
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=ord_data)
oe.fit(df[["Size"]])
oe.transform(df[["Size"]])
df["Size_encoding"] = oe.transform(df[["Size"]])
df

# 2nd method (Map function)

ord_data1 = {"s":0,"m":1,"l":2,"xl":3
df["Size"].map(ord_data1)
df["Size_encoding_map"] = df["Size"].map(ord_data1)
df

--------
# for example(learning purpose)
ord_data1 = {"s":5,"m":6,"l":7,"xl":8}
df["Size_encoding_map"] = df["Size"].map(ord_data1)
df
----------

#practical on big data

dataset = pd.read_csv("loan.csv")
dataset.head()
dataset["Property_Area"].unique() # for find name
dataset["Property_Area"].fillna(dataset["Property_Area"].model()[0],inplace=True)
en_data_ord = [['Rural','Semiurban','Urban']]
from sklearn.preprocessing import OrdinalEncoder
oen = OrdinalEncoder(categories=en_data_ord)
dataset["Property_Area"] = oen.fit_transform(dataset[["Property_Area"]] # use
fit_transform to permoform direct dataset
dataset.head(10)

#OUTLIER
# how to detect outlier
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv("loan.csv")
dataset.head(3)
dataset.info()
dataset.describe()
# box plot
plt.figure(figsize=(15,5))
sns.boxplot(x = "CoapplicantIncome", data=dataset)
plt.show()
sns.boxplot(x = "ApplicantIncome", data=dataset)
plt.show()

#another method to find outlier

sns.distplot(dataset["ApplicantIncome"])
plt.show()

# outlier remover method

# using IQR(inter quantile range)
# IQR = Q3-Q1
# min = Q1-(1.5*IQR)
# max = Q3 +(1.5 * IQR)

dataset.shape
q1 = dataset["CoapplicantIncome"].quantile(0.25)
q3 = dataset["CoapplicantIncome"].quantile(0.75)
q1
q3
IQR = q3-q1
min_range = q1 - (1.5*IQR)
max_range = q3 + (1.5*IQR)
min_range,max_range

dataset
dataset[dataset["CoapplicantIncome"]<=max_range ]
new_dataset = dataset[dataset["CoapplicantIncome"]<=max_range]
new_dataset.shape
sns.boxplot(x = "CoapplicantIncome", data=new_dataset)
plt.show()

# outlier removal using Z score

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv("loan.csv")
dataset.head(3)
dataset.isnull().sum()
dataset.describe()
sns.boxplot(x= "CoapplicantIncome",data= dataset)
sns.distplot(dataset["CoapplicantIncome"]) # distribution plot
min_range = dataset["CoapplicantIncome"].mean() -
3*dataset["CoapplicantIncome"].std() # std - standard division
max_range = dataset["CoapplicantIncome"].mean() +
3*dataset["CoapplicantIncome"].std()
min_range,max_range
new_dataset = dataset[dataset["CoapplicantIncome"]<= max_range]
sns.boxplot(x= "CoapplicantIncome",data= new_dataset)

# Z score
z_score = (dataset["CoapplicantIncome"] -
dataset["CoapplicantIncome"].mean())/(dataset["CoapplicantIncome"].std())
z_score
z_score>3
data["z_score"] = z_score # puting data in orignal dataset
dataset
# removing outlier
dataset[dataset["z_score"]<3]
new_dataset.shape
dataset[dataset["z_score"]<3].shape

# Feature Scaling(standardization)
standardization - It is a very effective technique which re-Scales a feature value
so that it has distribution with 0 mean value and variance equals to 1.
x(new) = x(i)-x(mean)/standard Deviation

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv("loan.csv")
dataset.head(3)
dataset.isnull().sum()
dataset["ApplicantIncome"].fillna(dataset["ApplicantIncome"].mean(),inplace=True)
sns.distplot(dataset["ApplicantIncome"])
plt.show()
dataset.describe()
# scalling throw scikit-learn

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
ss.fit(dataset[["ApplicantIncome"]])
ss.transform(data[["ApplicantIncome]])
dataset["ApplicantIncome_ss"] =
pd.DataFrame(ss.transform(dataset[["ApplicantIncome]]),columns=["x"])
dataset.head(3)
dataset.describe()

plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
plt.title("Before")
sns.distplot(dataset["ApplicantIncome_ss"])

plt.subplot(1,2,2)
plt.title("After")
sns.distplot(dataset["ApplicantIncome"])

plt.show()

# Feature Scaling(normalization)
#min-max scaler(normalization techniques)
normalization - It is a scaling technique in which values are shifted and rescaled
so that they end up ranging between 0 and 1. it is also known as Min-Max scaling.

x(new) = x(i) -min(x)/max(x) - min(x)

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv("loan.csv")
dataset.head(3)
dataset.isnull.sum()
dataset.describe()
sns.distplot(dataset["CoapplicantIncome"])
plt.show()
from sklearn.preprocessing import MinMaxScaler
ms = MinMaxScaler()
ms.fit(dataset[["CoapplicantIncome"]])
ms.transform(dataset[["CoapplicantIncome"]])
dataset["CoapplicantIncome_min"] = ms.transform(dataset[["CoapplicantIncome"]])
dataset.head(3)

plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
plt.title("Before")
sns.distplot(dataset["CoapplicantIncome"])
plt.subplot(1,2,2)
plt.title("After")
sns.distplot(dataset["CoapplicantIncome_min"])

plt.show()

# Handling Duplicate Data

import pandas as pd
data = {"name":["a","b","c","d","a","c"],"eng":[8,7,5,8,8,5],"hindi":[2,3,4,5,2,6]}
df = pd.DataFrame(data)
df
df.duplicate()
# df["duplicate"] = df.duplicate()
df
df.drop_duplicates()
df.drop_duplicates(inplace=True)

# duplicate data on big data

dataset = pd.read_csv("loan.csv")
dataset.duplicate()
dataset.shape
dataset.drop_duplicates(inplace=True)
dataset.shape

# Replace And Data Type change

import pandas as pd
dataset = pd.read_csv("loan.csv")
dataset.head(3)
dataset.info()
dataset.isnull().sum()
dataset["Dependents"].value_counts()
dataset["Dependents"].fillna(dataset["Dependents"].mode()[0],inplace=True)
dataset.isnull().sum()
dataset["Dependents"].replace("3+","3",inplace=True)
dataset["Dependents"].astype("int64") #converting type of object
dataset["Dependents"] = dataset["Dependents"].astype("int64")
dataset["Dependents"].value_counts()
dataset.info()

# Function Transformer
# without outlier
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv("loan.csv")
dataset.head(3)
dataset.isnull.sum()
sns.distplot(dataset["CoapplicantIncome"])
plt.show()
#IQR
q1 = dataset["CoapplicantIncome"].quantile(0.25)
q3 = dataset["CoapplicantIncome"].quantile(0.75)
iqr = q3 -q1
min_r = q1 - (1.5*iqr)
max_r = q3 + (1.5*iqr)
min_r,max_r

dataset[dataset["CoapplicantIncome"]<=max_r]
dataset = dataset[dataset["CoapplicantIncome"]<=max_r]
sns.distplot(dataset["CoapplicantIncome"])
plt.show()

from sklearn.preprocessing import FunctionTransformer # function transforming here

plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
sns.distplot(dataset["CoapplicantIncome"])
plt.title("Before")
plt.subplot(1,2,2)
sns.distplot(dataset["CoapplicantIncome_tf"])
plt.title("After")
plt.show()

# Function Transformer
# with outlier
dataset.head(3)
dataset.isnull.sum()
sns.distplot(dataset["CoapplicantIncome"])
plt.show()
## IQR
q1 = dataset["CoapplicantIncome"].quantile(0.25)
q3 = dataset["CoapplicantIncome"].quantile(0.75)
iqr = q3 -q1
min_r = q1 - (1.5*iqr)
max_r = q3 + (1.5*iqr)
min_r,max_r

#dataset[dataset["CoapplicantIncome"]<=max_r]
#dataset = dataset[dataset["CoapplicantIncome"]<=max_r]

sns.distplot(dataset["CoapplicantIncome"])
plt.show()

from sklearn.preprocessing import FunctionTransformer # function transforming here

ft = FunctionTransformer(func = np.log1p)
ft.fit(dataset[["CoapplicantIncome"]])
ft.transform(dataset[["CoapplicantIncome"]])
dataset["CoapplicantIncome_tf"] = ft.transform(dataset[["CoapplicantIncome"]])
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
sns.distplot(dataset["CoapplicantIncome"])
plt.title("Before")
plt.subplot(1,2,2)
sns.distplot(dataset["CoapplicantIncome_tf"])
plt.title("After")
plt.show()

## another method
ft1 = FunctionTransformer(func= lambda x : x**2)
ft1.fit(dataset[["CoapplicantIncome"]])
dataset["CoapplicantIncome_tf1"] = ft1.transform(dataset[["CoapplicantIncome"]])
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
sns.distplot(dataset["CoapplicantIncome"])
plt.title("Before")
plt.subplot(1,2,2)
sns.distplot(dataset["CoapplicantIncome_tf1"])
plt.title("After")
plt.show()

## Feature Selection Techniques

## Backward Elemination (using mlxtend) Forward Elimination (using mlxtend)

Feature_Selection :- A feature is an attribute that has an impact on a problem or

is useful for the problem, and choosing the important features for the model is
known as feature selection.

#Forward Elimination (using mlxtend) :-

import pandas as pd
from mlxtend.feature_selection import SequentialFeatureSelector
dataset = pd.read_csv("diabetes.csv")
dataset.head(3)
x = dataset.iloc[:,:-1]
y = dataset["Outcome"]
x.shape

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
fs = SequentialFeatureSelector(lr,k_features=5,forward=True)
fs.fit(x,y)
fs.feature_names
fs.k_feature_names_
fs.k_score_

# for backword
fs = SequentialFeatureSelector(lr,k_features=5,forward=False)
fs.fit(x,y)
fs.feature_names
fs.k_feature_names_
fs.k_score_
## Train Test Split in Data set ##
import pandas as pd
dataset = pd.read_csv("Boston.csv")
dataset.head(3)
dataset.shape
input_data = dataset.iloc[:,:-1]
output_data = dataset["House_Price"]
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test =
train_test_split(input_data,output_data,test_size=0.25)
x_test
x_train
y_test
y_train
x_train.shape , y_train.shape
x_test.shape , y_test.shape

### Regression Analysis

## LINEAR REGRESSION ALGORITHM (simple linear)
# simple linear Regression - simple linear Regression is a type of Regression
algorithms that models the relationship between a dependent variable and a single
independent variable.
y = mx+c
where y = dependent variable
x = independent variable
m = slope/gradient/coefficient
c = intercept

m = +ve thita < 90

m = -ve thita > 90
m = 0 thita = 0

#practical
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv("placement.csv")
#dataset = pd.read_csv(r"D:\data set\video\placement.csv")
dataset.head(3)
plt.figure(figsize=(5,3))
sns.scatterplot(x="cgpa",y="package",data=dataset)
plt.show()
dataset.isnull.sum()
dataset.ndim # if they are 1 dimensionthen please change in 2 dimension
x = dataset[["cgpa"]]
y = dataset["package"]

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train,y_train)
# y = m*x+c
lr.coef_ #array([0.57425647])
lr.intercept_ # -1.0270069374542108
# y = 0.57425647*x-1.0270069374542108
lr.score(x_test,y_test)*100
lr.predict([[6.89]])
# 0.57425647*6.89-1.0270069374542108 # 2.92962016
y_prd = lr.predict(x)
plt.figure(figsize=(5,4))
sns.scatterplot(x="cgpa",y="package",data=dataset)
plt.plot(dataset["cgpa"],y_prd,c="red" )
plt.legend(["org data","predict line"])
plt.savefig("predict.jpg")
plt.show()

## Multiple linear Regression

Multiple linear Regression is an extension of simple linear regression as it takes

more than one predictor variable to predict the response variable.

y = m1x1+m2x2+m3x3.....+c

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv("regression_dataset.csv")
dataset.head(3)
dataset.shape
dataset.isnull.sum()
sns.pairplot(data=dataset)
plt.show()
sns.heatmap(data=dataset.corr(),annot=True)
plt.show()

x = dataset.iloc[:,:-1]
x
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv("multiple_regression_dataset.csv")
dataset.head()
dataset.isnull.sum()
sns.pairplot(data=dataset)
plt.show()
sns.heatmap(data=dataset.corr(),annot=True)
plt.show()
x = dataset.iloc[:,:-1]
y = dataset["Salary"]
x.ndim
dataset.shape

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test train_test_split(x,y,
test_size=0.2,random_state=42)

from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)*100 # 63.65989707495038
# y = m1x1+m2x2+c
lr.coef_ #array([1676.38278101, -136.23367567])
lr.intercept_ #34875.404040507696
# y_prd = 1676.38278101*Age -136.23367567*Experience + 34875.404040507696
x.column
lr.predict(x_test)

## POLYNOMIAL REGRESSION
Ploynomial Regression is a regression algorithm that models the relationship
between a dependent(y) and independent variable(x) as nth degree ploynomial.

Y = b0+b1x1+b2x1*square+.......bnx1square(n)

import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_csv("ploynomial.csv")
dataset.head(3)
dataset.corr()
plt.scatter(dataset["Level"],dataset["Salary"])
plt.xlabel("Level")
plt.ylabel("Salary")
plt.show()

x = dataset[["Level"]]
y = dataset[["Salary"]]
from sklearn.preprocessing import PolynomialFeatures
pf = PolynomialFeatures(degree=2)
pf.fit(x)
pf.transform(x)
x = pf.transform(x)

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)

from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)*100 # 99.999999
# y = m1x1+m2x2^2+c
# y = 1000.10647295*x1 + 500.00158031*x2^2 - 13.512174614
lr.coef_ #array([0. , 1000.10647295, 500.00158031])
lr.intercept_ # -13.512174614

prd =lr.predict(x)
plt.scatter(dataset["Level"],dataset["Salary"])
plt.plot(dataset["Level"],prd,c='red')
plt.xlabel("Level")
plt.ylabel("Salary")
plt.legend(["org","prd"])
plt.show()

test = pf.transform([[45]])
test # array([[1.000e+00, 4.500e+01, 2.025e+03]])
lr.predict(test) #([1057494.47922994])

# Cost Function
1 - A cost function is an important parameter that determines how well a machine
learning model performs for a given dataset.
2 - Cost function is a measure of how wrong the model is in estimating the
relationship between X(input) and Y(output) Parameter.

=> types of cost function

a) Regression Cost Function
b) Classification cost Function

a)Regression Cost Function

Regression models are used to make a prediction for the continuous variable.
1- MSE(Mean Square Error)
2- RMSE(Root Mean Square Error)
3- MAE- (Mean Absolute Error)
4- R^2 Accuracy

b) Classification cost Function

1- Binary Classification Cost Function:
Classification models are used to make predictions of categorical variables,
such as predictions for 0 or 1, Cat or dog, etc.

2- Multi-class Classification Cost Function:

A multi-class Classification cost function is used in the Classification
problems for which instances are allocated to one of more than two classes.

-> Binary Cross Entropy Cost Function or Log Loss Function

=> Regression Cost Function

1. Mean Squared Error:

. Mean Squared Error(MSE) is the mean Squared difference between the actual
and predicted values. MSE penalizes high errors caused by outliers by squaring the
errors.
. Mean Squared error is also known as L2 Loss.

Mean Squared Error = 1/n E (y(i) - ^y(i))^2

2. Mean Absolute Error:

. Mean Absolute Error(MAE) is the mean absolute difference between the actual
Values and the predicted values.
. MAE is more robust to outliers. The insensitivity to outlier is because it
does not penalize high errors caused by outliers.

Mean Absolute Error = 1/n E |Y(i) - ^y(i)|

3. Root Mean Squared Error:

. Root Mean Squared Error (RMSE) is the root Squared mean of the difference
between actual and predicted values.
. RMSE can be used in situations where we want to penalize high errors but
not as much as MSE does.

Root Mean Squared Error = root1/n E (y(i) -^y(i))^2

=> How To Find a Best Fit Line:-

=> L1(Lasso Regularization), L2(Ridge Regularization)

=> Regularization Techniques
. This is a form of regression, that constrains/regularizes or shrinks the
coefficient estimates towards zero.
. This technique discourages learning a more complex or flexible model, so as
to avoid the risk of overfitting.

-> Regularization can achieve this motive with 2 techniques:

- Ridge Regularization /L2
- Lasso Regularization /L1

-> Lasso Regularization /L1:(feature selection work)

. This is a regularization technique used in feature selection using a
Shrinkage method also referred to as the penalized regression method.
. Lasso Regression magnitude of coefficients can be exactly zero.

cost function = Loss + lambda E||w||

. Loss = sum of squared residual

. lambda = penalty
. w = slope of the curve

-> Ridge Regularization /L2:(overfitting reducing technique)

. Ridge Regression,also known as L2 regularization, is an extension to linear
Regression that introduces a regularization term to reduce model complexity and
help prevent overfitting.
. Ridge Regression is working value/magnitude of coefficients is almost equal
to zero.

cost function = loss + lambda E ||w||^2

. Loss = sum of squared residual

. lambda = penalty
. w = slope of the curve

=> L1(Lasso Regularization), L2(Ridge Regularization) Practical

-> Regularization Techniques (Practical)

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

dataset = pd.read_csv(r"houseprice.csv")
dataset.head(3)
plt.figure(figsize=(10,10))
sns.heatmap(data=dataset.corr(),annot=True)
plt.show()
x = dataset.iloc[:,:-1]
y = dataset["price"]

sc = StandardScaler()
sc.fit(x)
sc.transform(x)
x = pd.DataFrame(sc.transform(x),column=x.column)

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)
from sklearn.linear_model import LinearRegression , Lasso , Ridge
from sklearn.metrics import mean_absolute_error,mean_squared_error
import numpy

# LinearRegression
lr = LinearRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)*100 # 3.2286184...
lr.coef_
print(mean_squared_error(y_test,lr.predict(x_test))) #986919392751.0544
print(mean_absolute_error(y_test,lr.predict(x_test))) #210903.52141518658
print(np.sqrt(mean_squared_error(y_test,lr.predict(x_test)))) #993438.167552996
plt.figure(figsize=(15,5))
plt.bar(x.columns,lr.coef_)
plt.title("LinearRegression")
plt.xlabel("columns")
plt.ylabel("coef")
plt.show()

# Lasso(L1)

la =Lasso(alpha=0.5)
la.fit(x_train,y_train)
la.score(x_test,y_test)*100 # 3.228361
la.coef_
print(mean_squared_error(y_test,la.predict(x_test))) #986921772009.158
print(mean_absolute_error(y_test,la.predict(x_test))) #210908.17447564355
print(np.sqrt(mean_squared_error(y_test,la.predict(x_test)))) #993439.3650390335
plt.figure(figsize=(15,5))
plt.bar(x.columns,la.coef_)
plt.title("Lasso")
plt.xlabel("columns")
plt.ylabel("coef")
plt.show()

# Ridge(L2)

ri = Ridge(alpha = 10)
ri.fit(x_train,y_train)
ri.score(x_test,y_test)*100 #3.2401994171
ri.coef_
print(mean_squared_error(y_test,ri.predict(x_test))) #986801284919.7765
print(mean_absolute_error(y_test,ri.predict(x_test))) #210815.94787357954
print(np.sqrt(mean_squared_error(y_test,ri.predict(x_test)))) #993378.7217973699
plt.figure(figsize=(15,5))
plt.bar(x.columns,ri.coef_)
plt.title("Ridge")
plt.xlabel("columns")
plt.ylabel("coef")
plt.show()

df =
pd.DataFrame({"col_name":x.columns,"LinearRegression":lr.coef_,"Lasso":la.coef_,"Ri
dge":ri.coef_})

## Classification
# Classification Algorithm
. The Classification algorithm is used to identify the category of new observation
on the basis of training data.
. In Classification, a program learns from the given dataset or observations and
then classifies new observation into a number of classes or groups.
. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. classes can be
called as targets/labels or categories.

-> There are two types of Classifications:

a) Binary Classifier: If the Classification problem has only two possible Outcomes,
then it is called as Binary Classifier.
ex- SPAM or NOT SPAM, CAT or DOG, etc.

b) Multi-class Classifier: If a Classification problem has more than two outcomes,

then it is called as Multi-class Classifier.
ex - Classifications of types of crops, Classification of types of music.

=> Types Of ML CLASSIFICATION ALGORITHM

Non-linear models linear Models

. K-Nearest Neighbours . Logistic Regression
. SVM . Support Vector Machines
. Naive Bayes
. Decision Tree Classification
. Random Forest Classification

=> Logistic Regression AND (Binary Classification)(Practical):

. Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique.
. It is used for predicting the categorical dependent variable using a given set of
independent variables.
. Therefore, the outcome must be a categorical or discrete value. it can be either
Yes or No,0 or 1,True or False, etc. but instead of giving the exact value as 0 and
1, it gives the probabilistic values which lie between 0 and 1.

-> Types of Logistic Regression

on the basis of the categories, Logistic Regression can be classified into three
types:
i) Binomial: In binomial Logistic regression, there can be only two possible types
of the dependent variables, such as 0 or 1,Pass or Fail,etc.
ii) Multinomial: In multinomial Logistic regression, there can be 3 or more
possible unordered types of the dependent variable, such as "cat","dogs", or
"sheep".
iii) Ordinal: In ordinal Logistic regression, there can be 3 or more possible
ordered types of dependent variables, such as "low","medium",or "High".

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

dataset = pd.read_csv(r"Social_Network_Ads.csv")
dataset.drop(columns=["EstimatedSalary"],inplace=True)
dataset.head(5)

plt.figure(figsize=(4,3))
sns.scatterplot(x="Age",y="Purchased",data=dataset)
plt.show()
x = dataset[["Age"]]
y = dataset["Purchased"]

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)*100 # 91.25
lr.predict([[40]])

plt.figure(figsize=(4,3))
sns.scatterplot(x="Age",y="Purchased",data=dataset)
sns.lineplot(x = "Age",y = lr.predict(x),data=dataset,color = 'red')
plt.show()

=> Logistic Regression (Binary Classification)(Multiple input)(Practical):

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

dataset = pd.read_csv(r"placement.csv")
dataset.head(3)

plt.figure(figsize=(5,4))
sns.scatterplot(x="cgpa",y="score",data=dataset,hue="placed")
plt.legend(loc=1)
plt.show()

x = dataset.iloc[:,:-1]
x.ndim
print(x)
y = dataset["placed"]

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)*100
lr.predict([[8.14,6.52]]) # array([1], dtype=int64)
lr.coef_
lr.intercept_

from mlxtend.plotting import plot_decision_regions

plot_decision_regions(x.to_numpy(),y.to_numpy,clf=lr)
plt.show()

=> Logistic Regression(Binary Classification)(Ploynomial input)(practical):

-> Logistic Regression(Ploynomial Feature):

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

dataset = pd.read_csv(r"Polynomial_classification.csv")
dataset.head(5)
plt.figure(figsize=(5,4))
sns.scatterplot(x="data1",y="data2",data=dataset,hue="output")
plt.show()

x = dataset.iloc[:,:-1]
y = dataset["output"]

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0,random_state=42)

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fix(x_train,y_train)
lr.score(x_test,y_test)*100 #90.0
from mlxtend.plotting import plot_decision_regions
plot_decision_regions(x.to_numpy(),y.to_numpy(),clf=lr)
plt.show()

->with Ploynomial Feature

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

dataset = pd.read_csv(r"Polynomial_classification.csv")
dataset.head(5)
lt.figure(figsize=(5,4))
sns.scatterplot(x="data1",y="data2",data=dataset,hue="output")
plt.show()

x = dataset.iloc[:,:-1]
y = dataset["output"]

from sklearn.preprocessing import PolynomialFeatures

pf = PolynomialFeatures(degree=3)
pf.fit(x)
pf.transform(x)
x = pd.DataFrame(pf.transform(x))
x.shape

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0,random_state=42)

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fix(x_train,y_train)
lr.score(x_test,y_test)*100 #95.0

=> Logistic Regression(Multiclass Classification)(practical):

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv(r"iris.csv")
dataset.head(3)
dataset["species"].unique() #array(['setosa','versicolor','virginica'],
dtype=object)
sns.pairplot(data=dataset,hue="species")
plt.show()

x = dataset.iloc[:,:-1]
y = dataset["species"]

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0,random_state=42)

## OVR
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(multi_class="ovr")
lr.fix(x_train,y_train)
lr.score(x_test,y_test)*100 #96.66666

# multinomial
lr1 = LogisticRegression(multi_class="multinomial")
lr1.fit(x_train,y_train)
lr1.score(x_test,y_test)*100

lr2 = LogisticRegression()
lr2.fit(x_train,y_train)
lr2.score(x_test,y_test)*100 #100.0

=> CONFUSION MATRIX:

. A Confusion matrix is a simple and useful tool for understanding the performance
of a Classification model, like one used in machine learning or statistics.
. It helps you evaluate how well your model is doing in categorizing things
correctly.
. It is also known as the error matrix.
. The matrix consists of predictions result in a summarized form, which has a total
number of correct predictions and incorrect predictions.

. Accuracy = (TP+TN)/N
. Error = (FN+FP)/n
. False Negative: The model has predicted no, but the actual value was Yes, it is
also called as Type-II error.
. False Positive:The model has predicted Yes, but the actual value was No. it is
also called a Type-I error.

=> CONFUSION MATRIX(Sensitivity, Precision, Recall,F1-score)

-> Precision: TP/(TP+FP)
It helps us to measure the ability to classify positive samples in the model.
-> F1-Score:
It is the harmonic mean of precision and recall. it takes both false positive
and false negattives into account. Therefore, it performs well on an imbalanced
dataset.

F1 Score = 2*Precision*Recall/(Precision+Recall)

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dataset = pd.read_csv(r"placement.csv")
dataset.head(5)
x = dataset.iloc[:,:-1]
x
y = dataset["placed"]

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)*100 #100.0

from sklearn.metrics import confusion_matrix,precision_score,recall_score,f1_score

cf = confusion_matrix(y_test,lr.predict(x_test))
cf #array([[10, 0],
[0, 10]], dtype=int64)
sns.heatmap(cf,annot=True)
plt.show()

precision_score(y_test,lr.predict(x_test))*100 #100.0
recall_score(y_test,lr.predict(x_test))*100 #100.0
f1_score(y_test,lr.predict(x_test))*100 #100

=> IMBALANCED DTATASET

-> Techniques to Handle IMBALANCED Data
i) Random Under Sampling:
we will reduce the majority of the class so that it will have same no of as
the minority.
ii) Random Over Sampling:
we will increase the size of manority is inactive class to the size of
majority class ie active.

import pandas as pd
dataset = pd.read_csv("Social_Network_Ads.csv")
dataset.head(5)
dataset["Purchased"].value_counts()

x = dataset.iloc[:,:-1]
y = dataset["Purchased"]

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=42,test_size=0.2)

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)*100
lr.predict([[45,26000]]) #array([0], dtype=int64)

=> imblearn
import pandas as pd
dataset = pd.read_csv("Social_Network_Ads.csv")
dataset.head(5)
x = dataset.iloc[:,:-1]
y = dataset["Purchased"]
dataset["Purchased"].value_counts()

-> from imblearn.under_sampling import RandomUnderSampler

ru = RandomUnderSampler()
ru.fit_resample(x,y)
ru_x, ru_y = ru.fit_resample(x,y)
ru_y.value_counts()
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=42,test_size=0.2)

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)*100 # 58.620689655172406
lr.predict([[45,26000]]) #array([1], dtype=int64)

-> from imblearn.over_sampling import RandomOverSampler

ro = RandomOverSampler()
ro_x, ro_y = ro.fit_resample(x,y)
ro_y.value_counts()
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=42,test_size=0.2)

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)*100 #45.63106796116505
lr1.predict([[45,26000]]) #array([1], dtype=int64)

## NAIVE BAYES

- Naive Bayes is a Classification algorithm based on Bayes theorem.

- which is a probability theory that describe the probability of an event, based on
prior knowledge of conditions that might be related to the event.

-> Naive: It is called Naive because it assumes that the occurrence of a certain
feature is independent of the occurrence of other features.
-> Bayes: It is called bayes because it depends on the principle of Bayes' Theorem.

-> Bayes' Theorem: Bayes' theorem is also known as Bayes' Rile or Bayes' law, which
is used to determines the probability of a hypothesis with prior knowledge. It
depends on the conditional probability.

p(a/b)= (p(b/a)p(a))/p(b)

where:
- p(a/b) is Posterior probability: probability of hypothesis A on the observed
event B.
- P(b/a) is Likelihood probability: Probability of the evidence given that the
probability a hypothesis is true.
- p(a) is Prior Probability: Probability of hypothesis before observing the
evidence.
- p(b) is Marginal Probability: Probability of evidence.

=> Types of Naive Bayes Model:

There are three types of Naive Bayes Model,
which are given below
. Gaussian
. Multinomial
. Bernoulli

i) Gaussian Naive Bayes:

- Assumes that continuous features follow a Gaussian (normal) distribution.
- Suitable for features that are continuous and have a normal distribution.

ii) Bernoulli Naive Bayes:

- Assumes that features are binary (Boolean) variable.
- Suitable for data that can be represented as binary features, such as document
Classification problems where each term is either present or absent.

iii) Multinomial Naive Bayes:

- Assumes that features follow a multinomial distribution.
- Typically used for discrete data, such as text data, where each feature
represents the frequency of a team.

-> Practical:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from mlxtend.plotting import plot_decision_regions

dataset = pd.read_csv(r"placement.csv")
dataset.head(5)

sns.kdeplot(data=dataset["cgpa"])
plt.show()
sns.kdeplot(data=dataset["score"])
plt.show()

plt.figure(figsize=(4,3))
sns.scatterplot(x="cgpa",y="score",data=dataset,hue="placed")
plt.show()
x = dataset.iloc[:,:-1]
y = dataset["placed"]

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=42,test_size=0.2)

from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB

gnb = GaussianNB()
gnb.fit(x_train,y_train)
gnb.score(x_test,y_testtest)*100 #100.0
gnb.score(x_train,y_train)*100 #97.5

gnb.predict([[6.17,5.17]]) # array([0],dtype=int64) # 6.17 5.17 0

plt.figure(figsize=(4,3))
plot_decision_regions(x.to_numpy(),y.to_numpy(),clf=gnb)
plt.show()

mnb = MultinomialNB()
mnb.fit(x_train,y_train)
mnb.score(x_test,y_test)*100 #75.0
mnb.score(x_train,y_train)*100 #73.75

plt.figure(figsize=(4,3))
plot_decision_regions(x.to_numpy(),y.to_numpy(),clf=mnb)
plt.show()

bnb = BernoulliNB()
bnb.fit(x_train,y_train)
bnb.score(x_test,y_test)*100 #50.0
bnb.score(x_train,y_train)*100 #50.0

plt.figure(figsize=(4,3))
plot_decision_regions(x.to_numpy(),y.to_numpy(),clf=bnb)
plt.show()

#==> DECISION TREE (REGRESSION):

. Decision Tree is a Supervised learning technique that can be used for both
Classification and Regression problems, but mostly it is preferred for solving
Classification problems.

. In order to build a tree, we use the CART algorithm, which stands for
Classification and Regression Tree algorithm.

-> important Terminology related to Decision Trees:

- Root Node: It represents the entire population or sample and this further
gets divided into two or more homogeneous sets.
- Splitting: It is a process of dividing a node into two or more sub-nodes.
- Decision Node: When a sub-node splits into further sub-nodes, then it is
called the decision node.
- Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
- Pruning: When we remove sub-nodes of a decision node , this process is
called pruning. You can say the opposite process of Splitting.
- Branch / sub-Tree: A subsection of the entire tree is called branch or sub-
tree.
- Parent and Child Node: A node , which is divided into sub-nodes is called a
parent node of sub-nodes where as sub-nodes are the child of a parentnode.

-> attribute selection measures

- This measurement, we can easily select the best attribute for the nodes of
the tree. There are two popular techniques for ASM, which are:

. Information Gain
. Entropy / Gini Index

=> Entropy : Entropy is a metric to measure the impurity in a given attribute. It

specifies randomness in data.

Entropy(s) = -P(Yes) log2 P(Yes) - P(no)log2 P(no)

where:
s = Total number of samples
P(Yes) = probability of Yes
P(no) = probability of no

=> Information Gain : Information gain is the measurement of changes in entropy

after the segmentation of a dataset based on an attribute. It calculates how much
information a feature provided us about a class.

Information Gain = Entropy(S) - [(Weighted Avg) *Entropy(each feature)]

## DECISION TREE (Classification)(Practical):

05 E RandomForest LoanData
No ratings yet
05 E RandomForest LoanData
8 pages
ML
No ratings yet
ML
10 pages
DS2 C5 S1 Preparing Data Machine Learning Concept Codebook
No ratings yet
DS2 C5 S1 Preparing Data Machine Learning Concept Codebook
1 page
Final-12-Lab Programs
No ratings yet
Final-12-Lab Programs
30 pages
Assgn 04 ML Jatan - Colab
No ratings yet
Assgn 04 ML Jatan - Colab
4 pages
S6 - Data Mining Lab Experiments (Except 1)
No ratings yet
S6 - Data Mining Lab Experiments (Except 1)
6 pages
1
No ratings yet
1
13 pages
Loan Default Prediction System 1753830667
No ratings yet
Loan Default Prediction System 1753830667
11 pages
Machine Learning Model Building
No ratings yet
Machine Learning Model Building
6 pages
Data Preprocessing 2
No ratings yet
Data Preprocessing 2
5 pages
C121 Exp2
No ratings yet
C121 Exp2
23 pages
ML All Projectpdf Removed
No ratings yet
ML All Projectpdf Removed
41 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
AI&ML
No ratings yet
AI&ML
9 pages
Assignmnet 5
No ratings yet
Assignmnet 5
11 pages
Code
No ratings yet
Code
6 pages
Data Mining - Project
100% (2)
Data Mining - Project
11 pages
Data Preprocessing Example Programs1
No ratings yet
Data Preprocessing Example Programs1
9 pages
DA Programs
No ratings yet
DA Programs
44 pages
Exp2 - Data Visualization and Cleaning and Feature Selection
No ratings yet
Exp2 - Data Visualization and Cleaning and Feature Selection
13 pages
Alishba (S005)
No ratings yet
Alishba (S005)
5 pages
ML Batch
No ratings yet
ML Batch
36 pages
Slip
No ratings yet
Slip
5 pages
ML Short Code - Under Updating
No ratings yet
ML Short Code - Under Updating
4 pages
ASSESSMENT2
No ratings yet
ASSESSMENT2
22 pages
Ashwin Report
No ratings yet
Ashwin Report
18 pages
ML Journal External
No ratings yet
ML Journal External
14 pages
MACHINE LEARNING Manual
No ratings yet
MACHINE LEARNING Manual
36 pages
Mlda - Lab
No ratings yet
Mlda - Lab
35 pages
ML Lab Prgms Split
No ratings yet
ML Lab Prgms Split
3 pages
Support Vector Machine For Classification: Name: Saurav Doke Roll No: A-41 PRN: 2264191242040
No ratings yet
Support Vector Machine For Classification: Name: Saurav Doke Roll No: A-41 PRN: 2264191242040
3 pages
Fall Semester 2020-21 AI With Python ECE-4031
No ratings yet
Fall Semester 2020-21 AI With Python ECE-4031
5 pages
Page Rank
No ratings yet
Page Rank
7 pages
Loan ML Complete Guide
No ratings yet
Loan ML Complete Guide
3 pages
Lecture Material 3
No ratings yet
Lecture Material 3
7 pages
Machine Learning Program
No ratings yet
Machine Learning Program
12 pages
'Universalbank - CSV': #Reading The File
No ratings yet
'Universalbank - CSV': #Reading The File
4 pages
C121 Exp1
No ratings yet
C121 Exp1
32 pages
Loan Prediction
No ratings yet
Loan Prediction
26 pages
ML Programs
No ratings yet
ML Programs
14 pages
PRGM 4
No ratings yet
PRGM 4
3 pages
ML Complete Notes Hridoy
No ratings yet
ML Complete Notes Hridoy
5 pages
(Feature Engineering) (Extended-Cheatsheet)
100% (1)
(Feature Engineering) (Extended-Cheatsheet)
9 pages
ML 1-10
No ratings yet
ML 1-10
53 pages
ML - Lab Manual With Woad File
No ratings yet
ML - Lab Manual With Woad File
12 pages
ML Manual
No ratings yet
ML Manual
9 pages
22K61A0654 2 Sasi Auto
No ratings yet
22K61A0654 2 Sasi Auto
24 pages
23BCE7199 ML Lab Assignment
No ratings yet
23BCE7199 ML Lab Assignment
15 pages
D3 Docs
No ratings yet
D3 Docs
6 pages
Data - Analytics Lab - Manual JNTUH R22 Regulation
No ratings yet
Data - Analytics Lab - Manual JNTUH R22 Regulation
26 pages
#Group: B (ML) : Numpy NP Pandas PD
No ratings yet
#Group: B (ML) : Numpy NP Pandas PD
9 pages
Aiml 5-8
No ratings yet
Aiml 5-8
19 pages
Practical 3
No ratings yet
Practical 3
8 pages
ML File
No ratings yet
ML File
13 pages
Experiment No 11
No ratings yet
Experiment No 11
19 pages
ML Week-12
No ratings yet
ML Week-12
7 pages
ML Lab Experiment Shortened With Same Output
No ratings yet
ML Lab Experiment Shortened With Same Output
6 pages
ASSESSMENT2
No ratings yet
ASSESSMENT2
22 pages
Prediction of Autism and Dyslexia Using Machine Learning and Clinical Data Balancing
No ratings yet
Prediction of Autism and Dyslexia Using Machine Learning and Clinical Data Balancing
11 pages
Lesson 8 - Classification
No ratings yet
Lesson 8 - Classification
74 pages
Cs229 Midterm Aut2015
No ratings yet
Cs229 Midterm Aut2015
21 pages
The Effect of Attending A Small Class in The Early Grades On College-Test Taking and Middle School Test Results: Evidence From Project Star
No ratings yet
The Effect of Attending A Small Class in The Early Grades On College-Test Taking and Middle School Test Results: Evidence From Project Star
28 pages
Westerveld Et Al-2018-Autism Research
No ratings yet
Westerveld Et Al-2018-Autism Research
13 pages
Heart Attack Prediction with ML Models
No ratings yet
Heart Attack Prediction with ML Models
10 pages
Refractive Errors Among School Children in Addis Ababa Ethiopia
No ratings yet
Refractive Errors Among School Children in Addis Ababa Ethiopia
6 pages
Data Science Insights on Titanic
No ratings yet
Data Science Insights on Titanic
24 pages
22BM6JP06 Amrita Mandal Midsem Report
No ratings yet
22BM6JP06 Amrita Mandal Midsem Report
28 pages
Bayes CPH Exercises
No ratings yet
Bayes CPH Exercises
6 pages
Exact Logistic
No ratings yet
Exact Logistic
10 pages
Grade 7 Tech Program Overview
100% (1)
Grade 7 Tech Program Overview
26 pages
Factors Influencing Women's Participation in SHGs An Empirical
No ratings yet
Factors Influencing Women's Participation in SHGs An Empirical
6 pages
Umniyatun 2021
No ratings yet
Umniyatun 2021
9 pages
Econometrics: Limited Dependent Models
No ratings yet
Econometrics: Limited Dependent Models
68 pages
Article in Review
No ratings yet
Article in Review
22 pages
Midterm
No ratings yet
Midterm
33 pages
Adv Analytical Theory and Methods: Regression
No ratings yet
Adv Analytical Theory and Methods: Regression
45 pages
Detecting The Financial Statement Fraud The Analysis of The Differences Between Data Mining Techniques and Experts' Judgments
100% (1)
Detecting The Financial Statement Fraud The Analysis of The Differences Between Data Mining Techniques and Experts' Judgments
12 pages
CampusX DSMP Syllabus
No ratings yet
CampusX DSMP Syllabus
48 pages
KPMG Business Analytics
No ratings yet
KPMG Business Analytics
18 pages
Summer Training Report
No ratings yet
Summer Training Report
34 pages
Machine Learning: An Applied Econometric Approach: Sendhil Mullainathan and Jann Spiess
No ratings yet
Machine Learning: An Applied Econometric Approach: Sendhil Mullainathan and Jann Spiess
48 pages
BCPUML Breast Cancer Prediction Using Machine Learning Approach-A Performance Analysis
No ratings yet
BCPUML Breast Cancer Prediction Using Machine Learning Approach-A Performance Analysis
10 pages
Credit Card Fraud Detection Using Predictive Modeling: A Review
No ratings yet
Credit Card Fraud Detection Using Predictive Modeling: A Review
7 pages
Regression & Linear Modeling Best Practices and Modern Methods, 1st Edition Complete DOCX Download
100% (14)
Regression & Linear Modeling Best Practices and Modern Methods, 1st Edition Complete DOCX Download
15 pages
Internship Final 1
No ratings yet
Internship Final 1
12 pages
(Ebook PDF) Stat2: Building Models For A World of Data PDF Download
No ratings yet
(Ebook PDF) Stat2: Building Models For A World of Data PDF Download
53 pages
Applied Linear Models With SAS 1st Edition Daniel Zelterman Download
100% (1)
Applied Linear Models With SAS 1st Edition Daniel Zelterman Download
54 pages
Biostatistics in Public Health Using STATA-2016
100% (4)
Biostatistics in Public Health Using STATA-2016
202 pages