0% found this document useful (0 votes)
109 views19 pages

Assignment-1: Abhishek Shringi

This document contains Abhishek Shringi's assignment on analyzing three datasets - data_1, data_2, and data_3. For each dataset, Abhishek provides scatter plots, histograms, boxplots, and heatmaps to infer patterns in the data. He also calculates statistical parameters like mean, median, variance, range, percentiles, and quartiles. Finally, he identifies outliers in data_3 using standard deviation and mean deviation approaches.

Uploaded by

Franklin Garyson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views19 pages

Assignment-1: Abhishek Shringi

This document contains Abhishek Shringi's assignment on analyzing three datasets - data_1, data_2, and data_3. For each dataset, Abhishek provides scatter plots, histograms, boxplots, and heatmaps to infer patterns in the data. He also calculates statistical parameters like mean, median, variance, range, percentiles, and quartiles. Finally, he identifies outliers in data_3 using standard deviation and mean deviation approaches.

Uploaded by

Franklin Garyson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

ASSIGNMENT-1

ABHISHEK SHRINGI

03-03-2021

CLL788

PROF. HARIPRASAD KODAMANA


Question 1:

PART-(A):
For “data_1”

Inferernce:

From the scatter plot we can infer that the given


data is in form of a cluster between a certain range
of values and point far away from it are

Scatter plot

Histograms

Inferernce:

From the scatter plot we can infer that the given data is in form of a cluster between a certain range of values and
point far away from it are
Assignment PAGE 2
Histograms

Assignment PAGE 3
Question 1:

PART-(A):
For “data_1”

Inference

Boxplot showing x and y distribution

Inferernce:

From the scatter plot we can infer that the given


data is in form of a cluster between a certain range
of values and point far away from it are

Heatmap

Assignment PAGE 4
Question 1:

PART-(A):
For “data_3”

Inferernce:

From the scatter plot we can infer that the given


data is in form of a cluster between a certain range
of values and point far away from it are

Scatter plot

Histograms

Inferernce:

From the scatter plot we can infer that the given data is in form of a cluster between a certain range of values and
point far away from it are

Assignment PAGE 5
Question 1:

PART-(A):
For “data_3”

Inferernce:

From the scatter plot we can infer that the given


data is in form of a cluster between a certain range
of values and point far away from it are

Inferernce:

From the scatter plot we can infer that the given


data is in form of a cluster between a certain range
of values and point far away from it are

Assignment PAGE 6
Question 1:

PART-(C):
For “data_1”

Parameters x y

4.9397 5.0429
Mean

4.9242 5.0747
Median

0.9718 1.014
Variance

(2.373,8.117) (2.181, 8.190)


Range

3.68476102 3.73504038
Percentile
4.08570612 4.21923318
4.4123583 4.55248202
4.700721 4.87424146
4.9242784 5.07476839
5.18017309 5.36513593
5.46958487 5.57664436
5.78852584 5.87166202
6.20755249 6.21449875
8.1170447 8.19010885

4.30398749 4.33146421
Quartile
4.9242784 5.07476839
5.60721395 5.6823799
8.1170447 8.19010885
-0.00136 -0.1207
Skewness

-0.2039 0.09080
Kurtosis

Assignment PAGE 7
For “data_3”

Parameters x y

5.0824 4.9528
Mean

5.0548 4.9788
Median

2.6572 2.6294886177174
Variance
48

(-1.458,12.267) (-1.004,10.58)
Range

3.43512982 3.28478833
Percentile
4.09472424 3.95274238
4.50069037 4.3584519
4.81775713 4.69351167
5.05487549 4.97884708
5.38188447 5.25595864
5.70739195 5.56424354
6.04270545 5.94105513
6.53338054 6.59228622
12.26702453 10.58925244

4.31898111 4.21810056
Quartile
5.05487549 4.97884708
5.89160317 5.7826677
12.2670245 10.58925244

0.0413 -0.0825
Skewness

3.30098 2.417
Kurtosis

Assignment PAGE 8
Question 1:

PART-(D):

By Standard Deviation approach( For Data3) :

Outliers in x Outliers in y

10.10393747643514 10.43902165955905,
10.67798532964827 10.55061243676815,
12.2670245277286 10.58925243952102,
11.44965456902049 10.11551641326384,
10.28782093813039 -0.5334956955315495,
-1.458402520742969 -0.2557338130229054,
-1.171853375119153 -1.004100360593079,
-0.763132811291513 -0.5528195925039532
-0.2198563833130491
-0.2439263484483771

By Mean Deviation approach ( For Data3)

Outliers in x Outliers in y

10.10393747643514 9.745989964192137
9.14841306110033 9.265264240840457
9.630313376450168 10.43902165955905
9.109233422770801 9.219071559753377
9.37144942459267 9.717401494534483
9.005201225564582 8.5540601585995
9.039507424436444 10.55061243676815
10.67798532964827 9.749970745029252
9.510497893324324 9.127858533076473
12.2670245277286 8.581245337617629
9.5456448855292 8.529730294461498
9.375609831291062 10.58925243952102
8.780482242073463 8.693157145943976
11.44965456902049 9.431652444355333
9.321302250247287 10.11551641326384
10.28782093813039 0.739593465779022
-1.458402520742969 1.028275835883979
-1.171853375119153 1.336612395334499
1.172063763666684 1.289365241997683
1.180680541173682 0.357106853679473
0.7916468945614012 0.303137254711518
1.400723237983714 0.580120633348619
-0.763132811291513 1.271821699448418
1.232122283140872 0.365306465834663
0.7725850670343857 0.432114885726590
1.179109482383718 0.468033364310671
-0.2198563833130491 -0.53349569553154

Assignment PAGE 9
0.7398399056096148 1.154781275984739
-0.2439263484483771 -0.25573381302290
0.5680693847459939 0.434412418225085
1.083946032586385 -1.00410036059307
0.9053918140205244 0.747056391444499
1.286385738350685 1.047516482080888
-0.55281959250395
0.957823807905711
1.421603535576362

Inference:-

From the data obtained we can see that the number of outliers in obtained through the MAD approach is greater. This is
due to the fact that the standard deviation approach relies on the mean of the data which in turn would be dependent on
the outliers. However, the MAD approach utilizes the median of the data and hence is less effected by outliers and can
easily detect them.

Question 2: Bach Gradient Descent

STEPS DONE:

1) Input of data from files “q1”


2) Then this data was divided into input(x) and output(y)
3) Then, learning rate was assigned value. Parameters(theta) were initialized to 1.
4) Then a batch gradient descent was implemented and iterated until we get minimum cost .
5) A plot of cost function with iteration was plotted for different learning rates as given below.

Inference: We see that the cost decreases with each iteration showing that it gradually converges to a particular value. On
reaching this value the cost becomes constant thus stopping the batch gradient descent algorithms and helping us obtain
the corresponding value of thetas. These theta are then used to fit a straight line to the data. As seen from the plot on the
left, the line fits the data quite well thus showing the model works well

Question 2: Stochastic Gradient Descent

Assignment PAGE 10
STEPS DONE:

1) Input of data from files “q1”


2) Then this data was divided into input(x) and output(y)
3) Then, learning rate was assigned value. Parameters(theta) were initialized to 1.
4) Then a Stochastic gradient descent was implemented and iterated until we get minimum cost .
5) A plot of cost function with iteration was plotted for different learning rates as given below.

Inference: We see that the cost decreases with each iteration showing that it gradually converges to a particular value.
However on comparing this with that of the batch gradient descent we see that the path followed by stochastic gradient
descent is quite noisy. This is in accordance with the expected results as in this algorithm we update our hypothesis after
each example as a result of which we may not always follow the optimum path.

Once the algorithm converges we obtain the corresponding value of thetas. These theta are then used to fit a straight line
to the data. As seen from the plot on the left, the line fits the data quite well thus showing the model created is working.

Question 2: Least Squared Close Form Solution

STEPS DONE:

1) Input of data from files “q1”


2) Then this data was divided into input(x) and output(y)
3) Then, learning rate was assigned value. Parameters(theta) were initialized to 1.
4) Then Least Squared solution was implemented according to the following formula thus giving us the optimum
value of the parameters required:
𝜃 = (𝑋 𝑇 ∗ 𝑋)−1 ∗ 𝑋 𝑇 ∗ 𝑦
5) A plot of cost function with iteration was plotted for different learning rates as given below.

Inference: As expected, the model obtained through this method seems as the best fit as here we are using a theoretical
equation to arrive directly at the optimum value. Note that this is also expected to be the fastest algorithm due to the least
computations involved and absence of loops within the code.

Assignment PAGE 11
Question 2b:

x y w h
6.2101 17.612 0.023960093 17.612
5.6277 9.2302 0.000504599 9.078502328
8.6186 13.762 0.113718462 13.52965597
7.1032 11.954 0.639492951 10.80446502

theta0 theta1
0 0
0.004219852 0.0262057
0.004265662 0.026463506
0.019651378 0.159066843
0.088745171 0.649853868

Assignment PAGE 12
Assignment PAGE 13
Question-2c

Assignment PAGE 14
Using suitable python libraries, I was able to implement the following regression algorithms:

Ridge Regression:

Lasso Regression:

Elastic Regression:

Assignment PAGE 15
Inference: Through the three plots above we can see that each of the three algorithms have almost the same performance
and seem to fit the data equally well. It was expected that lasso regression would be better. However as the given dataset
consists of only 1 input feature lasso regression underperforms(can be observed by calculating model score). It is a well
known fact that lasoo regression tries to reduce the input parameters and works well for datasets with a large number of
input parameters. On the other hand ridge regression is more suitable for datasets with fewer datapoints.

Question -3

STEPS DONE:

6) Input of data from files “q2_train” and “q2_test”


7) Then this data was divided into input(x) and output(y)
8) Then, as input involves more than one feature or independent variable so feature scaling was done so as to avoid
any skewness in results.This was done using a function from sklearn library named ‘MinMaxScaler()’
9) Then, learning rate was assigned value. Parameters(theta) were initialized to 1.
10) Sigmoid function was calculated as given below:
11) Then a batch gradient descent was implemented and iterated until we get minimum cost .
12) A plot of cost function with iteration was plotted for different learning rates as given below.

OBSERVATIONS:

1) For the various learning rates below or equal to ‘.1’ the cost function shows excellent convergence. The graph
below shows cost vs iteration for the the batch lms implemented (𝛼 = .1).

Assignment PAGE 16
Clearly from the above graph, it can be inferred that the convergence occur around 50 iterations. The value of
parameters after the complete set of iterations were found out to be 𝜃 = [−3.32681817 5.82141106 1.10280864].

Assignment PAGE 17
2) In order to compare the actual and the predicted value , a contour was plotted between the input parameters for
the training data as shown below:

It can be inferred from the above graph that for the training data, most of the predicted and training data
coincides and hence it can be said that the logistic regression fairly predicts correct output for a given set of

inputsFor the test data predictions were made , and the output is represented on a similar contour plot between
the input parameters as shown below:

Assignment PAGE 18
Assignment PAGE 19

You might also like