0% found this document useful (0 votes)
11 views12 pages

Revision

The document discusses various data visualization techniques suitable for different types of variables, including histograms, bar charts, and scatter plots. It also covers clustering methods, the Self-Organizing Map, evaluation metrics for clustering, and the Apriori algorithm for mining frequent itemsets. Additionally, it addresses model evaluation techniques, overfitting, and the significance of predictors in regression analysis.

Uploaded by

lawjavier3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views12 pages

Revision

The document discusses various data visualization techniques suitable for different types of variables, including histograms, bar charts, and scatter plots. It also covers clustering methods, the Self-Organizing Map, evaluation metrics for clustering, and the Apriori algorithm for mining frequent itemsets. Additionally, it addresses model evaluation techniques, overfitting, and the significance of predictors in regression analysis.

Uploaded by

lawjavier3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

(A) A single continuous variable (e.g.

height of a student): A histogram or


a box plot would be appropriate here. These plots can provide information
about the central tendency, dispersion, and skewness of the data.

(B) A single categorical variable (e.g. the days of a week): A bar chart or a
pie chart would work well in this case. These charts can show the
frequency or proportion of each category.

© A single continuous variable (e.g. personal income) and a single


categorical variable (e.g. gender): A box plot or a violin plot can be used
here. These plots can show the distribution of the continuous variable
across different categories.

(D) Two continuous variables (e.g. height and weight of a student): A


scatter plot would be the best choice for this type of data. It can show the
relationship or correlation between the two variables.

(E) Two categorical variables (e.g. highest qualification and gender): A


stacked bar chart or a mosaic plot can be used in this case. These plots
can show the relationship between two categorical variables.

(A) Partitional Clustering vs Hierarchical Clustering:


Connections: Both are methods used to group similar objects into clusters.
They aim to maximize the similarity within a cluster and minimize the
similarity between different clusters.

Differences:

Initialization: Partitional clustering (like k-means) starts with a random


partitioning of data points into clusters and iteratively refines the clusters.
Hierarchical clustering starts by treating each data point as a cluster and
successively merges or splits existing clusters.

Number of Clusters: In partitional clustering, the number of clusters needs


to be specified in advance. In hierarchical clustering, it doesn’t need to be
specified at the start, and an entire hierarchy of clusters is created.

Flexibility: Hierarchical clustering provides more flexibility than partitional


clustering as it offers the ability to visualize the data clusters using a
dendrogram.

(B) Self-Organizing Map (SOM) and its “Topology Preserving Properties”:

A Self-Organizing Map (SOM) is a type of artificial neural network that is


trained using unsupervised learning to produce a low-dimensional
representation of the input space of the training samples, called a map.

The “topology preserving” property means that the SOM maintains the
topological properties of the input space. This means that if certain
patterns are close together in the input space, they are also close
together in the map, preserving the similarity relationships of the original
data.

© Parameters in Training SOM:

Grid size: This defines the dimensions of the grid on which the map is
created. It affects the granularity of the map.

Learning Rate: This controls how much the weights are adjusted at each
step. A high learning rate can speed up training but may lead to less
accurate results.

Neighborhood Radius: This determines the extent to which neighboring


nodes are updated for each training sample. A larger radius means more
nodes are considered neighbors.

(D) Determining Better Clustering Result:

To determine whose clustering result is better, we can use various


evaluation metrics such as Silhouette Score, Davies-Bouldin Index, or
Dunn Index. These metrics consider both the compactness of the clusters
(how close the points within a cluster are) and the separation between
different clusters. The choice of metric can depend on the specific
requirements of the task. For example, if we care more about the
separation of clusters, we might choose a metric that emphasizes that
aspect.
(A) Apriori Algorithm: The Apriori algorithm is a popular algorithm for
mining frequent itemsets for boolean association rules. The main steps
are:

Set Generation: Start by forming a set of all individual items that have a
support greater than the minimum support. These sets are called 1-
itemsets.

Candidate Generation: Generate the candidate itemsets (Ck) in the


database using the frequent itemsets found in the previous step (Lk-1).
This is done by joining Lk-1 with itself and pruning subsets that are
infrequent.

Pruning: Any (k-1)-itemset that is not frequent cannot be a subset of a


frequent k-itemset, so remove all such candidate itemsets.

Iteration: Repeat the candidate generation and pruning steps until no


more frequent k-itemsets can be found.

(B) Creating Transactions: Let’s assume we have 10 transactions in total.


For the rule {A, D} => {F, H} to have a support of 0.3, it means that 30%
of the transactions contain {A, D, F, H}, which is 3 transactions. For the
rule to have a confidence of 0.6, it means that 60% of the transactions
that contain {A, D} also contain {F, H}. So, we need 5 transactions to
contain {A, D}. Here is one possible set of transactions:

{A, D, F, H}

{A, D, F, H}

{A, D, F, H}

{A, D, B}

{A, D}

{B, F, H}

{B, F}

{B, H}

{F, H}

{A, B, D, F, H}

© Issue with “Confidence” Measure: The confidence measure has a


drawback: it only considers the popularity of the itemset X and does not
take into account the popularity of Y. This could lead to misleading results
if Y is very common in the dataset. For example, if bread is bought
frequently, the rule {eggs} => {bread} might have a high confidence not
because eggs and bread are strongly associated but simply because bread
is bought frequently.

To address this issue, other measures like lift or leverage are used. The lift
of a rule is the ratio of the observed support to that expected if X and Y
were independent. A lift value greater than 1 indicates that itemsets X and
Y are more likely to be bought together than separately, which gives us
more information than the confidence measure alone. Similarly, leverage
computes the difference between the observed frequency of X and Y
appearing together and the frequency that would be expected if X and Y
were independent. A leverage value of 0 indicates independence.
(A) Procedure for k-NN Classifier:

Preprocessing: Normalize the M training samples to ensure consistency.

Choosing k: Use cross-validation on the training set to find an optimal ‘k’


value that minimizes error.

Distance Metric: Select an appropriate distance metric (e.g., Euclidean,


Manhattan) for measuring distances between samples.

Training: Use the M labeled samples to train the k-NN model by storing
them in memory.

Testing: Test the classifier using the N hidden samples, comparing


predicted labels with actual labels and adjusting ‘k’ or other parameters
as needed.

(B) Total Number of Weights in MLP: The total number of weights in this
MLP can be calculated as follows:

Number of weights from input layer to hidden layer = 45 features * 20


neurons = 900

Number of weights from hidden layer to output layer = 20 neurons * 3


targets = 60

Total number of weights = 900 + 60 = 960

© Computing Output for MLP: Given the threshold function, the output for
each sample can be computed as follows (assuming the threshold µ is 0.5
for both neurons):

Sample1:

Hidden neuron input: (1.5*+1)+(1.2*+1)=2.7; Output: f(2.7)=1 (since


µ=0.5)

Output neuron input: (2.7*-2)+(0*+1)= -5.4; Output: f(-5.4)=0 (since


µ=0.5)

Repeat similar calculations for Sample2, Sample3, and Sample4.

(D) Linearity of MLP with Linear Activation Function: If all neurons in an


MLP network use a linear activation function f(x) = x, then the relationship
between the input and output of this network will remain linear. This is
because a linear function preserves the linearity of the input data, and the
composition of linear functions is also a linear function. Therefore, no
matter how many hidden layers the MLP has, if all the neurons use a
linear activation function, the overall network represents a linear
transformation from the input to the output.

(A) Hold-out and 10-fold Cross-Validation (CV):

Hold-out: This method involves splitting the dataset into two sets: a
training set and a testing set. The model is trained on the training set and
evaluated on the testing set. The strength of this method is its simplicity
and speed. However, its weakness is that the evaluation may vary
significantly depending on how the data is split.

10-fold CV: This method involves splitting the dataset into 10 equal parts,
or ‘folds’. The model is then trained 10 times, each time using 9 folds for
training and a different fold for testing. The final model performance is the
average of the performances of the 10 models. The strength of this
method is that it gives a more robust estimate of the model performance
than the hold-out method. However, it is computationally more expensive.

(B) Overfitting Problem: Overfitting occurs when a model learns the


training data too well. It captures not only the underlying patterns but also
the noise and outliers in the training data. As a result, while the model
performs well on the training data, it performs poorly on unseen data (test
data) because it fails to generalize.

© Random Forest Method: Random Forest is an ensemble learning method


that operates by constructing multiple decision trees at training time and
outputting the mean prediction of the individual trees for regression
problems. It improves upon a single decision tree by reducing overfitting
and providing a more robust prediction by averaging the results of many
trees.

(D) Inappropriateness of Linear Regression for Binary/Categorical


Response Variable: Linear regression is not suitable for binary/categorical
response variables because it may predict values outside the range of 0
and 1 for binary response variables. This does not make sense for
binary/categorical outcomes. Logistic regression is a more appropriate
choice in this case as it predicts the probability of the outcome.
(E) Optimisation Problem for Soft Margin Classifier:

(i) The variable M in the optimization problem for the soft margin classifier
represents the margin of the classifier. The goal of the optimization
problem is to maximize this margin.

(ii) The variable C is a regularization parameter that controls the trade-off


between maximizing the margin (M) and minimizing the sum of the slack
variables (ϵ𝑖). A larger C places more emphasis on minimizing the sum of
the ϵ𝑖, which can lead to a smaller margin if the data is not linearly
separable.
1) root (n=706, deviance=139239800, yval=3266.356)

|--- 2) totwrk>=2256.5 (n=369, deviance=70902640, yval=3151.081)

| |

| |--- 4) totwrk>=3693.5 (n=20, deviance=4127513, yval=2747.500)


*

| |

| |--- 5) totwrk<3693.5 (n=349, deviance=63330890,


yval=3174.209)

| |

| |--- 10) totwrk>=2622.5 (n=174, deviance=35897580,


yval=3122.690)

| | |

| | |--- 20) totwrk<2784.5 (n=47, deviance=11624900,


yval=2929.553) *

| | |

| | |--- 21) totwrk>=2784.5 (n=127, deviance=21870690,


yval=3194.165) *

| |

| |--- 11) totwrk<2622.5 (n=175, deviance=26512260,


yval=3225.434) *

|--- 3) totwrk<2256.5 (n=337, deviance=58064950, yval=3392.576)

|--- 6) educ>=7.5 (n=324, deviance=54736770, yval=3375.145)


| |

| |--- 12) totwrk>=1174 (n=209, deviance=32660830,


yval=3314.206) *

| |

| |--- 13) totwrk<1174 (n=115, deviance=19889250,


yval=3485.896) *

|--- 7) educ<7.5 (n=13, deviance=776314, yval=3827.000) *

(B) Predicted Sleep Per Week: Using the regression tree, we can predict
the sleep per week for each record as follows:

Record 1: Follows path 2 -> 5 -> 11, predicted sleep = 3225.434 minutes
per week.

Record 2: Follows path 2 -> 4, predicted sleep = 2747.500 minutes per


week.

Record 3: Follows path 3 -> 6 -> 12, predicted sleep = 3314.206 minutes
per week.

© Mean Square Error (MSE) and Mean Absolute Error (MAE): To compute
the MSE and MAE, we need the actual sleep values for all records in the
test set, which are not provided in the question. The formulas for MSE and
MAE are as follows:

MSE = (1/n) * Σ(actual - predicted)²

MAE = (1/n) * Σ|actual - predicted|

(D) Support Vectors vs Usual Observations: In Support Vector Machine


(SVM) classification, support vectors are the data points that lie closest to
the decision boundary and are the most difficult to classify, whereas usual
observations are those that are farther away from the decision boundary.
Support vectors are used to maximize the margin of the classifier. If these
points are moved, the position of the decision boundary would change.
However, moving the usual observations (those not on the margin) would
not affect the position of the decision boundary.
(A) Precision and Recall for Each Classifier:

Linear SVM:

Precision = TP / (TP + FP) = 230 / (230 + 35) = 0.868

Recall = TP / (TP + FN) = 230 / (230 + 32) = 0.878

Nonlinear SVM with γ = 1:

Precision = 253 / (253 + 32) = 0.888

Recall = 253 / (253 + 9) = 0.966

Nonlinear SVM with γ = 500:

Precision = 180 / (180 + 12) = 0.938

Recall = 180 / (180 + 82) = 0.687

Based on these calculations, the nonlinear SVM with γ = 1 has the highest
recall and the nonlinear SVM with γ = 500 has the highest precision. If we
consider both precision and recall, the nonlinear SVM with γ = 1 might be
the best overall as it has a good balance of precision and recall.

(B) Information Gain from the First Split: Information gain is calculated as
the entropy of the parent node minus the weighted sum of the entropy of
the child nodes. In this case, the entropy of the parent node is -
0.5log2(0.5) - 0.5log2(0.5) = 1. The entropy of the left child node is -
0.57log2(0.57) - 0.43log2(0.43) = 0.985, and the entropy of the right child
node is -0.46log2(0.46) - 0.54log2(0.54) = 0.997. The information gain is
therefore 1 - 0.70.985 - 0.30.997 = 0.014.

© Predictor totwrk: The p-value for the predictor totwrk is less than 0.001,
which is less than the significance level of 0.10. Therefore, we reject the
null hypothesis that the coefficient of totwrk is zero. This suggests that
totwrk is a significant predictor of sleep at the 10% significance level.

(D) Differences in Sleep Between Men and Women: The p-value for the
predictor male is 0.0118, which is less than the significance level of 0.05.
Therefore, we reject the null hypothesis that the coefficient of male is
zero. This suggests that there is a significant difference in the minutes of
sleep at night per week between men and women at the 5% significance
level.

(E) MAE and MSE for the Test Set: To compute the Mean Absolute Error
(MAE) and Mean Squared Error (MSE), we need the actual and predicted
sleep values for all records in the test set, which are not provided in the
question. The formulas for MAE and MSE are as follows:

MAE = (1/n) * Σ|actual - predicted|

MSE = (1/n) * Σ(actual - predicted)²

Without the actual and predicted values, we cannot compute the MAE and
MSE, and therefore cannot compare the performance of the linear
regression model to the regression tree.

You might also like