0% found this document useful (0 votes)
304 views1 page

Data Science Midterm for FINC 614

The document provides instructions for a midterm exam in a data science course. Students are asked to: [1] select a dataset from a public repository to build a classification model; [2] clean the data by removing missing values and duplicate rows; [3] derive a new feature; [4] create frequency tables and summary statistics; [5] make plots and inferences; [6] check for data imbalance; [7] split into train and test sets; [8] do cross validation; [9] fit decision tree, logistic regression, and naive bayes models; [10] evaluate the models' performance; and [11] identify the best performing model.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
304 views1 page

Data Science Midterm for FINC 614

The document provides instructions for a midterm exam in a data science course. Students are asked to: [1] select a dataset from a public repository to build a classification model; [2] clean the data by removing missing values and duplicate rows; [3] derive a new feature; [4] create frequency tables and summary statistics; [5] make plots and inferences; [6] check for data imbalance; [7] split into train and test sets; [8] do cross validation; [9] fit decision tree, logistic regression, and naive bayes models; [10] evaluate the models' performance; and [11] identify the best performing model.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 1

FINC 614 Introduction to Data Science

Mid Term Exam

Do the following in R and turn in a word or PDF document generated with knitr, via blackboard.

1. Pick any dataset from the UCI Machine Learning Repository


(http://archive.ics.uci.edu/ml/index.php) suitable for building a classification model
2. Count the number of rows that have missing data. Remove the missing data. (2 points)
3. Check if there are any duplicate rows in the data. If there are duplicates report how many and
remove them. (3 points)
4. Created at least one derived feature (derived data column) (5 points)
5. Create a frequency table for your data (choose appropriate data attributes for a frequency
table) (5 points)
6. Report summary statistics (Mean, median, standard deviation, quartiles and range) for your data
(5 points)
7. Use ggplot2 to plot the following types of graphs with your data. Choose data attributes that
are meaningful to plot for each graph. Make inferences about your data based on the graphs
(e.g. correlation, shape of distribution etc)
a. Scatter plot (2 points)
b. Histogram (2 points)
c. Boxplot (2 points)
d. Line graph (2 points)
8. Check if there is an imbalance in your dataset (3 points)
9. Split your dataset into a training and test dataset choosing the percentages based on the size of
your dataset (2 points)
10. Use a k fold cross validation. Choose k based on the size of your dataset and the time it would
take to fit the model. (2 points)
11. Fit the following models to your data to predict the class of a meaningful attribute of your
choice.
a. Decision tree (5 points)
b. Logistic regression (5 points)
c. Nave Bayes (5 points)
12. Plot the fitted decision tree. What attribute was used for the first split? (3 points)
13. Report the following accuracy measures for the each model you fit above
a. Confusion matrix (2 points)
b. Accuracy (2 points)
c. Sensitivity/Specificity (2 points)
d. Precision/Recall (2 points)
e. ROC curve (2 points)
14. Which model gives you the best results? (2 points)

You might also like