0% found this document useful (0 votes)

4 views40 pages

Live Classroom 2

Uploaded by

nannn.gao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views40 pages

Live Classroom 2

Uploaded by

nannn.gao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

MET CS688 OL

WEB ANALYTICS AND MINING

LIVE CLASSROOM 2
Discussion Grading Clarification
• Be responsive and engaged throughout the week

• One thoughtful, original post è 70 possible points

• Each thoughtful follow-up that replies to another student
è 10 possible points each, up to 3 replies
– The best 3 replies will be scored

• Full points go to the highest quality content

Module 2 Objectives
• Get familiar with basics of machine learning
• Learn to evaluate the performance (efficacy and accuracy) of a
machine learning algorithm
• Apply principles of data visualization to create effective graphs
of data
Types of Machine Learning
• Unsupervised Learning
– We have input variables X but no output
– Descriptive Analytics:
• Clustering (understand how some X relate to other X)
• Associations (people who do A also appear to do B)
• Supervised Learning
– We have input variables X1, …, Xn and output variable Y
– Predictive Analytics: Use an algorithm to map the X to Y
• Regression, Classification (including boosted trees), Analysis of Variance, etc.
• Semi-supervised Learning
– We have input variables X for everybody, but outputs Y for only some
– Use both groups of techniques, often with idea of predicting Y where we don’t know it
Clustering (Unsupervised)
• Divide a
set of data
into similar
sub-groups
Regression (Supervised) and Correlation (Both)
• Regression uses information in one or more input variables to
predict an outcome
– Simple regression uses just one input variable to predict an outcome
– Correlation is a measurement of how well that one input predicts the
outcome
r = -0.86
Association Analysis (Unsupervised)
• Suppose you inspect a 3 bedroom residence for large but
movable objects (plus living room and dining/kitchen):
Room Item Set
1 Bed, Chair, Desk, Dresser, Mirror, Nightstand, TV
2 Bed, Dresser, Nightstand
3 Bed, Dresser, Nightstand
4 Chair, Table
5 Bookshelf, Couch, TV

• If you are placed at random in one of the rooms that has a TV,
how likely are you to be in a room that also has a Couch?
50%
In Association Analysis, this is “50% Confidence”
Classification (Supervised, Semi-Supervised)
Which of these images are eyes?

After converting the images into numeric representations (many

ways to do this), classification techniques are used to predict
which ones are eyes.
Many ways to do this (some are better than others), including logistic
regression, classification trees, random forests, neural nets, …
Classification (Supervised, Semi-Supervised)
Suppose each of these people is a diﬀerent age with a diﬀerent
medical history.

👨🦰👴👨🦱🧔👱👱👩🦰👵👩🦱👩

Predict whether each is most likely to die: before age 50,

between age 51-65, between age 66-80, or at age 81 or older.
(Perhaps, to help set life insurance rates.)
Many ways to do this, including classiﬁcaaon trees, random
forests, logisac regression, neural nets, …
Deviation and Anomaly Detection
Deviation and Anomaly Detection
Algorithm Efficiency
• Computational Complexity
– High number of operations usually
requires more:
• Time to process
• Energy (battery) to process

• Storage, memory, network bandwidth requirements

Algorithm Accuracy: Precision and Recall
• Precision: The fraction of the returned results that are relevant
to the information need
Precision = TP/(TP+FP)

• Recall (or Sensitivity): The fraction of the relevant documents in

the collection that were returned by the system
Recall = TP/(TP+FN)

• F-Score (or F-Measure) combines precision and recall. It is the

harmonic mean of precision and recall.
2 + 𝑃𝑟𝑒𝑐𝑖𝑠𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 =
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙

• Accuracy: The fraction of all documents labeled correctly

Accuracy = (TP+TN)/(TP+TN+FP+FN)
Efficiency vs. Accuracy
• Quick answers can be valuable
– “I want to know whether I have COVID-19 or not.”
• An ineﬃcient algorithm may take 2 weeks to produce an answer.
• Not useful for providing guidance on whether to quaranCne.

• Accurate answers are valuable

– “You have cancer.”
• An inaccurate indicaCon of cancer leads to a healthy person taking
chemotherapy and radiaCon, which weakens their immune system and
causes many side eﬀects.
Algorithm Inputs
• Data
• Parameters
– Examples: how sensitive the algorithm is, how hard it looks for a
solution, how many solutions it produces, etc.
– Some parameters are set by the algorithm
– Parameters set by the user are sometimes called hyperparameters

• Scalars, vectors, matrices, tensors

Working With Data: A Scientific Process

Data
Prepare to Run
Define the Cleaning,
Experiment Run Algorithm
Question Wrangling,
Algorithm & Evaluate
Exploration
Define the Question
• What is the problem to be solved?
• Who has this problem?
• What do I think is true about this problem?
• How might I prove that?
Experiment
• Run experiments to collect data to test the problem and
related hypothesis statements
Data Cleaning, Wrangling, and Exploration
• Address inconsistencies in data
– Decide what to do about missing values, outliers, etc.
• Get data in algorithm-ready state
– File formats, data types, etc.
• Visualize data, explore correlaaons, check for unusual results
• Assess whether collected data suﬃciently cover the range of
decisions to be made
Prepare to Run Algorithm
• Feature engineering: selection, generation, extraction – know
what is relevant to send to algorithm
• Dimensionality reduction: a type of (usually) unsupervised
feature selection algorithm
• Choose hyperparameters
Run Algorithm and Evaluate
• If supervised learning: divide data into training (for running
algorithm) and testing (evaluate the algorithm’s performance
vs. ground truth)
• Based on performance, may need to adjust hyperparameters,
or do more data wrangling, or try new experiment
Algorithm Performance Depends on Good Data
Great lecture if you’re curious: https://www.youtube.com/watch?v=06-AZXmwHjo

A key slide from this lecture:

Good data is:

• Defined consistently (unambiguous definition of labels and y variables)
• Cover important cases (good coverage of the x input variables)
• Has timely feedback from production data (distribution covers data drift
and concept drift)
• Sized appropriately
•LIVE CLASSROOM 2 PART 2
Some algorithms to learn
• There are many algorithms we could learn. We will only introduce a few.

Classification Tree
Logistic Regression
K Nearest Neighbor
Coding for running & evaluating algorithms
R Python
Classification Tree tree:tree sklearn.tree
Logistic Regression base:glm(…,family=“binomial”) sklearn.linear_model
K Nearest Neighbor class:knn sklearn.neighbors

Splitting test/train base:sample sklearn.model_selection

Evaluation caret:confusionMatrix sklearn.metrics

Manual calculations
** sklearn is really nice!!!
Some Types of Graphs
Why Graphing?
• Graphs can be very useful tools!
• You can deceive if you make wrong choices:
– Selecting incomplete data
– Labeling only part of the data
– Using wrong chart for job
– Incorrect axes
– Hard-to-read colors
• A good graph may provide more
perspective
– As useful as a statistical analysis

Sources for graphs:

https://www.politifact.com/truth-o-meter/statements/2015/oct/01/jason-chaffetz/chart-shown-planned-parenthood-hearing-misleading-/
Principles of Data Visualization
• Data-to-ink ratio
– https://junkcharts.typepad.com/junk_charts/2012/12/english-donuts-rival-spanish-donuts.html

Vs.
Principles of Data Visualization
• Software defaults are not always best
– Especially for data with a complicated
history, you may need to think more
about what you want to do
– Many examples online
Principles of Data Visualization
• Use titles, axis labels, and legends to your benefit
– Compare and contrast, for example, these plots – note no x-axis at all on left plot
(so harder to know what is being plotted!)

h"ps://junkcharts.typepad.com/junk_charts/2019/06/clarifying-comparisons-in-censored-cohort-data-uk-housing-aﬀordability.html
Principles of Data Visualization
• Color selection
– Watch for color blindness
• Most chart designers, me included, are very bad at this
• https://analyticsdemystified.com/excel-tips/data-visualization-that-is-color-blind-friendly-excel-2007/

– Watch for color optical illusions

• http://mentalfloss.com/article/54448/5-color-illusions-and-why-they-work
Principles of Data Visualization
• Color selection
– Watch for color blindness
• Most chart designers, me included, are very bad at this
• https://analyticsdemystified.com/excel-tips/data-visualization-that-is-color-blind-friendly-excel-2007/

– Watch for color optical illusions

• http://mentalfloss.com/article/54448/5-color-illusions-and-why-they-work
Principles of Data Visualization
• Know which charts are better used for multiplicative differences vs.
additive differences
Principles of Data Visualization
• In scaZer plots, consider using plo[ng symbols besides default dots
– hFp://sprout038.sprout.yale.edu/imageﬁnder/Figure.external?sp=SPMC1472692%2F1471-2105-7-123-
21&state:Figure=BrO0ABXcRAAAAAQAACmRvY3VtZW50SWRzcgARamF2YS5sYW5nLkludGVnZXIS4qCk94GHOAIAAUkABXZhbHVleHIAEGphdmEubGFuZy5OdW1iZXKGrJUdC5TgiwI
AAHhwAAZhoA%3D%3D
Principles of Data Visualization
• Pie charts – can be useful, but often requires
more thought to read
– Problems commonly cited with pies:
– What does it mean when percentages add to more than 100%?
– Can you identify which slice is biggest in each of the 3 below?
– Of the 3 pies below, which has the smallest green slice?

– https://medium.com/@clmentviguier/the-hate-of-pie-charts-harms-good-data-visualization-cc7cfed243b6
Principles of Data Visualization
• Aspect ratio can make a big difference!
Google Charts
• Google Charts is an interactive Web service that creates graphical charts from
user-supplied information.

• The charts are based on HTML5/SVG and hence can be used in web pages
without the need of plugins.

• A browser with an Internet connection is required to display the Google charts.

https://developers.google.com/chart/interactive/docs/
https://en.wikipedia.org/wiki/HTML5
https://en.wikipedia.org/wiki/Scalable_Vector_Graphics
googleVis R Package
• The googleVis R package provides the interface between R and the Google charts.
• The HTML5/SVG-based charts include the line, bar, column, area, combo, sca^er, bubble,
candles_ck, pie, organiza_onal, tables, gauges, tree maps, maps, geo charts and intensity
maps. The Flash-based charts include mo_on charts, annotated _me lines, and geo maps.
• Allows users to create interac_ve charts based on data frames.
• Charts are displayed locally, usually in your web browser.
• A modern browser with an Internet connec_on is required and for some charts a Flash player.
• The data remains local and is not uploaded to Google.

hLps://cran.r-project.org/web/packages/googleVis/vigneLes/googleVis_examples.html
hLps://cran.r-project.org/web/packages/googleVis/index.html
hLp://cran.r-project.org/web/packages/googleVis/googleVis.pdf
Plotting in googleVis
• Each chart has its own function
• There are simple implementations, and more complicated ones
• All the detail you need is in the modules
Reminder
• You do not have to use googleVis on this assignment
• If you understand ggplot2, plotly, or other graphing tools, you may use
them

DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
No ratings yet
DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
91 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
Data Mining for Analysts
No ratings yet
Data Mining for Analysts
38 pages
PGP-Data Science - Course Module With Internship Module
No ratings yet
PGP-Data Science - Course Module With Internship Module
17 pages
Week 12 Intro To DS and ML
No ratings yet
Week 12 Intro To DS and ML
67 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
9 pages
PavicJakov WEKA
No ratings yet
PavicJakov WEKA
40 pages
Report Machine Learning 101 1 1
No ratings yet
Report Machine Learning 101 1 1
1 page
AFRICDSA Certified Data Scientist Syllabus - V1.2
No ratings yet
AFRICDSA Certified Data Scientist Syllabus - V1.2
12 pages
Data Science
No ratings yet
Data Science
13 pages
Data Science Classes
No ratings yet
Data Science Classes
13 pages
PGP-Data Science - Course Module With Internship Module
No ratings yet
PGP-Data Science - Course Module With Internship Module
16 pages
ML 2 PPT Unit 2
No ratings yet
ML 2 PPT Unit 2
214 pages
Machine Learning Path
No ratings yet
Machine Learning Path
21 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
29 pages
Week 4 - Intro To ML
No ratings yet
Week 4 - Intro To ML
37 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
13 pages
Unit 1 Part 4
No ratings yet
Unit 1 Part 4
8 pages
Pattern Summary Final
No ratings yet
Pattern Summary Final
28 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
MILIT PPT Modifies
No ratings yet
MILIT PPT Modifies
43 pages
Data Analytics Course (IIFT MBA) Full Course Summary - 27072023
No ratings yet
Data Analytics Course (IIFT MBA) Full Course Summary - 27072023
253 pages
ML Summary
No ratings yet
ML Summary
23 pages
ML SummaryFINAL
No ratings yet
ML SummaryFINAL
48 pages
231
No ratings yet
231
10 pages
Pattern L1 L6
No ratings yet
Pattern L1 L6
19 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
Business Data Analytics Part 4
No ratings yet
Business Data Analytics Part 4
52 pages
Statistics For Data Science
100% (2)
Statistics For Data Science
39 pages
CH 5
No ratings yet
CH 5
19 pages
Chapter 2 Machine Learning Draft-85-172
No ratings yet
Chapter 2 Machine Learning Draft-85-172
88 pages
Aiya Session 4
No ratings yet
Aiya Session 4
42 pages
General ML Notes
No ratings yet
General ML Notes
30 pages
Free Data Science Course Material 2018
No ratings yet
Free Data Science Course Material 2018
32 pages
Report Print
No ratings yet
Report Print
22 pages
L 13 Choose Your Own Algorithm D 07062024 111828am
No ratings yet
L 13 Choose Your Own Algorithm D 07062024 111828am
36 pages
07 Classification
No ratings yet
07 Classification
52 pages
SWE 227 Slide 01
No ratings yet
SWE 227 Slide 01
21 pages
Python 06 MachineLearning
No ratings yet
Python 06 MachineLearning
45 pages
Data Science Crash Course
100% (1)
Data Science Crash Course
32 pages
Machine Learning - Brief
No ratings yet
Machine Learning - Brief
12 pages
Analytics Boot Camp
No ratings yet
Analytics Boot Camp
126 pages
CSC413 Lecture Note
No ratings yet
CSC413 Lecture Note
32 pages
ML - Machine Learning PDF
No ratings yet
ML - Machine Learning PDF
13 pages
Intro To ML
No ratings yet
Intro To ML
26 pages
Machine Learning Basics for Beginners
100% (5)
Machine Learning Basics for Beginners
134 pages
Ai Syllabus
No ratings yet
Ai Syllabus
7 pages
AI For Eng Supervised-Learning
No ratings yet
AI For Eng Supervised-Learning
25 pages
BMI 704 - Machine Learning Lab
No ratings yet
BMI 704 - Machine Learning Lab
23 pages
Pattern Recognition Application
No ratings yet
Pattern Recognition Application
43 pages
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
No ratings yet
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
62 pages
APS1070 Lecture (3) Slides
No ratings yet
APS1070 Lecture (3) Slides
70 pages
AIML
No ratings yet
AIML
30 pages
Intro to Exploratory Data Analysis
No ratings yet
Intro to Exploratory Data Analysis
17 pages
Machine: Learning
No ratings yet
Machine: Learning
24 pages
Noida Institute of Engineering and Technology
No ratings yet
Noida Institute of Engineering and Technology
24 pages
Machine Learning
No ratings yet
Machine Learning
25 pages
7118 Ds Methodology Ss
No ratings yet
7118 Ds Methodology Ss
56 pages
Automated Digitization of Student's Marks From The Answer Book
No ratings yet
Automated Digitization of Student's Marks From The Answer Book
9 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
12 pages
A Novel Integrated Logistic Regression Model Enhanced With Rec 2024 Healthca
No ratings yet
A Novel Integrated Logistic Regression Model Enhanced With Rec 2024 Healthca
16 pages
Mini Project
No ratings yet
Mini Project
31 pages
AML Winter 2021 Solution
No ratings yet
AML Winter 2021 Solution
6 pages
Credit Card Fraud Detection - Machine Learning Methods: March 2019
No ratings yet
Credit Card Fraud Detection - Machine Learning Methods: March 2019
6 pages
UNIT 1 All Notes
No ratings yet
UNIT 1 All Notes
24 pages
IoT Smart Lab: Iris-Based Diabetes Detection
No ratings yet
IoT Smart Lab: Iris-Based Diabetes Detection
5 pages
Miniproject Group E 1
No ratings yet
Miniproject Group E 1
47 pages
Volume6 Issue3 Paper10 2022
No ratings yet
Volume6 Issue3 Paper10 2022
6 pages
Optimized Hybrid Ensemble Learning Approaches Applied To Very Short-Term Load Forecasting
No ratings yet
Optimized Hybrid Ensemble Learning Approaches Applied To Very Short-Term Load Forecasting
17 pages
ZulkaidaAkbar IAEAWorkshop
No ratings yet
ZulkaidaAkbar IAEAWorkshop
12 pages
Spatial Data Mining Techniques: M.Tech Seminar Report Submitted by
No ratings yet
Spatial Data Mining Techniques: M.Tech Seminar Report Submitted by
28 pages
Predicting Brain Age Using ML Algorithms
No ratings yet
Predicting Brain Age Using ML Algorithms
9 pages
Project Final Report 2
No ratings yet
Project Final Report 2
69 pages
Kernel PCA
No ratings yet
Kernel PCA
13 pages
Course Outcome - BCA - BU - Sep - 2023 - Update
No ratings yet
Course Outcome - BCA - BU - Sep - 2023 - Update
24 pages
AI in Medical Diagnostics
No ratings yet
AI in Medical Diagnostics
24 pages
AI-Powered Support for Slow Learners
No ratings yet
AI-Powered Support for Slow Learners
8 pages
Correct Validation WP Final V
No ratings yet
Correct Validation WP Final V
26 pages
The Design of Optimized RISC Processor For Edge Artificial Intelligence Based On Custom Instruction Set Extension
No ratings yet
The Design of Optimized RISC Processor For Edge Artificial Intelligence Based On Custom Instruction Set Extension
13 pages
On Bed Posture Recognition Using Deep Learning With Pressure Sens
No ratings yet
On Bed Posture Recognition Using Deep Learning With Pressure Sens
86 pages
Section A: Ques. 1
No ratings yet
Section A: Ques. 1
31 pages
Football Rating System in PA
No ratings yet
Football Rating System in PA
9 pages
Kannada Manuscript Digitization Through OCR and Machine Learning
No ratings yet
Kannada Manuscript Digitization Through OCR and Machine Learning
5 pages
CEP Final
No ratings yet
CEP Final
11 pages
Audio-Based Music Classification
100% (1)
Audio-Based Music Classification
47 pages
Crime Rate Prediction: Ch. Mahendra1, G. Nani Babu2, G. Balu Nitin Chandra, A. Avinash 4, Y. Aditya5
No ratings yet
Crime Rate Prediction: Ch. Mahendra1, G. Nani Babu2, G. Balu Nitin Chandra, A. Avinash 4, Y. Aditya5
6 pages
Transfer Learning for Plant ID in India
No ratings yet
Transfer Learning for Plant ID in India
11 pages
K-Nearest Neighbour (KNN) Algorithm
No ratings yet
K-Nearest Neighbour (KNN) Algorithm
5 pages

Live Classroom 2

Uploaded by

Live Classroom 2

Uploaded by

MET CS688 OL

WEB ANALYTICS AND MINING

• One thoughtful, original post è 70 possible points

• Full points go to the highest quality content

After converting the images into numeric representations (many

Predict whether each is most likely to die: before age 50,

• Storage, memory, network bandwidth requirements

• Recall (or Sensitivity): The fraction of the relevant documents in

• F-Score (or F-Measure) combines precision and recall. It is the

• Accuracy: The fraction of all documents labeled correctly

• Accurate answers are valuable

• Scalars, vectors, matrices, tensors

A key slide from this lecture:

Good data is:

Splitting test/train base:sample sklearn.model_selection

Evaluation caret:confusionMatrix sklearn.metrics

Sources for graphs:

– Watch for color optical illusions

– Watch for color optical illusions

• A browser with an Internet connection is required to display the Google charts.

You might also like