0% found this document useful (0 votes)
4 views40 pages

Live Classroom 2

Uploaded by

nannn.gao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views40 pages

Live Classroom 2

Uploaded by

nannn.gao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

MET CS688 OL

WEB ANALYTICS AND MINING


LIVE CLASSROOM 2
Discussion Grading Clarification
• Be responsive and engaged throughout the week

• One thoughtful, original post è 70 possible points


• Each thoughtful follow-up that replies to another student
è 10 possible points each, up to 3 replies
– The best 3 replies will be scored

• Full points go to the highest quality content


Module 2 Objectives
• Get familiar with basics of machine learning
• Learn to evaluate the performance (efficacy and accuracy) of a
machine learning algorithm
• Apply principles of data visualization to create effective graphs
of data
Types of Machine Learning
• Unsupervised Learning
– We have input variables X but no output
– Descriptive Analytics:
• Clustering (understand how some X relate to other X)
• Associations (people who do A also appear to do B)
• Supervised Learning
– We have input variables X1, …, Xn and output variable Y
– Predictive Analytics: Use an algorithm to map the X to Y
• Regression, Classification (including boosted trees), Analysis of Variance, etc.
• Semi-supervised Learning
– We have input variables X for everybody, but outputs Y for only some
– Use both groups of techniques, often with idea of predicting Y where we don’t know it
Clustering (Unsupervised)
• Divide a
set of data
into similar
sub-groups
Regression (Supervised) and Correlation (Both)
• Regression uses information in one or more input variables to
predict an outcome
– Simple regression uses just one input variable to predict an outcome
– Correlation is a measurement of how well that one input predicts the
outcome
r = -0.86
Association Analysis (Unsupervised)
• Suppose you inspect a 3 bedroom residence for large but
movable objects (plus living room and dining/kitchen):
Room Item Set
1 Bed, Chair, Desk, Dresser, Mirror, Nightstand, TV
2 Bed, Dresser, Nightstand
3 Bed, Dresser, Nightstand
4 Chair, Table
5 Bookshelf, Couch, TV

• If you are placed at random in one of the rooms that has a TV,
how likely are you to be in a room that also has a Couch?
50%
In Association Analysis, this is “50% Confidence”
Classification (Supervised, Semi-Supervised)
Which of these images are eyes?

After converting the images into numeric representations (many


ways to do this), classification techniques are used to predict
which ones are eyes.
Many ways to do this (some are better than others), including logistic
regression, classification trees, random forests, neural nets, …
Classification (Supervised, Semi-Supervised)
Suppose each of these people is a different age with a different
medical history.

👨🦰👴👨🦱🧔👱👱👩🦰👵👩🦱👩

Predict whether each is most likely to die: before age 50,


between age 51-65, between age 66-80, or at age 81 or older.
(Perhaps, to help set life insurance rates.)
Many ways to do this, including classificaaon trees, random
forests, logisac regression, neural nets, …
Deviation and Anomaly Detection
Deviation and Anomaly Detection
Algorithm Efficiency
• Computational Complexity
– High number of operations usually
requires more:
• Time to process
• Energy (battery) to process

• Storage, memory, network bandwidth requirements


Algorithm Accuracy: Precision and Recall
• Precision: The fraction of the returned results that are relevant
to the information need
Precision = TP/(TP+FP)

• Recall (or Sensitivity): The fraction of the relevant documents in


the collection that were returned by the system
Recall = TP/(TP+FN)

• F-Score (or F-Measure) combines precision and recall. It is the


harmonic mean of precision and recall.
2 + 𝑃𝑟𝑒𝑐𝑖𝑠𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 =
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙

• Accuracy: The fraction of all documents labeled correctly


Accuracy = (TP+TN)/(TP+TN+FP+FN)
Efficiency vs. Accuracy
• Quick answers can be valuable
– “I want to know whether I have COVID-19 or not.”
• An inefficient algorithm may take 2 weeks to produce an answer.
• Not useful for providing guidance on whether to quaranCne.

• Accurate answers are valuable


– “You have cancer.”
• An inaccurate indicaCon of cancer leads to a healthy person taking
chemotherapy and radiaCon, which weakens their immune system and
causes many side effects.
Algorithm Inputs
• Data
• Parameters
– Examples: how sensitive the algorithm is, how hard it looks for a
solution, how many solutions it produces, etc.
– Some parameters are set by the algorithm
– Parameters set by the user are sometimes called hyperparameters

• Scalars, vectors, matrices, tensors


Working With Data: A Scientific Process

Data
Prepare to Run
Define the Cleaning,
Experiment Run Algorithm
Question Wrangling,
Algorithm & Evaluate
Exploration
Define the Question
• What is the problem to be solved?
• Who has this problem?
• What do I think is true about this problem?
• How might I prove that?
Experiment
• Run experiments to collect data to test the problem and
related hypothesis statements
Data Cleaning, Wrangling, and Exploration
• Address inconsistencies in data
– Decide what to do about missing values, outliers, etc.
• Get data in algorithm-ready state
– File formats, data types, etc.
• Visualize data, explore correlaaons, check for unusual results
• Assess whether collected data sufficiently cover the range of
decisions to be made
Prepare to Run Algorithm
• Feature engineering: selection, generation, extraction – know
what is relevant to send to algorithm
• Dimensionality reduction: a type of (usually) unsupervised
feature selection algorithm
• Choose hyperparameters
Run Algorithm and Evaluate
• If supervised learning: divide data into training (for running
algorithm) and testing (evaluate the algorithm’s performance
vs. ground truth)
• Based on performance, may need to adjust hyperparameters,
or do more data wrangling, or try new experiment
Algorithm Performance Depends on Good Data
Great lecture if you’re curious: https://www.youtube.com/watch?v=06-AZXmwHjo

A key slide from this lecture:

Good data is:


• Defined consistently (unambiguous definition of labels and y variables)
• Cover important cases (good coverage of the x input variables)
• Has timely feedback from production data (distribution covers data drift
and concept drift)
• Sized appropriately
•LIVE CLASSROOM 2 PART 2
Some algorithms to learn
• There are many algorithms we could learn. We will only introduce a few.

Classification Tree
Logistic Regression
K Nearest Neighbor
Coding for running & evaluating algorithms
R Python
Classification Tree tree:tree sklearn.tree
Logistic Regression base:glm(…,family=“binomial”) sklearn.linear_model
K Nearest Neighbor class:knn sklearn.neighbors

Splitting test/train base:sample sklearn.model_selection

Evaluation caret:confusionMatrix sklearn.metrics


Manual calculations
** sklearn is really nice!!!
Some Types of Graphs
Why Graphing?
• Graphs can be very useful tools!
• You can deceive if you make wrong choices:
– Selecting incomplete data
– Labeling only part of the data
– Using wrong chart for job
– Incorrect axes
– Hard-to-read colors
• A good graph may provide more
perspective
– As useful as a statistical analysis

Sources for graphs:


https://www.politifact.com/truth-o-meter/statements/2015/oct/01/jason-chaffetz/chart-shown-planned-parenthood-hearing-misleading-/
Principles of Data Visualization
• Data-to-ink ratio
– https://junkcharts.typepad.com/junk_charts/2012/12/english-donuts-rival-spanish-donuts.html

Vs.
Principles of Data Visualization
• Software defaults are not always best
– Especially for data with a complicated
history, you may need to think more
about what you want to do
– Many examples online
Principles of Data Visualization
• Use titles, axis labels, and legends to your benefit
– Compare and contrast, for example, these plots – note no x-axis at all on left plot
(so harder to know what is being plotted!)

h"ps://junkcharts.typepad.com/junk_charts/2019/06/clarifying-comparisons-in-censored-cohort-data-uk-housing-affordability.html
Principles of Data Visualization
• Color selection
– Watch for color blindness
• Most chart designers, me included, are very bad at this
• https://analyticsdemystified.com/excel-tips/data-visualization-that-is-color-blind-friendly-excel-2007/

– Watch for color optical illusions


• http://mentalfloss.com/article/54448/5-color-illusions-and-why-they-work
Principles of Data Visualization
• Color selection
– Watch for color blindness
• Most chart designers, me included, are very bad at this
• https://analyticsdemystified.com/excel-tips/data-visualization-that-is-color-blind-friendly-excel-2007/

– Watch for color optical illusions


• http://mentalfloss.com/article/54448/5-color-illusions-and-why-they-work
Principles of Data Visualization
• Know which charts are better used for multiplicative differences vs.
additive differences
Principles of Data Visualization
• In scaZer plots, consider using plo[ng symbols besides default dots
– hFp://sprout038.sprout.yale.edu/imagefinder/Figure.external?sp=SPMC1472692%2F1471-2105-7-123-
21&state:Figure=BrO0ABXcRAAAAAQAACmRvY3VtZW50SWRzcgARamF2YS5sYW5nLkludGVnZXIS4qCk94GHOAIAAUkABXZhbHVleHIAEGphdmEubGFuZy5OdW1iZXKGrJUdC5TgiwI
AAHhwAAZhoA%3D%3D
Principles of Data Visualization
• Pie charts – can be useful, but often requires
more thought to read
– Problems commonly cited with pies:
– What does it mean when percentages add to more than 100%?
– Can you identify which slice is biggest in each of the 3 below?
– Of the 3 pies below, which has the smallest green slice?

– https://medium.com/@clmentviguier/the-hate-of-pie-charts-harms-good-data-visualization-cc7cfed243b6
Principles of Data Visualization
• Aspect ratio can make a big difference!
Google Charts
• Google Charts is an interactive Web service that creates graphical charts from
user-supplied information.

• The charts are based on HTML5/SVG and hence can be used in web pages
without the need of plugins.

• A browser with an Internet connection is required to display the Google charts.

https://developers.google.com/chart/interactive/docs/
https://en.wikipedia.org/wiki/HTML5
https://en.wikipedia.org/wiki/Scalable_Vector_Graphics
googleVis R Package
• The googleVis R package provides the interface between R and the Google charts.
• The HTML5/SVG-based charts include the line, bar, column, area, combo, sca^er, bubble,
candles_ck, pie, organiza_onal, tables, gauges, tree maps, maps, geo charts and intensity
maps. The Flash-based charts include mo_on charts, annotated _me lines, and geo maps.
• Allows users to create interac_ve charts based on data frames.
• Charts are displayed locally, usually in your web browser.
• A modern browser with an Internet connec_on is required and for some charts a Flash player.
• The data remains local and is not uploaded to Google.

hLps://cran.r-project.org/web/packages/googleVis/vigneLes/googleVis_examples.html
hLps://cran.r-project.org/web/packages/googleVis/index.html
hLp://cran.r-project.org/web/packages/googleVis/googleVis.pdf
Plotting in googleVis
• Each chart has its own function
• There are simple implementations, and more complicated ones
• All the detail you need is in the modules
Reminder
• You do not have to use googleVis on this assignment
• If you understand ggplot2, plotly, or other graphing tools, you may use
them

You might also like