0% found this document useful (0 votes)

218 views71 pages

Data Science Insights for Students

This document provides an outline for a course on data science called CS109. It discusses the goals of data science which include gaining insights into data through computation, statistics, and visualization. It then discusses key concepts in data science like exploratory data analysis, prediction, modeling, and communicating results. It provides examples of course content like Netflix prize challenges and working with large neuroscience datasets. It introduces the instructors and teaching assistants for the course and discusses expectations around homework, projects, collaboration, and grading.

Uploaded by

Rahul Venkat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

218 views71 pages

Data Science Insights for Students

Uploaded by

Rahul Venkat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

STAT121 / AC209 / E-109

CS109 Data Science

Hanspeter Pfister
pfister@seas.harvard.edu
Joe Blitzstein
blitzstein@stat.harvard.edu
Verena Kaynig
vkaynig@seas.harvard.edu

Outline
What?
Why?
Who?
How?

Data Science
To gain insights into data through
computation, statistics, and visualization

A Data Scientist Is...

A data scientist is someone who knows more
statistics than a computer scientist and more
computer science than a statistician.
- Josh Blumenstock
Data Scientist = statistician + programmer +
coach + storyteller + artist
- Shlomo Aragmon

Nate Silver

Nate Silver won the election

Harvard Business Review

#natesilverfacts

http://techcrunch.com/2012/11/07/nate-silver-as-software/

Nate Silver on Pundits

Silver: Pundits are no
better than a coin toss.
Stewart: Do you foresee a
coin getting its own show?
The coin toss show?
http://www.thedailyshow.com/watch/wed-october-17-2012/nate-silver

Some Key Principles

use many data sources (the plural of anecdote is not data)
understand how the data were collected (sampling is essential)
weight the data thoughtfully (not all polls are equally good)
use statistical models (not just hacking around in Excel)
understand correlations (e.g., states that trend similarly)
think like a Bayesian, check like a frequentist (reconciliation)
have good communicationskills (What does a 60%
probability even mean? How can we visualize, validate, and
understand the conclusions?)

Netflix Prize

Netflix Prize Progress

HBR, Oct 2012

3 Years Later
We evaluated some of the new
methods offline but the additional
accuracy gains that we measured did
not seem to justify the engineering
effort needed to bring them into a
production environment.

Xavier Amatriain and Justin Basilico, 2012

Some Challenges

massive data (500k users, 20k movies, 100m ratings)

missing data (99% of data missing; not missing at

random)

extremely complicated set of factors that affect peoples

ratings of movies (actors, directors, genre, ...)

need to avoid overfitting (test data vs. training data)

curse of dimensionality (very high-dimensional

problem)

Kaggle

The Connectome
How is the mammalian brain wired?

~60 um3
600 GB
Courtesy of
Bobby Kasthuri.
Harvard

The Data Challenge

Pixel resolution: 3-5 nm; Slice thickness: 30-50 nm
1 mm : 40 Gpixels x 25,000 slices = ~1 PByte
3

Daniel Berger

Connectome Workflow
Cutting

Analysis

Proof Reading

Imaging

Visualization

Alignment &
Registration

Segmentation

Analysis

K. Al-Awami, et al.,
NeuroLines: A Subway Map Metaphor for Visualizing Nanoscale Neuronal Connectivity,
IEEE Transactions on Visualization and Computer Graphics, vol. 20, no. 12, pp. 2369-2378,
2014

Data Science
Computer
Science

Statistics

Domain Science

Drew Conway

Machine

Human
Human Cognition

Data Management
Data Mining
Machine Learning

Perception
Visualization

Business Intelligence
Statistics

Story Telling
Decision Making
Theory

Data Science

Inspired by Daniel Keim, Visual Analytics: Definition,

Process, and Challenges

Outline

What?
Why?
Who?
How?

The Age of Big Data

BBC, 2013

Big Data
Between the dawn of civilization and
2003, we only created five exabytes of
information; now were creating that
amount every two days.
Eric Schmidt, Google (and others)

http://onesecond.designly.com/

travers808,Visual.ly

Jim Gray, Microsoft

By 2018, the US could face a shortage

of up to 190,000 workers with analytical
skills
McKinsey Global Institute
The sexy job in the next 10 years will
be statisticians. Data Scientists?
Hal Varian, Prof. Emeritus UC Berkeley
Chief Economist, Google

Hal Varian Explains...

The ability to take data to be able to
understand it, to process it, to extract
value from it, to visualize it, to
communicate it's going to be a hugely
important skill in the next decades, not
only at the professional level but even at
the educational level for elementary school
kids, for high school kids, for college kids.
Because now we really do have essentially
free and ubiquitous data. Hal Varian

Ask an interesting
question.

What is the scientific goal?

What would you do if you had all the data?
What do you want to predict or estimate?

Get the data.

How were the data sampled?

Which data are relevant?
Are there privacy issues?

Explore the data.

Plot the data.

Are there anomalies?
Are there patterns?

Model the data.

Build a model.
Fit the model.
Validate the model.

Communicate and
visualize the results.

What did we learn?

Do the results make sense?
Can we tell a story?

IPython Notebooks
http://nbviewer.ipython.org/

Outline

What?
Why?
Who?
How?

Hanspeter Pfister

An Wang Professor of Computer Science, SEAS

Director, Institute for Applied Computational Science
pfister@seas.harvard.edu / @hpfister

Joe Blitzstein
Professor of the Practice in Statistics,
Co-Director of Undergraduate Studies in Statistics
blitz@fas.harvard.edu, twitter @stat110, SC 714

Verena Kaynig-Fittkau
Lecturer and research scientist at IACS
vkaynig@seas.harvard.edu, NW B164

Rahul Dave
Head TF and Lecturer at IACS
rahuldave@gmail.com, NW B164

CS 109 Staff
Andrew Reece
Antonio Coppola
Austen Novis
Brian Feeny
Dana Katzenelson
Giri Gopalan
Irma Nomani
Jacob Dorabialski
Joseph Song
Kathy Li
Lawrence Kim
Leandra King

Luis Campos
Marcus Way
Michael Ma
Michael Packer
Nelson Santos
Richard Kim
Rick Wei-Jong Lee
Sail Wu
Stephen Klosterman
Xintao Qiu
Yingzhuo (Diana) Zhang
Yuhao Zhu

About You

Outline
What?
Why?
Who?
How?

CS109 Key Facets

data munging/scraping/sampling/cleaningin order to get an

informative, manageable data set;

data storage and management in order to be able to access

data quickly and reliably during subsequent analysis;

exploratory data analysisto generate hypotheses and

intuition about the data;

predictionbased on statistical tools such as regression,

classification, and clustering; and

communicationof results through visualization, stories, and

interpretable summaries.

Act I: Predictions
Data Collection, Munging, and Storage
Exploratory Data Analysis (EDA)
Classification & Regression
Cross Validation
Dimensionality Reduction
Effective Communication & Writing

Act II: Recommendations

Support Vector Machines
Decision Trees & Random Forests
Bagging & Boosting
Machine Learning Best Practices
MapReduce, Amazons EC2, and Spark

Act III: Clustering & Text

Bayesian Thinking & Naive Bayes
Text Analysis: LDA & Topic Modeling
Clustering
Effective Presentations
Deep Learning
Guest Lecture: Experimental Design

cs109.org

Concepts...
Lectures

...and Skills
Sections

Sections
Introduce tools & skills; available as lab
notebooks and videos

Mandatory, except for DCE students

First (group) section this Friday!
10am-12pm in MD G115
Regular sections first week as office hours
to get help with Python, Git, and HW0

Section Schedule (TBD)

Monday

Tuesday

9:00 AM

Wednesday

Thursday

Friday

Rahul, NW-B150

10:00 AM

Leandra

Ima

Steve

Luis

11:00 AM

NW-B150

12:00 PM

1:00 PM

Diana

2:00 PM

NW-B150

Lecture

Lawrence

Lecture

NW-B150

3:00 PM

Joseph

NW-B103

NW-B150

NW-B103

Michael Packer

4:00 PM

NW-B150

5:00 PM

Michael Ma

Antonio

Sail, Nelson

NW-B150

6:00 PM

Austen, Dana

NW-B150 and B166

Richard

7:00 PM

NW-B150 and B166

Kathi

NW-B150

8:00 PM

NW-B150

Homework
Real-World focus
Scrape and wrangle messy data
Apply sophisticated statistical analysis
Visualize and communicate results
Election data, music charts,
recommendations, etc.

Programming

xkcd

Piazza

Sign up by next Friday (HW0)

Announcements posted here
Questions, feedback, discussions, etc.
Help each other!

Grades

No exams!
50% Homework
40% Projects (3-4 person teams)
10% Participation (Piazza & Sections)
10 point scale, holistic grading

Projects

Policies

HWs due on Thursdays, 11:59 pm EST

6 late days for HW (no questions asked)
Cannot submit HW later than 2 days
Regrading requests within 7 days in writing

Grade may improve or go down

Collaboration Policy

Work you turn in must be your own

Projects are a 3-4 person team effort

With project group peer assessment

Acknowledge all help and code you used

Harvard Honor Code

Is this course for me ???

Prerequisites
Programming experience

CS50 and/or C, C++, Java, Python, etc.

Basic statistical knowledge

STAT100, ideally STAT110

Willingness to learn new software & tools

This can be time consuming

You will need to read online documentation

Be Patient
Be Flexible
Be Constructive

http://davidzinger.wordpress.com/2007/05/page/2/

Next Steps

HW 0, mandatory, needs to be submitted!

Good test of your basic skills

Installation of several Python frameworks

Complete the survey by tomorrow! Needed to be

able to submit HW 0

Not graded, do it as soon as possible

Read syllabus carefully

Important Links
Create a github account at http://github.com
Then fill in our survey at http://goo.gl/forms/bJwajS8zO8
HW 0 document at https://github.com/cs109/2015lab1/
blob/master/hw0.ipynb

Week 1 notebooks at https://github.com/cs109/2015lab1

HW repositories will be created for you on github. See
HW 0 for details.

Data Science Course Overview
No ratings yet
Data Science Course Overview
74 pages
Introduction To Data Science
75% (4)
Introduction To Data Science
74 pages
Lecture 1 - Introduction To Big Data
No ratings yet
Lecture 1 - Introduction To Big Data
51 pages
347 862932 Introduction
No ratings yet
347 862932 Introduction
35 pages
Introduction to Data Science Workshop
No ratings yet
Introduction to Data Science Workshop
19 pages
Chap1-Overview of Data Science
No ratings yet
Chap1-Overview of Data Science
50 pages
Asd 01
No ratings yet
Asd 01
38 pages
Modul 1
No ratings yet
Modul 1
56 pages
1) Data-Sci Chapter-1
No ratings yet
1) Data-Sci Chapter-1
17 pages
CS429: Data Mining: About Instructor
No ratings yet
CS429: Data Mining: About Instructor
26 pages
Data Science
No ratings yet
Data Science
40 pages
Lecture-1 Introduction To Data Science
100% (1)
Lecture-1 Introduction To Data Science
20 pages
Data Academy - Data Science Basics
100% (1)
Data Academy - Data Science Basics
114 pages
Prof. Randy Paffenroth Data Science Program Department of Mathematical Sciences Worcester Polytechnic Institute Rcpaffenroth@wpi - Edu 2014
No ratings yet
Prof. Randy Paffenroth Data Science Program Department of Mathematical Sciences Worcester Polytechnic Institute Rcpaffenroth@wpi - Edu 2014
22 pages
Lecture 1 - Introduction To Data Science
No ratings yet
Lecture 1 - Introduction To Data Science
38 pages
Module 1
No ratings yet
Module 1
47 pages
Data Science
No ratings yet
Data Science
87 pages
Research Paper On Hadoop
No ratings yet
Research Paper On Hadoop
47 pages
Data Science Course Overview
No ratings yet
Data Science Course Overview
36 pages
Intro To DS
No ratings yet
Intro To DS
37 pages
Semana 1: The Data Scientist's Toolbox
No ratings yet
Semana 1: The Data Scientist's Toolbox
20 pages
Unit I Introduction To Data Science
No ratings yet
Unit I Introduction To Data Science
79 pages
DCIT414 Session 2
No ratings yet
DCIT414 Session 2
32 pages
Introduction To Data Science CHAPTER 1
No ratings yet
Introduction To Data Science CHAPTER 1
95 pages
Lec1 - For Upload Complete
No ratings yet
Lec1 - For Upload Complete
111 pages
22CDE01-Data Science Unit 1
No ratings yet
22CDE01-Data Science Unit 1
82 pages
Getting Started With Data Science: Grade VIII
No ratings yet
Getting Started With Data Science: Grade VIII
32 pages
Is Data Scientist Still The Sexiest Job of The 21st Century
No ratings yet
Is Data Scientist Still The Sexiest Job of The 21st Century
8 pages
Ds Intro KK
No ratings yet
Ds Intro KK
11 pages
Data Science
No ratings yet
Data Science
35 pages
Intro To Data Science
No ratings yet
Intro To Data Science
100 pages
Big Data Analytics: Data Scientists Are in High Demand
No ratings yet
Big Data Analytics: Data Scientists Are in High Demand
32 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
Intro to Data Science Course Guide
No ratings yet
Intro to Data Science Course Guide
25 pages
Orientation To Computing
No ratings yet
Orientation To Computing
67 pages
Data Science BluePrint
No ratings yet
Data Science BluePrint
12 pages
Your Data Literacy Depends On Understanding The Types of Data and How They're Captured
No ratings yet
Your Data Literacy Depends On Understanding The Types of Data and How They're Captured
5 pages
Introduction To Data Science UNIT 1
No ratings yet
Introduction To Data Science UNIT 1
44 pages
Week-1 Introduction To BDDA-TWM PDF
No ratings yet
Week-1 Introduction To BDDA-TWM PDF
48 pages
Business Analytics: An Introduction: Dr. Devesh Bathla
No ratings yet
Business Analytics: An Introduction: Dr. Devesh Bathla
27 pages
Data Science: Past to Present
No ratings yet
Data Science: Past to Present
11 pages
BD - eBOOK Big Data Data Scientist
No ratings yet
BD - eBOOK Big Data Data Scientist
11 pages
Week1 1
No ratings yet
Week1 1
40 pages
Big Data and Data Science Guide
No ratings yet
Big Data and Data Science Guide
62 pages
DataScientist v2
No ratings yet
DataScientist v2
14 pages
CH1 What Is Data Science
No ratings yet
CH1 What Is Data Science
21 pages
Data Science Unit 1 NOTES
No ratings yet
Data Science Unit 1 NOTES
45 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
26 pages
DSGO 2019 Official Notes
No ratings yet
DSGO 2019 Official Notes
75 pages
Careers in Data Science - Institute For Career Research - Careers Ebooks, 2021 - Institute For Career Research - Anna's Archive
No ratings yet
Careers in Data Science - Institute For Career Research - Careers Ebooks, 2021 - Institute For Career Research - Anna's Archive
43 pages
Lecture 1 & 2
No ratings yet
Lecture 1 & 2
53 pages
Data Science Modern Technology5
No ratings yet
Data Science Modern Technology5
6 pages
Hello and Welcome To The Data Scientist
No ratings yet
Hello and Welcome To The Data Scientist
33 pages
BCA Lecture I
No ratings yet
BCA Lecture I
20 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Data Science Essentials & Big Data Concepts
No ratings yet
Data Science Essentials & Big Data Concepts
20 pages
1.1 Idml
No ratings yet
1.1 Idml
3 pages
1intro Data
No ratings yet
1intro Data
34 pages
Slide 1 - CSE 564 Intro
No ratings yet
Slide 1 - CSE 564 Intro
57 pages
16 L 3 Resolution
No ratings yet
16 L 3 Resolution
82 pages
Yelp Business Rating Prediction Guide
No ratings yet
Yelp Business Rating Prediction Guide
8 pages
Std10 English
No ratings yet
Std10 English
196 pages
Facebook Internship Experience
No ratings yet
Facebook Internship Experience
8 pages
Java MCQ
100% (1)
Java MCQ
13 pages
Data Center Commissioning Levels Guide
100% (2)
Data Center Commissioning Levels Guide
3 pages
Earthquake Load Calculations As Per IS1893-2002.-: Building Xyz at Mumbai. Rev - Mar2003 HSV
No ratings yet
Earthquake Load Calculations As Per IS1893-2002.-: Building Xyz at Mumbai. Rev - Mar2003 HSV
9 pages
Amerta Roopani Resume
No ratings yet
Amerta Roopani Resume
2 pages
04 - Giua HK 12
No ratings yet
04 - Giua HK 12
6 pages
Kindergarten DLL MELC Q3
No ratings yet
Kindergarten DLL MELC Q3
7 pages
Answer Key - Unit 3. Speaking Part 1 - Hobbies & Interests
No ratings yet
Answer Key - Unit 3. Speaking Part 1 - Hobbies & Interests
3 pages
Individual Summary Sheet
No ratings yet
Individual Summary Sheet
5 pages
Biology NSSCO/NASSCAS 2023 Topics
No ratings yet
Biology NSSCO/NASSCAS 2023 Topics
1 page
Shasha
No ratings yet
Shasha
3 pages
Small Group Dynamics
100% (1)
Small Group Dynamics
5 pages
English Programs in Tunisian Universities
No ratings yet
English Programs in Tunisian Universities
5 pages
SD Card Bill
No ratings yet
SD Card Bill
1 page
DeepETA - How Uber Predicts Arrival Times Using Deep Learning
No ratings yet
DeepETA - How Uber Predicts Arrival Times Using Deep Learning
18 pages
Hydraulics Lab: Drag & Energy Dissipation
100% (1)
Hydraulics Lab: Drag & Energy Dissipation
16 pages
English Test 6 Grade Name: - Date: - Read The Dialogue
No ratings yet
English Test 6 Grade Name: - Date: - Read The Dialogue
6 pages
Salsabeel Mahfod - Why Celebrating Black History All Year Is Critical - Blog Paragraph 1
No ratings yet
Salsabeel Mahfod - Why Celebrating Black History All Year Is Critical - Blog Paragraph 1
3 pages
Seeking Customer Centricity The Omni Business Model
No ratings yet
Seeking Customer Centricity The Omni Business Model
60 pages
Navigational Astronomy Basics
No ratings yet
Navigational Astronomy Basics
59 pages
Industrial Vacuum Dryer Solutions
No ratings yet
Industrial Vacuum Dryer Solutions
3 pages
16 Marks Metrology and Measurments
No ratings yet
16 Marks Metrology and Measurments
2 pages
Programming Intake Assignment
No ratings yet
Programming Intake Assignment
2 pages
Cambridge Igcse Mandarin As A Foreign Language Workbook
83% (6)
Cambridge Igcse Mandarin As A Foreign Language Workbook
27 pages
Hidalgo Revised Gender Syllabus Latest
No ratings yet
Hidalgo Revised Gender Syllabus Latest
4 pages
Orbital Motors Type OMP X and OMR X
No ratings yet
Orbital Motors Type OMP X and OMR X
72 pages
District Manager Operations Consulting in Sacramento CA Resume Art Taketa
No ratings yet
District Manager Operations Consulting in Sacramento CA Resume Art Taketa
2 pages
Scalar and Vector Applications
No ratings yet
Scalar and Vector Applications
15 pages
VLP: A Survey On Vision-Language Pre-Training
No ratings yet
VLP: A Survey On Vision-Language Pre-Training
19 pages
MC34063 Ic Tăng Áp DC-DC
No ratings yet
MC34063 Ic Tăng Áp DC-DC
7 pages
My Favorite Room 2
No ratings yet
My Favorite Room 2
2 pages

Data Science Insights for Students

Uploaded by

Data Science Insights for Students

Uploaded by

STAT121 / AC209 / E-109

CS109 Data Science

A Data Scientist Is...

Nate Silver won the election

Nate Silver on Pundits

Some Key Principles

Netflix Prize Progress

HBR, Oct 2012

Xavier Amatriain and Justin Basilico, 2012

massive data (500k users, 20k movies, 100m ratings)

missing data (99% of data missing; not missing at

extremely complicated set of factors that affect peoples

need to avoid overfitting (test data vs. training data)

curse of dimensionality (very high-dimensional

The Data Challenge

Inspired by Daniel Keim, Visual Analytics: Definition,

The Age of Big Data

Jim Gray, Microsoft

By 2018, the US could face a shortage

Hal Varian Explains...

What is the scientific goal?

Get the data.

How were the data sampled?

Explore the data.

Plot the data.

Model the data.

What did we learn?

An Wang Professor of Computer Science, SEAS

CS109 Key Facets

data munging/scraping/sampling/cleaningin order to get an

data storage and management in order to be able to access

exploratory data analysisto generate hypotheses and

predictionbased on statistical tools such as regression,

communicationof results through visualization, stories, and

Act II: Recommendations

Act III: Clustering & Text

Mandatory, except for DCE students

Section Schedule (TBD)

NW-B150 and B166

NW-B150 and B166

Sign up by next Friday (HW0)

HWs due on Thursdays, 11:59 pm EST

Grade may improve or go down

Work you turn in must be your own

With project group peer assessment

Acknowledge all help and code you used

Is this course for me ???

CS50 and/or C, C++, Java, Python, etc.

STAT100, ideally STAT110

This can be time consuming

HW 0, mandatory, needs to be submitted!

Good test of your basic skills

Installation of several Python frameworks

Complete the survey by tomorrow! Needed to be

Not graded, do it as soon as possible

Read syllabus carefully

Week 1 notebooks at https://github.com/cs109/2015lab1

You might also like