Advanced Machine Learning

Prof. James Sharpnack

Winter 2019

Office hours: 10am-12pm MSB 4107 or by appointment

Introduction

Machine learning has undergone dramatic changes in the past decade. Empirical risk minimization, enabled by batch and online convex optimization, is well understood, and has been used in classification, regression, and unsupervised learning. Recently, stochastic optimization and a suite of tools has been used for non-convex optimization, which has enabled learning functions that are the composition of other simpler functions. This has enabled the learning of representations in deep neural networks, and has transformed the field of computer vision and natural language processing.

Decision making in Markovian environments is a separate field of research in machine learning. Passive learning in Markov models include the hidden markov models and recurrent neural networks. For active decision making, settings and methods range from multiarmed bandits to reinforcement learning. All of these developments have culminated to deep Q-networks and other deep reinforcement learning techniques. It is our ultimate goal to track these ideas from basic machine learning to deep reinforcement learning.

How to read this syllabus

You should read this syllabus as a list of topics that will take on average 30 minutes to go over, and some topics are dependent on other topics. The only rule for going through this material is that you must complete the dependencies before you go over a given topic. In this way the course forms a directed acyclic graph (DAG) and we will visit each vertex in an ordering that is consistent with the DAG structure. The format of the content section is the following:

topic (ABV) : dependency 1, dependency 2, ...

description
subtopics
references

The refences are some of my source material for a topic. You should in no way interpret this as required reading, and the course content are the lectures that I give and the content in this repository.

Prerequisites:

Linear Algebra (projections, matrix operations, singular value decomposition and spectrum)
Calculus and probability (differentiation, Newton's method, LLN, CLT, expectation, conditioning, Bayes theorem, etc.)
Statistical decision making (MLE, hypothesis testing, confidence intervals, linear and logistic regression)
Computer programming (Basic data structures, sorting, big-O notation, sequential programming language such as C, Python, R, Java)

Classification is an important subclass of learning, and it is the first example of a computational-statistical tradeoff with surrogate losses.
Hinge loss, logistic loss, surrogate losses, ROC and PR curves
"A Probabilistic Theory of Pattern Recognition", Devroye et al., Chapters 1-2
"Surrogate losses and F-divergences", Nguyen, Wainwright, Jordan, 2009.

Information Theory (IT) : Class

This is a continuation of Class with a look at F-divergences and other topics in information theory
F-divergences, differential entropy, and hypothesis testing
"Introduction to Non-parametric Estimation", Tsybakov, Chapter 2.
"Information Theory, Inference, and Learning Algorithms", Mackay, Chapter 1.
"A Probabilistic Theory of Pattern Recognition", Devroye et al., Chapter 3

Bias-Variance Tradeoff (BV) : IML

We will see a bias variance tradeoff in supervised learning with a toy model, which is the sparse normal means model. Also some strategies for model selection.
sparse normal means, soft-thresholding, lasso, greedy selection
"On Minimax Estimation of a Sparse Normal Mean Vector", Iain Johnstone, 1994
"Elements of Statistical Learning", Hastie, Tibshirani, Friedman, Chapter 3

Directed Graphical Models and Hidden Markov Models (HMM) : IML

We will look at directed graphical models and how to learn in Markovian environments. In particular, we will see hidden Markov models (HMMs) and how to learn parameters in this context.
Forward backward algorithm, EM algorithm, DAGS
"An Introduction to Hidden Markov Models and Bayesian Networks", Zoubin Ghahramani.

Convex Optimization (Conv) : IML

We will see convex functions, subgradients, and why convexity is nice.
convexity, subgradient, subdifferential, first order conditions
"Convex Optimization", Boyd and Vandenberghe, Unit 1
"Subgradients", Boyd, Duchi, Vandenberghe.

Online Learning (Online) : Conv

In online learning, we read the data sequentially and need to make predictions in sequence. We will introduce the basic idea of online convex optimization and regret and see how this fits into the ideas we have learned in IML. We will see some basics of learning from experts.
"Introduction to Online Convex Optimization", Hazan, Chapter 1.
"Prediction, learning, and games", Cesa-Bianchi, Lugosi, Introduction.

Online convex optimization (OCO) : Online, Class

We will see subgradient descent algorithms, and convergence guarantees. We will see that the perceptron is just SGD applied to the hinge loss.
Online optimization, subgradient descent, SGD, preceptron, regret
"Introduction to Online Convex Optimization", Hazan, Chapters 1-3.
"Online convex programming and generalized infinitesimal gradient ascent", Zinkevich, 2003.

Unsupervised Learning (UL): Conv, Class

PCA is the first solvable non-convex programs that we will encounter. PCA can be used for learning latent factors and dimension reduction. We will cast it as a contrained loss minimization problem, and show it's connections to K-means clustering.
PCA, Frobenius norm, SVD, K-means, Lloyds algorithm
"Elements of Statistical Learning", Hastie, Tibshirani, Friedman, Chapter 14

Kernel Machines (Kern) : Class, UL

The kernel trick is a simple way to go from linear methods to non-linear methods, while ensuring computational feasibility. We will see the general kernel trick and apply it to SVMs and PCA.
Kernel trick, Kernel SVMs, Kernel PCA
"Kernel Methods in Machine Learning", Hoffman, Scholkopf, Smola, 2008.

Decision Trees and Random Forests (Tree) : Class

We will look at algorithms like CART and other decision trees. We will recall the bootstrap and bagging, then go over random forests. We will talk about random forests in the context of the bias-variance tradeoff.
CART, bagging, random forests
"Elements of Statistical Learning", Hastie, Tibshirani, Friedman, Chapter 15

Boosting (Boost) : OCO

We will go over boosting stumps and how smart ensembles of weak learners can make strong learners. Specific attention will be paid to Adaboost and we will prove a bound on training error. Time permitting, we will go over gradient boosting.
Adaboost, weak learners, exponential weighting, gradient boosting
"Rapid Object Detection using a Boosted Cascade of Simple Features", Viola, Jones, 2001.
"Experiments with a New Boosting Algorithm", Freund, Schapire, 1996.
"Boosting notes", Schapire.
"Gradient tree boosting paper", Friedman

Neural Nets (NNets) : Class, Conv

Neural networks construct non-linear functions by composing simple convex functions, which produces non-convex functions. We will see that neural networks can learn non-linear separators like XOR, and look at how to optimize them.
Fully connected layers, backpropagation, computation graphs
"Learning representations by back-propagating errors", Rumelhart, Hinton, Williams, 1986.
"Elements of Statistical Learning", Hastie, Tibshirani, Friedman, Chapter 11
"Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping", Caruana, Lawrence, Giles, 2001.

Non-convex Optimization (NonConv) : UL, NNets, Conv

We will look at some examples of non-convex optimization problems in machine learning, such as neural networks and PCA. We will see that perturbation and SGD can help non-convex optimization escape local minima.
Local minima, Saddle points, convergence guarantees, random starts
"Nonconvex optimization lecture notes", De Sa
"Escaping from Saddle Points", Ge
"Saddles again", Recht.

Deep learning and optimization (Deep) : NonConv, NNets

We will see how deep neural nets are universal function approximators, but optimizing them can be challenging. We will look at optimization methods, such as Nesterov acceleration. We will also look at tricks like gradient clipping, dropouts, and variance reduction.
"Dropout: A Simple Way to Prevent Neural Networks from Overfitting", Srivastava et al., 2014.
"On the importance of initialization and momentum in deep learning", Sutskever et al., 2013
"Universal Approximation Bounds for Superpositions of a Sigmoidal Function ", Barron, 1993.
"Optimization for Training Deep Models", Deep Learning, Goodfellow, Bengio, Courville.
"On the Number of Linear Regions of Deep Neural Networks", Montufar et al., 2014

Recurrent Neural Nets (RNN) : Deep, HMM

We will see that recurrent neural nets provide an alternative formulation to the HMM for prediction in Markov models. We will look at unravelling the computation graph and the DAG implied by RNNs.
RNN, Recurrent gradient calculation, RNN DAG, LSTM
"Sequence Modeling: Recurrent and Recursive Nets", Deep Learning, Goodfellow, Bengio, Courville.
"Neural Machine Translation by Jointly Learning to Align and Translate", Bahdanau et al., 2016
"Sequence to Sequence Learning with Neural Networks", Sutskever et al., 2014
Tech Report, "Probabilistic Interpretations of Recurrent Neural Networks", Choe et al., 2017
code
NMT code

Convolutional Neural Nets (ConvNets) : Deep

We will see how convolution can enforce parameter sharing, and look at this in the context of computer vision. We will see how convolution can be used with fixed low level features such as SIFT features and Gabor filters. This will lead to deep convolutional NNs.
Convolution, FFT, ConvNets
"Convolutional Networks", Deep Learning, Goodfellow, Bengio, Courville.
"Gradient-Based Learning Applied to Document Recognition", LeCun et al., 1998
"Going Deeper with Convolutions", Szegedy et al., 2015

Batch normalization (BatchNorm): Deep

In deep learning, at an intermediate layer, the input distribution will shift during training, causing learning rates to needs be small. Batch normalization is one way to get around this problem by normalizing the neuron in mini-batches.
batch normalization
"Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", Ioffe, Szegedy, 2015.
"How Does Batch Normalization Help Optimization?", Santurkar et al., 2018

Deep Autoencoders (Auto) : Deep, UL

Much like adding layers to linear classifiers can form non-linear classifiers, autoencoders add layers to PCA to perform non-linear dimension reduction. Time permitting we will look at variants such as sparse, and convolutional.
autoencoders, convolutional autoencoders
"Autoencoders", Deep Learning, Goodfellow, Bengio, Courville.

Generalized Adversarial Networks (GAN) : Auto

GANs is another deep unsupervised learner that seeks to make it hard to distinguish between fake and training images. This is accomplished by simultaneously learning and adversary and a generative model.
GANs
"Generalized Adversarial Networks", Goodfellow et al., 2014

Bandits (Bandit) : OCO

Our first sequential decision making setting is the multi-armed and stochastic bandit setting. We will look at exponential weighting and UCB.
Multi-armed bandit, EXP3, UCB
"Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems", Bubeck, Cesa-Bianchi, 2012, Chapters 2-3.

Contextual Bandits (Context) : Bandit

A more realistic setting for recommendation systems is bandits with context features. We will look at contextual bandits in the stochastic framework and the linUCB algorithm.
Contextual Bandits, linUCB
"Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems", Bubeck, Cesa-Bianchi, 2012, Chapter 4.

Reinforcement learning (RL) : Bandit

The bandit setting is a specific Markov decision process. Finding a policy that can perform well in the MDP setting is the focus of reinforcement learning. We will see the basic definitions of RL, Bellman iteration, and Monto Carlo methods.
Bellman iteration, RL, MDP, on/off-policy
"Reinforcement Learning", Sutton, Barto, Chapters 3-4

Policy gradients and REINFORCE (PGrad) : RL

"Policy Gradient Methods for Reinforcement Learning with Function Approximation", Sutton et al., '00.
"http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.129.8871", Williams, '92.

Temporal difference learning (TD) : RL

"Reinforcement Learning", Sutton, Barto, Chapters 6
"Actor-Critic Algorithms", Konda and Tsitsiklis, '00.

Q-learning and SARSA (Qlearn) : TD

"Reinforcement Learning", Sutton, Barto, Chapter 6
"Q-learning", Watkins and Dayan, 1992
"Convergence of Stochastic Iterative Dynamic Programming Algoriths", Jaakola at al. 1994

Proximal policy optimization (PPO) : PGrad, IT

"Proximal Policy Optimization Algorithms", Schulman et al. 2017.
"Trust Region Policy Optimization", Schulman et al. 2015

Deep Q-learning (DQN) : Qlearn

Instruction Plan and Grading

the scribes have been assigned, you can find your lesson here
you will use LaTeX to write up the lectures, and should recieve an invite to a project on overleaf

We may not be able to get to every topic. My job is to present these methods and summarize the material in lecture. Each of you will act as scribe for one of these topics. Once this is completed to my satisfaction then you will pass the class with an A.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
BV		BV
Class		Class
IML		IML
IT		IT
code		code
example		example
README.md		README.md

jsharpna/AML

Folders and files

Latest commit

History

Repository files navigation