Machine learning has undergone dramatic changes in the past decade. Empirical risk minimization, enabled by batch and online convex optimization, is well understood, and has been used in classification, regression, and unsupervised learning. Recently, stochastic optimization and a suite of tools has been used for non-convex optimization, which has enabled learning functions that are the composition of other simpler functions. This has enabled the learning of representations in deep neural networks, and has transformed the field of computer vision and natural language processing.
Decision making in Markovian environments is a separate field of research in machine learning. Passive learning in Markov models include the hidden markov models and recurrent neural networks. For active decision making, settings and methods range from multiarmed bandits to reinforcement learning. All of these developments have culminated to deep Q-networks and other deep reinforcement learning techniques. It is our ultimate goal to track these ideas from basic machine learning to deep reinforcement learning.
You should read this syllabus as a list of topics that will take on average 30 minutes to go over, and some topics are dependent on other topics. The only rule for going through this material is that you must complete the dependencies before you go over a given topic. In this way the course forms a directed acyclic graph (DAG) and we will visit each vertex in an ordering that is consistent with the DAG structure. The format of the content section is the following:
- description
- subtopics
- references
The refences are some of my source material for a topic. You should in no way interpret this as required reading, and the course content are the lectures that I give and the content in this repository.
- Linear Algebra (projections, matrix operations, singular value decomposition and spectrum)
- Calculus and probability (differentiation, Newton's method, LLN, CLT, expectation, conditioning, Bayes theorem, etc.)
- Statistical decision making (MLE, hypothesis testing, confidence intervals, linear and logistic regression)
- Computer programming (Basic data structures, sorting, big-O notation, sequential programming language such as C, Python, R, Java)
- Look into the basic terminology of machine learning, and a preview of what is to come
- Definition of learning machine, supervised learning, loss, risk, cross-validation, basic APIs, computational complexity, OLS, KNN
- "Machine Learning", Tom Mitchell, Chapter 1
- Classification is an important subclass of learning, and it is the first example of a computational-statistical tradeoff with surrogate losses.
- Hinge loss, logistic loss, surrogate losses, ROC and PR curves
- "A Probabilistic Theory of Pattern Recognition", Devroye et al., Chapters 1-2
- "Surrogate losses and F-divergences", Nguyen, Wainwright, Jordan, 2009.
- This is a continuation of Class with a look at F-divergences and other topics in information theory
- F-divergences, differential entropy, and hypothesis testing
- "Introduction to Non-parametric Estimation", Tsybakov, Chapter 2.
- "Information Theory, Inference, and Learning Algorithms", Mackay, Chapter 1.
- "A Probabilistic Theory of Pattern Recognition", Devroye et al., Chapter 3
- We will see a bias variance tradeoff in supervised learning with a toy model, which is the sparse normal means model. Also some strategies for model selection.
- sparse normal means, soft-thresholding, lasso, greedy selection
- "On Minimax Estimation of a Sparse Normal Mean Vector", Iain Johnstone, 1994
- "Elements of Statistical Learning", Hastie, Tibshirani, Friedman, Chapter 3
Directed Graphical Models and Hidden Markov Models (HMM) : IML
- We will look at directed graphical models and how to learn in Markovian environments. In particular, we will see hidden Markov models (HMMs) and how to learn parameters in this context.
- Forward backward algorithm, EM algorithm, DAGS
- "An Introduction to Hidden Markov Models and Bayesian Networks", Zoubin Ghahramani.
- We will see convex functions, subgradients, and why convexity is nice.
- convexity, subgradient, subdifferential, first order conditions
- "Convex Optimization", Boyd and Vandenberghe, Unit 1
- "Subgradients", Boyd, Duchi, Vandenberghe.
- In online learning, we read the data sequentially and need to make predictions in sequence. We will introduce the basic idea of online convex optimization and regret and see how this fits into the ideas we have learned in IML. We will see some basics of learning from experts.
- "Introduction to Online Convex Optimization", Hazan, Chapter 1.
- "Prediction, learning, and games", Cesa-Bianchi, Lugosi, Introduction.
- We will see subgradient descent algorithms, and convergence guarantees. We will see that the perceptron is just SGD applied to the hinge loss.
- Online optimization, subgradient descent, SGD, preceptron, regret
- "Introduction to Online Convex Optimization", Hazan, Chapters 1-3.
- "Online convex programming and generalized infinitesimal gradient ascent", Zinkevich, 2003.
- PCA is the first solvable non-convex programs that we will encounter. PCA can be used for learning latent factors and dimension reduction. We will cast it as a contrained loss minimization problem, and show it's connections to K-means clustering.
- PCA, Frobenius norm, SVD, K-means, Lloyds algorithm
- "Elements of Statistical Learning", Hastie, Tibshirani, Friedman, Chapter 14
- The kernel trick is a simple way to go from linear methods to non-linear methods, while ensuring computational feasibility. We will see the general kernel trick and apply it to SVMs and PCA.
- Kernel trick, Kernel SVMs, Kernel PCA
- "Kernel Methods in Machine Learning", Hoffman, Scholkopf, Smola, 2008.
- We will look at algorithms like CART and other decision trees. We will recall the bootstrap and bagging, then go over random forests. We will talk about random forests in the context of the bias-variance tradeoff.
- CART, bagging, random forests
- "Elements of Statistical Learning", Hastie, Tibshirani, Friedman, Chapter 15
- We will go over boosting stumps and how smart ensembles of weak learners can make strong learners. Specific attention will be paid to Adaboost and we will prove a bound on training error. Time permitting, we will go over gradient boosting.
- Adaboost, weak learners, exponential weighting, gradient boosting
- "Rapid Object Detection using a Boosted Cascade of Simple Features", Viola, Jones, 2001.
- "Experiments with a New Boosting Algorithm", Freund, Schapire, 1996.
- "Boosting notes", Schapire.
- "Gradient tree boosting paper", Friedman
- Neural networks construct non-linear functions by composing simple convex functions, which produces non-convex functions. We will see that neural networks can learn non-linear separators like XOR, and look at how to optimize them.
- Fully connected layers, backpropagation, computation graphs
- "Learning representations by back-propagating errors", Rumelhart, Hinton, Williams, 1986.
- "Elements of Statistical Learning", Hastie, Tibshirani, Friedman, Chapter 11
- "Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping", Caruana, Lawrence, Giles, 2001.
- We will look at some examples of non-convex optimization problems in machine learning, such as neural networks and PCA. We will see that perturbation and SGD can help non-convex optimization escape local minima.
- Local minima, Saddle points, convergence guarantees, random starts
- "Nonconvex optimization lecture notes", De Sa
- "Escaping from Saddle Points", Ge
- "Saddles again", Recht.
- We will see how deep neural nets are universal function approximators, but optimizing them can be challenging. We will look at optimization methods, such as Nesterov acceleration. We will also look at tricks like gradient clipping, dropouts, and variance reduction.
- "Dropout: A Simple Way to Prevent Neural Networks from Overfitting", Srivastava et al., 2014.
- "On the importance of initialization and momentum in deep learning", Sutskever et al., 2013
- "Universal Approximation Bounds for Superpositions of a Sigmoidal Function ", Barron, 1993.
- "Optimization for Training Deep Models", Deep Learning, Goodfellow, Bengio, Courville.
- "On the Number of Linear Regions of Deep Neural Networks", Montufar et al., 2014
- We will see that recurrent neural nets provide an alternative formulation to the HMM for prediction in Markov models. We will look at unravelling the computation graph and the DAG implied by RNNs.
- RNN, Recurrent gradient calculation, RNN DAG, LSTM
- "Sequence Modeling: Recurrent and Recursive Nets", Deep Learning, Goodfellow, Bengio, Courville.
- "Neural Machine Translation by Jointly Learning to Align and Translate", Bahdanau et al., 2016
- "Sequence to Sequence Learning with Neural Networks", Sutskever et al., 2014
- Tech Report, "Probabilistic Interpretations of Recurrent Neural Networks", Choe et al., 2017
- code
- NMT code
- We will see how convolution can enforce parameter sharing, and look at this in the context of computer vision. We will see how convolution can be used with fixed low level features such as SIFT features and Gabor filters. This will lead to deep convolutional NNs.
- Convolution, FFT, ConvNets
- "Convolutional Networks", Deep Learning, Goodfellow, Bengio, Courville.
- "Gradient-Based Learning Applied to Document Recognition", LeCun et al., 1998
- "Going Deeper with Convolutions", Szegedy et al., 2015
- In deep learning, at an intermediate layer, the input distribution will shift during training, causing learning rates to needs be small. Batch normalization is one way to get around this problem by normalizing the neuron in mini-batches.
- batch normalization
- "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", Ioffe, Szegedy, 2015.
- "How Does Batch Normalization Help Optimization?", Santurkar et al., 2018
- Much like adding layers to linear classifiers can form non-linear classifiers, autoencoders add layers to PCA to perform non-linear dimension reduction. Time permitting we will look at variants such as sparse, and convolutional.
- autoencoders, convolutional autoencoders
- "Autoencoders", Deep Learning, Goodfellow, Bengio, Courville.
- GANs is another deep unsupervised learner that seeks to make it hard to distinguish between fake and training images. This is accomplished by simultaneously learning and adversary and a generative model.
- GANs
- "Generalized Adversarial Networks", Goodfellow et al., 2014
- Our first sequential decision making setting is the multi-armed and stochastic bandit setting. We will look at exponential weighting and UCB.
- Multi-armed bandit, EXP3, UCB
- "Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems", Bubeck, Cesa-Bianchi, 2012, Chapters 2-3.
- A more realistic setting for recommendation systems is bandits with context features. We will look at contextual bandits in the stochastic framework and the linUCB algorithm.
- Contextual Bandits, linUCB
- "Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems", Bubeck, Cesa-Bianchi, 2012, Chapter 4.
- The bandit setting is a specific Markov decision process. Finding a policy that can perform well in the MDP setting is the focus of reinforcement learning. We will see the basic definitions of RL, Bellman iteration, and Monto Carlo methods.
- Bellman iteration, RL, MDP, on/off-policy
- "Reinforcement Learning", Sutton, Barto, Chapters 3-4
- "Policy Gradient Methods for Reinforcement Learning with Function Approximation", Sutton et al., '00.
- "http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.129.8871", Williams, '92.
- "Reinforcement Learning", Sutton, Barto, Chapters 6
- "Actor-Critic Algorithms", Konda and Tsitsiklis, '00.
- "Reinforcement Learning", Sutton, Barto, Chapter 6
- "Q-learning", Watkins and Dayan, 1992
- "Convergence of Stochastic Iterative Dynamic Programming Algoriths", Jaakola at al. 1994
- "Proximal Policy Optimization Algorithms", Schulman et al. 2017.
- "Trust Region Policy Optimization", Schulman et al. 2015
- "Playing Atari with Deep Reinforcement Learning", Mihn et al.
- "Rainbow: Combining Improvements in Deep Reinforcement Learning"
- the scribes have been assigned, you can find your lesson here
- you will use LaTeX to write up the lectures, and should recieve an invite to a project on overleaf
We may not be able to get to every topic. My job is to present these methods and summarize the material in lecture. Each of you will act as scribe for one of these topics. Once this is completed to my satisfaction then you will pass the class with an A.