Supervised Learning
● ML has been largely focused
on this …
● But Lots of other problem
settings are coming up:
○ What if we also have unlabeled data?
○ What if we only have unlabeled data?
○ What if we have poor-quality labels
(e.g., coarse or potentially mistaken?)
○ What if we have many datasets, but
one somehow differing from another?
○ What if we only have one example, or
a few per (new) class?
○ ……
And wait, there are more!
• Transfer Learning
• Semi-supervised learning Setting Source Target Shift Type
• One/Few-shot learning Semi-supervised Single Single None
labeled unlabeled
• Un/Self-Supervised Learning Domain Single Single Non-
Adaptation labeled unlabeled semantic
• Domain adaptation
Domain Multiple Unknown Non-
• Meta-Learning Generalization labeled semantic
• Zero-shot learning Cross-Task Single Single Semantic
Transfer labeled unlabeled
• Continual / Lifelong-learning
Few-Shot Single Single few- Semantic
• Multi-modal learning Learning labeled labeled
• Multi-task learning Un/Self- Single Many labeled Both/Task
Supervised unlabeled
• Active learning
• …
Particularly Meaningful for CV …
“Crystal” “Needle” “Empty”
“0” “1” “2” …
Human expert/ “Sports”
Special equipment/ “News”
Experiment “Science”
…
Cheap and abundant ! Expensive and scarce !
Particularly Meaningful for CV …
image-level labels points bounding boxes scribbles pixel-level labels
horse person
horse
person
1s/class 2.4s/instance 10s/instance 17s/instance 78s/instance
…
Annotation time
Particularly Meaningful for CV …
A Whole Big Field! We try to cover a few …
• Semi-Supervised Learning
• Few-Shot Learning
• Active Learning
• Transfer and Multi-Task Learning
• Self-Supervised Learning
What is Semi-Supervised Learning?
Supervised Learning ○ Training data: both labeled data
(image, label) and unlabeled
data (image)
○ Goal: Use unlabeled data to
improve supervised learning
Semi-Supervised Learning ○ Note: If we have lots of labeled
data, this goal is much harder
An Incomplete List of Methods ….
• Confidence & Entropy – “no matter what, be confident”
• Pseudo labeling
• Entropy minimization
• Virtual Adversarial Training
• Label Consistency – “label is robust to perturbations”
• Pseudo labeling, yet applying different sample augmentations
• Temporal Ensembling, Mean Teacher …
• Regularization
• Weight decay, Dropout …
• Strong/unsupervised data augmentation: MixUp, CutOut, MixMatch …
• Co-Training / Self-Training / Pseudo Labeling / Noisy Student
Pseudo Labeling
● Simple idea:
• Train on labeled data
• Make predictions on unlabeled data
• Pick confident predictions, and add
to training data
• Can do end-to-end (no need to
separate stages)
● Issues:
● “Under-confidence” or flatness –
“sharpening” by entropy
minimization
● “Overconfidence”? – Need better
uncertainty quantification
Label Consistency with Data Augmentations
Make sure that the logits are similar
We can either “ensemble” or “compare” them
MixMatch: A Holistic Approach for Semi-
Supervised Learning
MixUp
29
“Co-Training”
“Co-Training”
• (Blum & Mitchell, 1998) (Mitchell, 1999) assumes that
• features can be split into two sets;
• each sub-feature set is sufficient to train a good classifier.
• Initially two separate classifiers are trained with the labeled data, on the two sub-feature sets
respectively.
• Each classifier then classifies the unlabeled data, and “teaches” the other classifier with the
few unlabeled examples (and the predicted labels) they feel most confident.
• Each classifier is retrained with the additional training examples given by the other classifier,
and the process repeats.
“Co-Training”
“Noisy Student”
Few-Shot Learning
Normal Approach?
• Do what we always do: Fine-tuning
– Train classifier on base classes
Cons?
• The training we do on the base
classes does not factor the task into
account
– Freeze features • No notion that we will be performing a
– Learn classifier weights for new classes using bunch of N-way tests
few amounts of labeled data (during “query” • Idea: simulate what we will see during
time!) test time – and can do that many times!
A Closer Look at Few-shot Classification, Wei-Yu Chen, Yen-Cheng Liu,
Zsolt Kira, Yu-Chiang Frank Wang, Jia-Bin Huang
Meta Learning Approach
• Set up a set of smaller tasks during training which simulates what we will be
doing during testing
– Can optionally pre-train features on held-out base classes (not typical)
• Testing stage is now the same, but with new classes
Model-Agnostic Meta-Learning (MAML)
Active Learning
From Education . . .
C. Bonwell and J. Eison [1]: In active learning, students participate in the process and
students participate when they are doing something besides passively listening. It is a model
of instruction or an education action that gives the responsibility of learning to learners
themselves.
. . . to Machine Learning:
Settles [2, p.5]: Active learning systems attempt to overcome the labeling bottleneck by
asking queries in the form of unlabeled instances to be labeled by an oracle. In this
way, the active learner aims to achieve high accuracy using as few labeled instances as
possible, thereby minimizing the cost of obtaining labeled data.
[1] Charles C. Bonwell and James A. Eison. Active [2] Burr Settles. Active learning literature survey. Computer Sciences
learning: Creating excitement in the classroom. ASHE- Technical Report 1648, University of Wisconsin-Madison, Madison,
ERIC Higher Education Report, 1, 1991. Wisconsin, USA, 2009.
Active Learning
Setting
• Some information is costly (some not)
• Active learner controls selection process
Objective
• Select the most valuable information
• Baseline: random selection
Historical Remarks
• Optimal experimental design
• Valerii V. Fedorov. “Theory of Optimal Experiments Design”, Academic Press, 1972.
• Learning with queries/query synthesis
• Dana Angluin. “Queries and concept learning”, Machine Learning, 2:319{342,1988.
• Selective sampling
• David Cohn, L. Atlas, R. Ladner, M. El-Sharkawi, R. II Marks, M. Aggoune, and D. Park. “Training
connectionist networks with queries and selective sampling”, In Advances in Neural Information
Processing Systems (NIPS). Morgan Kaufmann, 1990.
Uncertainty sampling
Idea
• Select those instances where we are least
certain about the label
Approach
• 3 labels preselected
• Linear classifier
• Use distance to the decision boundary as
uncertainty measure
“Training connectionist networks with queries and selective sampling”.
David Cohn, L. Atlas, R. Ladner, M. El-Sharkawi, R. II Marks, M. Aggoune, and D. Park.
In Advances in Neural Information Processing Systems (NIPS). Morgan Kaufmann, 1990.
Uncertainty sampling
Ì easy to implement
Ì fast
¬¬ no exploration (often combined with random sampling)
¬¬ impact not considered (density weighted extensions exist)
¬¬ problem with complex structures (performance can be
even worse than random)
Pure exploitation, does not explore
Can get stuck in regions with high Bayesian error
Ensemble-based Sampling
“Query by committee”, H. Sebastian Seung, Manfred Opper, and Haim Sompolinsky.
Fifth workshop on computational learning theory. Morgan Kaufmann, 1992.
Transfer Learning
Improve Learning New Task
by Learned Task
Multi-Task Learning
Transfer Learning: Main Solutions
• Instance (Data) Transfer
• Reweight instances of target data according to source
• Example: importance sampling; some “style-transfer” for data adaptation
• Feature Transfer
• Mapping features of source and target data in a common space
• Example: TCA; common pre-training + tuning methods in DL
• Parameter Transfer
• Learn target model parameters according to source model
• Example: Multi-task learning; Net2Net
How transferable are deep learning features?
Net2Net Transfer
• Net2Net reuses information of an already trained deep model to
speedup training of a new model (potentially different topology)
Net2Net Transfer
Multi-Task Learning: Main Solutions
• Direct Parameter Sharing (straightforward)
• Examples: shared weights or activations in neural networks; shared parameters
in Gaussian process
• Structural Regularization
• Can be designed to incorporate various assumptions and domain knowledge
• Can be trained using large-scale optimization algorithms on big data
• The key is to design the regularization term that couples the tasks
• Classical examples: group sparsity, low-rank, parameters grouping…
General Multi-Task Learning Schematic in DNNs
• Can often help tasks by fewer labels, due to knowledge sharing… (“positive transfer”)
• But can backfire some tasks during collaboration too, due to cross-task conflict… (“negative transfer”)
Now let’s get ambitious: learning with NO Labels!!
First category of unsupervised learning
● Generative modeling
○ Generate or otherwise model pixels in the input space
○ Pixel-level generation is computationally expensive
○ Generating images of high-fidelity may not be necessary for
representation learning
Autoencoder Generative Adversarial Nets
Image credit: Xifeng Guo, Thalles Silva.
Second category of unsupervised learning
● Discriminative modeling
○ Train networks to perform pretext tasks where both the inputs and
labels are derived from an unlabeled dataset.
○ Heuristic-based pretext tasks: rotation prediction, relative patch
location prediction, colorization, solving jigsaw puzzle.
○ Many heuristics seem ad-hoc and may be limiting.
Images: [Gidaris et al 2018, Doersch et al 2015]
Motivation and Methodology
Main Tasks in Use:
■ Reconstruct from a corrupted
(or partial) version
■ Denoising Autoencoder
■ In-painting
■ Colorization
■ Visual common-sense tasks
■ Relative patch prediction
■ Jigsaw puzzles
■ Rotation
■ Contrastive Learning
■ word2vec
Yann LeCun’s cake
■ Contrastive Predictive
Slide: LeCun
Coding (CPC)
■ MoCO, simCLR …
Example: Solving Jigsaw Puzzles
Simple Contrastive Learning (simCLR)
• Simple idea: maximizing the agreement of representations
under data transformation, using a contrastive loss in the
latent/feature space
• Super effective: 10% relative improvement over previous
SOTA (cpc v2), outperforms AlexNet with 100X fewer labels
Simple Contrastive Learning Contrast (simCLR)
simCLR uses random crop and color distortion for augmentation.
Examples of augmentation applied to the left most images:
Simple Contrastive Learning Contrast (simCLR)
f(x) is the base network that computes internal representation.
Default simCLR uses (unconstrained) ResNet in this work.
However, it can be other networks.
Simple Contrastive Learning Contrast (simCLR)
g(h) is a projection network that project representation to a
latent space.
simCLR use a 2-layer non-linear MLP
Simple Contrastive Learning Contrast (simCLR)
In the h-representation space we do two things:
• “Pull” positive pairs closer together (two contrastive
“views” generated from the same sample, only with
different data augmentations
• “Push” negative pairs further away
Loss function (InfoNCE):
Original image crop 1 crop 2 contrastive image
simCLR algorithm in pseudo code
Take-home key points:
• Benefit from large batch sizes (at least, 1k-2k
per minibatch)
• Composition of augmentations are crucial.
Contrastive learning needs stronger data/color
augmentation than supervised learning
• A nonlinear projection head improves the
representation quality of the layer before it
• “Temperature hyperparameter” in the
contrastive loss is very critical
• simCLR can immediately be used to few-shot,
semi-supervised, and transfer learning
• Unsupervised contrastive learning benefits
(more) from bigger models (simCLR v2)
simCLR as a strong semi-supervised learner
“Pre-train, Fine-tune, and Distill”
• Surprise: Bigger models are more label-efficient!
• Using pre-training + fine-tuning, “the fewer the labels, the bigger the model”
Momentum Contrast (MoCo)
Barlow Twins: “Another Dimension” of Contrast
VIC-Reg: A (more) Unified SSL Framework
Promoted a lot by LeCun, etc.
… who argues three essential things constitute a good SSL loss:
• Variance: keeps the variance of each component of the representations
(measured over a batch) above a threshold, to prevent cross-sample collapse.
[contrastive learning, ”push” negative]
• Invariance: make the two similar representations as close to each other as
possible [contrastive learning, ”pull” positive]
• Covariance: decorrelates the variables of one sample’s embedding and prevents
an informational collapse in which the variables would vary together or be highly
correlated. [barlow twins; non-existent in CL]
VIC-Reg (promoted a lot by LeCun, etc.)
• Joint embedding with variance, invariance and covariance regularization
Beyond Contrast Learning:
Masked Auto-Encoder (MAE)
A more detailed tutorial: https://feichtenhofer.github.io/eccv2022-ssl-tutorial/Tutorial_files/slides/mae_tutorial_xinlei.pdf
How MAE works
How MAE works
How MAE works
How MAE works
MAE works by Reconstruction
MAE works by Reconstruction
MAE: More Take-Home Points
• BERT-like algorithm, but with crucial
design changes for vision
• BERT: 15% is enough
• MAE: a high ratio of 75% - 80% is optimal
• Very efficient when coupled with high mask
ratio (75%)
• MAR has large encoder on visible tokens
• … + small decoder on all tokens
• … + projection layer to connect the two
• After pre-training, throw away the decoder
• Intriguing properties – better scalability
• work with minimal data augmentation
Contrastive Language-Image Pre-training (CLIP)
https://openai.com/blog/clip/
CLIP is highly data-efficient, flexible and general
Some Limitations:
• struggles on abstract or
systematic tasks
• struggles on very fine-
grained classification
• sometimes sensitive to
wording/phrasing,
needing “prompt
engineering”
https://openai.com/blog/clip/
General Message about Self-Supervised Learning
• MAE has won most CV
downstream tasks (from 2D
to 3D, sparse to dense)
• MoCo/SimCLR still own more
competitive performance in
the few-shot regime
• Maybe we should “hybrid”?
• Lots of open problems
remain when/why an SSL
representation works or not