Learning with
Minimal Supervision
ELEC/COMP 447, ELEC/COMP 546
Spring 2025
Representation learning
“Coral”
“Fish”
Image
Compact mental
representation
[Serre, 2014]
CNNs learned the classical visual recognition pipeline!
Edges
Segments
Texture “clown fish”
Parts
Colors
im2vec
layer 3 representation of image
Image
layer 1 representation of image
Represent image as a neural embedding — a vector/tensor of neural activations
(perhaps representing a vector of detected texture patterns or object parts)
Investigating a representation via similarity analysis
How similar are these two images?
How about these two?
[Kriegeskorte et al. 2008]
Investigating a representation via similarity analysis
Representational Dissimilarity Matrix
Neural activation vector
[Kriegeskorte, Mur, Ruff, et al. 2008]
Investigating a representation via similarity analysis
IT Neuronal Units Deep net (in paricular, HMO)
[Yamins, Hong, Cadieu, Solomon, Seibert, DiCarlo, PNAS 2014]
Good representations are…
1. Compact (minimal)
2. Explanatory (sufficient) “Coral”
3. Disentangled (independent factors)
“Fish”
4. Interpretable
5. Make subsequent problem solving easy
[See “Representation Learning”, Bengio 2013, for more commentary]
Supervised object recognition
Learner “Fish”
image X label Y
Supervised object recognition
Learner “Duck”
…
image X label Y
Transfer learning
“Generally speaking, a good representation is one that makes a subsequent
learning task easier.” — Deep Learning, Goodfellow et al. 2016
Object recognition Place recognition
“Fish” ?
Often, what we will be “tested” on is to learn to do a new thing.
Object recognition Place recognition Place recognition
“Fish” bedroom ?
A lot of data A little data
Finetuning starts with the representation learned on a previous
task, and adapts it to perform well on a new task.
Finetuning in practice
Object recognition Place recognition
dolphin
cat
bathroom
grizzly bear
kitchen
angel fish
bedroom
chameleon
living room
clown fish
hallway
iguana
elephant
• The “learned representation” is just the weights and biases, so that’s what we transfer.
• Which weights and biases do we need to finetune? Often just the final layer.
If we keep on finetuning for every new datapoint or task that comes our way, we
get online learning. Humans seem to do this — we never stop learning.
…
Supervised vision Vision in nature
Hand-curated training data Raw unlabeled training data
+ Informative + Cheap
- Expensive - Noisy
- Limited to teacher’s knowledge - Harder to interpret
Autoencoder: A first self-supervised model
compressed image code
(vector z)
Image Reconstructed
image
Encoder Decoder
[e.g., Hinton & Salakhutdinov, Science 2006]
Autoencoder
Image Reconstructed
image
Image Reconstructed
image
compressed image code
(vector z)
Is the code informative about Logistic regression:
object class ?
Layer 1 representation Layer 6 representation
[DeCAF, Donahue, Jia, et al. 2013]
[Visualization technique : t-sne, van der Maaten & Hinton, 2008]
Can we learn better self-supervised features?
Input data
Question: What may be bad about using autoencoders to
learn self-supervised features?
Encoder: 4-layer conv
Two Answers: Decoder: 4-layer upconv
• Autoencoders prioritize reconstruction error → may not
weigh small but semantically important features highly Reconstructed data
enough.
• Autoencoders can “cheat” by memorizing weird features
associated with each image, instead of learning a well-
behaved representation.
Self-supervised
learning
Escher, 1948
Self-supervised learning
Common trick:
• Convert “unsupervised” problem
into “supervised” empirical risk
minimization
• Do so by cooking up “labels”
(prediction targets) from the raw
data itself
Escher, 1948
Self-supervised pretext tasks
?
θ=?
rotation prediction “jigsaw puzzle” image completion colorization
1. Solving the pretext tasks allow the model to learn good features.
2. We can automatically generate labels for the pretext tasks.
Task 1: Relative patch location prediction
? ? ?
? ?
? ? ? [Slide credit: Carl Doersch]
Task 1: Relative patch location prediction
(Image source: Doersch et al., 2015)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - 27 May 17, 2022
Task 1: Relative patch location prediction
Patch Embedding (representation)
Classifier Input Nearest Neighbors
CNN
CNN CNN Interesting: This representation places
all cat faces together in feature space!
[Slide credit: Carl Doersch]
Task 2: Solving “jigsaw puzzles”
(Image source: Noroozi & Favaro, 2016)
Task 3: Predict missing pixels (inpainting)
Context Encoders: Feature Learning by Inpainting (Pathak et al., 2016)
Source: Pathak et al., 2016
May 17, 2022
L
e
c Task 3: Predicting missing pixels (inpainting)
t
u
r
e
1
4
-
Learning to reconstruct the missing pixels
Source: Pathak et al., 2016
May 17, 2022
Task 4: Split-brain Autoencoder
Idea: cross-channel predictions
L ab
ab L Source:
32 RichardMay
Zhang17, 2022
/ Phillip Isola
Task 4: Split-brain Autoencoder
Idea: cross-channel predictions
Source: Richard Zhang / Phillip Isola
Autoencoder Classification performance
ImageNet Task [Russakovsky et al. 2015]
40 autoencoder
colorization
Raw Reconstructed 35
Accuracy
Data Data
30
Colorization 25
20
15
Raw Predicted 10
Grayscale Color
Channel Channels
Layer
Task 5: Rotation prediction
Hypothesis: a model could recognize the correct rotation of an object
only if it has the “visual commonsense” of what the object should look
like unperturbed.
(Image source: Gidaris et al. 2018)
Task 5: Rotation prediction
Self-supervised
learning by rotating
the entire input
images.
The model learns to
predict which rotation
is applied (4-way
classification)
(Image source: Gidaris et al. 2018)
Task 5: Rotation prediction
Self-supervised
learning by rotating
the entire input
images.
The model learns to
predict which rotation
is applied (4-way
classification)
(Image source: Gidaris et al. 2018)
Transfer learned features to supervised learning
Pretrained with full
ImageNet supervision
No pretraining
Self-supervised learning on
ImageNet (entire training
set) with AlexNet.
Finetune on labeled data
from Pascal VOC 2007.
Self-supervised learning with rotation prediction
(Image source: Gidaris et al. 2018)
What about videos pretext tasks?
Analog of jigsaw puzzles?
Sort frames in time!
Unsupervised Representation Learning by Sorting
Sequences. Lee et al., 2017.
Video Pretext Task: Video colorization
Idea: model the temporal coherence of colors in videos
reference frame how should I color these frames?
...
t=1 t=2 t=3
t=0
Source: Vondrick et al., 2018
Video colorization
Idea: model the temporal coherence of colors in videos
reference frame how should I color these frames?
Should be the same color!
...
t=1 t=2 t=3
t=0
Hypothesis: learning to color video frames should allow model to
learn to track regions or objects without labels!
Source: Vondrick et al., 2018
Video colorization
Learning objective:
Establish mappings
between reference and
target frames in a
learned feature space.
Use the mapping as
“pointers” to copy the
correct color (LAB).
Source: Vondrick et al., 2018
Learning to color videos
attention map on the predicted color = weighted loss between predicted color
reference frame sum of the reference color and ground truth color
Source: Vondrick et al., 2018
Colorizing videos (qualitative)
reference frame target frames (gray) predicted color
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - 44 Source: Google AI Blog
Colorizing videos (qualitative)
reference frame target frames (gray) predicted color
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - 45 Source: Google AI Blog
F
L
e
e
ic Tracking emerges from colorization
-t Propagate segmentation masks using learned attention
F
u
e
ri
e
L
1
i
,4
J-
i
a
j
u
n
W
u
, Source: Google AI Blog
F
L
e
e
ic Tracking emerges from colorization
-t Propagate skeletons using learned attention
F
u
e
ri
e
L
1
i
,4
J-
i
a
j
u
n
W
u
, Source: Google AI Blog
Problems with individual pretext tasks
● Coming up with individual pretext tasks is tedious.
● The learned representations may not be general.
?
θ=?
Can we come up with a more general pretext task?
A more general pretext task?
?
θ=?
same object
Contrastive Representation Learning
?
θ=?
attract
repel
A formulation of contrastive learning
Loss function given 1 positive sample and N - 1 negative samples:
x: reference sample; x+ positive sample; x- negative sample
That is: We aim to learn an encoder function f that
yields high score for positive pairs (x, x+) and low
scores for negative pairs (x, x-).
A formulation of contrastive learning
Loss function given 1 positive sample and N - 1 negative samples:
A formulation of contrastive learning
Loss function given 1 positive sample and N - 1 negative samples:
...
A formulation of contrastive learning
Loss function given 1 positive sample and N - 1 negative samples:
score for the score for the N-1
positive pair negative pairs
This seems familiar …
A formulation of contrastive learning
Loss function given 1 positive sample and N - 1 negative samples:
score for the score for the N-1
positive pair negative pairs
This seems familiar …
Cross entropy loss for a N-way softmax classifier!
I.e., learn to find the positive sample from the N samples
SimCLR: generating positive samples
from data augmentation
Source: Chen et al., 2020
SimCLR
[Chen, Kornblith, Norouzi, Hinton, ICML 2020]
[c.f. Becker & Hinton, Nature 1992]
Contrastive pre-training
Self-supervised contrastive learning New recognition task
dolphin
cat
grizzly bear
angel fish
chameleon
tiger
iguana
elephant
Training linear classifier on SimCLR features
Train feature encoder on
ImageNet (entire training set)
using SimCLR.
Freeze feature encoder, train a
linear classifier on top with
labeled data.
Source: Chen et al., 2020
Semi-supervised learning on SimCLR features
Train feature encoder on
ImageNet (entire training set)
using SimCLR.
Finetune the encoder with 1% /
10% of labeled data on ImageNet.
Source: Chen et al., 2020
Semi-supervised learning on SimCLR features
Train feature encoder on
ImageNet (entire training set)
using SimCLR.
Finetune the encoder with 1% /
10% of labeled data on ImageNet.
Source: Chen et al., 2020
Variations: DINO
DINO: Teacher-Student Paradigm
• An image x is transformed into two
views x1 and x2.
• The student is encouraged to match
the output probabilities of the
teacher.
• The teacher slowly updates its
parameters with an exponential
moving average (ema) of the
student’s parameters.
DINO
The Teacher
• Teacher’s parameters are an exponentially weighted average
of student’s parameters over recent duration.
• Teacher sees “global” views, while student sees local views.
Recall: Self-Attention
Multisensory self-supervision
provided label
Virginia de Sa. Learning Classification with Unlabeled Data. NIPS 1994.
[see also “Six lessons from babies”, Smith and Gasser 2005]
State Observations
Observations State
[Slide credit: Andrew Owens]
Predicting ambient sound
[Slide credit: Andrew Owens]
What did the model learn?
Unit #90 of 256
Strongest responses in dataset
Visualization method from (Zhou 2015)
[Slide credit: Andrew Owens]
[Slide credit: Andrew Owens]
CLIP (Contrastive Language–Image Pre-training)
Radford et al., 2021
[Slide Credit: Yann LeCun]