0% found this document useful (0 votes)
6 views76 pages

Lec 16

The document discusses various methods of representation learning, particularly focusing on self-supervised and contrastive learning techniques. It highlights the importance of creating compact, explanatory, and interpretable representations to improve subsequent learning tasks, such as object and place recognition. Additionally, it covers pretext tasks for self-supervised learning and introduces frameworks like SimCLR and DINO for effective feature extraction and classification.

Uploaded by

sarah.luan011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views76 pages

Lec 16

The document discusses various methods of representation learning, particularly focusing on self-supervised and contrastive learning techniques. It highlights the importance of creating compact, explanatory, and interpretable representations to improve subsequent learning tasks, such as object and place recognition. Additionally, it covers pretext tasks for self-supervised learning and introduces frameworks like SimCLR and DINO for effective feature extraction and classification.

Uploaded by

sarah.luan011
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Learning with

Minimal Supervision
ELEC/COMP 447, ELEC/COMP 546
Spring 2025
Representation learning

“Coral”

“Fish”

Image
Compact mental
representation
[Serre, 2014]
CNNs learned the classical visual recognition pipeline!

Edges
Segments

Texture “clown fish”


Parts

Colors
im2vec
layer 3 representation of image

Image

layer 1 representation of image

Represent image as a neural embedding — a vector/tensor of neural activations


(perhaps representing a vector of detected texture patterns or object parts)
Investigating a representation via similarity analysis

How similar are these two images?

How about these two?

[Kriegeskorte et al. 2008]


Investigating a representation via similarity analysis

Representational Dissimilarity Matrix

Neural activation vector

[Kriegeskorte, Mur, Ruff, et al. 2008]


Investigating a representation via similarity analysis
IT Neuronal Units Deep net (in paricular, HMO)

[Yamins, Hong, Cadieu, Solomon, Seibert, DiCarlo, PNAS 2014]


Good representations are…

1. Compact (minimal)

2. Explanatory (sufficient) “Coral”

3. Disentangled (independent factors)


“Fish”
4. Interpretable

5. Make subsequent problem solving easy

[See “Representation Learning”, Bengio 2013, for more commentary]


Supervised object recognition

Learner “Fish”

image X label Y
Supervised object recognition

Learner “Duck”

image X label Y
Transfer learning
“Generally speaking, a good representation is one that makes a subsequent
learning task easier.” — Deep Learning, Goodfellow et al. 2016
Object recognition Place recognition

“Fish” ?

Often, what we will be “tested” on is to learn to do a new thing.


Object recognition Place recognition Place recognition

“Fish” bedroom ?

A lot of data A little data

Finetuning starts with the representation learned on a previous


task, and adapts it to perform well on a new task.
Finetuning in practice

Object recognition Place recognition

dolphin
cat
bathroom
grizzly bear
kitchen
angel fish
bedroom
chameleon
living room
clown fish
hallway
iguana
elephant

• The “learned representation” is just the weights and biases, so that’s what we transfer.
• Which weights and biases do we need to finetune? Often just the final layer.
If we keep on finetuning for every new datapoint or task that comes our way, we
get online learning. Humans seem to do this — we never stop learning.


Supervised vision Vision in nature

Hand-curated training data Raw unlabeled training data


+ Informative + Cheap
- Expensive - Noisy
- Limited to teacher’s knowledge - Harder to interpret
Autoencoder: A first self-supervised model
compressed image code
(vector z)

Image Reconstructed
image

Encoder Decoder
[e.g., Hinton & Salakhutdinov, Science 2006]
Autoencoder

Image Reconstructed
image
Image Reconstructed
image
compressed image code
(vector z)

Is the code informative about Logistic regression:


object class ?
Layer 1 representation Layer 6 representation

[DeCAF, Donahue, Jia, et al. 2013]


[Visualization technique : t-sne, van der Maaten & Hinton, 2008]
Can we learn better self-supervised features?
Input data

Question: What may be bad about using autoencoders to


learn self-supervised features?

Encoder: 4-layer conv


Two Answers: Decoder: 4-layer upconv
• Autoencoders prioritize reconstruction error → may not
weigh small but semantically important features highly Reconstructed data
enough.

• Autoencoders can “cheat” by memorizing weird features


associated with each image, instead of learning a well-
behaved representation.
Self-supervised
learning
Escher, 1948
Self-supervised learning

Common trick:

• Convert “unsupervised” problem


into “supervised” empirical risk
minimization

• Do so by cooking up “labels”
(prediction targets) from the raw
data itself
Escher, 1948
Self-supervised pretext tasks

?
θ=?

rotation prediction “jigsaw puzzle” image completion colorization

1. Solving the pretext tasks allow the model to learn good features.
2. We can automatically generate labels for the pretext tasks.
Task 1: Relative patch location prediction

? ? ?
? ?
? ? ? [Slide credit: Carl Doersch]
Task 1: Relative patch location prediction

(Image source: Doersch et al., 2015)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - 27 May 17, 2022
Task 1: Relative patch location prediction

Patch Embedding (representation)


Classifier Input Nearest Neighbors

CNN
CNN CNN Interesting: This representation places
all cat faces together in feature space!

[Slide credit: Carl Doersch]


Task 2: Solving “jigsaw puzzles”

(Image source: Noroozi & Favaro, 2016)


Task 3: Predict missing pixels (inpainting)

Context Encoders: Feature Learning by Inpainting (Pathak et al., 2016)


Source: Pathak et al., 2016

May 17, 2022


L
e
c Task 3: Predicting missing pixels (inpainting)
t
u
r
e
1
4
-

Learning to reconstruct the missing pixels


Source: Pathak et al., 2016

May 17, 2022


Task 4: Split-brain Autoencoder
Idea: cross-channel predictions
L ab

ab L Source:
32 RichardMay
Zhang17, 2022
/ Phillip Isola
Task 4: Split-brain Autoencoder
Idea: cross-channel predictions

Source: Richard Zhang / Phillip Isola


Autoencoder Classification performance
ImageNet Task [Russakovsky et al. 2015]

40 autoencoder
colorization
Raw Reconstructed 35

Accuracy
Data Data
30
Colorization 25

20

15

Raw Predicted 10
Grayscale Color
Channel Channels
Layer
Task 5: Rotation prediction

Hypothesis: a model could recognize the correct rotation of an object


only if it has the “visual commonsense” of what the object should look
like unperturbed.

(Image source: Gidaris et al. 2018)


Task 5: Rotation prediction

Self-supervised
learning by rotating
the entire input
images.

The model learns to


predict which rotation
is applied (4-way
classification)

(Image source: Gidaris et al. 2018)


Task 5: Rotation prediction

Self-supervised
learning by rotating
the entire input
images.

The model learns to


predict which rotation
is applied (4-way
classification)

(Image source: Gidaris et al. 2018)


Transfer learned features to supervised learning

Pretrained with full


ImageNet supervision

No pretraining

Self-supervised learning on
ImageNet (entire training
set) with AlexNet.

Finetune on labeled data


from Pascal VOC 2007.
Self-supervised learning with rotation prediction

(Image source: Gidaris et al. 2018)


What about videos pretext tasks?
Analog of jigsaw puzzles?

Sort frames in time!

Unsupervised Representation Learning by Sorting


Sequences. Lee et al., 2017.
Video Pretext Task: Video colorization
Idea: model the temporal coherence of colors in videos

reference frame how should I color these frames?

...

t=1 t=2 t=3


t=0

Source: Vondrick et al., 2018


Video colorization
Idea: model the temporal coherence of colors in videos

reference frame how should I color these frames?


Should be the same color!

...

t=1 t=2 t=3


t=0
Hypothesis: learning to color video frames should allow model to
learn to track regions or objects without labels!

Source: Vondrick et al., 2018


Video colorization

Learning objective:

Establish mappings
between reference and
target frames in a
learned feature space.

Use the mapping as


“pointers” to copy the
correct color (LAB).

Source: Vondrick et al., 2018


Learning to color videos

attention map on the predicted color = weighted loss between predicted color
reference frame sum of the reference color and ground truth color

Source: Vondrick et al., 2018


Colorizing videos (qualitative)
reference frame target frames (gray) predicted color

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - 44 Source: Google AI Blog
Colorizing videos (qualitative)
reference frame target frames (gray) predicted color

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 14 - 45 Source: Google AI Blog
F
L
e
e
ic Tracking emerges from colorization
-t Propagate segmentation masks using learned attention
F
u
e
ri
e
L
1
i
,4
J-
i
a
j
u
n
W
u
, Source: Google AI Blog
F
L
e
e
ic Tracking emerges from colorization
-t Propagate skeletons using learned attention
F
u
e
ri
e
L
1
i
,4
J-
i
a
j
u
n
W
u
, Source: Google AI Blog
Problems with individual pretext tasks

● Coming up with individual pretext tasks is tedious.


● The learned representations may not be general.

?
θ=?

Can we come up with a more general pretext task?


A more general pretext task?
?

θ=?

same object
Contrastive Representation Learning
?

θ=?

attract

repel
A formulation of contrastive learning
Loss function given 1 positive sample and N - 1 negative samples:

x: reference sample; x+ positive sample; x- negative sample

That is: We aim to learn an encoder function f that


yields high score for positive pairs (x, x+) and low
scores for negative pairs (x, x-).
A formulation of contrastive learning
Loss function given 1 positive sample and N - 1 negative samples:
A formulation of contrastive learning
Loss function given 1 positive sample and N - 1 negative samples:

...
A formulation of contrastive learning
Loss function given 1 positive sample and N - 1 negative samples:

score for the score for the N-1


positive pair negative pairs
This seems familiar …
A formulation of contrastive learning
Loss function given 1 positive sample and N - 1 negative samples:

score for the score for the N-1


positive pair negative pairs
This seems familiar …

Cross entropy loss for a N-way softmax classifier!


I.e., learn to find the positive sample from the N samples
SimCLR: generating positive samples
from data augmentation

Source: Chen et al., 2020


SimCLR
[Chen, Kornblith, Norouzi, Hinton, ICML 2020]

[c.f. Becker & Hinton, Nature 1992]


Contrastive pre-training

Self-supervised contrastive learning New recognition task

dolphin
cat
grizzly bear
angel fish
chameleon
tiger
iguana
elephant
Training linear classifier on SimCLR features

Train feature encoder on


ImageNet (entire training set)
using SimCLR.

Freeze feature encoder, train a


linear classifier on top with
labeled data.

Source: Chen et al., 2020


Semi-supervised learning on SimCLR features

Train feature encoder on


ImageNet (entire training set)
using SimCLR.

Finetune the encoder with 1% /


10% of labeled data on ImageNet.

Source: Chen et al., 2020


Semi-supervised learning on SimCLR features

Train feature encoder on


ImageNet (entire training set)
using SimCLR.

Finetune the encoder with 1% /


10% of labeled data on ImageNet.

Source: Chen et al., 2020


Variations: DINO
DINO: Teacher-Student Paradigm
• An image x is transformed into two
views x1 and x2.

• The student is encouraged to match


the output probabilities of the
teacher.

• The teacher slowly updates its


parameters with an exponential
moving average (ema) of the
student’s parameters.
DINO
The Teacher

• Teacher’s parameters are an exponentially weighted average


of student’s parameters over recent duration.
• Teacher sees “global” views, while student sees local views.
Recall: Self-Attention
Multisensory self-supervision

provided label

Virginia de Sa. Learning Classification with Unlabeled Data. NIPS 1994.

[see also “Six lessons from babies”, Smith and Gasser 2005]
State Observations
Observations State
[Slide credit: Andrew Owens]
Predicting ambient sound

[Slide credit: Andrew Owens]


What did the model learn?

Unit #90 of 256

Strongest responses in dataset

Visualization method from (Zhou 2015)


[Slide credit: Andrew Owens]
[Slide credit: Andrew Owens]
CLIP (Contrastive Language–Image Pre-training)
Radford et al., 2021
[Slide Credit: Yann LeCun]

You might also like