Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments
1
Inria∗ 2
Facebook AI Research
Abstract
Unsupervised image representations have significantly reduced the gap with su-
pervised pretraining, notably with the recent achievements of contrastive learning
methods. These contrastive methods typically work online and rely on a large
number of explicit pairwise feature comparisons, which is computationally chal-
lenging. In this paper, we propose an online algorithm, SwAV, that takes advantage
of contrastive methods without requiring to compute pairwise comparisons. Specif-
ically, our method simultaneously clusters the data while enforcing consistency
between cluster assignments produced for different augmentations (or “views”) of
the same image, instead of comparing features directly as in contrastive learning.
Simply put, we use a “swapped” prediction mechanism where we predict the cluster
assignment of a view from the representation of another view. Our method can be
trained with large and small batches and can scale to unlimited amounts of data.
Compared to previous contrastive methods, our method is more memory efficient
since it does not require a large memory bank or a special momentum network.
In addition, we also propose a new data augmentation strategy, multi-crop, that
uses a mix of views with different resolutions in place of two full-resolution views,
without increasing the memory or compute requirements much. We validate our
findings by achieving 75.3% top-1 accuracy on ImageNet with ResNet-50, as well
as surpassing supervised pretraining on all the considered transfer tasks.
1 Introduction
Unsupervised visual representation learning, or self-supervised learning, aims at obtaining features
without using manual annotations and is rapidly closing the performance gap with supervised pre-
training in computer vision [10, 24, 42]. Many recent state-of-the-art methods build upon the instance
discrimination task that considers each image of the dataset (or “instance”) and its transformations as
a separate class [16]. This task yields representations that are able to discriminate between different
images, while achieving some invariance to image transformations. Recent self-supervised methods
that use instance discrimination rely on a combination of two elements: (i) a contrastive loss [23] and
(ii) a set of image transformations. The contrastive loss removes the notion of instance classes by
directly comparing image features while the image transformations define the invariances encoded
in the features. Both elements are essential to the quality of the resulting networks [10, 42] and our
work improves upon both the objective function and the transformations.
* Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France
Correspondence to mathilde@fb.com
Code: https://github.com/facebookresearch/swav
• We propose a scalable online clustering loss that improves performance by +2% on ImageNet and
works in both large and small batch settings without a large memory bank or a momentum encoder.
• We introduce the multi-crop strategy to increase the number of views of an image with no
computational or memory overhead. We observe a consistent improvement of between 2% and 4%
on ImageNet with this strategy on several self-supervised methods.
• Combining both technical contributions into a single model, we improve the performance of self-
supervised by +4.2% on ImageNet with a standard ResNet and outperforms supervised ImageNet
pretraining on multiple downstream tasks. This is the first method to do so without finetuning the
features, i.e., only with a linear classifier on top of frozen features.
2 Related Work
Instance and contrastive learning. Instance-level classification considers each image in a dataset
as its own class [5, 16, 56]. Dosovitskiy et al. [16] assign a class explicitly to each image and learn a
linear classifier with as many classes as images in the dataset. As this approach becomes quickly
intractable, Wu et al. [56] mitigate this issue by replacing the classifier with a memory bank that stores
previously-computed representations. They rely on noise contrastive estimation [22] to compare
instances, which is a special form of contrastive learning [28, 45]. He et al. [24] improve the training
of contrastive methods by storing representations from a momentum encoder instead of the trained
network. More recently, Chen et al. [10] show that the memory bank can be entirely replaced with the
elements from the same batch if the batch is large enough. In contrast to this line of works, we avoid
comparing every pair of images by mapping the image features to a set of trainable prototype vectors.
2
Features Codes
t~T X1 fθ Z1
t~
T X1 fθ Z1 Q1
T
t~T t~
Comparison Swapped
X t~T X t~ Prototypes C
T Prediction
t~T t~
T
X2 fθ Z2 X2 fθ Z2 Q2
Features Codes
Clustering for deep representation learning. Our work is also related to clustering-based meth-
ods [2, 4, 7, 8, 19, 29, 57, 60, 61, 66]. Caron et al. [7] show that k-means assignments can be used as
pseudo-labels to learn visual representations. This method scales to large uncurated dataset and can
be used for pre-training of supervised networks [8]. However, their formulation is not principled and
recently, Asano et al. [2] show how to cast the pseudo-label assignment problem as an instance of the
optimal transport problem. We consider a similar formulation to map representations to prototype
vectors, but unlike [2] we keep the soft assignment produced by the Sinkhorn-Knopp algorithm [13]
instead of approximating it into a hard assignment. Besides, unlike Caron et al. [7, 8] and Asano et
al. [2], we obtain online assignments which allows our method to scale gracefully to any dataset size.
Handcrafted pretext tasks. Many self-supervised methods manipulate the input data to extract a
supervised signal in the form of a pretext task [1, 14, 30, 32, 34, 40, 43, 46, 47, 53, 54, 64]. We refer
the reader to Jing et al. [31] for an exhaustive and detailed review of this literature. Of particular
interest, Misra and van der Maaten [42] propose to encode the jigsaw puzzle task [44] as an invariant
for contrastive learning. Jigsaw tiles are non-overlapping crops with small resolution that cover
only part (∼20%) of the entire image area. In contrast, our multi-crop strategy consists in simply
sampling multiple random crops with two different sizes: a standard size and a smaller one.
3 Method
Our goal is to learn visual features in an online fashion without supervision. To that effect, we
propose an online clustering-based self-supervised method. Typical clustering-based methods [2, 7]
are offline in the sense that they alternate between a cluster assignment step where image features of
the entire dataset are clustered, and a training step where the cluster assignments (or “codes”) are
predicted for different image views. Unfortunately, these methods are not suitable for online learning
as they require multiple passes over the dataset to compute the image features necessary for clustering.
In this section, we describe an alternative where we enforce consistency between codes from different
augmentations of the same image. This solution is inspired by contrastive instance learning [56] as
we do not consider the codes as a target, but only enforce consistent mapping between views of the
same image. Our method can be interpreted as a way of contrasting between multiple image views by
comparing their cluster assignments instead of their features.
More precisely, we compute a code from an augmented version of the image and predict this code
from other augmented versions of the same image. Given two image features zt and zs from two
different augmentations of the same image, we compute their codes qt and qs by matching these
features to a set of K prototypes {c1 , . . . , cK }. We then setup a “swapped” prediction problem with
the following loss function:
L(zt , zs ) = `(zt , qs ) + `(zs , qt ), (1)
where the function `(z, q) measures the fit between features z and a code q, as detailed later.
Intuitively, our method compares the features zt and zs using the intermediate codes qt and qs . If
3
these two features capture the same information, it should be possible to predict the code from the
other feature. A similar comparison appears in contrastive learning where features are compared
directly [56]. In Fig. 1, we illustrate the relation between contrastive learning and our method.
Each image xn is transformed into an augmented view xnt by applying a transformation t sampled
from the set T of image transformations. The augmented view is mapped to a vector representation by
applying a non-linear mapping fθ to xnt . The feature is then projected to the unit sphere, i.e., znt =
fθ (xnt )/kfθ (xnt )k2 . We then compute a code qnt from this feature by mapping znt to a set of
K trainable prototypes vectors, {c1 , . . . , cK }. We denote by C the matrix whose columns are the
c1 , . . . , ck . We now describe how to compute these codes and update the prototypes online.
Swapped prediction problem. The loss function in Eq. (1) has two terms that setup the “swapped”
prediction problem of predicting the code qt from the feature zs , and qs from zt . Each term represents
the cross entropy loss between the code and the probability obtained by taking a softmax of the dot
products of zi and all prototypes in C, i.e.,
exp τ1 z>
(k) (k) t ck
X
(k)
`(zt , qs ) = − qs log pt , where pt = P 1 > 0 .
(2)
k k0 exp τ zt ck
where τ is a temperature parameter [56]. Taking this loss over all the images and pairs of data
augmentations leads to the following loss function for the swapped prediction problem:
N
" K > K > #
1 X X 1 > 1 > X znt ck X zns ck
− z Cqns + zns Cqnt − log exp − log exp .
N n=1 τ nt τ τ τ
s,t∼T k=1 k=1
This loss function is jointly minimized with respect to the prototypes C and the parameters θ of the
image encoder fθ used to produce the features (znt )n,t .
Computing codes online. In order to make our method online, we compute the codes using only the
image features within a batch. We compute codes using the prototypes C such that all the examples
in a batch are equally partitioned by the prototypes. This equipartition constraint ensures that the
codes for different images in a batch are distinct, thus preventing the trivial solution where every
image has the same code. Given B feature vectors Z = [z1 , . . . , zB ], we are interested in mapping
them to the prototypes C = [c1 , . . . , cK ]. We denote this mapping or codes by Q = [q1 , . . . , qB ],
and optimize Q to maximize the similarity between the features and the prototypes , i.e.,
max Tr Q> C> Z + εH(Q),
(3)
Q∈Q
P
where H is the entropy function, H(Q) = − ij Qij log Qij and ε is a parameter that controls the
smoothness of the mapping. Asano et al. [2] enforce an equal partition by constraining the matrix Q
to belong to the transportation polytope. They work on the full dataset, and we propose to adapt their
solution to work on minibatches by restricting the transportation polytope to the minibatch:
1 > 1
Q = Q ∈ RK×B + | Q1B = 1K , Q 1 K = 1B , (4)
K B
where 1K denotes the vector of ones in dimension K. These constraints enforce that on average each
B
prototype is selected at least K times in the batch.
Once a continuous solution Q∗ to Prob. (3) is found, a discrete code can be obtained by using a
rounding procedure [2]. Empirically, we found that discrete codes work well when computing codes
in an offline manner on the full dataset as in Asano et al. [2]. However, in the online setting where
we use only minibatches, using the discrete codes performs worse than using the continuous codes.
An explanation is that the rounding needed to obtain discrete codes is a more aggressive optimization
step than gradient updates. While it makes the model converge rapidly, it leads to a worse solution.
We thus preserve the soft code Q∗ instead of rounding it. These soft codes Q∗ are the solution of
Prob. (3) over the set Q and takes the form of a normalized exponential matrix [13]:
>
C Z
Q∗ = Diag(u) exp Diag(v), (5)
ε
4
Method Arch. Param. Top1 Supervised SwAV
Supervised R50 24 76.5
80
Colorization [63] R50 24 39.6
Jigsaw [44] R50 24 45.7 78 77.9 78.5
where u and v are renormalization vectors in RK and RB respectively. The renormalization vectors
are computed using a small number of matrix multiplications using the iterative Sinkhorn-Knopp
algorithm [13]. In practice, we observe that using only 3 iterations is fast and sufficient to obtain
good performance. Indeed, this algorithm can be efficiently implemented on GPU, and the alignment
of 4K features to 3K codes takes 35ms in our experiments, see § 4.
Working with small batches. When the number B of batch features is too small compared to
the number of prototypes K, it is impossible to equally partition the batch into the K prototypes.
Therefore, when working with small batches, we use features from the previous batches to augment
the size of Z in Prob. (3). Then, we only use the codes of the batch features in our training loss. In
practice, we store around 3K features, i.e., in the same range as the number of code vectors. This
means that we only keep features from the last 15 batches with a batch size of 256, while contrastive
methods typically need to store the last 65K instances obtained from the last 250 batches [24].
As noted in prior works [10, 42], comparing random crops of an image plays a central role by capturing
information in terms of relations between parts of a scene or an object. Unfortunately, increasing
the number of crops or “views” quadratically increases the memory and compute requirements. We
propose a multi-crop strategy where we use two standard resolution crops and sample V additional
low resolution crops that cover only small parts of the image. Using low resolution images ensures
only a small increase in the compute cost. Specifically, we generalize the loss of Eq (1):
1 X VX+2
L(zt1 , zt2 , . . . , ztV +2 ) = 1v6=i `(ztv , qti ). (6)
2(V + 1) v=1 i∈{1,2}
Note that we compute codes using only the full resolution crops. Indeed, computing codes for all crops
increases the computational time and we observe in practice that it also alters the transfer performance
of the resulting network. An explanation is that using only partial information (small crops cover only
small area of images) degrades the assignment quality. Figure 3 shows that multi-crop improves
the performance of several self-supervised methods and is a promising augmentation strategy.
5
Table 1: Semi-supervised learning on ImageNet with a ResNet-50. We finetune the model with
1% and 10% labels and report top-1 and top-5 accuracies. *: uses RandAugment [12].
1% labels 10% labels
Method Top-1 Top-5 Top-1 Top-5
Supervised 25.4 48.4 56.4 80.4
Methods using UDA [58] - - 68.8* 88.5*
label-propagation FixMatch [49] - - 71.5* 89.1*
PIRL [42] 30.7 57.2 60.4 83.8
Methods using PCL [35] - 75.6 - 86.2
self-supervision only SimCLR [10] 48.3 75.5 65.6 87.8
SwAV 53.9 78.5 70.2 89.9
4 Main Results
We analyze the features learned by SwAV by transfer learning on multiple datasets. We implement in
SwAV the improvements used in SimCLR, i.e., LARS [62], cosine learning rate [38, 42] and the MLP
projection head [10]. We provide the full details and hyperparameters for pretraining and transfer
learning in the Appendix.
We evaluate the features of a ResNet-50 [26] trained with SwAV on ImageNet by two experiments:
linear classification on frozen features and semi-supervised learning by finetuning with few labels.
When using frozen features (Fig. 2 left), SwAV outperforms the state of the art by +4.2% top-1
accuracy and is only 1.2% below the performance of a fully supervised model. Note that we train
SwAV during 800 epochs with large batches (4096). We refer to Fig. 3 for results with shorter
trainings and to Table 3 for experiments with small batches. On semi-supervised learning (Table 1),
SwAV outperforms other self-supervised methods and is on par with state-of-the-art semi-supervised
models [49], despite the fact that SwAV is not specifically designed for semi-supervised learning.
Variants of ResNet-50. Figure 2 (right) shows the performance of multiple variants of ResNet-50
with different widths [33]. The performance of our model increases with the width of the model, and
follows a similar trend to the one obtained with supervised learning. When compared with concurrent
work like SimCLR, we see that SwAV reduces the difference with supervised models even further.
Indeed, for large architectures, our method shrinks the gap with supervised training to 0.6%.
Table 2: Transfer learning on downstream tasks. Comparison between features from ResNet-50
trained on ImageNet with SwAV or supervised learning. We consider two settings. (1) Linear
classification on top of frozen features. We report top-1 accuracy on all datasets except VOC07 where
we report mAP. (2) Object detection with finetuned features on VOC07+12 trainval using Faster
R-CNN [48] and on COCO [36] using DETR [6]. We report the most standard detection metrics for
these datasets: AP50 on VOC07+12 and AP on COCO.
Linear Classification Object Detection
Places205 VOC07 iNat18 VOC07+12 (Faster R-CNN) COCO (DETR)
Supervised 53.2 87.5 46.7 81.3 40.8
SwAV 56.7 88.9 48.6 82.6 42.1
We test the generalization of ResNet-50 features trained with SwAV on ImageNet (without labels) by
transferring to several downstream vision tasks. In Table 2, we compare the performance of SwAV
features with ImageNet supervised pretraining. First, we report the linear classification performance
6
Table 3: Training in small batch setting. Top-1 accuracy on ImageNet with a linear classifier
trained on top of frozen features from a ResNet-50. All methods are trained with a batch size of 256.
We also report the number of stored features, the type of cropping used and the number of epochs.
Method Mom. Encoder Stored Features multi-crop epoch batch Top-1
SimCLR 0 2×224 200 256 61.9
MoCov2 X 65, 536 2×224 200 256 67.5
MoCov2 X 65, 536 2×224 800 256 71.1
SwAV 3, 840 2×160 + 4×96 200 256 72.0
SwAV 3, 840 2×224 + 6×96 200 256 72.7
SwAV 3, 840 2×224 + 6×96 400 256 74.3
on the Places205 [65], VOC07 [17], and iNaturalist2018 [52] datasets. Our method outperforms
supervised features on all three datasets. Note that SwAV is the first self-supervised method to
surpass ImageNet supervised features on these datasets. Second, we report network finetuning on
object detection on VOC07+12 using Faster R-CNN [48] and on COCO [36] with DETR [6]. DETR
is a recent object detection framework that reaches competitive performance with Faster R-CNN
while being conceptually simpler and trainable end-to-end. We use DETR because, unlike Faster
R-CNN [25], using a pretrained backbone in this framework is crucial to obtain good results compared
to training from scratch [6]. In Table 2, we show that SwAV outperforms the supervised pretrained
model on both VOC07+12 and COCO datasets. Note that this is line with previous works that also
show that self-supervision can outperform supervised pretraining on object detection [19, 24, 42].
We report more detection evaluation metrics and results from other self-supervised methods in the
Appendix. Overall, our SwAV ResNet-50 model surpasses supervised ImageNet pretraining on all
the considered transfer tasks and datasets. We will release this model so other researchers might also
benefit by replacing the ImageNet supervised network with our model.
We train SwAV with small batches of 256 images on 4 GPUs and compare with MoCov2 and SimCLR
trained in the same setup. In Table 3, we see that SwAV maintains state-of-the-art performance even
when trained in the small batch setting. Note that SwAV only stores a queue of 3, 840 features. In
comparison, to obtain good performance, MoCov2 needs to store 65, 536 features while keeping
an additional momentum encoder network. When SwAV is trained using 2×160 + 4×96 crops,
SwAV has a running time 1.2× higher than SimCLR with 2×224 crops and is around 1.4× slower
than MoCov2 due to the additional back-propagation [11]. However, as shown in Table 3, SwAV
learns much faster and reaches higher performance in 4× fewer epochs: 72% after 200 epochs while
MoCov2 needs 800 epochs to achieve 71.1%. Increasing the resolution and the number of epochs,
SwAV reaches 74.3% with a small number of stored features and no momentum encoder.
5 Ablation Study
Applying the multi-crop strategy to different methods. In Fig. 3 (left), we report the impact of
applying our multi-crop strategy on the performance of a selection of other methods. Besides
SwAV, we consider supervised learning, SimCLR and two clustering-based models, DeepCluster-v2
and SeLa-v2. The last two are obtained by applying the improvements of SimCLR to DeepCluster [7]
and SeLa [2] (see details in the Appendix). We see that the multi-crop strategy consistently
improves the performance for all the considered methods by a significant margin of 2−4% top-1
accuracy. Interestingly, multi-crop seems to benefit more clustering-based methods than contrastive
methods. We note that multi-crop does not improve the supervised model.
Figure 3 (left) also allows us to compare clustering-based and contrastive instance methods. First,
we observe that SwAV and DeepCluster-v2 outperform SimCLR by 2% both with and without
multi-crop. This suggests the learning potential of clustering-based methods over instance classifi-
cation. Finally, we see that SwAV performs on par with offline clustering-based approaches, that use
the entire dataset to learn prototypes and codes.
7
76
Top-1 ∆ 75.3
top-1 accucary
Method 2x224 2x160+4x96 74 74.6 (50h)
73.9 (25h)
Supervised 76.5 76.0 −0.5 (12h30)
SimCLR 68.2 70.6 +2.4 72 64 V-100
SeLa-v2 67.2 71.8 +4.6 72.1 16Gb GPUs
DeepCluster-v2 70.2 74.3 +4.1 (6h15)
70.1 74.1 +4.0 70
100 400 800
SwAV
number of epochs
Figure 3: Top-1 accuracy on ImageNet with a linear classifier trained on top of frozen features
from a ResNet-50. (left) Impact of multi-crop and comparison between clustering-based and
contrastive instance methods. Self-supervised methods are trained for 400 epochs and supervised
models for 200 epochs. (right) Performance as a function of epochs. We compare SwAV models
trained with different number of epochs and report their running time based on our implementation.
84
top-1 accuracy
Method Frozen Finetuned Weak supervision*
Random 15.0 76.5 82 No pretraining
MoCo - 77.3* 80 SwAV
SimCLR 60.4 77.2
SwAV 66.5 77.8 78
10 20 30 40
model capacity
Figure 4: Pretraining on uncurated data. Top-1 accuracy on ImageNet for pretrained models
on an uncurated set of 1B random Instagram images. (left) We compare ResNet-50 pretrained
with either SimCLR or SwAV on two downstream tasks: linear classification on frozen features or
finetuned features. (right) Performance of finetuned models as we increase the capacity of a ResNext
following [39]. The capacity is provided in billions of Mult-Add operations.
*: pretrained on a curated set of 1B Instagram images filtered with 1.5k hashtags similar to ImageNet classes.
Impact of longer training. In Fig. 3 (right), we show the impact of the number of training epochs
on performance for SwAV with multi-crop. We train separate models for 100, 200, 400 and 800
epochs and report the top-1 accuracy on ImageNet using the linear classification evaluation. We train
each ResNet-50 on 64 V100 16GB GPUs and a batch size of 4096. While SwAV benefits from longer
training, it already achieves strong performance after 100 epochs, i.e., 72.1% in 6h15.
Unsupervised pretraining on a large uncurated dataset. We test if SwAV can serve as a pre-
training method for supervised learning and also check its robustness on uncurated pretraining data.
We pretrain SwAV on an uncurated dataset of 1 billion random public non-EU images from Instagram.
In Fig. 4 (left), we measure the performance of ResNet-50 models when transferring to ImageNet
with frozen or finetuned features. We report the results from He et al. [24] but note that their setting
is different. They use a curated set of Instagram images, filtered by hashtags similar to ImageNet
labels [39]. We compare SwAV with a randomly initialized network and with a network pretrained on
the same data using SimCLR. We observe that SwAV maintains a similar gain of 6% over SimCLR
as when pretrained on ImageNet (Fig. 2), showing that our improvements do not depend on the
data distribution. We also see that pretraining with SwAV on random images significantly improves
over training from scratch on ImageNet (+1.3%) [8, 24]. In Fig. 4 (right), we explore the limits of
pretraining as we increase the model capacity. We consider the variants of the ResNeXt architec-
ture [59] as in Mahajan et al. [39]. We compare SwAV with supervised models trained from scratch
on ImageNet. For all models, SwAV outperforms training from scratch by a significant margin
showing that it can take advantage of the increased model capacity. For reference, we also include the
results from Mahajan et al. [39] obtained with a weakly-supervised model pretrained by predicting
hashtags filtered to be similar to ImageNet classes. Interestingly, SwAV performance is strong when
compared to this topline despite not using any form of supervision or filtering of the data.
8
6 Discussion
Self-supervised learning is rapidly progressing compared to supervised learning, even surpassing
it on transfer learning, even though the current experimental settings are designed for supervised
learning. In particular, architectures have been designed for supervised tasks, and it is not clear if
the same models would emerge from exploring architectures with no supervision. Several recent
works have shown that exploring architectures with search [37] or pruning [9] is possible without
supervision, and we plan to evaluate the ability of our method to guide model explorations.
Acknowledgement. We thank Nicolas Carion, Kaiming He, Herve Jegou, Benjamin Lefaudeux,
Thomas Lucas, Francisco Massa, Sergey Zagoruyko, and the rest of Thoth and FAIR teams for their
help and fruitful discussions. Julien Mairal was funded by the ERC grant number 714381 (SOLARIS
project) and by ANR 3IA MIAI@Grenoble Alpes (ANR-19-P3IA-0003).
References
[1] Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: Proceedings of the Interna-
tional Conference on Computer Vision (ICCV) (2015) 3
[2] Asano, Y.M., Rupprecht, C., Vedaldi, A.: Self-labelling via simultaneous clustering and repre-
sentation learning. International Conference on Learning Representations (ICLR) (2020) 2, 3,
4, 5, 7, 19, 20, 21
[3] Bachman, P., Hjelm, R.D., Buchwalter, W.: Learning representations by maximizing mutual
information across views. In: Advances in Neural Information Processing Systems (NeurIPS)
(2019) 16
[4] Bautista, M.A., Sanakoyeu, A., Tikhoncheva, E., Ommer, B.: Cliquecnn: Deep unsupervised
exemplar learning. In: Advances in Neural Information Processing Systems (NeurIPS) (2016) 3
[5] Bojanowski, P., Joulin, A.: Unsupervised learning by predicting noise. In: Proceedings of the
International Conference on Machine Learning (ICML) (2017) 2, 18
[6] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object
detection with transformers. arXiv preprint arXiv:2005.12872 (2020) 6, 7, 15, 17, 18
[7] Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning
of visual features. In: Proceedings of the European Conference on Computer Vision (ECCV)
(2018) 2, 3, 7, 18, 19, 20
[8] Caron, M., Bojanowski, P., Mairal, J., Joulin, A.: Unsupervised pre-training of image features
on non-curated data. In: Proceedings of the International Conference on Computer Vision
(ICCV) (2019) 3, 8
[9] Caron, M., Morcos, A., Bojanowski, P., Mairal, J., Joulin, A.: Pruning convolutional neural
networks with self-supervision. arXiv preprint arXiv:2001.03554 (2020) 9
[10] Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning
of visual representations. arXiv preprint arXiv:2002.05709 (2020) 1, 2, 5, 6, 13, 14, 16, 17
[11] Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive
learning. arXiv preprint arXiv:2003.04297 (2020) 5, 7, 14, 15, 18
[12] Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: Practical data augmentation with no
separate search. arXiv preprint arXiv:1909.13719 (2019) 6
[13] Cuturi, M.: Sinkhorn distances: Lightspeed computation of optimal transport. In: Advances in
Neural Information Processing Systems (NeurIPS) (2013) 3, 4, 5, 19
[14] Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context
prediction. In: Proceedings of the International Conference on Computer Vision (ICCV) (2015)
3
[15] Donahue, J., Simonyan, K.: Large scale adversarial representation learning. In: Advances in
Neural Information Processing Systems (NeurIPS) (2019) 5, 16
[16] Dosovitskiy, A., Fischer, P., Springenberg, J.T., Riedmiller, M., Brox, T.: Discriminative
unsupervised feature learning with exemplar convolutional neural networks. IEEE transactions
on pattern analysis and machine intelligence 38(9), 1734–1747 (2016) 1, 2
9
[17] Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual
object classes (voc) challenge. International journal of computer vision 88(2), 303–338 (2010)
7
[18] Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library for large linear
classification. Journal of machine learning research (2008) 15
[19] Gidaris, S., Bursuc, A., Komodakis, N., Pérez, P., Cord, M.: Learning representations by
predicting bags of visual words. arXiv preprint arXiv:2002.12247 (2020) 3, 7, 17, 18
[20] Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image
rotations. In: International Conference on Learning Representations (ICLR) (2018) 16
[21] Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A.,
Jia, Y., He, K.: Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint
arXiv:1706.02677 (2017) 14
[22] Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: A new estimation principle for
unnormalized statistical models. In: Proceedings of the Thirteenth International Conference on
Artificial Intelligence and Statistics (2010) 2
[23] Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping.
In: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) (2006)
1
[24] He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual
representation learning. arXiv preprint arXiv:1911.05722 (2019) 1, 2, 5, 7, 8, 15, 16, 17, 18
[25] He, K., Girshick, R., Dollár, P.: Rethinking imagenet pre-training. In: Proceedings of the
International Conference on Computer Vision (ICCV) (2019) 7, 15
[26] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Pro-
ceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
6
[27] Hénaff, O.J., Razavi, A., Doersch, C., Eslami, S., Oord, A.v.d.: Data-efficient image recognition
with contrastive predictive coding. arXiv preprint arXiv:1905.09272 (2019) 5, 16
[28] Hjelm, R.D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A.,
Bengio, Y.: Learning deep representations by mutual information estimation and maximization.
International Conference on Learning Representations (ICLR) (2019) 2
[29] Huang, J., Dong, Q., Gong, S.: Unsupervised deep learning by neighbourhood discovery. In:
Proceedings of the International Conference on Machine Learning (ICML) (2019) 3
[30] Jenni, S., Favaro, P.: Self-supervised feature learning by learning to spot artifacts. In: Pro-
ceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
3
[31] Jing, L., Tian, Y.: Self-supervised visual feature learning with deep neural networks: A survey.
arXiv preprint arXiv:1902.06162 (2019) 3
[32] Kim, D., Cho, D., Yoo, D., Kweon, I.S.: Learning image representations by completing damaged
jigsaw puzzles. In: Winter Conference on Applications of Computer Vision (WACV) (2018) 3
[33] Kolesnikov, A., Zhai, X., Beyer, L.: Revisiting self-supervised visual representation learning. In:
Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) (2019) 6
[34] Larsson, G., Maire, M., Shakhnarovich, G.: Learning representations for automatic colorization.
In: Proceedings of the European Conference on Computer Vision (ECCV) (2016) 3
[35] Li, J., Zhou, P., Xiong, C., Socher, R., Hoi, S.C.: Prototypical contrastive learning of unsuper-
vised representations. arXiv preprint arXiv:2005.04966 (2020) 5, 6, 14, 15, 17, 19
[36] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.:
Microsoft coco: Common objects in context. In: Proceedings of the European Conference on
Computer Vision (ECCV) (2014) 6, 7, 15, 17
[37] Liu, C., Dollár, P., He, K., Girshick, R., Yuille, A., Xie, S.: Are labels necessary for neural
architecture search? arXiv preprint arXiv:2003.12056 (2020) 9
[38] Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint
arXiv:1608.03983 (2016) 6, 14, 15
10
[39] Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri, M., Li, Y., Bharambe, A., van der
Maaten, L.: Exploring the limits of weakly supervised pretraining. In: Proceedings of the
European Conference on Computer Vision (ECCV) (2018) 8
[40] Mahendran, A., Thewlis, J., Vedaldi, A.: Cross pixel optical flow similarity for self-supervised
learning. arXiv preprint arXiv:1807.05636 (2018) 3
[41] Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B.,
Houston, M., Kuchaiev, O., Venkatesh, G., et al.: Mixed precision training. arXiv preprint
arXiv:1710.03740 (2017) 14
[42] Misra, I., van der Maaten, L.: Self-supervised learning of pretext-invariant representations.
arXiv preprint arXiv:1912.01991 (2019) 1, 2, 3, 5, 6, 7, 14, 15, 17, 18
[43] Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal
order verification. In: Proceedings of the European Conference on Computer Vision (ECCV)
(2016) 3
[44] Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw
puzzles. In: Proceedings of the European Conference on Computer Vision (ECCV) (2016) 3, 5
[45] Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding.
arXiv preprint arXiv:1807.03748 (2018) 2
[46] Pathak, D., Girshick, R., Dollár, P., Darrell, T., Hariharan, B.: Learning features by watching
objects move. In: Proceedings of the Conference on Computer Vision and Pattern Recognition
(CVPR) (2017) 3
[47] Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: Feature
learning by inpainting. In: Proceedings of the Conference on Computer Vision and Pattern
Recognition (CVPR) (2016) 3
[48] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with
region proposal networks. In: Advances in Neural Information Processing Systems (NeurIPS)
(2015) 6, 7, 15, 17, 18
[49] Sohn, K., Berthelot, D., Li, C.L., Zhang, Z., Carlini, N., Cubuk, E.D., Kurakin, A., Zhang, H.,
Raffel, C.: Fixmatch: Simplifying semi-supervised learning with consistency and confidence.
arXiv preprint arXiv:2001.07685 (2020) 6
[50] Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. arXiv preprint arXiv:1906.05849
(2019) 16
[51] Touvron, H., Vedaldi, A., Douze, M., Jégou, H.: Fixing the train-test resolution discrepancy. In:
Advances in Neural Information Processing Systems (NeurIPS) (2019) 2
[52] Van Horn, G., Mac Aodha, O., Song, Y., Cui, Y., Sun, C., Shepard, A., Adam, H., Perona, P.,
Belongie, S.: The inaturalist species classification and detection dataset. In: Proceedings of the
Conference on Computer Vision and Pattern Recognition (CVPR) (2018) 7
[53] Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: Pro-
ceedings of the International Conference on Computer Vision (ICCV) (2015) 3
[54] Wang, X., He, K., Gupta, A.: Transitive invariance for self-supervised visual representation
learning. In: Proceedings of the International Conference on Computer Vision (ICCV) (2017) 3
[55] Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2. https://github.com/
facebookresearch/detectron2 (2019) 15
[56] Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance
discrimination. In: Proceedings of the Conference on Computer Vision and Pattern Recognition
(CVPR) (2018) 2, 3, 4, 5, 17, 19, 21
[57] Xie, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In:
Proceedings of the International Conference on Machine Learning (ICML) (2016) 3
[58] Xie, Q., Dai, Z.D., Hovy, E., Luong, M.T., Le, Q.V.: Unsupervised data augmentation for
consistency training. arXiv preprint arXiv:1904.12848 (2020) 6
[59] Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for
deep neural networks. In: Proceedings of the Conference on Computer Vision and Pattern
Recognition (CVPR) (2017) 8
11
[60] Yan, X., Misra, Ishan, I., Gupta, A., Ghadiyaram, D., Mahajan, D.: ClusterFit: Improving
generalization of visual representations. In: Proceedings of the Conference on Computer Vision
and Pattern Recognition (CVPR) (2020) 3
[61] Yang, J., Parikh, D., Batra, D.: Joint unsupervised learning of deep representations and image
clusters. In: Proceedings of the Conference on Computer Vision and Pattern Recognition
(CVPR) (2016) 3
[62] You, Y., Gitman, I., Ginsburg, B.: Large batch training of convolutional networks. arXiv preprint
arXiv:1708.03888 (2017) 6, 14, 15
[63] Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Proceedings of the European
Conference on Computer Vision (ECCV) (2016) 5
[64] Zhang, R., Isola, P., Efros, A.A.: Split-brain autoencoders: Unsupervised learning by cross-
channel prediction. In: Proceedings of the Conference on Computer Vision and Pattern Recog-
nition (CVPR) (2017) 3
[65] Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene
recognition using places database. In: Advances in Neural Information Processing Systems
(NeurIPS) (2014) 7
[66] Zhuang, C., Zhai, A.L., Yamins, D.: Local aggregation for unsupervised learning of visual
embeddings. In: Proceedings of the International Conference on Computer Vision (ICCV)
(2019) 3, 5, 17, 19
12
Appendix
A Implementation Details
In this section, we provide the details and hyperparameters for SwAV pretraining and transfer learning.
Our code is publicly available at https://github.com/facebookresearch/swav.
# C: prototypes (DxK)
# model: convnet + projection head
# temp: temperature
# compute assignments
with torch.no_grad():
q_t = sinkhorn(scores_t)
q_s = sinkhorn(scores_s)
# normalize prototypes
with torch.no_grad():
C = normalize(C, dim=0, p=2)
# Sinkhorn-Knopp
def sinkhorn(scores, eps=0.05, niters=3):
Q = exp(scores / eps).T
Q /= sum(Q)
K, B = Q.shape
u, r, c = zeros(K), ones(K) / K, ones(B) / B
for _ in range(niters):
u = sum(Q, dim=1)
Q *= (r / u).unsqueeze(1)
Q *= (c / sum(Q, dim=0)).unsqueeze(0)
return (Q / sum(Q, dim=0, keepdim=True)).T
Most of our training hyperparameters are directly taken from SimCLR work [10]. We train
13
SwAV with stochastic gradient descent using large batches of 4096 different instances. We distribute
the batches over 64 V100 16Gb GPUs, resulting in each GPU treating 64 instances. The temperature
parameter τ is set to 0.1 and the Sinkhorn regularization parameter ε is set to 0.05 for all runs. We
use a weight decay of 10−6 , LARS optimizer [62] and a learning rate of 4.8 which is linearly ramped
up during the first 10 epochs. After warmup, we use the cosine learning rate decay [38, 42] with
a final value of 0.0048. To help the very beginning of the optimization, we freeze the prototypes
during the first epoch of training. We synchronize batch-normalization layers across GPUs using
the optimized implementation with kernels through CUDA/C-v2 extension from apex-2 . We
also use apex library for training with mixed precision [41]. Overall, thanks to these training
optimizations (mixed precision, kernel batch-normalization and use of large batches [21]), 100
epochs of training for our best SwAV model take approximately 6 hours (see Table 4). Similarly to
previous works [10, 11, 35], we use a projection head on top of the convnet features that consists in a
2-layer multi-layer perceptron (MLP) that projects the convnet output to a 128-D space.
Note that SwAV is more suitable for a multi-node distributed implementation compared to contrastive
approaches SimCLR or MoCo. The latter methods require sharing the feature matrix across all
GPUs at every batch which might become a bottleneck when distributing across many GPUs. On
the contrary, SwAV requires sharing only matrix normalization statistics (sum of rows and columns)
during the Sinkhorn algorithm.
Small Views
tV + 2~Tsmall xn(V+2) fθ zn(V+2)
Figure 5: Multi-crop: the image xn is transformed into V + 2 views: two global views and V small
resolution zoomed views.
14
A.4 Implementation details of semi-supervised learning (finetuning with 1% or 10% labels)
We finetune with either 1% or 10% of ImageNet labeled images a ResNet-50 pretrained with SwAV.
We use the 1% and 10% splits specified in the official code release of SimCLR. We mostly follow
hyperparameters from PCL [35]: we train during 20 epochs with a batch size of 256, we use distinct
learning rates for the convnet weights and the final linear layer, and we decay the learning rates
by a factor 0.2 at epochs 12 and 16. We do not apply any weight decay during finetuning. For 1%
finetuning, we use a learning rate of 0.02 for the trunk and 5 for the final layer. For 10% finetuning,
we use a learning rate of 0.01 for the trunk and 0.2 for the final layer.
15
B Additional Results
In Table 4, we report compute and GPU memory requirements based on our implementation for
different settings. As described in § A.1, we train each method on 64 V100 16GB GPUs, with a batch
size of 4096, using mixed precision and apex optimized version of synchronized batch-normalization
layers. We report results with ResNet-50 for all methods. In Fig. 6, we report SwAV performance
for different training lengths measured in hours based on our implementation. We observe that after
only 6 hours of training, SwAV outperforms SimCLR trained for 1000 epochs (40 hours based on
our implementation) by a large margin. If we train SwAV for longer, we see that the performance gap
between the two methods increases even more.
Table 4: Computational cost. We report time and GPU memory requirements based on our imple-
mentation for different models trained during 100 epochs.
Method multi-crop time / 100 epochs peak memory / GPU
SimCLR 2 × 224 4h00 8.6G
SwAV 2 × 224 4h09 8.6G
SwAV 2 × 160 + 4 × 96 4h50 8.5G
SwAV 2 × 224 + 6 × 96 6h15 12.8G
72 SimCLR
ImageNet accuracy for linear models trained on
frozen features. We report SwAV performance for
68 SwAV different training lengths measured in hours based
on our implementation. We train each ResNet-50
64 models on 64 V100 16GB GPUs with a batch size
of 4096 (see § A.1 for implementation details).
0 15 30 45
running time (hours)
In Table 5, we show results when training SwAV on large architectures. We observe that SwAV
benefits from training on large architectures and plan to explore in this direction to furthermore boost
self-supervised methods.
Table 5: Large architectures. Top-1 accuracy for linear models trained on frozen features from
different self-supervised methods on large architectures.
Method Arch. Param. Top1
Supervised EffNet-B7 66 84.4
Rotation [20] RevNet50-4w 86 55.4
BigBiGAN [15] RevNet50-4w 86 61.3
AMDIM [3] Custom-RN 626 68.1
CMC [50] R50-w2 188 68.4
MoCo [24] R50-w4 375 68.6
CPC v2 [27] R161 305 71.5
SimCLR [10] R50-w4 375 76.8
SwAV R50-w4 375 77.9
SwAV R50-w5 586 78.5
16
B.3 Transferring unsupervised features to downstream tasks
In Table 6, we expand results from the main paper by providing numbers from previously and
concurrently published self-supervised methods. In the left panel of Table 6, we show performance
after training a linear classifier on top of frozen representations on different datasets while on the right
panel we evaluate the features by finetuning a ResNet-50 on object detection with Faster R-CNN [48]
and DETR [6]. Overall, we observe on Table 6 that SwAV is the first self-supervised method to
outperform ImageNet supervised backbone on all the considered transfer tasks and datasets. Other
self-supervised learners are capable of surpassing the supervised counterpart but only for one type of
transfer (object detection with finetuning for MoCo/PIRL for example). We will release this model so
other researchers might also benefit by replacing the ImageNet supervised network with our model.
Table 6: Transfer learning on downstream tasks. Comparison between features from ResNet-50
trained on ImageNet with SwAV or supervised learning. We also report numbers from other self-
supervised methods († for numbers from other methods run by us). We consider two settings. (1)
Linear classification on top of frozen features. We report top-1 accuracy on Places205 and iNat18
datasets and mAP on VOC07. (2) Object detection with finetuned features on VOC07+12 trainval
using Faster R-CNN [48] and on COCO [36] using DETR [6]. In this table, we report the most
standard detection metrics for these datasets: AP50 on VOC07+12 and AP on COCO.
Linear Classification Object Detection
Places205 VOC07 iNat18 VOC07+12 (Faster R-CNN) COCO (DETR)
Supervised 53.2 87.5 46.7 81.3 40.8
RotNet [19] 45.0 64.6 - - -
NPID++ [42] 46.4 76.6 32.4 79.1 -
MoCo [24] 46.9† 79.8† 31.5† 81.5 -
PIRL [42] 49.8 81.1 34.1 80.7 -
PCL [35] 49.8 84.0 - - -
BoWNet [19] 51.1 79.3 - 81.3 -
SimCLR [10] 53.3† 86.4† 36.2† - -
MoCov2 [24] 52.9† 87.1† 38.9† 82.5 42.0†
SwAV 56.7 88.9 48.6 82.6 42.1
17
Table 7: More detection metrics for object detection on VOC07+12 with finetuned features
using Faster R-CNN [48].
Method APall AP50 AP75
Supervised 53.5 81.3 58.8
Random 28.1 52.5 26.2
NPID++ [42] 52.3 79.1 56.9
PIRL [42] 54.0 80.7 59.7
BoWNet [19] 55.8 81.3 61.1
MoCov1 [24] 55.9 81.5 62.6
MoCov2 [11] 57.4 82.5 64.0
SwAV 56.1 82.6 62.7
Table 8: More detection metrics for object detection on COCO with finetuned features using
DETR [6].
Method AP AP50 AP75 APS APM APL
ImageNet labels 40.8 61.2 42.9 20.1 44.5 60.3
MoCo-v2 42.0 62.7 44.4 20.8 45.6 60.9
SwAV 42.1 63.1 44.5 19.7 46.3 60.9
18
Table 9: Low-shot learning on ImageNet. Top-1 and top-5 accuracies when training with 13 or 128
examples per category.
# examples per class 13 128
top1 top5 top1 top5
No pretraining 25.4 48.4 56.4 80.4
SwAV IG-1B 38.2 67.1 64.7 87.2
Table 10: KNN classifiers on ImageNet. We report top-1 accuracy with 20 and 200 nearest neigh-
bors.
Method 20-NN 200-NN
NPID [56] - 46.5
LA [66] - 49.4
PCL [35] 54.5 -
SwAV 59.2 55.8
assignments performs worse than using the soft assignments. An explanation is that the rounding
needed to obtain discrete codes is a more aggressive optimization step than gradient updates. While it
makes the model converge rapidly (see Fig. 7), it leads to a worse solution.
8 Soft Hard
7
SwAV loss
3
0 100 200 300 400
epochs
19
Table 11: Impact of number of prototypes. Top-1 ImageNet accuracy for linear models trained on
frozen features.
Number of prototypes 300 1000 3000 10000 30000 100000
Top-1 72.8 73.6 73.9 74.1 73.8 73.8
Table 12: Ablation studies on clustering. Top-1 ImageNet accuracy for linear models trained on
frozen features. (left) Impact of learning the prototypes. (right) Hard versus soft assignments.
Prototypes Learned Fixed Assignment Soft Hard
Top-1 73.9 73.1 Top-1 73.9 73.3
phase”) and training the network with a classification loss supervised by these pseudo-labels (“training
phase”).
The pseudo-labels are kept fixed during training and updated for the entire dataset once per epoch
during the assignment phase.
Training phase in DeepCluster-v2. In the original DeepCluster work, both the classification head
c and the convnet weights are trained to classify the images into their corresponding pseudo-label
between two assignments. Intuitively, this classification head is optimized to represent prototypes
for the different pseudo-classes. However, since there is no mapping between two consecutive
assignments: the classification head learned during an assignment becomes irrelevant for the following
one. Thus, this classification head needs to be re-set at each new assignment which considerably
disrupts the convnet training. For this reason, we propose to simply use for classification head c the
centroids given by k-means clustering (Eq. 10). Overall, during training, DeepCluster-v2 optimizes
the following problem with mini-batch SGD:
min `(z, c, q). (8)
z
Training phase in SeLa-v2. In SeLa work, the prototypes c are learned with stochastic gradient
descend during the training phase. Overall, during training, SeLa-v2 optimizes the following problem:
Table 13: Impact of the number of iterations in Sinkhorn algorithm. Top-1 ImageNet accuracy
for linear models trained on frozen features.
Sinkhorn iterations 1 3 10 30
Top-1 fail 73.9 73.8 73.7
20
Using the original implementation, if assignments are updated at each epoch, then the assignment
phase represents one third of the total training time. Therefore, in order to speed up training, we
choose to use the features computed during the previous epoch instead of dedicating pass forwards to
the assignments. This is similar to the memory bank introduced by Wu et al. [56], without momentum.
where zn and the columns of C are normalized. The original work DeepCluster uses tricks such as
cluster re-assignments and balanced batch sampling to avoid trivial solutions but we found these
unnecessary, and did not observe collapsing during our trainings. As noted by Asano et al., this is
due to the fact that assignment and training are well separated phases.
Assignment phase in SeLa-v2. Unlike DeepCluster, SeLa uses the same loss during training and
assignment phases. In particular, we use Sinkhorn-Knopp algorithm to optimize the following
assignment problem (see details and derivations in the original SeLa paper [2]):
min `(z, c, q). (11)
q
Implementation details We use the same hyperparameters as SwAV to train SeLa-v2 and
DeepCluster-v2: these are described in § A. Asano et al. [2] have shown that multi-clustering
boosts performance of clustering-based approaches, and so we use 3 sets of 3000 prototypes c when
training SeLa-v2 and DeepCluster-v2. Note that unlike online methods (like SwAV, SimCLR and
MoCo), the clustering approaches SeLa-v2 and DeepCluster-v2 can be implemented with only a
single crop per image per batch. The major limitation of SeLa-v2 and DeepCluster-v2 is that these
methods are not online and therefore scaling them to very large scale dataset is not posible without
major adjustments.
21