Semantic Deduplication for Data Efficient Learning [technical report]

Jung Hwan Heo & Paul Chen

Extension to SemDeDup: Data-efficient learning at web-scale through semantic deduplication

Getting Started

To get the deduplicated dataloader and train a model, use the following command:

python main.py \
    --ratio 25 \            # % data to use
    --n_clusters 20 \       # number of clusters
    --rank_type random \    # random vs. cossim
    --prune_type common \   # common vs. diverse
    --dataset fmnist        # mnist or fmnist

To generate a sweep over hyperparameters, simply edit the script run.sh and then launch

sh run.sh

You can also interactively play around with our code through the two notebooks:

dedeup.ipynb deduplicate with fast kmeans (w/ support on MPS backend)
train.ipynb train with deduplicated dataset

Introduction

With the advent of big data, machine learning models are being trained on massive datasets but at diminishing returns.
These datasets usually have a considerable amount of redundant or duplicate data, which can result in longer training times, increased storage needs, and unnecessary computational complexity.
To overcome this issue, our project proposes utilizing clustering techniques for semantic deduplication, which can significantly reduce data redundancy and storage requirements while enhancing training efficiency.

Objectives

Dataset Exploration: Develop a semantic deduplication method using k-means clustering to identify and remove redundant data samples.
Efficiency Tradeoff: Analyze the impact of semantic deduplication on storage requirements and training efficiency.

Methods

Clustering techniques to be used

Models to be used for feature extraction

VGG16
~~ResNet~~

Datasets used

MNIST
Fashion-MNIST
CIFAR-10
~~Stanford Cars~~
~~Caltech101~~

Representation spaces to be used

R1. Pixel space
R2. Embedding space

Experiments

E1. Pairwise Cosine similarity per clusters
E2. Data pruning ratio vs. Training steps vs. Performance

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
centroids		centroids
loaders		loaders
utils		utils
vgg16_notebooks		vgg16_notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dedup.ipynb		dedup.ipynb
main.py		main.py
run.sh		run.sh
train.ipynb		train.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Deduplication for Data Efficient Learning [technical report]

Getting Started

Introduction

Objectives

Methods

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Semantic Deduplication for Data Efficient Learning [technical report]

Getting Started

Introduction

Objectives

Methods

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages