Semantic Deduplication for Data Efficient Learning [technical report]
Jung Hwan Heo & Paul Chen
Extension to SemDeDup: Data-efficient learning at web-scale through semantic deduplication
To get the deduplicated dataloader and train a model, use the following command:
python main.py \
--ratio 25 \ # % data to use
--n_clusters 20 \ # number of clusters
--rank_type random \ # random vs. cossim
--prune_type common \ # common vs. diverse
--dataset fmnist # mnist or fmnistTo generate a sweep over hyperparameters, simply edit the script run.sh and then launch
sh run.shYou can also interactively play around with our code through the two notebooks:
dedeup.ipynbdeduplicate with fast kmeans (w/ support on MPS backend)train.ipynbtrain with deduplicated dataset
- With the advent of big data, machine learning models are being trained on massive datasets but at diminishing returns.
- These datasets usually have a considerable amount of redundant or duplicate data, which can result in longer training times, increased storage needs, and unnecessary computational complexity.
- To overcome this issue, our project proposes utilizing clustering techniques for semantic deduplication, which can significantly reduce data redundancy and storage requirements while enhancing training efficiency.
- Dataset Exploration: Develop a semantic deduplication method using k-means clustering to identify and remove redundant data samples.
- Efficiency Tradeoff: Analyze the impact of semantic deduplication on storage requirements and training efficiency.
Clustering techniques to be used
Models to be used for feature extraction
- VGG16
ResNet
Datasets used
- MNIST
- Fashion-MNIST
- CIFAR-10
Stanford CarsCaltech101
Representation spaces to be used
- R1. Pixel space
- R2. Embedding space
Experiments
- E1. Pairwise Cosine similarity per clusters
- E2. Data pruning ratio vs. Training steps vs. Performance