CS772 Project Proposal
Efficient Training on Large-Scale Datasets
Rishit Bhutra Sameer Ahmad Shubham Jangid Bhavesh Shukla
210857 210912 211022 210266
15th February 2025
Introduction
The use of web-scraped data in machine learning has significantly advanced the field, enabling
models to be trained on large, diverse datasets. However, these vast amounts of data also come with
challenges, particularly the extended time required for training. Much of this time is wasted learning
redundant or non-learnable data points. This project aims to address these inefficiencies by
developing methods to selectively choose data points that contribute most to training, thereby
reducing overall training time while maintaining, or even improving, model accuracy.
Related Work
Our work is motivated by the paper Prioritized Training on Learnable, Worthwhile, and
Untrained Data Points. The authors propose techniques to rank and prioritize data points
during training, showing that intelligently selected samples can significantly reduce training time
without sacrificing performance.
Two prevalent methods for sample prioritization include:
• Filtering out noisy data points, which are often not worth learning.
• Focusing on high-loss examples, which are deemed more informative.
However, both methods come with limitations, particularly in handling varying data distributions.
To address these limitations, the paper introduces a novel metric, Reducible Holdout Loss
(RHO), which identifies and ranks the most valuable training points. In experiments on the large-
scale Clothing-1M dataset, RHO-LOSS led to an 18x reduction in training steps while achieving a
2% improvement in accuracy.
Proposed Contribution
Our project extends the ideas from the original paper and explores their application beyond com-
puter vision tasks, aiming to make contributions in two key areas:
1. Adapting to New Domains
The original paper focuses on computer vision, where it is relatively easier to remove noisy
data points. We propose extending the approach to more complex domains, specifically Natural
Language Processing (NLP) and Signal Processing. This adaptation will involve testing on
standard datasets from these fields to evaluate the method’s performance in different contexts. Given
1
the computational cost, efficient batching and load balancing strategies will be integral to
the adaptation.
2. Exploring Alternative Metrics and Loss Functions
In addition to RHO Loss, we will explore the use of probabilistic autoencoders to develop more
sophisticated metrics for filtering out non-learnable data points. This approach may offer more
fine-grained control over the ranking of data points, particularly in signal processing tasks. We
also plan to experiment with modifications to the RHO Loss function, assessing its effectiveness
on datasets with highly diverse distributions through synthetic experiments.