0% found this document useful (0 votes)
35 views2 pages

cs772 Proposal

The CS772 project proposal focuses on improving the efficiency of training machine learning models on large-scale datasets by selectively choosing data points that enhance training speed and accuracy. Building on prior research, the project aims to adapt these methods for Natural Language Processing and Signal Processing domains, while also exploring new metrics and loss functions for better data point ranking. The proposed approach seeks to reduce training time significantly while maintaining or improving model performance.

Uploaded by

Rishit Bhutra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views2 pages

cs772 Proposal

The CS772 project proposal focuses on improving the efficiency of training machine learning models on large-scale datasets by selectively choosing data points that enhance training speed and accuracy. Building on prior research, the project aims to adapt these methods for Natural Language Processing and Signal Processing domains, while also exploring new metrics and loss functions for better data point ranking. The proposed approach seeks to reduce training time significantly while maintaining or improving model performance.

Uploaded by

Rishit Bhutra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

CS772 Project Proposal

Efficient Training on Large-Scale Datasets


Rishit Bhutra Sameer Ahmad Shubham Jangid Bhavesh Shukla
210857 210912 211022 210266

15th February 2025

Introduction
The use of web-scraped data in machine learning has significantly advanced the field, enabling
models to be trained on large, diverse datasets. However, these vast amounts of data also come with
challenges, particularly the extended time required for training. Much of this time is wasted learning
redundant or non-learnable data points. This project aims to address these inefficiencies by
developing methods to selectively choose data points that contribute most to training, thereby
reducing overall training time while maintaining, or even improving, model accuracy.

Related Work
Our work is motivated by the paper Prioritized Training on Learnable, Worthwhile, and
Untrained Data Points. The authors propose techniques to rank and prioritize data points
during training, showing that intelligently selected samples can significantly reduce training time
without sacrificing performance.

Two prevalent methods for sample prioritization include:


• Filtering out noisy data points, which are often not worth learning.
• Focusing on high-loss examples, which are deemed more informative.
However, both methods come with limitations, particularly in handling varying data distributions.

To address these limitations, the paper introduces a novel metric, Reducible Holdout Loss
(RHO), which identifies and ranks the most valuable training points. In experiments on the large-
scale Clothing-1M dataset, RHO-LOSS led to an 18x reduction in training steps while achieving a
2% improvement in accuracy.

Proposed Contribution
Our project extends the ideas from the original paper and explores their application beyond com-
puter vision tasks, aiming to make contributions in two key areas:

1. Adapting to New Domains


The original paper focuses on computer vision, where it is relatively easier to remove noisy
data points. We propose extending the approach to more complex domains, specifically Natural
Language Processing (NLP) and Signal Processing. This adaptation will involve testing on
standard datasets from these fields to evaluate the method’s performance in different contexts. Given

1
the computational cost, efficient batching and load balancing strategies will be integral to
the adaptation.

2. Exploring Alternative Metrics and Loss Functions


In addition to RHO Loss, we will explore the use of probabilistic autoencoders to develop more
sophisticated metrics for filtering out non-learnable data points. This approach may offer more
fine-grained control over the ranking of data points, particularly in signal processing tasks. We
also plan to experiment with modifications to the RHO Loss function, assessing its effectiveness
on datasets with highly diverse distributions through synthetic experiments.

You might also like