cs772 Proposal

The CS772 project proposal focuses on improving the efficiency of training machine learning models on large-scale datasets by selectively choosing data points that enhance training speed and accuracy. Building on prior research, the project aims to adapt these methods for Natural Language Processing and Signal Processing domains, while also exploring new metrics and loss functions for better data point ranking. The proposed approach seeks to reduce training time significantly while maintaining or improving model performance.

Uploaded by

Rishit Bhutra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views2 pages

cs772 Proposal

Uploaded by

Rishit Bhutra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

CS772 Project Proposal

Efficient Training on Large-Scale Datasets

Rishit Bhutra Sameer Ahmad Shubham Jangid Bhavesh Shukla
210857 210912 211022 210266

15th February 2025

Introduction
The use of web-scraped data in machine learning has significantly advanced the field, enabling
models to be trained on large, diverse datasets. However, these vast amounts of data also come with
challenges, particularly the extended time required for training. Much of this time is wasted learning
redundant or non-learnable data points. This project aims to address these inefficiencies by
developing methods to selectively choose data points that contribute most to training, thereby
reducing overall training time while maintaining, or even improving, model accuracy.

Related Work
Our work is motivated by the paper Prioritized Training on Learnable, Worthwhile, and
Untrained Data Points. The authors propose techniques to rank and prioritize data points
during training, showing that intelligently selected samples can significantly reduce training time
without sacrificing performance.

Two prevalent methods for sample prioritization include:

• Filtering out noisy data points, which are often not worth learning.
• Focusing on high-loss examples, which are deemed more informative.
However, both methods come with limitations, particularly in handling varying data distributions.

To address these limitations, the paper introduces a novel metric, Reducible Holdout Loss
(RHO), which identifies and ranks the most valuable training points. In experiments on the large-
scale Clothing-1M dataset, RHO-LOSS led to an 18x reduction in training steps while achieving a
2% improvement in accuracy.

Proposed Contribution
Our project extends the ideas from the original paper and explores their application beyond com-
puter vision tasks, aiming to make contributions in two key areas:

1. Adapting to New Domains

The original paper focuses on computer vision, where it is relatively easier to remove noisy
data points. We propose extending the approach to more complex domains, specifically Natural
Language Processing (NLP) and Signal Processing. This adaptation will involve testing on
standard datasets from these fields to evaluate the method’s performance in different contexts. Given

1
the computational cost, efficient batching and load balancing strategies will be integral to
the adaptation.

2. Exploring Alternative Metrics and Loss Functions

In addition to RHO Loss, we will explore the use of probabilistic autoencoders to develop more
sophisticated metrics for filtering out non-learnable data points. This approach may offer more
fine-grained control over the ranking of data points, particularly in signal processing tasks. We
also plan to experiment with modifications to the RHO Loss function, assessing its effectiveness
on datasets with highly diverse distributions through synthetic experiments.

CS772 Project Proposal
No ratings yet
CS772 Project Proposal
2 pages
CS772 Project Report
No ratings yet
CS772 Project Report
9 pages
Robotics Imitation Learning Course
No ratings yet
Robotics Imitation Learning Course
69 pages
772 Presentation
No ratings yet
772 Presentation
10 pages
Transfer Learning
No ratings yet
Transfer Learning
18 pages
Deep Learning Important Questions For Ia 1
No ratings yet
Deep Learning Important Questions For Ia 1
11 pages
Learning3 6pp
No ratings yet
Learning3 6pp
15 pages
Bad Students Make Great Teachers
No ratings yet
Bad Students Make Great Teachers
16 pages
Over Fitting and TBL
No ratings yet
Over Fitting and TBL
46 pages
Col774 A5
No ratings yet
Col774 A5
6 pages
Exam Spring 10
No ratings yet
Exam Spring 10
10 pages
Lab4 - Jupyter Notebook
No ratings yet
Lab4 - Jupyter Notebook
7 pages
4-1 Fine-Tuning Your Model
No ratings yet
4-1 Fine-Tuning Your Model
60 pages
ML Da1
No ratings yet
ML Da1
8 pages
INT354 - Unit 1
No ratings yet
INT354 - Unit 1
72 pages
Why We Think - Lil'Log
No ratings yet
Why We Think - Lil'Log
32 pages
Greedy Pruning For Continually Adapting Networks
No ratings yet
Greedy Pruning For Continually Adapting Networks
60 pages
Compare Class I Fiers Part 13
No ratings yet
Compare Class I Fiers Part 13
32 pages
Intro to Machine Learning Basics
No ratings yet
Intro to Machine Learning Basics
61 pages
Machine Learning Midterm Exam 2020
No ratings yet
Machine Learning Midterm Exam 2020
6 pages
Machine Learning
No ratings yet
Machine Learning
20 pages
LoTs and HoTs Question For Unit 3 and Unit 4 - 1
No ratings yet
LoTs and HoTs Question For Unit 3 and Unit 4 - 1
16 pages
Harshit Satya: Work Experience Skills
No ratings yet
Harshit Satya: Work Experience Skills
1 page
Machine Learning Interview Questions.
50% (2)
Machine Learning Interview Questions.
43 pages
COL 774 - Machine Learning - Assignment 5
No ratings yet
COL 774 - Machine Learning - Assignment 5
6 pages
LLMs Reasoning
No ratings yet
LLMs Reasoning
18 pages
Harmonic Loss for Interpretable AI
No ratings yet
Harmonic Loss for Interpretable AI
12 pages
PSCS511 - Machine Learning Ques Paper
No ratings yet
PSCS511 - Machine Learning Ques Paper
10 pages
Practical Aspects of Deep Learning PI
No ratings yet
Practical Aspects of Deep Learning PI
46 pages
Coursework Assessment MFKhan v1.4
No ratings yet
Coursework Assessment MFKhan v1.4
9 pages
Machine Learning Unit - 2 Supervised Learning
No ratings yet
Machine Learning Unit - 2 Supervised Learning
7 pages
Lead Scoring Group Case Study Presentation
100% (2)
Lead Scoring Group Case Study Presentation
19 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
30 pages
Political Bias vs. Statistical Bias - KI-Campus
No ratings yet
Political Bias vs. Statistical Bias - KI-Campus
1 page
Data Science
No ratings yet
Data Science
16 pages
3 LogisticRegression
No ratings yet
3 LogisticRegression
30 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
Transfer Learning With Partial Observabi
No ratings yet
Transfer Learning With Partial Observabi
8 pages
Machine Learning With Pythone - Syllabus
No ratings yet
Machine Learning With Pythone - Syllabus
13 pages
MD - Sajedul Islam - Assaignment - 02
No ratings yet
MD - Sajedul Islam - Assaignment - 02
11 pages
02jul2024 StaticMedia AI UNIT 2-CAPSTONE PROJECT NOTES 6759955093464609405
No ratings yet
02jul2024 StaticMedia AI UNIT 2-CAPSTONE PROJECT NOTES 6759955093464609405
6 pages
CSCI-43646364 S25 - Lecture 3
No ratings yet
CSCI-43646364 S25 - Lecture 3
42 pages
Case 2 Object Detection
No ratings yet
Case 2 Object Detection
77 pages
Machine Learning Insem-01 QP
No ratings yet
Machine Learning Insem-01 QP
6 pages
UNIT 2 Data Science LM 2023
No ratings yet
UNIT 2 Data Science LM 2023
13 pages
Chapter 1 Capstone Project Ai Class 12
No ratings yet
Chapter 1 Capstone Project Ai Class 12
5 pages
Cs 171 18 IntroLearning Old
No ratings yet
Cs 171 18 IntroLearning Old
47 pages
Data - and AI-driven Methods in Engineering
No ratings yet
Data - and AI-driven Methods in Engineering
40 pages
B.Tech AI & DS: Data Science Lab
No ratings yet
B.Tech AI & DS: Data Science Lab
35 pages
IML 7 - ROC Curve
No ratings yet
IML 7 - ROC Curve
17 pages
Midterm Spring13
No ratings yet
Midterm Spring13
10 pages
ML Classification Trupesh Patel
No ratings yet
ML Classification Trupesh Patel
39 pages
ML Interview Questions
No ratings yet
ML Interview Questions
10 pages
Artificial Intelligence Grade 12 Notes-Capstone Project CBSE Skill Education-Artificial Intelligence
92% (12)
Artificial Intelligence Grade 12 Notes-Capstone Project CBSE Skill Education-Artificial Intelligence
10 pages
Data Science
No ratings yet
Data Science
16 pages
Individual Assignment 2 Guideline
No ratings yet
Individual Assignment 2 Guideline
8 pages
ML Unit 1
No ratings yet
ML Unit 1
35 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
Endsemester 2024 25 2
No ratings yet
Endsemester 2024 25 2
25 pages
Contactstaff
No ratings yet
Contactstaff
1 page
Endsem Exam
No ratings yet
Endsem Exam
19 pages
ESO207 ProgAssign 2 2022
No ratings yet
ESO207 ProgAssign 2 2022
7 pages
Raman Spectroscopy of Tetrahedral Oxyanions
No ratings yet
Raman Spectroscopy of Tetrahedral Oxyanions
3 pages
Physics MCQs For Class 12 With Answers Chapter 10
No ratings yet
Physics MCQs For Class 12 With Answers Chapter 10
12 pages
Unit 1 - Basic Concepts - FD3404 - Principles of Thermodynamics
No ratings yet
Unit 1 - Basic Concepts - FD3404 - Principles of Thermodynamics
28 pages
ReliaPrep RNA Tissue Miniprep System TM394
No ratings yet
ReliaPrep RNA Tissue Miniprep System TM394
20 pages
Designs, Performance and Economic Feasibility of Domestic Solar Dryers
No ratings yet
Designs, Performance and Economic Feasibility of Domestic Solar Dryers
31 pages
3333A1 Ds PDF
No ratings yet
3333A1 Ds PDF
2 pages
Crude Oil Distillation
100% (1)
Crude Oil Distillation
7 pages
Smart Memory Alloys: Asim Rahimatpure
No ratings yet
Smart Memory Alloys: Asim Rahimatpure
3 pages
Trigonometry Extra Questions
No ratings yet
Trigonometry Extra Questions
3 pages
Exit Exam Sample Question
No ratings yet
Exit Exam Sample Question
12 pages
CH 12
No ratings yet
CH 12
3 pages
Dura-Blok Written Specification
No ratings yet
Dura-Blok Written Specification
1 page
Eth Over SDH Cis11
No ratings yet
Eth Over SDH Cis11
38 pages
LUTEC AUSTRALIA PTY LTD Displays Prototypes That Amplifying Electricity by 5 Times
No ratings yet
LUTEC AUSTRALIA PTY LTD Displays Prototypes That Amplifying Electricity by 5 Times
1 page
ChE 471 EXAM 2 2004
No ratings yet
ChE 471 EXAM 2 2004
3 pages
Additional Teaching Ideas 8.2
No ratings yet
Additional Teaching Ideas 8.2
2 pages
Strength of Materials Basics
No ratings yet
Strength of Materials Basics
21 pages
ACERT Tech for C27 Engines
91% (11)
ACERT Tech for C27 Engines
46 pages
Sugarcane Bagasse Activated Carbon for Textile Wastewater Treatment
No ratings yet
Sugarcane Bagasse Activated Carbon for Textile Wastewater Treatment
9 pages
Econometrics All Chapters
No ratings yet
Econometrics All Chapters
98 pages
Install OpenMPI in Linux
No ratings yet
Install OpenMPI in Linux
5 pages
Millimeter-Wave VCO Design for 5G
No ratings yet
Millimeter-Wave VCO Design for 5G
7 pages
12c Bus Protocol
100% (1)
12c Bus Protocol
33 pages
Operations Management Essentials
No ratings yet
Operations Management Essentials
46 pages
Uniformity Dosage Unit USP
No ratings yet
Uniformity Dosage Unit USP
4 pages
Drill-String Torque & Drag Model: DEA 44 Phase V
No ratings yet
Drill-String Torque & Drag Model: DEA 44 Phase V
59 pages
Properties Based On Set Operations
No ratings yet
Properties Based On Set Operations
8 pages
Develop Lofting and Lathe Creation in 3d Animation
No ratings yet
Develop Lofting and Lathe Creation in 3d Animation
5 pages
International ISO Standard 4249-2: Iteh Standard Preview (Standards - Iteh.ai)
No ratings yet
International ISO Standard 4249-2: Iteh Standard Preview (Standards - Iteh.ai)
8 pages
West Natuna Basin PDF
No ratings yet
West Natuna Basin PDF
12 pages

cs772 Proposal

Uploaded by

cs772 Proposal

Uploaded by

CS772 Project Proposal

Efficient Training on Large-Scale Datasets

15th February 2025

Two prevalent methods for sample prioritization include:

1. Adapting to New Domains

2. Exploring Alternative Metrics and Loss Functions

You might also like