awesome-LLM-DS

LLMs for Data Science Papers

Data Acquisition

Title	Venue	Date	Code	Project
Understanding HTML with LLMs	-	-	Github	Project
Web page Extraction using GPT	-	-	Github	Project
Extracting Financial News using Dolphin	-	-	Github	Project
Social Network Data Extraction via LLMs	-	-	Github	Project
Knowledge Extraction from Literature (ChatCite, KnowledgeFlow)	-	-	Github	Project
Profile Information Extraction using LLMs	-	-	Github	Project
POI Data Extraction with LLMs	-	-	Github	Project
Epidemiological Data Extraction with Text Analysis	-	-	Github	Project
Knowledge Graph Extraction (ExtractKG, LLM-KG)	-	-	Github	Project
Multimodal Data Extraction (LLM-MM, WorldGPT, MultimodalExtractor)	-	-	Github	Project
Multilingual Data Processing (LlamaLens, MultimodalLLM, QuantifyingLLMs)	-	-	Github	Project
IoT Data Extraction (LLMind, UnifiedIoT, EfficientIoT, LLM-IoT)	-	-	Github	Project
Medical Data Extraction (APrompt4EM, IterativeLLM, AutoMed-LLM)	-	-	Github	Project
Retrieval-Augmented Generation (Survey on RAG, Business-RAG, Enhanced-RAG)	-	-	Github	Project

Data Annotation

Title	Venue	Date	Code	Project
Human-LLM Collaboration (HumanLLM, AnnotationLLM)	-	-	Github	Project
HowToCaption: Automated Captioning with LLMs	-	-	Github	Project
Zero-Shot Data Annotation (ZeroShot-LLM)	-	-	Github	Project
LLMaAA: LLMs for Automated Annotation Assistance	-	-	Github	Project
Gollie: LLM-based Data Annotation Framework	-	-	Github	Project
Self-Correction in Data Annotation (LLM-SelfCorrect)	-	-	Github	Project
Eagle: Enhancing Annotation with LLMs	-	-	Github	Project
CoAnnotating: Collaborative Annotation with LLMs	-	-	Github	Project
PDFChatAnnotator: LLM-driven PDF Data Annotation	-	-	Github	Project

Data Aggregation

Title	Venue	Date	Code	Project
Multimodal LLM for Data Aggregation (MultimodalLLM, AggregatorGPT)	-	-	Github	Project
TableGPT2: LLMs for Table Data Processing	-	-	Github	Project
DataChat: Conversational AI for Data Analysis	-	-	Github	Project
InsightLens: AI-driven Data Aggregation	-	-	Github	Project
DataLab: LLMs for Data Curation	-	-	Github	Project

Data Generation

Title	Venue	Date	Code	Project
TSL: LLM-based Time-Series Data Generation	-	-	Github	Project
LLM-PTM: Pre-trained Model for Data Generation	-	-	Github	Project
LLM-Forest: Multi-agent Learning for Data Synthesis	-	-	Github	Project
IoT-LLM: Data Generation for Internet of Things	-	-	Github	Project
DPDA: Privacy-Preserving Data Generation with LLMs	-	-	Github	Project
UnIMP: Unsupervised Data Imputation using LLMs	-	-	Github	Project

Data Preparation

Feature Engineering

Title	Venue	Date	Code	Project
Llm-Select: Feature Selection with LLMs	-	-	Github	Project
LmPriors: Prior Knowledge for Feature Engineering	-	-	Github	Project
AltFS: Alternative Feature Selection with LLMs	-	-	Github	Project
ICE-SEARCH: Efficient Feature Extraction using LLMs	-	-	Github	Project

Data Cleaning

Title	Venue	Date	Code	Project
VIDS: AI-based Data Cleaning for Visual Datasets	-	-	Github	Project
Cocoon: LLM-powered Data Refinement	-	-	Github	Project
Gidcl: Graph-based Cleaning using LLMs	-	-	Github	Project
Multi-News+: Automated Data Cleaning for News Articles	-	-	Github	Project

Data Analysis

Descriptive Analysis

Title	Venue	Date	Code	Project
ClusterLLM: Large language models as a guide for text clustering	EMNLP	2023	Github	Project
Efficient Few-Shot Fine-Tuning for Opinion Summarization	NAACL	2023	Github	Project
Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization	EMNLP	2021	Github	Project
Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chain-of-Thought Method	ACL	2023	Github	Project
Planning with Learned Entity Prompts for Abstractive Summarization	TACL	2021	Github	Project
Personalized Abstractive Summarization by Tri-agent Generation Pipeline	EACL	2023	Github	Project
Similar Data Points Identification with LLM: A Human-in-the-loop Strategy Using Summarization and Hidden State Insights	ArXiv	2024	Github	Project
TopicGPT: A Prompt-based Topic Modeling Framework	NAACL	2024	Github	Project
Neural Topic Modeling with Large Language Models in the Loop	ArXiv	2024	Github	Project
Addressing Topic Granularity and Hallucination in Large Language Models for Topic Modelling	ArXiv	2024	Github	Project
Designing Heterogeneous LLM Agents for Financial Sentiment Analysis	ACM TMIS	2024	Github	Project
WisdoM: Improving Multimodal Sentiment Analysis by Fusing Contextual World Knowledge	ACMMM	2024	Github	Project
Unified Multi-modal Pre-training for Few-shot Sentiment Analysis with Prompt-based Learning	ACMMM	2022	Github	Project
A multimodal approach to cross‑lingual sentiment analysis with ensemble of transformer and LLM	Sci. Rep.	2024	Github	Project
Sentiment Analysis through LLM Negotiations	ArXiv	2023	Github	Project

Analytical Reasoning

Title	Venue	Date	Code	Project
PaLM: Scaling Language Modeling with Pathways	TMLR	2023	Github	Project
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding	NAACL	2019	Github	Project
Learning Transferable Visual Models From Natural Language Supervision	ICML	2021	Github	Project
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)	ArXiv	2023	Github	Project
I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification	CVPR	2023	Github	Project
CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations	EMNLP	2021	Github	Project
One for all: Towards training one graph model for all classification task	ICLR	2024	Github	Project
Language is All a Graph Needs	EACL	2024	Github	Project
GPT4Graph: Can Large Language Models Understand Graph Structured Data? An Empirical Evaluation and Benchmarking	ArXiv	2023	Github	Project
Cross-Modal Learning for Chemistry Property Prediction: Large Language Models Meet Graph Machine Learning	NeurIPS Workshop	2023	Github	Project
Scaling Laws for Discriminative Classification in Large Language Models	Applied AI Letters	2025	Github	Project
Large Language Models for Text Classification: From Zero-Shot Learning to Instruction-Tuning	Open Science Foundation	2024	Github	Project
An Experimental Evaluation of LLM on Image Classification	ADC	2024	Github	Project
Contextual Speech Emotion Recognition with Large Language Models and ASR-Based Transcript	NeurIPS Workshop	2024	Github	Project
Enhancing Speech De-Identification with LLM-Based Data Augmentation	ICAICTA	2024	Github	Project
# Rethinking VLMs and LLMs for Image Classification	ArXiv	2024	Github	Project
Music Genre Classification using Large Language Models	ArXiv	2024	Github	Project

Data Analysis

Descriptive Analysis

Title	Venue	Date	Code	Project
Lost in the middle: How language models use long contexts	-	-	Github	Project
Me llama: Foundation large language models for medical applications	-	-	Github	Project
TopicGPT: A prompt-based topic modeling framework	-	-	Github	Project
FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization	-	-	Github	Project
When2Ask: Learning to schedule human interaction questions with language models	-	-	Github	Project
TL;DR: Mining significant science from academic papers	-	-	Github	Project
Sentiment analysis through llm negotiations	-	-	Github	Project
Efficient few-shot fine-tuning for opinion summarization	-	-	Github	Project
Vision guided generative pre-trained language models for multimodal abstractive summarization	-	-	Github	Project
Element-aware summarization with large language models: Expert-aligned evaluation and chain-of-thought method	-	-	Github	Project
Planning with learned entity prompts for abstractive summarization	-	-	Github	Project
Personalized abstractive summarization by tri-agent generation pipeline	-	-	Github	Project

Analytical Reasoning

Title	Venue	Date	Code	Project
Palm: Scaling language modeling with pathways	JMLR	2023	-	-
BERT: Pre-training of deep bidirectional transformers for language understanding	NAACL	2019	-	-
Learning transferable visual models from natural language supervision	ICML	2021	-	-
The dawn of lmms: Preliminary explorations with gpt-4v (ision)	arXiv	2023	-	-
I2mvformer: Large language model generated multi-view document supervision for zero-shot image classification	CVPR	2023	-	-
CTAL: Pre-training cross-modal transformer for audio-and-language representations	arXiv	2021	-	-
GPT4Graph: Can large language models understand graph structured data? An empirical evaluation and benchmarking	arXiv	2023	-	-
Cross-modal learning for chemistry property prediction: Large language models meet graph machine learning	arXiv	2024	-	-

Interactive Analysis

Title	Venue	Date	Code	Project
Data Interpreter	-	2024	Github	Project
Reflexion: Language Agents with Verbal Reinforcement Learning	NeurIPS	2024	Github	Project
PlotGen: Multi-agent LLM-based Scientific Data Visualization via Multimodal Feedback	arXiv	2025	Github	Project
Memocrs: Memory-Enhanced Sequential Conversational Recommender Systems with Large Language Models	ICIKM	2024	Github	Project
Expel: LLM Agents are Experiential Learners	AAAI	2024	Github	Project
Large Language Models Can Self-Improve	arXiv	2022	Github	Project
G-eval: NLG Evaluation Using GPT-4 with Better Human Alignment	EMNLP	2023	Github	Project
Drdt: Dynamic Reflection with Divergent Thinking for LLM-based Sequential Recommendation	arXiv	2023	Github	Project
Autokaggle: A Multiagent Framework for Autonomous Data Science Competitions	arXiv	2024	Github	Project
Chartllama: A Multimodal LLM for Chart Understanding and Generation	arXiv	2023	Github	Project
Tinychart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning	arXiv	2024	Github	Project
Unichart: A Universal Vision-Language Pretrained Model for Chart Comprehension and Reasoning	arXiv	2023	Github	Project
Graphotter: Evolving LLM-based Graph Reasoning for Complex Table Question Answering	arXiv	2024	Github	Project
Mplug-paperowl: Scientific Diagram Analysis with the Multimodal Large Language Model	ACM MM	2024	Github	Project
CoG-DQA: Chain-of-guiding Learning with Large Language Models for Diagram Question Answering	CVPR	2024	Github	Project
NLGift: Graph LLM for Intelligent Finance Task	arXiv	2024	Github	Project
Insightpilot: An LLM-empowered Automated Data Exploration System	EMNLP	2023	Github	Project
Talk2data: A Natural Language Interface for Exploratory Visual Analysis via Question Decomposition	ACM TOIS	2024	Github	Project
TiInsight: Comprehensive LLM-based Exploratory Analysis for Time Series	arXiv	2024	Github	Project
Genoagent: A Baseline Method for LLM-based Exploration of Gene Expression Data	OpenReview	2025	Github	Project
Data-copilot: Bridging Billions of Data and Humans with Autonomous Workflow	arXiv	2023	Github	Project
QUIS: Question-guided Insights Generation for Automated Exploratory Data Analysis	arXiv	2024	Github	Project
Lida: A Tool for Automatic Generation of Grammar-agnostic Visualizations and Infographics using Large Language Models	ACL	2023	Github	Project

Quantitative Analysis

Title	Venue	Date	Code	Project
Unichart: A Universal Vision-Language Pretrained Model for Chart Comprehension and Reasoning	EMNLP	2024	Github	Project
EvoChart: A Benchmark and a Self-Training Approach Towards Real-World Chart Understanding	AAAI	2025	Github	Project
ChartInstruct: Instruction Tuning for Chart Comprehension and Reasoning	ACL	2024	Github	Project
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training	ACL	2024	Github	Project
Position: What Can Large Language Models Tell Us About Time Series Analysis	arXiv	2024	Github	Project
Lambda: A Large Model Based Data Agent	arXiv	2024	Github	Project
Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding	ICLR	2024	Github	Project
TS-Reasoner: Compositional Time Series Reasoning for End-to-End Task Execution	arXiv	2024	Github	Project
Proteingpt: Multimodal LLM for Protein Property Prediction and Structure Understanding	arXiv	2024	Github	Project
Time-LLM: Time Series Forecasting by Reprogramming Large Language Models	ICLR	2024	Github	Project
LLM4TS: Two-Stage Fine-Tuning for Time-Series Forecasting with Pre-Trained LLMs	TIST	2025	Github	Project
GPT4MTS: Prompt-Based Large Language Model for Multimodal Time-Series Forecasting	AAAI	2024	Github	Project
Strada-LLM: Graph LLM for Traffic Prediction	arXiv	2024	Github	Project
LeMoLE: LLM-Enhanced Mixture of Linear Experts for Time Series Forecasting	arXiv	2024	Github	Project
ST-LLM: Spatial-Temporal Large Language Model for Traffic Prediction	MDM	2024	Github	Project
Realtcd: Temporal Causal Discovery from Interventional Data with Large Language Model	CIKM	2024	Github	Project
MATMCD: From Query Tools to Causal Architects	arXiv	2023	Github	Project
Order-of-Though: Are Large Language Models Capable of Causal Reasoning for Sensing Data Analysis	EdgeFM	2024	Github	Project

Special Domains

Medical

Title	Venue	Date	Code	Project
ClinicalGPT: AI for Clinical Diagnostics	-	-	Github	Project
Mole-BERT: Molecular Biology with Transformers	-	-	Github	Project
Me-LLaMA: Medical Large Language Model Applications	-	-	Github	Project

Finance

Title	Venue	Date	Code	Project
ECC Analyzer: AI for Risk Management	-	-	Github	Project
RiskLabs: Data-driven Risk Management	-	-	Github	Project
TradingAgents: AI-Powered Financial Trading	-	-	Github	Project

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

awesome-LLM-DS

LLMs for Data Science Papers

Data Acquisition

Data Annotation

Data Aggregation

Data Generation

Data Preparation

Feature Engineering

Data Cleaning

Data Analysis

Descriptive Analysis

Analytical Reasoning

Data Analysis

Descriptive Analysis

Analytical Reasoning

Interactive Analysis

Quantitative Analysis

Special Domains

Medical

Finance

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

awesome-LLM-DS

LLMs for Data Science Papers

Data Acquisition

Data Annotation

Data Aggregation

Data Generation

Data Preparation

Feature Engineering

Data Cleaning

Data Analysis

Descriptive Analysis

Analytical Reasoning

Data Analysis

Descriptive Analysis

Analytical Reasoning

Interactive Analysis

Quantitative Analysis

Special Domains

Medical

Finance

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages