Skip to content

SongW-SW/awesome-LLM-DS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 

Repository files navigation

awesome-LLM-DS

LLMs for Data Science Papers

Data Acquisition

Title Venue Date Code Project
Understanding HTML with LLMs - - Github Project
Web page Extraction using GPT - - Github Project
Extracting Financial News using Dolphin - - Github Project
Social Network Data Extraction via LLMs - - Github Project
Knowledge Extraction from Literature (ChatCite, KnowledgeFlow) - - Github Project
Profile Information Extraction using LLMs - - Github Project
POI Data Extraction with LLMs - - Github Project
Epidemiological Data Extraction with Text Analysis - - Github Project
Knowledge Graph Extraction (ExtractKG, LLM-KG) - - Github Project
Multimodal Data Extraction (LLM-MM, WorldGPT, MultimodalExtractor) - - Github Project
Multilingual Data Processing (LlamaLens, MultimodalLLM, QuantifyingLLMs) - - Github Project
IoT Data Extraction (LLMind, UnifiedIoT, EfficientIoT, LLM-IoT) - - Github Project
Medical Data Extraction (APrompt4EM, IterativeLLM, AutoMed-LLM) - - Github Project
Retrieval-Augmented Generation (Survey on RAG, Business-RAG, Enhanced-RAG) - - Github Project

Data Annotation

Title Venue Date Code Project
Human-LLM Collaboration (HumanLLM, AnnotationLLM) - - Github Project
HowToCaption: Automated Captioning with LLMs - - Github Project
Zero-Shot Data Annotation (ZeroShot-LLM) - - Github Project
LLMaAA: LLMs for Automated Annotation Assistance - - Github Project
Gollie: LLM-based Data Annotation Framework - - Github Project
Self-Correction in Data Annotation (LLM-SelfCorrect) - - Github Project
Eagle: Enhancing Annotation with LLMs - - Github Project
CoAnnotating: Collaborative Annotation with LLMs - - Github Project
PDFChatAnnotator: LLM-driven PDF Data Annotation - - Github Project

Data Aggregation

Title Venue Date Code Project
Multimodal LLM for Data Aggregation (MultimodalLLM, AggregatorGPT) - - Github Project
TableGPT2: LLMs for Table Data Processing - - Github Project
DataChat: Conversational AI for Data Analysis - - Github Project
InsightLens: AI-driven Data Aggregation - - Github Project
DataLab: LLMs for Data Curation - - Github Project

Data Generation

Title Venue Date Code Project
TSL: LLM-based Time-Series Data Generation - - Github Project
LLM-PTM: Pre-trained Model for Data Generation - - Github Project
LLM-Forest: Multi-agent Learning for Data Synthesis - - Github Project
IoT-LLM: Data Generation for Internet of Things - - Github Project
DPDA: Privacy-Preserving Data Generation with LLMs - - Github Project
UnIMP: Unsupervised Data Imputation using LLMs - - Github Project

Data Preparation

Feature Engineering

Title Venue Date Code Project
Llm-Select: Feature Selection with LLMs - - Github Project
LmPriors: Prior Knowledge for Feature Engineering - - Github Project
AltFS: Alternative Feature Selection with LLMs - - Github Project
ICE-SEARCH: Efficient Feature Extraction using LLMs - - Github Project

Data Cleaning

Title Venue Date Code Project
VIDS: AI-based Data Cleaning for Visual Datasets - - Github Project
Cocoon: LLM-powered Data Refinement - - Github Project
Gidcl: Graph-based Cleaning using LLMs - - Github Project
Multi-News+: Automated Data Cleaning for News Articles - - Github Project

Data Analysis

Descriptive Analysis

Title Venue Date Code Project
ClusterLLM: Large language models as a guide for text clustering EMNLP 2023 Github Project
Efficient Few-Shot Fine-Tuning for Opinion Summarization NAACL 2023 Github Project
Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization EMNLP 2021 Github Project
Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chain-of-Thought Method ACL 2023 Github Project
Planning with Learned Entity Prompts for Abstractive Summarization TACL 2021 Github Project
Personalized Abstractive Summarization by Tri-agent Generation Pipeline EACL 2023 Github Project
Similar Data Points Identification with LLM: A Human-in-the-loop Strategy Using Summarization and Hidden State Insights ArXiv 2024 Github Project
TopicGPT: A Prompt-based Topic Modeling Framework NAACL 2024 Github Project
Neural Topic Modeling with Large Language Models in the Loop ArXiv 2024 Github Project
Addressing Topic Granularity and Hallucination in Large Language Models for Topic Modelling ArXiv 2024 Github Project
Designing Heterogeneous LLM Agents for Financial Sentiment Analysis ACM TMIS 2024 Github Project
WisdoM: Improving Multimodal Sentiment Analysis by Fusing Contextual World Knowledge ACMMM 2024 Github Project
Unified Multi-modal Pre-training for Few-shot Sentiment Analysis with Prompt-based Learning ACMMM 2022 Github Project
A multimodal approach to cross‑lingual sentiment analysis with ensemble of transformer and LLM Sci. Rep. 2024 Github Project
Sentiment Analysis through LLM Negotiations ArXiv 2023 Github Project

Analytical Reasoning

Title Venue Date Code Project
PaLM: Scaling Language Modeling with Pathways TMLR 2023 Github Project
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding NAACL 2019 Github Project
Learning Transferable Visual Models From Natural Language Supervision ICML 2021 Github Project
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) ArXiv 2023 Github Project
I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification CVPR 2023 Github Project
CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations EMNLP 2021 Github Project
One for all: Towards training one graph model for all classification task ICLR 2024 Github Project
Language is All a Graph Needs EACL 2024 Github Project
GPT4Graph: Can Large Language Models Understand Graph Structured Data? An Empirical Evaluation and Benchmarking ArXiv 2023 Github Project
Cross-Modal Learning for Chemistry Property Prediction: Large Language Models Meet Graph Machine Learning NeurIPS Workshop 2023 Github Project
Scaling Laws for Discriminative Classification in Large Language Models Applied AI Letters 2025 Github Project
Large Language Models for Text Classification: From Zero-Shot Learning to Instruction-Tuning Open Science Foundation 2024 Github Project
An Experimental Evaluation of LLM on Image Classification ADC 2024 Github Project
Contextual Speech Emotion Recognition with Large Language Models and ASR-Based Transcript NeurIPS Workshop 2024 Github Project
Enhancing Speech De-Identification with LLM-Based Data Augmentation ICAICTA 2024 Github Project
# Rethinking VLMs and LLMs for Image Classification ArXiv 2024 Github Project
Music Genre Classification using Large Language Models ArXiv 2024 Github Project

Data Analysis

Descriptive Analysis

Title Venue Date Code Project
Lost in the middle: How language models use long contexts - - Github Project
Me llama: Foundation large language models for medical applications - - Github Project
TopicGPT: A prompt-based topic modeling framework - - Github Project
FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization - - Github Project
When2Ask: Learning to schedule human interaction questions with language models - - Github Project
TL;DR: Mining significant science from academic papers - - Github Project
Sentiment analysis through llm negotiations - - Github Project
Efficient few-shot fine-tuning for opinion summarization - - Github Project
Vision guided generative pre-trained language models for multimodal abstractive summarization - - Github Project
Element-aware summarization with large language models: Expert-aligned evaluation and chain-of-thought method - - Github Project
Planning with learned entity prompts for abstractive summarization - - Github Project
Personalized abstractive summarization by tri-agent generation pipeline - - Github Project

Analytical Reasoning

Title Venue Date Code Project
Palm: Scaling language modeling with pathways JMLR 2023 - -
BERT: Pre-training of deep bidirectional transformers for language understanding NAACL 2019 - -
Learning transferable visual models from natural language supervision ICML 2021 - -
The dawn of lmms: Preliminary explorations with gpt-4v (ision) arXiv 2023 - -
I2mvformer: Large language model generated multi-view document supervision for zero-shot image classification CVPR 2023 - -
CTAL: Pre-training cross-modal transformer for audio-and-language representations arXiv 2021 - -
GPT4Graph: Can large language models understand graph structured data? An empirical evaluation and benchmarking arXiv 2023 - -
Cross-modal learning for chemistry property prediction: Large language models meet graph machine learning arXiv 2024 - -

Interactive Analysis

Title Venue Date Code Project
Data Interpreter - 2024 Github Project
Reflexion: Language Agents with Verbal Reinforcement Learning NeurIPS 2024 Github Project
PlotGen: Multi-agent LLM-based Scientific Data Visualization via Multimodal Feedback arXiv 2025 Github Project
Memocrs: Memory-Enhanced Sequential Conversational Recommender Systems with Large Language Models ICIKM 2024 Github Project
Expel: LLM Agents are Experiential Learners AAAI 2024 Github Project
Large Language Models Can Self-Improve arXiv 2022 Github Project
G-eval: NLG Evaluation Using GPT-4 with Better Human Alignment EMNLP 2023 Github Project
Drdt: Dynamic Reflection with Divergent Thinking for LLM-based Sequential Recommendation arXiv 2023 Github Project
Autokaggle: A Multiagent Framework for Autonomous Data Science Competitions arXiv 2024 Github Project
Chartllama: A Multimodal LLM for Chart Understanding and Generation arXiv 2023 Github Project
Tinychart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning arXiv 2024 Github Project
Unichart: A Universal Vision-Language Pretrained Model for Chart Comprehension and Reasoning arXiv 2023 Github Project
Graphotter: Evolving LLM-based Graph Reasoning for Complex Table Question Answering arXiv 2024 Github Project
Mplug-paperowl: Scientific Diagram Analysis with the Multimodal Large Language Model ACM MM 2024 Github Project
CoG-DQA: Chain-of-guiding Learning with Large Language Models for Diagram Question Answering CVPR 2024 Github Project
NLGift: Graph LLM for Intelligent Finance Task arXiv 2024 Github Project
Insightpilot: An LLM-empowered Automated Data Exploration System EMNLP 2023 Github Project
Talk2data: A Natural Language Interface for Exploratory Visual Analysis via Question Decomposition ACM TOIS 2024 Github Project
TiInsight: Comprehensive LLM-based Exploratory Analysis for Time Series arXiv 2024 Github Project
Genoagent: A Baseline Method for LLM-based Exploration of Gene Expression Data OpenReview 2025 Github Project
Data-copilot: Bridging Billions of Data and Humans with Autonomous Workflow arXiv 2023 Github Project
QUIS: Question-guided Insights Generation for Automated Exploratory Data Analysis arXiv 2024 Github Project
Lida: A Tool for Automatic Generation of Grammar-agnostic Visualizations and Infographics using Large Language Models ACL 2023 Github Project

Quantitative Analysis

Title Venue Date Code Project
Unichart: A Universal Vision-Language Pretrained Model for Chart Comprehension and Reasoning EMNLP 2024 Github Project
EvoChart: A Benchmark and a Self-Training Approach Towards Real-World Chart Understanding AAAI 2025 Github Project
ChartInstruct: Instruction Tuning for Chart Comprehension and Reasoning ACL 2024 Github Project
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training ACL 2024 Github Project
Position: What Can Large Language Models Tell Us About Time Series Analysis arXiv 2024 Github Project
Lambda: A Large Model Based Data Agent arXiv 2024 Github Project
Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding ICLR 2024 Github Project
TS-Reasoner: Compositional Time Series Reasoning for End-to-End Task Execution arXiv 2024 Github Project
Proteingpt: Multimodal LLM for Protein Property Prediction and Structure Understanding arXiv 2024 Github Project
Time-LLM: Time Series Forecasting by Reprogramming Large Language Models ICLR 2024 Github Project
LLM4TS: Two-Stage Fine-Tuning for Time-Series Forecasting with Pre-Trained LLMs TIST 2025 Github Project
GPT4MTS: Prompt-Based Large Language Model for Multimodal Time-Series Forecasting AAAI 2024 Github Project
Strada-LLM: Graph LLM for Traffic Prediction arXiv 2024 Github Project
LeMoLE: LLM-Enhanced Mixture of Linear Experts for Time Series Forecasting arXiv 2024 Github Project
ST-LLM: Spatial-Temporal Large Language Model for Traffic Prediction MDM 2024 Github Project
Realtcd: Temporal Causal Discovery from Interventional Data with Large Language Model CIKM 2024 Github Project
MATMCD: From Query Tools to Causal Architects arXiv 2023 Github Project
Order-of-Though: Are Large Language Models Capable of Causal Reasoning for Sensing Data Analysis EdgeFM 2024 Github Project

Special Domains

Medical

Title Venue Date Code Project
ClinicalGPT: AI for Clinical Diagnostics - - Github Project
Mole-BERT: Molecular Biology with Transformers - - Github Project
Me-LLaMA: Medical Large Language Model Applications - - Github Project

Finance

Title Venue Date Code Project
ECC Analyzer: AI for Risk Management - - Github Project
RiskLabs: Data-driven Risk Management - - Github Project
TradingAgents: AI-Powered Financial Trading - - Github Project

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors