Stars
Python tool for converting files and office documents to Markdown.
RAGFlow is a leading open-source Retrieval-Augmented Generation (RAG) engine that fuses cutting-edge RAG with Agent capabilities to create a superior context layer for LLMs
Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.
🆓免费的 ChatGPT 镜像网站列表,持续更新。List of free ChatGPT mirror sites, continuously updated.
Toolkit for linearizing PDFs for LLM datasets/training
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website …
A Comprehensive Toolkit for High-Quality PDF Content Extraction
⏰ Collaboratively track worldwide conference deadlines (Website, Python Cli, Wechat Applet) / If you find it useful, please star this project, thanks~
PyTorch implementation of MAE https//arxiv.org/abs/2111.06377
A lightweight LMM-based Document Parsing Model
Multilingual Document Layout Parsing in a Single Vision-Language Model
MS-Agent: Lightweight Framework for Empowering Agents with Autonomous Exploration in Complex Task Scenarios
🦛 CHONK docs with Chonkie ✨ — The no-nonsense RAG library
A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).
The hub for EleutherAI's work on interpretability and learning dynamics
UltraRAG 2.0: Less Code, Lower Barrier, Faster Deployment! MCP-based low-code RAG framework, enabling researchers to build complex pipelines to creative innovation.
[CVPR 2025] A Comprehensive Benchmark for Document Parsing and Evaluation
TexTeller can convert image to latex formulas (image2latex, latex OCR) with higher accuracy and exhibits superior generalization ability, enabling it to cover most usage scenarios.
UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition
[ICLR 2025] VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis
Awesome Deep Research list! For more details, please refer to our survey paper -- A Comprehensive Survey of Deep Research: Systems, Methodologies, and Applications
[ACL 2024] This is the code repo for our ACL’24 paper "Cleaner Pretraining Corpus Curation with Neural Web Scraping".