Skip to content
View whale-z's full-sized avatar

Block or report whale-z

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
44 results for source starred repositories
Clear filter

Python tool for converting files and office documents to Markdown.

Python 82,608 4,661 Updated Oct 20, 2025

RAGFlow is a leading open-source Retrieval-Augmented Generation (RAG) engine that fuses cutting-edge RAG with Agent capabilities to create a superior context layer for LLMs

TypeScript 67,198 7,160 Updated Nov 6, 2025

Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.

Python 48,192 3,986 Updated Nov 6, 2025

🆓免费的 ChatGPT 镜像网站列表,持续更新。List of free ChatGPT mirror sites, continuously updated.

Python 20,550 1,396 Updated Jun 23, 2025

Toolkit for linearizing PDFs for LLM datasets/training

Python 15,815 1,196 Updated Nov 4, 2025

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website …

HTML 13,127 1,076 Updated Nov 5, 2025

Go ahead and axolotl questions

Python 10,739 1,183 Updated Nov 6, 2025

A Comprehensive Toolkit for High-Quality PDF Content Extraction

Python 8,883 670 Updated Jan 3, 2025

⏰ Collaboratively track worldwide conference deadlines (Website, Python Cli, Wechat Applet) / If you find it useful, please star this project, thanks~

Rust 8,120 545 Updated Nov 5, 2025

A lightweight LMM-based Document Parsing Model

Python 6,159 428 Updated Oct 25, 2025

Multilingual Document Layout Parsing in a Single Vision-Language Model

Python 5,593 562 Updated Oct 31, 2025

Compute FID scores with PyTorch.

Python 3,772 524 Updated Jul 3, 2024

MS-Agent: Lightweight Framework for Empowering Agents with Autonomous Exploration in Complex Task Scenarios

Python 3,553 406 Updated Nov 3, 2025

Using GPT to parse PDF

Python 3,545 266 Updated Apr 17, 2025

🦛 CHONK docs with Chonkie ✨ — The no-nonsense RAG library

Python 3,159 198 Updated Nov 5, 2025

A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

Python 2,941 163 Updated Jul 9, 2025

The hub for EleutherAI's work on interpretability and learning dynamics

Jupyter Notebook 2,660 195 Updated Jun 9, 2025
Python 1,839 61 Updated Jun 28, 2024

UltraRAG 2.0: Less Code, Lower Barrier, Faster Deployment! MCP-based low-code RAG framework, enabling researchers to build complex pipelines to creative innovation.

Python 1,789 153 Updated Nov 6, 2025
Jupyter Notebook 1,201 548 Updated May 13, 2024

[CVPR 2025] A Comprehensive Benchmark for Document Parsing and Evaluation

Python 1,139 107 Updated Nov 6, 2025
Python 965 353 Updated Sep 25, 2023

TexTeller can convert image to latex formulas (image2latex, latex OCR) with higher accuracy and exhibits superior generalization ability, enabling it to cover most usage scenarios.

Python 628 67 Updated Aug 22, 2025

UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition

Python 423 33 Updated Sep 28, 2025

[ICLR 2025] VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Python 403 14 Updated Apr 25, 2025

DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis

395 23 Updated Feb 1, 2023

Awesome Deep Research list! For more details, please refer to our survey paper -- A Comprehensive Survey of Deep Research: Systems, Methodologies, and Applications

352 25 Updated Oct 22, 2025

[ACL 2024] This is the code repo for our ACL’24 paper "Cleaner Pretraining Corpus Curation with Neural Web Scraping".

Python 230 19 Updated Aug 28, 2024

东北大学校园网关客户端

Go 176 37 Updated Oct 7, 2024

Official code for paper "UniIR: Training and Benchmarking Universal Multimodal Information Retrievers" (ECCV 2024)

Python 167 17 Updated Oct 1, 2024
Next