Stars
Ongoing research training transformer models at scale
Agent framework and applications built upon Qwen>=3.0, featuring Function Calling, MCP, Code Interpreter, RAG, Chrome extension, etc.
Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception
Convert Word documents (.docx files) to HTML
Convert Word documents (.docx files) to HTML
Monkey (LMM): Image Resolution and Text Label Are Important Things for Large Multi-modal Models (CVPR 2024 Highlight)
[CVPR 2024] DocRes: A Generalist Model Toward Unifying Document Image Restoration Tasks
A bug-free and improved implementation of LLaVA-UHD, based on the code from the official repo
[IEEE TPAMI] Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation
ModelScope: bring the notion of Model-as-a-Service to life.
[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.
Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages.
Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"
Official implementation code of the paper <AnyText: Multilingual Visual Text Generation And Editing>
mPLUG-Owl: The Powerful Multi-modal Large Language Model Family
Official Code for DragGAN (SIGGRAPH 2023)
An open-source framework for training large multimodal models.
Grounded SAM: Marrying Grounding DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect , Segment and Generate Anything
Painter & SegGPT Series: Vision Foundation Models from BAAI
The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
ChatGLM-6B: An Open Bilingual Dialogue Language Model | 开源双语对话语言模型
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
We identify the desiderata for a comprehensive benchmark and propose Visually Rich Document Understanding (VRDU). VRDU contains two datasets that represent several challenges: rich schema including…