Stars
WanJuan3.0(“万卷·丝路”)一个作为综合性的纯文本语料库,采集了多个国家地区的网络公开信息、文献、专利等资料,数据总规模超1.2TB,Token总数超过300B,处于国际领先水平,首期开源的语料库主要由泰语、俄语、阿拉伯语、韩语和越南语5个子集构成,每个子集的数据规模均超过150GB
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
A Next-Generation Training Engine Built for Ultra-Large MoE Models
Official release of InternLM series (InternLM, InternLM2, InternLM2.5, InternLM3).
🔍 An LLM-based Multi-agent Framework of Web Search Engine (like Perplexity.ai Pro and SearchGPT)
Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.
A Collection of Foundation Driving Models by OpenDriveLab
WanJuan-CC是以CommonCrawl为基础,经过数据抽取,规则清洗,去重,安全过滤,质量清洗等步骤得到的高质量数据。
New ways of breaking app-integrated LLMs
MLLM-DataEngine: An Iterative Refinement Approach for MLLM
AAAI 2024: Visual Instruction Generation and Correction
Automated dense category annotation engine that serves as the initial semantic labeling for the Segment Anything dataset (SA-1B).
An open-source codebase for exploring autonomous driving pre-training
OpenPCSeg: Open Source Point Cloud Segmentation Toolbox and Benchmark
Data annotation toolbox supports image, audio and video data.
Data annotation component library --provided as NPM packages
Data Set Description Language Specification (新一代人工智能数据集描述语言DSDL)
SDK of OpenDataLab - https://opendatalab.org.cn