Stars
The repository provides code for running inference with the Meta Segment Anything Audio Model (SAM-Audio), links for downloading the trained model checkpoints, and example notebooks that show how t…
Fun-ASR is an end-to-end speech recognition large model launched by Tongyi Lab.
A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Recognition, Voice Activity Detection, Text Post-processing etc.
GLM-4.6V/4.5V/4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
An Open Phone Agent Model & Framework. Unlocking the AI Phone for Everyone
Central repository for biomolecular foundation models with shared trainers and pipeline components
robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
Long-form streaming TTS system for multi-speaker dialogue generation
Open-source industrial-grade ASR models supporting Mandarin, Chinese dialects and English, achieving a new SOTA on public Mandarin ASR benchmarks, while also offering outstanding singing lyrics rec…
EverMemOS is an open-source, enterprise-grade intelligent memory system. Our mission is to build AI memory that never forgets, making every conversation built on previous understanding.
Omnilingual ASR Open-Source Multilingual SpeechRecognition for 1600+ Languages
OpenOCR: A general OCR system with accuracy and efficiency. Supporting 24 Scene Text Recognition methods trained from scratch on large-scale real datasets, and will continue to add the latest methods.
A lightweight LMM-based Document Parsing Model
docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
A Lightweight Face Recognition and Facial Attribute Analysis (Age, Gender, Emotion and Race) Library for Python
OpenAI CLIP text encoders for multiple languages!
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second.
Core ML tools contain supporting tools for Core ML model conversion, editing, and validation.
Embedding Atlas is a tool that provides interactive visualizations for large embeddings. It allows you to visualize, cross-filter, and search embeddings and metadata.
This repository contains the official implementation of "FastVLM: Efficient Vision Encoding for Vision Language Models" - CVPR 2025