-
Humane
- Atlanta, Georgia
-
14:15
(UTC -05:00) - @ericlewisplease
Highlights
- Pro
Lists (2)
Sort Name ascending (A-Z)
Stars
[ICLR 2025] LLaVA-MoD: Making LLaVA Tiny via MoE-Knowledge Distillation
You like pytorch? You like micrograd? You love tinygrad! ❤️
A framework for unified personalized model, achieving mutual enhancement between personalized understanding and generation. Demonstrating the potential of cross-task information transfer in persona…
Research implementation to investigate methods of integrating the speech modality into pre-trained language models
Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities。
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
A flexible and efficient codebase for training visually-conditioned language models (VLMs)
A paper list of some recent works about Token Compress for Vit and VLM
Official codebase used to develop Vision Transformer, SigLIP, MLP-Mixer, LiT and more.
Reading list for research topics in multimodal machine learning
Pytorch Distributed native training library for LLMs/VLMs with OOTB Hugging Face support
BLIP-2 implementation for training vision-language models. Q-Former + frozen encoders + any LLM. Colab-ready notebooks with MoE variant.
An API-compatible, drop-in replacement for Apple's Foundation Models framework with support for custom language model providers.
From Chain-of-Thought prompting to OpenAI o1 and DeepSeek-R1 🍓
A Framework of Small-scale Large Multimodal Models
Trying to study the effect of different connectors , (linear, MLP and Cross Attention) to analyze what paradigms do LLM'S use or make a best guess
A curated list of vision-and-language pre-training (VLP). :-)
Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence
MMSearch-R1 is an end-to-end RL framework that enables LMMs to perform on-demand, multi-turn search with real-world multimodal search tools.
Turn Apple's CVPR-25 FastVLM encoder into a reproducible baseline for mobile apps. First complete implementation achieving <250ms multimodal inference on iPhone.
The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM", IJCV2025
Fully Open Framework for Democratized Multimodal Reinforcement Learning.
Famous Vision Language Models and Their Architectures
Run frontier LLMs and VLMs with day-0 model support across GPU, NPU, and CPU, with comprehensive runtime coverage for PC (Python/C++), mobile (Android & iOS), and Linux/IoT (Arm64 & x86 Docker). Su…
The Paper List of Large Multi-Modality Model (Perception, Generation, Unification), Parameter-Efficient Finetuning, Vision-Language Pretraining, Conventional Image-Text Matching for Preliminary Ins…
A simple, unified multimodal models training engine. Lean, flexible, and built for hacking at scale.
Open CoreUI - A rewritten Open WebUI in Rust, significantly reducing memory and resource usage, requiring no dependency services, no Docker, with both a server version and a Tauri-based desktop cli…
Unified LLM orchestration and gateway service for DGX Spark — dynamically manages vLLM, SGLang, and TensorRT-LLM backends under a single OpenAI-compatible API.