Stars
Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
[NeurIPS 2023 D&B] VidChapters-7M: Video Chapters at Scale
WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
The official Python library for the OpenAI API
Course to get into Large Language Models (LLMs) with roadmaps and Colab notebooks.
[arXiv 2023] Set-of-Mark Prompting for GPT-4V and LMMs
Welcome to the Llama Cookbook! This is your go to guide for Building with Llama: Getting started with Inference, Fine-Tuning, RAG. We also show you how to solve end to end problems using Llama mode…
🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
Code for Enhancing Detail Preservation for Customized Text-to-Image Generation: A Regularization-Free Approach
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
[NeurIPS 2023] Official implementation of the paper "Segment Everything Everywhere All at Once"
Grounded SAM: Marrying Grounding DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect , Segment and Generate Anything
Instruction Tuning with GPT-4
The ChatGPT Retrieval Plugin lets you easily find personal or work documents by asking questions in natural language.
A playbook for systematically maximizing the performance of deep learning models.
[CVPR 2023] Official Implementation of X-Decoder for generalized decoding for pixel, image and language
Implementation of Dreambooth (https://arxiv.org/abs/2208.12242) with Stable Diffusion
A compilation of network architectures for vision and others without usage of self-attention mechanism
Official implementation of AAAI 2023 paper "Parameter-efficient Model Adaptation for Vision Transformers"
[NeurIPS 2022] code for "K-LITE: Learning Transferable Visual Models with External Knowledge" https://arxiv.org/abs/2204.09222
Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
A collection of papers on the topic of ``Computer Vision in the Wild (CVinW)''
Toolkit for Elevater Benchmark
This is a offical PyTorch/GPU implementation of SupMAE.
[CVPR 2022] Official code for "RegionCLIP: Region-based Language-Image Pretraining"