-
Institute of Computing Technology, CAS
Highlights
- Pro
Lists (12)
Sort Name ascending (A-Z)
Stars
Awesome Unified Multimodal Models
Automatically crawl arXiv papers daily and summarize them using AI. Illustrating them using GitHub Pages.
Resources and paper list for "Thinking with Images for LVLMs". This repository accompanies our survey on how LVLMs can leverage visual information for complex reasoning, planning, and generation.
[ICCV 2025] Code & Data for: SuperEdit - Rectifying and Facilitating Supervision for Instruction-Based Image Editing
[NeurIPS 2025] Image editing is worth a single LoRA! 0.1% training data for fantastic image editing! Surpasses GPT-4o in ID persistence~ MoE ckpt released! Only 4GB VRAM is enough to run!
VSCode extension that grammar-checks texts through a local LLM
Official implement of ICML2024 Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation
[ICLR 2025] Diffusion Feedback Helps CLIP See Better
[NeurIPS'25] A work to improve CLIP's visual detail capturing ability by inverting the unCLIP generative model.
The official implementation of CVPR Workshop 2025 paper: Window Token Concatenation for Efficient Visual Large Language Models.
(ICCV 2025) Enhance CLIP and MLLM's fine-grained visual representations with generative models.
This repository collects papers on VLLM applications. We will update new papers irregularly.
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
Diffusion Classifier leverages pretrained diffusion models to perform zero-shot classification without additional training
[ICLR'25 Oral] No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images
LaTeX Thesis Template for the University of Chinese Academy of Sciences
Official implementation of NeurIPS'24 paper Not All Diffusion Model Activations Have Been Evaluated as Discriminative Features
Utilities intended for use with Llama models.
Collection of common code that's shared among different research projects in FAIR computer vision team.
High-resolution models for human tasks.
The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…
PyTorch implementation of MAR+DiffLoss https://arxiv.org/abs/2406.11838
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
This repo contains the code for 1D tokenizer and generator
SEED-Voken: A Series of Powerful Visual Tokenizers
Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ
DiffSeg is an unsupervised zero-shot segmentation method using attention information from a stable-diffusion model. This repo implements the main DiffSeg algorithm and additionally includes an expe…