-
Shanghai Jiao Tong University
- Shanghai
- www.wzk.plus
- https://scholar.google.com/citations?user=W0zVf-oAAAAJ
Starred repositories
The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable…
强化学习中文教程(蘑菇书🍄),在线阅读地址:https://datawhalechina.github.io/easy-rl/
LAVIS - A One-stop Library for Language-Vision Intelligence
[NeurIPS 2024 Best Paper Award][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". A…
Taming Transformers for High-Resolution Image Synthesis
COCO API - Dataset @ http://cocodataset.org/
Acceptance rates for the major AI conferences
[ICLR 2023] ReAct: Synergizing Reasoning and Acting in Language Models
Official Repository for "Eureka: Human-Level Reward Design via Coding Large Language Models" (ICLR 2024)
Official repository of 'Visual-RFT: Visual Reinforcement Fine-Tuning' & 'Visual-ARFT: Visual Agentic Reinforcement Fine-Tuning'’
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
ICCV2021, Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
Utility functions for handling MIDI data in a nice/intuitive way.
This is the official repository for M2UGen
Data and code for FreshLLMs (https://arxiv.org/abs/2310.03214)
Evaluating text-to-image/video/3D models with VQAScore
L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning
[TMLR] Public code repo for paper "A Single Transformer for Scalable Vision-Language Modeling"
Generate chatbots from a corpus
Train InternViT-6B in MMSegmentation and MMDetection with DeepSpeed
[IGARSS 2025 Oral] A Simple Aerial Detection Baseline of Multimodal Language Models.
Improving Contrastive Learning by Visualizing Feature Transformation, ICCV 2021 Oral
This is a offical PyTorch/GPU implementation of SupMAE.
The first unified, efficient, and extensible evaluation toolkit for evaluating image generation and editing models across multiple benchmarks.