-
FDU & SAIS
- Shanghai
-
10:52
(UTC +08:00) - kobeshegu.github.io
- @kobeshegu
- https://www.zhihu.com/people/ke-ke-ke-ke-ke-da-xia
Lists (8)
Sort Name ascending (A-Z)
Stars
ARM: An AutoRegressive Large Multimodal Model with Discrete Representations
UniRL is a Framework for Unified Multimodal Model Reinforcement Learning
Ideogram 4: Open image model at the forefront of design
Bernini is a unified framework for video generation and editing that combines an MLLM-based semantic planner with a DiT-based renderer.
Are attention sinks necessary in diffusion transformers? Code for dynamic sink detection and causal suppression experiments in SD3/SDXL.
Our inference and training framework to run on the Cosmos Models
ERNIE-Image is an open text-to-image generation model developed by the ERNIE-Image team at Baidu. It is built on a single-stream Diffusion Transformer (DiT), with only 8B DiT parameters, it reaches…
PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion
Lens is a 3.8B-parameter text-to-image diffusion model that achieves quality competitive with and in several cases surpassing models like FLUX and SD3, while requiring significantly less training c…
[CVPR 2026] Official repository for "Reviving ConvNeXt for Efficient Convolutional Diffusion Models"
HY-World 1.5: A Systematic Framework for Interactive World Modeling with Real-Time Latency and Geometric Consistency
A 3B-active-parameter native unified multimodal model for image and video understanding, generation, and editing.
A Minimal and Elegant Framework & Tutorial for Real-Time Interactive World Models
Official Implemenation for RAEv2: Improved Baselines with Representation Autoencoders
awesome-native-multimodal-models
The agent harness performance optimization system. Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.
An agentic skills framework & software development methodology that works.
Prompt as Code | GPT-Image2 工业级提示词引擎与模板库,470+ 个案例逆向工程,20+ 套工业级模板,并提炼出Skills,持续更新中
GPT-Image-2 PPT Generator Skill for Creating Image-Based PowerPoint Presentations in Codex and Other Skill-Compatible Agents
RL training framework for diffusion and omni-modality models
Official implementation of Tuna-2: Pixel Embeddings Beat Vision Encoders for Unified Understanding and Generation
Official Codebase for "DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos" (ICML 2026)
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
This repository is the collection of World model Papers