Starred repositories
Vision-OPD is a regional-to-global on-policy self-distillation framework that transfers a model's own privileged crop-conditioned perception to its full-image policy, enabling fine-grained visual u…
【ACL 2026】LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck for LLM Safety
We introduce 'Thinking with Video', a new paradigm leveraging video generation for multimodal reasoning. Our VideoThinkBench shows that Sora-2 surpasses GPT5 by 10% on eyeballing puzzles and reache…
Code for CVPR 2026 paper "MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning"
[CVPR 2025] DEIM: DETR with Improved Matching for Fast Convergence
[ICCV 2023] DETRs with Collaborative Hybrid Assignments Training
Andrej Karpathy的认知操作系统。不是语录合集,是可运行的思维框架。Made with 女娲.skill
LLM Wiki is a cross-platform desktop application that turns your documents into an organized, interlinked knowledge base — automatically. Instead of traditional RAG (retrieve-and-answer from scratc…
A single CLAUDE.md file to improve Claude Code behavior, derived from Andrej Karpathy's observations on LLM coding pitfalls.
ICM-Assistant: Instruction-tuning Multimodal Large Language Models for Rule-based Explainable Image Content Moderation. AAAI, 2025
Clone of DeepSeek Thinking-with-Visual-Primitives
ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors [EMNLP 2024 Findings]
Streaming Thinking for VideoLLM Streaming Video Understanding
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension
An agentic framework for omni-modal question-answer tasks.
Fully autonomous & self-evolving research from idea to paper. Chat an Idea. Get a Paper. 🦞
Keep tabs on your tabs. Turn your "New tabs" page into a mission control, so you can close them easily. Built for people who open too many tabs and never close them.
The agent that grows with you
AI coding assistant skill (Claude Code, Codex, OpenCode, Cursor, Gemini CLI, and more). Turn any folder of code, SQL schemas, R scripts, shell scripts, docs, papers, images, or videos into a querya…
reverse engineering Gemini's SynthID detection
Video dataset dedicated to portrait-mode video recognition.