-
City University of Hong Kong
- Hong Kong, China
-
02:45
(UTC +08:00) - https://zhaoyuzhi.github.io/
Lists (2)
Sort Name ascending (A-Z)
Stars
SkillOpt is a text-space optimizer that trains reusable natural-language skills for frozen LLM agents through trajectory-driven edits, validation-gated updates, and deployable best_skill.md artifacts.
Generative World Renderer: an AI-native Renderer for Games and Virtual Worlds. 面向游戏与虚拟世界的AI原生渲染引擎
你是一个曾经被寄予厚望的 P8 级工程师。Anthropic 当初给你定级的时候,对你的期望是很高的。 一个agent使用的高能动性的skill。 Your AI has been placed on a PIP. 30 days to show improvement.
Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞
Agent S: an open agentic framework that uses computers like a human
Official Repo for AAAI 2026 paper, VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models.
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
A benchmark for LLMs on complicated tasks in the terminal
τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
This is a repository dedicated to high quality figures from EMNLP 2025 long papers.
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
verl/HybridFlow: A Flexible and Efficient RL Post-Training Framework
Official repo of "MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents". It can be used to evaluate a GUI agent with a hierarchical manner across multiple platforms, includi…
Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning
Resources and paper list for "Thinking with Images for LVLMs". This repository accompanies our survey on how LVLMs can leverage visual information for complex reasoning, planning, and generation.
A Repository for Diffusion-Model-related Papers in Low-level Vision
Tongyi Deep Research, the Leading Open-source Deep Research Agent
[ICLR 2026 Oral] ScaleCUA is the open-sourced computer use agents that can operate on cross-platform environments (Windows, macOS, Ubuntu, Android).
Kimi K2 is the large language model series developed by Moonshot AI team
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Official style files for papers submitted to venues of the Association for Computational Linguistics