User profiles for Zhoufutu Wen
Zhoufutu WenByteDance SEED Verified email at bytedance.com Cited by 607 |
Enhancing dynamic image advertising with vision-language pre-training
In the multimedia era, image becomes an effective medium in search advertising. Dynamic
Image Advertising (DIA), a system that matches queries with appropriate ad images and …
Image Advertising (DIA), a system that matches queries with appropriate ad images and …
Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks
In this paper, we introduce Knowledge-Orthogonal Reasoning (KOR), a concept aimed at
minimizing reliance on domain-specific knowledge, enabling more accurate evaluation of …
minimizing reliance on domain-specific knowledge, enabling more accurate evaluation of …
Supergpqa: Scaling llm evaluation across 285 graduate disciplines
Large language models (LLMs) have demonstrated remarkable proficiency in mainstream
academic disciplines such as mathematics, physics, and computer science. However, human …
academic disciplines such as mathematics, physics, and computer science. However, human …
Simplevqa: Multimodal factuality evaluation for multimodal large language models
The increasing application of multi-modal large language models (MLLMs) across various
sectors has spotlighted the essence of their output reliability and accuracy, particularly their …
sectors has spotlighted the essence of their output reliability and accuracy, particularly their …
First return, entropy-eliciting explore
Reinforcement Learning from Verifiable Rewards (RLVR) improves the reasoning abilities
of Large Language Models (LLMs) but it struggles with unstable exploration. We propose …
of Large Language Models (LLMs) but it struggles with unstable exploration. We propose …
Treepo: Bridging the gap of policy optimization and efficacy and inference efficiency with heuristic tree-based modeling
Recent advancements in aligning large language models via reinforcement learning have
achieved remarkable gains in solving complex reasoning problems, but at the cost of …
achieved remarkable gains in solving complex reasoning problems, but at the cost of …
Futurex: An advanced live benchmark for llm agents in future prediction
Future prediction is a complex task for LLM agents, requiring a high level of analytical thinking,
information gathering, contextual understanding, and decision-making under uncertainty. …
information gathering, contextual understanding, and decision-making under uncertainty. …
Korgym: A dynamic game platform for llm reasoning evaluation
Recent advancements in large language models (LLMs) underscore the need for more
comprehensive evaluation methods to accurately assess their reasoning capabilities. Existing …
comprehensive evaluation methods to accurately assess their reasoning capabilities. Existing …
Finsearchcomp: Towards a realistic, expert-level evaluation of financial search and reasoning
Search has emerged as core infrastructure for LLM-based agents and is widely viewed as
critical on the path toward more general intelligence. Finance is a particularly demanding …
critical on the path toward more general intelligence. Finance is a particularly demanding …
Iv-bench: A benchmark for image-grounded video perception and reasoning in multimodal llms
Existing evaluation frameworks for Multimodal Large Language Models (MLLMs) primarily
focus on image reasoning or general video understanding tasks, largely overlooking the …
focus on image reasoning or general video understanding tasks, largely overlooking the …