Stars
A dashboard for testing API availability across AI platforms and models.
A lightweight Code Agent orchestration hub with multi-repo management, task queuing, Git worktree automation, and real-time Web UI interaction.
An all-in-one enhancement suite for Google Gemini & AI Studio - timeline navigation, folder management, prompt library, and chat export in one powerful extension. / Google Gemini & AI Studio 全能增强插件…
Official implementation of paper "Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models"
清华大学“荷塘雨课堂”助手,包含自动签到、答题等功能。
A fast, simple & powerful blog framework, powered by Node.js.
MiroTrain is an efficient and algorithm-first framework research agent.
MiroRL is an MCP-first reinforcement learning framework for deep research agent.
MiroMind-M1 is a fully open-source series of reasoning language models built on Qwen-2.5, focused on advancing mathematical reasoning.
MiroThinker is a deep research agent optimized for complex research and prediction tasks. Our latest models, MiroThinker-1.7, achieves 74.0 and 75.3 on the BrowseComp and BrowseComp Zh, respectively.
🏆 Top-1 on 5+ benchmarks | Web UI | Supports MiroThinker, Claude, Kimi, OpenAI
A demo project for starters to learn website coding
Official repo of "MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents". It can be used to evaluate a GUI agent with a hierarchical manner across multiple platforms, includi…
Azur Lane bot (CN/EN/JP/TW) 碧蓝航线脚本 | 无缝委托科研,全自动大世界
[ICCV2025] V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
Class project for Media and Cognition 2023 Fall
📔 notes for Multi-hop Reading Comprehension and open-domain question answering
Code and documents of LongLoRA and LongAlpaca (ICLR 2024 Oral)
BABILong is a benchmark for LLM evaluation using the needle-in-a-haystack approach.
[ACL'24 Outstanding] Data and code for L-Eval, a comprehensive long context language models evaluation benchmark
ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Models
LOFT: A 1 Million+ Token Long-Context Benchmark
This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?
Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718
Doing simple retrieval from LLM models at various context lengths to measure accuracy
Tests for long context window evaluation
[EMNLP 2024 (Oral)] Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA