You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Benchmark for fine-grained machine-generated text detection. 6.5k texts written by humans, generated by ten open-source instruction-finetuned LLMs and edited by expert annotators.
Official evaluation code for the U-MATH and μ-MATH benchmarks. These datasets are designed to test the mathematical reasoning and meta-evaluation capabilities of LLMs on university-level problems.
Official evaluation code for the U-MATH and μ-MATH benchmarks. These datasets are designed to test the mathematical reasoning and meta-evaluation capabilities of LLMs on university-level problems.
Benchmark for fine-grained machine-generated text detection. 6.5k texts written by humans, generated by ten open-source instruction-finetuned LLMs and edited by expert annotators.