mukhal

Muhammad Khalifa mukhal

PhD student @umich. Previously @cohere-ai, @ai2, @aws, and @naverlabseurope.

52 followers · 15 following

Achievements

Stars

FrontisAI / NatureBench

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

Python 78 5 Updated Jul 25, 2026

google-deepmind / alphaproof-nexus-results

Lean math proofs generated by AlphaProof Nexus and accompanying natural language prose proofs.

Lean 280 20 Updated Jul 21, 2026

huggingface / Repo2RLEnv

Convert any Repo into an RL Environment

Python 469 72 Updated Jul 23, 2026

NoviScl / Automated-AI-Researcher

Python 80 11 Updated Jan 20, 2026

radixark / miles

Miles is an enterprise-facing reinforcement learning framework for LLM and VLM post-training, forked from and co-evolving with slime.

Python 1,789 325 Updated Jul 26, 2026

xhwang22 / Awesome-Reward-Hacking

A curated list of papers and resources on Reward Hacking, Emergent Misalignment, and Proxy Exploitation in Large Models

42 4 Updated Apr 17, 2026

duoan / TorchCode

🔥 LeetCode for PyTorch — practice implementing softmax, attention, GPT-2 and more from scratch with instant auto-grading. Jupyter-based, self-hosted or try online.

Jupyter Notebook 4,400 384 Updated May 25, 2026

zohaib-khan5040 / Countdown-Code

A testbed for studying the emergence and generalization of reward hacking

Python 9 4 Updated Mar 10, 2026

sgl-project / mini-sglang

A compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems.

Python 4,630 755 Updated May 17, 2026

mukhal / codebase-to-puzzles

generate coding exercises from any github repo

Python 2 Updated Oct 28, 2025

mjun0812 / flash-attention-prebuild-wheels

Provide with pre-build flash-attention 2 and 3 package wheels on Linux and Windows using GitHub Actions

Python 1,638 74 Updated Jul 26, 2026

papercopilot / paperlists

Processed / Cleaned Data for Paper Copilot

Python 952 47 Updated Jul 1, 2026

ChenmienTan / RL2

Python 1,298 135 Updated May 20, 2026

yunx-z / ThinkLogit

Forked from alisawuffles/proxy-tuning

Eliciting Long CoT from a Short CoT Model

Python 8 Updated May 16, 2025

McGill-NLP / nano-aha-moment

Single File, Single GPU, From Scratch, Efficient, Full Parameter Tuning library for "RL for LLMs"

Jupyter Notebook 626 57 Updated Oct 7, 2025

RyanLiu112 / Awesome-Process-Reward-Models

A comprehensive collection of process reward models.

176 4 Updated Jun 6, 2026

mukhal / ThinkPRM

[TMLR] Process Reward Models That Think

Python 89 8 Updated Nov 29, 2025

yunx-z / MLRC-Bench

Forked from snap-stanford/MLAgentBench

Python 10 11 Updated Nov 14, 2025

kanishkg / cognitive-behaviors

Python 224 12 Updated Mar 26, 2025

google-research / rliable

[NeurIPS'21 Outstanding Paper] Library for reliable evaluation on RL and ML benchmarks, even with only a handful of seeds.

Jupyter Notebook 880 49 Updated Aug 12, 2024

simplescaling / s1

s1: Simple test-time scaling

Python 6,660 757 Updated Jun 25, 2025

PrimeIntellect-ai / verifiers

Our library for RL environments + evals

Python 4,400 612 Updated Jul 25, 2026

llm-merging / LLM-Merging

LLM-Merging: Building LLMs Efficiently through Merging

Jupyter Notebook 208 44 Updated Sep 24, 2024

bndr / pipreqs

pipreqs - Generate pip requirements.txt file based on imports of any project. Looking for maintainers to move this project forward.

Python 7,462 420 Updated Mar 30, 2026

nerfies / nerfies.github.io

JavaScript 4,293 1,942 Updated Jun 21, 2024

QwenLM / ProcessBench

Official repository for ACL 2025 paper "ProcessBench: Identifying Process Errors in Mathematical Reasoning"

Python 190 18 Updated May 20, 2025

huggingface / search-and-learn

Recipes to scale inference-time compute of open models

Python 1,130 131 Updated May 26, 2026

EnnengYang / Awesome-Model-Merging-Methods-Theories-Applications

Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities. ACM Computing Surveys, 2026.

769 47 Updated Jul 17, 2026

nolabs-ai / deepfabric

Generate High-Quality Synthetics, Train, Measure, and Evaluate in a Single Pipeline

Python 878 83 Updated Jul 20, 2026

bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.

Python 1,055 263 Updated Jul 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Muhammad Khalifa mukhal

Achievements

Achievements

Block or report mukhal

Stars

FrontisAI / NatureBench

google-deepmind / alphaproof-nexus-results

huggingface / Repo2RLEnv

NoviScl / Automated-AI-Researcher

radixark / miles

xhwang22 / Awesome-Reward-Hacking

duoan / TorchCode

zohaib-khan5040 / Countdown-Code

sgl-project / mini-sglang

mukhal / codebase-to-puzzles

mjun0812 / flash-attention-prebuild-wheels

papercopilot / paperlists

ChenmienTan / RL2

yunx-z / ThinkLogit

McGill-NLP / nano-aha-moment

RyanLiu112 / Awesome-Process-Reward-Models

mukhal / ThinkPRM

yunx-z / MLRC-Bench

kanishkg / cognitive-behaviors

google-research / rliable

simplescaling / s1

PrimeIntellect-ai / verifiers

llm-merging / LLM-Merging

bndr / pipreqs

nerfies / nerfies.github.io

QwenLM / ProcessBench

huggingface / search-and-learn

EnnengYang / Awesome-Model-Merging-Methods-Theories-Applications

nolabs-ai / deepfabric

bigcode-project / bigcode-evaluation-harness