Starred repositories
Efficient operation implementation based on the Cambricon Machine Learning Unit (MLU) .
A Python DSL to write Nvidia PTX for Hopper and Blackwell in JAX and PyTorch
A Claude Code plugin that shows what's happening - context usage, active tools, running agents, and todo progress
🤯 LobeHub is your Chief Agent Operator, organizing your agents into 7×24 operations by hiring, scheduling, and reporting on your entire AI team.
A configuration framework that enhances Claude Code with specialized commands, cognitive personas, and development methodologies.
A Claude Skill to give your agent the ability to use a web browser
Persistent file-based planning for AI coding agents and long-running agentic tasks. Crash-proof markdown plans that survive context loss and /clear, plus a deterministic completion gate and multi-a…
Give your AI agent eyes to see the entire internet. Read & search Twitter, Reddit, YouTube, GitHub, Bilibili, XiaoHongShu — one CLI, zero API fees.
CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning
Allow torch tensor memory to be released and resumed later
A unified architecture deep learning framework designed specifically for ultra-large-scale sparse models.
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Production-ready platform for agentic workflow development.
An Unbiased Sequential Recommendation Dataset with Randomly Exposed Videos
Representation and Reference Lowering of ONNX Models in MLIR Compiler Infrastructure
Examples for Recommenders - easy to train and deploy on accelerated infrastructure.
LibreCAD is a cross-platform 2D CAD program. It can read DXF/DWG, and write DXF/DWG/PDF/SVG files. It supports point/line/circle/ellipse/parabola/hyperbola/spline primitives. The GUI is highly cust…
cuVS - a library for vector search and clustering on the GPU
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
DeepEP: an efficient expert-parallel communication library
High-performance GEMM implementation optimized for NVIDIA H100 GPUs, leveraging Hopper architecture's TMA, WGMMA, and Thread Block Clusters for near-peak theoretical performance.
Make huge neural nets fit in memory