-
Nanyang Technological University
- Singapore
-
16:55
(UTC +08:00)
Highlights
- Pro
Stars
Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs
A bidirectional pipeline parallelism algorithm for computation-communication overlap in DeepSeek V3/R1 training.
⭐️ A cross-platform CLI All-in-One assistant tool for Claude Code, Codex & Gemini CLI.
mKernel: fast multi-node, multi-GPU fused kernels
Dynamic Memory Management for Serving LLMs without PagedAttention
An Efficient and Versatile Inference Engine for Distributed LLM Serving
Train the smallest LM you can that fits in 16MB. Best model wins!
A ChatGPT(GPT-3.5) & GPT-4 Workload Trace to Optimize LLM Serving Systems
MetaAttention: A Unified and Performant Attention Framework Across Hardware Backends(PPoPP'26)
Using a swizzled hierarchical layout for GEMM
Academic Research Skills for Claude Code: research → write → review → revise → finalize
Efficient Long-context Language Model Training by Core Attention Disaggregation
Open-source framework for the research and development of foundation models.
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Pure Rust + CUDA LLM inference engine
Foundry materializes CUDA graphs along with its execution context to disk to support fast cold start of serving engines.