Tags: gty111/gLLM
Tags
Kernel refactor: use sgl_kernel (#181) * Use sgl-kernel and flashinfer * Simplify build * Update requirements * Add health check * Fix shape * Fix attention * Fix * Fix * Fix * Fix conv3d * Support endpoint * Fix stop string * Abstract conv3d module * Fix moe and add conv file * Fix weight for conv3d * Fix * Fix fused moe * FIx * Fix for moe and model max len * Fix padding block * Bump up to v0.0.6 * Clean up --------- Co-authored-by: instinctguo <instinctguo@tencent.com>
Support TP 🎉 (#72) * Initial support for TP * Use random initialization * Fix PP forward * Downgrade to torch 2.6.0 * Fix env setting for MAX_JOBS * Downgrade to torch 2.5.1 * Fix TP group init * Fix annotation * Make llama compatible for tp * Make chatglm compatible for TP * Make Qwen3 compatible for TP * Remove weight_loader in fused_moe * Make fused_moe compatible for TP; Abstract weight load function * Make qwen_moe compatible for tp * Make mixtral compatible for TP * Update readme * Abstract module attention; Clean up code for TP attention; Clean up code for model weights loading for glm * Add MoE tuing config for A100 PCIE 40GB * Refactor scheduler.py and AllocatorID * Refactor IDAllocator * Refactor worker scheduler * Update readme * Make embed_tokens and lm_head compatible for TP * Fix multi-node zmq_comm * Bump version to 0.1.0