Highlights
- Pro
Stars
A low-latency & high-throughput serving engine for LLMs
[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
KAG is a knowledge-enhanced generation framework based on OpenSPG engine, which is used to build knowledge-enhanced rigorous decision-making and information retrieval knowledge services
Code and documents of LongLoRA and LongAlpaca (ICLR 2024 Oral)
This includes the original implementation of SELF-RAG: Learning to Retrieve, Generate and Critique through self-reflection by Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi.
[EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
Huazhong University of Science and Technology System Capability Training-DBMS.华中科技大学系统能力培养-DBMS
A collection of LLM papers, blogs, and projects, with a focus on OpenAI o1 and reasoning techniques.
Collecting awesome papers of RAG for AIGC. We propose a taxonomy of RAG foundations, enhancements, and applications in paper "Retrieval-Augmented Generation for AI-Generated Content: A Survey".
Awesome-LLM-RAG: a curated list of advanced retrieval augmented generation (RAG) in Large Language Models
🎉 Modern CUDA Learn Notes with PyTorch: fp32/tf32, fp16/bf16, fp8/int8, flash_attn, rope, sgemm, sgemv, warp/block reduce, dot, elementwise, softmax, layernorm, rmsnorm.
Graph-structured Indices for Scalable, Fast, Fresh and Filtered Approximate Nearest Neighbor Search
Flash Attention in ~100 lines of CUDA (forward pass only)
An LLM Based Diagnosis System (https://arxiv.org/pdf/2312.01454.pdf)
Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.
🤖 The free, Open Source alternative to OpenAI, Claude and others. Self-hosted and local-first. Drop-in replacement for OpenAI, running on consumer-grade hardware. No GPU required. Runs gguf, transf…
Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting yo…
RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
Langchain-Chatchat(原Langchain-ChatGLM)基于 Langchain 与 ChatGLM, Qwen 与 Llama 等语言模型的 RAG 与 Agent 应用 | Langchain-Chatchat (formerly langchain-ChatGLM), local knowledge based LLM (like ChatGLM, Qwen and…
LlamaIndex is a data framework for your LLM applications
🦜🔗 Build context-aware reasoning applications
A Survey on Benchmarks of Multimodal Large Language Models
Ceph is a distributed object, block, and file storage platform
ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale