Efficient Triton Kernels for LLM Training
-
Updated
Nov 9, 2024 - Python
Efficient Triton Kernels for LLM Training
Kernl lets you run PyTorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.
🎉 Modern CUDA Learn Notes with PyTorch: CUDA Cores, Tensor Cores, fp32/tf32, fp16/bf16, fp8/int8, flash_attn, rope, sgemm, hgemm, sgemv, warp/block reduce, elementwise, softmax, layernorm, rmsnorm.
A service for autodiscovery and configuration of applications running in containers
Playing with the Tigress software protection. Break some of its protections and solve their reverse engineering challenges. Automatic deobfuscation using symbolic execution, taint analysis and LLVM.
A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.
FlagGems is an operator library for large language models implemented in Triton Language.
Automatic ROPChain Generation
SymGDB - symbolic execution plugin for gdb
LLVM based static binary analysis framework
A performance library for machine learning applications.
ClearML - Model-Serving Orchestration and Repository Solution
(WIP)The deployment framework aims to provide a simple, lightweight, fast integrated, pipelined deployment framework for algorithm service that ensures reliability, high concurrency and scalability of services.
NVIDIA-accelerated, deep learned model support for image space object detection
NVIDIA-accelerated DNN model inference ROS 2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with CUDA-capable GPU
Deploy DL/ ML inference pipelines with minimal extra code.
Add a description, image, and links to the triton topic page so that developers can more easily learn about it.
To associate your repository with the triton topic, visit your repo's landing page and select "manage topics."