Skip to content
View bbshocking's full-sized avatar

Block or report bbshocking

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

A collection of libraries to optimise AI model performances

Python 8,366 632 Updated Jul 22, 2024

NVIDIA container runtime library

C 1,031 247 Updated Nov 4, 2025

🦄 🦄 🦄 Core smart contracts of Uniswap v3

TypeScript 4,860 2,992 Updated Nov 3, 2024

High performance distributed framework for training deep learning recommendation models based on PyTorch.

Rust 409 55 Updated Jun 14, 2025

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries

Python 7,325 1,090 Updated Sep 26, 2025

F-Stack is an user space network development kit with high performance based on DPDK, FreeBSD TCP/IP stack and coroutine API.

C 4,137 942 Updated Nov 5, 2025

Making large AI models cheaper, faster and more accessible

Python 41,221 4,536 Updated Oct 13, 2025

ComScribe is a tool to identify communication among all GPU-GPU and CPU-GPU pairs in a single-node multi-GPU system.

C++ 27 4 Updated Jul 6, 2023

Milvus is a high-performance, cloud-native vector database built for scalable vector ANN search

Go 38,280 3,495 Updated Nov 5, 2025

Repository for nvCOMP docs and examples. nvCOMP is a library for fast lossless compression/decompression on the GPU that can be downloaded from https://developer.nvidia.com/nvcomp.

C++ 598 89 Updated Sep 11, 2024

An Agile Chisel-Based SoC Design Framework

Scala 26 2 Updated Dec 29, 2021

热咖啡

JavaScript 188 9 Updated Feb 2, 2023

Slicing a PyTorch Tensor Into Parallel Shards

Python 301 15 Updated Jun 7, 2025

A benchmark for testing PCIe and host/device memory bandwith and communication contention on multi-GPU and multi-CPU systems.

C++ 9 1 Updated Jun 9, 2016

The X86 Encoder Decoder (XED), is a software library for encoding and decoding X86 (IA32 and Intel64) instructions

Python 1,523 164 Updated Jun 11, 2025

BLAS-like Library Instantiation Software Framework

C 2,547 402 Updated Oct 21, 2025

A 128 bit unsigned integer class for CUDA

C++ 46 17 Updated Jan 3, 2025

The Ceph Benchmarking Tool

Python 299 146 Updated Oct 9, 2025

ONNX-TensorRT: TensorRT backend for ONNX

C++ 3,165 543 Updated Sep 8, 2025

FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/

C++ 1,468 674 Updated Nov 5, 2025

Tensorflow Backend for ONNX

Python 1,325 298 Updated Mar 28, 2024

Samples for CUDA Developers which demonstrates features in CUDA Toolkit

C 8,388 2,170 Updated Sep 5, 2025

Triton Model Analyzer is a CLI tool to help with better understanding of the compute and memory requirements of the Triton Inference Server models.

Python 495 80 Updated Nov 5, 2025

Automatically generate a C++ header file including Cuda device-specific parameters

C++ 3 Updated Jul 1, 2020

A GPU-powered real-time analytics storage and query engine.

Go 3,067 235 Updated Jul 13, 2024

Rodinia benchmark

C 189 103 Updated Apr 14, 2023

Running BERT without Padding

C++ 475 53 Updated Mar 18, 2022

Virtual Kubelet is an open source Kubernetes kubelet implementation.

Go 4,439 651 Updated Nov 3, 2025

brpc is an Industrial-grade RPC framework using C++ Language, which is often used in high performance system such as Search, Storage, Machine learning, Advertisement, Recommendation etc. "brpc" mea…

C++ 17,359 4,066 Updated Nov 5, 2025
Next