Build software better, together

apexedgesystems / vernier

C++23 benchmarking framework with 6 profiler backends, CUDA GPU support, statistical regression detection, cross-compilation for 5 architectures, and CLI tools for analysis and visualization.

Updated Apr 2, 2026
C++

CarlosArmeroMoneo / quantum-workload-architecture

Star

Quantum workload planning and profiler-backed architecture analysis for exact tensor-network execution.

python quantum-computing profiling tensor-networks nsight cuquantum

Updated Mar 25, 2026
Python

Therad445 / cuda-rentsense-knn

Star

CUDA-accelerated kNN regression for rent estimation with CPU baseline, shared-memory optimization, and profiling

machine-learning performance cmake cxx gpu parallel-computing cuda profiling knn nsight

Updated Mar 20, 2026
C++

kalyani-25 / Reimplementation_flash-attention-from-scratch

Star

16-step CUDA optimization of FlashAttention-2 achieving 99.2% of official performance on A100 — Ampere architecture

deep-learning cuda pytorch ampere gpu-kernels nsight llm-inference flashattention

Updated Mar 6, 2026
Cuda

Hyeonjoon-Nam / Cuda-Study-Journey

Star

High-Performance Computing (HPC) & Optimization studies using CUDA C++. Includes Grid-Stride Loops, Shared Memory tiling, and Nsight Compute profiling analysis.

gpu optimization cuda nsight

Updated Mar 23, 2026
C++

yasser1-0 / FP16-vs-FP32-A-GPU-Lab-in-Frames

Star

🎬 Explore GPU training efficiency with FP32 vs FP16 in this modular lab, utilizing Tensor Core acceleration for deep learning insights.

performance-engineering deep-learning reproducible-research cuda pytorch fp16 cupy mixed-precision nsight gpu-benchmark nvtx fp32 tensor-core

Updated Feb 20, 2026
Python

Salik-Devv / edge-detection-using-cuda

Star

High-performance Sobel edge detection using CUDA with CPU vs GPU benchmarking, roofline analysis, and Nsight profiling.

computer-vision parallel-computing cuda image-processing high-performance-computing performance-analysis gpu-computing sobel roofline-model nsight

Updated Jan 17, 2026
Python

FlosMume / cpp-cuda-deepvision-rtx-starter

Star

CUDA C++ practice project for RTX 4070 SUPER — explore GPU concurrency, pinned memory, and Nsight profiling. Includes SAXPY and 2D blur kernels to train optimization, stream overlap, and timing analysis for NVIDIA Developer Technology Engineering skillset.

cpp gpu cuda nvidia high-performance-computing cuda-kernels gpu-optimization nsight parrallel-computing deep-learning-inference gpu-profiling cuda-streams pinned-memory

Updated Nov 18, 2025
Cuda

j3soon / hpc-samples

Star

CUDA Samples and Nsight Guided Profiling Samples

cuda profiling nsight nsight-compute

Updated Nov 14, 2025
Cuda

Chirag005 / CUDA-Kernel-project

Star

Custom PyTorch CUDA kernel implementing optimized ReLU activation with vectorization, performance profiling, and memory analysis on Tesla T4 GPU achieving 75% bandwidth efficiency.

gpu cuda pytorch cuda-kernels performance-analysis nsight cuda-programming kernel-profiler cuda-kernel

Updated Oct 27, 2025
Jupyter Notebook

UchihaIthachi / sssp-apsp-hpc-openmp-cuda

Star

🚀 High-performance implementations and benchmarks of SSSP and APSP algorithms (Bellman–Ford, Dijkstra, Floyd–Warshall, Johnson) in Serial, OpenMP, CUDA, and Hybrid CPU+GPU. Includes profiling, speedup plots, and HPC notebooks

graph-algorithms hpc openmp parallel-computing cuda performance-analysis gpu-computing johnson-algorithm nvcc apsp bellman-ford floyd-warshall-algorithm shortest-path-algorithm sssp nsight dijikstra-algorithm

Updated Oct 17, 2025
Jupyter Notebook

Umer-Farooq-CS / MNIST-Classification

Star

The MNIST classification problem is a fundamental machine learning task that involves recognizing handwritten digits (0- 9) from a dataset of 70,000 grayscale images (28x28 pixels each). It serves as a benchmark for evaluating machine learning models, particularly neural networks.

benchmarking deep-learning parallel-computing cuda mnist neural-networks high-performance-computing gpu-acceleration profiling shared-memory openacc performance-optimization c-cpp nsight tensor-cores cuda-streams pinned-memory

Updated Sep 12, 2025
Cuda

Dartayous / FP16-vs-FP32-A-GPU-Lab-in-Frames

Star

A reproducible GPU benchmarking lab that compares FP16 vs FP32 training on MNIST using PyTorch, CuPy, and Nsight profiling tools. This project blends performance engineering with cinematic storytelling—featuring NVTX-tagged training loops, fused CuPy kernels, and a profiler-driven README that narrates the GPU’s inner workings frame by frame.

performance-engineering deep-learning reproducible-research cuda pytorch fp16 cupy mixed-precision nsight gpu-benchmark nvtx fp32 tensor-core

Updated Sep 5, 2025
Python

K-Wu / HET_nsight_utils

Star

cuda nvidia trace gspread profiling ncu nsight nsys

Updated Aug 12, 2024
Python

HROlive / Fundamentals-of-Accelerated-Computing-with-CUDA-C-Cpp

Star

Accelerate and optimize existing C/C++ CPU-only applications using the most essential CUDA tools and techniques.

cpp cuda nvidia cuda-kernels profilling nsight cuda-programming

Updated May 23, 2024
Jupyter Notebook

itm-unipi / Parallelized-Nearest-Neighbor-Upscaler

Star

University Project for "Computer Architecture" course (MSc Computer Engineering @ University of Pisa). Implementation of a Parallelized Nearest Neighbor Upscaler using CUDA.

gpu nvidia nvidia-cuda nvidia-gpu nsight image-upscaling parallelized nearest-neighbor-algorithm nsight-compute

Updated Dec 29, 2023
C

Juanx65 / yolov8test

Star

learning how to do profiling on a yolov8 net using nvidia nsight compute

profiling nsight yolov8

Updated Jul 5, 2023
Python

sharcnet / vscode-hpc

Star

Remote development on HPC clusters with VSCode

python cxx jupyter cpp hpc cmake-examples vscode cuda hpc-clusters nsight vscode-remote

Updated Sep 19, 2022
Jupyter Notebook

mnicely / computeWorks_examples

Star

Matrix multiplication example performed with OpenMP, OpenACC, BLAS, cuBLABS, and CUDA

docker openmp cuda eclipse-plugin cublas nvidia blas nvidia-docker pgi-compiler openacc nsight

Updated May 31, 2022
C++

BrainTwister / docker-devel-env

Star

Fast, reproducible, and portable software development environments

docker jenkins development cmake eclipse gcc vscode cuda clang conan reproducibility portability nsight

Updated Dec 8, 2021
Dockerfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nsight

Here are 23 public repositories matching this topic...

apexedgesystems / vernier

CarlosArmeroMoneo / quantum-workload-architecture

Therad445 / cuda-rentsense-knn

kalyani-25 / Reimplementation_flash-attention-from-scratch

Hyeonjoon-Nam / Cuda-Study-Journey

yasser1-0 / FP16-vs-FP32-A-GPU-Lab-in-Frames

Salik-Devv / edge-detection-using-cuda

FlosMume / cpp-cuda-deepvision-rtx-starter

j3soon / hpc-samples

Chirag005 / CUDA-Kernel-project

UchihaIthachi / sssp-apsp-hpc-openmp-cuda

Umer-Farooq-CS / MNIST-Classification

Dartayous / FP16-vs-FP32-A-GPU-Lab-in-Frames

K-Wu / HET_nsight_utils

HROlive / Fundamentals-of-Accelerated-Computing-with-CUDA-C-Cpp

itm-unipi / Parallelized-Nearest-Neighbor-Upscaler

Juanx65 / yolov8test

sharcnet / vscode-hpc

mnicely / computeWorks_examples

BrainTwister / docker-devel-env

Improve this page

Add this topic to your repo