#

CUDA

CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers are able to dramatically speed up computing applications by harnessing the power of GPUs.

Here are 1,455 public repositories matching this topic...

instant-ngp

NVlabs / instant-ngp

Instant neural graphics primitives: lightning fast NeRF and more

machine-learning real-time computer-vision neural-network computer-graphics realtime cuda signed-distance-functions nerf 3d-reconstruction function-approximation real-time-rendering

Updated Oct 8, 2025
Cuda

LeetCUDA

xlite-dev / LeetCUDA

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

cuda cuda-kernels cuda-demo cuda-toolkit cuda-library cuda-kernel learn-cuda cuda-cpp hgemm flash-attention leet-cuda cuda-12

Updated Nov 28, 2025
Cuda

thu-ml / SageAttention

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

cuda triton attention vit quantization video-generation mlsys inference-acceleration efficient-attention llm llm-infra video-generate

Updated Nov 29, 2025
Cuda

BBuf / how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

cuda llm

Updated Nov 29, 2025
Cuda

rapidsai / cugraph

cuGraph - RAPIDS Graph Analytics Library

graph graph-algorithms gpu cuda nvidia complex-networks graph-analysis graphml graph-framework rapids

Updated Nov 29, 2025
Cuda

CannyLab / tsne-cuda

GPU Accelerated t-SNE for CUDA with Python bindings

python gpu cuda multithreading data-visualization mnist data-analysis tsne-algorithm tsne barnes-hut-tsne barnes-hut fit-tsne tsne-cuda

Updated Oct 2, 2024
Cuda

cub

NVIDIA / cub

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

cxx algorithms cpp gpu cpp14 cuda cpp11 nvidia cpp17 cub cpp20 cxx11 cxx14 cxx17 cxx20 nvidia-hpc-sdk

Updated Oct 9, 2023
Cuda

rapidsai / raft

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing high performance applications.

Updated Nov 26, 2025
Cuda

Celebrandil / CudaSift

A CUDA implementation of SIFT for NVidia GPUs (1.2 ms on a GTX 1060)

gpu cuda nvidia vision sift

Updated Oct 1, 2025
Cuda

andyzeng / tsdf-fusion

Fuse multiple depth frames into a TSDF voxel volume.

cuda artificial-intelligence vision rgbd 3d 3d-reconstruction depth-camera volumetric-data 3d-deep-learning tsdf kinect-fusion

Updated May 7, 2019
Cuda

NVIDIA / nvbench

CUDA Kernel Benchmarking Library

benchmark performance gpu cuda nvidia cuda-kernels kernel-benchmark

Updated Nov 15, 2025
Cuda

brucefan1983 / GPUMD

Graphics Processing Units Molecular Dynamics

machine-learning neural-network simulation gpu cuda molecular-dynamics neuroevolution high-performance-computing molecular-dynamics-simulation phonon physics-simulation natural-evolution-strategies heat-transport gpumd machine-learning-potential

Updated Nov 29, 2025
Cuda

rapidsai / cuvs

cuVS - a library for vector search and clustering on the GPU

machine-learning information-retrieval statistics clustering gpu distance cuda sparse nearest-neighbors similarity-search vector-similarity anns vector-search llm vector-store neighborhood-methods

Updated Nov 29, 2025
Cuda

NVIDIA / cuopt

GPU accelerated decision optimization

gpu optimization cuda linear-programming

Updated Dec 1, 2025
Cuda

yassa9 / qwen600

Static suckless single batch CUDA-only qwen3-0.6B mini inference engine

gpu cuda transformer cuda-programming llm llamacpp llm-inference qwen qwen3

Updated Sep 8, 2025
Cuda

Bruce-Lee-LY / cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

gpu cuda cublas nvidia gemm matrix-multiply tensor-core hgemm

Updated Sep 8, 2024
Cuda

alicevision / popsift

PopSift is an implementation of the SIFT algorithm in CUDA.

computer-vision gpu cuda image-processing feature-extraction sift

Updated Oct 27, 2025
Cuda

MegviiRobot / MegBA

MegBA: A GPU-Based Distributed Library for Large-Scale Bundle Adjustment

high-performance cuda distributed gpu-acceleration graph-optimization bundleadjustment

Updated Jun 3, 2024
Cuda

wangzyon / NVIDIA_SGEMM_PRACTICE

Step-by-step optimization of CUDA SGEMM

cuda sgemm

Updated Mar 30, 2022
Cuda

b0nes164 / GPUSorting

State of the art sorting and segmented sorting, including OneSweep. Implemented in CUDA, D3D12, and Unity style compute shaders. Theoretically portable to all wave/warp/subgroup sizes.

unity cuda hlsl d3d12 radix-sort compute-shader segmented-sort onesweep deviceradixsort

Updated Dec 14, 2024
Cuda

Created by Nvidia

Released June 23, 2007

Followers: 277 followers
Website: github.com/topics/cuda
Wikipedia: Wikipedia

Related topics

nvcc