← Back to blog

Accelerating GEMM across kernels

2025-11-24
CUDAGEMMMatrix Multiplication

Introduction

Matrix Muliplication is the most fundamental and the heart of linear algebra. It is also very important in the field of deep learning and other scientific computing fields.

In this article, we will understand how to optimize a standard FP32 GEMM kernel on NVIDIA GPUs.

General Matrix Multiplication (GEMM)

GEMM operation constitutes,

D=αAB+βCD = \alpha AB + \beta C

where, DRm×nD \in \mathbb{R}^{m \times n} , ARm×kA \in \mathbb{R}^{m \times k}, BRk×nB \in \mathbb{R}^{k \times n}, CRm×nC \in \mathbb{R}^{m \times n} and α\alpha & β\beta floats.

Kernel I: Naive

__global__ void naive_gemm(const float* A, const float* B, float* C, int M, int N, int K, float alpha, float beta) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if (row < M && col < N) {
        float temp = 0.0f;
        for (int i = 0; i < K; i++) {
            temp += A[row * K + i] * B[i * N + col];
        }
        C[row * N + col] = alpha * temp + beta * C[row * N + col];
    }
}