Skip to content
~/tosaki
Go back

From Megabytes to Megawatts: A Comprehensive Guide to High-Performance LLM and Diffusion Kernels with CUDA and Triton

Edit page

All content is generated by LLM, please exercise discretion.

The GPU Computing Paradigm: Unlocking Parallel Performance

The ascent of deep learning, particularly Large Language Models (LLMs) and Diffusion Models, is inextricably linked to the evolution of the Graphics Processing Unit (GPU). Originally designed for graphics rendering, GPUs have become the de facto hardware for AI due to their massively parallel architecture.¹ However, harnessing this power requires more than simply running code on a GPU; it demands a fundamental understanding of the hardware’s architecture and the programming models designed to exploit it. This section establishes the foundational principles of GPU computing, moving from the physical hardware to the abstract programming model, providing the necessary context for understanding why certain programming patterns are effective for high-performance computing.

The Anatomy of a Modern GPU: A Parallel Powerhouse

A modern GPU is not merely a faster version of a Central Processing Unit (CPU); its architecture is fundamentally different, optimized for data parallelism rather than sequential task execution.² Where a CPU has a few powerful cores designed for low-latency, single-threaded performance, a GPU contains thousands of simpler cores designed for high-throughput, parallel computation.³ To write efficient GPU code, one must understand its key components.

The CUDA Programming Model: Abstracting the Hardware

NVIDIA’s Compute Unified Device Architecture (CUDA) is a parallel computing platform and programming model that provides a software layer to abstract the physical GPU hardware.³ It allows developers to use languages like C++, Python, and Fortran to write programs that can harness the GPU’s parallel processing power.¹ The CUDA model is built on a few key abstractions that map directly to the hardware architecture.¹¹

Writing Your First CUDA Kernel: Vector Addition

The canonical “Hello, World!” of parallel computing is vector addition, C = A + B, where A, B, and C are large vectors.¹⁶ Implementing this in CUDA illustrates the fundamental workflow and programming model. The process involves the following steps ⁵:

  1. Host-Side Preparation: On the CPU, allocate memory for the host vectors h_A, h_B, and h_C using malloc() and initialize the input vectors h_A and h_B.
  2. Device Memory Allocation: On the GPU, allocate memory for the device vectors d_A, d_B, and d_C using cudaMalloc().¹²
  3. Host-to-Device Data Transfer: Copy the input data from the host vectors to their device counterparts using cudaMemcpy() with the direction cudaMemcpyHostToDevice.⁵
  4. Kernel Launch: Launch the add_kernel on the GPU. This is done using the triple-chevron syntax <<<gridDim, blockDim>>>. The blockDim is typically set to a multiple of 32 (e.g., 256 or 512) for efficiency. The gridDim is then calculated to ensure there are enough total threads to cover all N elements of the vectors, often using a ceiling division: (N + blockDim.x - 1) / blockDim.x.¹⁵
  5. Kernel Execution: The GPU executes the kernel code on all launched threads in parallel. Inside the kernel, each thread calculates its unique global index and performs one element-wise addition. A boundary check is essential to prevent threads from writing past the end of the arrays if N is not a perfect multiple of the block size.¹⁷
  6. Host-Device Synchronization: The host CPU must wait for the GPU to complete its computation before using the results. This is achieved by calling cudaDeviceSynchronize(), which blocks the CPU thread until all previously issued GPU commands have finished.¹⁸
  7. Device-to-Host Data Transfer: Copy the result vector d_C from device memory back to the host vector h_C using cudaMemcpy() with the direction cudaMemcpyDeviceToHost.⁵
  8. Cleanup: Free the allocated memory on both the device (using cudaFree()) and the host (using free()) to prevent memory leaks.¹²

The following code demonstrates a complete vector addition program in CUDA C++. Source code files with CUDA code typically use the .cu extension and are compiled with NVIDIA’s nvcc compiler (nvcc vector_add.cu -o vector_add).¹²

#include <iostream>
#include <cmath>
// Error checking wrapper for CUDA API calls
void cudaCheck(cudaError_t error, const char *file, int line) {
    if (error!= cudaSuccess) {
        printf(" at %s:%d: %s\n", file, line, cudaGetErrorString(error));
        exit(EXIT_FAILURE);
    }
}
#define CUDA_CHECK(err) (cudaCheck(err, __FILE__, __LINE__))
// CUDA Kernel to add two vectors
__global__ void add_kernel(const float* a, const float* b, float* c, int n) {
    // Calculate the unique global index for this thread
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    // Boundary check: ensure the thread index is within the array bounds
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}
int main() {
    int N = 1 << 20; // 1,048,576 elements
    size_t size = N * sizeof(float);
    // 1. Allocate host memory
    float* h_a = (float*)malloc(size);
    float* h_b = (float*)malloc(size);
    float* h_c = (float*)malloc(size);
    // Initialize host vectors
    for (int i = 0; i < N; ++i) {
        h_a[i] = sin(i) * sin(i);
        h_b[i] = cos(i) * cos(i);
    }
    // 2. Allocate device memory
    float *d_a, *d_b, *d_c;
    CUDA_CHECK(cudaMalloc(&d_a, size));
    CUDA_CHECK(cudaMalloc(&d_b, size));
    CUDA_CHECK(cudaMalloc(&d_c, size));
    // 3. Copy data from host to device
    CUDA_CHECK(cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice));
    CUDA_CHECK(cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice));
    // 4. Launch the kernel
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    add_kernel<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, N);
    // 5. Synchronize to wait for the kernel to finish
    CUDA_CHECK(cudaGetLastError()); // Check for kernel launch errors
    CUDA_CHECK(cudaDeviceSynchronize());
    // 6. Copy result from device to host
    CUDA_CHECK(cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost));
    // Verify the result on the host
    float maxError = 0.0f;
    for (int i = 0; i < N; ++i) {
        maxError = fmax(maxError, fabs(h_c[i] - 1.0f));
    }
    std::cout << "Max error: " << maxError << std::endl;
    // 7. Free memory
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);
    free(h_a);
    free(h_b);
    free(h_c);
    return 0;
}

Triton: High-Productivity GPU Programming in Python

While CUDA provides ultimate control over the GPU, its complexity presents a significant barrier. Writing efficient CUDA code requires deep hardware expertise and is a time-consuming, error-prone process.²¹ In the fast-paced world of AI research and development, a more productive solution is needed. OpenAI’s Triton emerges to fill this role, offering a high-level, Python-based path to high-performance GPU computing.

Bridging the Gap: Why Triton Exists

Triton is an open-source language and compiler designed to make GPU programming more accessible, especially for the AI community.²³ Its primary goal is to enable developers, even those with no prior CUDA experience, to write custom compute kernels that achieve performance comparable to that of expert-tuned CUDA code, but with a fraction of the effort.²¹ The magic of Triton lies in its just-in-time (JIT) compiler. A developer writes kernel logic in a high-level, Python-like syntax. The Triton compiler then takes this code and automatically handles many of the most challenging aspects of GPU programming, which would otherwise require painstaking manual optimization in CUDA ²¹:

This high level of automation represents a democratizing force in high-performance computing. Historically, kernel optimization was the exclusive domain of a small number of HPC specialists. The explosion of LLMs and other large AI models created an insatiable demand for performance optimization to manage training and inference costs.²⁷ Triton empowers a much broader audience of data scientists and ML engineers, who are already proficient in Python, to write their own high-performance kernels, bridging the gap between high-level frameworks like PyTorch and low-level CUDA.²⁴ This fosters a more open and agile ecosystem where performance optimization is no longer a bottleneck reserved for a few large corporations.²⁵

The Triton Programming Model: Block-Level Abstraction

Triton’s programming model is a key differentiator from CUDA. It revisits the Single Program, Multiple Data (SPMD) paradigm but elevates the level of abstraction. Instead of writing a program for a single scalar thread, as in CUDA, a Triton developer writes a blocked program that operates on entire tiles (or blocks) of data at once.²¹ The notion of individual threads within a block is abstracted away and managed by the compiler.

The core concepts of the Triton programming model include:

A “Hello, World!” in Triton: Vector Addition Revisited

Reimplementing vector addition in Triton starkly illustrates its simplicity and higher level of abstraction compared to CUDA.

The workflow consists of two parts: a Python host function to launch the kernel and the Triton kernel itself.

  1. Python Host Wrapper (add function): This function orchestrates the kernel launch. It takes PyTorch tensors as input, pre-allocates the output tensor, defines the launch grid (calculating how many program instances are needed), and launches the kernel with the appropriate arguments and meta-parameters.³²
  2. Triton Kernel (add_kernel function): This function, decorated with @triton.jit, contains the core logic. It calculates its program ID, computes the offsets for its data block, creates a mask for boundary safety, and then uses tl.load, vectorized arithmetic, and tl.store to perform the addition on its assigned tile of data.²⁴

The complete, annotated Python code for vector addition in Triton is as follows:

import torch
import triton
import triton.language as tl
@triton.jit
def add_kernel(
    x_ptr,  # Pointer to the first input vector
    y_ptr,  # Pointer to the second input vector
    output_ptr,  # Pointer to the output vector
    n_elements,  # Total number of elements in the vector
    BLOCK_SIZE: tl.constexpr,  # Number of elements each program instance will process
):
    """
    Triton kernel for element-wise vector addition.
    Each instance of this kernel computes a block of the output vector.
    """
    # 1. Get the program ID (pid) for this instance.
    # This is the index of the program instance in the 1D grid.
    pid = tl.program_id(axis=0)
    # 2. Calculate the offsets for the elements this program will process.
    # block_start is the starting index for this program's block.
    block_start = pid * BLOCK_SIZE
    # offsets is a vector of indices for the entire block.
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    # 3. Create a mask to prevent out-of-bounds memory access.
    # This is crucial when n_elements is not a multiple of BLOCK_SIZE.
    mask = offsets < n_elements
    # 4. Load the data blocks from global memory (DRAM) into registers (SRAM).
    # The mask ensures that we only load valid data.
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    # 5. Perform the element-wise addition on the loaded blocks.
    output = x + y
    # 6. Store the result block back to global memory.
    # The mask ensures we only write to valid memory locations.
    tl.store(output_ptr + offsets, output, mask=mask)
def add(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    """
    Host-side wrapper function to launch the Triton kernel for vector addition.
    """
    # Pre-allocate the output tensor on the same device as the inputs.
    output = torch.empty_like(x)
    assert x.is_cuda and y.is_cuda and output.is_cuda
    n_elements = output.numel()
    # Define the launch grid. It's a 1D grid of programs.
    # The size of the grid is the number of blocks needed to cover all elements.
    # triton.cdiv performs ceiling division.
    grid = lambda meta: (triton.cdiv(n_elements, meta),)
    # Launch the kernel.
    # The grid is passed in, and meta-parameters like BLOCK_SIZE are keyword arguments.
    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
    return output
# --- Verification ---
torch.manual_seed(0)
size = 98432
x = torch.rand(size, device='cuda')
y = torch.rand(size, device='cuda')
output_torch = x + y
output_triton = add(x, y)
print(f"PyTorch output: {output_torch}")
print(f"Triton output:  {output_triton}")
print(f"The maximum difference is {torch.max(torch.abs(output_torch - output_triton))}")

The stark contrast in complexity between the CUDA and Triton implementations highlights Triton’s value. The Triton version is not only shorter and written in a more familiar Python syntax, but it also abstracts away the most difficult parts of GPU programming, such as manual memory allocation, thread indexing logic, and shared memory management.

Table 1: CUDA vs. Triton - A Comparative Overview

To summarize the relationship between these two powerful tools, the following table provides a high-level comparison. It frames them not as direct competitors for all tasks, but as complementary tools in a performance optimization toolkit.

FeatureCUDATriton
Programming LanguageC/C++ with extensionsPython-like DSL
Abstraction LevelLow-level, hardware-centricHigh-level, block-centric
Thread ManagementManual (threads, warps, blocks)Automatic (managed by compiler)
Memory ManagementManual (cudaMalloc, Shared Memory via __shared__)Automatic (compiler optimizes SRAM usage)
Learning CurveSteep, requires deep hardware knowledgeModerate, familiar for Python developers
DebuggingComplex (Nsight, printf)Easier (CPU simulator via TRITON_INTERPRET=1)
PerformancePotential for absolute peak performanceOften on-par with expert CUDA, easier to achieve good performance
Typical Use CaseNovel hardware features, complex non-standard kernelsFusing common DL ops, rapid prototyping, general kernel development

The Bottleneck: Why Custom Kernels are Essential for Modern AI

While deep learning frameworks like PyTorch and TensorFlow provide highly optimized building blocks, their standard execution model often leaves significant performance on the table. To understand why, one must differentiate between operations limited by computation and those limited by memory, and recognize how frameworks handle them. This understanding provides the critical motivation for learning to write custom CUDA and Triton kernels.

Memory-Bound vs. Compute-Bound Operations

The performance of any GPU operation is ultimately constrained by one of two factors: the rate at which it can perform calculations (compute-bound) or the rate at which it can fetch data from memory (memory-bound).³⁵

The fundamental performance problem with standard deep learning frameworks is that they typically execute each of these operations as a separate, distinct kernel launch. Consider a simple sequence found in many models: a linear layer followed by a ReLU activation. A naive implementation would execute this as:

  1. Launch a GEMM kernel for the matrix multiplication. Write the result tensor to HBM.
  2. Launch an element-wise kernel for the ReLU activation. This kernel reads the result tensor from HBM, applies the ReLU, and writes the final output back to HBM.

This sequence is incredibly inefficient. The intermediate tensor is written all the way out to slow global memory only to be immediately read back in by the next kernel. This redundant memory traffic is a major source of performance loss.³⁷

The Tyranny of DRAM and Kernel Launch Overhead

The inefficiency of the standard execution model is rooted in two hardware realities: the HBM bottleneck and kernel launch overhead.

The Power of Kernel Fusion

Kernel fusion is the primary optimization technique used to combat these inefficiencies. It involves combining multiple sequential operations into a single, larger GPU kernel.³⁸

Instead of writing intermediate results to HBM, a fused kernel keeps them in the GPU’s fast on-chip registers or shared memory. In the fused Linear -> ReLU example, a thread block would compute a tile of the matrix multiplication result, immediately apply the ReLU activation to that tile while it is still in registers, and only then write the final, activated result to HBM.

The benefits of this approach are profound and directly address the core bottlenecks ⁴⁰:

Kernel fusion is the cornerstone of modern AI performance optimization. It is the fundamental principle that motivates the development of custom kernels. Sophisticated algorithms like Flash Attention are, at their core, an advanced application of kernel fusion for the specific sequence of operations in the attention mechanism.³⁶ Libraries like DeepSpeed’s Transformer Kernel and Liger-Kernel are essentially collections of pre-written, highly optimized fused kernels for common Transformer components.²⁷ Therefore, learning to write custom kernels in CUDA or Triton is fundamentally about learning how to identify and implement kernel fusion. It is the most critical optimization technique for the memory-bound workloads that constitute the majority of operations in LLMs and Diffusion models.

Optimizing the Transformer: From Multi-Head Attention to Flash Attention

The Transformer architecture, and specifically its self-attention mechanism, is the heart of modern LLMs. It is also a major performance bottleneck due to its computational and memory complexity. This section provides a practical, code-centric deep dive into optimizing this critical component, progressing from a naive implementation to the state-of-the-art Flash Attention algorithm.

Deconstructing Multi-Head Attention (MHA)

The attention mechanism allows a model to weigh the importance of different tokens in a sequence when producing a representation for a specific token. The most common variant is Scaled Dot-Product Attention, defined by the formula: $$ Attention(Q, K, V) = \mathrm{softmax}\left(\frac{Q K^{T}}{\sqrt{d_k}}\right)V $$ Here, Q (Query), K (Key), and V (Value) are matrices derived from the input sequence, and d_k is the dimension of the key vectors. Multi-Head Attention (MHA) enhances this by running the attention mechanism multiple times in parallel with different, learned linear projections, and then concatenating the results. This allows the model to jointly attend to information from different representational subspaces.⁴⁵

A standard implementation in PyTorch using nn.Module typically involves the following inefficient steps:

  1. Three separate nn.Linear layers to project the input into Q, K, and V.
  2. Reshaping and transposing the tensors to separate the attention heads.
  3. A batched matrix multiplication (torch.bmm or @) to compute the raw scores, QKᵀ.
  4. Applying a causal or padding mask to the scores.
  5. Applying the softmax function to get attention weights.
  6. A second batched matrix multiplication to apply the weights to the V matrix.
  7. Transposing and reshaping the output back to the original tensor format.

The critical inefficiency here is the materialization of the intermediate attention score matrix S = QKᵀ. This matrix has dimensions (SequenceLength, SequenceLength). For a sequence of length N, this requires O(N²) memory, which becomes prohibitively large for the long contexts used in modern LLMs. Storing and retrieving this large matrix from HBM is a massive performance bottleneck.⁴⁷

A Fused MHA Kernel in Triton

As a first optimization step, we can use Triton to fuse several MHA components, reducing HBM traffic and kernel launch overhead. This serves as a practical stepping stone before tackling the more complex Flash Attention.

Our fusion strategy involves two main parts:

  1. Fused QKV Projection: Instead of three separate linear layers for Q, K, and V, we can use a single, wider nn.Linear layer that projects the input to a tensor three times the size, and then split it. This reduces three separate GEMM operations into one larger, more efficient GEMM.⁵⁰
  2. Fused Attention Core: We can write a single Triton kernel that performs the core attention computation: softmax(QKᵀ)V. This kernel will load blocks of Q, K, and V, compute the dot product, apply softmax, and multiply by V, all within a single launch. This avoids materializing the full (N, N) attention matrix in global memory.

A Triton kernel for this fused attention core would look conceptually like this:

@triton.jit
def fused_attn_kernel(Q_ptr, K_ptr, V_ptr, O_ptr,...):
    # 1. Use 2D program IDs to select an output block
    pid_m = tl.program_id(axis=0) # Block row index
    pid_n = tl.program_id(axis=1) # Block col index (not needed for this simplified version)
    # 2. Initialize accumulator for the output block
    acc = tl.zeros((BLOCK_SIZE_M, D_HEAD), dtype=tl.float32)
    # 3. Load a block of Q
    q_offsets =...
    q = tl.load(Q_ptr + q_offsets)
    # 4. Loop over blocks of K and V
    for k_block_idx in range(0, N_CTX, BLOCK_SIZE_K):
        # Load a block of K
        k_offsets =...
        k = tl.load(K_ptr + k_offsets)
        # Compute a block of the score matrix S = QK^T
        s_ij = tl.dot(q, k)
        # (This is where Flash Attention's online softmax would go)
        # For a simpler fused kernel, we might compute a block of P
        p_ij = tl.softmax(s_ij)
        # Load a block of V
        v_offsets =...
        v = tl.load(V_ptr + v_offsets)
        # Compute a block of the output O = PV
        acc += tl.dot(p_ij, v)
    # 5. Store the final output block
    o_offsets =...
    tl.store(O_ptr + o_offsets, acc)

While this simplified structure illustrates the concept, it still faces the challenge of computing softmax correctly without seeing the entire row of scores. This is precisely the problem that Flash Attention solves.

The Flash Attention Revolution

Flash Attention is an IO-aware, exact attention algorithm that fundamentally reorders the computation to avoid the O(N²) memory bottleneck.⁵¹ It is a canonical example of algorithm-hardware co-design, where the algorithm was explicitly created to match the memory hierarchy of modern GPUs.

The key techniques are:

$$ m_{\text{new}} = \max \left( m_{\text{old}},, \max(S_{ij}) \right) $$

$$ l_{\text{new}} = e^{(m_{\text{old}} - m_{\text{new}})} \cdot l_{\text{old}} + \sum e^{(S_{ij} - m_{\text{new}})} $$

This mathematical trick allows for a tiled, streaming computation of an operation that is inherently global.³⁶

This approach demonstrates that the largest performance gains often come not from micro-optimizing existing code, but from fundamentally redesigning the algorithm to align with the physical constraints and opportunities of the hardware.

Implementing Flash Attention in Triton

The official Triton documentation includes a tutorial for a fused attention kernel that implements the Flash Attention v2 algorithm.⁵⁶ A line-by-line analysis of this kernel reveals the practical application of the concepts above. Key elements of the Triton implementation include:

The performance impact of these optimizations is transformative, as summarized in the table below.

Table 2: Attention Implementation Performance Benchmark (Conceptual)

This table illustrates the conceptual performance progression across different attention implementations on a modern GPU like an A100. Actual numbers will vary, but the trend is representative.

ImplementationSequence LengthLatency (ms)Peak Memory (GB)
Naive PyTorch MHA512~1.5~0.5
Naive PyTorch MHA2048~20~8
Naive PyTorch MHA8192OOM> 80 (OOM)
Fused Triton MHA512~1.2~0.5
Fused Triton MHA2048~15~8
Fused Triton MHA8192OOM> 80 (OOM)
Flash Attention (Triton)512~0.5< 0.1
Flash Attention (Triton)2048~2.0< 0.1
Flash Attention (Triton)8192~8.0< 0.1
Flash Attention (Triton)16384~18.0< 0.1

Note: OOM = Out of Memory. Latency and memory are conceptual estimates for illustration.

The table clearly shows that while basic fusion offers a modest improvement, it does not solve the fundamental O(N²) memory problem. Flash Attention, by redesigning the algorithm, not only provides a dramatic speedup but also breaks the quadratic memory barrier, enabling transformers to process sequence lengths that were previously impossible.

End-to-End Model Optimization

While optimizing individual operators like attention is crucial, the greatest performance gains are realized by taking a holistic view and optimizing entire sections of a model. This involves fusing multiple layers of a Transformer block or key components of a Diffusion model’s U-Net. This section expands the scope from single-operator fusion to these end-to-end strategies.

Fusing the Full Transformer Block

A standard Transformer block is a sequence of operations: Multi-Head Attention, a residual connection, Layer Normalization, a Feed-Forward Network (FFN) or MLP, another residual connection, and a final LayerNorm.⁴⁷ A naive implementation executes each of these as a separate kernel, resulting in significant memory traffic and launch overhead.

The ultimate goal of optimization is the “Megakernel” approach, where as many of these sequential operations as possible are fused into a single, persistent GPU kernel.⁵⁸ This minimizes launch overhead to nearly zero and maximizes the time data spends in fast on-chip memory.

While a single monolithic kernel for the entire block is complex to write and maintain, a practical approach involves breaking the block into a few highly optimized fused kernels:

Below is an example of a fused MLP kernel in Triton, which is a common and highly effective optimization.

@triton.jit
def fused_mlp_kernel(
    x_ptr, w1_ptr, w2_ptr, output_ptr,
    M, N, K,
    stride_xm, stride_xn,
    stride_w1n, stride_w1k,
    stride_w2k, stride_w2n,
    stride_om, stride_on,
    BLOCK_SIZE_M: tl.constexpr, BLOCK_SIZE_N: tl.constexpr, BLOCK_SIZE_K: tl.constexpr,
):
    # --- Fused Linear -> GeLU -> Linear ---
    pid = tl.program_id(axis=0)
    # --- First Linear Layer (GEMM 1) ---
    # This part is similar to a standard matrix multiplication kernel
    # It computes a block of the intermediate result: intermediate = x @ w1
   ...
    intermediate_block =... # Result of first GEMM
    # --- Fused GeLU Activation ---
    # The activation is applied element-wise to the intermediate result
    # while it is still in registers.
    # A fast approximation of GeLU is often used in kernels.
    activated_block = 0.5 _intermediate_block_ (1 + tl.tanh(0.79788456 _intermediate_block_ (1 + 0.044715 _intermediate_block_ intermediate_block)))
    # --- Second Linear Layer (GEMM 2) ---
    # This part multiplies the activated block by the second weight matrix: output = activated_block @ w2
   ...
    output_block =... # Result of second GEMM
    # Store the final result
    tl.store(output_ptr + output_offsets, output_block)

Optimizing Diffusion Models

Diffusion models, which are becoming prevalent in image and video generation, typically rely on a U-Net architecture for the iterative denoising process.⁶⁰ This U-Net is composed of repeating blocks, which are prime targets for kernel fusion.

The most common computational pattern within a U-Net block is a sequence of Convolution -> Normalization -> Activation. For example, a block might contain a 1D convolution, followed by Group Normalization, and finally a Mish or SiLU activation function.³⁹ This sequence is memory-bound, as each step is relatively simple computationally but requires reading and writing large activation tensors.

A custom Triton kernel can fuse these three operations. A fused GroupNorm + Mish kernel is a non-trivial example that demonstrates more advanced parallel programming patterns. It requires a parallel reduction within each thread block to compute the mean and variance needed for the normalization step before applying the activation. This avoids writing the un-normalized and un-activated tensors to global memory.³⁹

Beyond kernel-level fusion, the overall denoising loop can also be optimized. In the standard process, timestep embeddings and various noise schedule constants are recomputed at every step of the reverse diffusion process. A higher-level optimization is to pre-compute all of these constants once at the beginning of inference and store them in a tensor on the GPU. The denoising kernel can then simply index into this tensor at each step, avoiding redundant computations and making the kernel itself lighter and faster.⁶²

The Broader Toolkit: torch.compile and When to Go Custom

The landscape of performance optimization is evolving rapidly. It is no longer a binary choice between a high-level framework and low-level CUDA.

This leads to a clear, hierarchical workflow for the modern performance engineer. One should not default to writing custom kernels. The most productive path is to start with high-level tools and only go to lower levels of abstraction when necessary. This workflow strikes a crucial balance between developer productivity and raw performance.

Table 3: Transformer Block Optimization Strategies

This table serves as a strategic map for optimizing a Transformer block, breaking it down by component and listing the state-of-the-art optimization strategy for each.

Operation / ComponentFusion StrategyPrimary BenefitImplementation Tool
Q, K, V ProjectionsFused nn.LinearReduces 3 GEMMs to 1, improves GPU utilizationPyTorch (torch.nn.Linear(d_in, 3*d_out))
Scaled Dot-Product AttentionFlash AttentionEliminates memory, drastically reduces HBM I/OTriton/CUDA (e.g., flash-attn library)
MLP / FFNFused Linear-GELU-LinearReduces HBM I/O by keeping intermediate activations in SRAMTriton, torch.compile
Layer NormalizationFused with preceding op (e.g., residual add)Reduces HBM I/OTriton, torch.compile, DeepSpeed Kernels
Entire Transformer LayerMegakernel / Persistent KernelEliminates all kernel launch overhead within the layerCustom CUDA/Triton, TensorRT-LLM

Conclusion and Future Directions

The journey from understanding GPU hardware to implementing end-to-end fused kernels in Triton reveals a clear set of principles for achieving high performance in modern AI. The landscape is defined by the constant tension between immense computational power and the limitations of memory bandwidth. Custom kernels, particularly those written with a focus on fusion and data locality, are the primary tool for resolving this tension.

A Practical Roadmap for the ML Practitioner

For a data scientist or machine learning engineer looking to master high-performance computing, the path forward is not to abandon high-level frameworks for raw CUDA, but to adopt a tiered, strategic approach.

The Future of High-Performance AI

The field of AI performance engineering is rapidly evolving, with several key trends shaping its future:

Ultimately, learning CUDA and Triton is an investment in understanding the fundamental principles of high-performance computing. This knowledge of how to reason about parallelism, data locality, and memory hierarchies is a durable skill that will remain valuable long after today’s specific models and frameworks have been superseded. As AI continues to push the limits of what is computationally feasible, the ability to bridge the gap between abstract algorithms and physical hardware will only grow in importance.

Works cited

  1. Understanding CUDA in Computing: A Comprehensive Guide | Lenovo US, accessed July 17, 2025, https://www.lenovo.com/us/en/glossary/what-is-the-cuba-toolkit/

  2. What is the best way for beginners to learn CUDA and parallel computing with GPUs?, accessed July 17, 2025, https://www.quora.com/What-is-the-best-way-for-beginners-to-learn-CUDA-and-parallel-computing-with-GPUs

  3. What Is CUDA? - Supermicro, accessed July 17, 2025, https://www.supermicro.com/en/glossary/cuda

  4. Introduction to CUDA Programming - GeeksforGeeks, accessed July 17, 2025, https://www.geeksforgeeks.org/electronics-engineering/introduction-to-cuda-programming/

  5. CUDA Refresher: The CUDA Programming Model | NVIDIA Technical Blog, accessed July 17, 2025, https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/

  6. Simplifying CUDA kernels with Triton: A Pythonic Approach to GPU Programming, accessed July 17, 2025, https://arunjitha.medium.com/simplifying-cuda-kernels-with-triton-a-pythonic-approach-to-gpu-programming-79bb7121e974

  7. Flash attention(Fast and Memory-Efficient Exact Attention with IO-Awareness): A deep dive, accessed July 17, 2025, https://towardsdatascience.com/flash-attention-fast-and-memory-efficient-exact-attention-with-io-awareness-a-deep-dive-724af489997b/

  8. FLASHATTENTION: Fast and Memory-Efficient Exact Attention with IO-Awareness - OpenReview, accessed July 17, 2025, https://openreview.net/pdf?id=H4DqfPSibmx

  9. 1 CUDA Programming Model, accessed July 17, 2025, https://www.eng.utah.edu/~cs5610/lectures/Programming_Models_for_GPU_Architecture%20CUDA.pdf

  10. en.wikipedia.org, accessed July 17, 2025, https://en.wikipedia.org/wiki/CUDA

  11. kst179/fused-attention: Fast and low-memory attention layer … - GitHub, accessed July 17, 2025, https://github.com/kst179/fused-attention

  12. CUDA Zone - Library of Resources - NVIDIA Developer, accessed July 17, 2025, https://developer.nvidia.com/cuda-zone

  13. What is the CUDA Programming Model? | GPU Glossary - Modal, accessed July 17, 2025, https://modal.com/gpu-glossary/device-software/cuda-programming-model

  14. Tutorial 01: Say Hello to CUDA, accessed July 17, 2025, https://cuda-tutorial.readthedocs.io/en/latest/tutorials/tutorial01/

  15. An Even Easier Introduction to CUDA (Updated) | NVIDIA Technical Blog, accessed July 17, 2025, https://developer.nvidia.com/blog/even-easier-introduction-cuda/

  16. CUDA Programming - Wolfram Language Documentation, accessed July 17, 2025, https://reference.wolfram.com/language/CUDALink/tutorial/Programming.html

  17. CUDA Programming Model — MolSSI GPU Programming Fundamentals documentation, accessed July 17, 2025, https://education.molssi.org/gpu_programming_beginner/03-cuda-program-model.html

  18. CUDA Basic Example - Vector Addition Explanation - eunomia, accessed July 17, 2025, https://eunomia.dev/others/cuda-tutorial/01-vector-addition/

  19. Vector Addition “Hello World!” Example with CUDA on Mac OSX …, accessed July 17, 2025, https://www.quantstart.com/articles/Vector-Addition-Hello-World-Example-with-CUDA-on-Mac-OSX/

  20. olcf-tutorials/vector_addition_cuda: A simple CUDA vector addition program - GitHub, accessed July 17, 2025, https://github.com/olcf-tutorials/vector_addition_cuda

  21. 4.4 Example: Vector Addition — Parallel Computing for Beginners, accessed July 17, 2025, https://www.learnpdc.org/PDCBeginners/4-cuda/4-VectorAdd.html

  22. Tutorial 02: CUDA in Actions, accessed July 17, 2025, https://cuda-tutorial.readthedocs.io/en/latest/tutorials/tutorial02/

  23. Introducing Triton: Open-Source GPU Programming for Neural Networks, accessed July 17, 2025, https://aimersociety.com/introducing-triton-open-source-gpu-programming-for-neural-networks/

  24. How Is OpenAI’s Triton Different From NVIDIA CUDA? - Analytics India Magazine, accessed July 17, 2025, https://analyticsindiamag.com/global-tech/how-is-openais-triton-different-from-nvidia-cuda/

  25. triton-lang.org, accessed July 17, 2025, https://triton-lang.org/#:~:text=Triton%20is%20a%20language%20and,throughput%20on%20modern%20GPU%20hardware.

  26. Democratizing AI Accelerators and GPU Kernel Programming using Triton, accessed July 17, 2025, https://next.redhat.com/2024/11/07/democratizing-ai-accelerators-and-gpu-kernel-programming-using-triton/

  27. Exploring Triton GPU programming for neural networks in Java - OpenJDK, accessed July 17, 2025, https://openjdk.org/projects/babylon/articles/triton

  28. Getting Started with Triton: A Step-by-Step Tutorial - Medium, accessed July 17, 2025, https://medium.com/ai-insights-cobet/getting-started-with-triton-a-step-by-step-tutorial-ddc18a186295

  29. Liger Kernel: Efficient Triton Kernels for LLM Training - arXiv, accessed July 17, 2025, https://arxiv.org/html/2410.10989v3

  30. [D] usefulness of learning CUDA/triton : r/MachineLearning - Reddit, accessed July 17, 2025, https://www.reddit.com/r/MachineLearning/comments/1kewrqc/d_usefulness_of_learning_cudatriton/

  31. Triton — GPU Programming for Neural Networks | by Dhananjay Kumar - Medium, accessed July 17, 2025, https://dhnanjay.medium.com/triton-gpu-programming-for-neural-networks-16271d729f78

  32. Introduction - Triton documentation, accessed July 17, 2025, https://triton-lang.org/main/programming-guide/chapter-1/introduction.html

  33. GPU MODE Lecture 14: Practitioners Guide to Triton - Christian Mills, accessed July 17, 2025, https://christianjmills.com/posts/cuda-mode-notes/lecture-014/

  34. triton_tutorial/02_vector_addition.ipynb at master - GitHub, accessed July 17, 2025, https://github.com/VikParuchuri/triton_tutorial/blob/master/02_vector_addition.ipynb

  35. Vector Addition — Triton documentation, accessed July 17, 2025, https://triton-lang.org/main/getting-started/tutorials/01-vector-add.html

  36. Triton Vector Addition Kernel, part 1: Making the Shift to Parallel Programming - YouTube, accessed July 17, 2025, https://www.youtube.com/watch?v=MEZ7XhzTLEg

  37. lectures/lecture_014/A_Practitioners_Guide_to_Triton.ipynb at main - GitHub, accessed July 17, 2025, https://github.com/gpu-mode/lectures/blob/main/lecture_014/A_Practitioners_Guide_to_Triton.ipynb

  38. How Nvidia’s CUDA Monopoly In Machine Learning Is Breaking – OpenAI Triton And PyTorch 2.0 – SemiAnalysis, accessed July 17, 2025, https://semianalysis.com/2023/01/16/nvidiaopenaitritonpytorch/

  39. Flash Attention - Insu Jang, accessed July 17, 2025, https://insujang.github.io/2024-01-21/flash-attention/

  40. OpenAI’s Triton: An end to end example | by Michael Diggin | Medium, accessed July 17, 2025, https://medium.com/@michael.diggin/openais-triton-an-end-to-end-example-c6577d81e3d0

  41. Kernel Fusion - Steven Gong, accessed July 17, 2025, https://stevengong.co/notes/Kernel-Fusion

  42. Part VI - Kernel Fusion in CUDA - Vrushank Desai, accessed July 17, 2025, https://www.vrushankdes.ai/diffusion-policy-inference-optimization/part-vi---kernel-fusion-in-cuda

  43. 31. Kernel Fusion - Aussie AI, accessed July 17, 2025, https://www.aussieai.com/book/ch31-kernel-fusion

  44. Kernel Operator Fusion - Aussie AI, accessed July 17, 2025, https://www.aussieai.com/research/kernel-fusion

  45. Kernel Fusion: A Smart Way to Enhance Neural Networks Performance - abhik.xyz, accessed July 17, 2025, https://www.abhik.xyz/articles/kernel-fusion

  46. DeepSpeed Transformer Kernel - DeepSpeed, accessed July 17, 2025, https://www.deepspeed.ai/tutorials/transformer_kernel/

  47. FlashAttention: Implementing High-Performance Attention with CUDA and Triton - Medium, accessed July 17, 2025, https://medium.com/@kimdoil1211/flashattention-implementing-high-performance-attention-with-cuda-and-triton-9ee635ab1200

  48. Tutorial 6: Transformers and Multi-Head Attention — UvA DL Notebooks v1.2 documentation, accessed July 17, 2025, https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html

  49. Parallelizing Multi-Head Attention on GPUs - Hemil Desai, accessed July 17, 2025, https://hd10.dev/posts/my-interests-2/cs259.pdf

  50. Multi-Head Attention From Scratch - Sanjaya’s Blog, accessed July 17, 2025, https://sanjayasubedi.com.np/deeplearning/multihead-attention-from-scratch/

  51. Introduction to Flash Attention: A Breakthrough in Efficient Attention Mechanism | by Sthanikam Santhosh | Medium, accessed July 17, 2025, https://medium.com/@sthanikamsanthosh1994/introduction-to-flash-attention-a-breakthrough-in-efficient-attention-mechanism-3eb47e8962c3

  52. Flash Attention: Revolutionizing Transformer Efficiency - Unite.AI, accessed July 17, 2025, https://www.unite.ai/flash-attention-revolutionizing-transformer-efficiency/

  53. LLMs-from-scratch/ch03/02_bonus_efficient-multihead-attention/mha-implementations.ipynb at main - GitHub, accessed July 17, 2025, https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/02_bonus_efficient-multihead-attention/mha-implementations.ipynb

  54. FlashAttention: Fast and Memory-Efficient Exact … - deepsense.ai, accessed July 17, 2025, https://deepsense.ai/wp-content/uploads/2023/04/2205.14135.pdf

  55. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. - MIT MLSys Discussion Group, accessed July 17, 2025, https://www.mlsys.ai/papers/flash_attention.html

  56. Understanding Flash Attention: Writing the Algorithm from Scratch in Triton, accessed July 17, 2025, https://towardsdatascience.com/understanding-flash-attention-writing-the-algorithm-from-scratch-in-triton-5609f0b143ea/

  57. FLASH ATTENTION: Fast and Memory-Efficient Exact Attention with IO-Awareness: Paper Review | by Sulbha Jain | May, 2025 | Medium, accessed July 17, 2025, https://medium.com/@sulbha.jindal/flash-attention-fast-and-memory-efficient-exact-attention-with-io-awareness-paper-review-79639127c5de

  58. Fused Attention - Triton documentation, accessed July 17, 2025, https://triton-lang.org/main/getting-started/tutorials/06-fused-attention.html

  59. Tutorial #17: Transformers III Training - Research Blog | RBC Borealis, accessed July 17, 2025, https://rbcborealis.com/research-blogs/tutorial-17-transformers-iii-training/

  60. Compiling LLMs into a MegaKernel: A Path to Low-Latency …, accessed July 17, 2025, https://zhihaojia.medium.com/compiling-llms-into-a-megakernel-a-path-to-low-latency-inference-cf7840913c17

  61. flash-attention/training/README.md at main · Dao-AILab/flash …, accessed July 17, 2025, https://github.com/HazyResearch/flash-attention/blob/main/training/README.md

  62. Custom diffusion model with PyTorch — Tutorials for AI developers 4.0, accessed July 17, 2025, https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/notebooks/pretrain/ddim_pretrain.html

  63. Ultimate guide to optimizing Stable Diffusion XL - Félix Sanz, accessed July 17, 2025, https://www.felixsanz.dev/articles/ultimate-guide-to-optimizing-stable-diffusion-xl

  64. Part VII - A Dive Into DDPMs & CUDA kernel for Denoising - Vrushank Desai, accessed July 17, 2025, https://www.vrushankdes.ai/diffusion-policy-inference-optimization/part-vii---a-dive-into-ddpms-cuda-kernel-for-denoising

  65. Liger Kernel: Efficient Triton Kernels for LLM Training - arXiv, accessed July 17, 2025, https://arxiv.org/html/2410.10989v2

  66. Accelerating PyTorch Transformers by replacing nn.Transformer with Nested Tensors and torch.compile() — PyTorch Tutorials 2.7.0+cu126 documentation, accessed July 17, 2025, https://docs.pytorch.org/tutorials/intermediate/transformer_building_blocks.html

  67. Accelerating Generative AI Part III: Diffusion, Fast – PyTorch, accessed July 17, 2025, https://pytorch.org/blog/accelerating-generative-ai-3/

  68. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision - PyTorch, accessed July 17, 2025, https://pytorch.org/blog/flashattention-3/

  69. How do you optimize GPU utilization during diffusion model training? - Milvus, accessed July 17, 2025, https://milvus.io/ai-quick-reference/how-do-you-optimize-gpu-utilization-during-diffusion-model-training

  70. What is Flash Attention? | Modal Blog, accessed July 17, 2025, https://modal.com/blog/flash-attention-article

  71. Optimizing Transformer-Based Diffusion Models for Video Generation with NVIDIA TensorRT, accessed July 17, 2025, https://developer.nvidia.com/blog/optimizing-transformer-based-diffusion-models-for-video-generation-with-nvidia-tensorrt/

  72. Run High-Performance LLM Inference Kernels from NVIDIA Using …, accessed July 17, 2025, https://developer.nvidia.com/blog/run-high-performance-llm-inference-kernels-from-nvidia-using-flashinfer/


Edit page
Share this post on:

Previous Post
Travelogue of Japan
Next Post
From Zero to LLM: A Complete Guide to the Large Language Model Tech Tree