From Megabytes to Megawatts: A Comprehensive Guide to High-Performance LLM and Diffusion Kernels with CUDA and Triton

All content is generated by LLM, please exercise discretion.

The GPU Computing Paradigm: Unlocking Parallel Performance

The ascent of deep learning, particularly Large Language Models (LLMs) and Diffusion Models, is inextricably linked to the evolution of the Graphics Processing Unit (GPU). Originally designed for graphics rendering, GPUs have become the de facto hardware for AI due to their massively parallel architecture.¹ However, harnessing this power requires more than simply running code on a GPU; it demands a fundamental understanding of the hardware’s architecture and the programming models designed to exploit it. This section establishes the foundational principles of GPU computing, moving from the physical hardware to the abstract programming model, providing the necessary context for understanding why certain programming patterns are effective for high-performance computing.

The Anatomy of a Modern GPU: A Parallel Powerhouse

A modern GPU is not merely a faster version of a Central Processing Unit (CPU); its architecture is fundamentally different, optimized for data parallelism rather than sequential task execution.² Where a CPU has a few powerful cores designed for low-latency, single-threaded performance, a GPU contains thousands of simpler cores designed for high-throughput, parallel computation.³ To write efficient GPU code, one must understand its key components.

Streaming Multiprocessors (SMs): The GPU is partitioned into an array of Streaming Multiprocessors. Each SM is an independent processing unit containing hundreds or thousands of compute cores, scheduling units, and a pool of memory resources. An SM can execute multiple groups of threads, known as thread blocks, concurrently, making it the primary engine of parallelism on the GPU.⁴
CUDA Cores: Within each SM are the CUDA cores, which are the simple Arithmetic Logic Units (ALUs) that perform the actual floating-point and integer calculations.⁵ Their sheer number allows for the simultaneous execution of thousands of operations.
The Memory Hierarchy: The most critical architectural concept for performance is the GPU’s multi-level memory hierarchy. The vast difference in speed and size between these memory spaces is the primary source of performance bottlenecks.
- Global Memory (DRAM): This is the largest memory space on the GPU, often referred to as High Bandwidth Memory (HBM) in modern data center GPUs. It can be several gigabytes in size (e.g., 40-80 GB on an NVIDIA A100) but has the highest latency and lowest bandwidth compared to on-chip memory.⁴ It is the GPU’s equivalent of system RAM.
- L2 Cache: A large cache (e.g., 40 MB on an A100) shared across all SMs. It serves as an intermediate, faster buffer between the SMs and the main global memory, helping to reduce latency for frequently accessed data.⁴
- Shared Memory (SMEM): A small, programmable, on-chip memory space located within each SM (e.g., 192 KB per SM on an A100). It has extremely high bandwidth and low latency, comparable to registers. This memory is shared among all threads within a single thread block and is explicitly managed by the programmer. It functions as a user-controlled cache, enabling efficient data sharing and communication, which is fundamental to many high-performance CUDA patterns.⁴
- Registers: The fastest memory available, with zero latency. Registers are private to each individual thread and are used to hold local variables.⁴ The performance disparity between on-chip memory (SRAM, used for registers and shared memory) and off-chip global memory (DRAM/HBM) is immense. A modern GPU can perform hundreds of floating-point operations in the time it takes to complete a single read from global memory.⁸ This chasm creates a “memory wall,” where the performance of an application is not limited by its computational capacity (TFLOPS) but by its ability to feed data to the compute units. Consequently, the central challenge of high-performance GPU programming is not merely parallelizing computation, but orchestrating data movement to maximize the time spent computing on data already present in fast on-chip memory and minimizing the number of costly round-trips to slow global memory. This principle underpins the need for custom, hardware-aware kernels.

The CUDA Programming Model: Abstracting the Hardware

NVIDIA’s Compute Unified Device Architecture (CUDA) is a parallel computing platform and programming model that provides a software layer to abstract the physical GPU hardware.³ It allows developers to use languages like C++, Python, and Fortran to write programs that can harness the GPU’s parallel processing power.¹ The CUDA model is built on a few key abstractions that map directly to the hardware architecture.¹¹

Host and Device: CUDA programming operates in a heterogeneous environment. The CPU and its memory are referred to as the host, while the GPU and its memory are the device. They are physically separate entities with distinct memory spaces, connected by a PCIe bus. The typical program flow involves the host managing overall control, allocating memory on the device, transferring data, launching computations on the device, and copying results back.⁴
Kernels: These are functions, designated by the __global__ keyword in CUDA C++, that are executed on the device.² A single kernel is executed in parallel by a vast number of GPU threads.
Thread Hierarchy: To manage the massive parallelism, CUDA organizes threads into a three-level hierarchy:
- Thread: The most fundamental unit of execution. Each thread runs the same kernel code but operates on different data, identified by its unique index.¹⁰
- Thread Block: A group of threads (up to 1024) that are executed together on a single SM. Threads within a block can cooperate efficiently by sharing data through the fast shared memory and can synchronize their execution using the __syncthreads() barrier. This tight cooperation is a cornerstone of many optimization techniques.⁴
- Grid: A collection of thread blocks that all execute the same kernel. Blocks within a grid are independent and cannot directly communicate or synchronize. This independence ensures that a kernel can scale transparently to GPUs with different numbers of SMs; a GPU with more SMs can simply execute more blocks in parallel.⁴
Indexing: CUDA provides built-in, multi-dimensional variables (threadIdx, blockIdx, blockDim, gridDim) that allow each thread to determine its unique identity within the grid. By combining these variables, a thread can compute a unique global index, which is then used to map it to a specific element or portion of the data to be processed. This is the fundamental mechanism for partitioning work across thousands of threads.⁴

Writing Your First CUDA Kernel: Vector Addition

The canonical “Hello, World!” of parallel computing is vector addition, C = A + B, where A, B, and C are large vectors.¹⁶ Implementing this in CUDA illustrates the fundamental workflow and programming model. The process involves the following steps ⁵:

Host-Side Preparation: On the CPU, allocate memory for the host vectors h_A, h_B, and h_C using malloc() and initialize the input vectors h_A and h_B.
Device Memory Allocation: On the GPU, allocate memory for the device vectors d_A, d_B, and d_C using cudaMalloc().¹²
Host-to-Device Data Transfer: Copy the input data from the host vectors to their device counterparts using cudaMemcpy() with the direction cudaMemcpyHostToDevice.⁵
Kernel Launch: Launch the add_kernel on the GPU. This is done using the triple-chevron syntax <<<gridDim, blockDim>>>. The blockDim is typically set to a multiple of 32 (e.g., 256 or 512) for efficiency. The gridDim is then calculated to ensure there are enough total threads to cover all N elements of the vectors, often using a ceiling division: (N + blockDim.x - 1) / blockDim.x.¹⁵
Kernel Execution: The GPU executes the kernel code on all launched threads in parallel. Inside the kernel, each thread calculates its unique global index and performs one element-wise addition. A boundary check is essential to prevent threads from writing past the end of the arrays if N is not a perfect multiple of the block size.¹⁷
Host-Device Synchronization: The host CPU must wait for the GPU to complete its computation before using the results. This is achieved by calling cudaDeviceSynchronize(), which blocks the CPU thread until all previously issued GPU commands have finished.¹⁸
Device-to-Host Data Transfer: Copy the result vector d_C from device memory back to the host vector h_C using cudaMemcpy() with the direction cudaMemcpyDeviceToHost.⁵
Cleanup: Free the allocated memory on both the device (using cudaFree()) and the host (using free()) to prevent memory leaks.¹²

The following code demonstrates a complete vector addition program in CUDA C++. Source code files with CUDA code typically use the .cu extension and are compiled with NVIDIA’s nvcc compiler (nvcc vector_add.cu -o vector_add).¹²

#include <iostream>
#include <cmath>
// Error checking wrapper for CUDA API calls
void cudaCheck(cudaError_t error, const char *file, int line) {
    if (error!= cudaSuccess) {
        printf(" at %s:%d: %s\n", file, line, cudaGetErrorString(error));
        exit(EXIT_FAILURE);
    }
}
#define CUDA_CHECK(err) (cudaCheck(err, __FILE__, __LINE__))
// CUDA Kernel to add two vectors
__global__ void add_kernel(const float* a, const float* b, float* c, int n) {
    // Calculate the unique global index for this thread
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    // Boundary check: ensure the thread index is within the array bounds
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}
int main() {
    int N = 1 << 20; // 1,048,576 elements
    size_t size = N * sizeof(float);
    // 1. Allocate host memory
    float* h_a = (float*)malloc(size);
    float* h_b = (float*)malloc(size);
    float* h_c = (float*)malloc(size);
    // Initialize host vectors
    for (int i = 0; i < N; ++i) {
        h_a[i] = sin(i) * sin(i);
        h_b[i] = cos(i) * cos(i);
    }
    // 2. Allocate device memory
    float *d_a, *d_b, *d_c;
    CUDA_CHECK(cudaMalloc(&d_a, size));
    CUDA_CHECK(cudaMalloc(&d_b, size));
    CUDA_CHECK(cudaMalloc(&d_c, size));
    // 3. Copy data from host to device
    CUDA_CHECK(cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice));
    CUDA_CHECK(cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice));
    // 4. Launch the kernel
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    add_kernel<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, N);
    // 5. Synchronize to wait for the kernel to finish
    CUDA_CHECK(cudaGetLastError()); // Check for kernel launch errors
    CUDA_CHECK(cudaDeviceSynchronize());
    // 6. Copy result from device to host
    CUDA_CHECK(cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost));
    // Verify the result on the host
    float maxError = 0.0f;
    for (int i = 0; i < N; ++i) {
        maxError = fmax(maxError, fabs(h_c[i] - 1.0f));
    }
    std::cout << "Max error: " << maxError << std::endl;
    // 7. Free memory
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);
    free(h_a);
    free(h_b);
    free(h_c);
    return 0;
}

Triton: High-Productivity GPU Programming in Python

While CUDA provides ultimate control over the GPU, its complexity presents a significant barrier. Writing efficient CUDA code requires deep hardware expertise and is a time-consuming, error-prone process.²¹ In the fast-paced world of AI research and development, a more productive solution is needed. OpenAI’s Triton emerges to fill this role, offering a high-level, Python-based path to high-performance GPU computing.

Bridging the Gap: Why Triton Exists

Triton is an open-source language and compiler designed to make GPU programming more accessible, especially for the AI community.²³ Its primary goal is to enable developers, even those with no prior CUDA experience, to write custom compute kernels that achieve performance comparable to that of expert-tuned CUDA code, but with a fraction of the effort.²¹ The magic of Triton lies in its just-in-time (JIT) compiler. A developer writes kernel logic in a high-level, Python-like syntax. The Triton compiler then takes this code and automatically handles many of the most challenging aspects of GPU programming, which would otherwise require painstaking manual optimization in CUDA ²¹:

Automatic Memory Coalescing: The compiler analyzes memory access patterns and arranges them to be combined into single, efficient transactions, maximizing bandwidth utilization.
Shared Memory Management: Triton’s compiler automatically uses the fast on-chip shared memory as a cache for data that is reused within a thread block, a task that requires explicit and complex management in CUDA.
Instruction Scheduling: The compiler reorders instructions to hide memory latency and maximize the utilization of the SM’s execution units.

This high level of automation represents a democratizing force in high-performance computing. Historically, kernel optimization was the exclusive domain of a small number of HPC specialists. The explosion of LLMs and other large AI models created an insatiable demand for performance optimization to manage training and inference costs.²⁷ Triton empowers a much broader audience of data scientists and ML engineers, who are already proficient in Python, to write their own high-performance kernels, bridging the gap between high-level frameworks like PyTorch and low-level CUDA.²⁴ This fosters a more open and agile ecosystem where performance optimization is no longer a bottleneck reserved for a few large corporations.²⁵

The Triton Programming Model: Block-Level Abstraction

Triton’s programming model is a key differentiator from CUDA. It revisits the Single Program, Multiple Data (SPMD) paradigm but elevates the level of abstraction. Instead of writing a program for a single scalar thread, as in CUDA, a Triton developer writes a blocked program that operates on entire tiles (or blocks) of data at once.²¹ The notion of individual threads within a block is abstracted away and managed by the compiler.

The core concepts of the Triton programming model include:

@triton.jit Decorator: A Python decorator that marks a function as a Triton kernel, signaling to the JIT compiler that it should be compiled for GPU execution.²⁴
Pointers: Kernels operate on pointers to tensor data in GPU memory, not the tensor objects themselves. When a PyTorch tensor is passed to a Triton kernel, it is implicitly converted to a pointer to its starting address.³¹
Program ID and Offsets: A grid of program instances is launched (e.g., in 1D, 2D, or 3D). Each instance retrieves its unique ID using tl.program_id(axis=...). This ID is used to calculate the base offset into the large input tensors, determining which chunk of data the program will process.²⁴
Block-wise Operations (tl.arange, tl.load, tl.store): Programs operate on blocks of data. tl.arange(start, end) creates a 1D vector of indices. This is added to the program’s base offset to generate the full set of addresses for the data tile. tl.load and tl.store then operate on these blocks of pointers, loading or storing entire tiles of data between global memory and on-chip SRAM in a vectorized manner.²⁹
Masking: To correctly handle data sizes that are not perfect multiples of the block size, tl.load and tl.store accept a boolean mask argument. This mask prevents out-of-bounds memory accesses for elements at the edge of a tensor. This is far more efficient than using an if statement inside the kernel, which would cause thread divergence and degrade performance.²⁴
tl.constexpr: A marker for compile-time constants, such as block sizes. This allows the Triton compiler to generate more specialized and highly optimized code, as it can unroll loops and hardcode dimensions directly into the machine instructions.²⁹

A “Hello, World!” in Triton: Vector Addition Revisited

Reimplementing vector addition in Triton starkly illustrates its simplicity and higher level of abstraction compared to CUDA.

The workflow consists of two parts: a Python host function to launch the kernel and the Triton kernel itself.

Python Host Wrapper (add function): This function orchestrates the kernel launch. It takes PyTorch tensors as input, pre-allocates the output tensor, defines the launch grid (calculating how many program instances are needed), and launches the kernel with the appropriate arguments and meta-parameters.³²
Triton Kernel (add_kernel function): This function, decorated with @triton.jit, contains the core logic. It calculates its program ID, computes the offsets for its data block, creates a mask for boundary safety, and then uses tl.load, vectorized arithmetic, and tl.store to perform the addition on its assigned tile of data.²⁴

The complete, annotated Python code for vector addition in Triton is as follows:

import torch
import triton
import triton.language as tl
@triton.jit
def add_kernel(
    x_ptr,  # Pointer to the first input vector
    y_ptr,  # Pointer to the second input vector
    output_ptr,  # Pointer to the output vector
    n_elements,  # Total number of elements in the vector
    BLOCK_SIZE: tl.constexpr,  # Number of elements each program instance will process
):
    """
    Triton kernel for element-wise vector addition.
    Each instance of this kernel computes a block of the output vector.
    """
    # 1. Get the program ID (pid) for this instance.
    # This is the index of the program instance in the 1D grid.
    pid = tl.program_id(axis=0)
    # 2. Calculate the offsets for the elements this program will process.
    # block_start is the starting index for this program's block.
    block_start = pid * BLOCK_SIZE
    # offsets is a vector of indices for the entire block.
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    # 3. Create a mask to prevent out-of-bounds memory access.
    # This is crucial when n_elements is not a multiple of BLOCK_SIZE.
    mask = offsets < n_elements
    # 4. Load the data blocks from global memory (DRAM) into registers (SRAM).
    # The mask ensures that we only load valid data.
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    # 5. Perform the element-wise addition on the loaded blocks.
    output = x + y
    # 6. Store the result block back to global memory.
    # The mask ensures we only write to valid memory locations.
    tl.store(output_ptr + offsets, output, mask=mask)
def add(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    """
    Host-side wrapper function to launch the Triton kernel for vector addition.
    """
    # Pre-allocate the output tensor on the same device as the inputs.
    output = torch.empty_like(x)
    assert x.is_cuda and y.is_cuda and output.is_cuda
    n_elements = output.numel()
    # Define the launch grid. It's a 1D grid of programs.
    # The size of the grid is the number of blocks needed to cover all elements.
    # triton.cdiv performs ceiling division.
    grid = lambda meta: (triton.cdiv(n_elements, meta),)
    # Launch the kernel.
    # The grid is passed in, and meta-parameters like BLOCK_SIZE are keyword arguments.
    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
    return output
# --- Verification ---
torch.manual_seed(0)
size = 98432
x = torch.rand(size, device='cuda')
y = torch.rand(size, device='cuda')
output_torch = x + y
output_triton = add(x, y)
print(f"PyTorch output: {output_torch}")
print(f"Triton output:  {output_triton}")
print(f"The maximum difference is {torch.max(torch.abs(output_torch - output_triton))}")

The stark contrast in complexity between the CUDA and Triton implementations highlights Triton’s value. The Triton version is not only shorter and written in a more familiar Python syntax, but it also abstracts away the most difficult parts of GPU programming, such as manual memory allocation, thread indexing logic, and shared memory management.

Table 1: CUDA vs. Triton - A Comparative Overview

To summarize the relationship between these two powerful tools, the following table provides a high-level comparison. It frames them not as direct competitors for all tasks, but as complementary tools in a performance optimization toolkit.

Feature	CUDA	Triton
Programming Language	C/C++ with extensions	Python-like DSL
Abstraction Level	Low-level, hardware-centric	High-level, block-centric
Thread Management	Manual (threads, warps, blocks)	Automatic (managed by compiler)
Memory Management	Manual (`cudaMalloc`, Shared Memory via `__shared__`)	Automatic (compiler optimizes SRAM usage)
Learning Curve	Steep, requires deep hardware knowledge	Moderate, familiar for Python developers
Debugging	Complex (Nsight, `printf`)	Easier (CPU simulator via `TRITON_INTERPRET=1`)
Performance	Potential for absolute peak performance	Often on-par with expert CUDA, easier to achieve good performance
Typical Use Case	Novel hardware features, complex non-standard kernels	Fusing common DL ops, rapid prototyping, general kernel development

The Bottleneck: Why Custom Kernels are Essential for Modern AI

While deep learning frameworks like PyTorch and TensorFlow provide highly optimized building blocks, their standard execution model often leaves significant performance on the table. To understand why, one must differentiate between operations limited by computation and those limited by memory, and recognize how frameworks handle them. This understanding provides the critical motivation for learning to write custom CUDA and Triton kernels.

Memory-Bound vs. Compute-Bound Operations

The performance of any GPU operation is ultimately constrained by one of two factors: the rate at which it can perform calculations (compute-bound) or the rate at which it can fetch data from memory (memory-bound).³⁵

Compute-Bound Operations: These are tasks where the GPU spends the majority of its time executing arithmetic instructions. The bottleneck is the raw processing power (FLOPS) of the SMs. The canonical example in deep learning is a large, dense General Matrix Multiplication (GEMM), which is the workhorse of linear layers and attention mechanisms. For these operations, the amount of computation is very high relative to the amount of data that needs to be read and written.³⁵
Memory-Bound Operations: These are tasks where the time spent waiting for data to be transferred from the slow global memory (HBM) to the fast on-chip memory (SRAM) dwarfs the time spent on actual computation. The vast majority of non-GEMM operations in a neural network fall into this category, including element-wise operations (e.g., ReLU, GeLU, dropout, addition), normalization layers (LayerNorm, BatchNorm), and reduction operations (e.g., softmax, sum, max).⁶

The fundamental performance problem with standard deep learning frameworks is that they typically execute each of these operations as a separate, distinct kernel launch. Consider a simple sequence found in many models: a linear layer followed by a ReLU activation. A naive implementation would execute this as:

Launch a GEMM kernel for the matrix multiplication. Write the result tensor to HBM.
Launch an element-wise kernel for the ReLU activation. This kernel reads the result tensor from HBM, applies the ReLU, and writes the final output back to HBM.

This sequence is incredibly inefficient. The intermediate tensor is written all the way out to slow global memory only to be immediately read back in by the next kernel. This redundant memory traffic is a major source of performance loss.³⁷

The Tyranny of DRAM and Kernel Launch Overhead

The inefficiency of the standard execution model is rooted in two hardware realities: the HBM bottleneck and kernel launch overhead.

HBM Bottleneck: As established, the bandwidth of HBM (e.g., ~2 TB/s on an NVIDIA A100) is an order of magnitude lower than that of on-chip SRAM (e.g., ~19 TB/s).⁷ Every unnecessary read from or write to HBM forces the powerful compute units to sit idle, waiting for data. Minimizing this traffic is the single most important goal of kernel optimization.
Kernel Launch Overhead: Initiating a kernel launch from the CPU is not a free operation. It involves communication across the PCIe bus and setup time on the GPU. While this overhead is negligible for a single, long-running kernel (like a large GEMM), it becomes a significant performance bottleneck when a model is composed of thousands of small, fast-running kernels. The cumulative launch overhead can easily dominate the actual computation time for memory-bound operations.³⁸

The Power of Kernel Fusion

Kernel fusion is the primary optimization technique used to combat these inefficiencies. It involves combining multiple sequential operations into a single, larger GPU kernel.³⁸

Instead of writing intermediate results to HBM, a fused kernel keeps them in the GPU’s fast on-chip registers or shared memory. In the fused Linear -> ReLU example, a thread block would compute a tile of the matrix multiplication result, immediately apply the ReLU activation to that tile while it is still in registers, and only then write the final, activated result to HBM.

The benefits of this approach are profound and directly address the core bottlenecks ⁴⁰:

Reduced Memory Bandwidth Usage: By eliminating the read/write of intermediate tensors to HBM, kernel fusion drastically cuts down on memory traffic.
Eliminated Kernel Launch Overhead: What was previously two or more kernel launches becomes a single launch, removing the associated overhead.
Improved Data Locality and Cache Utilization: Data is immediately reused while it is still hot in the fastest levels of the GPU’s memory hierarchy, maximizing efficiency.

Kernel fusion is the cornerstone of modern AI performance optimization. It is the fundamental principle that motivates the development of custom kernels. Sophisticated algorithms like Flash Attention are, at their core, an advanced application of kernel fusion for the specific sequence of operations in the attention mechanism.³⁶ Libraries like DeepSpeed’s Transformer Kernel and Liger-Kernel are essentially collections of pre-written, highly optimized fused kernels for common Transformer components.²⁷ Therefore, learning to write custom kernels in CUDA or Triton is fundamentally about learning how to identify and implement kernel fusion. It is the most critical optimization technique for the memory-bound workloads that constitute the majority of operations in LLMs and Diffusion models.

Optimizing the Transformer: From Multi-Head Attention to Flash Attention

The Transformer architecture, and specifically its self-attention mechanism, is the heart of modern LLMs. It is also a major performance bottleneck due to its computational and memory complexity. This section provides a practical, code-centric deep dive into optimizing this critical component, progressing from a naive implementation to the state-of-the-art Flash Attention algorithm.

Deconstructing Multi-Head Attention (MHA)

The attention mechanism allows a model to weigh the importance of different tokens in a sequence when producing a representation for a specific token. The most common variant is Scaled Dot-Product Attention, defined by the formula: $$ Attention(Q, K, V) = \mathrm{softmax}\left(\frac{Q K^{T}}{\sqrt{d_k}}\right)V $$ Here, Q (Query), K (Key), and V (Value) are matrices derived from the input sequence, and d_k is the dimension of the key vectors. Multi-Head Attention (MHA) enhances this by running the attention mechanism multiple times in parallel with different, learned linear projections, and then concatenating the results. This allows the model to jointly attend to information from different representational subspaces.⁴⁵

A standard implementation in PyTorch using nn.Module typically involves the following inefficient steps:

Three separate nn.Linear layers to project the input into Q, K, and V.
Reshaping and transposing the tensors to separate the attention heads.
A batched matrix multiplication (torch.bmm or @) to compute the raw scores, QKᵀ.
Applying a causal or padding mask to the scores.
Applying the softmax function to get attention weights.
A second batched matrix multiplication to apply the weights to the V matrix.
Transposing and reshaping the output back to the original tensor format.

The critical inefficiency here is the materialization of the intermediate attention score matrix S = QKᵀ. This matrix has dimensions (SequenceLength, SequenceLength). For a sequence of length N, this requires O(N²) memory, which becomes prohibitively large for the long contexts used in modern LLMs. Storing and retrieving this large matrix from HBM is a massive performance bottleneck.⁴⁷

A Fused MHA Kernel in Triton

As a first optimization step, we can use Triton to fuse several MHA components, reducing HBM traffic and kernel launch overhead. This serves as a practical stepping stone before tackling the more complex Flash Attention.

Our fusion strategy involves two main parts:

Fused QKV Projection: Instead of three separate linear layers for Q, K, and V, we can use a single, wider nn.Linear layer that projects the input to a tensor three times the size, and then split it. This reduces three separate GEMM operations into one larger, more efficient GEMM.⁵⁰
Fused Attention Core: We can write a single Triton kernel that performs the core attention computation: softmax(QKᵀ)V. This kernel will load blocks of Q, K, and V, compute the dot product, apply softmax, and multiply by V, all within a single launch. This avoids materializing the full (N, N) attention matrix in global memory.

A Triton kernel for this fused attention core would look conceptually like this:

@triton.jit
def fused_attn_kernel(Q_ptr, K_ptr, V_ptr, O_ptr,...):
    # 1. Use 2D program IDs to select an output block
    pid_m = tl.program_id(axis=0) # Block row index
    pid_n = tl.program_id(axis=1) # Block col index (not needed for this simplified version)
    # 2. Initialize accumulator for the output block
    acc = tl.zeros((BLOCK_SIZE_M, D_HEAD), dtype=tl.float32)
    # 3. Load a block of Q
    q_offsets =...
    q = tl.load(Q_ptr + q_offsets)
    # 4. Loop over blocks of K and V
    for k_block_idx in range(0, N_CTX, BLOCK_SIZE_K):
        # Load a block of K
        k_offsets =...
        k = tl.load(K_ptr + k_offsets)
        # Compute a block of the score matrix S = QK^T
        s_ij = tl.dot(q, k)
        # (This is where Flash Attention's online softmax would go)
        # For a simpler fused kernel, we might compute a block of P
        p_ij = tl.softmax(s_ij)
        # Load a block of V
        v_offsets =...
        v = tl.load(V_ptr + v_offsets)
        # Compute a block of the output O = PV
        acc += tl.dot(p_ij, v)
    # 5. Store the final output block
    o_offsets =...
    tl.store(O_ptr + o_offsets, acc)

While this simplified structure illustrates the concept, it still faces the challenge of computing softmax correctly without seeing the entire row of scores. This is precisely the problem that Flash Attention solves.

The Flash Attention Revolution

Flash Attention is an IO-aware, exact attention algorithm that fundamentally reorders the computation to avoid the O(N²) memory bottleneck.⁵¹ It is a canonical example of algorithm-hardware co-design, where the algorithm was explicitly created to match the memory hierarchy of modern GPUs.

The key techniques are:

Tiling: The Q, K, and V matrices are partitioned into smaller blocks. The computation proceeds by loading one block of Q into the fast on-chip SRAM and then iterating through all blocks of K and V, loading them one by one into SRAM to compute a portion of the attention output. This ensures that the data being actively worked on is always in the fastest memory.³⁶
Online Softmax: The main innovation of Flash Attention is a method for computing the softmax correctly without having the entire row of the QKᵀ score matrix. As each new block of scores S_ij is computed in SRAM, a running set of statistics (the maximum value m and the normalization sum l) are updated. The previous block’s contribution to the final output is rescaled on-the-fly to account for the new global statistics. The update rules are:

$$ m_{\text{new}} = \max \left( m_{\text{old}},, \max(S_{ij}) \right) $$

$$ l_{\text{new}} = e^{(m_{\text{old}} - m_{\text{new}})} \cdot l_{\text{old}} + \sum e^{(S_{ij} - m_{\text{new}})} $$

This mathematical trick allows for a tiled, streaming computation of an operation that is inherently global.³⁶

Recomputation for Backward Pass: To avoid storing the massive (N, N) attention matrix for backpropagation, Flash Attention only stores the final output O and the softmax normalization statistics (l, m) from the forward pass. During the backward pass, these values are used to recompute the necessary blocks of the attention matrix on-the-fly in SRAM. This is significantly faster than reading the full matrix from HBM.⁵¹

This approach demonstrates that the largest performance gains often come not from micro-optimizing existing code, but from fundamentally redesigning the algorithm to align with the physical constraints and opportunities of the hardware.

Implementing Flash Attention in Triton

The official Triton documentation includes a tutorial for a fused attention kernel that implements the Flash Attention v2 algorithm.⁵⁶ A line-by-line analysis of this kernel reveals the practical application of the concepts above. Key elements of the Triton implementation include:

Grid and Program IDs: A 2D or 3D grid is used to parallelize across the batch, head, and sequence length dimensions.
Outer Loop: An outer loop iterates over blocks of the query sequence length (start_m). Each program instance is responsible for computing a block of the output O.
Inner Loop: An inner loop iterates over blocks of the key/value sequence length (start_n).
Data Loading: Inside the inner loop, blocks of q, k, and v are loaded from HBM into registers/SRAM.
Score Computation: tl.dot(q, k) computes a tile of the QKᵀ matrix.
Online Softmax Logic: The core of the algorithm is implemented here. The code calculates the new maximum m_ij, updates the accumulator acc with a correction factor alpha, computes the new partial softmax values p, and updates the running statistics l_i and m_i.⁵⁴
Causal Masking: If causal attention is required, a mask is applied within the inner loop to set scores for future tokens to negative infinity.
Final Write-back: After the inner loop completes, the final normalized output block is written to HBM.
Autotuning: The kernel is decorated with @triton.autotune. This feature automatically benchmarks different kernel configurations (e.g., BLOCK_M, BLOCK_N, num_warps) on the target hardware and problem size, and caches the best-performing configuration for subsequent runs. This automates the tedious process of manual performance tuning.²⁴

The performance impact of these optimizations is transformative, as summarized in the table below.

Table 2: Attention Implementation Performance Benchmark (Conceptual)

This table illustrates the conceptual performance progression across different attention implementations on a modern GPU like an A100. Actual numbers will vary, but the trend is representative.

Implementation	Sequence Length	Latency (ms)	Peak Memory (GB)
Naive PyTorch MHA	512	~1.5	~0.5
Naive PyTorch MHA	2048	~20	~8
Naive PyTorch MHA	8192	OOM	> 80 (OOM)
Fused Triton MHA	512	~1.2	~0.5
Fused Triton MHA	2048	~15	~8
Fused Triton MHA	8192	OOM	> 80 (OOM)
Flash Attention (Triton)	512	~0.5	< 0.1
Flash Attention (Triton)	2048	~2.0	< 0.1
Flash Attention (Triton)	8192	~8.0	< 0.1
Flash Attention (Triton)	16384	~18.0	< 0.1

Note: OOM = Out of Memory. Latency and memory are conceptual estimates for illustration.

The table clearly shows that while basic fusion offers a modest improvement, it does not solve the fundamental O(N²) memory problem. Flash Attention, by redesigning the algorithm, not only provides a dramatic speedup but also breaks the quadratic memory barrier, enabling transformers to process sequence lengths that were previously impossible.

End-to-End Model Optimization

While optimizing individual operators like attention is crucial, the greatest performance gains are realized by taking a holistic view and optimizing entire sections of a model. This involves fusing multiple layers of a Transformer block or key components of a Diffusion model’s U-Net. This section expands the scope from single-operator fusion to these end-to-end strategies.

Fusing the Full Transformer Block

A standard Transformer block is a sequence of operations: Multi-Head Attention, a residual connection, Layer Normalization, a Feed-Forward Network (FFN) or MLP, another residual connection, and a final LayerNorm.⁴⁷ A naive implementation executes each of these as a separate kernel, resulting in significant memory traffic and launch overhead.

The ultimate goal of optimization is the “Megakernel” approach, where as many of these sequential operations as possible are fused into a single, persistent GPU kernel.⁵⁸ This minimizes launch overhead to nearly zero and maximizes the time data spends in fast on-chip memory.

While a single monolithic kernel for the entire block is complex to write and maintain, a practical approach involves breaking the block into a few highly optimized fused kernels:

Flash Attention Kernel: As discussed, this handles the entire MHA block efficiently.
Fused MLP/FFN Kernel: The FFN, typically consisting of two linear layers with an activation function like GeLU in between (Linear -> GeLU -> Linear), is an excellent candidate for fusion. A single Triton kernel can perform the first matrix multiplication, apply the GeLU activation, and perform the second matrix multiplication without writing the intermediate activated tensor to global memory.⁵⁹
Fused LayerNorm: Layer Normalization involves calculating mean and variance across features, which is a reduction operation. It can be fused with the preceding residual addition, saving a memory read/write cycle. Some advanced implementations, like in DeepSpeed’s kernel, use an “invertible” LayerNorm that allows the input activation to be dropped entirely, saving memory, as it can be recomputed from the output during the backward pass.⁴⁴

Below is an example of a fused MLP kernel in Triton, which is a common and highly effective optimization.

@triton.jit
def fused_mlp_kernel(
    x_ptr, w1_ptr, w2_ptr, output_ptr,
    M, N, K,
    stride_xm, stride_xn,
    stride_w1n, stride_w1k,
    stride_w2k, stride_w2n,
    stride_om, stride_on,
    BLOCK_SIZE_M: tl.constexpr, BLOCK_SIZE_N: tl.constexpr, BLOCK_SIZE_K: tl.constexpr,
):
    # --- Fused Linear -> GeLU -> Linear ---
    pid = tl.program_id(axis=0)
    # --- First Linear Layer (GEMM 1) ---
    # This part is similar to a standard matrix multiplication kernel
    # It computes a block of the intermediate result: intermediate = x @ w1
   ...
    intermediate_block =... # Result of first GEMM
    # --- Fused GeLU Activation ---
    # The activation is applied element-wise to the intermediate result
    # while it is still in registers.
    # A fast approximation of GeLU is often used in kernels.
    activated_block = 0.5 _intermediate_block_ (1 + tl.tanh(0.79788456 _intermediate_block_ (1 + 0.044715 _intermediate_block_ intermediate_block)))
    # --- Second Linear Layer (GEMM 2) ---
    # This part multiplies the activated block by the second weight matrix: output = activated_block @ w2
   ...
    output_block =... # Result of second GEMM
    # Store the final result
    tl.store(output_ptr + output_offsets, output_block)

Optimizing Diffusion Models

Diffusion models, which are becoming prevalent in image and video generation, typically rely on a U-Net architecture for the iterative denoising process.⁶⁰ This U-Net is composed of repeating blocks, which are prime targets for kernel fusion.

The most common computational pattern within a U-Net block is a sequence of Convolution -> Normalization -> Activation. For example, a block might contain a 1D convolution, followed by Group Normalization, and finally a Mish or SiLU activation function.³⁹ This sequence is memory-bound, as each step is relatively simple computationally but requires reading and writing large activation tensors.

A custom Triton kernel can fuse these three operations. A fused GroupNorm + Mish kernel is a non-trivial example that demonstrates more advanced parallel programming patterns. It requires a parallel reduction within each thread block to compute the mean and variance needed for the normalization step before applying the activation. This avoids writing the un-normalized and un-activated tensors to global memory.³⁹

Beyond kernel-level fusion, the overall denoising loop can also be optimized. In the standard process, timestep embeddings and various noise schedule constants are recomputed at every step of the reverse diffusion process. A higher-level optimization is to pre-compute all of these constants once at the beginning of inference and store them in a tensor on the GPU. The denoising kernel can then simply index into this tensor at each step, avoiding redundant computations and making the kernel itself lighter and faster.⁶²

The Broader Toolkit: `torch.compile` and When to Go Custom

The landscape of performance optimization is evolving rapidly. It is no longer a binary choice between a high-level framework and low-level CUDA.

torch.compile as an Automatic Optimizer: Introduced in PyTorch 2.0, torch.compile is a powerful JIT compiler that can automatically optimize PyTorch code. It works by capturing the computational graph of a model and then using a backend, TorchInductor, to fuse operations and generate highly efficient Triton kernels.³⁷ For many standard model architectures and operations, simply adding @torch.compile can provide substantial speedups, often achieving much of the benefit of manual kernel fusion with a single line of code.⁶⁵
The Case for Manual Kernels: Despite the power of compilers, there remain critical scenarios where writing custom kernels is necessary:
- Novel Architectures and Operations: If a researcher is developing a new type of attention mechanism, a unique activation function, or a non-standard normalization layer, an automatic compiler like torch.compile will not know how to fuse it. A manual kernel is the only way to achieve high performance for these novel components.²⁸
- State-of-the-Art Algorithms: Cutting-edge, complex algorithms like Flash Attention involve bespoke logic (e.g., online softmax, recomputation) that is far beyond the capability of a general-purpose compiler to discover on its own. These must be implemented manually.⁵³
- Peak Production Performance: For large-scale inference services where every millisecond of latency and every watt of power matters, a hand-tuned CUDA or Triton kernel can often outperform a compiler-generated one, squeezing out the last 10-20% of performance.²⁸
- Exploiting New Hardware Features: When new GPUs are released with specialized hardware units (like the Tensor Memory Accelerator on NVIDIA’s Hopper architecture), manual kernels are often the first and only way to leverage these features before compilers add support.⁶⁶

This leads to a clear, hierarchical workflow for the modern performance engineer. One should not default to writing custom kernels. The most productive path is to start with high-level tools and only go to lower levels of abstraction when necessary. This workflow strikes a crucial balance between developer productivity and raw performance.

Table 3: Transformer Block Optimization Strategies

This table serves as a strategic map for optimizing a Transformer block, breaking it down by component and listing the state-of-the-art optimization strategy for each.

Operation / Component	Fusion Strategy	Primary Benefit	Implementation Tool
Q, K, V Projections	Fused `nn.Linear`	Reduces 3 GEMMs to 1, improves GPU utilization	PyTorch (`torch.nn.Linear(d_in, 3*d_out)`)
Scaled Dot-Product Attention	Flash Attention	Eliminates `N²` memory, drastically reduces HBM I/O	Triton/CUDA (e.g., `flash-attn` library)
MLP / FFN	Fused `Linear-GELU-Linear`	Reduces HBM I/O by keeping intermediate activations in SRAM	Triton, `torch.compile`
Layer Normalization	Fused with preceding op (e.g., residual add)	Reduces HBM I/O	Triton, `torch.compile`, DeepSpeed Kernels
Entire Transformer Layer	Megakernel / Persistent Kernel	Eliminates all kernel launch overhead within the layer	Custom CUDA/Triton, TensorRT-LLM

Conclusion and Future Directions

The journey from understanding GPU hardware to implementing end-to-end fused kernels in Triton reveals a clear set of principles for achieving high performance in modern AI. The landscape is defined by the constant tension between immense computational power and the limitations of memory bandwidth. Custom kernels, particularly those written with a focus on fusion and data locality, are the primary tool for resolving this tension.

A Practical Roadmap for the ML Practitioner

For a data scientist or machine learning engineer looking to master high-performance computing, the path forward is not to abandon high-level frameworks for raw CUDA, but to adopt a tiered, strategic approach.

Step 1: Master High-Level Frameworks. First, become an expert in your framework of choice, like PyTorch. Understand the model architectures deeply and learn to use profiling tools (e.g., PyTorch Profiler, NVIDIA Nsight) to precisely identify performance bottlenecks.
Step 2: Embrace the Compiler. Your first and most productive optimization tool should be torch.compile. It will automatically handle the most common fusion opportunities and provide significant speedups with minimal effort.
Step 3: Learn Triton for Custom Fusion. When profiling reveals that torch.compile has missed an opportunity, or when you are developing novel layers, turn to Triton. Master its block-level programming model to write custom fused kernels for the specific parts of your model that are bottlenecks.
Step 4: Study State-of-the-Art Algorithms. To tackle complex, well-understood problems like attention, do not start from scratch. Instead, study the implementations of algorithms like Flash Attention. This will provide invaluable lessons in algorithm-hardware co-design that are more important than the syntax of any single language.
Step 5: Use CUDA Sparingly. Reserve raw CUDA C++ for the most demanding situations: when you need to interface with low-level libraries, when Triton’s abstractions prove limiting for a highly novel algorithm, or when you are tuning a critical production kernel for the absolute maximum performance on a specific piece of hardware.

The Future of High-Performance AI

The field of AI performance engineering is rapidly evolving, with several key trends shaping its future:

The Ascendancy of Compilers: The software stack is undeniably moving towards compilation. High-level frameworks are increasingly becoming user-friendly frontends to powerful compilers that can target multiple hardware backends.²⁵ The deep integration of Triton into PyTorch’s torch.compile is a clear indicator of this trend.
Continuous Need for Hardware Specialization: As hardware vendors like NVIDIA release new GPUs with specialized units (e.g., FP8 support, Tensor Memory Accelerators), there will be a continuous need for new, hand-tuned kernels to exploit these features effectively. Triton and CUDA will remain the primary languages for this pioneering work.⁶⁶
Democratization Through Abstraction: While the underlying hardware and kernels become more complex, high-level abstractions are making these optimizations more accessible. Libraries like FlashInfer are providing pre-packaged, highly optimized kernels behind simple APIs, allowing framework developers to integrate them without needing to become CUDA experts themselves.⁶⁸

Ultimately, learning CUDA and Triton is an investment in understanding the fundamental principles of high-performance computing. This knowledge of how to reason about parallelism, data locality, and memory hierarchies is a durable skill that will remain valuable long after today’s specific models and frameworks have been superseded. As AI continues to push the limits of what is computationally feasible, the ability to bridge the gap between abstract algorithms and physical hardware will only grow in importance.

Works cited

Understanding CUDA in Computing: A Comprehensive Guide | Lenovo US, accessed July 17, 2025, https://www.lenovo.com/us/en/glossary/what-is-the-cuba-toolkit/
What is the best way for beginners to learn CUDA and parallel computing with GPUs?, accessed July 17, 2025, https://www.quora.com/What-is-the-best-way-for-beginners-to-learn-CUDA-and-parallel-computing-with-GPUs
What Is CUDA? - Supermicro, accessed July 17, 2025, https://www.supermicro.com/en/glossary/cuda
Introduction to CUDA Programming - GeeksforGeeks, accessed July 17, 2025, https://www.geeksforgeeks.org/electronics-engineering/introduction-to-cuda-programming/
CUDA Refresher: The CUDA Programming Model | NVIDIA Technical Blog, accessed July 17, 2025, https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Simplifying CUDA kernels with Triton: A Pythonic Approach to GPU Programming, accessed July 17, 2025, https://arunjitha.medium.com/simplifying-cuda-kernels-with-triton-a-pythonic-approach-to-gpu-programming-79bb7121e974
Flash attention(Fast and Memory-Efficient Exact Attention with IO-Awareness): A deep dive, accessed July 17, 2025, https://towardsdatascience.com/flash-attention-fast-and-memory-efficient-exact-attention-with-io-awareness-a-deep-dive-724af489997b/
FLASHATTENTION: Fast and Memory-Efficient Exact Attention with IO-Awareness - OpenReview, accessed July 17, 2025, https://openreview.net/pdf?id=H4DqfPSibmx
1 CUDA Programming Model, accessed July 17, 2025, https://www.eng.utah.edu/~cs5610/lectures/Programming_Models_for_GPU_Architecture%20CUDA.pdf
en.wikipedia.org, accessed July 17, 2025, https://en.wikipedia.org/wiki/CUDA
kst179/fused-attention: Fast and low-memory attention layer … - GitHub, accessed July 17, 2025, https://github.com/kst179/fused-attention
CUDA Zone - Library of Resources - NVIDIA Developer, accessed July 17, 2025, https://developer.nvidia.com/cuda-zone
What is the CUDA Programming Model? | GPU Glossary - Modal, accessed July 17, 2025, https://modal.com/gpu-glossary/device-software/cuda-programming-model
Tutorial 01: Say Hello to CUDA, accessed July 17, 2025, https://cuda-tutorial.readthedocs.io/en/latest/tutorials/tutorial01/
An Even Easier Introduction to CUDA (Updated) | NVIDIA Technical Blog, accessed July 17, 2025, https://developer.nvidia.com/blog/even-easier-introduction-cuda/
CUDA Programming - Wolfram Language Documentation, accessed July 17, 2025, https://reference.wolfram.com/language/CUDALink/tutorial/Programming.html
CUDA Programming Model — MolSSI GPU Programming Fundamentals documentation, accessed July 17, 2025, https://education.molssi.org/gpu_programming_beginner/03-cuda-program-model.html
CUDA Basic Example - Vector Addition Explanation - eunomia, accessed July 17, 2025, https://eunomia.dev/others/cuda-tutorial/01-vector-addition/
Vector Addition “Hello World!” Example with CUDA on Mac OSX …, accessed July 17, 2025, https://www.quantstart.com/articles/Vector-Addition-Hello-World-Example-with-CUDA-on-Mac-OSX/
olcf-tutorials/vector_addition_cuda: A simple CUDA vector addition program - GitHub, accessed July 17, 2025, https://github.com/olcf-tutorials/vector_addition_cuda
4.4 Example: Vector Addition — Parallel Computing for Beginners, accessed July 17, 2025, https://www.learnpdc.org/PDCBeginners/4-cuda/4-VectorAdd.html
Tutorial 02: CUDA in Actions, accessed July 17, 2025, https://cuda-tutorial.readthedocs.io/en/latest/tutorials/tutorial02/
Introducing Triton: Open-Source GPU Programming for Neural Networks, accessed July 17, 2025, https://aimersociety.com/introducing-triton-open-source-gpu-programming-for-neural-networks/
How Is OpenAI’s Triton Different From NVIDIA CUDA? - Analytics India Magazine, accessed July 17, 2025, https://analyticsindiamag.com/global-tech/how-is-openais-triton-different-from-nvidia-cuda/
triton-lang.org, accessed July 17, 2025, https://triton-lang.org/#:~:text=Triton%20is%20a%20language%20and,throughput%20on%20modern%20GPU%20hardware.
Democratizing AI Accelerators and GPU Kernel Programming using Triton, accessed July 17, 2025, https://next.redhat.com/2024/11/07/democratizing-ai-accelerators-and-gpu-kernel-programming-using-triton/
Exploring Triton GPU programming for neural networks in Java - OpenJDK, accessed July 17, 2025, https://openjdk.org/projects/babylon/articles/triton
Getting Started with Triton: A Step-by-Step Tutorial - Medium, accessed July 17, 2025, https://medium.com/ai-insights-cobet/getting-started-with-triton-a-step-by-step-tutorial-ddc18a186295
Liger Kernel: Efficient Triton Kernels for LLM Training - arXiv, accessed July 17, 2025, https://arxiv.org/html/2410.10989v3
[D] usefulness of learning CUDA/triton : r/MachineLearning - Reddit, accessed July 17, 2025, https://www.reddit.com/r/MachineLearning/comments/1kewrqc/d_usefulness_of_learning_cudatriton/
Triton — GPU Programming for Neural Networks | by Dhananjay Kumar - Medium, accessed July 17, 2025, https://dhnanjay.medium.com/triton-gpu-programming-for-neural-networks-16271d729f78
Introduction - Triton documentation, accessed July 17, 2025, https://triton-lang.org/main/programming-guide/chapter-1/introduction.html
GPU MODE Lecture 14: Practitioners Guide to Triton - Christian Mills, accessed July 17, 2025, https://christianjmills.com/posts/cuda-mode-notes/lecture-014/
triton_tutorial/02_vector_addition.ipynb at master - GitHub, accessed July 17, 2025, https://github.com/VikParuchuri/triton_tutorial/blob/master/02_vector_addition.ipynb
Vector Addition — Triton documentation, accessed July 17, 2025, https://triton-lang.org/main/getting-started/tutorials/01-vector-add.html
Triton Vector Addition Kernel, part 1: Making the Shift to Parallel Programming - YouTube, accessed July 17, 2025, https://www.youtube.com/watch?v=MEZ7XhzTLEg
lectures/lecture_014/A_Practitioners_Guide_to_Triton.ipynb at main - GitHub, accessed July 17, 2025, https://github.com/gpu-mode/lectures/blob/main/lecture_014/A_Practitioners_Guide_to_Triton.ipynb
How Nvidia’s CUDA Monopoly In Machine Learning Is Breaking – OpenAI Triton And PyTorch 2.0 – SemiAnalysis, accessed July 17, 2025, https://semianalysis.com/2023/01/16/nvidiaopenaitritonpytorch/
Flash Attention - Insu Jang, accessed July 17, 2025, https://insujang.github.io/2024-01-21/flash-attention/
OpenAI’s Triton: An end to end example | by Michael Diggin | Medium, accessed July 17, 2025, https://medium.com/@michael.diggin/openais-triton-an-end-to-end-example-c6577d81e3d0
Kernel Fusion - Steven Gong, accessed July 17, 2025, https://stevengong.co/notes/Kernel-Fusion
Part VI - Kernel Fusion in CUDA - Vrushank Desai, accessed July 17, 2025, https://www.vrushankdes.ai/diffusion-policy-inference-optimization/part-vi---kernel-fusion-in-cuda
31. Kernel Fusion - Aussie AI, accessed July 17, 2025, https://www.aussieai.com/book/ch31-kernel-fusion
Kernel Operator Fusion - Aussie AI, accessed July 17, 2025, https://www.aussieai.com/research/kernel-fusion
Kernel Fusion: A Smart Way to Enhance Neural Networks Performance - abhik.xyz, accessed July 17, 2025, https://www.abhik.xyz/articles/kernel-fusion
DeepSpeed Transformer Kernel - DeepSpeed, accessed July 17, 2025, https://www.deepspeed.ai/tutorials/transformer_kernel/
FlashAttention: Implementing High-Performance Attention with CUDA and Triton - Medium, accessed July 17, 2025, https://medium.com/@kimdoil1211/flashattention-implementing-high-performance-attention-with-cuda-and-triton-9ee635ab1200
Tutorial 6: Transformers and Multi-Head Attention — UvA DL Notebooks v1.2 documentation, accessed July 17, 2025, https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html
Parallelizing Multi-Head Attention on GPUs - Hemil Desai, accessed July 17, 2025, https://hd10.dev/posts/my-interests-2/cs259.pdf
Multi-Head Attention From Scratch - Sanjaya’s Blog, accessed July 17, 2025, https://sanjayasubedi.com.np/deeplearning/multihead-attention-from-scratch/
Introduction to Flash Attention: A Breakthrough in Efficient Attention Mechanism | by Sthanikam Santhosh | Medium, accessed July 17, 2025, https://medium.com/@sthanikamsanthosh1994/introduction-to-flash-attention-a-breakthrough-in-efficient-attention-mechanism-3eb47e8962c3
Flash Attention: Revolutionizing Transformer Efficiency - Unite.AI, accessed July 17, 2025, https://www.unite.ai/flash-attention-revolutionizing-transformer-efficiency/
LLMs-from-scratch/ch03/02_bonus_efficient-multihead-attention/mha-implementations.ipynb at main - GitHub, accessed July 17, 2025, https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/02_bonus_efficient-multihead-attention/mha-implementations.ipynb
FlashAttention: Fast and Memory-Efficient Exact … - deepsense.ai, accessed July 17, 2025, https://deepsense.ai/wp-content/uploads/2023/04/2205.14135.pdf
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. - MIT MLSys Discussion Group, accessed July 17, 2025, https://www.mlsys.ai/papers/flash_attention.html
Understanding Flash Attention: Writing the Algorithm from Scratch in Triton, accessed July 17, 2025, https://towardsdatascience.com/understanding-flash-attention-writing-the-algorithm-from-scratch-in-triton-5609f0b143ea/
FLASH ATTENTION: Fast and Memory-Efficient Exact Attention with IO-Awareness: Paper Review | by Sulbha Jain | May, 2025 | Medium, accessed July 17, 2025, https://medium.com/@sulbha.jindal/flash-attention-fast-and-memory-efficient-exact-attention-with-io-awareness-paper-review-79639127c5de
Fused Attention - Triton documentation, accessed July 17, 2025, https://triton-lang.org/main/getting-started/tutorials/06-fused-attention.html
Tutorial #17: Transformers III Training - Research Blog | RBC Borealis, accessed July 17, 2025, https://rbcborealis.com/research-blogs/tutorial-17-transformers-iii-training/
Compiling LLMs into a MegaKernel: A Path to Low-Latency …, accessed July 17, 2025, https://zhihaojia.medium.com/compiling-llms-into-a-megakernel-a-path-to-low-latency-inference-cf7840913c17
flash-attention/training/README.md at main · Dao-AILab/flash …, accessed July 17, 2025, https://github.com/HazyResearch/flash-attention/blob/main/training/README.md
Custom diffusion model with PyTorch — Tutorials for AI developers 4.0, accessed July 17, 2025, https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/notebooks/pretrain/ddim_pretrain.html
Ultimate guide to optimizing Stable Diffusion XL - Félix Sanz, accessed July 17, 2025, https://www.felixsanz.dev/articles/ultimate-guide-to-optimizing-stable-diffusion-xl
Part VII - A Dive Into DDPMs & CUDA kernel for Denoising - Vrushank Desai, accessed July 17, 2025, https://www.vrushankdes.ai/diffusion-policy-inference-optimization/part-vii---a-dive-into-ddpms-cuda-kernel-for-denoising
Liger Kernel: Efficient Triton Kernels for LLM Training - arXiv, accessed July 17, 2025, https://arxiv.org/html/2410.10989v2
Accelerating PyTorch Transformers by replacing nn.Transformer with Nested Tensors and torch.compile() — PyTorch Tutorials 2.7.0+cu126 documentation, accessed July 17, 2025, https://docs.pytorch.org/tutorials/intermediate/transformer_building_blocks.html
Accelerating Generative AI Part III: Diffusion, Fast – PyTorch, accessed July 17, 2025, https://pytorch.org/blog/accelerating-generative-ai-3/
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision - PyTorch, accessed July 17, 2025, https://pytorch.org/blog/flashattention-3/
How do you optimize GPU utilization during diffusion model training? - Milvus, accessed July 17, 2025, https://milvus.io/ai-quick-reference/how-do-you-optimize-gpu-utilization-during-diffusion-model-training
What is Flash Attention? | Modal Blog, accessed July 17, 2025, https://modal.com/blog/flash-attention-article
Optimizing Transformer-Based Diffusion Models for Video Generation with NVIDIA TensorRT, accessed July 17, 2025, https://developer.nvidia.com/blog/optimizing-transformer-based-diffusion-models-for-video-generation-with-nvidia-tensorrt/
Run High-Performance LLM Inference Kernels from NVIDIA Using …, accessed July 17, 2025, https://developer.nvidia.com/blog/run-high-performance-llm-inference-kernels-from-nvidia-using-flashinfer/