About This Page

This page covers low-level graphics programming — GPU architecture, modern graphics APIs, shader languages, and advanced rendering techniques. For engine-level rendering see Game Development. For path tracing deep-dive see PathTracer Learning. For engine-specific shaders see Godot, Unity, Unreal Engine.

History

  • How: Graphics APIs evolved from fixed-function pipelines (OpenGL 1.x, DirectX 7) to fully programmable shaders (DX9/OpenGL 2), then to explicit low-overhead APIs (Vulkan, DX12, Metal) giving developers direct GPU control.
  • Who: Key contributors — Microsoft (DirectX), Khronos Group (OpenGL, Vulkan, WebGL, WebGPU), Apple (Metal), AMD/NVIDIA (GPU hardware).
  • Why: Games and simulations demand maximum GPU performance. High-level APIs had too much driver overhead. Explicit APIs let developers control memory, synchronization, and command submission directly.

API Evolution Timeline

timeline
    title Graphics API Evolution
    1992 : OpenGL 1.0
         : Fixed-function pipeline
         : No shaders
    1995 : DirectX 1.0
         : Windows-only
         : Microsoft enters graphics
    2002 : DirectX 9 / OpenGL 2.0
         : Programmable shaders begin
         : HLSL and GLSL introduced
    2006 : DirectX 10 / OpenGL 3.0
         : Geometry shaders
         : Unified shader model
    2009 : OpenGL ES 2.0
         : Mobile graphics programming
         : WebGL follows
    2013 : Metal (Apple)
         : Low-overhead API for iOS/macOS
         : First modern explicit API
    2015 : DirectX 12 / Vulkan
         : Explicit GPU control
         : Multi-threading, no driver magic
    2021 : WebGPU
         : Modern GPU API for browsers
         : Replaces WebGL
    2023 : Vulkan 1.3 / DX12 Ultimate
         : Mesh shaders, ray tracing standard
         : Work graphs introduced

Introduction

API Comparison

APIPlatformOverheadLearning CurveBest For
VulkanCross-platformMinimalVery HighGames, engines, cross-platform
DirectX 12Windows / XboxMinimalVery HighWindows games, Xbox
MetalApple onlyMinimalHighiOS / macOS games
WebGPUBrowser + nativeLowMediumWeb games, tools
OpenGLCross-platformHighLowLearning, legacy
OpenGL ESMobileMediumLowMobile (legacy)
WebGLBrowserHighLowWeb (legacy)

Graphics Programming Knowledge Map

mindmap
  root((Advanced Graphics))
    GPU Architecture
      Hardware Pipeline
      Memory Model
      Execution Model
      Synchronization
    Graphics APIs
      Vulkan
      DirectX 12
      Metal
      WebGPU
    Shader Languages
      GLSL
      HLSL
      MSL
      WGSL
      SPIR-V
    Rendering Techniques
      Rasterization
      Ray Tracing
      Compute
      Mesh Shaders
    Advanced Topics
      Render Graphs
      Bindless Resources
      GPU Driven Rendering
      Multi-threading

GPU Architecture

GPU vs CPU Architecture

graph TD
    subgraph CPU["🖥️ CPU — Few Powerful Cores"]
        C1["Core 1\nComplex logic\nBranch prediction\nOut-of-order exec"]
        C2["Core 2"]
        C3["Core 3"]
        C4["Core 4 ... 32"]
        Cache["Large Cache\nL1/L2/L3"]
        C1 --- Cache
    end
    subgraph GPU["🎮 GPU — Thousands of Simple Cores"]
        SM1["SM / CU\n128 shader cores"]
        SM2["SM / CU\n128 shader cores"]
        SM3["SM / CU\n128 shader cores"]
        SMN["... thousands more"]
        VRAM["VRAM\nHigh bandwidth\n~1TB/s"]
        SM1 --- VRAM
    end
    CPU -->|"Submits draw calls\nand commands"| GPU
FeatureCPUGPU
Core count4–641,000–18,000+
Core complexityVery high (OOO, branch pred)Simple (in-order)
Memory bandwidth~100 GB/s~1 TB/s
Best atSequential logic, branchingParallel math, throughput
LatencyLowHigh (hidden by parallelism)

GPU Execution Model

graph TD
    Thread["Thread\nSingle shader invocation\n(one pixel, one vertex)"]
    Warp["Warp / Wavefront\n32 threads (NVIDIA)\n64 threads (AMD)\nExecute in lockstep"]
    Block["Thread Block / Workgroup\nMultiple warps\nShared memory access"]
    Grid["Grid / Dispatch\nAll blocks for one draw/dispatch"]
    Thread --> Warp --> Block --> Grid

GPU Memory Hierarchy

Memory TypeLocationSpeedSizeScope
RegistersOn-chipFastest~256KB per SMPer thread
Shared Memory / LDSOn-chipVery fast32–128KB per SMPer workgroup
L1 CacheOn-chipFast32–128KB per SMPer SM
L2 CacheOn-chipMedium4–80MBWhole GPU
VRAM (GDDR6/HBM)Off-chip~1 TB/s8–80GBWhole GPU
System RAMOff-chip~100 GB/sGBsCPU+GPU shared

Synchronization Primitives

PrimitiveScopeUse Case
Barrier (execution)WorkgroupWait for all threads before proceeding
Memory barrierWorkgroup / globalEnsure writes are visible
SemaphoreQueue levelSignal between GPU queues
FenceCPU-GPUCPU waits for GPU work to finish
Pipeline barrier (Vulkan)Command bufferTransition resource states
EventCommand bufferFine-grained sync within a queue

Vulkan

Vulkan Architecture Overview

graph TD
    App["Your Application"]
    Instance["VkInstance\nVulkan context"]
    PhysDev["VkPhysicalDevice\nGPU hardware info"]
    LogDev["VkDevice\nLogical device\nQueues + features"]
    subgraph Memory["Memory Management"]
        Alloc["VkDeviceMemory\nAllocate GPU memory"]
        Buffer["VkBuffer\nVertex, index, uniform data"]
        Image["VkImage\nTextures, render targets"]
    end
    subgraph Commands["Command Recording"]
        Pool["VkCommandPool"]
        CmdBuf["VkCommandBuffer\nRecord draw calls"]
    end
    subgraph Rendering["Render Pipeline"]
        RenderPass["VkRenderPass\nAttachments, subpasses"]
        Pipeline["VkPipeline\nShaders + state"]
        Framebuf["VkFramebuffer\nRender targets"]
    end
    Swapchain["VkSwapchainKHR\nPresent to screen"]
    Queue["VkQueue\nSubmit commands to GPU"]

    App --> Instance --> PhysDev --> LogDev
    LogDev --> Memory
    LogDev --> Commands
    LogDev --> Rendering
    LogDev --> Swapchain
    CmdBuf --> Queue --> Swapchain

Vulkan Initialization

// 1. Create Instance
VkApplicationInfo appInfo{};
appInfo.sType = VK_STRUCTURE_TYPE_APPLICATION_INFO;
appInfo.pApplicationName = "My Game";
appInfo.applicationVersion = VK_MAKE_VERSION(1, 0, 0);
appInfo.apiVersion = VK_API_VERSION_1_3;
 
VkInstanceCreateInfo createInfo{};
createInfo.sType = VK_STRUCTURE_TYPE_INSTANCE_CREATE_INFO;
createInfo.pApplicationInfo = &appInfo;
 
// Enable validation layers (debug only)
const char* validationLayers[] = {"VK_LAYER_KHRONOS_validation"};
createInfo.enabledLayerCount = 1;
createInfo.ppEnabledLayerNames = validationLayers;
 
VkInstance instance;
vkCreateInstance(&createInfo, nullptr, &instance);
 
// 2. Pick Physical Device (GPU)
uint32_t deviceCount = 0;
vkEnumeratePhysicalDevices(instance, &deviceCount, nullptr);
std::vector<VkPhysicalDevice> devices(deviceCount);
vkEnumeratePhysicalDevices(instance, &deviceCount, devices.data());
 
VkPhysicalDevice physicalDevice = devices[0]; // pick best GPU
 
// 3. Create Logical Device + Queue
float queuePriority = 1.0f;
VkDeviceQueueCreateInfo queueInfo{};
queueInfo.sType = VK_STRUCTURE_TYPE_DEVICE_QUEUE_CREATE_INFO;
queueInfo.queueFamilyIndex = graphicsQueueFamily;
queueInfo.queueCount = 1;
queueInfo.pQueuePriorities = &queuePriority;
 
VkDeviceCreateInfo deviceInfo{};
deviceInfo.sType = VK_STRUCTURE_TYPE_DEVICE_CREATE_INFO;
deviceInfo.queueCreateInfoCount = 1;
deviceInfo.pQueueCreateInfos = &queueInfo;
 
VkDevice device;
vkCreateDevice(physicalDevice, &deviceInfo, nullptr, &device);
 
VkQueue graphicsQueue;
vkGetDeviceQueue(device, graphicsQueueFamily, 0, &graphicsQueue);

Vulkan Memory Management

// Allocate GPU memory manually (unlike OpenGL which does it for you)
VkMemoryAllocateInfo allocInfo{};
allocInfo.sType = VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO;
allocInfo.allocationSize = memRequirements.size;
allocInfo.memoryTypeIndex = findMemoryType(
    memRequirements.memoryTypeBits,
    VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT  // GPU-only memory (fastest)
    // VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT
    // ^ CPU-accessible memory (for staging buffers)
);
 
VkDeviceMemory bufferMemory;
vkAllocateMemory(device, &allocInfo, nullptr, &bufferMemory);
vkBindBufferMemory(device, buffer, bufferMemory, 0);
Memory TypeFlagUse Case
Device LocalDEVICE_LOCALGPU-only buffers, textures (fastest)
Host Visible + CoherentHOST_VISIBLE + HOST_COHERENTStaging buffers, uniform data
Host Visible + CachedHOST_VISIBLE + HOST_CACHEDReadback from GPU to CPU

Vulkan Pipeline

graph LR
    VS["Vertex Shader\nTransform vertices"] --> PA["Primitive Assembly\nAssemble triangles"]
    PA --> TCS["Tessellation Control\n(optional)"]
    TCS --> TES["Tessellation Eval\n(optional)"]
    TES --> GS["Geometry Shader\n(optional)"]
    GS --> Rast["Rasterization\nTriangles → Fragments"]
    Rast --> FS["Fragment Shader\nCompute pixel color"]
    FS --> Blend["Color Blending\nAlpha compositing"]
    Blend --> FB["Framebuffer\nFinal image"]
// Create graphics pipeline (simplified)
VkGraphicsPipelineCreateInfo pipelineInfo{};
pipelineInfo.sType = VK_STRUCTURE_TYPE_GRAPHICS_PIPELINE_CREATE_INFO;
 
// Shader stages
VkPipelineShaderStageCreateInfo shaderStages[] = {vertStage, fragStage};
pipelineInfo.stageCount = 2;
pipelineInfo.pStages = shaderStages;
 
// Vertex input
pipelineInfo.pVertexInputState = &vertexInputInfo;
pipelineInfo.pInputAssemblyState = &inputAssembly;
 
// Rasterization
VkPipelineRasterizationStateCreateInfo rasterizer{};
rasterizer.polygonMode = VK_POLYGON_MODE_FILL;
rasterizer.cullMode = VK_CULL_MODE_BACK_BIT;
rasterizer.frontFace = VK_FRONT_FACE_COUNTER_CLOCKWISE;
pipelineInfo.pRasterizationState = &rasterizer;
 
// Depth testing
VkPipelineDepthStencilStateCreateInfo depthStencil{};
depthStencil.depthTestEnable = VK_TRUE;
depthStencil.depthWriteEnable = VK_TRUE;
depthStencil.depthCompareOp = VK_COMPARE_OP_LESS;
pipelineInfo.pDepthStencilState = &depthStencil;
 
VkPipeline graphicsPipeline;
vkCreateGraphicsPipelines(device, VK_NULL_HANDLE, 1, &pipelineInfo, nullptr, &graphicsPipeline);

Vulkan Render Pass & Synchronization

// Pipeline barrier — transition image layout
VkImageMemoryBarrier barrier{};
barrier.sType = VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER;
barrier.oldLayout = VK_IMAGE_LAYOUT_UNDEFINED;
barrier.newLayout = VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL;
barrier.srcAccessMask = 0;
barrier.dstAccessMask = VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT;
 
vkCmdPipelineBarrier(
    commandBuffer,
    VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT,        // src stage
    VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT, // dst stage
    0, 0, nullptr, 0, nullptr,
    1, &barrier
);
Image LayoutUse Case
UNDEFINEDInitial state, don’t care about contents
COLOR_ATTACHMENT_OPTIMALWriting to render target
SHADER_READ_ONLY_OPTIMALSampling in shader
TRANSFER_SRC_OPTIMALCopy source
TRANSFER_DST_OPTIMALCopy destination
PRESENT_SRC_KHRReady to present to screen

Vulkan Ray Tracing

graph TD
    BLAS["BLAS\nBottom-Level Acceleration Structure\nGeometry triangles/AABBs"]
    TLAS["TLAS\nTop-Level Acceleration Structure\nInstances of BLASes with transforms"]
    RGen["Ray Generation Shader\nSpawns rays from camera"]
    RInt["Intersection Shader\nCustom geometry intersection"]
    RAny["Any-Hit Shader\nTransparency, alpha test"]
    RClose["Closest-Hit Shader\nShading at hit point"]
    RMiss["Miss Shader\nBackground / sky when no hit"]
    BLAS --> TLAS
    TLAS --> RGen
    RGen --> RInt
    RGen --> RAny
    RGen --> RClose
    RGen --> RMiss

DirectX 12

DX12 vs Vulkan Terminology

ConceptVulkanDirectX 12
DeviceVkDeviceID3D12Device
Command bufferVkCommandBufferID3D12GraphicsCommandList
Command poolVkCommandPoolID3D12CommandAllocator
QueueVkQueueID3D12CommandQueue
Render passVkRenderPassRender targets (no formal pass)
PipelineVkPipelineID3D12PipelineState
Descriptor setVkDescriptorSetDescriptor heap
BufferVkBufferID3D12Resource
ImageVkImageID3D12Resource
SwapchainVkSwapchainKHRIDXGISwapChain4
Semaphore/FenceVkSemaphore/VkFenceID3D12Fence
Memory heapVkDeviceMemoryD3D12_HEAP_TYPE

DX12 Initialization

#include <d3d12.h>
#include <dxgi1_6.h>
 
// 1. Enable debug layer (debug builds only)
ID3D12Debug* debugController;
D3D12GetDebugInterface(IID_PPV_ARGS(&debugController));
debugController->EnableDebugLayer();
 
// 2. Create DXGI Factory + enumerate adapters
IDXGIFactory7* factory;
CreateDXGIFactory2(DXGI_CREATE_FACTORY_DEBUG, IID_PPV_ARGS(&factory));
 
IDXGIAdapter4* adapter;
factory->EnumAdapterByGpuPreference(0,
    DXGI_GPU_PREFERENCE_HIGH_PERFORMANCE, IID_PPV_ARGS(&adapter));
 
// 3. Create D3D12 Device
ID3D12Device8* device;
D3D12CreateDevice(adapter, D3D_FEATURE_LEVEL_12_1, IID_PPV_ARGS(&device));
 
// 4. Create Command Queue
D3D12_COMMAND_QUEUE_DESC queueDesc{};
queueDesc.Type = D3D12_COMMAND_LIST_TYPE_DIRECT;
queueDesc.Priority = D3D12_COMMAND_QUEUE_PRIORITY_NORMAL;
 
ID3D12CommandQueue* commandQueue;
device->CreateCommandQueue(&queueDesc, IID_PPV_ARGS(&commandQueue));
 
// 5. Create Swapchain
DXGI_SWAP_CHAIN_DESC1 swapchainDesc{};
swapchainDesc.Width = 1920;
swapchainDesc.Height = 1080;
swapchainDesc.Format = DXGI_FORMAT_R8G8B8A8_UNORM;
swapchainDesc.BufferCount = 3;  // triple buffering
swapchainDesc.SwapEffect = DXGI_SWAP_EFFECT_FLIP_DISCARD;
 
IDXGISwapChain4* swapchain;
factory->CreateSwapChainForHwnd(commandQueue, hwnd,
    &swapchainDesc, nullptr, nullptr,
    reinterpret_cast<IDXGISwapChain1**>(&swapchain));

DX12 Resource Barriers

// Transition render target from present → render target state
D3D12_RESOURCE_BARRIER barrier{};
barrier.Type = D3D12_RESOURCE_BARRIER_TYPE_TRANSITION;
barrier.Transition.pResource = renderTarget;
barrier.Transition.StateBefore = D3D12_RESOURCE_STATE_PRESENT;
barrier.Transition.StateAfter  = D3D12_RESOURCE_STATE_RENDER_TARGET;
barrier.Transition.Subresource = D3D12_RESOURCE_BARRIER_ALL_SUBRESOURCES;
 
commandList->ResourceBarrier(1, &barrier);
 
// ... draw calls ...
 
// Transition back to present
barrier.Transition.StateBefore = D3D12_RESOURCE_STATE_RENDER_TARGET;
barrier.Transition.StateAfter  = D3D12_RESOURCE_STATE_PRESENT;
commandList->ResourceBarrier(1, &barrier);
Resource StateUsage
PRESENTReady to display on screen
RENDER_TARGETWriting color output
DEPTH_WRITEWriting depth buffer
PIXEL_SHADER_RESOURCEReading in pixel shader
NON_PIXEL_SHADER_RESOURCEReading in compute/vertex shader
COPY_SOURCESource of a copy operation
COPY_DESTDestination of a copy operation
UNORDERED_ACCESSRead/write in compute shader

DX12 Descriptor Heaps

Heap TypeContainsShader Visible
CBV_SRV_UAVConstant buffers, textures, UAVsYes
SAMPLERTexture samplersYes
RTVRender target viewsNo
DSVDepth stencil viewsNo
// Create CBV/SRV/UAV descriptor heap
D3D12_DESCRIPTOR_HEAP_DESC heapDesc{};
heapDesc.Type = D3D12_DESCRIPTOR_HEAP_TYPE_CBV_SRV_UAV;
heapDesc.NumDescriptors = 1000;
heapDesc.Flags = D3D12_DESCRIPTOR_HEAP_FLAG_SHADER_VISIBLE;
 
ID3D12DescriptorHeap* srvHeap;
device->CreateDescriptorHeap(&heapDesc, IID_PPV_ARGS(&srvHeap));
 
// Create SRV for a texture
D3D12_SHADER_RESOURCE_VIEW_DESC srvDesc{};
srvDesc.Format = DXGI_FORMAT_R8G8B8A8_UNORM;
srvDesc.ViewDimension = D3D12_SRV_DIMENSION_TEXTURE2D;
srvDesc.Shader4ComponentMapping = D3D12_DEFAULT_SHADER_4_COMPONENT_MAPPING;
srvDesc.Texture2D.MipLevels = 1;
 
device->CreateShaderResourceView(texture, &srvDesc,
    srvHeap->GetCPUDescriptorHandleForHeapStart());

Metal (Apple)

Metal Architecture

graph TD
    App["Swift / Objective-C / C++ App"]
    Device["MTLDevice\nGPU abstraction"]
    CmdQueue["MTLCommandQueue\nSubmit work to GPU"]
    CmdBuf["MTLCommandBuffer\nRecord commands"]
    subgraph Encoders["Command Encoders"]
        Render["MTLRenderCommandEncoder\nDraw calls"]
        Compute["MTLComputeCommandEncoder\nCompute dispatches"]
        Blit["MTLBlitCommandEncoder\nCopy operations"]
    end
    Library["MTLLibrary\nCompiled shaders (.metallib)"]
    Pipeline["MTLRenderPipelineState\nShaders + render state"]

    App --> Device --> CmdQueue --> CmdBuf
    CmdBuf --> Encoders
    Library --> Pipeline
    Pipeline --> Render

Metal Initialization (Swift)

import Metal
import MetalKit
 
// Get GPU device
guard let device = MTLCreateSystemDefaultDevice() else {
    fatalError("Metal not supported")
}
 
// Create command queue
let commandQueue = device.makeCommandQueue()!
 
// Load shader library
let library = device.makeDefaultLibrary()!
let vertexFunction   = library.makeFunction(name: "vertex_main")!
let fragmentFunction = library.makeFunction(name: "fragment_main")!
 
// Create render pipeline
let pipelineDescriptor = MTLRenderPipelineDescriptor()
pipelineDescriptor.vertexFunction   = vertexFunction
pipelineDescriptor.fragmentFunction = fragmentFunction
pipelineDescriptor.colorAttachments[0].pixelFormat = .bgra8Unorm
 
let pipelineState = try! device.makeRenderPipelineState(descriptor: pipelineDescriptor)
 
// Per frame — create command buffer and encode draw calls
let commandBuffer = commandQueue.makeCommandBuffer()!
let renderEncoder = commandBuffer.makeRenderCommandEncoder(descriptor: renderPassDescriptor)!
 
renderEncoder.setRenderPipelineState(pipelineState)
renderEncoder.setVertexBuffer(vertexBuffer, offset: 0, index: 0)
renderEncoder.drawPrimitives(type: .triangle, vertexStart: 0, vertexCount: 3)
renderEncoder.endEncoding()
 
commandBuffer.present(drawable)
commandBuffer.commit()

Metal Shading Language (MSL)

#include <metal_stdlib>
using namespace metal;
 
// Vertex input structure
struct VertexIn {
    float3 position [[attribute(0)]];
    float2 texCoord [[attribute(1)]];
    float3 normal   [[attribute(2)]];
};
 
// Vertex output / fragment input
struct VertexOut {
    float4 position [[position]];
    float2 texCoord;
    float3 worldNormal;
};
 
// Uniform buffer
struct Uniforms {
    float4x4 modelMatrix;
    float4x4 viewProjectionMatrix;
    float3   lightDirection;
};
 
// Vertex shader
vertex VertexOut vertex_main(
    VertexIn in [[stage_in]],
    constant Uniforms& uniforms [[buffer(1)]])
{
    VertexOut out;
    float4 worldPos = uniforms.modelMatrix * float4(in.position, 1.0);
    out.position  = uniforms.viewProjectionMatrix * worldPos;
    out.texCoord  = in.texCoord;
    out.worldNormal = (uniforms.modelMatrix * float4(in.normal, 0.0)).xyz;
    return out;
}
 
// Fragment shader
fragment float4 fragment_main(
    VertexOut in [[stage_in]],
    texture2d<float> albedoTexture [[texture(0)]],
    sampler texSampler [[sampler(0)]],
    constant Uniforms& uniforms [[buffer(1)]])
{
    float4 color = albedoTexture.sample(texSampler, in.texCoord);
    float  ndotl = max(dot(normalize(in.worldNormal), -uniforms.lightDirection), 0.0);
    return float4(color.rgb * ndotl, color.a);
}

WebGPU

WebGPU vs WebGL

FeatureWebGLWebGPU
Based onOpenGL ES 2.0/3.0Vulkan / DX12 / Metal
Compute shadersNo (WebGL 2 limited)Yes — full compute
Multi-threadingNoYes (workers)
Explicit memoryNoYes
Shader languageGLSLWGSL
PerformanceMediumHigh
StatusLegacyModern standard

WebGPU Initialization (JavaScript)

// Check support
if (!navigator.gpu) throw new Error("WebGPU not supported");
 
// Get adapter (GPU) and device
const adapter = await navigator.gpu.requestAdapter({
    powerPreference: "high-performance"
});
const device = await adapter.requestDevice();
 
// Get canvas context
const canvas  = document.querySelector("canvas");
const context = canvas.getContext("webgpu");
const format  = navigator.gpu.getPreferredCanvasFormat();
 
context.configure({ device, format });
 
// Create shader module (WGSL)
const shaderModule = device.createShaderModule({ code: `
    @vertex
    fn vs_main(@builtin(vertex_index) vi: u32) -> @builtin(position) vec4f {
        var pos = array<vec2f, 3>(
            vec2f( 0.0,  0.5),
            vec2f(-0.5, -0.5),
            vec2f( 0.5, -0.5)
        );
        return vec4f(pos[vi], 0.0, 1.0);
    }
 
    @fragment
    fn fs_main() -> @location(0) vec4f {
        return vec4f(1.0, 0.4, 0.1, 1.0); // orange
    }
`});
 
// Create render pipeline
const pipeline = device.createRenderPipeline({
    layout: "auto",
    vertex:   { module: shaderModule, entryPoint: "vs_main" },
    fragment: { module: shaderModule, entryPoint: "fs_main",
                targets: [{ format }] },
    primitive: { topology: "triangle-list" }
});
 
// Render frame
const encoder     = device.createCommandEncoder();
const renderPass  = encoder.beginRenderPass({
    colorAttachments: [{
        view: context.getCurrentTexture().createView(),
        clearValue: { r: 0, g: 0, b: 0, a: 1 },
        loadOp: "clear", storeOp: "store"
    }]
});
renderPass.setPipeline(pipeline);
renderPass.draw(3);
renderPass.end();
device.queue.submit([encoder.finish()]);

WGSL (WebGPU Shading Language)

// Uniform buffer binding
struct Uniforms {
    modelMatrix : mat4x4<f32>,
    viewProjMatrix : mat4x4<f32>,
}
@group(0) @binding(0) var<uniform> uniforms : Uniforms;
 
// Texture and sampler
@group(0) @binding(1) var myTexture : texture_2d<f32>;
@group(0) @binding(2) var mySampler : sampler;
 
// Vertex shader
struct VertexOutput {
    @builtin(position) position : vec4<f32>,
    @location(0) uv : vec2<f32>,
}
 
@vertex
fn vs_main(
    @location(0) position : vec3<f32>,
    @location(1) uv : vec2<f32>
) -> VertexOutput {
    var out : VertexOutput;
    out.position = uniforms.viewProjMatrix * uniforms.modelMatrix * vec4<f32>(position, 1.0);
    out.uv = uv;
    return out;
}
 
// Fragment shader
@fragment
fn fs_main(in : VertexOutput) -> @location(0) vec4<f32> {
    return textureSample(myTexture, mySampler, in.uv);
}

HLSL (High-Level Shading Language)

HLSL Basics

// Constant buffer (uniform data from CPU)
cbuffer PerFrameConstants : register(b0)
{
    float4x4 g_ModelMatrix;
    float4x4 g_ViewProjMatrix;
    float3   g_LightDir;
    float    g_Time;
};
 
// Texture and sampler
Texture2D    g_AlbedoTexture : register(t0);
SamplerState g_LinearSampler : register(s0);
 
// Vertex shader input
struct VSInput
{
    float3 Position : POSITION;
    float3 Normal   : NORMAL;
    float2 TexCoord : TEXCOORD0;
};
 
// Vertex shader output / pixel shader input
struct PSInput
{
    float4 Position    : SV_POSITION;
    float3 WorldNormal : NORMAL;
    float2 TexCoord    : TEXCOORD0;
};
 
// Vertex shader
PSInput VSMain(VSInput input)
{
    PSInput output;
    float4 worldPos = mul(g_ModelMatrix, float4(input.Position, 1.0));
    output.Position    = mul(g_ViewProjMatrix, worldPos);
    output.WorldNormal = mul((float3x3)g_ModelMatrix, input.Normal);
    output.TexCoord    = input.TexCoord;
    return output;
}
 
// Pixel shader
float4 PSMain(PSInput input) : SV_TARGET
{
    float4 albedo = g_AlbedoTexture.Sample(g_LinearSampler, input.TexCoord);
    float3 normal = normalize(input.WorldNormal);
    float  ndotl  = saturate(dot(normal, -g_LightDir));
    return float4(albedo.rgb * ndotl, albedo.a);
}

HLSL Data Types

TypeDescriptionExample
float32-bit floatfloat x = 1.0;
float2/3/4Vectorfloat3 pos = float3(1,2,3);
float4x44x4 matrixfloat4x4 mvp;
int/uintIntegerint count = 5;
boolBooleanbool isLit = true;
half16-bit float (mobile perf)half2 uv;
Texture2D2D texture resourceTexture2D albedo;
TextureCubeCubemap textureTextureCube envMap;
SamplerStateTexture samplerSamplerState s;
RWTexture2DRead/write texture (compute)RWTexture2D<float4> output;
StructuredBufferArray of structsStructuredBuffer<Particle> particles;

HLSL Compute Shader

// Compute shader — runs on GPU without rasterization
// Used for: post-processing, physics simulation, particle systems, AI
 
RWTexture2D<float4> g_OutputTexture : register(u0);
Texture2D<float4>   g_InputTexture  : register(t0);
 
// Thread group size: 8x8 = 64 threads per group
[numthreads(8, 8, 1)]
void CSMain(
    uint3 dispatchID  : SV_DispatchThreadID,   // global thread ID
    uint3 groupID     : SV_GroupID,             // which group
    uint3 localID     : SV_GroupThreadID,       // thread within group
    uint  groupIndex  : SV_GroupIndex           // flat index within group
)
{
    uint2 pixel = dispatchID.xy;
 
    // Get texture dimensions
    uint width, height;
    g_InputTexture.GetDimensions(width, height);
 
    if (pixel.x >= width || pixel.y >= height) return;
 
    // Simple blur — sample 3x3 neighborhood
    float4 color = float4(0, 0, 0, 0);
    for (int dy = -1; dy <= 1; dy++)
    for (int dx = -1; dx <= 1; dx++)
    {
        int2 samplePos = clamp(int2(pixel) + int2(dx, dy),
                              int2(0,0), int2(width-1, height-1));
        color += g_InputTexture[samplePos];
    }
    g_OutputTexture[pixel] = color / 9.0;
}

HLSL Semantic Reference

SemanticStageDescription
SV_POSITIONVS out / PS inClip-space position
SV_TARGETPS outRender target output
SV_DEPTHPS outDepth output
SV_VertexIDVS inVertex index
SV_InstanceIDVS inInstance index
SV_DispatchThreadIDCS inGlobal compute thread ID
SV_GroupIDCS inThread group ID
SV_GroupThreadIDCS inThread ID within group
SV_GroupIndexCS inFlat index within group
POSITIONVS inVertex position
NORMALVS inVertex normal
TEXCOORD0-7VS in/outTexture coordinates
COLOR0-1VS in/outVertex color

GLSL Advanced

GLSL Advanced Features

#version 460 core
 
// Push constants (Vulkan — fast small data, no buffer needed)
layout(push_constant) uniform PushConstants {
    mat4 mvp;
    vec4 color;
    float time;
} pc;
 
// Descriptor set bindings
layout(set = 0, binding = 0) uniform sampler2D albedoMap;
layout(set = 0, binding = 1) uniform sampler2D normalMap;
layout(set = 0, binding = 2) uniform sampler2D roughnessMap;
 
// Subpass input (Vulkan deferred rendering)
layout(input_attachment_index = 0, set = 1, binding = 0)
    uniform subpassInput gBufferAlbedo;
 
// Shader storage buffer (read/write from shader)
layout(set = 0, binding = 3) buffer ParticleBuffer {
    vec4 positions[];
    vec4 velocities[];
} particles;

GLSL Compute Shader

#version 460
 
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
 
// Shared memory — fast on-chip memory shared within workgroup
shared vec4 sharedData[64];
 
layout(set = 0, binding = 0) buffer InputBuffer  { vec4 input_data[]; };
layout(set = 0, binding = 1) buffer OutputBuffer { vec4 output_data[]; };
 
void main() {
    uint gid = gl_GlobalInvocationID.x;  // global thread index
    uint lid = gl_LocalInvocationID.x;   // local thread index
 
    // Load into shared memory
    sharedData[lid] = input_data[gid];
 
    // Synchronize — all threads must reach this before continuing
    barrier();
    memoryBarrierShared();
 
    // Process using shared memory (e.g., parallel reduction)
    for (uint stride = 32; stride > 0; stride >>= 1) {
        if (lid < stride) {
            sharedData[lid] += sharedData[lid + stride];
        }
        barrier();
    }
 
    if (lid == 0) {
        output_data[gl_WorkGroupID.x] = sharedData[0];
    }
}

SPIR-V Pipeline

graph LR
    GLSL["GLSL source\n(.vert .frag .comp)"]
    HLSL2["HLSL source\n(.hlsl)"]
    MSL2["MSL source\n(.metal)"]
    SPIRV["SPIR-V bytecode\n(.spv)"]
    Vulkan2["Vulkan\nVkShaderModule"]
    GLSL -->|"glslc / glslangValidator"| SPIRV
    HLSL2 -->|"dxc -spirv"| SPIRV
    SPIRV --> Vulkan2
    MSL2 -->|"xcrun metal"| MetalLib["Metal Library\n(.metallib)"]
# Compile GLSL to SPIR-V
glslc shader.vert -o vert.spv
glslc shader.frag -o frag.spv
glslc shader.comp -o comp.spv
 
# Compile HLSL to SPIR-V (for Vulkan)
dxc -spirv -T vs_6_6 -E VSMain shader.hlsl -Fo vert.spv
dxc -spirv -T ps_6_6 -E PSMain shader.hlsl -Fo frag.spv
dxc -spirv -T cs_6_6 -E CSMain shader.hlsl -Fo comp.spv

Advanced Rendering Techniques

Render Graph

graph TD
    GBuf["G-Buffer Pass\nWrite: Albedo, Normal, Depth"]
    Shadow["Shadow Map Pass\nWrite: ShadowMap"]
    SSAO["SSAO Pass\nRead: Depth, Normal\nWrite: AO texture"]
    Lighting["Deferred Lighting Pass\nRead: GBuffer, ShadowMap, AO\nWrite: HDR color"]
    Bloom["Bloom Pass\nRead: HDR color\nWrite: Bloom texture"]
    Tonemap["Tonemap Pass\nRead: HDR color, Bloom\nWrite: LDR backbuffer"]
    Present["Present\nDisplay backbuffer"]

    GBuf --> SSAO
    GBuf --> Lighting
    Shadow --> Lighting
    SSAO --> Lighting
    Lighting --> Bloom
    Lighting --> Tonemap
    Bloom --> Tonemap
    Tonemap --> Present

Deferred Rendering

graph LR
    subgraph GPass["Geometry Pass"]
        Geo["Scene geometry"] --> GB1["Albedo buffer"]
        Geo --> GB2["Normal buffer"]
        Geo --> GB3["Depth buffer"]
        Geo --> GB4["Roughness/Metallic buffer"]
    end
    subgraph LPass["Lighting Pass"]
        GB1 --> Light["Lighting calculation\nfor ALL lights at once"]
        GB2 --> Light
        GB3 --> Light
        GB4 --> Light
        Light --> HDR["HDR color buffer"]
    end
TechniqueCostLightsTransparencyUse Case
ForwardO(objects × lights)FewYesSimple scenes, mobile
DeferredO(objects + lights)ManyNoComplex scenes, many lights
Forward+ (Tiled)O(objects + tiles×lights)ManyYesBest of both worlds
Clustered ForwardO(objects + clusters×lights)Very manyYesModern AAA standard

Bindless Resources

// Bindless textures — DX12 / Vulkan
// All textures in one heap, indexed by uint
Texture2D g_Textures[] : register(t0, space0);
SamplerState g_Sampler : register(s0);
 
struct DrawData {
    uint albedoIndex;
    uint normalIndex;
    uint roughnessIndex;
    uint materialFlags;
};
 
StructuredBuffer<DrawData> g_DrawData : register(t0, space1);
 
float4 PSMain(PSInput input) : SV_TARGET {
    DrawData data = g_DrawData[input.drawID];
 
    // Index into bindless texture array
    float4 albedo    = g_Textures[data.albedoIndex].Sample(g_Sampler, input.uv);
    float3 normal    = g_Textures[data.normalIndex].Sample(g_Sampler, input.uv).xyz;
    float  roughness = g_Textures[data.roughnessIndex].Sample(g_Sampler, input.uv).r;
 
    return albedo; // simplified
}

GPU-Driven Rendering

graph TD
    subgraph CPU["CPU (minimal work)"]
        Upload["Upload scene data\nto GPU once"]
        Dispatch["Dispatch compute shader"]
    end
    subgraph GPU["GPU (does everything)"]
        Cull["Compute: Frustum + Occlusion Culling\nDetermines which objects are visible"]
        IndirectArgs["Write indirect draw arguments\nto GPU buffer"]
        Draw["ExecuteIndirect / DrawIndirect\nGPU issues its own draw calls"]
    end
    Upload --> Dispatch --> Cull --> IndirectArgs --> Draw

Mesh Shaders

graph LR
    subgraph Old["Old Pipeline"]
        IA["Input Assembler"] --> VS["Vertex Shader"] --> GS["Geometry Shader"] --> Rast1["Rasterizer"]
    end
    subgraph New["Mesh Shader Pipeline"]
        TS["Task Shader\n(Amplification)\nCulling, LOD selection"] --> MS["Mesh Shader\nGenerate vertices + primitives"] --> Rast2["Rasterizer"]
    end
FeatureVertex ShaderMesh Shader
InputFixed vertex bufferFlexible — any data
OutputOne vertexUp to 256 vertices + 512 primitives
CullingCPU or geometry shaderTask shader on GPU
LODCPU-sideTask shader on GPU
Procedural geoGeometry shader (slow)Native, fast

PBR — Physically Based Rendering

PBR Material Model

ParameterRangeDescription
Albedo0–1 RGBBase color, no lighting baked in
Metallic0–10 = dielectric (plastic/wood), 1 = metal
Roughness0–10 = mirror smooth, 1 = fully diffuse
NormalXYZSurface detail without geometry
AO0–1Ambient occlusion — crevice darkening
EmissionRGBSelf-illumination, ignores lighting
Height/Displacement0–1Surface displacement

Cook-Torrance BRDF

// Cook-Torrance specular BRDF — industry standard PBR
// f(l,v) = D(h) * F(v,h) * G(l,v,h) / (4 * dot(n,l) * dot(n,v))
 
// D — Normal Distribution Function (GGX/Trowbridge-Reitz)
float DistributionGGX(vec3 N, vec3 H, float roughness) {
    float a  = roughness * roughness;
    float a2 = a * a;
    float NdotH  = max(dot(N, H), 0.0);
    float NdotH2 = NdotH * NdotH;
    float denom = (NdotH2 * (a2 - 1.0) + 1.0);
    return a2 / (PI * denom * denom);
}
 
// F — Fresnel-Schlick approximation
vec3 FresnelSchlick(float cosTheta, vec3 F0) {
    return F0 + (1.0 - F0) * pow(clamp(1.0 - cosTheta, 0.0, 1.0), 5.0);
}
 
// G — Geometry function (Smith's method)
float GeometrySmith(vec3 N, vec3 V, vec3 L, float roughness) {
    float NdotV = max(dot(N, V), 0.0);
    float NdotL = max(dot(N, L), 0.0);
    float ggx1  = GeometrySchlickGGX(NdotV, roughness);
    float ggx2  = GeometrySchlickGGX(NdotL, roughness);
    return ggx1 * ggx2;
}
 
// Full PBR lighting calculation
vec3 PBR_Lighting(vec3 albedo, float metallic, float roughness,
                  vec3 N, vec3 V, vec3 L, vec3 lightColor) {
    vec3 H  = normalize(V + L);
    vec3 F0 = mix(vec3(0.04), albedo, metallic); // base reflectivity
 
    float D = DistributionGGX(N, H, roughness);
    vec3  F = FresnelSchlick(max(dot(H, V), 0.0), F0);
    float G = GeometrySmith(N, V, L, roughness);
 
    vec3  specular = (D * F * G) / (4.0 * max(dot(N,V),0.0) * max(dot(N,L),0.0) + 0.0001);
    vec3  kD = (vec3(1.0) - F) * (1.0 - metallic);
    vec3  diffuse = kD * albedo / PI;
 
    float NdotL = max(dot(N, L), 0.0);
    return (diffuse + specular) * lightColor * NdotL;
}

Performance & Optimization

GPU Profiling Tools

ToolPlatformWhat It Shows
RenderDocAllFrame capture, draw call inspection, shader debugging
NVIDIA NsightNVIDIAGPU timeline, shader occupancy, memory bandwidth
AMD Radeon GPU ProfilerAMDGPU timeline, shader analysis
Intel GPAIntelFrame analysis, GPU metrics
Xcode GPU Frame CaptureAppleMetal frame debugging
PIXWindows/XboxDX12 frame capture and analysis
Chrome DevToolsBrowserWebGPU timing

Common GPU Bottlenecks

graph TD
    Bottleneck["GPU Bottleneck?"]
    CPU["CPU Bound\nDraw call submission\ntoo slow"]
    Vertex["Vertex Bound\nToo many vertices\ncomplex vertex shader"]
    Fragment["Fragment Bound\nToo many pixels\ncomplex pixel shader"]
    Memory["Memory Bound\nTexture bandwidth\nbuffer reads"]
    Bottleneck --> CPU
    Bottleneck --> Vertex
    Bottleneck --> Fragment
    Bottleneck --> Memory
    CPU -->|Fix| CPUFix["Instancing\nIndirect drawing\nBatching"]
    Vertex -->|Fix| VFix["LOD\nMesh simplification\nVertex shader optimization"]
    Fragment -->|Fix| FFix["Early-Z\nDepth prepass\nShader simplification"]
    Memory -->|Fix| MFix["Texture compression\nMipmaps\nCoalesced access"]

Shader Optimization Rules

RuleWhyHow
Avoid branchingWarp divergence halves throughputUse step(), mix(), select() instead
Use half precision on mobile2x throughput on mobile GPUshalf / mediump types
Minimize texture samplesEach sample = memory bandwidthCache samples, use fewer textures
Precompute in vertex shaderRuns fewer times than fragmentMove invariant math to VS
Use MAD instructionsSingle cycle multiply-adda * b + c compiles to MAD
Avoid dynamic indexingBreaks compiler optimizationUse constant indices when possible
Pack data tightlyBetter cache utilizationUse vec4 not 4 separate float

Logseq Graph Connections

  • Related pages:
    • Game Development — engine-level rendering concepts (rasterization, lighting, shaders overview)
    • PathTracer Learning — full GPU path tracer implementation with Vulkan RT
    • Godot — Godot shading language and RenderingDevice
    • Unity — Unity URP/HDRP shader graph and HLSL
    • Unreal Engine — Unreal material system and HLSL
    • Cpp — C++ language reference for graphics programming
    • Rust — Rust with wgpu for WebGPU native
    • Game Design — design concepts that drive rendering requirements

More Learn

Official Documentation

Books

Tools & Libraries

  • RenderDoc — Free GPU frame debugger. Works with Vulkan, DX11/12, OpenGL, Metal.
  • VulkanMemoryAllocator — AMD — Free. Simplifies Vulkan memory management.
  • SPIRV-Cross — Free. Cross-compile SPIR-V to GLSL/HLSL/MSL.
  • wgpu — Rust — Free. WebGPU implementation in Rust. Works natively + in browser.
  • bgfx — Free. Cross-platform rendering library abstracting Vulkan/DX12/Metal/WebGPU.