🟢 Chapter 1 — What is a GPU and Why Does It Exist?
The Rendering Problem
A 1080p screen has 1,920 × 1,080 = 2,073,600 pixels
At 60 FPS, we must compute ~124 million pixels per second
Each pixel needs lighting, texture sampling, shadow calculation…
A CPU with 8 cores handling this serially would need 15 million pixels per core per second
That’s impossible. We need a completely different kind of processor.
The GPU Solution: Massive Parallelism
Instead of a few powerful cores, a GPU has thousands of simple cores
Each core is slower and simpler than a CPU core — but there are SO MANY of them
All cores run the SAME program (your shader) on different data (different pixels)
This pattern is called SIMD — Single Instruction, Multiple Data
CPU: 8 cores × fast = handles 8 things well
GPU: 10,000 cores × slower = handles 10,000 things at once
For rendering: we always need to do the SAME thing to MANY pixels
→ GPU wins overwhelmingly
CPU vs GPU — Side by Side
CPU Core (1 of 8–32): GPU Core (1 of thousands):
┌─────────────────────┐ ┌───────────┐
│ Large L1/L2/L3 Cache│ │ Tiny cache│
│ Branch Predictor │ │ Simple ALU│
│ Out-of-Order Exec │ │ No branch │
│ Multiple GHz │ │ prediction│
│ Complex logic unit │ └───────────┘
└─────────────────────┘
Memory: 100 GB/s bandwidth Memory: ~1 TB/s bandwidth (GDDR6X)
Feature
CPU
GPU
Core count
4–64
1,000–18,000+
Clock speed
3–5 GHz
1–2 GHz
Memory bandwidth
~100 GB/s
~1 TB/s (GDDR6X)
Best at
Sequential code, branches, logic
Parallel math, throughput
Latency
Low
High (hidden by parallelism)
Context switching
Cheap
Expensive (warps help)
🟢 Chapter 2 — How the GPU Executes Your Shader
Threads, Warps, and Workgroups
When you run a shader, the GPU doesn’t run one instance at a time
It runs thousands of instances simultaneously, organized into a hierarchy:
Thread → Single shader invocation (one pixel, one vertex)
↓
Warp / Wavefront → Group of threads that execute IN LOCKSTEP
NVIDIA: 32 threads per warp
AMD: 64 threads per wavefront
↓
Thread Block / Workgroup → Multiple warps sharing memory
↓
Grid / Dispatch → All workgroups for one shader invocation
A fragment shader for a 1080p frame launches 2 million threads simultaneously
Each thread independently computes one pixel
The Warp — The Fundamental Execution Unit
The GPU scheduler assigns one instruction to an entire warp at once
All 32 threads in the warp execute the SAME instruction on DIFFERENT data
This is what enables massive throughput — one instruction decode → 32 results
Warp Divergence — The Hidden Performance Killer
What happens if threads in the same warp take different if/else paths?
if (someCondition) { // Path A: 16 threads want to go here result = expensiveCalcA();} else { // Path B: other 16 threads want to go here result = expensiveCalcB();}
Both paths execute! Threads not on the active path are masked off (results discarded)
The warp takes time(A) + time(B) instead of max(time(A), time(B))
This is called warp divergence — it can halve or worse your throughput
Performance Rule if/else in shaders when possible. Use mix(), step(), clamp() instead.
These select values without branching.
float v = condition ? a : b; → still branches!
float v = mix(b, a, step(0.5, condition)); → no branch
Avoid
Latency Hiding
Memory access (reading from VRAM) has high latency: ~500–1000 clock cycles
CPUs use large caches + out-of-order execution to hide this
GPUs use warp switching: while one warp is waiting for memory, run another warp
This is why GPUs need MANY warps in flight — to always have something ready to run
This is also why register usage matters: more registers = fewer warps can be in flight
🟢 Chapter 3 — GPU Memory Hierarchy
Understanding memory is critical for writing efficient shaders.
Memory Types (Fastest → Slowest)
Registers → Per thread, fastest, ~256KB per SM
↓ ~10× slower
Shared Memory → Per workgroup, very fast, 32–128KB per SM
↓ ~5× slower
L1 Cache → Per SM, automatic, 32–128KB
↓ ~5× slower
L2 Cache → Whole GPU, 4–80MB
↓ ~10× slower
VRAM (GDDR6) → Off-chip, ~1TB/s bandwidth, GBs
↓ ~10× slower
System RAM → CPU side, ~100GB/s, accessed via PCIe
Memory
Location
Speed
Size
Scope
Registers
On-chip
Fastest
~256KB per SM
Per thread
Shared / LDS
On-chip
Very fast
32–128KB per SM
Per workgroup
L1 Cache
On-chip
Fast
32–128KB per SM
Per SM
L2 Cache
On-chip
Medium
4–80MB
Whole GPU
VRAM (GDDR6)
Off-chip
1 TB/s
8–80GB
Whole GPU
System RAM
Off-chip
100 GB/s
GBs
Via PCIe
Coalesced Memory Access
When threads in a warp read from adjacent memory addresses → coalesced
The GPU fetches ONE large chunk of memory → full bandwidth utilization
When threads read random addresses → uncoalesced → many separate small fetches → terrible performance
// GOOD: Thread i reads index i → adjacent → coalescedfloat value = buffer[gl_GlobalInvocationID.x];// BAD: Threads read random positions → uncoalescedfloat value = buffer[hash(gl_GlobalInvocationID.x)];
Shared Memory (LDS — Local Data Store)
A fast scratchpad shared by all threads in a workgroup
Access is 5–10× faster than VRAM
Key optimization: load from VRAM once to shared mem → use many times
// Compute shader using shared memory for matrix multiplyshared float tile[16][16]; // 16×16 tile in shared memory// Each thread loads one element from global memory to sharedtile[localY][localX] = globalMatrix[globalY * width + globalX];barrier(); // wait for ALL threads in workgroup to finish loading// Now all threads can use the tile very fastfor (int k = 0; k < 16; k++) { result += tile[localY][k] * otherTile[k][localX];}
🟡 Chapter 4 — What is Vulkan? Why is it Hard?
The Old Way: OpenGL
OpenGL (1992) has a simple API — load a shader, bind a texture, draw
The GPU driver does A LOT of work behind the scenes:
Compiles shaders at draw time
Manages memory allocation
Infers synchronization (when can GPU start drawing?)
Validates your calls → slow
The driver overhead became a bottleneck — 10-20% of CPU time just in driver code
The New Way: Vulkan (2016)
Vulkan removes driver magic — YOU do everything explicitly:
More code, but full control and predictable performance
OpenGL "draw a triangle": ~10 lines
Vulkan "draw a triangle": ~800 lines
But: Vulkan is faster, more predictable, and scales to multi-threaded rendering
When to Use Vulkan
You’re building a game engine or custom renderer
You need maximum GPU performance
You need cross-platform (Windows, Linux, Android, macOS via MoltenVK)
You’re learning GPU architecture (Vulkan teaches you exactly how the GPU works)
NOT for: a quick graphics demo (use WebGPU or SDL), a first graphics project (use OpenGL)
Objects are created once (expensive), used many times (cheap)
Lifetime and dependencies are clear and explicit
🟠 Chapter 6 — Vulkan from Zero: Step by Step
Step 1: Instance — Connect to Vulkan
// Tell Vulkan about your applicationVkApplicationInfo appInfo{};appInfo.sType = VK_STRUCTURE_TYPE_APPLICATION_INFO;appInfo.pApplicationName = "My Game";appInfo.applicationVersion = VK_MAKE_VERSION(1, 0, 0);appInfo.apiVersion = VK_API_VERSION_1_3; // Vulkan 1.3VkInstanceCreateInfo instanceInfo{};instanceInfo.sType = VK_STRUCTURE_TYPE_INSTANCE_CREATE_INFO;instanceInfo.pApplicationInfo = &appInfo;// Enable validation layers (ONLY in debug builds!)// Validation layers catch your mistakes and print helpful errorsconst char* validationLayers[] = {"VK_LAYER_KHRONOS_validation"};instanceInfo.enabledLayerCount = 1;instanceInfo.ppEnabledLayerNames = validationLayers;// Enable extensions needed for windowed renderingconst char* extensions[] = {"VK_KHR_surface", "VK_KHR_win32_surface"};instanceInfo.enabledExtensionCount = 2;instanceInfo.ppEnabledExtensionNames = extensions;VkInstance instance;VkResult result = vkCreateInstance(&instanceInfo, nullptr, &instance);// ALWAYS check result == VK_SUCCESS in real code!
Step 2: Physical Device — Pick the GPU
// Enumerate all available GPUsuint32_t deviceCount = 0;vkEnumeratePhysicalDevices(instance, &deviceCount, nullptr);std::vector<VkPhysicalDevice> devices(deviceCount);vkEnumeratePhysicalDevices(instance, &deviceCount, devices.data());// Check each GPU's properties and featuresfor (auto& device : devices) { VkPhysicalDeviceProperties props; vkGetPhysicalDeviceProperties(device, &props); // props.deviceType: DISCRETE_GPU, INTEGRATED_GPU, CPU... // props.limits: maxTextureSize, maxUniformBufferRange, etc. // props.deviceName: "NVIDIA RTX 4090" VkPhysicalDeviceFeatures features; vkGetPhysicalDeviceFeatures(device, &features); // features.geometryShader, features.samplerAnisotropy, etc.}VkPhysicalDevice physicalDevice = devices[0]; // pick best GPU
Step 3: Logical Device and Queues
// Find queue families — groups of queues with specific capabilities// Graphics queue: can render// Compute queue: can run compute shaders// Transfer queue: can copy data// Present queue: can present to a windowuint32_t queueFamilyCount = 0;vkGetPhysicalDeviceQueueFamilyProperties(physicalDevice, &queueFamilyCount, nullptr);std::vector<VkQueueFamilyProperties> queueFamilies(queueFamilyCount);vkGetPhysicalDeviceQueueFamilyProperties(physicalDevice, &queueFamilyCount, queueFamilies.data());// Find a queue family that supports graphicsuint32_t graphicsFamily = -1;for (uint32_t i = 0; i < queueFamilies.size(); i++) { if (queueFamilies[i].queueFlags & VK_QUEUE_GRAPHICS_BIT) { graphicsFamily = i; break; }}// Create a logical device with one graphics queuefloat priority = 1.0f;VkDeviceQueueCreateInfo queueInfo{};queueInfo.sType = VK_STRUCTURE_TYPE_DEVICE_QUEUE_CREATE_INFO;queueInfo.queueFamilyIndex = graphicsFamily;queueInfo.queueCount = 1;queueInfo.pQueuePriorities = &priority;VkDeviceCreateInfo deviceInfo{};deviceInfo.sType = VK_STRUCTURE_TYPE_DEVICE_CREATE_INFO;deviceInfo.queueCreateInfoCount = 1;deviceInfo.pQueueCreateInfos = &queueInfo;VkDevice device;vkCreateDevice(physicalDevice, &deviceInfo, nullptr, &device);VkQueue graphicsQueue;vkGetDeviceQueue(device, graphicsFamily, 0, &graphicsQueue);
Step 4: Memory and Buffers
// Create a vertex bufferVkBufferCreateInfo bufferInfo{};bufferInfo.sType = VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO;bufferInfo.size = sizeof(Vertex) * vertexCount;bufferInfo.usage = VK_BUFFER_USAGE_VERTEX_BUFFER_BIT;bufferInfo.sharingMode = VK_SHARING_MODE_EXCLUSIVE;VkBuffer vertexBuffer;vkCreateBuffer(device, &bufferInfo, nullptr, &vertexBuffer);// Find out how much GPU memory it needsVkMemoryRequirements memReqs;vkGetBufferMemoryRequirements(device, vertexBuffer, &memReqs);// Allocate GPU memory// (In real projects: use VMA — Vulkan Memory Allocator — for this)VkMemoryAllocateInfo allocInfo{};allocInfo.sType = VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO;allocInfo.allocationSize = memReqs.size;allocInfo.memoryTypeIndex = findMemoryType( memReqs.memoryTypeBits, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | // CPU can write VK_MEMORY_PROPERTY_HOST_COHERENT_BIT // writes are visible to GPU);VkDeviceMemory bufferMemory;vkAllocateMemory(device, &allocInfo, nullptr, &bufferMemory);vkBindBufferMemory(device, vertexBuffer, bufferMemory, 0);// Copy vertex data from CPU to GPUvoid* data;vkMapMemory(device, bufferMemory, 0, bufferInfo.size, 0, &data);memcpy(data, vertices.data(), bufferInfo.size);vkUnmapMemory(device, bufferMemory);
Use VMA in Real Projects vkAllocateMemory must be called minimally (GPU drivers have limits ~4096 allocations)
Use the Vulkan Memory Allocator library
It handles suballocation, memory type selection, and defragmentation automatically
Step 5: Shaders and Pipeline
// Shaders must be compiled to SPIR-V bytecode// glslc vertex.vert -o vertex.spv// glslc fragment.frag -o fragment.spv// Load compiled SPIR-Vauto vertCode = readFile("vertex.spv");auto fragCode = readFile("fragment.spv");VkShaderModuleCreateInfo vertInfo{};vertInfo.sType = VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO;vertInfo.codeSize = vertCode.size();vertInfo.pCode = reinterpret_cast<const uint32_t*>(vertCode.data());VkShaderModule vertModule, fragModule;vkCreateShaderModule(device, &vertInfo, nullptr, &vertModule);// (similar for fragment shader)// Shader stages in the pipelineVkPipelineShaderStageCreateInfo vertStage{};vertStage.sType = VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO;vertStage.stage = VK_SHADER_STAGE_VERTEX_BIT;vertStage.module = vertModule;vertStage.pName = "main"; // entry point function name// (similar for fragStage with VK_SHADER_STAGE_FRAGMENT_BIT)VkPipelineShaderStageCreateInfo shaderStages[] = {vertStage, fragStage};
Step 6: Recording Command Buffers
// Command buffers record your draw calls// They are submitted to the GPU queue for executionVkCommandBufferBeginInfo beginInfo{};beginInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO;vkBeginCommandBuffer(commandBuffer, &beginInfo); // Start a render pass (begins rendering to framebuffer) VkRenderPassBeginInfo renderPassInfo{}; renderPassInfo.sType = VK_STRUCTURE_TYPE_RENDER_PASS_BEGIN_INFO; renderPassInfo.renderPass = renderPass; renderPassInfo.framebuffer = framebuffer; VkClearValue clearColor = {{{0.0f, 0.0f, 0.0f, 1.0f}}}; renderPassInfo.clearValueCount = 1; renderPassInfo.pClearValues = &clearColor; vkCmdBeginRenderPass(commandBuffer, &renderPassInfo, VK_SUBPASS_CONTENTS_INLINE); vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, pipeline); VkBuffer vertexBuffers[] = {vertexBuffer}; VkDeviceSize offsets[] = {0}; vkCmdBindVertexBuffers(commandBuffer, 0, 1, vertexBuffers, offsets); vkCmdBindIndexBuffer(commandBuffer, indexBuffer, 0, VK_INDEX_TYPE_UINT32); vkCmdDrawIndexed(commandBuffer, indexCount, 1, 0, 0, 0); // indexCount: number of indices // 1: number of instances // 0, 0, 0: offsets vkCmdEndRenderPass(commandBuffer);vkEndCommandBuffer(commandBuffer);// Submit to GPU queueVkSubmitInfo submitInfo{};submitInfo.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;submitInfo.commandBufferCount = 1;submitInfo.pCommandBuffers = &commandBuffer;vkQueueSubmit(graphicsQueue, 1, &submitInfo, VK_NULL_HANDLE);vkQueueWaitIdle(graphicsQueue); // wait for GPU to finish
🔴 Chapter 7 — Synchronization
Synchronization is the hardest part of Vulkan. Understanding it requires understanding GPU parallelism.
Why Synchronization is Necessary
GPU operations are ASYNC — submitting commands doesn’t mean they execute immediately
Without synchronization: you might read a texture before it’s been written
Vulkan gives you tools to express: “Operation B must happen after Operation A”
Synchronization Primitives
Primitive
Scope
Use For
VkFence
CPU-GPU
CPU waits for GPU work to finish
VkSemaphore
Queue-Queue
Signal between GPU queues
Pipeline Barrier
Command buffer
Memory/execution dependencies in one queue
VkEvent
Command buffer
Fine-grained within one queue
Pipeline Barriers — Image Layout Transitions
Images must be in the correct layout for each use
// Transition image from UNDEFINED to COLOR_ATTACHMENT_OPTIMALVkImageMemoryBarrier barrier{};barrier.sType = VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER;barrier.oldLayout = VK_IMAGE_LAYOUT_UNDEFINED;barrier.newLayout = VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL;barrier.srcAccessMask = 0;barrier.dstAccessMask = VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT;barrier.image = image;barrier.subresourceRange = {VK_IMAGE_ASPECT_COLOR_BIT, 0, 1, 0, 1};vkCmdPipelineBarrier( commandBuffer, VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, // src: any stage VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT, // dst: must wait until here 0, 0, nullptr, 0, nullptr, 1, &barrier);
Image Layout
Used For
UNDEFINED
Don’t care about contents
COLOR_ATTACHMENT_OPTIMAL
Writing to render target
SHADER_READ_ONLY_OPTIMAL
Sampling in shader
TRANSFER_SRC_OPTIMAL
Source of copy
TRANSFER_DST_OPTIMAL
Destination of copy
PRESENT_SRC_KHR
Ready to show on screen
Fences — CPU Waiting for GPU
VkFenceCreateInfo fenceInfo{};fenceInfo.sType = VK_STRUCTURE_TYPE_FENCE_CREATE_INFO;VkFence fence;vkCreateFence(device, &fenceInfo, nullptr, &fence);// Submit work with fence signalVkSubmitInfo submitInfo{...};vkQueueSubmit(queue, 1, &submitInfo, fence); // fence signals when done// CPU waits here until GPU is donevkWaitForFences(device, 1, &fence, VK_TRUE, UINT64_MAX);vkResetFences(device, 1, &fence); // reset for next frame
Compute shaders let you run arbitrary parallel code on the GPU — not just rendering.
Used for: physics simulation, particle systems, image processing, AI inference, path tracing
A Compute Shader in GLSL
#version 450// Workgroup size: each workgroup has 16×16 = 256 threadslayout(local_size_x = 16, local_size_y = 16) in;// Storage image: we write pixels to thislayout(binding = 0, rgba8) uniform writeonly image2D outputImage;// Storage buffer: read-write data arraylayout(binding = 1) buffer DataBuffer { float data[];};void main() { // What pixel is this thread responsible for? ivec2 coord = ivec2(gl_GlobalInvocationID.xy); ivec2 size = imageSize(outputImage); if (coord.x >= size.x || coord.y >= size.y) return; // bounds check // UV in [0, 1] vec2 uv = vec2(coord) / vec2(size); // Compute something (mandelbrot as example) vec2 c = (uv - 0.5) * 3.0; vec2 z = vec2(0.0); int iter = 0; for (int i = 0; i < 100; i++) { if (dot(z, z) > 4.0) break; z = vec2(z.x*z.x - z.y*z.y + c.x, 2.0*z.x*z.y + c.y); iter++; } float t = float(iter) / 100.0; vec4 color = vec4(t, t*t, t*t*t, 1.0); // Write result to output image imageStore(outputImage, coord, color);}
Dispatching a Compute Shader from C++
// Create compute pipeline (simpler than graphics pipeline — no render pass!)VkComputePipelineCreateInfo computeInfo{};computeInfo.sType = VK_STRUCTURE_TYPE_COMPUTE_PIPELINE_CREATE_INFO;computeInfo.stage = computeShaderStage; // the compiled compute shadercomputeInfo.layout = pipelineLayout;VkPipeline computePipeline;vkCreateComputePipelines(device, VK_NULL_HANDLE, 1, &computeInfo, nullptr, &computePipeline);// Dispatch: run the compute shadervkCmdBindPipeline(cmdBuf, VK_PIPELINE_BIND_POINT_COMPUTE, computePipeline);vkCmdBindDescriptorSets(cmdBuf, VK_PIPELINE_BIND_POINT_COMPUTE, ...);// Dispatch 1920/16 × 1080/16 workgroups (to cover the full image)vkCmdDispatch(cmdBuf, (1920 + 15) / 16, // x: ceil(width / workgroup_x) (1080 + 15) / 16, // y: ceil(height / workgroup_y) 1 // z: 1 for 2D dispatch);
🔴 Chapter 9 — GPU-Driven Rendering (Advanced)
Traditional rendering: CPU decides what to draw → submits draw calls → GPU executes
GPU-driven rendering: GPU itself decides what to draw → much faster
Multi-Draw Indirect
struct VkDrawIndexedIndirectCommand { uint32_t indexCount; // triangles to draw uint32_t instanceCount; uint32_t firstIndex; int32_t vertexOffset; uint32_t firstInstance;};// Upload all possible draw calls to a GPU buffervkCmdDrawIndexedIndirect( cmdBuf, indirectBuffer, // buffer of draw commands on GPU 0, // offset drawCount, // how many draw calls sizeof(VkDrawIndexedIndirectCommand) // stride);// GPU reads draw commands from VRAM — no CPU roundtrip!
Mesh Shaders (Vulkan 1.3 + NV extension)
Replace vertex + geometry shaders with programmable mesh pipeline
Generate geometry on the GPU (LOD, culling, procedural meshes)
Amplification shader (task shader): decide how many meshlets to spawn
Mesh shader: generate final vertices for rasterization
✅ Chapter 10 — Checklist
Beginner
TODO Understand why GPUs have thousands of simple cores instead of a few fast ones
TODO Can explain what a warp is and what lockstep execution means
TODO Know why warp divergence hurts performance
TODO Understand the GPU memory hierarchy (registers → shared → L2 → VRAM)
Intermediate
TODO Know what a VkInstance, VkPhysicalDevice, and VkDevice each represent
TODO Understand why Vulkan requires you to pre-allocate memory manually
TODO Can explain why shaders must be compiled to SPIR-V
TODO Understand the difference between VkFence and VkSemaphore
Advanced
TODO Can write a Vulkan hello triangle from scratch (all setup steps)
TODO Understand pipeline barriers and image layout transitions
TODO Can write a compute shader and dispatch it from C++
TODO Know what GPU-driven rendering means and why it’s faster
TODO Understand coalesced vs uncoalesced memory access