Concept: Async Compute


What Is Async Compute?

  • Running compute work in parallel with graphics work on the GPU
  • Modern GPUs have separate compute queues that can run alongside the graphics queue
  • Enables better GPU utilization by filling idle shader units

GPU Queue Types

  • Graphics queue: supports all operations (graphics, compute, transfer)
  • Compute queue: compute + transfer only (no rasterization)
  • Transfer queue: DMA transfers only
  • Multiple queues can run simultaneously on different hardware units

Why It Matters for Path Tracing

  • BLAS builds are compute-heavy — can overlap with rendering
  • Denoising passes can overlap with next frame’s ray tracing
  • TLAS rebuild can overlap with shadow ray tracing
  • Typical frame timeline without async
    • [BLAS build] → [TLAS build] → [Ray trace] → [Denoise] → [Present]
  • With async compute
    • [BLAS build (async)] ↕ [TLAS build] → [Ray trace] → [Denoise (async)] ↕ [Present]

Vulkan Async Compute Setup

  • Find a compute-only queue family
    for (auto& queueFamily : queueFamilies) {
        if ((queueFamily.queueFlags & VK_QUEUE_COMPUTE_BIT) &&
            !(queueFamily.queueFlags & VK_QUEUE_GRAPHICS_BIT)) {
            computeQueueFamily = index;
        }
    }
  • Create separate command pools and queues for compute
  • Submit compute work to compute queue, graphics to graphics queue

Synchronization

  • Async compute requires careful synchronization
  • Timeline semaphores (Vulkan 1.2) — preferred
    VkSemaphoreTypeCreateInfo typeInfo{};
    typeInfo.semaphoreType = VK_SEMAPHORE_TYPE_TIMELINE;
    typeInfo.initialValue = 0;
    • Signal from compute queue, wait on graphics queue
  • Pipeline barriers within a queue
  • Queue ownership transfers for shared resources

Practical Considerations

  • Not all GPUs benefit equally
    • Integrated GPUs: often single queue, no benefit
    • Discrete GPUs: multiple compute units, significant benefit
  • Overhead: synchronization adds complexity and some latency
  • Profile first: measure actual GPU utilization before optimizing
  • NVIDIA NSight, AMD RGP — tools for visualizing queue utilization

In Godot Context

  • Godot’s RenderingDevice exposes compute queues
  • BLAS builds for skinned meshes are good candidates for async
  • Denoising (OIDN compute) can run async with next frame’s RT