I thought I’d have a go at implementing some path tracing in CUDA. Let’s start simple: a classical path tracer with explicit direct lighting. Lots of hacks:
- No BVH yet, every ray tests the 30 triangles of the Cornell Box
- Every surface is lambertian (so cosine weighted hemisphere sampling for spawning rays)
- Hardcoded for a single area light (which the camera cannot see)
- Uses copy-pasted Moller intersection test from CPU code
- Random number generation got moved to a texture read (with the texture data updated CPU-side) to avoid absurd register counts
Some initial kernel stats and performance on my lowly GeForce 9500 GT at 512x512 for a single ray per pixel:
- 0 bounces: 34 registers: 21ms/frame (12.4 Mray/s)
- 1 bounce: 45 registers: 51ms/frame (approx 10.2 Mrays/s)
- 2 bounces: 45 registers: 81ms/frame (?? Mrays/s)
I’ve no idea how many rays/s I get for 2+ bounces since many rays will have terminated by then I haven’t put any debug counters in for this. Note the big increase in register count at 1 bounce for adding a classic path tracing loop to the kernel.
Performance for this simple scene is not good. My occupancy is an awful 17% for each of the kernels. This obviously needs improving, I’m way over the register limit though. To get to 50% occupancy, I need to get down to 20 registers. To get to 100%, down to 10 registers. Switching to a more CUDA-friendly ray/triangle test will probably help a bit, but this isn’t going to perform miracles. The problem is the kernel structure itself: it’s trying to do everything in one loop. From reading the very nice CUDA-related papers from this years SIGGRAPH, I realise that I’d have to move to some job-based system eventually, but I found it surprising to suffer from register problems at such low complexity.
More on this topic soon (I don’t get paid to experiment with CUDA sadly). In the meantime, here are some “novel viewpoint” shots of the 2-bounce kernel as it accumulates 1/16/512 rays per pixel (you can see just how much my RNG sucks):