[X3D-Public] HW Occlusion Culling - was Re: Announcement: view3dscene 3.4

Wed Aug 26 10:49:22 PDT 2009

John A. Stewart wrote:
> Alan, Michalis;
> 
>>
>> How did you find the hardware occulsion query implementation.  Did it
>> help a lot with your framerate?  Did you compare it to a CPU culling
>> method?
> 

Here go some details about hardware occlusion query in my engine.
I actually implemented two separate hw occlusion culling algorithms,
both can be tried by view3dscene menu:

- 1st one is the basic, described on
http://http.developer.nvidia.com/GPUGems/gpugems_ch29.html --- this just
sorts the objects, renders with occlusion query, using results of
occlusion query from previous frame. So it has a "latency", there are
rendering errors possible (visible object not rendered for one frame).

- 2nd algorithm is the "Coherent Hierarchical Culling", described on
http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter06.html . This
has some nice advantages over basic approach --- it never has any
rendering errors and theoretically is able to deal with very large
scenes (it does queries for boxes of internal nodes of collision tree).

My results in practice can be summarized as "it depends" :) Both
approaches can give huge rendering speedup (I was easily able to make
scenes where speedup 100x times is observed), but in common scenes they
can also have no effect at all, at least for most camera settings.
In really evil cases (and on evil GPUs :) ), they may even hurt.

- It greatly depends on the scene. For basic approach, the number of
shapes is important. For hierarchical approach, nice (sparse) tree is
important.

- It also depends a lot on the GPU, just like John writes. On my old
NVidia GeForce 5200 the results were always better or at least the same
compared to simple frustum culling. On Radeon X1600, while sometimes
much better (it's newer GPU after all), I was sometimes able to "choke"
the GPU with queries for large boxes (this was with "hierarchical"
approach). Both approaches use a lot of fill rate, which is a problem.
1st (basic) approach uses a lot of fill rate per shape, so it has to be
slow on really large scenes. 2nd (hierarchical) approach theoretically
is much better and able to handle very large scenes with not much
queries... but then queries are done for very large boxes (for octree
internal nodes bounding boxes), and so still eat a lot of fill rate.

- Also it depends on how you divide your geometry into VRML/X3D Shapes.
Single Shape is a basic unit to be culled in my implementation. Ideally
every shape should contain some small (small bbox), but complicated
geometry (many vertexes, complicated texturing / shaders setup etc.).
Only then the query for the shape is really much faster than just
drawing the shape.

I didn't compare with any CPU implementation, and I don't think I'll
make a CPU implementation. GPU version isn't perfect, but does the job,
and with newer GPUs will hopefully be even better. And tests on more
GPUs (especially the really new ones) are still on my TODO list,
hopefully I'll get some reports from people testing the new release :)

Michalis