I've been thinking and reading a lot about noise recently and have come to some conclusions. I would like to offer a (mathematical) deffense of the CGI guys who claim to need a large number of frames and small EV steps to accomplish their goals, and also a general scheme to determine hdr bracketing sequence based on SNR requirements.
First of all a distinction needs to be made with what these guys are trying to do and what a photographer is trying to do. The bottom line is that SNR requirements for CGI are (probably, AFAIK) way higher than for photography. The next paragraph is a justification of why that is (which is based on my limited understanding of the subject, if you do CGI and can explain it mathematically, feel free to expand my justification further and/or come up with some concrete SNR requirements). If you believe that CGI would in fact have a need for a much higher SNR than photography, then you can skip the next paragraph. (perhaps if you can't believe CGI artist would need so high of a SNR, a scientist doing research might, i.e. an astronomer)
Most of the people who come along and say "no way you'd ever need that many frames" are photographers (including myself), you'd certainly be correct in saying that as well, if your goal is photography. The goal of a photographer is to capture a scene and produce a realistic looking reproduction of that scene. When a scene has a large dynamic range, photographers must use more frames to capture the additional range and then in post they are going to be
compressing that range to fit into the relatively small bit depth of the medium it's going to be displayed on (either print or screen). The fact that the HDR image is being compressed for photographic purposes puts a low limit on the minimum SNR required to achieve 'satistfactory' results, i.e. it doesn't matter that we have a noise level of plus or minus a handful of photons out of the hundreds or even thousands of photons of signal esp. when applying a curve to that linear data and then compressing it further to fit on a screen. That small error or noise will compress down such that it is less than 1 brightness level on a screen, totally inpercivable. Local toning methods may be utilized to achieve better results, and these will cause 'local stretching' but in general the SNR requirements for this area not much larger than the SNR requirements for some given area of a single low DR scene. CGI has a totally different goal. They are not simply displaying the image, but are using it as a lighting source to acurately render 3D models to be inserted into the scene. I'm not fully familiar with this process, but it is ceratinly concievable that this would result in situations where the original HDRI is
stretched really hard (it maybe that it is stretched really hard in saturation rather than brightness or even in both but the SNR requirements would be similar). The rendered result is still going to be output on an 8 bit display though, which is where I come up with this pratical limit to SNR: 256:1 (i.e. I can think of no reason for a SNR requirement of greater than 256). Should we stretch an image so hard that some given level is black and the next higher level is white on an 8 bit screen, if we had a SNR of 256, there would be no visible noise since the noise level would be less than 1 screen brightness unit above black. So let's say that for some reason we do require this maximum SNR of 256 throughout the entire image, how does that translate to how many and how spaced the brackets need to be:
***
Noise due to the quantum mechanical nature of photons is known as
shot noise. Due to the nice way the statistics work out it is equal to the square root of the number of photons in our signal. Photon noise far dominates the other sources of noise in the well exposed areas of the image (b/c shot noise only decreases with the sqrt of the signal, while other sources decrease linearly with signal), so let's just consider our main noise source for now, that is, shot noise.
A search of the internets reveals that the full well capacity of the 5D3 (what we'll use for this example, since it's FWC is quite large) is something like 70,000 photons. This means the SNR of the brightest signal capturable with the 5D3 is srqt(70,000) = 264.5... Good enough for gov't work

(the read noise SNR would be related to the DR of 11 stops since the DR in EVs is just log2 of the ratio between the brightest possible signal and the noise floor, to convert back to a ratio 2^11 = 2048, so clearly photon noise is dominating here). Now, let's go down by 1 EV, that means 35,000 photons. The SNR is sqrt(35,000) = 187.1 -> not good enough for our extreme 256:1 SNR requirement. If our brackets are spaced 1EV apart, this 1EV down will be overexposed on the next bracket, so we can only use this one frame for those areas. Ironically, as we keeping going down, assuming a well designed algo that 'stacks' frames to cancel out photon noise, there are more and more frames we can use to cancel out photon noise in areas that aren't overexposed, the lower we go. Meaning the shadows will on average have a better SNR than the highlights, and our minimum SNR will be in areas of the image right at that first EV step (assuming we take enough brackets to get down to where even our deepest shadows are only 1 of our EV steps from the right of the histogram). This leads us to a little equation for the minimum SNR based simply on FWC and EV spacing: SNR = sqrt(FWC/(2^EvStep)).
Note: this equation will start to get more and more wrong as ev step size increases, since read noise will start to play a more important role, and we have neglected it's contribution here.I know it's hard to believe that 187:1 is not a good enough SNR, but I'm trying to give these guys the benefit of the doubt that such extremely noise free images are required for their purposes.
This all leads to some interesting conclusions:
It is better to not use even ev spacing, but gradually increase the ev spacing, assuming you have an HDR merge algo that stacks pixels to cancel out photon noise. The exact rate I think is going to be something related to a sqrt function (but I'm not sure exactly yet, feel free to work out the mathematics, it may involve a
geometric series). Working out this rate would give you the most even SNR accross the entire image, thus optimizing the necessary number of frames to take (So that we don't take extra frames that disproportionally contribute to SNR improvement in areas of the image that don't need further improvement). But at some point (maybe around 4 or 5 ev) you will have to stop increasing the spacing due to read noise becoming more of a factor. This is all assumming you need a high and even SNR accross your entire image, which might not neccessarily be true for photographers, since non-linear curves transformations may put more or less emphasis on noise in various regions of the histogram)
The spacing you should use depends only the FWC of your camera, and what your SNR requirements are. (So appropriate spacing may differ between FF and APS-C cameras, since they usually have quite different FWC).
You should keep taking frames at whatever increasing ev spacing works out to be correct until you get to the point where the next image would be completely overexposed. So it's probably just easier to stop when you get to an image that is fully white and just throw it out.
If you do not require a SNR of greater than sqrt(FWC/(2^4)) ~= 66, which would probably be true of most photographers, then Guilermo's Zero Noise ideas about using 4ev spacing is probably very good advice. Another way to think about it is: are bright mid-tones (-4EV from max) on a normal ISO100 image noise free enough for you? I'll put it this way: I've never heard a photographer complain about the noise in the bright midtones of a properly exposed ISO100 image.
Something else to note: should your SNR requirements be higher than sqrt(FWC), i.e. the highlights of a properly exposed ISO 100 image are too noisy (really!?), you are going to have to stack multiple exposures per bracket frame to get there, if you stack 2 frames, you count twice as many photons, effictively doubling your FWC, and so on and so forth. Also, this is mathematical proof of my previous assertions about taking multiple of particular frames at some given ev, with wider ev spacing, being the same as using smaller ev spacing (e.g. a sequence of 10 frames at 1EV spacing is the same as 2 sets of 5 frames at 2 EV spacing), nobody really disputed that though, I'm just pointing out that now I have some proof.