If you ever looked in the comments from
raw_rec.c, you have noticed that I've stated a little goal: 1920x1080 on 1000x cards (of course, 5D3 at 24p). Goal achieved and exceeded - even got reports of 1920x1280 continuous.
During the last few days I took a closer look at the buffering strategy. While it was near-optimal for continuous recording (large contiguous chunks = faster write speed), there was (and still is) room for improvement for those cases when you want to push the recording past the sustained write speed, and squeeze as many frames as possible.
So, I've designed a
new buffering strategy (I'll call it
variable buffering), with the following ideas in mind:
* Write speed varies with buffer size, like this (thanks to testers who ran the benchmarks for hours on their cameras)
* Noticing the speed drop is small, it's almost always better to start writing as soon as we have one frame captured. Therefore, the new strategy aims for 100% duty cycle of the card writing task.
* Because large buffers are faster than small ones, these are preferred. If the card is fast enough, only the largest buffers will be touched, and therefore the method is still optimal for continuous recording. Even better - adding a bunch of small buffers will not slow it down at all.
* This algorithm will use every single memory buffer that can contain at least one frame (because small buffers are no longer slowing it down).
* Another cause of stopping: when the buffer is about to overflow, it's usually because the camera is trying to save a huge buffer (say a 32MB one), which will take a long time (say 1.5 seconds on slow SD cameras, 21MB/s). So, I've added a heuristic that limits buffer size - so, in this case, if we predict the buffer will overflow after only 1 second, we'll save only 20MB out of 32, which will finish at 0.95 seconds. At that moment, the capturing task will have a 20MB free chunk that can be used for capturing more frames.
* Buffering is now done at frame level, not at chunk level. This finer granularity allows me to split buffers on the fly, in whatever configuration I believe it's best for the situation.
* The algorithm is designed to adjust itself on the fly; for this, it does some predictions, such as when the buffer is likely to overflow. If it predicts well, it will squeeze a few more frames. If not... not 
* More juicy details in the comments.
This is experimental. I've ran a few tests, played back a few videos on the camera, but that was all. I didn't even check whether the frames are saved in the correct order.
Build notes-
This breaks bolt_rec. Buffering is now done at frame level, not at chunk level, so bolt_rec has to be adjusted.
- The current source code has debug mode enabled - it prints funky graphs. You'll find them on the card.
- The debug code
will slow down the write speed.
- I'd like you to run some test recordings and paste the graphs - this will allow me to check if there's any difference between theory and practice (you know, in theory there isn't any).
- I did not run any comparison with the older method on the camera (I did only in simulation). Would be very nice if you can do this.
- It may achieve lower write speeds. This is normal, because it also uses smaller buffers. If you also consider the idle time, it should be better overall.
- For normal usage, disable the debug code (look at the top of raw_rec.c).
History
[2013-05-17] Experiment about checking the optimal buffer sizes. People ran the benchmarks for hours on their cameras and posted a bunch of logs. They pretty much confirmed my previous theory, that any buffer size between 16MB and 32MB should result in highest speeds.
[2013-05-30] Noticed that file writes aligned at 512 bytes are a little faster (credits: CHDK). Rounded image size to multiples of 64x32 or 128x16 to ensure 512-byte rounding.
[2013-08-06] Figured out that I could just add some padding to each frame to ensure 512-byte rounding and keep the high write speeds without breaking the converters too hard. Also aligned everything at 4096 bytes, which solved some mysterious lockups from EDMAC and brought back the highest speed in benchmarks (over 700MB/s).
[2013-05-30] speedsim.py - First attempt to get a mathematical model of the recording process. Input: resolution, fps and available buffers. Output: how many frames you will get, with detailed graphs. Also in-camera estimation of how many frames you will get with current settings.
[2013-06-18] Took a closer look at these logs and fitted a mathematical model for the speed drop at small buffer sizes.
[2013-06-18] Does buffer ordering/splitting matter? 1% experimented with it before, but there was no clear conclusion.
* is it better to take the highest one first or the smallest one first? there's no clear answer, each one is best for some cases and suboptimal for others.
* since some cameras had very few memory chunks (e.g. 550D: 32+32+8 MB), what if each of the 32MB buffer is divided in 2x16 or 4x8 MB? This brought a significant improvement for resolutions just above the continuous recording threshold, but lowered performance for continuous recording.
* optimization: updated speedsim.py so it finds the best memory configuration for one particular situation. Xaint confirmed the optimization results on 550D.
* problem: there was no one-size-fits-all solution.
[2013-06-19] Simulation now matches perfectly the real-world results. So, the mathematical model is accurate!
[2013-06-19] Started to sketch the variable buffering algorithm and already got some simulation results. There was a clear improvement for borderline cases (settings that require just a little more write speed that your camera+card can deliver).
Example: 550D, 1280x426, 23.976fps, 21.16MB/s, simulation:

- 8MB + 2x32MB (current method) - 317 frames
- 9x8MB - 1566 frames
- Variable buffering, starting from 8MB + 2x32MB - 1910 frames
[2013-06-20] Ran a few more tests and noticed that it meets or exceeds the performance of the old algorithm with sort/split optimization. There are still some cases where the sort/split method gives 2-3 more frames (no big deal).
Got rid of some spikes, which squeezed a few more frames (1925 or something).

Implemented the algorithm in camera. It's a bit simplified, I didn't include all the optimizations from the simulated version, but at least it sems to work.
Took me two hours just to write this post. Whooo 
Enjoy and let me know if the theory actually works!
Fixed broken link - Audionut