12-bit (and 10-bit) RAW video development discussion

g3gg0 · May 27, 2013, 09:02:32 AM

@mucher:
not sure what you want to do with your calculation.
but the data has to be unshifted from the 32 bit words and doing a simple LSR#2 while building the 14 bit value
will get you the value as 12 bit and the same result.

savale · May 27, 2013, 09:21:04 AM

First step:

mu suggestion: make it as fast as possible and see if it's feasible. So just truncate the 4 LSB bits using a bitmask / bitshifting:
something like this:

Code Select


/* let's say the variable fourteenBitsValue contains the 14 bit value */
tenBitsValue = (fourteenBitsValue & 0b11111111110000) >>> 4; /* zero the 4 LSB bits and shift 4 bits */

KMikhail · May 27, 2013, 10:42:53 AM

Okay, back to rounding.

Lets simplify it for understanding - will work it out in a decimal system first.

We have 1001 and 1005 and want to shift it to the right by 1 digit, divide by 10. The double result of division is 100.1 and 100.5, nearest integers after NORMAL math rounding are 100 and 101. However, if you want to operate in the integer space you will have 100 and... 100, since CPU rounds DOWN to next integer. To avoid this we want to add 5 (half of the decimal base in power of the shift) to get normal rounding: 1001+5=1006 >> 1 = 100, 1005+5=1010 >> 1 = 101. Now, if you wish, you can shift them back. I am not sure how you want to align the final result (left or right).

To shift by 2 bits, divide by 4, you need to add 2^2 / 2= 2, before you shift, to preserve correct mathematical rounding. Then, if you need to re-align most meaningful digit to the left, you can shift it back.

I hope it is clear now. But, look-up table might be indeed a better alternative, unless referencing memory via ptr is expensive, I am not familiar with efficiency of DIGIC5+.

BTW why it was so important to round it so well? Perhaps for 10 bits?

mogs · May 27, 2013, 11:51:10 AM

The packed data looks pretty messy. Dealing with this the most efficient way seems more important than rounding errors.

Once the packed data has been pulled apart, before repacking at a lower bit depth, it may be worth considering implementing something like this for a cleaner roll-off in the highlights.

(from the apertus site, http://apertus.org)

If the knees were aligned with bit boundaries it should be possible to implement with only shifts and adds?

KMikhail · May 27, 2013, 12:20:24 PM

The image is misleading.

1) Nowadays DR of digital sensors is enormous, compared to what it used to be, and outperforms many, if not most of the film (per area unit). Some BW might be still better, when other properties (grain) are poor.
2) Both linear digital and enhanced digital should stretch from (0,0) to the same topmost rightmost point, it is an allocation of levels per bits what makes it different (linear vs. curved), not the clipping (defined at the moment of reading the sensor).

At this moment I would prioritize ability to compress RAW 14 bit in some lossless/lossy way, and then chopping some bits off and possibly applying LUT. IMHO, of course.

g3gg0 · May 27, 2013, 01:11:43 PM

Quote from: savale on May 27, 2013, 09:21:04 AM
Code Select Expand
/* zero the 4 LSB bits and shift 4 bits */

not necessary as the lowest bits are already cut off when shifting right...

seriously, i dont get what you all are trying to calculate.
if its about proper rounding, well then add 8 before shifting 4 bits right.
any additional operation like adding, masking, shifting, whatever makes the code half as fast.

vicnaum · May 27, 2013, 01:57:31 PM

Can someone make a test bitshifting build? I guess it shouldn't even needed to be based on d's object file.

Grunf · May 27, 2013, 02:05:43 PM

Quote from: a1ex on May 24, 2013, 05:54:58 PM
Instead of trying to catch this unicorn (see benchmarks), I'd say it's better to investigate how to enable 12-bit output directly from DIGIC.

You can select different raw types from c0f37014: denoised, with different byte order, compressed and so on. Some test code: http://www.magiclantern.fm/forum/index.php?topic=5614.msg39696#msg39696

This.

If we could make CMOS output shifted RAW data, then it would be much easier. It would be much easier even if it could output padded & aligned pixel data (so we don't have to "dig out" the RGB values from packed Bayer-data)., thus making it easier to shift diown to 10 bit in CPU

g3gg0 · May 27, 2013, 02:29:09 PM

Quote from: vicnaum on May 27, 2013, 01:57:31 PM
Can someone make a test bitshifting build? I guess it shouldn't even needed to be based on d's object file.

what do you mean with that? thats exactly what d has done.

g3gg0 · May 27, 2013, 02:31:37 PM

Quote from: Grunf on May 27, 2013, 02:05:43 PM
If we could make CMOS output shifted RAW data, then it would be much easier. It would be much easier even if it could output padded & aligned pixel data (so we don't have to "dig out" the RGB values from packed Bayer-data)., thus making it easier to shift diown to 10 bit in CPU

if you want padded and aligned data, this means you will have a higher data rate.

(assuming you want it to be 16 bit aligned)

IliasG · May 27, 2013, 06:04:27 PM

Quote from: KMikhail on May 27, 2013, 10:42:53 AM
...

I hope it is clear now. But, look-up table might be indeed a better alternative, unless referencing memory via ptr is expensive, I am not familiar with efficiency of DIGIC5+.

BTW why it was so important to round it so well? Perhaps for 10 bits?

In case it is a question for me, yes with 10bits raw bayer we need very careful rounding and even with careful rounding 10bits depth can be inadequate. While with RGB 10bits can be (and are) OK with raw Bayer "master" there is the need to apply some heavy manipulations not needed for the ready to use RGB file. Demosaic, WB (which is a 1+ stop digital amplification for the weak channel), Color transformation from Bayer R-G1-B-G2 TO RGB (once again about 1 stop amp for the weak channel)

But with a lookup table we can have untouched the low values (keep the sampling at the same density as the 14bit starting data) where the problem is significant and gamma encode the rest.

We can even develop a curve imitating the S-log ..

Say we go for 14 to 10bit,
- we can clip at 32 (or even less) 14bit levels below "Black Level" instead of 2048 as it is now. Just map any 14bit level lower than 2016 (14bit) to zero. The remaining levels are 16384-2016 = 14368 which will be mapped (with any
- the first (say 128) 14bit values stay untouched by using a linear part with a slope of 16 y= 16*x
- then continue with a gamma .. say y= x^1/2
- if we want the S-curve then at the last 1/2 or 1/3 stop we invert the gamma y=x^2

Try to dump Nikon's D5200 raw curve with Rawdigger .. it's as described before .. linear-gamma-inverted gamma. Same proposal as AXIOM's Smart Dynamic Range

Let's hope the lookup table can be managed fastly by Digic .. can any expert give an opinion on this ??.
I think for a PC (X86 architecture) a lookup table is almost always preferable/faster.

One more compression scheme crosses my mind... Can we, during writing the raw data, use variable bit depth adapted to the real needs (values up to 256 really need 8bits). Then the same lookup table can be expanded with one more field giving this info ...

.

g3gg0 · May 27, 2013, 07:40:26 PM

Quote from: g3gg0 on May 27, 2013, 01:11:43 PM
any additional operation like adding, masking, shifting, whatever makes the code half as fast.

and another memory access outside TCM memory will slow it down again a cycle or more.

so: 1/3th the performance.

Grunf · May 27, 2013, 10:30:41 PM

Quote from: g3gg0 on May 27, 2013, 02:31:37 PM
if you want padded and aligned data, this means you will have a higher data rate.

(assuming you want it to be 16 bit aligned)

It would lead to more data being transferred internally, but it would also mean that CPU doesn't need to do unpacking of the primaries, right?

jgerstel · May 27, 2013, 11:24:21 PM

Currently CF card IO is the bottleneck....
Imagine you can solve the 14 to 10 bit issue: this bottleneck is solved and resolutions can explode: what about 5d3 4K camera..... just an idea...

1% · May 27, 2013, 11:33:58 PM

I'd settle for acceptable resolutions on SD cards.

KMikhail · May 28, 2013, 12:08:34 AM

You know mates, what realy stumbles me.

With 5D3 we have 6fps at 22mp @ raw with lossless compression. 1920*1280 is 9 times smaller, so the same codec/routines should be able to pull up to (& who knows, may be beyond) 54fps, most likely 60fps, if you'd cut some fancy features. With 70% compression.

Canon definitely can pull, if it would care.

Is there any way to access RAW routines of original hardware/firmware?

Also, is the code multi-threaded? Apparently all these slowdowns with GUI/etc. is due to serial perferming?

1% · May 28, 2013, 12:44:51 AM

2 different processes. LV raw/photo raw

Maybe possible to shrink the actual read area of the sensor and then you get something like 640x480 120fps really line skipped. Some demo canon fw had this and it wasn't stable.

We don't know how to do that yet.

KMikhail · May 28, 2013, 01:03:31 AM

Quote from: 1% on May 28, 2013, 12:44:51 AM
2 different processes. LV raw/photo raw
We don't know how to do that yet.

You probably saw my tests of video raw vs. photo raw. Binning definitely averages several pixels, having noise closer to resized 22 mp raw. I, personally, don't consider 1:1 mode valuable for me: I want to use full sensor and full coverage of my glass, that's what makes 5D3 standing out of everything else, including DR, according to my test (1-2 stops better noise in blacks than 1:1).

So, if by any chance photo raw engine would be accessed to compress LV raw it won't slow it down significantly and give an outstanding ability to open these raws in DPP, where color is so nice and demosaicing works so well. Limitless abilities.

This would be more than worthy of a good donation

mucher · May 28, 2013, 05:18:43 AM

I am not expert in computing, but as far as I have known, the LUT things looks very good, but that sounds that we will have a lot of codes like: if x<y0 and if x>y1, x=y2, these kinds of codes are known to stall the CPU to a halt. d's method should be more suitable for CPU to do, and it already working real time reportedly by several one here who have actually loaded it into camera and tried, and we only don't know why the program halts. Maybe d can mercifully display his source code, maybe there will be someone who can look closely to see if there is something more we can do. d's method sound more straight-forward to me, and easier. To achieve better rounding and accuracy probably, soundingly to me, we only need to change from (x * 12) / 14 to (x * 2^12) / 2 ^ 14, and the cpu might be able to collaborate with that.

savale · May 28, 2013, 08:11:42 AM

Quote from: g3gg0 on May 27, 2013, 02:29:09 PM
what do you mean with that? thats exactly what d has done.

Sorry I missed that... as I understand now the bits are truncated before putting them into the buffer. Is that correct?

I have two new ideas that might be worth checking:

1) what about manipulating the pointer instead of shifting. Would it be faster?

something like this:

- write 14 bits
- move the write pointer backwards 4 positions
- write 14 bits (which overwrite 4 bits of the previous value)
- move the write pointer backwards 4 positions
etc...

Details must be sorted out to make this work correctly (LSB / MSB)

2) a second thought: what about doing the conversion 14 to 10 bit at time when written to flash instead at time it's written into the buffer? I can imagine the results will be different. (I am not sure better, but might be worth a try)

vicnaum · May 28, 2013, 08:16:00 AM

Quote from: mucher on May 28, 2013, 05:18:43 AM
I am not expert in computing, but as far as I have known, the LUT things looks very good, but that sounds that we will have a lot of codes like: if x<y0 and if x>y1, x=y2, these kinds of codes are known to stall the CPU to a halt. d's method should be more suitable for CPU to do, and it already working real time reportedly by several one here who have actually loaded it into camera and tried, and we only don't know why the program halts. Maybe d can mercifully display his source code, maybe there will be someone who can look closely to see if there is something more we can do. d's method sound more straight-forward to me, and easier. To achieve better rounding and accuracy probably, soundingly to me, we only need to change from (x * 12) / 14 to (x * 2^12) / 2 ^ 14, and the cpu might be able to collaborate with that.

Nope, LUT things will be just y = lut [ x ], and that's all.

And about the calculations you mention - they are easily done (like g3gg0 said) with bit shifting:
Raw14b: 13598 = 0011 0101 0001 1110
Shift 2 bits right =>12bit = 0000 1101 0100 0111 = 3399
13598 * 2^12 / 2^14 = 13598 * 4096 / 16384 = 3399.5 = 3400

So no need for calculations at all. Rsh solves it on cpu-elementary level.
If you need better rounding (like said), add half of the bits (2bits = 4, so add 2) to number before shifting:
(13598+2) = 13600
13600 Rsh 2 = 3400

g3gg0 · May 28, 2013, 09:27:11 AM

Quote from: savale on May 28, 2013, 08:11:42 AM
2) a second thought: what about doing the conversion 14 to 10 bit at time when written to flash instead at time it's written into the buffer? I can imagine the results will be different. (I am not sure better, but might be worth a try)

sorry, but read into the raw_rec code.
as soon you understood how all that works, you will understand that these ideas are obsolete

savale · May 28, 2013, 10:32:23 AM

hehe ok

I will

Northernlight · May 28, 2013, 01:01:30 PM

Hi, I came across this article reg. the Canon C500 and it's 10-bit RAW capabilites.
Not sure if this was relevant under this thread or under discussion of ETTR, but I found it interesting (and I am sure most you already know this).

http://nofilmschool.com/2012/11/canon-c500-shipping-raw-4k/

"Adding gain adjustments at the sensor level produces a consistent stop range above and below the middle grey level, even at high ISOs, and reduces the overall noise in the image."

As I understand this; Canon's RAW format is special in the sense that the C500 add gain at sensor level before outputting the RAW stream, is this in a way, approaching a little more to ETTR? (to improve DR and SNR)

Q1: Could this "approach" be possible to do also with the RAW module from ML ? (if you hopefully succeed with 10bit RAW!)

BTW! I have no chance to follow the technical discussion in this thread, but:
Q2: Do you at this point have any estimates of what the DNG file size could be eg. @1920x1080, should you succeed with 10bit RAW?
(I am just curious to where 10bit might take us in case of fps/resolution)

tin2tin · May 28, 2013, 01:18:17 PM

Quote from: Northernlight on May 28, 2013, 01:01:30 PM
BTW! I have no chance to follow the technical discussion in this thread, but:
Q2: Do you at this point have any estimates of what the DNG file size could be eg. @1920x1080, should you succeed with 10bit RAW?

Quote from: vicnaum on May 23, 2013, 11:30:48 AM
In theory, SD cards can give 22MB/s or 961194 bytes per frame. 1280x600 in 10 bits gives 960000 bytes.

News:

12-bit (and 10-bit) RAW video development discussion