Question about in camera bit-depth reduction

BatchGordon · July 27, 2023, 12:20:15 AM

I'm looking at the source code but I cannot find the routines for the lossless compression.
Is the module mlv_lite using the lj92 code from mlv_rec or is it just calling internal functions of Canon firmware that we can't modify?
Thanks!

names_are_hard · July 27, 2023, 01:41:26 AM

Lossless compression itself is done via Canon hardware. What is it that you're interested in doing?

BatchGordon · July 27, 2023, 03:31:36 AM

Thank you!
So I suppose there's no way to do a 14 to 10 bit sample reduction by some very light computation?
I was thinking about trying to compress data with a log approximation using just a few bitwise operations.
Probably it still cannot be done in realtime, but if you know where I could put the code I would still like to try it.

names_are_hard · July 27, 2023, 03:43:05 AM

In what way would this be different from the 10-bit mode ML already has?

If you're thinking for video, you probably can't do this fast enough with CPU. I won't stop you trying though

reddeercity · July 27, 2023, 07:14:19 AM

Will i have some experimental code for the 5D2 D4 back when raw video was just a thing starting and not easy recorded by 5D2
(5D3 wasn't even out yet , so the 5D2 was the best at the time with no equal)
always i have code that "Hard Coded" 12bit from 14bit in camera for raw video so no hardware (encoder chip) like there is now.

code released on May, 23, 2013
raw_rec.c

Code Select


static int buffer_size_compressed = 0;
static int raw_bitdepth = 12; // make it an option in raw menu and export in raw footer?

extern void raw14_to_raw12(void *buffer, int size);
extern void raw14_to_raw10(void *buffer, int size);




if(raw_bitdepth==12)
		{
			buffer_size_compressed = (size_used * raw_bitdepth) / 14;
			raw14_to_raw12(ptr, size_used);
			written += FIO_WriteFile(f, ptr, buffer_size_compressed);

		}

raw2dng.c

Code Select


if(raw_bitdepth==12)
	{
		struct raw12_pixblock * p = (void*)raw_info.buffer + y * raw_info.pitch + (x/8)*12;
		switch (x%8) {
			case 0: return p->a;
			case 1: return p->b;
			case 2: return p->c_lo | (p->c_hi << 4);
			case 3: return p->d;
			case 4: return p->e;
			case 5: return p->f_lo | (p->f_hi << 8);
			case 6: return p->g;
			case 7: return p->h;
		}
		return p->a;
	}

	struct raw_pixblock * p = (void*)raw_info.buffer + y * raw_info.pitch + (x/8)*14;
    switch (x%8) {
        case 0: return p->a;
        case 1: return p->b_lo | (p->b_hi << 12);
        case 2: return p->c_lo | (p->c_hi << 10);
        case 3: return p->d_lo | (p->d_hi << 8);
        case 4: return p->e_lo | (p->e_hi << 6);
        case 5: return p->f_lo | (p->f_hi << 4);
        case 6: return p->g_lo | (p->g_hi << 2);
        case 7: return p->h;
    }
    return p->a;
    }

no sure this will help you , i have the full "C" code for
raw_rec.c
raw2dng.c
chdk-dng.c (this is for dng writer for the hard coded 12bit on 5d2
a old developer "Mayo" was or is the author of this code , if you would like all of the code let my know & I'll
post a link to the full code.

BatchGordon · July 27, 2023, 01:39:30 PM

Quote from: names_are_hard on July 27, 2023, 03:43:05 AM
In what way would this be different from the 10-bit mode ML already has?

Right question! As far as I know the current reduction from 14 to 10 bits is done by discarding the 4 least significant bits of the sample. Which is reasonable, but not optimal.
A color depth of 10-bit linear is sufficient for light grading, but many modern cameras work at least with a 10-bit log profile, which while not raw recordings, has better shadow behavior.

For the speed problem in the videos... I also have doubts that it can work in realtime, furthermore I think I'm not able to write code optimized for the processor, but the algorithm I found looks very interesting and, computationally, very light:
https://lifecs.likai.org/2018/12/hybrid-log-float-14-bit-to-10-bit-pixel.html

Log conversion should be done before lossless compression and could benefit this too by allowing for a higher level of compression.
Of course, a de-log is also required after decompressing and before demosaicing in MlvApp.

Does anyone else (besides me) find the hypothesis at least theoretically feasible?

Quote from: reddeercity on July 27, 2023, 07:14:19 AM
Will i have some experimental code for the 5D2 D4 back when raw video was just a thing starting and not easy recorded by 5D2
(5D3 wasn't even out yet , so the 5D2 was the best at the time with no equal)
always i have code that "Hard Coded" 12bit from 14bit in camera for raw video so no hardware (encoder chip) like there is now.
...

Thanks so much for the code!
It appears to take a similar route to the old raw module (before mlv_lite) avoiding the hardware chip to do the encoding.
I don't know if we can still intercept the image samples in mlv_lite while keeping the hardware encoding functionality or not.
Going back to the old route would likely mean a performance and functionality loss that could be hard to accept.
Anyway, I'll let you know if I need the full code. Thanks again!

theBilalFakhouri · July 27, 2023, 01:58:24 PM

@BatchGordon

I split your initial question and the following replies to a separated topic, since it was off-topic in crop mood thread.

names_are_hard · July 27, 2023, 03:07:55 PM

Quote from: BatchGordon on July 27, 2023, 01:39:30 PM
Does anyone else (besides me) find the hypothesis at least theoretically feasible?

The alg seems reasonable enough, but some points of note: older cams definitely don't have SIMD (NEON, on ARM). Newer cams report support for NEON but I haven't been able to initialise the NEON unit (you're supposed to do some asm magic to wake it up). Perhaps support is there, perhaps not, I have no prior experience trying to use it so could easily be doing something wrong.

ML code suggests the sensor has a 16-bit capture mode. This isn't 16 bits of data, the sensor remains 14 bit, so I assume this captures the data packed into 16 bits. Possibly useful for you. See MODE_16BIT.

Perf is going to be "challenging". Consider 3000x1000 capture region at 24 fps. This is 288M sensels per second. Can you encode each of those in 10 instructions? If so you need to hit 2.8B ips. I don't know the exact clock on the CPU but I'm betting it's not GHz. So, while I think it is an interesting exercise, be prepared to do the work and not end up with something useful. I think you only gain something here if existing 14-bit raw is card speed limited, and the conversion step to 10-bit doesn't bottleneck write speed due to CPU overhead? Might be worth doing some benchmarks with dummy code before you get too invested.

compress_task() in modules/mlv_lite.c passes uncompressed buffers to compression, that should give you a place in the code to work from.

BatchGordon · July 28, 2023, 11:44:50 PM

Thank you for all your suggestions and the additional information, they can really give me a better starting point!

I completely agree with you on the doubts about performance: I'm ready to do the work and not end up with something useful, after all this is what experimenting means!

I'm also sure my code won't be as optimized as this task requires, but I'm mostly a high level software developer. Anyway, as soon as possible I will share the results and the code, so someone with more knowledge of me could improve my work.

Thanks again and yes, compress_task() seems to be really the best place to start with the code!

Skinny · July 29, 2023, 08:12:01 AM

By the way, what would be really interesting is if you can do something to make 8 bit raw at least somewhat useful by pushing shadows values somehow, maybe with very simple algorithm..
Could be interesting "low bitrate" mode when you need longer recording times

BatchGordon · July 31, 2023, 10:24:37 AM

I think the only way to make 8 bits per color usable is by a precise log conversion, or at least some approximation using the Taylor series.
Both solutions I'm sure are computationally too complex.

The idea of the proposed algorithm, the only one simple enough to have any possibility of being applicable, I don't think can be extended to a reduction to 8 bits as the quantization would be too granular.
In any case, if it should work for 10 bits, a test with 8 could always be attempted.

BatchGordon · August 07, 2023, 08:14:13 PM

I have done some very basic testing and, unless I have done something wrong (that is possible

), it could be able to process at the required speed.
Now I need some help from someone who knows something more than me about the code...

As suggested by names_are_hard the code can be put inside "compress_task()", probably just before the call to CreateMemorySuite(). That's where I've put my dummy code, anyway.

So... I suppose I can find the res_x*res_y samples in the data pointed by ptr after the header.
But now i have two questions:
- How much long is the header?
- Are the 14 bit samples already "packed" at this stage or they get packed only after compression and at this stage every sample is a WORD?

Sorry if the questions are silly, but it's my first time looking into the code of ML.

names_are_hard · August 07, 2023, 08:57:58 PM

I wasn't exactly suggesting you put it in compress_task() - I think this only runs if you're using compression. It was just a convenient place in the code where I happened to know the image buffer was available. Still, it should work fine for a proof of concept.

I'm guessing you're not very familiar with C? It's a typed language, so, to determine the size of a variable, first find the type, then, if it's not a fundamental type, find the definition of the type (often inside a .h file).

Which 14 bit samples are you talking about? If you give a variable name or line in the code your question would be easier to understand.

BatchGordon · August 10, 2023, 03:34:37 PM

Sorry for my late reply!

I understand compress_task() is not the right place to put the code, but as of now being "a convenient place" is all I need since I'm just doing preliminary tests.

As I previously said I'm a java developer and C is not my main language. Furthermore, my previous experience with C (and C++) is more with desktop applications development. So I'm aware I'm not the right person to be involved with these tests, but I want to try it anyway and perhaps, in case of some positive results, someone else could improve my work.

Yesterday I have been able to check a bit more inside the code and I have understood something.
More in detail, I have been able to change the values of part of the image acquired putting this code inside compress_task() right before the call to lossless_compress_raw_rectangle(...):

Code Select


            unsigned int s = 0; // will get the sample value
            for(int pos_y = 900; pos_y < 1000; pos_y++) { // should be extended from 0 to raw_info.height
               for(int pos_x = 900; pos_x < 1000; pos_x++) { // should be extended from 0 to raw_info.width
                  struct raw_pixblock * p = (void*)fullSizeBuffer + pos_y * raw_info.pitch + (pos_x/8)*14; 
                  
                  // get the value of the sample      
                  switch (pos_x%8) {
                      case 0: s = p->a; break;
                      case 1: s = p->b_lo | (p->b_hi << 12); break;
                      case 2: s = p->c_lo | (p->c_hi << 10); break;
                      case 3: s = p->d_lo | (p->d_hi << 8); break;
                      case 4: s = p->e_lo | (p->e_hi << 6); break;
                      case 5: s = p->f_lo | (p->f_hi << 4); break;
                      case 6: s = p->g_lo | (p->g_hi << 2); break;
                      case 7: s = p->h;
                  }

                  s |= 0x0FFF; // just an example of sample value changed

                  // write the new value of the sample
                  switch (pos_x%8) {
                      case 0: p->a = s; break;
                      case 1: p->b_lo = s; p->b_hi = s >> 12; break;
                      case 2: p->c_lo = s; p->c_hi = s >> 10; break;
                      case 3: p->d_lo = s; p->d_hi = s >> 8; break;
                      case 4: p->e_lo = s; p->e_hi = s >> 6; break;
                      case 5: p->f_lo = s; p->f_hi = s >> 4; break;
                      case 6: p->g_lo = s; p->g_hi = s >> 2; break;
                      case 7: p->h = s; break;
                  }
                }
            }

The code to access the value of the sample is very inefficient (I took the code from the functions to read/write a sample in raw.c and could be ok for a random access, not for sequential) and must be highly optimized (or, better, rewritten) but for now what I find surprising is another thing:
I can extend the process to the full frame of the image and have no delay if I avoid the write of the samples, while keeping the writing I can't extend it much more than 100x100 pixel without becoming choppy (even liveview will be slow).
I mean: it was expected to be hard to do it in realtime, but I don't understand why only the writing portion of the code is taking alot of time. It looks pretty similar to the reading, to me, while in comparison it seems to take like 100x more time to write the sample.
Any opinion about it?

names_are_hard · August 10, 2023, 07:09:30 PM

It'll interact with caching, might be that.

I don't want to review code unless I can be sure of what changes are made - can you host this as a repo somewhere with "fast" and "slow" being different commits?

(You also probably want to avoid division since that's done in software, and avoid unaligned memory access since a) it can be slow on ARM and b) it may return different data than you expect)

names_are_hard · August 11, 2023, 03:43:52 PM

Thinking about it, you should always be able to unroll the access to p so you write an aligned amount into the struct in one go. This will avoid needing to do complicated shifted accesses through the p pointer, which is likely to be quite slow. Accumulate writes into a local var until you have some amount of contiguous bits that is a nice round number. So, 32-bit minimum, in which case your local var is a register, or, probably better, some multiple of sizeof(raw_pixblock) that is itself a multiple of 32 bits in length. Multiple of 2 would work, higher is probably more efficient. Your local var is then some buffer and you memcpy() into your dest buffer, no need to do any shifts during the write, no need to do any expensive sub-32-bit length writes, and you reduce loop overhead.

Separately, you possibly could use the UNCACHED() address for p. This may avoid expensive sync operations in between writes. On the other hand, this may already be happening in which case it's pointless, or, it may be unsafe - depends what the code is doing.

News:

Question about in camera bit-depth reduction