MLVFS - a FUSE based, "on the fly" MLV to CDNG converter

Danne · May 25, 2024, 03:36:10 AM

MLRwviewer reads raw data pretty much in realtime, even on crappy MacBook airs. Complex gpu code but still. It amazed me how both DVR and premiere struggled so with dng files while Andrew Baldwin published stellar code for previewing.
I bet there are solutions for MLVFS to achieve realtime previewing. Probably hard work.

vastunghia · May 25, 2024, 08:41:11 AM

Don't think so.

MLVFS just exposes to your OS file system (via Fuse) chunks of raw data contained in MLV files as single DNG files (that is, single frames) containing that raw data. This is a simple task: it is not debayering / demosaic-ing; it is not processing the raw data (unless you ask it to do the Dual ISO magic, or something else). So, however bad the code could be written (and I don't think this is the case), it could not be much optimized anyway. MLVFS is pretty transparent in its most basic application (no Dual ISO etc).

The only thing it needs to do actively is de-compressing, in case of lossless compressed MLV files. I think *this* is the CPU-consuming step which is making difficult to reach real-time playback for high-res, high-bit depth footage. This, and the fact that the editor asks one frame at the time, instead of leveraging on the parallelization potential of MLVFS.

MLRawViewer was another beast. It debeayered data and showed you the final image (whereas MLVFS delegates this to, say, DVR). It was fast because it used GPU -- just as DVR does -- in debayering. And I think it didn't handle lossless compressed MLV files. I'm pretty sure that it would stutter if you provided it with the same compressed footage I'm trying to playback. Because that wouldn't be a debayering GPU problem, but a decompression CPU one.

Evidence for the above: I *can* playback in DVR in real-time if I just copy my DNG sequence locally. So DVR is using GPU acceleration to debayer, just as MLRawViewer, and since there is nomore decompression work to do on CPU, everything is fine.

So in a nutshell: I think the culprit is decompression, weighing on CPU, and that there is no optimization that can be performed, on this respect, on MLVFS code. We only need to devise a smarter way to *use* MLVFS. Will think / work on it when I find the time.

Danne · May 25, 2024, 04:01:43 PM

Probably.
However. Dvr is not even close when previewing dng sequences when I comes to speed. At least last I messed with this program.

"time, instead of leveraging on the parallelization potential of MLVFS."

How about looking into this

?

vastunghia · May 25, 2024, 04:14:27 PM

Can't say whether DVR is fast or not, I can only say that, after upgrading my iMac GPU with a powerful eGPU, it can now playback no problem in real-time hard DNG sequences like 3.5K @14-bit. Provided that MLV -> DNG conversion was performed before (i.e. not on the fly via MLVFS).

Yeah, planning to study how DVR is accessing DNG sequences as a first step towards understanding whether we can tweak somehow the interface between DVR and MLVFS. Once again, caching could be a key factor.

names_are_hard · May 25, 2024, 05:03:23 PM

When it comes to optimising, a bit of speculation is good, try and think of the overall shape of the problem. But then you have to do the real step: profiling. Should be fairly easy to profile this code, it's open source. Intuition is often very bad for guessing what performance is really doing.

Decompression is normally much faster than compression and parallelises well. I would expect this not to be a bottleneck, and if it is, I'd expect it to be easy to fix. Easy to measure with a profiler.

vastunghia · May 25, 2024, 06:53:45 PM

Quote from: names_are_hard on May 25, 2024, 05:03:23 PMBut then you have to do the real step: profiling.

100% agree. That's exactly what I meant when I wrote

Quote from: vastunghia on May 24, 2024, 09:41:51 PMCould be helpful to make MLVFS log single frame read requests with their corresponding timestamp, to try and understand the time sequence with which DVR asks for frames. And check if frame requests are repeated during the third playback, when all seems to be cached already. May give it a try as soon as I have 10 minutes.

Not an expert here, not sure logging into a txt file with milliseconds timestamps is the way to go for profiling. But that's what I'm currently doing, adding a log entry every time a frame read request is acknowledged from MLVFS and then again when it is successfully completed.

Quote from: names_are_hard on May 25, 2024, 05:03:23 PMDecompression is normally much faster than compression and parallelises well. I would expect this not to be a bottleneck, and if it is, I'd expect it to be easy to fix. Easy to measure with a profiler.

Mmh not so sure: according to FastCinemaDNG developers,

QuotePerformance of Lossless JPEG decoder is mostly limited by Huffman decoding. Actually, we need to read bitstream (bit after bit), to recover Huffman codes and data bits. Huffman decompression is essentially serial algorithm, so one can't implement it on GPU, that's why everything is done on CPU.

So it looks like this could be a bottleneck after all, at least to some extent. And in particular as long as one keeps asking for single frames in sequence, one at a time (as DVR seems to do, naturally assuming that the bottleneck is hard drive read speed, and not CPU), instead of asking for a bunch of them, and then proceed to the next bunch.

Anyway. Will dig into profiling. Which is simple in Python.. but I never did it in C.

names_are_hard · May 25, 2024, 07:26:36 PM

You could log, and it can help, and I do it in some cases when optimising. It can be a quick check in the early stages especially if you have a suspicion some particular thing is slow.

Generally though, use a real profiler. I tend to use callgrind, from Valgrind suite of tools first, because it works on anything and doesn't need recompiling. Use kcachegrind to view results.

Here, since it's open source and easy to build, and you're probably using gcc, gprof is another obvious tool to try.

Both of these will give you a lot of good info, without you needing to understand the code first. You will need to learn the tools, but they're not hard and good tutorials exist.

Quote from: vastunghia on May 25, 2024, 06:53:45 PMSo it looks like this could be a bottleneck after all, at least to some extent. And in particular as long as one keeps asking for single frames in sequence, one at a time (as DVR seems to do, naturally assuming that the bottleneck is hard drive read speed, and not CPU), instead of asking for a bunch of them, and then proceed to the next bunch.

The Huffman stuff is interesting, I haven't dealt with this alg. I bet it's still faster to decode than encode. Also, looks like you can speed it up fine using parallel techniques, there are a lot of easy to find papers on this, as well as repos with code they claim works. Skim reading, it does sound like it presents some problems, so you won't get 8x speedup from 8 cores, but you only need 2x or so, right?

And yeah, because here you can treat each frame / file as a separate decompression job, you can trivially parallelise the overall task. 8 cores? Use say, 6 threads, and decode whatever frame someone requests, and the next 5.

And / or, assign some amount of cache (1/4 of available ram or whatever), and decode frames you think will be useful until cache is full, then serve from cache. Don't wait for something to request the file first - fill the cache speculatively. As you say, it should be quite predictable. If the whole MLV file fits in cache (a common case?), MLVFS can just decode it all as fast as possible when you mount the file, then all work is done, and keep using the cache.

How to make MLVFS do that? Well, I dunno

vastunghia · May 25, 2024, 11:37:50 PM

Thanks for the hints on profiling. Will check them out. I'm on MacOS so kcachegrind won't do. But anyway, will find my way.

Quote from: names_are_hard on May 25, 2024, 07:26:36 PMAnd yeah, because here you can treat each frame / file as a separate decompression job, you can trivially parallelise the overall task. 8 cores? Use say, 6 threads, and decode whatever frame someone requests, and the next 5.

That's *exactly* my point, thank you. And the part on the cache, as well.

This would be the most obvious -- and I also think the most effective -- way to optimize MLVFS. Like, ok, maybe the decomp. algo can be a bit fine tuned; but if you can parallelize it, who cares?

Will look into this.

vastunghia · June 09, 2024, 09:07:56 AM

Unfortunately life is pretty busy at the moment, so I just had the time to understand how profiling works in Xcode and give it a first brief try.

The image below shows results when trying to playback in DVR for the first time since mounting the virtual Fuse drive (i.e. without any advantage from caching) a 48-seconds clip @3.5K / 23.976 fps / 14 bit / lossless compressed.

Not sure I'm reading the results correctly, but does seem to me that my guess was confirmed and the culprit is the lj92_decode function, implementing decompression.

I'm a bit surprised by CPU usage though. Hope there is space for further optimization. Maybe one day I will be able to answer this question...

names_are_hard · June 09, 2024, 10:16:22 PM

Nice, it's working

I don't know how to interpret it, I've never used Xcode. Getting good at profiling takes a while.

It doesn't look like any of the CPUs are working very hard. And yet, I agree it does look like lj92_decode has a high time cost. It's being called by get_image_data(). That might mean it's spending a lot of time waiting for disk access? Slow, but low CPU. That depends what the profiler measures. Disk accesses (and switches to kernel context) are often handled in special ways.

I can't give any better advice, you might want to look up how to show disk wait in Xcode profiling. It should be possible.

EDIT: another possibility is multi-threading making your code *slower*. CPU usage is about the same on all cores... if they're spending a lot of time waiting on locks etc, that wouldn't show as CPU usage, but it would take time.

vastunghia · June 10, 2024, 09:20:24 AM

Thanks for your thoughts!

There is a very basic implementation of thread synchronization in MLVFS, from what I saw, using pthread mutex locks (for instance in get_or_create_image_buffer(), which is part of the heaviest stack trace reported by Xcode's profiler). My understanding is that we are just avoiding contemporary access to image buffers temporarily stored in RAM here.

I don't think there can be any issue in resource management on the virtual FUSE drive — each file is a single frame and no one is trying to access one frame multiple times in theory.

On the real hard drive, unless we are saying that there is something wrong about trying to read multiple chunks at once from the original MLV file, once again I can see no conflicting race for resources. And once again, pretty sure my PCIe NVMe drive cannot be a bottleneck.

If you had something different in mind when citing resource locks, could you elaborate a little bit? I will look for useful instruments in Xcode's tools portfolio anyway, thanks.

Now, profiling was good to verify the main theory — namely, that decompression is the CPU killer.

But I also need to verify that DVR is currently asking for frames one by one, serially — which should be the obvious way to ask for frames in DVR's perspective, given that its developers are assuming to be hard drive-limited instead of CPU-limited. To do so, I would need to log millisecond-timestamps with 1. Frame read request acknowledgment, 2. Start of frame processing, 3. End of processing, 4. Frame content delivery.

Already tried logging to file, but results did not convince me. Will try to understand if Xcode tools can help me somehow, or will try again file logging.

Danne · June 10, 2024, 10:27:34 AM

Nice work so far 👍.

names_are_hard · June 10, 2024, 10:10:37 PM

Quote from: vastunghia on June 10, 2024, 09:20:24 AMI don't think there can be any issue in resource management on the virtual FUSE drive — each file is a single frame and no one is trying to access one frame multiple times in theory.

...

If you had something different in mind when citing resource locks, could you elaborate a little bit? I will look for useful instruments in Xcode's tools portfolio anyway, thanks.

Now, profiling was good to verify the main theory — namely, that decompression is the CPU killer.

If different threads are contesting for a shared lock when decompressing, they may decompress *slower* than a single thread. These kinds of problems are easy to hit by accident and a profiler will be able to show you stats on mutex contention - but I don't know how to get Xcode to do this.

I don't think CPU is the killer. Take a look at your results in detail:

CPU usage is not high overall. 40% max even when active. You have 1m13s of trace, 73s. I guess some of this is not really using mlvfs, but some setup time, you starting the trace, stopping it etc. But whatever it is, the total time profiled *for mlvfs* is only 11s.

A lot of your 73s is something other than lj92 decomp. Of the *CPU* time, a lot is lj92. But most of the time isn't lj92. This means you may have something easy to fix.

vastunghia · June 11, 2024, 06:06:05 PM

You definitely know what you are talking about, while the same does not apply to me

That being said, however, in my mind one frame <-> one serial decompression <-> one thread, so I'm not sure how two threads could be contesting for the same locked resource. AFAIK, in MLVFS mutex locks are applied only in one specific context, and only to single image (single frame) buffers.

Could MLVFS-related low CPU usage also be due to the fact that, apart from MLVFS running, there is also DVR reading DNG files, demosaicing them and rendering on screen (with all those real-time scopes as well)? Though DVR should be more GPU-based. Oh well.

Anyway. I found out that, once frames have been played back once, apparently they are being cached by FUSE (not MLVFS, nor macOS): indeed, profiling during following playbacks (which are smooth in DVR) shows a recurring call to fuse_session_process_buf(). This suggests that forcing FUSE to read all frames in advance could do the trick.

Though it would be leaner and more efficient to build a custom caching system inside MLVFS, like: when FUSE asks for whichever frame N of MLV file X, then in parallel 1. MLVFS processes and returns such frame, and 2. it starts processing and caching (once again in parallel) frames starting from N+1 (where of course at each step processing is skipped if frame was already cached) possibly up to the final frame, and then also from frame 0 up to N-1. Though this may also lead to unbearable resource conflicts...

Ps: I also realized that simply forcing MLVFS to process and cache M frames in advance (i.e. frames N+1 up to N+M) is not a solution.

Dunno. This is tough

vastunghia · June 11, 2024, 06:33:37 PM

Ok, next step: instead of profiling MLVFS using DVR as a trigger for frame reads, build a simple script / program that just reads frames (without any further downstream processing like debayering) in sequence, and profile MLVFS using this lightweight program. This should get rid of any DVR-induced background noise.

Also, modify the script / program so that it tries to read frames in parallel (say, 2 by 2, or 3 by 3 etc) and repeat. Check if resulting read time is compatible with real-time playback. This could provide a proof of concept, or invalidate it.

Then again, not sure how to pass from PoC to actual implementation in MLVFS

News:

MLVFS - a FUSE based, "on the fly" MLV to CDNG converter