MLVFS - a FUSE based, "on the fly" MLV to CDNG converter

Started by dmilligan, August 31, 2014, 02:01:24 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Danne

MLRwviewer reads raw data pretty much in realtime, even on crappy MacBook airs. Complex gpu code but still. It amazed me how both DVR and premiere struggled so with dng files while Andrew Baldwin published stellar code for previewing.
I bet there are solutions for MLVFS to achieve realtime previewing. Probably hard work.

vastunghia

Don't think so.

MLVFS just exposes to your OS file system (via Fuse) chunks of raw data contained in MLV files as single DNG files (that is, single frames) containing that raw data. This is a simple task: it is not debayering / demosaic-ing; it is not processing the raw data (unless you ask it to do the Dual ISO magic, or something else). So, however bad the code could be written (and I don't think this is the case), it could not be much optimized anyway. MLVFS is pretty transparent in its most basic application (no Dual ISO etc).

The only thing it needs to do actively is de-compressing, in case of lossless compressed MLV files. I think *this* is the CPU-consuming step which is making difficult to reach real-time playback for high-res, high-bit depth footage. This, and the fact that the editor asks one frame at the time, instead of leveraging on the parallelization potential of MLVFS.

MLRawViewer was another beast. It debeayered data and showed you the final image (whereas MLVFS delegates this to, say, DVR). It was fast because it used GPU -- just as DVR does -- in debayering. And I think it didn't handle lossless compressed MLV files. I'm pretty sure that it would stutter if you provided it with the same compressed footage I'm trying to playback. Because that wouldn't be a debayering GPU problem, but a decompression CPU one.

Evidence for the above: I *can* playback in DVR in real-time if I just copy my DNG sequence locally. So DVR is using GPU acceleration to debayer, just as MLRawViewer, and since there is nomore decompression work to do on CPU, everything is fine.

So in a nutshell: I think the culprit is decompression, weighing on CPU, and that there is no optimization that can be performed, on this respect, on MLVFS code. We only need to devise a smarter way to *use* MLVFS. Will think / work on it when I find the time.
5D3 for video
70D for photo

Danne

Probably.
However. Dvr is not even close when previewing dng sequences when I comes to speed. At least last I messed with this program.

"time, instead of leveraging on the parallelization potential of MLVFS."

How about looking into this  8)?

vastunghia

Can't say whether DVR is fast or not, I can only say that, after upgrading my iMac GPU with a powerful eGPU, it can now playback no problem in real-time hard DNG sequences like 3.5K @14-bit. Provided that MLV -> DNG conversion was performed before (i.e. not on the fly via MLVFS).

Yeah, planning to study how DVR is accessing DNG sequences as a first step towards understanding whether we can tweak somehow the interface between DVR and MLVFS. Once again, caching could be a key factor.
5D3 for video
70D for photo

names_are_hard

When it comes to optimising, a bit of speculation is good, try and think of the overall shape of the problem.  But then you have to do the real step: profiling.  Should be fairly easy to profile this code, it's open source.  Intuition is often very bad for guessing what performance is really doing.

Decompression is normally much faster than compression and parallelises well.  I would expect this not to be a bottleneck, and if it is, I'd expect it to be easy to fix.  Easy to measure with a profiler.

vastunghia

Quote from: names_are_hard on Yesterday at 05:03:23 PMBut then you have to do the real step: profiling.

100% agree. That's exactly what I meant when I wrote

Quote from: vastunghia on May 24, 2024, 09:41:51 PMCould be helpful to make MLVFS log single frame read requests with their corresponding timestamp, to try and understand the time sequence with which DVR asks for frames. And check if frame requests are repeated during the third playback, when all seems to be cached already. May give it a try as soon as I have 10 minutes.

Not an expert here, not sure logging into a txt file with milliseconds timestamps is the way to go for profiling. But that's what I'm currently doing, adding a log entry every time a frame read request is acknowledged from MLVFS and then again when it is successfully completed.

Quote from: names_are_hard on Yesterday at 05:03:23 PMDecompression is normally much faster than compression and parallelises well.  I would expect this not to be a bottleneck, and if it is, I'd expect it to be easy to fix.  Easy to measure with a profiler.

Mmh not so sure: according to FastCinemaDNG developers,

QuotePerformance of Lossless JPEG decoder is mostly limited by Huffman decoding. Actually, we need to read bitstream (bit after bit), to recover Huffman codes and data bits. Huffman decompression is essentially serial algorithm, so one can't implement it on GPU, that's why everything is done on CPU.

So it looks like this could be a bottleneck after all, at least to some extent. And in particular as long as one keeps asking for single frames in sequence, one at a time (as DVR seems to do, naturally assuming that the bottleneck is hard drive read speed, and not CPU), instead of asking for a bunch of them, and then proceed to the next bunch.

Anyway. Will dig into profiling. Which is simple in Python.. but I never did it in C.
5D3 for video
70D for photo

names_are_hard

You could log, and it can help, and I do it in some cases when optimising.  It can be a quick check in the early stages especially if you have a suspicion some particular thing is slow.

Generally though, use a real profiler.  I tend to use callgrind, from Valgrind suite of tools first, because it works on anything and doesn't need recompiling.  Use kcachegrind to view results.

Here, since it's open source and easy to build, and you're probably using gcc, gprof is another obvious tool to try.

Both of these will give you a lot of good info, without you needing to understand the code first.  You will need to learn the tools, but they're not hard and good tutorials exist.

Quote from: vastunghia on Yesterday at 06:53:45 PMSo it looks like this could be a bottleneck after all, at least to some extent. And in particular as long as one keeps asking for single frames in sequence, one at a time (as DVR seems to do, naturally assuming that the bottleneck is hard drive read speed, and not CPU), instead of asking for a bunch of them, and then proceed to the next bunch.

The Huffman stuff is interesting, I haven't dealt with this alg.  I bet it's still faster to decode than encode.  Also, looks like you can speed it up fine using parallel techniques, there are a lot of easy to find papers on this, as well as repos with code they claim works.  Skim reading, it does sound like it presents some problems, so you won't get 8x speedup from 8 cores, but you only need 2x or so, right?

And yeah, because here you can treat each frame / file as a separate decompression job, you can trivially parallelise the overall task.  8 cores?  Use say, 6 threads, and decode whatever frame someone requests, and the next 5.

And / or, assign some amount of cache (1/4 of available ram or whatever), and decode frames you think will be useful until cache is full, then serve from cache.  Don't wait for something to request the file first - fill the cache speculatively.  As you say, it should be quite predictable.  If the whole MLV file fits in cache (a common case?), MLVFS can just decode it all as fast as possible when you mount the file, then all work is done, and keep using the cache.

How to make MLVFS do that?  Well, I dunno :D

vastunghia

Thanks for the hints on profiling. Will check them out. I'm on MacOS so kcachegrind won't do. But anyway, will find my way.

Quote from: names_are_hard on Yesterday at 07:26:36 PMAnd yeah, because here you can treat each frame / file as a separate decompression job, you can trivially parallelise the overall task.  8 cores?  Use say, 6 threads, and decode whatever frame someone requests, and the next 5.

That's *exactly* my point, thank you. And the part on the cache, as well.

This would be the most obvious -- and I also think the most effective -- way to optimize MLVFS. Like, ok, maybe the decomp. algo can be a bit fine tuned; but if you can parallelize it, who cares?

Will look into this.
5D3 for video
70D for photo