crop_rec_4k corrupted mlv's

Started by sylvain_c, June 18, 2017, 01:56:25 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

sylvain_c

I have some corrupted mlv with the new crop_rec_4k, about 1/10 of the files can not extract with the new mlv_dump. I tried putting those files in an hex editor to compare with others, but i don't know what to do next...

Thanks for any help
5D MKIII 1.1.3

Danne

And did you try mlv_dump --dng --relaxed input.MLV already?

g3gg0

Help us with datasheets - Help us with register dumps
magic lantern: 1Magic9991E1eWbGvrsx186GovYCXFbppY, server expenses: [email protected]
ONLY donate for things we have done, not for things you expect!

sylvain_c

@Danne:
For mlv_dump --dng --relaxed input.MLV, i got the exact same error for all the corrupted files:

[ERROR] invalid block size at position 0x000009fc

@g3gg0:
I tried the code in the same cmd prompt but maybe I don't understand something, "./mlv_dump is not recognized as internal command" (translated from french).
Also tried without the point at the beginning and other things. Sorry.

Edit: I succeed:
Last update:  37426e9 on 2017-06-01 21:58:33 UTC by g3gg0:
mlv_dump: fixed dependency order
Build date:   2017-06-02 22:06:19 UTC
5D MKIII 1.1.3

g3gg0

ok quite a severe error.
please one further step:

mlv_dump --visualize
Help us with datasheets - Help us with register dumps
magic lantern: 1Magic9991E1eWbGvrsx186GovYCXFbppY, server expenses: [email protected]
ONLY donate for things we have done, not for things you expect!

sylvain_c

With --visualize:

[ERROR] Invalid block size at position 0x000009fc

I have a total of 4 files on 26 with the same error, exactly the same setting 3072x1320 23.976 all the day.
5D MKIII 1.1.3

g3gg0

please post everything, not just the error message.
if that doesnt work use "--visualize --skip-xref"

and could you upload the smallest failing video somewhere?
Help us with datasheets - Help us with register dumps
magic lantern: 1Magic9991E1eWbGvrsx186GovYCXFbppY, server expenses: [email protected]
ONLY donate for things we have done, not for things you expect!

sylvain_c

With --visualize --skip-xref:

Mode of operation:
     - Input MLV file: 'M18-1317'
     -Verify file structure
File M18-1317.MLV opened
File M18-1317.M00 not existing
Processing...
[MLVI]
[RAWI][RAWC][IDNT][EXPO][LENS][RTCI][WBAL][VERS]+++++++++++[ERROR]  Invalid block size at position 0x000009fc
Processed 0 video frames at 23.98 FPS (0.00 s)
Done

And for my smallest file (295Mo):
https://mega.nz/#!Ho1mAJoA!0jBg5yctKbnn1_9MKifXKRlRC8Vc6-RAGp8zTdmgtIw
5D MKIII 1.1.3

g3gg0

Help us with datasheets - Help us with register dumps
magic lantern: 1Magic9991E1eWbGvrsx186GovYCXFbppY, server expenses: [email protected]
ONLY donate for things we have done, not for things you expect!

g3gg0

Help us with datasheets - Help us with register dumps
magic lantern: 1Magic9991E1eWbGvrsx186GovYCXFbppY, server expenses: [email protected]
ONLY donate for things we have done, not for things you expect!

sylvain_c

Thank you for your patience g3ggo, it works! You are my savior  8)
5D MKIII 1.1.3

a1ex

Fix prepared: https://bitbucket.org/hudson/magic-lantern/pull-requests/838/fix-for-clean_d_cache-and-fio_writefile/diff

This bug was present for many years (probably since the first versions of mlv_rec, and present in all the subsequent implementations - including mlv_lite and silent picture module). The side effect of this bug was some incorrect bytes in the MLV header, appearing with very low probability (you'd have to record many clips to get one such error). If these incorrect bytes were on metadata like exposure time or camera serial number or something like that, it's easy to see why the bug went unnoticed for so much time.

However, once the the erroneous byte appeared in the size field of a block (which is required to be correct in order to parse the MLV), invalid MLVs were written.

Side note: I had a pretty hard time creating a synthetic test that would reproduce this bug consistently - at first, I've even created 1000 files and checked all of them, and could not get a single erroneous byte.

After fiddling with it all day long, g3gg0 gave me a very good hint - the source buffer (that was going to be written to card) must be at different memory locations (different stack offsets) during the test, as the bug depends on where this buffer gets loaded into cache. Randomized the buffer location and suddenly got lots of errors, in other words, a consistent way to reproduce this bug. More details in the PR.

sylvain_c

If I understand correctly my bug was a very low probability one that help you discovering another really old one that now make ML even more stable?I'm glad to have been useful!
5D MKIII 1.1.3

a1ex

I'd consider the old one as the same bug; however, its exact side effects depend on many factors (in particular, how things are aligned in memory being one important factor). That probably explains why, with the build and the combination of settings you used, the probability of getting an invalid MLV (with the same type of error) was pretty high (about 10% from your report), while nobody else reported the exact same issue (so, either the issue occurred in places that did not affect the MLV structure, or, if it was, either it wasn't reported - likely if it happened once in 1000 clips or so, or, if it was reported, we had no idea what it could have been).

These "random" errors are usually very hard to track down, but in your case, the only hypothesis that made sense (how it could happen) was that CPU was aware of the size of the NULL block (and computed it correctly), but for some reason, what was written to card (which is performed using DMA from physical memory) was wrong. Therefore, there must have been some sort of mismatch between what CPU thinks about the physical memory, and what the memory actually contains (and that's likely because the has only written some changes - the contents of that NULL block - into its own cache, but these changes did not reach the physical memory by the time the NULL block was written to card).

I knew about this quirk, and most of the code doing file I/O uses memory buffers with caching disabled (fio_malloc), but since this wasn't enforced, these calls (those for writing the MLV headers, which were used incorrectly for many years) didn't ring a bell. Now, the PR either enforces correct usage (when reading files) or just ensures everything is written to main memory before writing files (so there will be no more restrictions about what memory addresses can be used for writing).

And yes, similar issues can be in other parts of the source code (although I didn't find any by looking at the code), or in code written by new developers, who may not be aware of this quirk (as there were no warnings, and so far, on the other platforms we were familiar with, this was a non-issue - afaik, the cache coherence issues, if any, are handled by the operating system). If there are such problems in other areas, after the fix, you'll get an error text all over the screen.

g3gg0

this is why using in production environment is still risky.
but we try to help where possible.

thanks for reporting.
Help us with datasheets - Help us with register dumps
magic lantern: 1Magic9991E1eWbGvrsx186GovYCXFbppY, server expenses: [email protected]
ONLY donate for things we have done, not for things you expect!

g3gg0

Quote from: a1ex on June 22, 2017, 06:08:21 PM
I knew about this quirk, and most of the code doing file I/O used memory buffers with caching disabled (fio_malloc), but since this wasn't enforced, these calls (those for writing the MLV headers, which were used incorrectly for many years) didn't ring a bell.

same here. knew that DMA isn't able to look into the caches.
but reverse engineering canons code revealed functions that also didn't care about cache flushing.
these were just some debug routines, not relevant for production use.

so i was like "yeah, FIO routines for sure flush caches, else they wouldn't do this. and works reliable so all fine".
wrong assumption then :)
Help us with datasheets - Help us with register dumps
magic lantern: 1Magic9991E1eWbGvrsx186GovYCXFbppY, server expenses: [email protected]
ONLY donate for things we have done, not for things you expect!

a1ex

Quote from: g3gg0 on June 22, 2017, 06:17:15 PM
works reliable so all fine


    for (int i = 0; i < 1000; i++)
    {
        char name[10];
        snprintf(name, sizeof(name), "%03d.txt", i);
        printf("%s\n", name);
        FILE * f = FIO_CreateFile(name);
        uint32_t buf = i + '0';
        FIO_WriteFile(f, &buf, sizeof(buf));
        FIO_CloseFile(f);
    }


This creates 1000 files, all of them with correct contents (tested a couple of times), although the usage of FIO_WriteFile - without the wrappers from the PR - is incorrect. So, even if we ran some simple file integrity test 1000 times, on a known-to-be-incorrect program, there is a chance that our tests won't reveal any issues, and then the same code may fail randomly on other cameras (with unknown probability).

g3gg0

yeah. assembly opcode level interfaces are poorly documented nowadays ;)
Help us with datasheets - Help us with register dumps
magic lantern: 1Magic9991E1eWbGvrsx186GovYCXFbppY, server expenses: [email protected]
ONLY donate for things we have done, not for things you expect!