EOSM Shutter-Bug reboot

Started by dfort, March 11, 2018, 05:04:43 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

dfort

True Confessions -- I'm only getting about 10% of what you're talking about but I do find it fascinating. Tried a few of your suggestions in QEMU and understand that the issue seems to be some sort of a delay between mpu events. I attempted to put in some delays in the boot up process but everything I tried either had no effect or the camera locked up.

Maybe there is still something that a mere mortal with an EOSM in hand can do to help a1ex resolve the shutter-bug?

Here's what I tried today. Startup logs using io_trace with these options:

src/trace_io.c
static ASM_VAR uint32_t protected_region = REGION(0xC0210000, 0x010000); // timers

Makefile.user
CONFIG_DEBUG_INTERCEPT_STARTUP = y
CONFIG_MMIO_TRACE=y


I've always had issues getting this working without crashing the camera but it does save startup logs so here's the tests I ran:

SanDisk 32GB card using Canon EF-M 11-22mm lens -- no shutter-bug.
Lexar 64GB card using Canon EF-M 11-22mm lens -- shutter-bug.
Lexar 64GB card using Canon EF-M 22mm prime lens -- no shutter-bug.

Got crash logs with different information on the zoom lens logs and on the Lexar card with the zoom lens (shutter-bug) it created two log files while all the others had only one log but that's just something worth mentioning, the juicy stuff is in these logs.

This is the second and third mpu_recv events when the shutter-bug is not present. (167D9 - 161A3 =1590μs ?)

161A3> **INT-36h*:000e14e4:00:00: *** mpu_recv(50 4e 02 0e 03 03 03 00 03 00 00 65 00 00 01 19 64 00 00 00 00 81 00 00 06 04 00 00 06 04 00 00 06 04 01 01 00 00 3c 00 68 3b 00 00 00 00 00 00 00 00 00 00 00 00 80 48 78 48 00 03 00 01 00 00 00 00 02 01 00 00 00 01 00 00 00 00 00 00 01), from 1dedc
161E9> **INT-36h*:ff0f382c:00:02: [SEQ] NotifyComplete (Startup, Flag = 0x2)
16213> **INT-36h*:ff0f3890:00:03: [SEQ] NotifyComplete (Cur = 1, 0x20000003, Flag = 0x2)
16276> **INT-50h*:ff137e38:00:01: [PM] DisablePowerSave (Counter = 2)
1629E> **INT-50h*:ff137ea8:00:01: [PM] EnablePowerSave (Counter = 1)
1657F> **INT-36h*:000e14e4:00:00: *** mpu_recv(24 23 02 0d 00 00 00 00 01 00 00 00 00 00 00 00 00 01 02 00 00 00 00 00 01 01 01 01 00 00 00 00 00 00 00 00), from 1dedc
165D9> **INT-50h*:ff137e38:00:01: [PM] DisablePowerSave (Counter = 2)
165FF> **INT-50h*:ff137ea8:00:01: [PM] EnablePowerSave (Counter = 1)
167D9>    PropMgr:ff132454:00:05: [RTC] ChangePropertyCBR 0x0, 0x0


These are the same mpu_recv events when the shutter-bug is present. (16D15 - 1656F = 1958μs ?)

1656F> **INT-36h*:000e14e4:00:00: *** mpu_recv(50 4e 02 0e 03 03 03 00 03 00 00 65 00 00 01 19 64 00 00 00 00 81 00 00 06 04 00 00 06 04 00 00 06 04 01 01 00 00 3c 00 68 3b 00 00 00 00 00 00 00 00 00 00 00 00 80 48 78 48 00 03 00 01 00 00 00 00 02 01 00 00 00 01 00 00 00 00 00 00 01), from 1dedc
165B6> **INT-36h*:ff0f382c:00:02: [SEQ] NotifyComplete (Startup, Flag = 0x2)
165E2> **INT-36h*:ff0f3890:00:03: [SEQ] NotifyComplete (Cur = 1, 0x20000003, Flag = 0x2)
16635> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
1665C> **INT-50h*:ff137e38:00:01: [PM] DisablePowerSave (Counter = 2)
16684> **INT-50h*:ff137ea8:00:01: [PM] EnablePowerSave (Counter = 1)
166A5> **INT-50h*:0001DB64:MMIO:  [0xC022006C] -> 0x00000047
166C4>    PropMgr:FF131D60:MMIO:  [0xC022C0C4] <- 0x0083DC00
166D5> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
166EC> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
1670C> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
1671A>    PropMgr:FF131CAC:MMIO:  [0xC022C0C4] <- 0x0093D800
1672A> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
16743> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
1675F> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
1677C> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
16796> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
167B3> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
167CE> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
167EA> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
16806> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
1681A>    PropMgr:FF131D60:MMIO:  [0xC022C0C4] <- 0x0083DC00
16822> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
1683D> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
1685A> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
1687D> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
16896> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
16A35> **INT-36h*:000e14e4:00:00: *** mpu_recv(24 23 02 0d 00 00 00 00 01 00 00 00 00 00 00 00 00 01 02 00 00 00 00 00 01 01 01 01 00 00 00 00 00 00 00 00), from 1dedc
16A7E> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
16A9E> **INT-50h*:ff137e38:00:01: [PM] DisablePowerSave (Counter = 2)
16AC3> **INT-50h*:ff137ea8:00:01: [PM] EnablePowerSave (Counter = 1)
16AE3> **INT-50h*:0001DB64:MMIO:  [0xC022006C] -> 0x00000047
16B11> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
16B27> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
16B43> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
16B62> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
16B7D> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
16B9A> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
16BBC> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
16BD8> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
16BF5> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
16C09>    PropMgr:FF131CAC:MMIO:  [0xC022C0C4] <- 0x0093D800
16C10> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
16C38> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
16C4D> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
16C51>    PropMgr:FF131D60:MMIO:  [0xC022C0C4] <- 0x0083DC00
16C68> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
16C84> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
16CA2> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
16CAA>    PropMgr:FF131CAC:MMIO:  [0xC022C0C4] <- 0x0093D800
16CBE> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
16CDA> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
16CF9> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047
16D15> **INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047


The Lexar 64GB that is consistently showing the shutter-bug looks like this with the EF-M 22mm prime lens -- this is a lens that doesn't trigger the shutter-bug. . (167E8 - 1614D = 1691μs ?)

1614D> **INT-36h*:000e14e4:00:00: *** mpu_recv(50 4e 02 0e 03 03 03 00 03 00 00 65 00 00 01 19 64 00 00 00 00 81 00 00 06 04 00 00 06 04 00 00 06 04 01 01 00 00 3c 00 68 3b 00 00 00 00 00 00 00 00 00 00 00 00 80 48 78 48 00 03 00 01 00 00 00 00 02 01 00 00 00 01 00 00 00 00 00 00 01), from 1dedc
16193> **INT-36h*:ff0f382c:00:02: [SEQ] NotifyComplete (Startup, Flag = 0x2)
161BE> **INT-36h*:ff0f3890:00:03: [SEQ] NotifyComplete (Cur = 1, 0x20000003, Flag = 0x2)
1621D> **INT-50h*:ff137e38:00:01: [PM] DisablePowerSave (Counter = 2)
16245> **INT-50h*:ff137ea8:00:01: [PM] EnablePowerSave (Counter = 1)
16561> **INT-36h*:000e14e4:00:00: *** mpu_recv(24 23 02 0d 00 00 00 00 01 00 00 00 00 00 00 00 00 01 02 00 00 00 00 00 01 01 01 01 00 00 00 00 00 00 00 00), from 1dedc
165B6> **INT-50h*:ff137e38:00:01: [PM] DisablePowerSave (Counter = 2)
165DB> **INT-50h*:ff137ea8:00:01: [PM] EnablePowerSave (Counter = 1)
167E8>    PropMgr:ff132454:00:05: [RTC] ChangePropertyCBR 0x0, 0x0

g3gg0

@dfort:
when formatting, does the camera reboot?
the delay after that reboot when it works is then 1.6msec again, right?

edit:
probably a good idea to hook mpu_recv and check for the two messages, measuring the delay and show on screen "EXPECT [NO] BUG"
and run the procedure several times to get 100% confidence that this is correlated
Help us with datasheets - Help us with register dumps
magic lantern: 1Magic9991E1eWbGvrsx186GovYCXFbppY, server expenses: [email protected]
ONLY donate for things we have done, not for things you expect!

dfort

Quote from: g3gg0 on March 18, 2018, 01:44:37 AM
@dfort:
when formatting, does the camera reboot?

Yes, the camera reboots and there is quite a bit of LED activity--assume that it is saving a new firmware dump at this point?

Quote from: g3gg0 on March 18, 2018, 01:44:37 AM
the delay after that reboot when it works is then 1.6msec again, right?

I can't test for that with this trace_io build because it freezes up the camera. Note that reformatting the card doesn't always work on the 128GB SanDisk card and it doesn't work at all on the 64GB Lexar card. When I was doing the tests that showed the shutter-bug on the SanDisk 32GB card formatting the card worked on that narrow range that we found but I haven't done enough testing to conclude when it is 100% reproducible.

It looks like the time it takes to run just those two mpu_recv events takes about 1.7msec when the shutter-bug isn't present and nearly 2msec when it is.

In addition, The shutter-bug has a bunch of "**INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047" events while the logs that don't have the shutter-bug don't include these events.

Quote from: g3gg0 on March 18, 2018, 01:44:37 AM
probably a good idea to hook mpu_recv and check for the two messages, measuring the delay and show on screen "EXPECT [NO] BUG"
and run the procedure several times to get 100% confidence that this is correlated

Makes sense but not sure I can figure out how to do that. Tell me exactly how to set this up and I'll test it.

Curious if the card speed affecting the total boot time had anything to do with the shutter-bug but it doesn't look like it does. The 1GB Kodak card is super slow yet it doesn't show the shutter-bug.

Taking a couple of unique events in the logs and calculating the time between these events:

SanDisk 32GB zoom -> no shutter-bug
0A316 [SEQ] seqEventDispatch (Startup, 0)
FF2D1 gmtProperty ID=0x80030011(0x10502801)
1,003 msec

Lexar 64GB zoom -> shutter-bug
0A45D [SEQ] seqEventDispatch (Startup, 0)
ED71B gmtProperty ID=0x80030011(0x10502801)
930 msec

Lexar 64GB prime -> no shutter-bug
0A321 [SEQ] seqEventDispatch (Startup, 0)
D06D3 gmtProperty ID=0x80030011(0x10501801)
812 msec


Just wondering, is the "gmtProperty ID=0x80030011" the lens ID? Notice the difference between the 11-22mm zoom and the 22mm prime lens. The shutter-bug only shows up on EF-M zoom lenses -- they all have IS so it seems reasonable to search for some sort of a connection here?

dfort

Cleaning out some files and realized that I sort of just glossed over this:

Quote from: dfort on March 17, 2018, 08:38:05 PM
...on the Lexar card with the zoom lens (shutter-bug) it created two log files while all the others had only one log but that's just something worth mentioning...

Turns out that second log of the card with the shutter-bug and the EF-M zoom doesn't contain those "**INT-36h*:0001DBF4:MMIO:  [0xC022006C] -> 0x00000047" events.

So how does this work? The camera is started, code is copied to RAM then it reboots, right?

Lexar 64GB zoom -> shutter-bug -> dm-0001.log (2nd log)
0A303 [SEQ] seqEventDispatch (Startup, 0)
D0D30 gmtProperty ID=0x80030011(0x10502801)
813 msec


So this looks pretty much like the logs without the shutter-bug yet the bug is there.

Quote from: dfort on March 17, 2018, 08:38:05 PM
Maybe there is still something that a mere mortal with an EOSM in hand can do to help a1ex resolve the shutter-bug?

Maybe not?

a1ex

For some reason, the logs without shutter bug do not include any MMIO events. Were they created by the same binary? (I should probably include some version info in the logs). BTW, REGION(0xC0210000, 0x010000) does not match anything from C022nnnn, so it must have been a different binary.

Don't see any reason to log the MMIO activity for the timers; that range only covers DryOS heartbeat timer.

./run_canon_fw.sh 5D3,firmware="boot=0" -d io,int,v |& grep -a 0xC021


You can see these in my startup log, too (with the first debug_intercept call moved at "We're back!" - a modification not compatible with all models). Without that change, REGION(0xC0210000, 0x010000) will not log any events (because there's no Canon code that changes these registers after the early boot sequence).

My experiment was with the initialization before the "We're back!" stage (where the vanilla io_trace and most of other ML code cannot run). That register turned out not to be useful for our investigation (unless we can somehow reflash the bootloader or re-run it with caches patched during a software reboot).

0x80030011 is PROP_LENS (what it contains is in lens.c).

dfort

Same binary. Just io_trace with these minimal changes:

Magic Lantern Nightly.2018Mar18.EOSM202
Camera   : EOSM
Firmware : 202
Changeset: 2838aae45890+ (io_trace)
Built on : 2018-03-18 12:22:52 by rosiefort@RosieFoComputer

diff -r 2838aae45890 src/io_trace.c
--- a/src/io_trace.c Tue Feb 13 21:36:23 2018 +0100
+++ b/src/io_trace.c Sun Mar 18 05:22:52 2018 -0700
@@ -35,7 +35,7 @@
  * REGION(0xC0E20000, 0x010000): JPCORE (JP57, lossless)
  * REGION(0xC0000000, 0x1000000): everything except EEKO? (DIGIC <= 5)
  */
-static ASM_VAR uint32_t protected_region = REGION(0xC0220000, 0x010000);
+static ASM_VAR uint32_t protected_region = REGION(0xC0210000, 0x010000);

/* number of 32-bit integers recorded for one MMIO event (power of 2) */
#define RECORD_SIZE 8
diff -r 2838aae45890 platform/EOSM.202/internals.h
--- a/platform/EOSM.202/internals.h Tue Feb 13 21:36:23 2018 +0100
+++ b/platform/EOSM.202/internals.h Sun Mar 18 05:22:52 2018 -0700
@@ -138,7 +138,7 @@
//~ #define CONFIG_AUDIO_CONTROLS

/** Patch LiveViewApp dialog hander to hide Canon bottom bar */
-#define CONFIG_LVAPP_HACK_PATCH
+// #define CONFIG_LVAPP_HACK_PATCH

/** Workaround for menu timeout in LiveView */
#define CONFIG_MENU_TIMEOUT_FIX


CONFIG_DEBUG_INTERCEPT_STARTUP = y
CONFIG_MMIO_TRACE=y

dfort

Quote from: a1ex on March 18, 2018, 01:12:31 PM
BTW, REGION(0xC0210000, 0x010000) does not match anything from C022nnnn, so it must have been a different binary.

A screw up on my part is a possibility though I would swear under oath that I did everything possible to make sure that all tests were on the same binary.

Took another look at your startup log and noticed what looks like Russian characters in it:
0.000.000    юяякюяяк:ff0c7950:MMIO : [0xC0201100] <- 0x43210DCB

Ran it through Google Translate and it detected Belarusian:


My logs aren't showing yuyayakyuyayak.

Ok--enough of the fun and games.

Quote from: a1ex on March 18, 2018, 01:12:31 PM
My experiment was with the initialization before the "We're back!" stage (where the vanilla io_trace and most of other ML code cannot run). That register turned out not to be useful for our investigation (unless we can somehow reflash the bootloader or re-run it with caches patched during a software reboot).

That's the area that I was experimenting with by putting in some msleep calls but it would just freeze on startup.

One thing I noticed is that with the camera bootflag enabled when changing the lens with the camera turned off the LED would flash if there was a card with the bootflag enabled but not if the card didn't have the bootflag enabled.

Lens twist test with camera turned off:







Camera bootflag     Card bootflag     LED
ononflashes
onoffno activity
offonno activity
offoffno activity

I did a quick test on the 5D3 and it does the same thing.

So is the LED activity indicating that it is writing to the card or is it also active when reading the card? Whatever is going on with the lens is also happening even without a card in the camera because on the 22mm lens every time the lens is put onto the body it racks the focus even when the camera is turned off. That rack focus doesn't happen with any other lens that I tried including EF lenses with the EF-M adapter.

Of course ML requires both camera and card bootflags set so maybe whatever is causing the shutter-bug might be in this very early lens initiation stage before the camera is even turned on?

a1ex

Quote from: dfort on March 19, 2018, 03:21:31 PM
That's the area that I was experimenting with by putting in some msleep calls but it would just freeze on startup.

msleep only works after DryOS is up and running - long after that early boot stage. I've only tried it after calling Canon's init_task; might also work right before it, but it won't work earlier.

Quote
So is the LED activity indicating that it is writing to the card or is it also active when reading the card?

Generally speaking, LED activity is independent of card activity. There can be card access with LED turned on, card access with LED turned off, no card access with LED turned on and... no card access with led turned off.

For example, 5D2 bootloader reads from the card without ever bothering to turn on the LED.

To see the specific activity in some particular case, run in QEMU with -d io and look for SD/CF and LED activity.

Quote
Of course ML requires both camera and card bootflags set so maybe whatever is causing the shutter-bug might be in this very early lens initiation stage before the camera is even turned on?

Possible. Lens initialization (whether successful or not) might depend on the time it takes from physical power on until the MPU initialization (first mpu_send call, or maybe some subsequent mpu_send call). That's why I told you to place the msleeps on only one of the mpu_send_log calls - to verify this hypothesis.

dfort

Quote from: a1ex on March 16, 2018, 06:50:06 AM
The function you are looking for is in dm-spy-extra.c.

Ok--so I switched over to the dm-spy-experiments branch. Commented out CONFIG_LVAPP_HACK_PATCH in internals.h in order to get it to compile (adding the new stub might interfere) ran a few startup logs and--the "good" card is showing the shutter-bug!

Let me repeat that - The "good" card is showing the shutter-bug in the dm-spy-experiments branch with CONFIG_DEBUG_INTERCEPT_STARTUP enabled.

Eight out 10 logs showed the same sequence of mpu_recv events:

DM-0000
15F7D> **INT-36h*:000e1624:00:00: *** mpu_recv(50 4e 02 0e 03 03 03 00 03 00 00 68 00 00 01 19 64 00 00 00 00 81 00 00 06 04 00 00 06 04 00 00 06 04 01 01 00 00 00 00 68 3b 00 00 00 00 00 00 00 00 00 00 00 00 80 48 78 48 00 03 00 01 00 00 00 00 02 01 00 00 00 01 00 00 00 00 00 00 01), from 1dedc
16315> **INT-36h*:000e1624:00:00: *** mpu_recv(24 23 02 0d 00 00 00 00 01 00 00 00 00 00 00 00 00 01 02 00 00 00 00 00 01 01 01 01 00 00 00 00 00 00 00 00), from 1dedc
16685> **INT-36h*:000e1624:00:00: *** mpu_recv(28 27 02 0f 01 00 00 10 01 01 00 00 00 00 00 00 00 00 00 00 00 18 00 00 00 0c 00 00 00 00 00 00 00 02 00 00 03 01 00 00), from 1dedc
1727A> **INT-36h*:000e1624:00:00: *** mpu_recv(3c 3a 02 10 00 00 00 00 00 00 01 00 00 00 00 ff 00 00 ff 00 ff ff ff ff ff ff ff ff 00 00 00 00 00 01 ff 00 00 02 01 01 00 03 00 01 00 04 00 00 01 05 00 01 01 00 01 00 00 3f 00), from 1dedc
17738> **INT-36h*:000e1624:00:00: *** mpu_recv(3e 3c 02 11 01 ff 01 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 03 3b 00 ff 00 00 ff 00 ff ff ff ff ff ff ff ff 00 00), from 1dedc
17841> **INT-36h*:000e1624:00:00: *** mpu_recv(06 05 01 4f 01 00), from 1dedc
17973> **INT-36h*:000e1624:00:00: *** mpu_recv(0a 08 01 6e 80 48 00 00 00), from 1dedc


However, two of the logs showed a different sequence:

DM-0002
15840> **INT-36h*:000e1624:00:00: *** mpu_recv(06 05 01 00 03 00), from 1dedc
161D1> **INT-36h*:000e1624:00:00: *** mpu_recv(50 4e 02 0e 03 03 03 00 00 00 00 68 00 00 01 19 64 00 00 00 00 81 00 00 06 04 00 00 06 04 00 00 06 04 01 01 00 00 00 00 68 3b 00 00 00 00 00 00 00 00 00 00 00 00 80 48 78 48 00 03 00 01 00 00 00 00 02 01 00 00 00 01 00 00 00 00 00 00 01), from 1dedc
16573> **INT-36h*:000e1624:00:00: *** mpu_recv(24 23 02 0d 00 00 00 00 01 00 00 00 00 00 00 00 00 01 02 00 00 00 00 00 01 01 01 01 00 00 00 00 00 00 00 00), from 1dedc
168E1> **INT-36h*:000e1624:00:00: *** mpu_recv(28 27 02 0f 01 00 00 10 01 01 00 00 00 00 00 00 00 00 00 00 00 18 00 00 00 0c 00 00 00 00 00 00 00 02 00 00 03 01 00 00), from 1dedc
16D7A> **INT-36h*:000e1624:00:00: *** mpu_recv(3c 3a 02 10 00 00 00 00 00 00 01 00 00 00 00 ff 00 00 ff 00 ff ff ff ff ff ff ff ff 00 00 00 00 00 01 ff 00 00 02 01 01 00 03 00 01 00 04 00 00 01 05 00 01 01 00 01 00 00 3f 00), from 1dedc
17250> **INT-36h*:000e1624:00:00: *** mpu_recv(3e 3c 02 11 01 ff 01 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 3b 00 ff 00 00 ff 00 ff ff ff ff ff ff ff ff 00 00), from 1dedc
17441> **INT-36h*:000e1624:00:00: *** mpu_recv(0a 08 01 6e 80 48 00 00 00), from 1dedc


Note that the only way to startup the EOSM with this build is via the PLAY button. Starting up normally the camera never gets to LiveView though it does save a startup log.

Quote from: a1ex on March 14, 2018, 10:56:45 AM
You can also add a delay (msleep) for the first mpu_send call (but not for any other). In the logging function (mpu_send_log), which gets called right before running the mpu_send, try something like this (with the good card):

  static int call_num = 0;
  call_num++;
  if (call_num == 1) /* delay the first call only */
  {
    /* do the math after running memory benchmark, then round to multiples of 10 */
    /* let's say it's 53ms -> 50 should not cause the bug, 60 should cause it */
    msleep(50);
  }


That would move the delay right before mpu_send (where I believe it matters).

Then, try delaying some other mpu_send call (the second, the third and so on). At some point, the bug should disappear.

Alright, we need to run memory benchmarks but the bench module doesn't compile in the dm-spy-experiments branch so I switched over to the unified branch--not working there either so switched to the lua_fix branch...

Not very useful:


Shoot another boring video:


To come up with the test results:
Test function: Display on:        Display off:
memcpy cacheable 34.18 MB/s 39.68 MB/s OK
memcpy uncacheable 10.46 MB/s 11.90 MB/s OK
memcpy64 cacheable 40.04 MB/s 44.87 MB/s OK
memcpy64 uncacheable 15.96 MB/s 18.24 MB/s OK
dma_memcpy cacheable 34.20 MB/s 37.47 MB/s OK
dma_memcpy uncachab 33.55 MB/s 37.34 MB/s OK
edmac_memcpy 345.99 MB/s (test skipped) OK
edmac_copy_rectangle 175.15 MB/s (test skipped)
memset cacheable 95.11 MB/s 110.40 MB/s
memset uncacheable 40.97 MB/s 47.35 MB/s
memset64 cacheable 89.53 MB/s 104.35 MB/s
memset64 uncacheable 58.31 MB/s 65.16 MB/s
read32 cacheable 39.68 MB/s 45.50 MB/s
read32 uncacheable 12.55 MB/s 14.68 MB/s
read64 cacheable 53.03 MB/s 60.54 MB/s
read64 uncacheable 23.25 MB/s 27.27 MB/s
bpm_fill to idle buf 88.28 MB/s (test skipped)


It's not for a lack of trying but I'm still lost. Which of these values am I supposed to use to figure out how long to make the msleep call? In addition, what is going on with that code snippet? Aren't we looking for mpu_recv events? Where does it go?

[EDIT] I did try this but it isn't creating the shutter-bug on the good card (with startup log disabled) or removing the shutter-bug on the card that always shows the bug:
diff -r 912ba70b63dc src/dm-spy-extra.c
--- a/src/dm-spy-extra.c Sat Feb 03 21:55:07 2018 +0100
+++ b/src/dm-spy-extra.c Fri Mar 30 12:38:09 2018 -0700
@@ -55,6 +55,8 @@
static void StartEDmac_log(uint32_t* regs, uint32_t* stack, uint32_t pc);
static void TryPostEvent_log(uint32_t* regs, uint32_t* stack, uint32_t pc);

+static int call_num = 0;
+
struct logged_func
{
     uint32_t addr;                              /* Logged address (usually at the start of the function; will be passed to gdb_add_watchpoint) */
@@ -750,6 +752,14 @@
     mpu_decode(buf, msg, sizeof(msg));
     DryosDebugMsg(0, 0, "*** mpu_send(%02x %s), from %x", size_ex, msg, caller);
     last_mpu_timestamp = MEM(0xC0242014);
+   
+    call_num++;
+    if (call_num == 1) /* delay the first call only */
+    {
+        /* do the math after running memory benchmark, then round to multiples of 10 */
+        /* let's say it's 53ms -> 50 should not cause the bug, 60 should cause it */
+        msleep(60);
+    }
}

static void mpu_recv_log(uint32_t* regs, uint32_t* stack, uint32_t pc)


Quote from: a1ex on March 14, 2018, 10:56:45 AM
At some point, the bug should disappear.

Just not yet.

a1ex

Press PLAY before running the benchmark. Running the benchmarks in LiveView will give much lower numbers, not relevant for this experiment; the PLAY mode is a good approximation, as the DryOS tasks are mostly idle in this case. You may also disable the interrupts while running these benchmarks - that might give even higher speeds (but will prevent any kind of multitasking during the test).


/* modules/bench/mem_bench.c */
/* caveat: might interfere with EDMAC or DMA memcpy (comment them out if they lock up) */
int old = cli();
bench_fun(arg0, arg1, arg2, arg3);
sei(old);
/* might need some msleep here, to allow the Canon tasks to catch up - they could not do anything during the test */


Our call from boot-hack.c is memset64 with cacheable memory and display off.

You've got some good logs before - you could start from the same configuration. You can disable everything else, except for MPU messages, to minimize overhead. Disable MPU_DELAY_SEND too, as we don't want any slowdown for MPU (I slow them down to find the dependency relationships between them, e.g. which mpu_recv lines are likely replies to the previous mpu_send). The io_trace_full branch has minimal logging overhead, so you may use that one instead. MMIO messages won't work on that branch when logging the startup process, but we don't need them for this experiment.

dfort

Don't get how to work with that new code snippet but did get a good memory benchmark run in Play mode:



Quote from: a1ex on March 30, 2018, 09:41:10 PM
Our call from boot-hack.c is memset64 with cacheable memory and display off.

    /* do the math after running memory benchmark, then round to multiples of 10 */
    /* let's say it's 53ms -> 50 should not cause the bug, 60 should cause it */


So let's say it is 180, right? Am I on the right track with the [EDIT] I made on Reply #33?

Or am I getting ahead of myself and need to create some more startup logs showing only MPU messages with MPU_DELAY_SEND disabled.

Switching over to the io_trace_full branch and uhhhhhhhhhh....

dfort

Did lots of experimenting today. Disabled MPU_DELAY_SEND, tried to follow the instructions to add a delay (msleep) for the first mpu_send call, even used the MPU_DELAY_SEND trick of using the busy waiting code instead of msleep. At one point I had the shutter-bug on the card that reliably doesn't show the bug and vice versa but before posting how I did it I tried reproducing the results without success. Slippery little bastard.

I did make some new startup logs with the io_trace_full branch. Here's a shutter-bug log and a no shutter-bug log. Searching for anything that might offer a clue I grepped mpu_send and found something rather interesting.

no shutter-bug
0.088.758     Startup:000038b8:00:00: *** mpu_send(06 04 02 00 00)
0.116.269     PropMgr:0000393c:00:00: *** mpu_send(08 06 00 00 02 0e 00)
0.117.568     PropMgr:0000393c:00:00: *** mpu_send(08 06 00 00 01 97 00)
0.121.538     PropMgr:0000393c:00:00: *** mpu_send(0a 08 03 57 10 33 00 01 00)
0.146.158     PropMgr:0000393c:00:00: *** mpu_send(0a 08 03 06 00 00 00 00 00)
0.146.631     PropMgr:0000393c:00:00: *** mpu_send(06 04 03 10 00)
0.147.472     PropMgr:0000393c:00:00: *** mpu_send(06 05 03 07 ff 00)
0.170.272     PropMgr:0000393c:00:00: *** mpu_send(06 05 01 2e 01 00)
0.172.456     PropMgr:0000393c:00:00: *** mpu_send(0a 08 03 0b 00 00 00 00 00)
0.233.858     PropMgr:0000393c:00:00: *** mpu_send(22 20 03 3b 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00)
0.244.377     PropMgr:0000393c:00:00: *** mpu_send(06 05 04 0e 01 00)
0.253.125     PropMgr:0000393c:00:00: *** mpu_send(08 07 01 2a 17 d4 00 00)
0.253.721     PropMgr:0000393c:00:00: *** mpu_send(06 05 03 07 05 00)
0.254.194     PropMgr:0000393c:00:00: *** mpu_send(0a 08 03 06 00 00 04 59 00)
0.255.082     PropMgr:0000393c:00:00: *** mpu_send(06 05 03 11 01 00)
0.255.497     PropMgr:0000393c:00:00: *** mpu_send(06 05 02 0a 00 00)
0.262.666     PropMgr:0000393c:00:00: *** mpu_send(06 05 03 0d 00 00)
0.262.971     PropMgr:0000393c:00:00: *** mpu_send(06 05 03 0c 00 00)
0.290.910     PropMgr:0000393c:00:00: *** mpu_send(06 05 03 19 01 00)
0.323.529     PropMgr:0000393c:00:00: *** mpu_send(06 05 09 11 01 00)
0.324.523     PropMgr:0000393c:00:00: *** mpu_send(12 11 09 15 00 00 00 00 00 00 00 00 00 00 00 00 00 00)
0.325.260     PropMgr:0000393c:00:00: *** mpu_send(14 13 09 25 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00)
0.325.979     PropMgr:0000393c:00:00: *** mpu_send(06 05 01 5a 00 00)
0.326.420     PropMgr:0000393c:00:00: *** mpu_send(06 05 09 2f 01 00)
0.349.545     PropMgr:0000393c:00:00: *** mpu_send(14 13 09 25 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00)
0.350.631     PropMgr:0000393c:00:00: *** mpu_send(06 05 01 5a 01 00)
0.375.568     PropMgr:0000393c:00:00: *** mpu_send(0a 08 09 1a 00 00 00 00 00)
0.376.172     PropMgr:0000393c:00:00: *** mpu_send(0a 08 09 24 00 00 00 00 00)
0.388.969     PropMgr:0000393c:00:00: *** mpu_send(08 06 03 18 00 00 00)
0.389.683     PropMgr:0000393c:00:00: *** mpu_send(08 06 03 1f 00 00 00)
0.390.479     PropMgr:0000393c:00:00: *** mpu_send(06 05 03 13 00 00)
0.390.880     PropMgr:0000393c:00:00: *** mpu_send(06 05 03 1e 00 00)
0.399.998     PropMgr:0000393c:00:00: *** mpu_send(06 05 01 8a 00 00)
0.400.643     PropMgr:0000393c:00:00: *** mpu_send(08 06 03 1f 00 00 00)
0.401.637     PropMgr:0000393c:00:00: *** mpu_send(0a 08 01 3b ff ff 00 00 00)
0.405.589     PropMgr:0000393c:00:00: *** mpu_send(0a 08 01 3b ff 00 00 00 00)


shutter-bug
0.088.785     Startup:000038b8:00:00: *** mpu_send(06 04 02 00 00)
0.116.777     PropMgr:0000393c:00:00: *** mpu_send(08 06 00 00 02 0e 00)
0.118.057     PropMgr:0000393c:00:00: *** mpu_send(08 06 00 00 01 97 00)
0.122.037     PropMgr:0000393c:00:00: *** mpu_send(0a 08 03 57 10 33 00 01 00)
0.146.153     PropMgr:0000393c:00:00: *** mpu_send(0a 08 03 06 00 00 00 00 00)
0.146.618     PropMgr:0000393c:00:00: *** mpu_send(06 04 03 10 00)
0.147.482     PropMgr:0000393c:00:00: *** mpu_send(06 05 03 07 ff 00)
0.170.538     PropMgr:0000393c:00:00: *** mpu_send(06 05 01 2e 01 00)
0.172.702     PropMgr:0000393c:00:00: *** mpu_send(0a 08 03 0b 00 00 00 00 00)
0.234.753     PropMgr:0000393c:00:00: *** mpu_send(22 20 03 3b 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00)
0.245.293     PropMgr:0000393c:00:00: *** mpu_send(06 05 04 0e 01 00)
0.254.129     PropMgr:0000393c:00:00: *** mpu_send(08 07 01 2a 17 d3 00 00)
0.254.905     PropMgr:0000393c:00:00: *** mpu_send(06 05 03 07 05 00)
0.255.395     PropMgr:0000393c:00:00: *** mpu_send(0a 08 03 06 00 00 08 c7 00)
0.256.404     PropMgr:0000393c:00:00: *** mpu_send(06 05 03 11 01 00)
0.256.811     PropMgr:0000393c:00:00: *** mpu_send(06 05 02 0a 00 00)
0.264.150     PropMgr:0000393c:00:00: *** mpu_send(06 05 03 0d 00 00)
0.264.453     PropMgr:0000393c:00:00: *** mpu_send(06 05 03 0c 00 00)
0.291.576     PropMgr:0000393c:00:00: *** mpu_send(06 05 03 19 01 00)
0.324.192     PropMgr:0000393c:00:00: *** mpu_send(06 05 09 11 01 00)
0.325.238     PropMgr:0000393c:00:00: *** mpu_send(12 11 09 15 00 00 00 00 00 00 00 00 00 00 00 00 00 00)
0.325.968     PropMgr:0000393c:00:00: *** mpu_send(14 13 09 25 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00)
0.326.703     PropMgr:0000393c:00:00: *** mpu_send(06 05 01 5a 00 00)
0.327.142     PropMgr:0000393c:00:00: *** mpu_send(06 05 09 2f 01 00)
0.347.417     PropMgr:0000393c:00:00: *** mpu_send(14 13 09 25 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00)
0.348.033     PropMgr:0000393c:00:00: *** mpu_send(06 05 01 5a 01 00)
0.372.658     PropMgr:0000393c:00:00: *** mpu_send(0a 08 09 1a 00 00 00 00 00)
0.373.331     PropMgr:0000393c:00:00: *** mpu_send(0a 08 09 24 00 00 00 00 00)
0.385.430     PropMgr:0000393c:00:00: *** mpu_send(08 06 03 18 00 00 00)
0.386.130     PropMgr:0000393c:00:00: *** mpu_send(08 06 03 1f 00 00 00)
0.386.919     PropMgr:0000393c:00:00: *** mpu_send(06 05 03 13 00 00)
0.387.330     PropMgr:0000393c:00:00: *** mpu_send(06 05 03 1e 00 00)
0.396.373     PropMgr:0000393c:00:00: *** mpu_send(06 05 01 8a 00 00)
0.397.026     PropMgr:0000393c:00:00: *** mpu_send(08 06 03 1f 00 00 00)
0.398.011     PropMgr:0000393c:00:00: *** mpu_send(0a 08 01 3b ff ff 00 00 00)
0.401.832     PropMgr:0000393c:00:00: *** mpu_send(0a 08 01 3b ff 00 00 00 00)


If I understand the way this log is laid out both these logs show the first mpu_send event (Startup) starting at approximately the same time but the shutter-bug log is running ahead of the no shutter-bug log. Something probably even more significant is that there is a difference between a few lines. I ran multiple startup logs and while the no shutter-bug logs were always the same, the logs with the shutter-bug showed some differences:

no shutter-bug - all iterations were the same
0.253.125     PropMgr:0000393c:00:00: *** mpu_send(08 07 01 2a 17 d4 00 00)
0.253.721     PropMgr:0000393c:00:00: *** mpu_send(06 05 03 07 05 00)
0.254.194     PropMgr:0000393c:00:00: *** mpu_send(0a 08 03 06 00 00 04 59 00)


shutter-bug 1
0.254.129     PropMgr:0000393c:00:00: *** mpu_send(08 07 01 2a 17 d3 00 00)
0.254.905     PropMgr:0000393c:00:00: *** mpu_send(06 05 03 07 05 00)
0.255.395     PropMgr:0000393c:00:00: *** mpu_send(0a 08 03 06 00 00 08 c7 00)


shutter-bug 2
0.253.759     PropMgr:0000393c:00:00: *** mpu_send(08 07 01 2a 17 d4 00 00)
0.254.294     PropMgr:0000393c:00:00: *** mpu_send(06 05 03 07 05 00)
0.254.786     PropMgr:0000393c:00:00: *** mpu_send(0a 08 03 06 00 00 08 c8 00)


shutter-bug 3
0.313.787     PropMgr:0000393c:00:00: *** mpu_send(08 07 01 2a 17 d4 00 00)
0.314.456     PropMgr:0000393c:00:00: *** mpu_send(06 05 03 07 05 00)
0.315.267     PropMgr:0000393c:00:00: *** mpu_send(0a 08 03 06 00 00 08 c7 00)


I don't know, it just seems strange to me. Shouldn't the mpu_send events always be the same on startup like the no shutter-bug logs?

a1ex

Quote from: dfort on March 31, 2018, 03:31:36 AM
grepped mpu_send

That's a warning flag - if you disable everything else (including Canon messages) to have minimal overhead, you should get *only* mpu_send and mpu_recv lines in the log. Preparing one such build here (#2 will be the minimalist one):

https://builds.magiclantern.fm/jenkins/view/Experiments/job/startup-log-mpu/

In particular, logging all the interrupts has a high overhead.

The startup process is non-deterministic anyway, so you won't get the same sequence twice (if you look at all the messages, including non-MPU ones, some will be different).

To find out what's up with the above differences, try annotating the logs (it will "explain" what each MPU message is).

dfort

Quote from: a1ex on March 31, 2018, 08:55:57 AM
To find out what's up with the above differences, try annotating the logs (it will "explain" what each MPU message is).

So much for my great discovery--the differences between the logs are because of:


0.253.125     PropMgr:0000393c:00:00: *** mpu_send(08 07 01 2a 17 d4 00 00)                         ; PROP_CARD2_FILE_NUMBER
...
0.254.194     PropMgr:0000393c:00:00: *** mpu_send(0a 08 03 06 00 00 04 59 00)                      ; PROP_AVAIL_SHOT


Of course the logs are different, the cards are different sizes, 32GB for the no shutter-bug card and 64GB for the shutter-bug card thus the difference in PROP_AVAIL_SHOT. I probably checked to see if the shutter-bug was active a few times on the shutter-bug card so that might be why the file number changed.

Quote from: a1ex on March 31, 2018, 08:55:57 AM
Preparing one such build here (#2 will be the minimalist one):

That link points to the latest build which is currently build #3. I was able to find build #2 but in all cases there is no EOSM build. Easy on me you're making it not.

[EDIT] Without the new LiveViewApp_handler_BL_JudgeBottomInfoDispTimerState stub or disabling CONFIG_LVAPP_HACK_PATCH in internals.h the EOSM won't compile in these io_trace branches.

dfort

Ok--applied the patch and created new annotated logs. Hope this brings some light to the shutter-bug:

https://www.dropbox.com/sh/eq0k1qcyeijt1c2/AADbKjfbrIio1fflq6L1gNi8a?dl=0

Interesting side note, on one of my cards that always shows the shutter-bug this build didn't show the shutter-bug but that was only on the first boot. All subsequent startups, including after formatting the card, the shutter-bug would show up.

I know that I'm supposed to be experimenting with delaying various mpu calls but so far nothing I have tried has made any difference. Any hints? Of course the Coup De Grâce for the shutter-bug is what I really want.

a1ex

Start with a card that does not show the shutter bug with logging enabled.

Then delay the first mpu_send call by some large amount, e.g. 100ms. Does it show the shutter bug? (I assume yes)

Then try 10ms. Does it show the shutter bug? (I assume no)

If my assumptions were correct, keep reading, otherwise stop.

Binary search to find the critical delay value (where slightly higher values show the shutter bug and slightly lower values do not). Timer resolution for msleep is 10 ms. Cross-check the value with memset64 critical buffer size, which you already found. Use benchmark result to convert between the two, so you can compare.

Once that works, delay some other mpu_send call (not the first). At some point, I'd expect the shutter bug to disappear, even if you increase the delay.

Write down where that happens - it's going to be a very important clue.

dfort

Quote from: a1ex on March 31, 2018, 05:31:22 PM
Start with a card that does not show the shutter bug with logging enabled.

Check!

Quote from: a1ex on March 31, 2018, 05:31:22 PM
Then delay the first mpu_send call by some large amount, e.g. 100ms. Does it show the shutter bug? (I assume yes)

I did this before on the the dm-spy-experiments branch and it didn't work but it is doing exactly as you assumed it would with this trimmed down io_trace_full branch.

Magic Lantern Nightly.2018Mar31.EOSM202
Camera   : EOSM
Firmware : 202
Changeset: 90647297667d+ (io_trace_full_shutter-bug)
Built on : 2018-03-31 20:59:25 by rosiefort@RosieFoComputer

diff -r 90647297667d src/dm-spy-extra.c
--- a/src/dm-spy-extra.c Sat Mar 31 07:37:17 2018 -0700
+++ b/src/dm-spy-extra.c Sat Mar 31 13:59:25 2018 -0700
@@ -57,6 +57,8 @@
static void StartEDmac_log(uint32_t* regs, uint32_t* stack, uint32_t pc);
static void TryPostEvent_log(uint32_t* regs, uint32_t* stack, uint32_t pc);

+static int call_num = 0;
+
#define DM_OVERRIDE(field, new_value) \
     do { \
         struct debug_msg * dm = debug_get_last_block(); \
@@ -706,6 +708,14 @@

static void mpu_send_log(uint32_t* regs, uint32_t* stack, uint32_t pc)
{
+  call_num++;
+  if (call_num == 1) /* delay the first call only */
+  {
+    /* do the math after running memory benchmark, then round to multiples of 10 */
+    /* let's say it's 53ms -> 50 should not cause the bug, 60 should cause it */
+    msleep(100);
+  }
+
#ifdef MPU_DELAY_SEND
   if (MPU_DELAY_SEND)
   {


CONFIG_DEBUG_INTERCEPT_STARTUP = y


Quote from: a1ex on March 31, 2018, 05:31:22 PM
Then try 10ms. Does it show the shutter bug? (I assume no)

If my assumptions were correct, keep reading, otherwise stop.

Right again so let's continue.

Quote from: a1ex on March 31, 2018, 05:31:22 PM
Binary search to find the critical delay value (where slightly higher values show the shutter bug and slightly lower values do not). Timer resolution for msleep is 10 ms.

Now for the crazy part:


10  - no shutter-bug
100 - shutter-bug
50  - no shutter-bug
70  - no shutter-bug
80  - shutter-bug (format card) no shutter-bug
90  - no shutter-bug
100 - no shutter-bug (what the?)
110 - no shutter-bug
120 - no shutter-bug
130 - no shutter-bug
140 - no shutter-bug
150 - no shutter-bug
160 - shutter-bug (format card) no shutter-bug
170 - no shutter-bug
180 - no shutter-bug
190 - no shutter-bug
200 - no shutter-bug
210 - no shutter-bug
220 - no shutter-bug
230 - no shutter-bug
240 - no shutter-bug
250 - no shutter-bug
300 - shutter-bug (format card) shutter-bug
270 - shutter-bug (format card) shutter-bug
260 - shutter-bug (format card) shutter-bug
250 - shutter-bug (format card) no shutter-bug
260 - no shutter-bug (restart) shutter-bug
270 - shutter-bug (format card) no shutter-bug
280 - shutter-bug (format card) shutter-bug


Ok--so it looks like 280 is where the shutter-bug can be reliably reproduced on a SanDisk Extreme Pro 32GB card. I don't want to say that there is a lower end that is "safe" because I have experienced the shutter-bug on these cards, albeit on very rare instances. Why the card formatting? I've got a SanDisk Extreme Pro 128GB card that can be cleared of the shutter-bug by formatting the card but it will return on restart. The Lexar 64GB card that reliably shows the shutter-bug has in fact been clear of the shutter-bug on one of the tests that I ran in Reply #39. My hypothesis is that there are several cards that fall into a rather large zone that could be affected by the shutter-bug depending on the card's memory fragmentation.

Quote from: a1ex on March 31, 2018, 05:31:22 PM
Cross-check the value with memset64 critical buffer size, which you already found. Use benchmark result to convert between the two, so you can compare.

On Reply #35 I came up with 180 and on Reply #20 I came up with 0x0B50BBC -- is this the memset64 critical buffer size? How does this translate into some value that we can use?

Quote from: a1ex on March 31, 2018, 05:31:22 PM
Once that works, delay some other mpu_send call (not the first). At some point, I'd expect the shutter bug to disappear, even if you increase the delay.

I suppose you want me to keep using the same card now that we have it tuned into something that reliably shows the shutter-bug but how much to delay some other mpu_send call? It seems a little heavy handed to use 280ms, doesn't it?

Quote from: a1ex on March 31, 2018, 05:31:22 PM
Write down where that happens - it's going to be a very important clue.

Has an important clue emerged out of all this yet?

dfort

Perhaps some logs will help determine where to put the next mpu_send delay. This time the card was formatted before installing ML so these are first time startup logs with and without the shutter-bug on the same card. The only difference being the msleep(280) call in dm-spy-extra.c in order to induce the shutter-bug. After startup I checked to make sure that the shutter-bug was or wasn't present.

Annotated logs:

https://www.dropbox.com/sh/nskqhfj1am797et/AAA11okP24-gPpJwEHVGSVLAa?dl=0

A couple of quick observations: "PROP 80050044" is missing in the log with the shutter-bug and the positions of "PROP 8004004B" and "Complete WaitID = Mode group" are swapped between the two logs.

[EDIT] Holy smokes -- first thing I tried worked. Must be the April 1st effect!

static void mpu_send_log(uint32_t* regs, uint32_t* stack, uint32_t pc)
{
  call_num++;
  if (call_num == 1) /* delay the first call only */
  {
    /* do the math after running memory benchmark, then round to multiples of 10 */
    /* let's say it's 53ms -> 50 should not cause the bug, 60 should cause it */
    /* tests on camera show that 270ms is the least amount that will reliably reproduce the shutter-bug */
    msleep(280);
  }
  if (call_num == 53) /* PROP 80050044 */
  {
    msleep(280);
  }

#ifdef MPU_DELAY_SEND


Uploading the log and running some more tests to make sure this is repeatable but it looks promising.

[EDIT2] Took out the first delay and put it on the Lexar 64GB card that reliably shows the shutter-bug and -- no shutter-bug! Just to make sure this wasn't a fluke I formatted the card and re-installed ML just like on the previous tests. The shutter-bug came back, bummer. Ok, more testing to do.

a1ex

Yeah, 280 is a bit high. From the benchmark, memset64 uncacheable runs at 180MB/s. Since it has to fill 0x0B50BBC, that would take 0x0B50BBC / (1024*1024) / 180 = 0.062 seconds. In other words, I've expected 70 ms to show the shutter but and 60 ms not to show it. At least the order of magnitude is not too different.

Does the shutter bug disappear if you delay some other mpu_send call? (second, third...)

dfort

Looks like we were both posting at the same time. Check my Reply #42.

Of course I should test to make sure the results I got with the SanDisk card (no shutter-bug card) are repeatable and to see if I can find something on the Lexar card (shutter-bug card) that can be reliably reproduced.


dfort

Nothing seems to be reliably reproducible with the exception of being able to get the shutter-bug to reliably show up on a card that usually isn't affected by the shutter-bug. Ugh!

Maybe we're trying to hit a moving target? What I mean is that the mpu_send call that needs to be delayed might be showing up in different locations? Seems like I was on to something when I found out that "PROP 80050044" is missing in the log with the shutter-bug. Went back to previous logs and it seems to be consistant. Well, almost--the log with the secondary mpu_send delay is also missing that property yet it didn't show the shutter-bug.

[EDIT] Just an educated guess at  "PROP 80050044" -- PROP_LV_CONTINUOUS_STOP ?? Searched the disassembly for a match.

a1ex

PROP_LV_PROHIBIT_OUTPUT?


AB96A> **INT-36h*:000e14d4:00:00: *** mpu_recv(06 04 09 2d 01), from 1dedc
AB9F6>    PropMgr:ff31fc9c:01:03: Receive Size = 0 ( ID = 0x9, 0x2D )
ABA77>    PropMgr:ff31ce4c:01:03: Deliv WaitID = 0x80050044, 0xFF183400(1)
ABAE9>        Gmt:ff17c93c:9a:01: gmtProperty ID=0x80050044(0x1)
ABB26>        Gmt:ff1800e0:9a:16:  PROP_LV_PROHIBIT_OUTPUT 1 1

g3gg0

Quote from: dfort on April 01, 2018, 05:13:37 PM
Maybe we're trying to hit a moving target?

sporadic race conditions in a system you didn't design are the worst thing you could ever face :)
Help us with datasheets - Help us with register dumps
magic lantern: 1Magic9991E1eWbGvrsx186GovYCXFbppY, server expenses: [email protected]
ONLY donate for things we have done, not for things you expect!

dfort

Interesting. PROP_LV_PROHIBIT_OUTPUT sounds like something that would block the camera from doing something--like saving an image.

However, going over previous logs it looks like some of the shutter-bug logs have the same values for that property as the no shutter-bug logs so it is probably just another dead end.

g3gg0

yet the timing might play some role.
did it change during startup?
Help us with datasheets - Help us with register dumps
magic lantern: 1Magic9991E1eWbGvrsx186GovYCXFbppY, server expenses: [email protected]
ONLY donate for things we have done, not for things you expect!