EOSM Shutter-Bug reboot

Started by dfort, March 11, 2018, 05:04:43 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

dfort

EOS-M Shutter-Bug

One of the peskiest and long lived Magic Lantern bug is the EOSM shutter-bug. The first report of this bug that I could find was made in 2012. It has been debated to death in the EOS M Alpha shutter-bug discussion and Issue #1893, EOS-M shutter bug on Bitbucket.

Quick workaround - use the one SD card that has been verified to work consistently, the SanDisk 32GB Extreme PRO SDHC UHS-I. A smaller SanDisk Extreme PRO card is ok but don't use anything larger because pretty much all SDXC cards show the shutter-bug.

Better solution, let's try to find a way to squash this bug.

First of all, how to reproduce this bug.

Lens - Only EF-M zoom lenses, both Canon and third party like the Tamron 18-200mm exhibit the shutter-bug. It is probably not the zoom but more likely when the boot process is reading the image stabilization system that is triggering the bug. Maybe, we don't really know. Lenses that have the collapsing feature also exhibit the shutter-bug if the camera is started with the lens extended.

SD Card - Many cards exhibit the bug but there are a few surprises. For example I've got an old 1GB Kodak card that is super slow but works fine. Anything over 32GB seems to show the bug.

Photo Mode - The shutter bug prevents you from taking a still photo. It doesn't affect the silent still module, only regular JPEG and CR2 shots that need a full shutter press. It also has nothing to do with movie recording.

Starting the camera with the Play button makes no difference. Even holding down the INFO button which on the EOSM prevents ML from loading will show the shutter-bug. However, interrupting the startup process by turning the camera on/off/on curing startup will kill the bug if you do it just right. This puts the camera in some sort of a special "StartupCondition" - This is what we know about StartupCondition so far:

StartupCondition : 1, 0 = camera powered on normally, with the ON/OFF button
StartupCondition : 16, 0 = camera powered on via the Play button
StartupCondition : 4, 0 = on/off/on trick that somehow kills the shutter-bug

Another way to kill the shutter-bug is by either twisting the lens on and off in order to break the lens/body connection or if the lens has the collapsing feature simply collapsing and extending the lens also works.

Experienced EOSM ML users have learned to live with the shutter-bug but it continues to surprise new users. Developers have tried multiple times to figure this out but now we have some more powerful tools that we can use.

I'm not sure if this time we'll be able to do it. This will be more of a test of what we've learned over the past several years and what the new tools reveal. Here's a suggestion where to start:

Quote from: a1ex on March 10, 2018, 01:09:57 AM
If the issue can be reproduced outside LiveView - for example, with a call("Release") somewhere at startup - that's a good candidate for trying in QEMU. I have some local changes that emulate the photo capture process to some extent - at least it's making the screen black and starts capturing the image. That should be enough for diagnosing the issue, as my understanding is that, when this bug happens, the camera doesn't even start capturing the image (maybe it also locks up?)

We'll need two logs with the above scenario, with the camera starting outside LiveView and trying to capture an image. MPU messages should be enabled (they are by default) - that's the "driving force" behind photo capture (and that's what my local changes are all about - MPU messages from photo capture logs, assigned to different hotkeys). One log should be successful and the other should have the bug.

If there is some workaround (maybe removing the lens and putting it back?), that's also worth logging.

First test - using the io_trace branch I ran a startup log holding down both the Play button so it would start outside of LiveView and the INFO button so it wouldn't load ML. I ran this on both a card that is known to always show the shutter-bug and a SanDisk 32GB Extreme PRO SDHC UHS-I card. Then I looked up some addresses for lens properties and found something interesting:

src/property.h
#define PROP_LENS               0x80030011 // info about lens? flags?

card with shutter-bug log
PROP:0x80030011,10502801

card without shutter-bug log
PROP:0x80030011,10522a01

Don't know if this is significant. I was careful not to change anything between the tests and the logs are very similar. So if this register holds some lens information or flags, it looks like there is something going on here. Perhaps figuring out how to re-read the lens information/flags at the end of the boot process might be a way to eliminate the shutter-bug?

dfort

Tried anther test with the io_trace build only this time logging events instead of just a startup. What happens if the 11-22mm lens along with the card that exhibits the shutter-bug is started with the lens collapsed and we log the action of extending the lens. This kills the shutter-bug but what is going on with that interesting lens property? Well, it shows up several times on the log:

EOSM extend 11-22mm lens log
copyDataToStorage eventID(0x80030011)Data(11740044)size(31)

So it is copying lens information to storage? This showed up on once in each of the logs I have posted.

dfort

Ok--one more test before calling it a night.

Log shooting a CR2 on both a card without the shutter-bug and a card with the shutter-bug. Of course the card with the shutter-bug blocked the shot but it also logged some ptpPropertyChangeEvent events that didn't show up on the other log.

ptpPropertyChangeEvent[80000028][4][87][0]
ptpPropertyChangeEvent[80030013][4][0][0]


So what are these properties?

src/property.h
#define PROP_PICTURE_STYLE 0x80000028   // 0x81 = std, 82 = portrait, 83 = landscape, 84 = neutral, 85 = faithful, 86 = monochrome, 21 = user 1, 22 = user 2, 23 = user 3
#define PROP_BATTERY_CHECK 0x80030013


The camera was set to Auto Picture Style so it makes sense that it changes, but to 87? Looks like that one isn't a known picture style.

[EDIT] Just out of curiosity I ran the test again with the Picture Style set to Standard and it did come up with 81 so that's good--though why would it be a property change when taking a picture?

a1ex

Last night I've cleaned up a script that annotates this kind of logs, printing the property ID and GUI event names for most of the MPU messages.

Noticed a bug: the log files must have Unix-style newlines, otherwise the script won't work well. The logs from here seem to be with Windows-style newlines.

Tried to diff the CR2 logs, but LiveView is way too noisy. Please capture the same info with call("Release") somewhere at startup, without entering LiveView. For example, with CONFIG_DEBUG_INTERCEPT_STARTUP, locate the 20-second blinks that indicate the experiment is running:


#ifdef CONFIG_DEBUG_INTERCEPT_STARTUP
    info_led_blink(20,500,500);
    debug_intercept();
#endif


and change to:

#ifdef CONFIG_DEBUG_INTERCEPT_STARTUP
    info_led_blink(10,500,500);
    beep();
    DryosDebugMsg(0, 0, "*** Attempting to take a picture...");
    call("Release");
    info_led_blink(10,500,500);
    debug_intercept();
#endif


What I could tell from the two CR2 logs was that both logs have the SW1 (half-shutter) event, but only one has SW2 (full shutter press or capture start, an event initiated by the MPU).

cr2-nobug-vs-bug.html (too noisy)

Diff'ed the two startup logs:
startup-nobug-vs-bug.html

- 02 12 is a group of properties related to lens (sent by MPU to main CPU)
- 09 2e is some property related to LiveView (sent by MPU to main CPU); seems to arrive with some delay (edit: 80050045 PROP_LV_FOCAL_DISTANCE)
- PROP_AVAIL_SHOT: likely given by free space on the card (it's sent by main CPU to MPU)
- PROP_POWER_KIND appears in a different order (multitasking during EOS startup - PowerShot firmware doesn't do that)
- 03 25: no idea, also out of order
- 02 15 is PROP_LENS
- 06 05 06 12 00 00 is some unidentified GUI event (GUICMD_CLOSE_SLOT_COVER)

Note: both startup logs were created with: StartupCondition : 4, 0:

QuoteStartupCondition : 4, 0 = on/off/on trick that somehow kills the shutter-bug

Shouldn't they be created with this one instead?

QuoteStartupCondition : 16, 0 = camera powered on via the Play button




The above logs can be used in QEMU.

1) convert them with extract_init_spells.py


python extract_init_spells.py EOSM-start-bug-a.log > EOSM-bug.h
python extract_init_spells.py EOSM-start-nobug-a.log > EOSM-nobug.h


That way, I can switch between the two by renaming the header used in mpu.c .

This will allow comparing the different properties and figuring out what they might be.

2) run ML and try to print the contents of PROP_LENS, for example:

PROP_HANDLER(PROP_LENS)
{
    uint8_t* info = (uint8_t *) buf;

    qprintf("PROP_LENS: ");
    for (int i = 0; i < len; i++)
    {
        qprintf("%02x ", info[i]);
    }
    qprintf("\n");
    ...



(./run_canon_fw.sh EOSM,firmware="boot=1" -s -S & arm-none-eabi-gdb -x EOSM/patches.gdb) |& grep -a PROP_LENS

# good
[MPU] Sending : 20 1f 03 24 45 46 2d 4d 31 31 2d 32 32 6d 6d 20 66 2f 34 2d 35 2e 36 20 49 53 20 53 54 4d 00 00  (PROP_LENS_NAME)
[MPU] Sending : 24 23 03 15 01 2a 52 10 33 00 0b 00 16 90 53 9a 7c 08 ff 00 00 00 00 00 00 00 00 49 04 01 00 00 37 2a 01 00  (PROP_LENS)
[MPU] Sending : 24 23 03 15 01 2a 52 10 33 00 0b 00 16 90 53 9a 7c 08 ff 00 00 00 00 00 00 00 00 49 04 01 00 00 37 2a 01 00  (PROP_LENS)
PROP_LENS: 01 2a 52 10 33 00 0b 00 16 90 53 9a 7c 08 ff 00 00 00 00 00 00 00 00 49 04 01 00 00 37 2a 01

# bad
[MPU] Sending : 20 1f 03 24 45 46 2d 4d 31 31 2d 32 32 6d 6d 20 66 2f 34 2d 35 2e 36 20 49 53 20 53 54 4d 00 00  (PROP_LENS_NAME)
[MPU] Sending : 24 23 03 15 01 28 50 10 33 00 0b 00 16 90 53 9a 7c 08 ff 00 00 00 00 00 00 00 00 49 04 01 00 00 37 29 01 00  (PROP_LENS)
[MPU] Sending : 24 23 03 15 01 28 50 10 33 00 0b 00 16 90 53 9a 7c 08 ff 00 00 00 00 00 00 00 00 49 04 01 00 00 37 29 01 00  (PROP_LENS)
PROP_LENS: 01 28 50 10 33 00 0b 00 16 90 53 9a 7c 08 ff 00 00 00 00 00 00 00 00 49 04 01 00 00 37 29 01


The two differences are raw_aperture_min and raw_aperture_max (f4.3-f24 vs f4-f22). Likely different focal length - that one is taken from PROP_LV_LENS. Though, in QEMU, using the startup logs only, focal length is reported as 1408 (ignored by ML).

Focal length is reported in LiveView, in the CR2 logs, but there are a couple of different values (showing the lens was probably zoomed in and out during the experiment).

3) Try to identify properties from unknown MPU messages:


# properties
if 1
  b *0xFF11FB04
  prop_request_change_log

  b *0xFF31F6F0
  mpu_analyze_recv_data_log

  b *0xFF31CE8C
  prop_lookup_maybe_log

  b *0xFF325994
  mpu_prop_lookup_log
end


4) ... ??




StartupCondition is a property that comes from the MPU:

*** mpu_recv(06 05 03 20 04 00), from 1dedc                        ; PROP_STARTUP_CONDITION


Can be easily faked on the main CPU side. Guess: maybe it skips sensor cleaning if camera is powered often?


# edit PROP_STARTUP_CONDITION in mpu_spells/EOSM.h (values: 0x01, 0x04, 0x10)
# we need -icount for deterministic emulation (so the stuff that appears in logs will be in the same order)
(./run_canon_fw.sh EOSM,firmware="boot=1" -s -S -d debugmsg -icount 5 & arm-none-eabi-gdb -x EOSM/patches.gdb) &> startup-condition-01
(./run_canon_fw.sh EOSM,firmware="boot=1" -s -S -d debugmsg -icount 5 & arm-none-eabi-gdb -x EOSM/patches.gdb) &> startup-condition-04
(./run_canon_fw.sh EOSM,firmware="boot=1" -s -S -d debugmsg -icount 5 & arm-none-eabi-gdb -x EOSM/patches.gdb) &> startup-condition-10



startup-condition-01-vs-04.html (looks like - START_AS_CHECK is sensor cleaning)
startup-condition-01-vs-10.html (16 decimal)

Long shot: what if you disable sensor cleaning?




Some more tests I'd like to run:

1) to exclude differences in SD card contents (including, but not limited to, card size):
- prepare the smaller SD card out of these two (format, make bootable, copy logging code)
- create a card image (dd if=/dev/sdcard of=sd.img bs=1M)
- copy the image to the large card (dd if=sd.img of=/dev/sdcard bs=1M)
- run the logging experiment from both cards (starting with the same initial conditions)

2) to exclude timing differences while loading autoexec.bin:
- compile with CONFIG_MARK_UNUSED_MEMORY_AT_STARTUP
- this will slow down the boot process by a few seconds
- the slowdown happens before the MPU is initialized (06 04 02 00 00)
- it's likely some times larger than any timing difference caused by different card speeds
- I should probably think of a way to get repeatable timings (maybe by delaying certain debug messages until some preset value)

3) can the bug be reproduced without autofcus? (if yes, turn it off)

dfort

Quote from: a1ex on March 11, 2018, 12:18:19 PM
Noticed a bug: the log files must have Unix-style newlines, otherwise the script won't work well. The logs from here seem to be with Windows-style newlines.

Darn it, just found out that Pastebin is doing that. Can't find a way to change it--is it a problem? Should I find another service to post my logs?

Thanks for the detailed post. This will keep me busy for a while.

Quote from: a1ex on March 11, 2018, 12:18:19 PM
Note: both startup logs were created with: StartupCondition : 4, 0:

Interesting--I pressed both the Play and INFO button on startup so it would start outside of LiveView and not load ML yet the shutter-bug still showed up. So much for StartupCondition having something to do with killing the bug.

[EDIT] Thought I'd try this:

Quote from: a1ex on March 11, 2018, 12:18:19 PM
Long shot: what if you disable sensor cleaning?

Didn't kill the bug. I also tried "Clean now" because the EOSM displays the sensor cleaning only on shutdown but even though it triggered the shutter during cleaning the shutter-bug was still present after sensor cleaning.

a1ex

Replace "\n" with "\r\n" in the script (will commit it later).

dfort

Started creating logs and processing them. Got something worth sharing:

https://www.dropbox.com/sh/j59xlz1qd1yelen/AABiu9QYITsJ-8RYBQiQZFnta?dl=0

First of all, the changes I did to io_trace (love being able to read this from autoexec.bin):

Magic Lantern Nightly.2018Mar11.EOSM202
Camera   : EOSM
Firmware : 202
Changeset: 2838aae45890+ (io_trace)
Built on : 2018-03-11 20:56:53 by rosiefort@RosieFoComputer

diff -r 2838aae45890 src/boot-hack.c
--- a/src/boot-hack.c Tue Feb 13 21:36:23 2018 +0100
+++ b/src/boot-hack.c Sun Mar 11 13:56:54 2018 -0700
@@ -25,6 +25,7 @@
  * Boston, MA  02110-1301, USA.
  */

+#include "beep.h"
#include "dryos.h"
#include "config.h"
#include "version.h"
@@ -528,7 +529,11 @@


#ifdef CONFIG_DEBUG_INTERCEPT_STARTUP
-    info_led_blink(20,500,500);
+    info_led_blink(10,500,500);
+    beep();
+    DryosDebugMsg(0, 0, "*** Attempting to take a picture...");
+    call("Release");
+    info_led_blink(10,500,500);
     debug_intercept();
#endif
}
diff -r 2838aae45890 src/dm-spy.c
--- a/src/dm-spy.c Tue Feb 13 21:36:23 2018 +0100
+++ b/src/dm-spy.c Sun Mar 11 13:56:54 2018 -0700
@@ -112,7 +112,7 @@

     /* only log messages from LiveView tasks until the buffer is half-full
      * (these are very verbose) */
-    if (lv && len > buf_size / 2)
+    if (lv)
     {
         if (streq(current_task->name, "Evf") ||
             streq(current_task->name, "Epp") ||
diff -r 2838aae45890 platform/EOSM.202/internals.h
--- a/platform/EOSM.202/internals.h Tue Feb 13 21:36:23 2018 +0100
+++ b/platform/EOSM.202/internals.h Sun Mar 11 13:56:54 2018 -0700
@@ -138,7 +138,7 @@
//~ #define CONFIG_AUDIO_CONTROLS

/** Patch LiveViewApp dialog hander to hide Canon bottom bar */
-#define CONFIG_LVAPP_HACK_PATCH
+// #define CONFIG_LVAPP_HACK_PATCH

/** Workaround for menu timeout in LiveView */
#define CONFIG_MENU_TIMEOUT_FIX


CONFIG_DEBUG_INTERCEPT_STARTUP = y


The test wouldn't run when pressing Play and INFO so I used only the Play button. Lights flashed two logs were saved and in both cases the second log showed the "*** Attempting to take a picture..." message so I processed the second logs.

Found some interesting differences but that's about as far as I got.

a1ex

The SW1 event is near the end of the log; may be worth increasing the log size (default is 512K, but many models accept 2MB or more). May take some trial and error.

Comparing MPU messages gives pretty much the same pattern as before (e.g. 06 04 09 2d 00 appears early on the good log). First difference is on some mpu_recv (the MPU decides to send different data, for some reason).

dfort

Saw you made some tweaks to the io_trace_full branch so I jump on that and ran the same test again only this time pretending that the EOSM is a real grownup camera:

src/dm-spy.c
/* plenty of memory available */
-#if defined(CONFIG_5D2) || defined(CONFIG_5D3) || defined(CONFIG_50D) || defined(CONFIG_500D) || defined(CONFIG_100D) || defined(CONFIG_70D)
+#if defined(CONFIG_5D2) || defined(CONFIG_5D3) || defined(CONFIG_50D) || defined(CONFIG_500D) || defined(CONFIG_100D) || defined(CONFIG_70D) || defined(CONFIG_EOSM)
#define BUF_SIZE_MALLOC (2048*1024)
#endif


It worked! Well, sort of. It can run the test as long as the camera is started via the play button. Starting "normally" will hang the camera but that isn't anything new.

One thing I was interested in was "ALERT DELIVER ARRAY" because it shows up in the "good" log but not in the shutter-bug log on my earlier test. It looks like this is located around ff11fc00 and it is a part of the EOSM _prop_request_change stub that is apparently never called when the shutter-bug is present. However, the new logs are much larger and both logs (shutter-bug and no shutter-bug) have ALERT DELIVER ARRAY. So much for that theory.

The io_trace_full logs are in the same Dropbox location as the earlier logs.

By the way, it is easy to figure out which log belongs to which SD card because the 32GB card is an SDHC card formatted as FAT 32 and has all uppercase filenames while the card with the shutter-bug is a 64GB SDXC card formatted as exFAT and has lowercase filenames. Note that it isn't just the file format that is the problem though I'm pretty sure that all SDXC cards have the bug while only a few SDHC cards don't.

In any case, this seems like an interesting experiment:

Quote from: a1ex on March 11, 2018, 12:18:19 PM
1) to exclude differences in SD card contents (including, but not limited to, card size):
- prepare the smaller SD card out of these two (format, make bootable, copy logging code)
- create a card image (dd if=/dev/sdcard of=sd.img bs=1M)
- copy the image to the large card (dd if=sd.img of=/dev/sdcard bs=1M)
- run the logging experiment from both cards (starting with the same initial conditions)

I'm on a Mac which is slightly different but dd is running now. Don't know if it is a Mac thing but it is taking a long time.

a1ex

Yeah, I fixed some bugs related to what I was studying (emulation-related stuff). There were some missing messages in these huge logs, so I updated the first one.

Something interesting: both logs are showing LiveView frames (uppercase: 19 frames; lowercase: 14 frames). Look for ">>> INT-6Ah" (HEAD1), 6B (HEAD2, missing, present on 5D3), D9 (HEAD3), E0 (HEAD4) and check the timestamps.

For example:

cat DM-0001.LOG | grep ">>> INT-6Ah" | cut -c 1-6,8-10 | gnuplot -p -e 'plot "/dev/stdin"'


Average delta 23.248 ms (43 FPS?!) +/- 0.13 us.

(TODO: I should take the interrupt names from here and fill in the gaps; enabling the register_interrupt stub will also name some of them)

The switch to LiveView happens shortly after call("Release") - at MPU message 06 05 04 00 00 (half-shutter).

Can we find a way to take a picture without half-shutter? (maybe by triggering PROP_REMOTE_SW2 directly). Will look into it.

dfort

Got the 64GB exFAT card that reliably shows the shutter-bug formatted to 32GB FAT 32 using dd. Now I've got a 32GB card that reliably shows the shutter-bug. Yes, this card still has the shutter-bug. Ran the same io_trace_full test and saved the log files in the same Dropbox folder.

Something interesting here--the DM-0000.LOG file is much smaller than on the other tests and it is missing "mpu_recv" messages that are on the other logs.

Ran gnuplot on the various io_trace_full startup logs and this is what I got:

32GB card without the shutter-bug


64GB card with shutter-bug


64GB card formatted to 32GB shutter-bug


Don't know what they mean but graphs are always fun to look at.

dfort

Had a crazy idea -- is a firmware dump done while the shutter-bug is active different from a firmware dump done without the shutter-bug?

MD5 (shutter-bug/ROM0.BIN) = 39e8ee840f543fd2e5312bb9ea54a153
MD5 (no-shutter-bug/ROM0.BIN) = 99d5e249bdccaded63fa1acf669cc90f

MD5 (shutter-bug/ROM1.BIN) = b85842898bb3ff1328dbf83047bba1c4
MD5 (no-shutter-bug/ROM1.BIN) = 4a068761b41e61bd0e2741526d8795da


Same camera, same conditions so shouldn't they match? I'll see if this is repeatable then do a disassembly and see what changed. Uh--how do you do those cool html diff files? My diff fu is dismal.

diff -q shutter-bug/ROM1.BIN.dis no-shutter-bug/ROM1.BIN.dis
Files shutter-bug/ROM1.BIN.dis and no-shutter-bug/ROM1.BIN.dis differ

a1ex

ROMs are reflashed at each shutdown with updated Canon settings (from menu or internal), so I don't expect repeatability here.

https://bitbucket.org/hudson/magic-lantern/pull-requests/825/prevent-canon-settings-from-being-saved
https://www.magiclantern.fm/forum/index.php?topic=19369

Warning: huge can of worms.




The graphs show timestamps for each LiveView frame (X = frame ID, Y = timestamp in seconds, with microsecond resolution).

The frame rate might be bumped to 43 FPS during autofocus, and gets back to 30 (guess) when autofocus finishes. Does ML's frame rate indicator agree with this?




HTML diffs:

#!/usr/bin/python2
import difflib
import os, sys

a = open(sys.argv[1]).readlines()
b = open(sys.argv[2]).readlines()

a = [l[:100] for l in a]
b = [l[:100] for l in b]

c = open("diff.html", "w")

h = difflib.HtmlDiff()
print >> c, h.make_file(a, b)


Gets in trouble in some tricky cases (doesn't always match the unchanged lines well). Very slow on large files.

May require preprocessing (to remove unimportant stuff that may be different - timestamps, code addresses etc)

Alternatives:
colordiff -y (very fast; console only; can be converted, but...)
gvimdiff before.log after.log -c "colorscheme garden" -c TOhtml -c "w! diff.html" -c q! -c q! -c q!  (looks ugly)
kdiff3 and meld seem to work too (no idea how to save as HTML though)
Beyond Compare gives good results in tricky cases (matches unchanged lines very well), but it's not free.
Pretty sure there are a lot of other tools around.

dfort

Quote from: a1ex on March 12, 2018, 06:18:39 PM
Warning: huge can of worms.

Figures. Thought I had something there.

Here's a one-liner that works pretty good but of course I can't share the output:

diff -y -W250 shutter-bug/ROM1.BIN.dis no-shutter-bug/ROM1.BIN.dis | expand | grep -E -C3 '^.{123} [|<>]( |$)' | colordiff | aha >mydiff.html

Had to install colordiff and aha. Yeah aha is the name of the program. Well, at least my diff fu has improved.




Quote from: a1ex on March 12, 2018, 06:18:39 PM
The graphs show timestamps for each LiveView frame (X = frame ID, Y = timestamp in seconds, with microsecond resolution).

The frame rate might be bumped to 43 FPS during autofocus, and gets back to 30 (guess) when autofocus finishes. Does ML's frame rate indicator agree with this?

The frame rate indicator changes during autofocus? Wow it does! Never noticed that before. As you know on standby the fps indicator shows 29.97 (approx. it bounces around) on most settings except for Movie crop mode where it shows the setting that corresponds with the Canon menu (1920x1080 30/25/24).

Ok--so here it is in movie mode on standby:


During focus (bottom display trick working nicely):


Finally, it shows the "real" fps only when recording H.264:





Quote from: a1ex on March 11, 2018, 12:18:19 PM
The two differences are raw_aperture_min and raw_aperture_max (f4.3-f24 vs f4-f22). Likely different focal length - that one is taken from PROP_LV_LENS. Though, in QEMU, using the startup logs only, focal length is reported as 1408 (ignored by ML).

Focal length is reported in LiveView, in the CR2 logs, but there are a couple of different values (showing the lens was probably zoomed in and out during the experiment).

Maybe on the first tests. On my later tests I was very careful not to bump anything. I also made sure that WB and ISO were not on auto and set Focus on manual. That's why I was a little surprised when you mentioned the fps changing during autofocus.

dfort

Ran some more tests today. Log files located in this Dropbox folder:

https://www.dropbox.com/sh/hwfpnsz7ki1yowu/AABYKwFaP-UYtBzeiSFsKcOLa?dl=0

First of all:

Quote from: a1ex on March 11, 2018, 12:18:19 PM
3) can the bug be reproduced without autofcus? (if yes, turn it off)

Yes and all the recent tests have been run using the MF setting.

Quote from: a1ex on March 11, 2018, 12:18:19 PM
2) to exclude timing differences while loading autoexec.bin:
- compile with CONFIG_MARK_UNUSED_MEMORY_AT_STARTUP
- this will slow down the boot process by a few seconds
- the slowdown happens before the MPU is initialized (06 04 02 00 00)
- it's likely some times larger than any timing difference caused by different card speeds
- I should probably think of a way to get repeatable timings (maybe by delaying certain debug messages until some preset value)

Ran this with the io_trace_full branch so there's lots of information in that log. I only ran it on the card with the shutter-bug because I figured that the idea here was that if we slow down the boot process maybe that will kill the shutter-bug. It didn't but I do have another large startup log file to look at.

Another test I thought would help was twisting the lens off and on. This is one of the oldest and most reliable workarounds for cards that exhibit the shutter-bug. Tried it both on the card with the shutter-bug as well as the one without the shutter-bug. I wanted to see if there was any difference and of course every line is different on these logs. In fact I ran the test multiple times and every time it was different. Something interesting is the area around cls_task because when the lens is twisted back on the screen goes black for a moment. Maybe this is the point where the lens firmware is scanned? My prediction is that the solution is going to be figuring out a way to reload the lens firmware because the shutter-bug is only happening on CPU lenses that have IS so there's quite a bit of code in the lens firmware as opposed to the simpler EF-M 22mm prime lens that doesn't have IS and doesn't exhibit the shutter-bug.

a1ex

Quote from: dfort on March 13, 2018, 08:45:25 PM
I only ran it on the card with the shutter-bug because I figured that the idea here was that if we slow down the boot process maybe that will kill the shutter-bug.

Or maybe the large delay might cause the shutter bug on the good card - who knows.

Although I still believe there might be something in the bootloader (since there were minimal autoexec.bin and FIRs that did nothing other than jumping to Canon firmware - one instruction in ASM - but still showed the bug). One of my earlier tests - which IIRC failed - was to re-run the entire bootloader with boot flag disabled (to simulate a software reset). Maybe I should revisit it.

dfort

Quote from: a1ex on March 13, 2018, 09:18:42 PM
Or maybe the large delay might cause the shutter bug on the good card - who knows.

I didn't think of that. Ok, fair enough--ran the same test on the Sandisk 32GB card that has never exhibited the shutter-bug and surprise! Shutter-bug! Log uploaded to today's test folder.

[EDIT] Ran the test again, this time on the current unified branch with the only change being to define CONFIG_MARK_UNUSED_MEMORY_AT_STARTUP in config-defines.h. Shutter-bug. So now we know how to put the shutter-bug on the one card that has consistently worked for us.

So is this the step back we need to take before running forward? In animation we call this "anticipation."


a1ex

Wow - first sign that we might be getting closer?!

Binary search - reduce the size argument for memset64 until you find the threshold. There may be some range of values where the bug appears with some probability - finding that interval could be useful.

TODO: I should find a way to measure the time spent in bootloader (before loading autoexec.bin). ARMv7 has performance counters, but I couldn't find any on ARM946E-S; however, our processor appears to have some extensions borrowed from future CPUs, so it might be worth trying some stuff blindly. Re-reading autoexec.bin from bootloader code (including SD card re-initialization) might give a good approximation.

BTW - these two logs, with CONFIG_MARK_UNUSED_MEMORY_AT_STARTUP, don't have any interesting messages - they only have the interrupts (which are not very interesting btw - you can comment them out). The messages from MPU are the most important here - don't disable them.

dfort

Updated my previous post.  :D

Quote from: a1ex on March 13, 2018, 09:49:27 PM
The messages from MPU are the most important here - don't disable them.

Hum--I didn't disable them. Using the same options as the previous tests that showed the MPU messages.

So what am I supposed to do here?

src/boot-hack.c
#ifdef CONFIG_MARK_UNUSED_MEMORY_AT_STARTUP
    memset64(0x00D00000, 0x124B1DE0 /* RA(W)VIDEO*/, 0x1FE00000 - 0x00D00000);
#endif


Would it be best to work on the unified branch for this or is there a different branch I should be using?

a1ex

There is an option that filters out messages not starting with lowercase e - probably taken from here.

(0x1FE00000 - 0x00D00000) / 2 = 0xF880000.

http://www.101computing.net/my-first-python-game-guess-the-number/

Any branch should be fine; there might be a small difference in results if you are also logging the startup messages somehow, but otherwise I don't expect any difference.

dfort

I played that divide by 2 game many times and it has always worked for me but this time it was rather frustrating. Partially because of the interruptions during testing and partially because there seems to be this range where the shutter-bug sometimes shows up and sometime it doesn't. This is probably the zone where several people have reported the bug but couldn't reliably reproduce it.

Here are my notes to find the sweet spot--or rather the bitter spot:

0x0000000 - no shutter-bug
0xF880000 - shutter-bug
0x7C40000 - shutter-bug
0x3E20000 - shutter-bug
0x1F10000 - shutter-bug
0x0F88000 - shutter-bug
0x07C4000 - no shutter-bug
0x0BA6000 - shutter-bug
0x09B5000 - no shutter-bug
0x0AAD800 - no shutter-bug
0x0B29C00 - no shutter-bug
0x0B67E00 - shutter-bug
0x0B48D00 - no shutter-bug
0x0B58580 - shutter-bug
0x0B50940 - no shutter-bug
0x0B54760 - shutter-bug
0x0B52850 - shutter-bug
0x0B518C8 - shutter-bug
0x0B51104 - shutter-bug
0x0B50D22 - shutter-bug
0x0B50B31 - no shutter-bug
0x0B50C29 - shutter-bug
0x0B50BAD - no shutter-bug
0x0B50BEB - shutter-bug
0x0B50BCC - shutter-bug
0x0B50BBC - no shutter-bug
0x0B50BC3 - shutter-bug
0x0B50BBF - shutter-bug
0x0B50BBD - shutter-bug
0x0B50BBB - shutter-bug
0x0B50BBC - shutter-bug
0x0B50BBB - shutter-bug
0x0B50BBA - shutter-bug
0x0B50BB9 - no shutter-bug
0x0B50BBA - no shutter-bug
0x0B50BBB - Sometimes yes, sometimes no


So it looks like the shutter bug shows up somewhere around 0x0B50BB9 to 0x0B50BBC. At this point it was pretty much a coin toss whether or not the shutter-bug would show up.

Used the unified branch for this test. Only changes were to define CONFIG_MARK_UNUSED_MEMORY_AT_STARTUP in src/config-defines.h and change the value here:

src/boot-hack.c
Quote#ifdef CONFIG_MARK_UNUSED_MEMORY_AT_STARTUP
    memset64(0x00D00000, 0x124B1DE0 /* RA(W)VIDEO*/, 0x0B50BBB);
#endif

a1ex

That's quite a narrow range - I've expected to be a coin toss for some hundreds or thousands of units.

You can convert this to seconds after running the memory benchmark - you may need this PR in case of error.

5D3: 217MB/s; 70D: 213MB/s => about 50 ms?

Also run the following:

./run_canon_fw.sh EOSM,firmware="boot=1" -d sdcf     # any autoexec, doesn't matter, we will be analyzing the bootloader only
./run_canon_fw.sh EOSM,firmware="boot=1" -d sdcf,io  # look for "Card LED"


Hypothesis: when starting the camera with boot flag enabled, we've got the following delays:
- SD card initialization (unknown, the LED is off during this process)
- loading autoexec.bin (LED is on during this process)
- ML initialization (blob_memcpy, memset64)
- Canon initialization (until the first mpu_send) - there are some variations here as well, e.g. in the first startup logs (bug vs nobug) there is a 71-microsecond difference (minor compared to 50ms):

15609>    Startup:000e146c:00:00: *** mpu_send(06 04 02 00 00), from 38b8                            ; Init group
155C2>    Startup:000e146c:00:00: *** mpu_send(06 04 02 00 00), from 38b8                            ; Init group


- that also implies the autoexec.bin size matters - larger binaries might be more likely to cause the bug (therefore explaining the inconsistent reports with the minimal binaries - some showing the bug, others not)
- that also implies the filesystem usage matters - as the card fills up with many files, finding autoexec.bin takes more time
- therefore, formatting the card may also appear to fix this bug in some cases (if the total delay was slightly above 50ms before formatting, and slightly below afterwards).

Can you somehow time the startup process of the EOS M on both cards, since pressing the button?

Recording at 60 FPS would give a time slice of 16.7 ms. Probably good enough for our tests, as long as the button press event can be clearly seen in the video. I just want to check whether the startup with the bad card is slower by more than ~ 50 ms, compared with the good card (in vanilla configuration).

You can also add a delay (msleep) for the first mpu_send call (but not for any other). In the logging function (mpu_send_log), which gets called right before running the mpu_send, try something like this (with the good card):

  static int call_num = 0;
  call_num++;
  if (call_num == 1) /* delay the first call only */
  {
    /* do the math after running memory benchmark, then round to multiples of 10 */
    /* let's say it's 53ms -> 50 should not cause the bug, 60 should cause it */
    msleep(50);
  }


That would move the delay right before mpu_send (where I believe it matters).

Then, try delaying some other mpu_send call (the second, the third and so on). At some point, the bug should disappear.




Low-level question: when is the hardware timer (0xC0242014) started?

5D3: reading it before reloc_entry() (boot-hack.c, early startup) locks up the camera. Reading it right after that returns 0x3170 (repeatable), even if there's a CF card in the camera (in this case, the startup time is noticeably longer). Might be initialized here:


[*unk*]        at 0xFF0C008C:FFFF09C0 [0xC0242010] -> 0x0       : ???
[*unk*]        at 0xFF0C0094:FFFF09C0 [0xC0242010] <- 0x1       : ???


In other words, it cannot be used to measure the time required by the bootloader to do whatever it does (including SD card activity).




BTW, this works on QEMU, but not on the camera (autoexec.bin loaded twice):

diff -r 36acb0f9c3e2 src/reboot.c
--- a/src/reboot.c
+++ b/src/reboot.c
@@ -211,2 +211,25 @@
{
+    /* assuming memory contents are kept across software reboots */
+    /* pick some address that's not going to be overwritten during startup */
+    /* e.g. from https://www.magiclantern.fm/forum/index.php?topic=5601.msg196632#msg196632 */
+    volatile uint32_t * first_time = 0x4CDFFFFC;    /* 5D3 */
+    //volatile uint32_t * first_time = 0x47F24000;    /* EOSM */
+
+    if (*(first_time) != 0xABCD1234)  /* likely true after power on */
+    {
+        /* re-execute the bootloader, but next time take the other branch */
+        *(first_time) = 0xABCD1234;
+        void (*reboot)() = (void *) 0xFFFF0000;
+        reboot();
+
+        /* unreachable */
+        while (1);
+    }
+    else
+    {
+        /* we are back! */
+        /* clear this address for the next power on and continue loading ML */
+        *(first_time) = 0;
+    }
+
     #if !(CURRENT_CAMERA_SIGNATURE)

dfort

TL;DR

Just kidding  :D

Of course I was hoping for a, "Just change this one line and Bob's your uncle" solution. Somehow I knew that wasn't going to happen.

I also made a few discoveries. The one card that I have that is under 2GB is an old Kodak "SD Standard" card. According to the SD Association standard cards are formatted either FAT 12 or FAT 16. That SD Standard card doesn't exhibit the shutter-bug. Cards 2-32GB are "SDHC Standard" cards and formatted FAT 32. The SanDisk Extreme Pro card that works so well falls into this category but there are lots of other cards that don't work which also fall into this category. Finally, cards greater than 32GB are "SDXC Standard" formatted exFAT and all of these cards that I have tested show the shutter-bug.

That said, nothing seems to be 100% certain with the shutter-bug. When I was doing the prior test I found a zone where it may or may not show up. In this case I tried to shoot 5 stills:

0x0B50BBB
1 bug
2 no bug
3 no bug
4 no bug
5 no bug


So I went to the next lower value and darn it, didn't work at first. There are reports that formatting the card "fixes" the shutter-bug so:

0x0B50BBA
1 bug
2 bug
3 bug
4 bug
5 bug
format card
6 no bug
7 no bug
8 no bug
9 no bug


Interesting, so I gathered several cards and started shooting the video you asked for. First attempt--the 32GB SandDisk Extreme Pro that has never failed me showed the shutter-bug. What? That was the test card I've been using and at first I thought I got things mixed up but nope--all cards had the same unified build. So not even this card is 100% free of the shutter-bug. Another thing I discovered was that the 128GB SanDisk Extreme Pro card could be cleared of the shutter-bug using the in-camera formatting trick but the bug comes back if the camera is restarted. The Lexar 64GB card I tested reliably showed the shutter-bug throughout my tests.

Put the video on YouTube and also in this Dropbox folder. Hope you find what you're looking for. Note that at the end I showed the formatting trick. I also cut out the boring bits to keep the file manageable.



Anyway, still got some more homework to complete from your previous post.

dfort

I assume you want this test to be run in the camera but this is QEMU code, right?

Quote from: a1ex on March 14, 2018, 10:56:45 AM
In the logging function (mpu_send_log), which gets called right before running the mpu_send, try something like this (with the good card):

  static int call_num = 0;
  call_num++;
  if (call_num == 1) /* delay the first call only */
  {
    /* do the math after running memory benchmark, then round to multiples of 10 */
    /* let's say it's 53ms -> 50 should not cause the bug, 60 should cause it */
    msleep(50);
  }


That would move the delay right before mpu_send (where I believe it matters).

Then, try delaying some other mpu_send call (the second, the third and so on). At some point, the bug should disappear.

The only place I found mpu_send_log was in the QEMU GDB scripts.

Quote from: a1ex on March 14, 2018, 10:56:45 AM
Low-level question: when is the hardware timer (0xC0242014) started?

Is it possible find a low-level answer to this? Will a io_trace startup log show it?

[EDIT] I remember testing several minimal autoexec.bin files to see if the shutter-bug shows up so just for the heck of it I tried building a minimal autoexec.bin from the minimal/EOSM directory and put it on the card that always shows the shutter-bug -- no shutter-bug!

a1ex

The function you are looking for is in dm-spy-extra.c.

The vanilla io_trace won't work because:
1) reading the timer (to timestamp the MMIO messages) will crash the camera before the initial hardware setup (the first part of reloc_entry, until "We're back!")
2) the pointer to data abort function, at that early stage, is at 0x30, not at 0x2C (from FFFF0010 vs FF0C0D9C, both copied to 0x10 - the exception vector for data abort)

Both issues can be worked around and debugged in QEMU. After that, got a trace log for the initial hardware setup for 5D3 1.1.3:

; 000dd43c maps to FF0C0014
000dd43c:MMIO : [0xC0000010] <- 0xD9C5D9C5
000dd448:MMIO : [0xC020010C] <- 0x00000001
000dd450:MMIO : [0xC020000C] <- 0x000000FF
000dd454:MMIO : [0xC020001C] <- 0x000000FF
000dd458:MMIO : [0xC020002C] <- 0x000000FF
000dd45c:MMIO : [0xC020003C] <- 0x000000FF
000dd460:MMIO : [0xC020004C] <- 0x000000FF
000dd464:MMIO : [0xC020005C] <- 0x000000FF
000dd468:MMIO : [0xC020006C] <- 0x000000FF
000dd46c:MMIO : [0xC020007C] <- 0x000000FF
000dd470:MMIO : [0xC020008C] <- 0x000000FF
000dd474:MMIO : [0xC020009C] <- 0x000000FF
000dd478:MMIO : [0xC02000AC] <- 0x000000FF
000dd47c:MMIO : [0xC02000BC] <- 0x000000FF
000dd480:MMIO : [0xC02000CC] <- 0x000000FF
000dd484:MMIO : [0xC02000DC] <- 0x000000FF
000dd488:MMIO : [0xC02000EC] <- 0x000000FF
000dd48c:MMIO : [0xC02000FC] <- 0x000000FF
000dd494:MMIO : [0xC0400008] -> 0x00010003
000dd4a0:MMIO : [0xC0400008] <- 0x00430007
000dd4b4:MMIO : [0xC0242010] -> 0x40000400
000dd4bc:MMIO : [0xC0242010] <- 0x40000401

QEMU:
000dd494:MMIO : [0xC0400008] -> 0x00010003 (set in bootloader)
000dd4a0:MMIO : [0xC0400008] <- 0x00430007
000dd4b4:MMIO : [0xC0242010] -> 0x00000000 (0x40000400 comes from hardware)
000dd4bc:MMIO : [0xC0242010] <- 0x00000001


Then, tried this before reloc_entry():

    MEM(0xC0400008) |= 0x430005;    /* enable clocks, FF0C0068; locks up without it */
    MEM(0xC0242010) |= 1;           /* start digic timer, FF0C0088; prints 0 without it after adding memset64 */
    t0 = MEM(0xC0242014);           /* read digic timer */
    /* printed t0 later when boot finished => 0 */


Then, added this right before reading the timer into t0:

    memset64(0x00D00000, 0x124B1DE0 /* RA(W)VIDEO*/, 0x0B50BBA);


=> 51910 microseconds.

Some notes from g3gg0 on this register (Register_Map page):


Another Timer
0xC0242010 [32] Control register
                  ---------------1 enable flag
                  ----xxxx-------- predivider
                      1 = 3.000 MHz tick rate
                      2 = 2.250 MHz tick rate
                      3 = 1.125 MHz tick rate
                      4 = 1.000 MHz tick rate (default)
0xC0242014 [32] 12 bit counter. used for task load measurement.


=> let's try different values:

0x40000501 => 25955 (ratio 0.5)
0x40000401 => 51910 (reference)
0x40000301 => 58399 (ratio 1.125)
0x40000201 => 116798 (ratio 2.25)
0x40000101 => 155730 (ratio 3)
0x40000001 => 233595 (ratio 4.5)
0x10000401 => 51910 (no change)
0x40000421 => 710 (?!)