Intercepting DebugMsg with cache hacks

Started by a1ex, August 28, 2012, 10:56:48 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

dfort

Hum. It compiles fine on the EOSM2 merge I made but not on the original EOSM in an unadulterated io_trace branch. Maybe this is pointing to a possible problem?

[ LD       ]   magiclantern
liveview.o:(.rodata+0x0): undefined reference to `LiveViewApp_handler_BL_JudgeBottomInfoDispTimerState'
make: *** [magiclantern] Error 1



a1ex

Right, the vanilla startup-log build also failed to build for EOSM.

This comes from an experimental PR I've included in order to solve a lockup on 60D in LiveView (back then I wrote it to solve issue #2434 on 650D and 700D, but was reported to cause issues with 70D, so I left it aside for a while).

dfort

Found out that I have the DEBUGMSG variation of CONFIG_LVAPP_HACK instead of PATCH in the EOSM2. This change will compile on the original EOSM:

  /** Patch LiveViewApp dialog hander to hide Canon bottom bar */
- #define CONFIG_LVAPP_HACK_PATCH
+ #define CONFIG_LVAPP_HACK_DEBUGMSG


but is this correct?

FYI - When I hit a problem on the EOSM2 I try it out on the EOSM.

Going back to the touch screen test, here is a startup log on the EOSM using CONFIG_LVAPP_HACK_DEBUGMSG and REGION(0xC0090000, 0x001000). I was tapping on the screen during startup. Like the EOSM2 the EOSM hung on startup but it did save a log.

a1ex

The log from M looks good.

CONFIG_LVAPP_HACK_DEBUGMSG is no longer a valid option on the above PR (it has no effect, same as commenting out the old option). Without any bottom bar hack enabled, Canon's bottom bar will probably conflict with ML one (it does on all other cameras I tried).

To debug, first advice would be to start without LiveView, but I assume you already tried that.

Next, I'd comment out our logging stuff from the data abort handler, and leave just the bare minimum for tracing the MMIO reads/writes (without actually logging anything):


static void __attribute__ ((naked)) trap()
{
    /* data abort exception occurred. switch stacks, log the access,
     * enable permissions and re-execute trapping instruction */
    asm(
        /* save context */
        "STMFD  SP!, {R0-R12, LR}\n"

        /* prepare to re-execute the old instruction */
        "SUB    R0, LR, #8\n"
        "LDR    R0, [R0]\n"

        /* copy it to uncacheable memory - will this work? */
        "LDR    R1, =trapped_instruction\n"
        "ORR    R1, #0x40000000\n"
        "STR    R0, [R1]\n"

        /* sync caches - unsure */
        "mov    r0, #0\n"
        "mcr    p15, 0, r0, c7, c10, 4\n" // drain write buffer
        "mcr    p15, 0, r0, c7, c5, 0\n" // flush I cache

        /* enable full access to memory */
        "MOV    R0, #0x00\n"
        "MCR    p15, 0, r0, c6, c7, 0\n"

        /* restore context */
        "LDMFD  SP!, {R0-R12, LR}\n"

        /* placeholder for executing the old instruction */
        "trapped_instruction:\n"
        ".word 0x00000000\n"

        /* save context once again */
        "STMFD  SP!, {R0-R12, LR}\n"

        /* sync caches - unsure, try commenting out, see b771f5a0ed3f */
        "mov    r0, #0\n"
        "mcr    p15, 0, r0, c7, c10, 4\n" // drain write buffer
        "mcr    p15, 0, r0, c7, c5, 0\n" // flush I cache

        /* re-enable memory protection */
        "LDR    R0, protected_region\n"
        "MCR    p15, 0, R0, c6, c7, 0\n"

        /* restore context */
        "LDMFD  SP!, {R0-R12, LR}\n"

        /* continue the execution after the trapped instruction */
        "SUBS   PC, LR, #4\n"

        /* ------------------------------------------ */
    );
}


The above runs "step by step" the trapped instruction (the one that does the MMIO read/write), and allows one to log what happens before and after its execution (but doesn't actually log anything - this is just the framework for logging the MMIO accesses). This code should run without crashing, but I'm not sure what to do with the caching registers - if it doesn't work, that could benefit from some fiddling. I don't really know whether the cache is handled correctly - if it's not, it's likely to result in subtle side effects that may be hard to track down. If you know how to handle this stuff properly, please chime in.

On 5D3, with REGION(0xC0000000, 0x10000000), the above code sequence may have some side effects when capturing photos (distorted colors, non-deterministic, doesn't happen all the time) and LiveView doesn't start (black screen, camera locked up). Looking into it.

If that won't work, changeset b771f5a0ed3f was the first one that worked on the camera; it did not save any kind of logs, but it did not trap all the MMIO accesses either; only the first one from a DryOS time slice (a very small percentage of them). Still, useful for debugging. You don't have the REGION macro on that changeset - REGION(0xC0090000, 0x001000) evaluates to 0xC0090017.




Edit: narrowed down a LiveView crash to REGION(0xC0F19000, 0x001000). No idea what device uses this range, and emulation doesn't go far enough to access these registers.

The crash happens on the first exception from this range (which makes it easier to debug). Failed instruction:

0001C370  LDMIA   R1!, {R3,R4,R12,LR}


and it appears to be in some sort of memcpy, which gets called like this:

7DD90>        Evf:000afaa8:00:00: *** sub_1C364(0xfe4bd00, 0xc0f19000, 0x140, 0x0), from ff48f25c
7DDDF>        Evf:000afaa8:00:00: *** sub_1C364(0xfe4be40, 0xc0f19200, 0x140, 0x0), from ff48f270
7DE2C>        Evf:000afaa8:00:00: *** sub_1C364(0xfe4bf80, 0xc0f19400, 0x140, 0x0), from ff48f284
7DE78>        Evf:000afaa8:00:00: *** sub_1C364(0xfe4c0c0, 0xc0f19600, 0x140, 0x0), from ff48f298
...


If I re-enable the permissions and return from the data abort handler so it re-executes this instruction execute from its original location (SUBS PC, LR, #8), I get no crash.

If I execute this instruction from our handler (self-modifying code), and return from data abort to the next instruction (SUBS PC, LR, #4), it crashes.

This scenario should be reproducible in QEMU, outside LiveView.




Edit: indeed, it is. Somehow it attempts to execute an undefined instruction.



0x0001c370:  e8b15018      ldm r1!, {r3, r4, ip, lr}
Taking exception 4 [Data Abort]
...with DFSR 0x1 DFAR 0xc0f19000
0x00000010:  e59ff014      ldr pc, [pc, #20] ; 0x2c
0x000b0710:  e92d5fff      push {r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, sl, fp, ip, lr}
0x000b0714:  e59f05b4      ldr r0, [pc, #1460] ; 0xb0cd0
0x000b0718:  e2800001      add r0, r0, #1 ; 0x1
0x000b071c:  e58f05ac      str r0, [pc, #1452] ; 0xb0cd0
0x000b0720:  e24e0008      sub r0, lr, #8 ; 0x8
0x000b0724:  e5900000      ldr r0, [r0]
0x000b0728:  e59f15a8      ldr r1, [pc, #1448] ; 0xb0cd8
0x000b072c:  e3811101      orr r1, r1, #1073741824 ; 0x40000000
0x000b0730:  e5810000      str r0, [r1]
0x000b0734:  e3a00000      mov r0, #0 ; 0x0
0x000b0738:  ee070f9a      mcr 15, 0, r0, cr7, cr10, {4}
0x000b073c:  ee070f15      mcr 15, 0, r0, cr7, cr5, {0}
0x000b0740:  e3a00000      mov r0, #0 ; 0x0
0x000b0744:  ee060f17      mcr 15, 0, r0, cr6, cr7, {0}
000B073C: MCR p15, ...          : CACHEMAINT x261 (omitted)
000B0744: MCR p15,0,Rd,cr6,cr7,0:  946_PRBS7 <- 0x0       
0x000b0748:  e8bd5fff      pop {r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, sl, fp, ip, lr}
0x000b074c:  e8b15018      ldm r1!, {r3, r4, ip, lr}
[ENGIO] at run_test:000B074C:0001C378 [0xC0F19000] -> 0x0       : ???
[ENGIO] at run_test:000B074C:0001C378 [0xC0F19004] -> 0x0       : ???
[ENGIO] at run_test:000B074C:0001C378 [0xC0F19008] -> 0x0       : ???
[ENGIO] at run_test:000B074C:0001C378 [0xC0F1900C] -> 0x0       : ???
0x000b0750:  e25ef004      subs pc, lr, #4 ; 0x4
0xfffffffc:  ffffffff      undefined instruction 0xffffffff
Taking exception 1 [Undefined Instruction]


The trapped instruction modifies LR...

dfort

Interesting. That M log that looks good was done with CONFIG_LVAPP_HACK_DEBUGMSG which is not a valid option. The M won't compile with CONFIG_LVAPP_HACK_PATCH but the M2 will.

I didn't think of starting up without LiveView so I tried it on the M2 (start with the playback button) and with the REGION(0xC0090000, 0x001000) you were interested in. Here's the startup log for that.

Cameras that are using the invalid CONFIG_LVAPP_HACK_DEBUGMSG in the io_trace branch are the 70D, 100D and 5D3.123 -- though you already know that.

Ok--trying that block of code, the M2 hangs on normal LiveView startup but it starts up fine starting in Play mode. In fact it even saves a startup log.

One last test, going back to the original M2 merged version with CONFIG_LVAPP_HACK_PATCH. I switched from CONFIG_DEBUG_INTERCEPT_STARTUP to CONFIG_DEBUG_INTERCEPT but it hangs on startup. Darn it, I was hoping this would help point out what is going on with the M2. Can't use io_trace to compare with the M because that camera won't compile with CONFIG_LVAPP_HACK_PATCH.


a1ex

That LiveView hack doesn't affect the logs; it only affects the user interface in LiveView (in other words, you can still get valid logs by disabling it).

The M2 log looks good, too.

The logging backend depends a lot on the address where ML is loaded (actually, whether it can do short jumps into ROM, or not - when patching low-level code, we have to be careful with this tiny detail, while for regular C code, it doesn't really matter - the compiler handles it for us). On 60D and 7D IIRC, I had to change the boot method to "classic" AllocateMemory - this change fixed the logging hooks, but broke the LiveView hack (so 60D was crashing in LiveView). The LVAPP_HACK_PATCH method attempts to unify the LiveView hacks (same code working on all cameras) and also plays nice with the logging hooks. There are, indeed, a few things to clean up, as you've already noticed.

a1ex

Some logs (huge, previously impossible to capture) from 5D3.113:

- camera startup
- still photo capture
- image playback (with zoom)
- LiveView, entering from photo mode (for some unknown reason, frames were captured at about 22 fps)
- 1080p25, entering from photo mode (same FPS issue)
- 1080p25 (a few frames, FPS correct this time)

It is my understanding that the above logs include *all* MMIO activity performed by Canon code (or ML) in the 0xC0000000 range (there are a few other regions used for Eeko); however, I have no hard proof for that (or for the correctness of all these values). The logs are huge, but the information you are looking for (regarding how the hardware is configured for various tasks) is probably there.

Previously, to get the value of one (1) single register, I had to examine the disassembly, find out where that register is read or written (from QEMU logs), find a place to install a logging hook, find out in what register I can find the result, troubleshoot if it doesn't work out of the box etc (the MPU communication was reverse engineered this way). Then I've automated that process to some extent (mmio_log, example), but still, I could only place a few logging hooks in a few isolated places (so the information was scattered in many related logs that had to be re-assembled manually to understand the big picture). Now I get the complete interactions with ALL the devices our processor is talking to, in one (huge) log!

The above logs have nearly 5000 unique registers (3500 just from the photo capture log). Next: good luck understanding the meaning of all these numbers :D
(many of them are documented in various places, so these logs would benefit from some postprocessing to label the registers we already know - you can help with that!)

Code in io_trace_full (major rework of the entire logging backend, breaks some of the other features, for example CONFIG_DEBUG_INTERCEPT_STARTUP+CONFIG_MMIO_TRACE, so I created yet another branch for now). The "very simple" io_trace works very well with less demanding logging scenarios, such as GPIO, or for LiveView at reduced FPS, without breaking any compatibility (at startup or otherwise). The above startup log was done with a hardcoded buffer (hack not committed, trying to find a better way).

Thanks g3gg0 for the help (in particular, regarding cache coherence issues) and for lots of good advice!

For DIGIC 4, the news are a little less good: 60D and 500D seem to work fine with GPIO, but they crash when trying to log the full 0xC0000000 range. Didn't try to narrow down. Works in QEMU.

I'm very curious to see what this hack can do on DIGIC 6 and 7 (more details after getting it to work in QEMU).

dfort

I was doing some tests on the EOSM and came up with the issue of not being able to compile dm-spy-experiments. Ran hg bisect on dm-spy-experiments and it pointed to a merge with cleaner-lvapp-hack so I ran hg bisect on that branch and came up with this:

The first bad revision is:
changeset:   14357:6b83ca04841b
branch:      cleaner-lvapp-hack
user:        alex@thinkpad
date:        Sun May 15 23:09:21 2016 +0300
summary:     Disabled CONFIG_LVAPP_HACK_DEBUGMSG and replaced it with CONFIG_LVAPP_HACK_PATCH. 6D, 650D, 700D and EOSM will need a new stub. TBD on 70D/100D as well.


The EOSM doesn't have the LiveViewApp_handler_BL_JudgeBottomInfoDispTimerState stub. I looked for it but no luck. This isn't the new stub that is referred to in the commit is it? All the other platforms mentioned on the commit build fine.

a1ex

Yes, that's the new stub; there is one possible address mentioned in the PR (and looks sane at first sight).

However, you don't need it to get debug messages (and it might even interfere - possible cache conflict with some addresses). Comment out the bottom bar hacks in internals.h (you did this here - unused CONFIG definitions have no effect).

dfort

Thanks--looks like I've been here before but there are so many places to look. Now I understand what nikfreak and Licaon Kter were talking about in that PR. Tried it out and it seems to work fine.

Any tests to run to check on the possible cache conflicts? This would also be a problem on the other cameras, right? So why not just make the EOSM change on the cleaner-lvapp-hack branch?

a1ex

That depends on the addresses you are logging. If there's any issue, you should get an error message (although I remember it only worked if two logged functions were in conflict; this case is a bit different, as it involves some other patch).

TODO: I should just disable the hack when compiling with CONFIG_DEBUG_INTERCEPT.

Yes, the M stub can be committed.

dfort

Quote from: a1ex on March 10, 2018, 12:16:49 AM
Yes, the M stub can be committed.

Pull request submitted.

BTW--What I was trying to do was to solve the EOSM shutter-bug issue. With all the stuff I learned in the past several years maybe I'm ready to tackle this again? Maybe not?

a1ex

Good idea.

If the issue can be reproduced outside LiveView - for example, with a call("Release") somewhere at startup - that's a good candidate for trying in QEMU. I have some local changes that emulate the photo capture process to some extent - at least it's making the screen black and starts capturing the image. That should be enough for diagnosing the issue, as my understanding is that, when this bug happens, the camera doesn't even start capturing the image (maybe it also locks up?)

We'll need two logs with the above scenario, with the camera starting outside LiveView and trying to capture an image. MPU messages should be enabled (they are by default) - that's the "driving force" behind photo capture (and that's what my local changes are all about - MPU messages from photo capture logs, assigned to different hotkeys). One log should be successful and the other should have the bug.

If there is some workaround (maybe removing the lens and putting it back?), that's also worth logging.

a1ex

Quote from: a1ex on February 15, 2018, 11:48:28 AM
Next: good luck understanding the meaning of all these numbers :D

Quote from: Levas on February 26, 2018, 12:39:11 PM
O jus a few thousand registers  to try, I'm almost there ;D

When all else fails, brute force prevails.

Quote from: a1ex on February 15, 2018, 11:48:28 AM
- camera startup


[REPL]         at 0xFF0C7950:FF0C1428 [0xC0201100] <- 0x43210DCB: found @ 3bb
[REPL]         at 0xFF0C7958:FF0C1428 [0xC0201104] <- 0xFEA98765: found @ 3bc
[REPL]         at 0xFF0C7960:FF0C1428 [0xC0201200] <- 0x1       : found @ 3bf
[REPL]         at 0xFF0C796C:FF0C1428 [0xC0201200] -> 0x1       : replayed @ 3c0
[REPL]         at 0xFF143100:FF0C76B8 [0xC0800008] -> 0x8081    : replayed @ 3c1
[REPL]         at 0xFF143108:FF0C76B8 [0xC0800008] <- 0x81      : found @ 3c2
[REPL]         at 0xFF143110:FF0C76B8 [0xC0800018] -> 0x4       : replayed @ 3c3
[REPL]         at 0xFF143118:FF0C76B8 [0xC0800018] <- 0x3F      : found @ 3c4
[REPL]         at 0xFF31AC10:FF143128 [0xC0800008] -> 0x81      : replayed @ 3c5
[REPL]         at 0xFF31AC14:FF143128 [0xC0800008] -> 0x81      : replayed @ 3c6
[REPL]         at 0xFF31AC1C:FF143128 [0xC0800008] <- 0x81      : found @ 3c7
[REPL]         at 0xFF31AC34:FF31AC2C [0xC0800010] <- 0x19      : found @ 3c8
[REPL]         at 0xFF31AC3C:FF31AC2C [0xC0800008] <- 0x81      : found @ 3c9
[REPL]         at 0xFF31AC4C:FF143148 [0xC0800008] -> 0x81      : replayed @ 3ca
[REPL]         at 0xFF31AC70:FF143148 [0xC0800008] <- 0x81      : found @ 3cb
[REPL]         at 0xFF31AC80:FF143154 [0xC0800008] -> 0x81      : replayed @ 3cc
[REPL]         at 0xFF31ACA8:FF143154 [0xC0800008] <- 0x81      : found @ 3cd
[REPL]         at 0xFF143164:FF143154 [0xC0800008] -> 0x81      : replayed @ 3ce
[REPL]         at 0xFF14316C:FF143154 [0xC0800008] <- 0x8081    : found @ 3cf
[REPL]         at 0xFF143178:FF143154 [0xC0800014] <- 0x2       : found @ 3d0
[REPL]         at 0xFF143180:FF143154 [0xC0800000] <- 0x0       : found @ 3d1
[REPL]         at 0xFF143188:FF143154 [0xC0800008] -> 0x8081    : replayed @ 3d2
[REPL]         at 0xFF0C7A38:FF0C1680 [0xC0203008] <- 0x8       : found @ 3d3
[REPL]         at 0xFF0C79A8:FF0C1688 [0xC0201010] <- 0xA       : found @ 3d4
[REPL]         at 0xFF0C79AC:FF0C1688 [0xC0201200] -> 0x1       : replayed @ 3d5
[REPL]         at 0xFF0C168C:FF0C1688 [0xC0400018] <- 0x0       : found @ 3d6
[REPL]         at 0xFF0C7AD8:FF0C1698 [0xC0400008] -> 0x430007  : replayed @ 3d7
[REPL]         at 0xFF0C7AE4:FF0C1698 [0xC0400008] <- 0x431007  : found @ 3d8
[REPL]         at 0xFF0C1698:FF0C1698 [0xC0400000] -> 0x80000000: replayed @ 3d9
[REPL]         at 0xFF0C7A68:FF0C16A4 [0xC0210200] <- 0x80000000: found @ 3da
[REPL]         at 0xFF0C7A6C:FF0C16A4 [0xC0210200] -> 0x0       : replayed @ 3db
[REPL]         at 0xFF0C7A9C:FF0C16C8 [0xC0210204] <- 0x2       : found @ 3dc
[REPL]         at 0xFF0C7AB0:FF0C16C8 [0xC0210214] <- 0x3       : found @ 3dd
[REPL]         at 0xFF0C16D0:FF0C16C8 [0xC0210208] <- 0x270F    : found @ 3de
[REPL]         at 0xFF0C7AC4:FF0C16DC [0xC0210210] <- 0x1       : found @ 3df
[REPL]         at 0xFF0C7AC8:FF0C16DC [0xC0210200] <- 0x1       : found @ 3e0
[REPL]      at init:FF0C79A8:FF0C77F8 [0xC0201010] <- 0x2E      : found @ 3e1
[REPL]      at init:FF0C79AC:FF0C77F8 [0xC0201200] -> 0x1       : replayed @ 3e2
[REPL]      at init:FF0C79A8:FF0C7800 [0xC0201010] <- 0x3A      : found @ 3e3
[REPL]      at init:FF0C79AC:FF0C7800 [0xC0201200] -> 0x1       : replayed @ 3e4
[REPL]      at init:FF1432CC:FF31AE98 [0xC05000D0] -> 0x20      : replayed @ 3e5
[REPL]      at init:FF1432D4:FF31AE98 [0xC05000D0] <- 0x20      : found @ 3e6
[REPL]      at init:FF1432E0:FF31AE98 [0xC05000C0] <- 0x4003BF20: found @ 3e7
[REPL]      at init:FF1432EC:FF31AE98 [0xC05000C4] <- 0x80      : found @ 3e8
[REPL]      at init:FF1432F8:FF31AE98 [0xC05000C8] <- 0x4003BF20: found @ 3e9
[REPL]      at init:FF143304:FF31AE98 [0xC05000CC] <- 0x80      : found @ 3ea
[REPL]      at init:FF14330C:FF31AE98 [0xC05000D4] <- 0x0       : found @ 3eb
[REPL]      at init:FF143318:FF31AE98 [0xC05000D8] <- 0x17      : found @ 3ec
[REPL]      at init:FF143324:FF31AE98 [0xC05000D0] <- 0x10023   : found @ 3ed
[REPL]      at init:FF143334:FF31AE98 [0xC0800018] <- 0x3F      : found @ 3f3
[REPL]      at init:FF143340:FF31AE98 [0xC0800014] <- 0x1       : found @ 3f4
[REPL]      at init:FF143348:FF31AE98 [0xC0800018] -> 0x3F      : replayed @ 3f5
[REPL]      at init:FF143350:FF31AE98 [0xC0800018] <- 0x7       : found @ 3f6
[REPL]      at init:FF14335C:FF31AE98 [0xC0800008] <- 0x8881    : found @ 3f7
[REPL]      at init:FF143364:FF31AE98 [0xC0800008] -> 0x8081    : replayed @ 3f8
[REPL]      at init:FF14350C:FF0C78CC [0xC0800014] -> 0x16      : replayed @ 3f9
[REPL]      at init:FF14353C:FF0C78CC [0xC0800014] <- 0x2       : found @ 3fa
[REPL]      at init:FF143548:FF0C78CC [0xC0800000] <- 0x4B      : found @ 3fb
[REPL]      at init:FF14350C:FF0C78CC [0xC0800014] -> 0x14      : replayed @ 3fc
[REPL]      at init:FF143518:FF0C78CC [0xC0800018] -> 0x7       : replayed @ 3fd
[REPL]      at init:FF143520:FF0C78CC [0xC0800018] <- 0x5       : found @ 3fe


Next: interrupu.