EDMAC internals

Started by a1ex, November 26, 2016, 01:28:55 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

a1ex

Until now, we didn't know much about how to configure the EDMAC. Recently we did some experiments that cleared up a large part of the mystery.

Will start with size parameters. They are labeled xa, xb, xn, ya, yb, yn, off1a, off1b, off2a, off2b, off3 (from debug strings). Their meaning was largely unknown, and so far we only used the following configuration:


xb = width
yb = height-1
off1b = padding after each line


Let's start with the simplest configuration (memcpy):


xb = size in bytes. 


Unfortunately, it doesn't work - the image height must be at least 2.



Simplest WxH



How Canon code sets it up:

  CalculateEDmacOffset(edmac_info, 720*480, 480):
     xb=0x1e0, yb=0x2cf


Transfer model (what the EDMAC does, in a compact notation):

xb * (yb+1)        (xb repeated yb times)




WxH + padding (skip after each line)




(xb, skip off1b) * (yb+1)


Note: skipping does not change the contents of the memory,
so the above is pretty much the same as:


(xb, skip off1b) * yb
followed by xb (without skip)




xa, xb, xn (usual raw buffer configuration)




xa = xb = width
xn = height-1


To see what xa and xn do, let's look at some more examples (how Canon code configures them):

  edmac_setup_size(ch, 0x1000000):
    xn=0x1000, xa=0x1000

  edmac_setup_size(6, 76800):
    xa=0x1000, xb=0xC00, xn=0x12

  CalculateEDmacOffset(edmac_info, 0x100000, 0x20):
    xa=0x20, xb=0x20, yb=0xfff, xn=0x7

  CalculateEDmacOffset(edmac_info, 1920*1080, 240):
    xa=0xf0, xb=0xf0, yb=0xb3f, xn=0x2


The above can be explained by a transfer model like this:

(xa * xn + xb) * (yb+1)




Adding ya, yn (to xa, xb, xn, yb)



Some experiments (trial and error, 5D3):

  xa = 3276, xb = 1638, xn = 1055                  => 3276*1055 + 1638 transferred
  xa = 3276, xb = 32,   xn = 1055                  => 3276*1055 + 32
  xa = 3276, xb = 0,    xn = 1055                  => 3276*1056 - 20 (?!)
  xa = 3276, xb = 3276, xn = 95, yb = 10           => 3276*96*11
  xa = 3276, xb = 3276, xn = 95, yb = 7,  yn = 3   => 3276*96*11
  xa = 3276, xb = 3276, xn = 10, yb = 62, yn = 33  => 3276*11*96
  xa = 3276, xb = 3276, xn = 10, yb=3, yn=5, ya=2  => 3276*11*19
  xa = 3276, xb = 3276, xn = 10, yb=5, yn=3, ya=6  => 3276*11*27
  xa = 3276, xb = 3276, xn = 10, yb=5, yn=3, ya=7  => 3276*11*30
  xa = 3276, xb = 3276, xn = 10, yb=7, yn=8, ya=9  => 3276*11*88
  xa = 3276, xb = 3276, xn = 10, yb=8, yn=3, ya=28 => 3276*11*96



(xa * xn + xb) REP (yn REP ya + yb)


Here, a REP b means 'perform a, repeat b times' => a * (b+1).

So far, so good, the above model appears to explain the behavior
when there are no offsets, and looks pretty simple.

There is a quirk: if xb = 0, the behavior looks strange.
Let's ignore it for now.



Adding off1b (to xa, xb, xn, ya, yb, yn)



What do we do about the offset off1b?

Experiment:

xa = 3276, xb = 3276, xn = 10, yb=95, off1b=100
=> copied 3276*10*96 + 3276, skipped 100,
   (CP 3276, SK 100) repeated 94 times (95 runs).


It copies a large block, then it starts skipping after each line.
Let's decompose our model and reorder the terms.
Then, let's skip off1b after each xb.


(xa * xn)        REP (yn REP ya + yb)
(xb, skip off1b) REP (yn REP ya + yb)


Let's check a more complex scenario:

xa = 3276, xb = 3276, xn = 10, yb=8, yn=3, ya=28, off1b=100
=> (CP 3276*10*29 + 3276,   SK 100), (CP 3276, SK 100) * 27,
   (CP 3276*10*29 + 3276*2, SK 100), (CP 3276, SK 100) * 27,
   (CP 3276*10*29 + 3276*2, SK 100), (CP 3276, SK 100) * 27,
   (CP 3276*10*9  + 3276*2, SK 100), (CP 3276, SK 100) * 8.


There's some big operation that appears repeated 3 times (yn),
although the copied block sizes are a little inconsistent (first is smaller).

After that, (xa * xn) is executed 9 times (yb+1).
At the end, (xb, skip off1b) is executed 9 times (also yb+1).

In the big operation, the 29 is clearly ya+1.

What if off1b is skipped after all xb iterations, but not the last one?
This could explain why we have an extra 3276 (the *2) on the last 3 log lines.

Regroup the terms like this:

  => ((CP 3276*10*29), (CP 3276, SK 100) * 28, CP 3276) * 3,
      (CP 3276*10*9 ), (CP 3276, SK 100) * 9.


Our model starts to look like this:

(
   (xa * xn)   (ya+1)
   (xb, skip off1b) *  ya
    xb without skip
)
  * yn

followed by:

   (xa * xn)   (yb+1)
   (xb, skip off1b) * (yb+1)


So far so good, it's a bit more complex,
but explains all the above observations.
Of course, the last line may be as well:

  (xb, skip off1b) * yb, xb without skip




Adding off1a



Let's try another offset: off1a = 44.
The log from this experiment is pretty long, so I'll simplify it by regrouping the terms.


xa = 3276, xb = 3276, xn = 10, yb=8, yn=3, ya=28, off1a=44, off1b=100
=> (
     ((CP 3276, SK 44)  * 28, CP 3276) * 10,
     ((CP 3276, SK 100) * 28, CP 3276),
   ) * 3,
   (
     ((CP 3276, SK 44)  * 8, CP 3276) * 10,
     ((CP 3276, SK 100) * 8, CP 3276)
   )


This gives good hints about what is happening when:

(
   ((xa, skip off1a) * ya, xa) * xn
    (xb, skip off1b) * ya, xb
) * yn,

(
   ((xa, skip off1a) * yb, xa) * xn
    (xb, skip off1b) * yb, xb
)




Adding the remaining offsets (all parameters are now used)



Let's add off2a, off2b and off3. They are pretty obvious now, so I'll skip the log file (which looks quite intimidating anyway).


(
   ((xa, skip off1a) * ya, xa, skip off2a) * xn
    (xb, skip off1b) * ya, xb,
     skip off3
) * yn,

(
   ((xa, skip off1a) * yb, xa, skip off2b) * xn
    (xb, skip off1b) * yb, xb
)


So, there is a pattern: perform N iterations with some settings, then perform the last iteration with slightly different parameters. The pattern repeats at all iteration levels (somewhat like fractals).

Just by looking at the memory contents, we can't tell what what the skip value is used for the very last iteration. However, by reading the memory address register (0x08) directly from hardware (not from the shadow memory), we can get the end address (after the EDMAC transfer was finished). For a write transfer, this includes the transferred data and also the skip offsets. Now it's straightforward to notice the last offset is off3, so our final model for EDMAC becomes:



EDMAC transfer model




(
   ((xa, skip off1a) * ya, xa, skip off2a) * xn
    (xb, skip off1b) * ya, xb, skip off3
) * yn,

(
   ((xa, skip off1a) * yb, xa, skip off2b) * xn
    (xb, skip off1b) * yb, xb, skip off3
)


The offset labels now start to make sense :)

C code (used in qemu):

for (int jn = 0; jn <= yn; jn++)
{
    int y     = (jn < yn) ? ya    : yb;
    int off2  = (jn < yn) ? off2a : off2b;
    for (int in = 0; in <= xn; in++)
    {
        int x     = (in < xn) ? xa    : xb;
        int off1  = (in < xn) ? off1a : off1b;
        int off23 = (in < xn) ? off2  : off3;
        for (int j = 0; j <= y; j++)
        {
            int off = (j < y) ? off1 : off23;
            cpu_physical_memory_write(dst, src, x);
            src += x;
            dst += x + off;
        }
    }
}


The above model is for write operations. For read, the skip offsets are applied to the source buffer - that's the only difference.

Offsets can be positive or negative. In particular, off1a and off1b only use 17 bits (digic 3 and 4) or 19 bits (digic 5), so we have to extend the sign.

The above model explained all the combinations that are not edge cases (such as yb=0 or odd values). Here are the tests I've ran: 5D3 vs QEMU.

For more details, please have a look at the "edmac" and "qemu" branches.

To be continued.

g3gg0

after alex found out how to correctly configure the edmac, it was easy to match it with patents.

https://www.google.de/patents/US7817297 see fig 11a
the description matches the reverse engineered information
Help us with datasheets - Help us with register dumps
magic lantern: 1Magic9991E1eWbGvrsx186GovYCXFbppY, server expenses: [email protected]
ONLY donate for things we have done, not for things you expect!

a1ex

Some pictures showing the EDMAC usage on 5D3 LiveView:




a1ex

5D3 photo mode:





TTJ = TwoInTwoOutJpegPath
TTL = TwoInTwoOutLosslessPath

a1ex

Committed the test code that outputs the raw logs used to figure out the EDMAC model, just in case anyone would like to play with it.

This code can be used for cross-checking the EDMAC behavior with our understanding on what it does, by running it on both a real camera and on QEMU - the logs should match.

PR ready.

a1ex

Something easier: playing back an image.



Translation:
- the graph shows only the steps performed on the image processing engine
- first step: some memcpy of size 3840x1079, using connection <6> which is pass-through (probably zeroing out some buffer); note the input is a single line repeated many times.
- second step: a quick JPEG read pass (reading the embedded JPEG from the CR2 to find its metadata).
- third step: decoding the JPEG (it reads the same JPEG again, but this time outputs a 2880x960 YUV image in some unusual order); input from connection <5>, output on <3>, using JPCORE.
- fourth step: resizing the 2880x960 YUV to 1440x480 (all these sizes were in bytes, so the end result is 720x480 pixels - the displayed image). Input and output on connection <3>.

The last configuration probably shows the data coming to some connection is not automatically forwarded to the other end of that connection (exception: connections 6 and 7 will simply copy the input data to output).

A real-time connection diagram is available on the "edmac" branch, but it only shows the latest state (so, in this case of playing back an image, it will only show the last processing step, because the EDMAC channels are reused).

After some playing around, I've got a few snapshots throughout the image playback process (captured one screenshot on each StartEDmac call):


names_are_hard

I am exploring EDMAC on modern cams.  I found this thread useful, but hard to understand.   Here's my attempt for a useful diagram:



xa and ya specify "regular" tiles, which can be repeated, with the number of repeats controlled by xn and yn.

xb and yb specify "irregular" / "remainder" tiles, only appearing in one column or row, or none if the respective xb or yb is 0.

Memory is linear, accessed left-to-right, top-to-bottom.

I would assume this design was chosen as it allows specifying regions of arbitrary size, while also allowing most of the transfers to occur using tiles of a convenient width for the underlying copy mechanism (which is presumably aligned to some power of two).

Note that I'm ignoring the offsets.  You don't need them if you're copying a contiguous region of memory, and they complicate the diagram significantly.  Check the patent for details, should you need this.  In broad terms, you can specify an offset between any of the rows or columns, and that offset can be positive (you are now copying a grid of disconnected regions), or negative (you are copying a grid of partially overlapping regions).

Because you can provide a different edmac_info struct for the dst and src, this allows some kinds of transforms to happen very efficiently (likely bottle-necked only by ram speed).  The example given in the patent is for handling data from a unit that produces RGB data as three separate channels (each at a different memory location), which can be combined together using such a copy operation.  For the source, specify 3 rows or columns, with offsets meaning you're reading from each channel, for the destination, choose a different geometry.

Further complicating things, there are various restrictions on what edmac_info configs are valid.  I don't know all of these.  Alex implies some in his comments, but also talks about some configs that aren't valid on my test cam, a 200D.

Some examples:
You can specify a single rectangle with only xb and yb, and if so, xa and ya can be 0.  But yb must be at least 1, and this implies 2 rows: there's an implicit first row, so 0 => 1, but this doesn't work if xb is the only non-zero value.
If you have xa and xn non-zero, but everything else zero - your copy will silently fail.
If you have xa, ya, xn, yn all non-zero, but everything else zero - your copy will silently fail.
If you have xa, xn, xb non-zero, all else zero, this works!
Possibly this means you must have a non-zero xb.

There are limits on the maximums for each variable, which can be quite low, perhaps somewhere around 4000.  But it depends on which variable, and I haven't checked this thoroughly.  xb works up to at least 65536.  But in essence, each tile should be somewhat small.

names_are_hard

After further testing on real cam to confirm I understand it, here's an illustrative copy example.


    // Region size 128kB == 0x20000, we can factor to:
    // 0x4 * 0x10 grid, each tile 0x80 * 0x10;
    // 0x4 * 0x10 * 0x80 * 0x10 == 0x20000
    struct edmac_info region = {
        .off1a = 0,
        .off1b = 0,
        .off2a = 0,
        .off2b = 0,
        .off3 = 0,
        .xa = 0x80, // "regular" col width
        .xb = 0x80, // "irregular" col width
        .ya = 0xf, // implies 0x10 high
        .yb = 0xf, // 0x10 high
        .xn = 0xf, // 0xf regular cols + 1 xb col
        .yn = 0x3 // 3 normal rows + 1 yb row
    };


It might seem more obvious to not use xb or yb at all; just have 10 normal columns and 4 normal rows.  I agree, but the hardware doesn't.

If you have xb == 0, your copy will silently fail.

yb is more tricky.  In general, ya and yb are treated as holding a number 1 higher than their values, but if yn is 0, ya is not used. Yb is always used, meaning if you keep it 0, you get a single row added to the end of your copy, of length (xa * xn) * (xb).

heder

Here's is the image from the patent.


We're using offsets with positive values (edmac_memcpy), but here they're are showed with negative offsets. This is quite a power full dma engine.

  • After xn (XNUMA) times xa copies, the last xb can actually move inside the last xa copy, If off1b is negative
  • After the first xa copy, you can overwrite the first xa if you make off1a negative
  • If xb is zero bytes but off1b is negative, the new pitch  will become new_pitch = old_pitch - off1b (I think so .. )
  • I'm going to stop here .. too many options ..
... some text here ..

names_are_hard

Fun, isn't it!  Positive offsets have some fairly obvious uses, negative ones I haven't had any very good ideas.  You could use them as a crude but fast form of image scaling, I guess.

I thought about using single-byte positive offsets and lots of regions to extract out luminance from YUV data (dst and src can have different "shapes", so you could use no offsets for the dst).  Haven't tested performance.

g3gg0

fixed the graphics. including http images from https is forbidden.
Help us with datasheets - Help us with register dumps
magic lantern: 1Magic9991E1eWbGvrsx186GovYCXFbppY, server expenses: [email protected]
ONLY donate for things we have done, not for things you expect!