In this case, you have to wait *after* releasing the direction key. The window movement happens in the LiveView task (Evf on new cams, LiveViewMgr on old ones), and it needs quite a bit of time to switch the video mode. You could start with one second, and try lower values until it can no longer keep up.
However, between the "press" and "unpress" event, you shouldn't need any delay, but 100ms or so won't hurt either.
Now, it is technically possible to move the zoom window around instantly (from one frame to another), but not by calling Canon functions. Overriding CMOS registers is a start (IIRC g3gg0 had a demo on this back in 2012), but it's not enough, as the black calibration (in particular, column offsets) has to be re-done. This is, according to my limited understanding, where most of the time goes during video mode switching. If one can figure out how to avoid the black calibration step (maybe by pre-computing it in advance, and just loading the right values when needed), instant video mode switching (from one frame to another, without any extra delay) might be doable.
You may get away without black calibration if you move the zoom window vertically (as the column offsets are likely to stay the same).