but the problem is you're asking near the impossible, most cameras can't deal with writing the video alone, much less an audio stream with it. g3gg0 did this with MLV and it has issues just from the meta data being written.
Sorry, I'm not making myself clear. If you could read a simple voltage spike from the mic input say, or flash, etc., then you can read 0s and 1s, right. I don't know how high a frequency you need with, but lets say it's 10 times the frame rate for easy sampling, so you'd need to read at 240bits per second. Okay, so you're reading 0s and 1s from whatever and in the mlv file you're writing either a 0 or 1 for each frame, which gives you a 24bit signature for each second.
At the same time, the recorder is also recording those 0s and 1s (click track) to say channel 4, or the channel 2, if you can live with mono.
Now, in the recorder you hack it to plug in a cable that sends out these spikes every 240 seconds (which are sampled down to 1 bit per frame). You also program the recorder to send out a specific "smoke signal" once it's started, so it sends batches of 12 0s and 1s for the first few seconds, but in a way that you can match up later.
So when you load the MLV file, it builds a signature from the encoded click track. It builds a signature from the audio file and then matches them up. Once they're matched up other software would know where the audio starts and cut it to that point and put the start point in the MLV file which the NLE (thinking big here) would ultimately use to match up the video and audio track.
I'm extemporizing here, but that's the general idea. You mount the hacked audio recorder on a cold-shoe, and maybe, if the dev is a real genius, the camera sends a signal to start and stop the audio recorder and then we're really farting through silk
