String localization?

Started by Marsu42, August 23, 2013, 05:10:56 PM

Previous topic - Next topic

0 Members and 2 Guests are viewing this topic.

Marsu42

Are there any ideas, plans or schedules for localized ml menus/strings other than patching the whole source code and (this is the bad part) adjusting this patch for every changeset?

To people accustomed to the Internet and participating in English forums this might be probably surprising, but at least in Germany there is a good part of the population in the east who did not learn English in school (and forgot their Russian :-)) and any non-localized user interface is a objective or at least subjective barrier that is not to be underestimated.

To some, modifying their camera with a "hack" sounds obscure enough, but it's worse when the hack speaks a language you barely understand :-p ... so imho the potential ml userbase is narrower than it could be.

nanomad

We would need a way to dinamically load and allocate every ML string though and this could negatively impact on ML memory usage (which sadly is a scarce resource on the camera).

As far as implementation goes, we just need to wrap around the printf functions and use a lookup table. We can even do it in a module.
EOS 1100D | EOS 650 (No, I didn't forget the D) | Ye Olde Canon EF Lenses ('87): 50 f/1.8 - 28 f/2.8 - 70-210 f/4 | EF-S 18-55 f/3.5-5.6 | Metz 36 AF-5

Marsu42

Quote from: nanomad on August 26, 2013, 03:30:26 PM
We would need a way to dinamically load and allocate every ML string though and this could negatively impact on ML memory usage (which sadly is a scarce resource on the camera).

Yeah, I thought so, and of course it's much easier to hardcode everything in place w/o any wrappers - but your idea sounds good, just use an index and replace the string if a translation if available.

However given the quality of ml and potential world wide userbase imho this issue should be discussed, I hope one of the devs with deep insight into the code and memory can say if this adds too much dead weight.

dmilligan

I think you could do this without using much extra memory. String literals are already taking up space in program memory are they not? Taking them all out would result in using much less program memory. Then just load the strings from a file into dynamic memory. The amount of extra dynamic memory you use, you should be able to save in program memory. IDK the specifics of memory allocation on this platform so someone correct me if I'm wrong, but this is how it would be on a normal computer.

Using this approach you could even sacrifice a little performance to make more memory available by only loading the strings on demand, when they are needed and then free() when you're done with them.

a1ex

String memory is only part of the problem.

Did you know ML has more than 350 menu items? (I've just counted them on 5D3). Menu items, not strings. Each one has name, most of them have one or two help strings, a few of them have more than two help strings, many of them have a list of values...

Many of these strings are changed often.

So, who's going to maintain the translations?

(hint: I didn't touch the user guide for around one year and nobody else touched it, and now it's outdated. I've proposed this to solve it, but was rejected.)

Also, what about special characters? Unicode is huge, code pages are ugly... choosing a good font for plain ASCII is hard enough.

I don't think we should go in this direction.

Marsu42

Quote from: a1ex on September 04, 2013, 07:00:32 AM
Did you know ML has more than 350 menu items? (I've just counted them on 5D3). Menu items, not strings. Each one has name, most of them have one or two help strings, a few of them have more than two help strings, many of them have a list of values...

I know there are a lot, but then again sitting down for a day and just translate it isn't impossible - these are usually small strings, and if you know what ml is about it just takes little time. The problem with improving the _longer_ docs is that it bores the your pants off as there's less motivation - I don't think they're actually used that much(?) in-camera, since most of ml is more or less self-explanatory .. but native menu language might lower the entry barrier a lot.

The real problem is change as alex correctly stated, so if anyone takes the time to translate the menus and message it has to be done in a way that not every source commit makes merging impossible, or it will never happen as there's no "stable" ml. But if it'd just be to grab a string file, turn on your favorite audio book and sit down for some hours, maybe in a shared effort with some other people of the same language, you could count me in for German.

a1ex

That's the easy part. Who's going to update it when some string that is not used daily gets changed?

Also, a new class of bugs will appear (mostly alignment issues, cropped strings, special characters... or increased memory usage for languages for which 255 characters is not enough...)

I'm still surprised when finding stuff broken for 1, 2, 3 months (e.g. RGB spotmeter) and nobody notices it, for example.

So, the real problem is to come up with something that will not turn into a maintenance nightmare. Sitting down and translating once is the easy part.

gotar

Quote from: a1ex on September 05, 2013, 08:51:09 AM
Who's going to update it when some string that is not used daily gets changed?

Somebody familiar with either gettext tool or https://www.transifex.com/ platform. String with no translation remains in original language (english) until Someone(TM) cares. If noone cares then, well... it doesn't matter:) Entire linux ecosystem uses loadable translations (po files or qm for Qt apps), the ones that are not maintained simply become obsoleted.

QuoteAlso, a new class of bugs will appear (mostly alignment issues, cropped strings, special characters... or increased memory usage for languages for which 255 characters is not enough...)

As for charset, most western languages can be easily transliterated to base latin alphabet. Such conversion, lenght checks etc could be done at compile time. As for the Asia - at least romanization of russian alphabet is also easy and standarized (ISO 9 or GOST 7.79/B). ASCII is enough to satisfy most of the people.

QuoteI'm still surprised when finding stuff broken for 1, 2, 3 months (e.g. RGB spotmeter) and nobody notices it, for example.

That's because ML with this bug was not released. Most people are scared of 'hacks' itself, bleeding edge nightly builds are usually for developers and feature-hungry users.

dmilligan

Even if you're not going to translate, I think that you will find that in the long run, removing string literals from code and into data files will make your code more maintainable, not less. A major advantage is that you'll be able modify these strings without recompiling. This makes it possible for a lot more people who are not programmers and would otherwise have trouble getting the compiler up and running on their system to contribute to ML.

Strings are data and code is logic and IMO separation of data and logic is just good programming practice. I have many times at work done refactoring passes of code that were simply 'remove all string literals from code'. And translating is something that we have never done or even plan to, but we still 'localize' all strings because it just makes the code more maintainable. Granted this is in a higher level language with the vast computing resources of a PC, still I think the principal applies and I think it is certainly possible to do it in a way that is not detrimental to performance.

UTF-8 is backwards compatible with ASCII, and is not terribly complicated to parse anyway. So you could perhaps have a module for UTF-8 capability that would only need to be loaded by languages that need more than 255 chars, and is not something you, as an english speaking developer, would have to care about at all. If a Chinese user wants to implement a UTF-8 parser module and create an appropriate font and translation so be it. You have only put the framework in place for him to do so, and this framework benefits you regardless.

a1ex

I don't want to declare macros for every single string, and then have them in a separate file. Reason: you can no longer understand what exactly the program is printing there without having two files open.

400plus uses this technique. The programming style looks a lot cleaner than mine, but I have trouble browsing their code, because I need to switch back and forth between two files.


menuitem_t bramp_items[] = {
        MENUITEM_BOOLEAN(0, LP_WORD(L_I_DELAY),        &settings.bramp_delay,     NULL),
        MENUITEM_COUNTER(0, LP_WORD(L_I_SHOTS),        &settings.bramp_shots,     NULL),
        MENUITEM_TIMEOUT(0, LP_WORD(L_I_INTERVAL),     &settings.bramp_time,      NULL),
        MENUITEM_TIMEOUT(0, LP_WORD(L_I_EXPOSURE),     &settings.bramp_exp,       NULL),
        MENUITEM_BRTIME( 0, LP_WORD(L_I_RAMP_T),       &settings.bramp_ramp_t,    NULL),
        MENUITEM_BRSHOTS(0, LP_WORD(L_I_RAMP_S),       &settings.bramp_ramp_s,    NULL),
        MENUITEM_EVCOMP (0, LP_WORD(L_I_RAMPING_TIME), &settings.bramp_ramp_time, NULL),
        MENUITEM_EVCOMP (0, LP_WORD(L_I_RAMPING_EXP),  &settings.bramp_ramp_exp,  NULL),
};


and in another file:

        LANG_PAIR( I_MANUAL_L,           "Bulb min"                  ) \
        LANG_PAIR( I_MANUAL_R,           "Bulb max"                  ) \
        LANG_PAIR( I_INTERVAL,           "Interval"                  ) \
        LANG_PAIR( I_EXPOSURE,           "Exposure"                  ) \
        LANG_PAIR( I_RAMP_T,             "Ramp size (time)"          ) \
        LANG_PAIR( I_RAMP_S,             "Ramp size (shots)"         ) \
        LANG_PAIR( I_RAMPING_EXP,        "Ramping (exposure)"        ) \
        LANG_PAIR( I_RAMPING_TIME,       "Ramping (interval)"        ) \


ML version is more verbose, but I know exactly what I'm going to get:

    {
        .name = "Intervalometer",
        .priv       = &intervalometer_running,
        .max        = 1,
        .update     = intervalometer_display,
        .help = "Take pictures at fixed intervals (for timelapse).",
        .submenu_width = 700,
        .works_best_in = DEP_PHOTO_MODE,
        .children =  (struct menu_entry[]) {
            {
                .name = "Take a pic every",
                .priv       = &interval_timer_index,
                .update     = interval_timer_display,
                .select     = interval_timer_toggle,
                .icon_type  = IT_PERCENT,
                .help = "Duration between two shots.",
            },
            {
                .name = "Start after",
                .priv       = &interval_start_timer_index,
                .update     = interval_start_after_display,
                .select     = interval_timer_toggle,
                .icon_type  = IT_PERCENT,
                .help = "Start the intervalometer after X seconds / minutes / hours.",
            },
            {
                .name = "Stop after",
                .priv       = &interval_stop_after,
                .max = 5000, // 5000 shots
                .update     = interval_stop_after_display,
                .icon_type  = IT_PERCENT_LOG_OFF,
                .help = "Stop the intervalometer after taking X shots.",
            },
            #ifdef FEATURE_FOCUS_RAMPING
            {
                .name = "Manual FocusRamp",
                .priv       = &bramp_manual_speed_focus_steps_per_shot,
                .max = 100,
                .min = -100,
                .update = manual_focus_ramp_print,
                .help  = "Manual focus ramping, in steps per shot. LiveView only.",
                .help2 = "Tip: enable powersaving features from Prefs menu.",
                .depends_on = DEP_AUTOFOCUS,
                .works_best_in = DEP_LIVEVIEW,
            },
            #endif
            MENU_EOL
        },
    },


Of course, 400plus has only 180 strings, so it's a lot less work there.

So, where's the advantage?

Marsu42

Quote from: a1ex on September 05, 2013, 03:26:40 PM
So, where's the advantage?

For coding, I don't see any advantage, and of course ml code is much easier to read as it is. This could be weighted against the wider user base of a multilang version, imagine how and where Linux would be if it'd be English only.

But this decision has to be made by the people having to work with the increased code complexity on a daily basis, I can only point out that I see the demand for this (or the Canon fw could be also just EN) and say that I'd do my small part even if I personally am fine with just the way it is now.

dmilligan

Quote from: a1ex on September 05, 2013, 03:26:40 PM
So, where's the advantage?

Separation of concerns. The point is if you are writing the menu code, you don't really care what the string is, you just care that you are pointing to the correct resource. You don't care b/c you are wearing your "program flow" hat. You don't care about what the data is, just that program flows correctly and the operations on the data occur as intended. Later on you (or even someone else) comes along with their "technical writer" hat on. This person only cares that the data is correct, and can easily fix it, all in one place without even recompiling. Then perhaps later on you come along with your "tester" hat on, and you create special test case versions of the string data files with unusual chars, very short, or very long strings, that you can use to test how well your menu display code, fonts, etc. works and holds up under a variety of circumstances.

You should rely on good verbose naming conventions and good code comments to make sure source code is clear, not string literals. When I'm browsing the ML code I need to have like 8+ files open at the same time b/c there are so many scattered undocumented functions and macros and macro functions that I have no idea what they do, where they are defined and I can't say that having strings in the code really helps me understand it at all. Most of these strings are in isolated menu initialization functions anyway.

All sorts of things you do in code are defined in other files, why should strings be a special case that aren't, esp. when they typically don't have any effect on the logical execution or flow of the program?

Besides this is what tabs are for (or are you a hard core vi or emacs user? ;))

gotar

Quote from: a1ex on September 05, 2013, 03:26:40 PM
I don't want to declare macros for every single string, and then have them in a separate file. Reason: you can no longer understand what exactly the program is printing there without having two files open.

So don't declare any macros, use original (english) string as a value - to be replaced when localized version is available, wrapped by appropriate function:

printf (_("Foo"));

This way one can easily disable entire i18n at build time by defining _(s) as simple 'return s' (if i18n would bring some performance penalty).

dmilligan

Quote from: gotar on September 05, 2013, 05:05:13 PM
So don't declare any macros, use original (english) string as a value - to be replaced when localized version is available, wrapped by appropriate function:

printf (_("Foo"));

This way one can easily disable entire i18n at build time by defining _(s) as simple 'return s' (if i18n would bring some performance penalty).

But then if you wanted to simply change the English wording a little bit, you'd have to change the file for all translations even if you don't need to change the actual translation. And you have a big performance hit, b/c you'd have to have the english version of the strings in memory and whatever translation you are using, and you'd have to do string comparisons to look up the appropriate translated string.

Another potential advantage of doing the way I described is that you could easily create a "minimal" translation that doesn't include any help texts and uses short abbreviations for power users who are familiar with ML, minimizing memory usage, and freeing up as much as possible for other things.

gotar

Quote from: dmilligan on September 05, 2013, 06:11:11 PM
But then if you wanted to simply change the English wording a little bit, you'd have to change the file for all translations even if you don't need to change the actual translation.

That's the point - how do you know, that you don't need to change translations to the languages you don't know? If you are sure (like fixing typo) nothing prevents you from using some one-liner (sed, perl, whatever you know) and correct all the translations in 3 seconds. On the other hand, if some function behaviour has been changed and you only fix original MACRO text, you'll end up with misleading or simply wrong translations.

Let me repeat: this is not my idea, but something that is used all around the world: gettext library. If there were any reasons to make it work this way, sooner or later you'll step into the same issues. Instead, KISS and do not reinvent the wheel.

QuoteAnd you have a big performance hit, b/c you'd have to have the english version of the strings in memory and whatever translation you are using, and you'd have to do string comparisons to look up the appropriate translated string.

Not necessarily - that depends on implementation only. You could prepare lookup table during build time (replace actual strings with identifiers, move text to separate loadable file the same as translations). Why do you insist on doing manually thing, that could be handled in preprocesing?

Another thing is: what shall be displayed for missing texts, when you load partial translation? Or don't have any translation file at all?

QuoteAnother potential advantage of doing the way I described is that you could easily create a "minimal" translation that doesn't include any help texts and uses short abbreviations for power users who are familiar with ML, minimizing memory usage, and freeing up as much as possible for other things.

And why couldn't you do the same with 'my' way? I see no difference here. Is it true at all, that these strings have so serious memory impact?

dmilligan

Quote from: gotar on September 05, 2013, 07:02:01 PM
Not necessarily - that depends on implementation only. You could prepare lookup table during build time (replace actual strings with identifiers, move text to separate loadable file the same as translations). Why do you insist on doing manually thing, that could be handled in preprocesing?
Beacuse you were suggesting using gettext as your implementation which does do this at runtime, not compile time and  requires libintl, which you would have to port to DryOS, and would futher bloat ML.

https://www.gnu.org/software/gettext/manual/gettext.html#Importing

Quote from: gotar on September 05, 2013, 07:02:01 PM
You could prepare lookup table during build time (replace actual strings with identifiers, move text to separate loadable file the same as translations)
This sounds like it would require modifiying the ARM compiler, I don't see how that is KISS.

Marsu42

I'm not much of a programer geek, but wouldn't it possible to write some small python/whatever script that quickly adds the current english string as a comment behind all macro-ized multilang statements? That way, the programers would know what it's all about w/o opening two text files, and everyone would be happy? If these comments are a problem for tracking changes in the source, another script could also quickly remove them before pushing changes via mercurial.

gotar

Quote from: dmilligan on September 05, 2013, 09:07:26 PM
Beacuse you were suggesting using gettext as your implementation which does do this at runtime

Native (original) strings are required anyway for missing translations. I gave gettext example mostly against using SOME_MAKRO_STRING instead actual text, and if it suitable itself or needs own (probably much simplified) implementation is a different question.

QuoteThis sounds like it would require modifiying the ARM compiler, I don't see how that is KISS.

It just needs some perl 30-liner preprocessor to be run on source tree at most (i.e. if there is real need for folding these constants to save memory), so that's not the real issue here.

dmilligan

Quote from: gotar on September 06, 2013, 01:42:10 PM
Native (original) strings are required anyway for missing translations.

I assume by this you mean that 'native strings' are required to be compiled into the binary as string literals for fallback purposes. Why? you could simply revert to looking it up in your default english translation file if a particular string is missing from a certain translation. If a string is missing altogether this a bug, not a situation that you need to try and figure out how to handle gracefully.

I'm still of the opinion that code that is not litered with string literals is much cleaner, all of your code is algorithmic and not sprinkled with data. Pretty much all the code I write professionally is like this. Like Marsu42 said, you can always add code comments for clarification.

gotar

Quote from: dmilligan on September 06, 2013, 02:56:20 PM
I assume by this you mean that 'native strings' are required to be compiled into the binary as string literals for fallback purposes. Why? you could simply revert to looking it up in your default english translation file if a particular string is missing from a certain translation.

Of course, you can even set a chain of fallbacks by specifying multiple languages in order of user being familiar with them (as LANGUAGES env variable does). But why do you even insist on removing strings from binary? I see no confirmation on that being real problem (unless all the languages would have to be loaded).

Quote
I'm still of the opinion that code that is not litered with string literals is much cleaner, all of your code is algorithmic and not sprinkled with data. Pretty much all the code I write professionally is like this. Like Marsu42 said, you can always add code comments for clarification.

Do you name objects in your code by their function, or some plain sequence? Do you export literal names of symbols, or some identifiers shortened to compress the binary? You know how this technique is called? Obfuscation.

Like a1ex said, it's not comfortable to maintain or read. Like I said, it leads to outdated translations (consider inverting function effect). If you don't like any literals, simply fold every printf in your editor, but do not obfuscate the sources.

nanomad

Indeed. If we are ever going to do this it will probably be a gettex-like system. It's quick, portable and doesn't add much clutter.
EOS 1100D | EOS 650 (No, I didn't forget the D) | Ye Olde Canon EF Lenses ('87): 50 f/1.8 - 28 f/2.8 - 70-210 f/4 | EF-S 18-55 f/3.5-5.6 | Metz 36 AF-5

stevefal

In my Android development all strings are provided by localized resources file, and looked up dynamically at runtime. Here's the same excerpt in 3 languages.


<string name="connection_connected">Connected</string>
<string name="connection_connecting">Connecting...</string>
<string name="connection_default_ssid_name">The wifi network</string>
<string name="connection_disconnected">Not connected</string>
<string name="connection_reconnecting">Reconnect required</string>
<string name="connection_server_unreachable">%s is stopped or unreachable. Retrying momentarily.</string>

<string name="connection_connected">Connexion</string>
<string name="connection_connecting">Connexion en cours...</string>
<string name="connection_default_ssid_name">Le réseau Wi-Fi</string>
<string name="connection_disconnected">Non connecté</string>
<string name="connection_reconnecting">Connexion obligatoire</string>
<string name="connection_server_unreachable">%s est arrêtée ou inaccessible. Nouvel essai dans un instant.</string>

<string name="connection_connected">接続済</string>
<string name="connection_connecting">接続中</string>
<string name="connection_default_ssid_name">Wi-Fi ネットワーク</string>
<string name="connection_disconnected">不接続</string>
<string name="connection_reconnecting">要再接続</string>
<string name="connection_server_unreachable">%s は稼働停止あるいは接続不能、即再試行 </string>


This is convenient for localization because the translator can edit the string resource file directly, using the string name and other translations as hints.

Of course having all the character sets is another question.
Steve Falcon

Licaon_Kter

The Android way with strings is pretty nice, just drop a strings-XX.xml and be done.

Is that possible (read easy) to implement in C?

a1ex

Reading an XML is no problem (though I'd prefer something less verbose).

The easiest way is probably the one proposed by nanomad: a module that hooks into bmp_printf. That way, existing code with English strings will be kept unchanged, and since ML menu entries are sometimes looked up by name (e.g. from Lua), this method won't break the current functionality. Memory restrictions are less of a problem with modules, so it should be fine.

Looking up each string on the fly might be a bit slow (to be tested), but one can imagine methods for speed-up (e.g. hash table, caching the lookup results or building a look-up table at startup).

Then there's Unicode support. ML supports 2 types of fonts: RBF (borrowed from CHDK) and Canon font format (currently used for icons). The former is limited to 256 characters, though apparently it was successfully adapted for Chinese. The latter supports Unicode characters and can be generated from bitmaps. In both cases, importing from a desktop font is a bit painful, but doable. Suggestions for a small Unicode library can be found here.

ML font is a custom design, so languages with accents (e.g. French) may have some trouble on the aesthetics side (for best look, the accented characters would need to be created from scratch).

Canon does it by embedding only the characters they actually use, so in their built-in font, for certain languages (e.g. Japanese), only some characters are available, and the exact set differs between camera models. Unfortunately, it's larger than ML menu font, so we can't simply use it everywhere.

QEMU can be used for development and testing (e.g. menu screenshots), as it's currently able to run unmodified ML binaries for a few cameras, to some extent. So, checking for obvious errors such as string overlaps should be easy.

Still, the main problem is maintenance. Reusing some tools that make translators' life easier (e.g. highlighting entries that need attention, or showing when a string edit breaks some translations, or warnings if certain strings are too long) would be great, but I have no experience with any of them.

Just some suggestions, if anyone wants to implement it.

GutterPump

I Wonder if it could be possible to translate like Poedit, for Wordpress module/theme translation.

Its not automatic translation but maybe it could be possible to create different variable in the ML code for native langage and translated langage if exist in a single file with all translated words.