java .properties with Unicode/bidi chars? (Right-to-left language technical forum)

Technical forums » Right-to-left language technical forum »
java .properties with Unicode/bidi chars?
Track this topic

java .properties with Unicode/bidi chars?

Thread poster: Jan Sundström

Jan Sundström

Sweden
Local time: 11:38
English to Swedish
+ ...

Mar 12, 2009

Hi all,

I read this very interesting problem description about Unicode characters in .properties files:
http://www.aldridge.de/2009/01/fun-with-character-encodings/

Is it possible technically to just open a .properties file in Trados and translate it straight away to Hebrew or Japanese? Or will Trados run into the same obstacles as described below?!

What is your preferred approach to translate these files?

TIA;

/J

Fun with character encodings

What do ASCII, ANSI, Latin-1, Windows-1252, Unicode and UTF have in common?

They are a pain in the neck for translators - but also, they are ways to encode characters in files, even in plain text files that usually seem as “un-encoded” as possible. Most of the time, you don’t have a problem with it, you open a txt file, you don’t really know (or need to know) what character format it has. The only reason why most people even know about this is because of the “bush hid the facts” (see below) trick in Notepad. I am not going into the history and details of the various formats, at the bottom are some links to other pages that deal with that if you want to learn more. I am merely looking at the consequences it can have for me during translation.

What I care more about is the fact that it can really break your neck during translation of string files. I run into that on and off and every time it happens, I learn a little bit more about it. I wanted to write about it since quite a while, and since the whole thing came down again earlier this week, I think it is time now.

We have a little update tool for an application that is written in Java. Java programs usually have their strings in .properties files. Those files are usually encoded in the 8-bit characters of ISO 8859-1 (aka Latin-1) which contains most “regular” characters but lacks support for language specific characters like ü Ü é or ñ. Those characters have to be converted into Unicode escape characters sometimes referred to as Java escape characters. I think most of us have experienced other escape characters, for example the \n for a new line, \t for a tab. Unicode escape characters are a little more involved, using a \uHHHH notation, where HHHH is the hex index of the character in the Unicode character set. So, for example the ß in a Java properties file has to be encoded into \u00df. To convert those characters, I use Rainbow which is part of the Okapi Framework. It has a handy Encoding Conversion Utility that allows you to convert files from one encoding to another.

Sounds really easy, right? Right? Now what is this woman complaining about again? Well, it’s not that easy. The conversion tool is designed to work with 8-bit ASCII-based encodings. Now, so what IS the problem - it was just stated that Java properties files are ASCII-based encodings? Well, TagEditor takes the ASCII file and when you “Save as Target” after translation, it converts the file into a UTF-8. And that is still not the problem, the problem is that it uses a UTF-8 format without a BOM (Byte Order Mark). The BOM is an (invisible) 2 byte sequence in the beginning of a file which basically tells a program “This is a Unicode file”. Without the BOM, some programs do not recognize the encoding of the file and assume ASCII - and that is the problem with Rainbow (and also with Passolo, a program that just got bought by SDL).

If you try to convert the encoding of a BOMless Unicode file, it goes terribly wrong. As I mentioned, the correct conversion of ß will give you \u00df. Converting a BOMless file will “double escape” the extended characters, and you get \u00c3\u0178 - clearly not the same. The “double escape” is actually a good indicator that something went wrong, if you check your file and see that your extended characters are represented by two escape sequences, you know something went wrong. Of course, that can be difficult when dealing with languages like Greek, Russian or Asian languages, simply because every single character is escaped. I usually try to find a short string and count.

Now, how do you know how a file is encoded? Right now, I use Notepad++ to check. It has a handy little Format menu and allows you to see which encoding is used and it also allows you to convert from one encoding to another. Supported formats are Windows, UNIX, Mac, ANSI, UTF-8 w/o BOM, UTF-8 and UCS-2 Big and Little Endian. Surprisingly, Windows Notepad is one of the few programs that actually manages to decipher the Unicode encoding even without a BOM, just open the BOMless file in Windows Notepad and save them without change. Unfortunately, you usually just don’t know and usually it isn’t even an issue.

I actually happen to get to talk to Yves Savourel, who is working at ENLASO and with the Okapi Framework (and about a gazillion other things related to localization), and he has been very helpful. He explained a few things to me a little better.

* The issue: a BOMless UTF-8 file is recognized as “windows-1252″ encoding
* a UTF-8 file uses two or more bytes to encode the extended characters
* the application thinks each of those bytes is a separate character and converts each into a Unicode escape sequence

* The solution: in Rainbow, manually force the encoding of the source file to UTF-8
* in Rainbow, use the Add/Remove BOM utility to set the BOM properly

If you got through all this stuff, you may now wonder if you’ll ever run into this issue. It is also not just about BOM or not, the whole file encoding raises issues in other applications too. To be honest, I don’t know how often freelance translators are confronted with these types of files, but here are the situations where I keep my eye peeled:

* Java files (.properties)
This was the most recent issue that triggered this post.
* String export files (often XML files or even plain txt)
I tend to get the strings for REALBasic applications in XML files, though I believe they are created by RegexBuddy.
* Non-Windows files or Windows files that will be used on other OSs
We run into this issue with txt files the were created on a Mac and that will be used in InstallShield-type applications, for example to display the license agreement or a readme file.
* All files
Haha, very funny - I know. What I mean is, I have experienced various issues with files, if I have to process them through different applications in order to get CAT-translatable files, for example if we receive a weird string file that Trados doesn’t understand and where we need to find a managable way to extract translatable text.

Anyway, maybe this will help someone else in the situation where the client comes back and claims the files are corrupt or so. Otherwise, I apologize for boring the heck out of you. You should have stopped reading my post a long time ago

Some interesting links with related information:

Okapi Framework
Notepad++
Bush hid the facts hoax and Bush hid the facts on Wikipedia
Mojibake
How to Determine Text File Encoding
Cast of Characters: ASCII, ANSI, UTF-8 and all that

esperantisto

Local time: 12:38
Member (2006)
English to Russian
+ ...

SITE LOCALIZER

OmegaT

Mar 12, 2009

OmegaT is a Java program per se, and is localized in multiple languages, including Arab. As far as I know, the Arab localization has been made with, yes, OmegaT. If you are interested, you may ask in the OmegaT user group.

Hebrew or Japanese?

Japanese and Chinese (traditional and simplified) localizations of OmegaT are also available.

Samuel Murray

Netherlands
Local time: 11:38
Member (2006)
English to Afrikaans
+ ...

Translating .properties files

Mar 12, 2009

J-a-n S-ndstr-m wrote:
Is it possible technically to just open a .properties file in Trados and translate it straight away to Hebrew or Japanese?

I don't know. Does Trados recognise a .properties file as a .properties file? If Trados is worth its salt, it will likely convert extended characters into escaped ones when it saves the file. The same is true of many translation programs that claim support for .properties.

OmT certainly escapes it. In fact, if you give OmT a UTF8 BOM file without escapes, and translate it, the translation will be a BOM-less file with escapes (but I'm not sure about the encoding -- my two most trusty Unicode editors report different things).

What is your preferred approach to translate these files?

The best is to ask the client. If for some reason your client wants unescaped files, he's going to be very unhappy if you escape the characters (even though escaping is the norm in .properties files).

If you use the Translate Toolkit with prop2po and po2prop, you can choose at the po2prop stage whether you want extended characters escaped or not (the default is escaped, if I'm not mistaken).

RieM

United States
Local time: 05:38
English to Japanese
+ ...

DejaVu

Mar 12, 2009

Hi,

I do this all the times -- into Japanese. With DV, all you have to do is to select the right encoding when exporting. Never been complained, never had an issue.

I also have Trados and I know it has a properties file filter via Synergy. If what's described above is always true, then the it'a a bug that should be or have been fixed (finger-crossed!). It's not that SDL is ignrant of BOM. Indeed the Tag Settings Manager has a variety of options regarding BOM. So I d... See more

Login to reply/comment

There is no moderator assigned specifically to this forum.
To report site rules violations or get help, please contact site staff »

java .properties with Unicode/bidi chars?

Forum rules

Help and orientation

CafeTran Espresso
You've never met a CAT tool this clever! Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free Buy now! »

Wordfast Pro
Translation Memory Software for Any Platform Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value Buy now! »

Recent posts | FAQ | Rules | Moderators | Article knowledgebase

Your current localization setting

English

Select a language

More languages...

java .properties with Unicode/bidi chars?

java .properties with Unicode/bidi chars?

You have native languages that can be verified

Your current localization setting

Select a language