Java, OS X Terminal, and Unicode - a Sordid Tale

Sat, 01/27/2007, 01:46

I'm pissed off, and I'm not sure who or what to blame.

See, right now, I'm trying to build a front-end to the UNIHAN database, which I've found to be profoundly useful for my East Asian language research; specifically the fact that it contains Tang Dynasty-era pronunciations for quite a few CJK (Chinese/Japanese/Korean) characters. The Mandarin, Japanese, and Sino-Korean pronunciations are also very nice to have. Unfortunately, the database itself is a massive UTF-8 flatfile, keyed by a character's hexadecimal UTF-16 codepoint, so it's not the easiest thing in the world to use.

So I'm using Java to make an application to allow me to type in a Han character, and get the UNIHAN data. What pisses me off, though, is the way that (Java/OS X/OSX's Terminal) handles unicode output. It's messed up as hell, and I don't. know. why.

Here's an example. Suppose I have code that looks like this:

System.out.println("你好");
System.out.println("\u4f60\u597d");

The first println works fine, and outputs the characters fine to OS X's Terminal. The second println, using the UTF-16 codepoints for those two characters (U+4F60 and U+597D, respectively), prints out two question marks. Cut-and-pasted, straight from terminal:

你好
??

I shit you not.

This isn't exactly something anyone I know can help me with, either. The most helpful answer I've gotten is that there're some very rare variant characters that have a codepoint associated with them, but no glyph representation in a font. This I knew, but there's a slight issue with that explanation. Namely, that the characters "你好" are roughly the Chinese equivalent of "hello".

Well, that, and the fact that the characters printed out just fine ON THE LINE RIGHT ABOVE.

So I don't know whether to blame Java, OS X, or Terminal, all of which are purported to be unicode-friendly. Actually, I think there's a strong chance that it's my fault, but we can pretend I didn't admit that. What I may just end up doing here is to declare a "screwit" situation, and just jump to the Swing app I was thinking about making, since that supposedly solves many issues.



Comments about "Java, OS X Terminal, and Unicode - a Sordid Tale" :



You should add little plussies + after your u's. Like: System.out.println("\u+4f60\u+597d"); That's what I'd do. U R SOFA KING WEE TODD ID.
-left by kythri ( http://www.kythri.net)


Dammit, why is my comment all unformatted? I put several hard returns in there.
-left by kythri ( http://www.kythri.net)


Haha! He said "hard".
-left by Carolinian ( gregorypeck@water.planet.0rg)


Holy crap, Carolinian is Gregory Peck?!
-left by kythri ( http://www.kythri.net)


This is a 2 year old post I know, but anyway: I'd put money on the OS X terminal being UTF-8 and your System.out.println("") outputting UTF-16; the line above it is outputting characters straight from your source code file, which would be saved in UTF-8 and therefore outputting UTF-8 to the terminal. I could be wrong though.
-left by bloopletech ( http://i.bloople.net)

Leave a comment:


Antispam:2 + 5 = (required)
Name:
Webpage: (optional)
Comments:

older entries:

Have you subscribed to the RSS feed? (If so, you're crazy)