Pay for Hesitation: Creating a java UTF-8 string?

Pages

2010年11月18日 星期四

Creating a java UTF-8 string?

記得以前一起寫Search Engine 的時候好像有碰到類似的問題,
後來不了了之..XD

今天又碰到同樣的問題了..

原來如下所說:


======================================================================
There is no such thing as an "UTF-8 String". A String is composed of characterswhereas UTF-8 is a method of converting between chracters and bytes.

======================================================================
What do you mean by "a String in UTF-8 format"? Java Strings are composed of 16-bit chars, so they are UTF-16 (although Unicode surrogates aren't handled properly until 1.5). UTF-8 is an appropriate encoding for an array of bytes, which you already have.

===============================================================================
There is not such thing as a UTF-8 string. A String is a string of characters, each one of which can be returned by the charAt(int pos) method. 


======================================================================
So, you want to store a byte-array (in this case it contains characters in UTF-8 format) into a String in such a way that the byte-array does not get changed/encoded? You want to circumvent the UTF-8 to UTF-16 encoding? I
don't think that is possible.

A String contains an array of 'char', not an array of 'byte'. And a char is a UTF-16 character.... A 'byte' is not a 'char', so conversion is necessary. Any String-constructor taking a byte-array will do some kind of conversion on the input byte-array (to properly convert it into a char-array).


===============================================================================



所以在A.java的程式碼中寫的 String str = new String("哈囉");
在A.class中, 這個"哈囉"字串會被JVM 編譯成UTF-16,
然後當A.class在被執行時, "哈囉"字串又會被轉碼成作業系統的charset.

所以在中文的windows執行A.class, "哈囉"字串就會變成Big5的編碼.
而在Linux環境上執行A.class, "哈囉"字串就會是UTF8的編碼.

假設A.class是一隻在Linux 上的server程式, 這個"哈囉"會被send給client端.
若client端用utf8的方式來存取stream, 就可以正確顯示.
(若client端的程式是由IDE run起來, 那麼IDE的console charset也要是utf-8, 才能在console正確看到"哈囉", 否則也會是亂碼)

但若A.class在中文的Windows上被執行, 那麼同一隻client用utf8的方式去接"哈囉"時, 就會看到亂碼.

debug的方式, 是把字串的byte array 以raw data印出來, 看其編碼是big5還是utf8.

解決編碼的問題(localization), 就是不把中文寫在程式中, 而是寫在文檔中, 而將該文檔轉成utf8.
A.class在run time時從文檔中讀出utf8的"哈囉"字串再send出去, 就保證A.class不會因為系統平台的差異而丟出不同編碼的"哈囉"字串了.


在程式碼中寫中文是不好的習慣,
無論是註解還是字串都一樣...
請參考"許功蓋"issue.

沒有留言: