jinfeng_wang

G-G-S,D-D-U!

BlogJava 首页 新随笔 联系 聚合 管理
  400 Posts :: 0 Stories :: 296 Comments :: 0 Trackbacks
Notes regarding character sets

One of the problems that occur when different machine models are connected by using Java or when a database is connected is the generation of garbled characters. For example, when a Solaris OE machine sends an em-sized tilde symbol ('**') to a Windows(R) machine via the network and the Windows(R) machine displays this character, a garbled character may be displayed.

This problem arises from the fact that different vendors use different conversion rules to convert existing Japanese code (JIS, EUC, and shift-JIS) into Unicode. Using Java will not solve this problem. Java system developers therefore need to take some protective measures against this problem.

To solve this problem, it is necessary to understand the background to this problem.

The background is explained below.

mark1Features of Unicode

Unicode is a character code system laid down by the Unicode Consortium. Including most of the world's major languages, it was formulated in such a way that one character code system would be able to process multiple languages. Unicode is based on ISO/IEC 10646, a standard of the International Organization for Standardization (ISO).

In the process of its development as a character code including most of the world's major languages, Unicode has come to have the following features:

mark2Inclusion of numerous similar characters

As a policy, Unicode is formulated in such a way that characters that are similar in appearance but different in terms of their roles or uses are included as different characters. For example, as many as six characters resembling an em-sized tilde ('**', WAVE DASH (01-33) of JIS X 0208) have been registered as shown below.

[Table: Unicode characters resembling JIS X 0208 01-33 (WAVE DASH, '**')]

Code point

Registered name

U+301c

WAVE DASH

U+223c

TILDE OPERATOR

U+223d

REVERSED TILDE

U+223e

INVERTED LAZY S

U+223f

SIGN WAVE

U+ff5e

FULLWIDTH TILDE

 

mark2Em-sized characters and en-sized characters registered as different characters

To assure compatibility with existing character sets, em-sized characters and en-sized characters are registered in Unicode as characters that are different from each other. For example, the em-sized '' and the en-sized 'A' are registered in Unicode as characters that are different from each other.

However, even for some characters that are em-sized characters and do not have corresponding en-sized characters in the Japanese language, such as '£' and '¢', two types of characters, em-sized characters and en-size characters, are registered in Unicode.

[Table: Unicode characters corresponding to JIS X 0208 01-82 (POUND SIGN, '£')]

Code point

Registered name

U+00a3

POUND SIGN

U+ffe1

FULLWIDTH POUND SIGN

 

[Table: Unicode characters corresponding to JIS X 0208 01-81 (CENT SIGN, '¢')]

Code point

Registered name

U+00a2

CENT SIGN

U+ffe0

FULLWIDTH CENT SIGN

 

mark1Vendor dependency of code conversion

Since Unicode includes many similar characters as described above, there can be many interpretations about which Unicode character should be used for a character in an existing Japanese character set. Vendors may use interpretations that differ from each other, with the result that different code conversion rules are followed by different vendors. Typical examples are shown below.

mark2The Unicode Consortium provides the JIS kanji-Unicode conversion rule.

JIS also defines a JIS kanji-Unicode conversion rule in the appendix of JIS X 0221. However, the rule for converting the em-sized dash symbol (JIS X 0201 01-29, '*') is different from the rule established by the Unicode Consortium.

[Table: Difference in conversion rules between the Unicode Consortium and JIS]

JIS X 0208

Conversion to Unicode

01-29
EM DASH ('-')

Unicode Consortium:U+2015
(HORIZONTAL BAR)

JIS:U+2014
(EM DASH)

 

mark2Microsoft Corporation uniquely defined a shift-JIS-Unicode mutual conversion rule for Windows(R).

In the conversion rule of Windows(R), an em-sized character of a Japanese code is, as a basic rule, to correspond to an em-sized character of Unicode. Therefore, for some characters, a conversion rule that differs from the JIS conversion rule is used.

[ Table: Differences between JIS conversion rules and Windows(R) conversion rules]

JIS X 0208

Conversion to Unicode

01-33
WAVE DASH ('**')

JIS:U+301c
(WAVE DASH)

Windows(R):U+ff5e
(FULLWIDTH TILDE)

01-82
POUND SIGN ('£')

JIS:U+00a3
(POUND SIGN)

Windows(R):U+ffe1
(FULLWIDTH POUND SIGN)

01-81
CENT SIGN ('¢')

JIS:U+00a2
(CENT SIGN)

Windows(R):U+ffe0
(FULLWIDTH CENT SIGN)

 

None of these conversion rules can be said to be right or wrong. The conversion rule defined in the JIS serves merely as an informative rule and is not mandatory. Currently, the final decision on which character is to correspond to which character depends on individual vendors.

mark1Mechanism of garbled character generation

As mentioned above, since different vendors use different conversion rules, some characters become garbled when systems from different vendors are linked using Java.

We consider, for example, a system in which Solaris OE and Windows(R) environments are connected via a network to exchange text data.

[Table: Example of garbled characters due to the inconsistency of conversion tables (in network communication)]

EUC data

 

Java (Unicode)

 

Shift-JIS data

'**'

U+301c

Undefined

Undefined

U+ff5e

'**'

 

Java reads character data of a character set that is specific to a platform, converts it to Unicode data, and then processes it. When the data is transferred over the network, it is transferred in Unicode, and, on the other platform, is converted into the character data of a character set specific to that other platform.

In this example, it is supposed that the Japanese EUC code is used as a character set on Solaris OE. The mutual code conversion between the Japanese EUC code on Solaris OE and Unicode is performed in accordance with the JIS conversion rule. Java also performs mutual conversion between the Japanese EUC code and Unicode basically in accordance with the JIS conversion rule.

It is supposed that the shift-JIS code is used as a character set on the Windows(R) side. Since Windows(R) has its own code conversion rule, Java performs mutual conversion between the shift-JIS code and Unicode in accordance with this code conversion rule of Windows(R).

Described below is the case where the em-sized tilde character ('**') is transferred from Solaris OE to Windows(R).

  1. Java reads the tilde character of the Japanese EUC code '**' and converts it to Unicode on Solaris OE. Because the conversion on Solaris OE is performed in accordance with the JIS conversion rule, '**' is converted to U+301c.

  2. The '**' converted into Unicode data (U+301c) is transferred to Windows(R) via the network.

  3. The transferred Unicode data is again converted into shift-JIS data on Windows(R). At this time, the conversion is performed on Windows(R) in accordance with the conversion rule of Windows(R).

  4. However, the Unicode code corresponding to '**' is U+ff5e according to the conversion rule of Windows(R). To the Unicode data received from Solaris OE (U+301c), the conversion rule of Windows(R) does not provide a corresponding character. Therefore, if U+301c is converted by using the conversion rule of Windows(R), it becomes an undefined character in the shift-JIS code.

The same can be said in the reverse operation; when '**' on Windows(R) (U+ff5e) is transferred to Solaris OE, since the JIS conversion rule does not define a character corresponding to U+ff5e, the conversion into the Japanese EUC code generates an undefined character.

The blame for this turn of events cannot be placed with the code conversion rules of either Solaris OE or Windows(R). Moreover, Java correctly performed the conversion in accordance with these rules. However, because the conversion rules of these two platforms differ, their combined use causes incorrect conversion.

mark1Corrective action

It seems that these problems can be solved by making one platform use the same conversion rule used by the other platform. However, this method brings another inconsistency into the side that uses the changed conversion rule.

In the first example, for instance, if the conversion rule followed on the Windows(R) side is changed to the JIS conversion rule, it becomes possible to correctly convert Unicode data received from Solaris OE but it is no longer possible to correctly convert existing Unicode data stored in the Windows(R) system.

The simplest and surest measure is to set the system so that it avoids the use of characters that can become garbled characters. Only a few limited multibyte characters can become garbled characters. It is necessary to exercise caution during the system design stage to ensure that these characters are not used.

Otherwise, for cases where the platforms to be used can be identified, for example, where the server uses Solaris OE and clients use Windows(R), another measure system developers can take is to have the system filter the data at data transfer.

For example, in data transfer from Solaris OE to Windows(R), if all the U+301c character values appearing in the data are converted into U+ff5e character values before the transfer, the '**' data can be transferred from Solaris OE to Windows(R) correctly.

In any case, system developers need to take protective measures in advance to avoid garbled character problems from occurring when they develop systems connecting different vendor platforms.

The following table lists the characters that cause garbled characters if the above example system (a system performing network communication between Solaris OE and Windows(R)) is implemented by using Java:

posted on 2006-02-04 11:08 jinfeng_wang 阅读(598) 评论(0)  编辑  收藏 所属分类: ZZ

只有注册用户登录后才能发表评论。


网站导航: