singapore: the smallest big galery
home » forum » Internationalisation » WHAT IS UTF-8?

You are not logged in.

#1 2009-11-27 09:41:03



The Unicode Standard (ISO 10646) defines a 16-bit universal character set which encompasses most of the world's writing systems. 16-bit characters, however, are not compatible with many current applications and protocols that assume 8-bit characters (such as the Web) or even 7-bit characters (such as mail), and this has led to the development of a few so-called UCS transformation formats (UTF), each with different characteristics. Unicode provides for a byte-oriented encoding called UTF-8 that has been designed for ease of use with existing ASCII-based systems. UTF-8 is the Unicode Transformation Format that serializes a Unicode code point as a unique sequence of one to four bytes. The UTF-8 encoding allows Unicode to be used in a convenient and backwards compatible way in environments that, like Unix, were designed entirely around ASCII. It was introduced to provide an ASCII backwards compatible multi-byte encoding.

The Unicode UTF-8 format of ISO 10646 is the preferred default character encoding for internationalization of Internet application protocols. It will be most common on the world wide web. Being multiple-byte format, it is naturally fit for the web as the web itself is based on 8-bit protocols. UTF-8, in fact, is the only Unicode format that is commonly supported by web browsers. It is being adopted and deployed by many major Vietnamese online media and publications.
UTF-8 and Unicode FAQ
UTF-8: What is It and Why is It Important?

A Vietnamese-language file in UTF-8 encoding is roughly 1.2 times larger than a file with same content but encoded using legacy encoding formats (VPS, VISCII, TCVN3, i.e.), for Vietnamese characters (mostly, vowels) in UTF-8 format usually require two to three bytes to represent. Followed are some examples of Viet characters in UTF-8 format.Vietnamese Character    16-bit Unicode    UTF-8 Bytes
Ồ    U+1ED2    E1 BB 92: á»’
ồ    U+1ED3    E1 BB 93: ồ
Ờ    U+1EDC    E1 BB 9C: Ờ
ơ    U+01A1    C6 A1:    Æ¡
ư    U+01B0    C6 B0:    Æ°
ứ    U+1EE9    E1 BB A9: ứ




#2 2010-05-11 14:28:06


Ohhh you are great...thank you so much for this help

best regards