In article <[email protected]> you wrote:
> (implication of utf8 and not utf16 goes here)
>
> Very few Unicode characters require three bytes, instead of the usual one or
> two.
UTF-8 2 bytes end with U+07ff which covers only Latin, Cyrillic, Hebrew and
Arabic.
All JCK Unified Ideographs (U+4E00-) and Extensions (U+3400-) have 3 byte
encodings with UTF-8. Some of the B Extensions even use 4 bytes (U+20000-)
> For one byte you just have the byte.
For ASCII you have one byte.
> For two bytes, you really have three: a control code stating "the following
> two bytes are a two byte character", and then the two bytes.
Umm, thats a bit missleading. UTF-8 works with bit not byte prefixes.
Unicode code points are integers and depending on the encoding represented
as multiple code points, which can be represented as bytes.
> Unless I've completely misunderstood the Unicode specification, this is what
> is going on.
You might want to look up Joel's Tutorial or just browse the Unihan Database:
http://www.joelonsoftware.com/articles/Unicode.html
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=3400
http://www.unicode.org/cgi-bin/UnihanGrid.pl?codepoint=U+07F1&useutf8=false
Greetings
Bernd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]