Re: A Great Idea (tm) about reimplementing NLS.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Patrick McFarland <[email protected]> writes:

> On Friday 17 June 2005 04:21 am, Måns Rullgård wrote:
>> Patrick McFarland <[email protected]> writes:
>> > On Thursday 16 June 2005 11:04 am, Lennart Sorensen wrote:
>> >>  Most people seem happy with 50 or so being a good limit even though
>> >> many systems support much longer.
>> >
>> > 50 characters or 50 bytes? Because in the case of UTF-8, if you do a lot
>> > of three byte characters (which require four bites to encode), 50 bytes
>> > is very short.
>>
>> What do you mean by three-byte characters requiring four bytes to
>> encode?  Is a three-byte character not a character encoded using three
>> bytes?
>
> (implication of utf8 and not utf16 goes here)
>
> Very few Unicode characters require three bytes, instead of the
> usual one or two.

I wouldn't the Chinese, Japanese, and Korean characters "very few",
and those all require (at least) three bytes.

> For one byte you just have the byte. 

Correct.

> For two bytes, you really have three: a control code stating "the
> following two bytes are a two byte character", and then the two
> bytes.
>
> For three bytes, you really have four bytes: a control code stating
> "the following three bytes are a three byte character" and then the
> three bytes.

Wrong.  The first byte indicates the total size of the character, but
it also contains data, like this:

  0xxxxxxx
  110xxxxx 10xxxxxx
  1110xxxx 10xxxxxx 10xxxxxx
  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Refer to the Unicode standard, section 3.9 for the full details.

>> As for 50 bytes being too short, many of the multibyte characters are
>> equivalent to several English characters, so fewer of them are
>> required.  You have a point, though.
>
> Any English characters (ie, the first 127 ascii characters) map
> directly to the first 127 Unicode characters (if thats what you
> meant).

Let me clarify with an example.  The common Korean name Kim consists
of three ascii characters.  The Hangul spelling, ~, is encoded in
utf-8 using three bytes.  Even though a three-byte character was used,
the number of bytes is the same.

-- 
Måns Rullgård
[email protected]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux