Re: A Great Idea (tm) about reimplementing NLS.

Patrick McFarland <[email protected]> writes:

> On Friday 17 June 2005 04:21 am, Måns Rullgård wrote:
>> Patrick McFarland <[email protected]> writes:
>> > On Thursday 16 June 2005 11:04 am, Lennart Sorensen wrote:
>> >>  Most people seem happy with 50 or so being a good limit even though
>> >> many systems support much longer.
>> >
>> > 50 characters or 50 bytes? Because in the case of UTF-8, if you do a lot
>> > of three byte characters (which require four bites to encode), 50 bytes
>> > is very short.
>>
>> What do you mean by three-byte characters requiring four bytes to
>> encode?  Is a three-byte character not a character encoded using three
>> bytes?
>
> (implication of utf8 and not utf16 goes here)
>
> Very few Unicode characters require three bytes, instead of the
> usual one or two.

I wouldn't the Chinese, Japanese, and Korean characters "very few",
and those all require (at least) three bytes.

> For one byte you just have the byte. 

Correct.

> For two bytes, you really have three: a control code stating "the
> following two bytes are a two byte character", and then the two
> bytes.
>
> For three bytes, you really have four bytes: a control code stating
> "the following three bytes are a three byte character" and then the
> three bytes.

Wrong.  The first byte indicates the total size of the character, but
it also contains data, like this:

  0xxxxxxx
  110xxxxx 10xxxxxx
  1110xxxx 10xxxxxx 10xxxxxx
  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Refer to the Unicode standard, section 3.9 for the full details.

>> As for 50 bytes being too short, many of the multibyte characters are
>> equivalent to several English characters, so fewer of them are
>> required.  You have a point, though.
>
> Any English characters (ie, the first 127 ascii characters) map
> directly to the first 127 Unicode characters (if thats what you
> meant).

Let me clarify with an example.  The common Korean name Kim consists
of three ascii characters.  The Hangul spelling, ~, is encoded in
utf-8 using three bytes.  Even though a three-byte character was used,
the number of bytes is the same.

-- 
Måns Rullgård
[email protected]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: A Great Idea (tm) about reimplementing NLS.
  - From: Måns Rullgård <[email protected]>

References:
- Re: A Great Idea (tm) about reimplementing NLS.
  - From: Måns Rullgård <[email protected]>
- Re: A Great Idea (tm) about reimplementing NLS.
  - From: Patrick McFarland <[email protected]>
- Re: A Great Idea (tm) about reimplementing NLS.
  - From: Måns Rullgård <[email protected]>
- Re: A Great Idea (tm) about reimplementing NLS.
  - From: Patrick McFarland <[email protected]>

Prev by Date: Re: A Great Idea (tm) about reimplementing NLS.
Next by Date: [Fwd: Re: [PATCH] ReiserFS _get_block_create_0 wrong behavior when I/O fails]
Previous by thread: Re: A Great Idea (tm) about reimplementing NLS.
Next by thread: Re: A Great Idea (tm) about reimplementing NLS.
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]