James, thanks for the detailed explanation. I actually understand unicode pretty well but I am curious why it doesn't "just work" by now in Fedora and probably other linux distributions. I use Japanese on my desktop and and I still have some problems such as tagging mp3s or exchanging files with people on Japanese windows. To be fair it has improved immensely but I will be very happy when I won't have to use smbchartool (http://www.samba.gr.jp/project/contrib/smbchartool.html) anymore. Nadeem On Sun, 19 Dec 2004 20:52:30 +0000, James Wilkinson <james@xxxxxxxxxxxxxxxxxxx> wrote: > Nadeem Bitar wrote: > > I'm interested to know why en_US works but en_US.UTF-8 doesn't. > (The context was id3 tags in MP3s etc) > > Computers store letters as binary numbers. > > The standard way of encoding Latin letters is the ASCII encoding. In > anything ASCII based, for example, A is (decimal) 65. ASCII covers the > symbols on a standard US keyboard, and uses numbers up to 127. > > Historically, Western computers have stored each character in one byte. > That gives you up to 256 characters. > > Many people want to use other symbols. For example, I might want to use > the  and â signs for currency. Greeks and Russians will want to use > their own letters (Î or Ð). People speaking French or Spanish will want > to use ÃÃcÃÃts. And you want to properly tag your MP3s. > > In fact, there are *way* more symbols than can be encoded in one byte. > So a number of "character sets" were invented: some for Greek letters, > some for Russian, some for Western European, etc. Usually the first half > was ASCII, and the rest character-set specific. > > And the problem is that it isn't always clearly specified which > character set you're using. I suspect that's what's happening here: the > encoder and the player are using different character sets. > > UTF-8 is a way of encoding practically any character, possibly in more > than one byte. If and when it becomes universal, then character set > problems should go away. But it's also another character set, so for > now, if an encoding program encodes symbols in UTF-8, but the readers > expect them to be in ISO 8859-1 ("Western Europe"), you'll have trouble. > > Now the LANG variable, among other things, sets which character set is > in use. en_US uses ISO 8859-1, while en_US.UTF-8 uses UTF-8 (not > surprisingly). So using en_US gets your MP3s using the ISO8859-1 > encoding that the MP3 players expect (because the encoder works that way > but the decoders presumably don't...) > > I have not been able to find if there is a character set specification > in id3 tags that one program or another is ignoring, or whether the > standard is simply deficient. > > With e-mails, for example, there's a MIME-Version and a Content-Type > header that specify that this e-mail is using UTF-8 (because that's the > only character set that covers everything I've used). > > James. > > Yes, I know, I've massively simplified in places. > > -- > E-mail address: james | DON'T be put off by "horror stories" spread by > @westexe.demon.co.uk | others. People who talk about death and serious > | injury are very rarely the ones who have actually > | suffered such things. -- Adrian Plass > > -- > fedora-list mailing list > fedora-list@xxxxxxxxxx > To unsubscribe: http://www.redhat.com/mailman/listinfo/fedora-list >