Re: flac/mp3 tagging Latin characters

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Nadeem Bitar wrote:
> I'm interested to know why en_US works but en_US.UTF-8 doesn't.
(The context was id3 tags in MP3s etc)

Computers store letters as binary numbers.

The standard way of encoding Latin letters is the ASCII encoding. In
anything ASCII based, for example, A is (decimal) 65. ASCII covers the
symbols on a standard US keyboard, and uses numbers up to 127.

Historically, Western computers have stored each character in one byte.
That gives you up to 256 characters.

Many people want to use other symbols. For example, I might want to use
the  and â signs for currency. Greeks and Russians will want to use
their own letters (Î or Ð). People speaking French or Spanish will want
to use ÃÃcÃÃts. And you want to properly tag your MP3s.

In fact, there are *way* more symbols than can be encoded in one byte.
So a number of "character sets" were invented: some for Greek letters,
some for Russian, some for Western European, etc. Usually the first half
was ASCII, and the rest character-set specific.

And the problem is that it isn't always clearly specified which
character set you're using. I suspect that's what's happening here: the
encoder and the player are using different character sets.

UTF-8 is a way of encoding practically any character, possibly in more
than one byte. If and when it becomes universal, then character set
problems should go away.  But it's also another character set, so for
now, if an encoding program encodes symbols in UTF-8, but the readers
expect them to be in ISO 8859-1 ("Western Europe"), you'll have trouble.

Now the LANG variable, among other things, sets which character set is
in use. en_US uses ISO 8859-1, while en_US.UTF-8 uses UTF-8 (not
surprisingly). So using en_US gets your MP3s using the ISO8859-1
encoding that the MP3 players expect (because the encoder works that way
but the decoders presumably don't...)

I have not been able to find if there is a character set specification
in id3 tags that one program or another is ignoring, or whether the
standard is simply deficient.

With e-mails, for example, there's a MIME-Version and a Content-Type
header that specify that this e-mail is using UTF-8 (because that's the
only character set that covers everything I've used).

James.

Yes, I know, I've massively simplified in places.

-- 
E-mail address: james | DON'T be put off by "horror stories" spread by
@westexe.demon.co.uk  | others.  People who talk about death and serious
                      | injury are very rarely the ones who have actually
                      | suffered such things.  -- Adrian Plass


[Index of Archives]     [Current Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Yosemite News]     [Yosemite Photos]     [KDE Users]     [Fedora Tools]     [Fedora Docs]

  Powered by Linux