Fedora Users — Re: Where is the $LANG variable defined?

Matthew Miller wrote:

On Sun, Nov 07, 2004 at 07:50:00PM +0100, Björn Persson wrote:
And something does -- set the LANG variable right, and it'll work, right?
If it's that easy, why doesn't SSH set LANG? Why should I have a lot of trouble doing this manually every time? And how does that help with filenames?
Answers to your questions in order. :)
1. Because it has no idea what to set it to.

It doesn't know because it doesn't bother to look, but that's the wrong answer. The answer is that it's *not* that easy. The encoding has to be set for the terminal program where I run SSH, and to do it through LANG, LANG must be set before the terminal program is started. So instead of just typing "ssh otherbox" I have to set LANG, launch a new terminal window and run SSH there.

2. You shouldn't have to go to a lot of trouble. And you shouldn't need to
   do it manually each time -- at the very least, you can script it.

So everyone and their dog should write a specialized script for every combination of local and remote box? Fat chance! We need a solution that will work out of the box so it can be packaged and distributed with operating systems.

I *could* write a program that connects with SSH, looks up the remote system's encoding, disconnects, opens a new terminal window with the right encoding set, and runs SSH in that window, but it wouldn't work in text mode, it wouldn't work with chained SSH sessions, and it still wouldn't help with file transfers.

3. The filename encoding problem is kinda sticky. Devising a workable
   on-the-fly transcoding solution seems like a lot of work on the
   _symptoms_. Instead, let's work on getting everything to work well with
   UTF-8.

You don't seem to fully understand the extent of the problem. That's not surprising as you're apparently a USian and seldom see the character encoding problems I see daily. If you had been regularly forced to spell your name "Mutthew" because "a" wasn't a valid character, you might have a different view.

What you're suggesting is that everyone should use UTF-8 everywhere so that there would only be one character encoding. That just isn't going to happen. I'd love to go UTF-8 myself and get access to all the world's written languages, but it's not feasible. That's not because of the filenames. I could transcode all my filenames easily enough. The real problem is the files' contents. Since there's only one big global locale setting I have to convert everything or nothing. I've got heaps of text files full of non-English letters and they're all encoded in Latin 1. (Actually, many of them are probably in Windows 1252, but the extra characters in that encoding don't seem to have gotten used very often, so they can pass as Latin 1.) Some of them are plain text. They could and would have to be transcoded. Others are XML or HTML. They could be left as they are but should be transcoded so they could be opened in text editors. If they are transcoded the embedded encoding specifications would have to be updated. Still others are source code. Transcoding those would constitute changes to the programs and could require several other changes. Then there are the files that aren't text and mustn't be transcoded by mistake. There's no reliable way of recognizing the different kinds of files automatically, so I'd have to go through them all manually and decide what to do with each one. No thanks!

Then there are the various people and computers I need to cooperate with and share files with - coworkers, Sourceforge projects and the like. These people share files with other people who in turn cooperate with still other people. What's the chance of getting all these people to switch character encodings at the same time?

While I'm typing this, Bittorrent is downloading Fedora 3 for me. I'm going to do a fresh install on an unused partition. The first thing I'll do after installing is to edit /etc/sysconfig/i18n to change from UTF-8 to Latin 1. Sure it would be nice if everyone would use the same character encoding, but Unicode was created some 50 years too late and now we have to live with the consequences. Myself I'm stuck with Latin 1.

Björn Persson