Re: [Patch] Support UTF-8 scripts

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Bernd Petrovitsch <[email protected]> wrote:
> On Thu, 2005-09-15 at 20:39 +0200, "Martin v. Löwis" wrote:
>> H. Peter Anvin wrote:

>> > In Unix, it's a hideously bad idea.  The reason is that Unix inherently
>> > assumes that text streams can be merged, split, and modified.  In other
>> > words, unless you can guarantee that EVERY program can handle BOM
>> > EVERYWHERE, it's broken.

You can't sort /bin/ls into /tmp/ls and expect /tmp/ls to be meaningfull,
but /bin/ls works as expected. You can't usurally concat perl scripts and
shell scripts either, but both kinds of script run quite well.

And if you do "cat /bin/cat /bin/cp > /bin/catcp", what's "catcp foo bar"
supposed to do? First output foo and bar to stdout, then copy foo to bar?
Is execve() broken if it doesn't do what I described? Is the ELF header
broken because it's not recogmized EVERYWHERE? I don't think so.

>> This argument is bogus. We are talking about scripts here, which cannot
>> be merged, split, and modified. You don't cat(1) or sort(1) them - it's
> 
> Sure they can since they are plain text files.
> How do you think one merges scripts?
> Just `cat`ing them all into one new file and edit that new file is much
> faster and simpler than to open an empty new file with your editor, then
> you open all the other scripts in your editor and copy them by hand.

What's supposed to happen if you concatenate a script from your french
user and from your russian user, both using localized text, into one file?
Unless you can guarantee every editor to correctly handle this case, all
usage of 8-bit-characters should be disabled - NOT!

If you concatenate two plain text files, you will use cat.
If you concatenate two pnm image files, you will use pnmcat.
If you concatenate two utf-8 files, you will use utf8cat.
If you concatenate two binaries, you will shoot your feet.
That's easy, isn't it?

BTW: I think decent utf-8 capable programs SHOULD ignore extra BOM markers.

> And you (or at least I) do `grep`/`egrep`/`fgrep`, `wc` them.

You can *grep utf-8 scripts, but you can't *grep binaries. Shouldn't
this be fixed by implementing an in-kernel ASCII assembler and convert
all binaries to assembler text?

> And
> probably with several other tools too - think of `find <dir> -type f
> -print0 | xargs -0r <cmd>`.

utf-8 filenames will work correctly (unless used as an extended BASIC
script with non-ASCII variable names, but that would be insane).

>> just pointless to do that. You create them with text editors, and those
>> can handle the UTF-8 signature.
> 
> It is not uncommon to create scripts and the like with other programs,
> other scripts, what-else.

It's not uncommon to create binaries using other programs. So what?

> Apart from the fact the a "script" is merely a plain text file with the
> eXecutable bit set.

And an utf-8 script is a utf-8 encoded text file with it's executable bit
set.

> And that is the only difference, so you have to at
> least (all instances of) `chmod` to insert and remove the BOM.
[...]

In order to make it harder for the interpreter to correctly detect utf-8?
You can have DOS executables run in dosboxes, windows applications run
in windows, java archives run in java, but utf-8 scripts should be
mangled in order to work "correctly", and mangled back in order to be
editable? *That*'s insane!

Just make execve ignore the BOM marker before "#!" as the patch does, and
you're done. The rest is somebody else's not-a-problem.



BTW2: However, I don't like the patch.

I'd first check for a utf-8 signature, and if it's found, adjust the
buffer offset by 3. Then I'd run the old code checking for the sh_bang.
OTOH, I just read the patch and not the .c file, maybe (unlikely) my idea
wouldn't work correctly.

-- 
Ich danke GMX dafür, die Verwendung meiner Adressen mittels per SPF
verbreiteten Lügen zu sabotieren.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]
  Powered by Linux