Re: [Patch] Support UTF-8 scripts

Bernd Petrovitsch <bernd@firmix.at> wrote:
> On Sun, 2005-09-18 at 09:23 +0200, "Martin v. Löwis" wrote:
> [...]
>> >>Hmm. What does that have to do with the patch I'm proposing? This
>> >>patch does *not* interfere with all text files. It is only relevant
>> >>for executable files starting with the #! magic.
>> > 
>> > It *does* interfere since scripts are also text files in every aspect.
>> > So every feature you want for "scripts" you also get for text files (and
>> > vice versa BTW).
>> 
>> The specific feature I get is that when I pass a file starting
>> with <utf8sig>#! to execve, Linux will execute the file following
>> the #!. In what way do I get this feature for text in general?
>> And if I do, why is that a problem?
> 
> After applying this patch it seems that "Linux" is supporting this
> marker officially in general - especially if the kernel supports it.

It will be the first POSIX kernel to correctly support utf-8 scripts.
It's 2005, and according to other(?) posters, this should be standard.

> I
> suppose the next kernel patch is to support Win-like CR-LF sequences
> (which is not the case AFAIK).

Maybe it should, maybe it shouldn't. If I used MAC or DOS, I'd be sure it
should.-)

> BTW even some standards body thinks that this is the way to go,

Not surprisingly the Unicode Consortium is one of them.

> it
> raises more problems and questions than resolves anything.

The problem of ow to handle BOM is solved by reading the standard.

> And though scripts are usually edited/changed/"parsed"/... with an text
> editor, it is not always the case. Therefore the automatic extension to
> *all text files* (especially as the marker basically applies to all text
> files, not only scripts).
> You want to focus just on your patch and ignore the directly implied
> potential problems arising ...

There is no problem arising from the patch, it solves one.
To solve the rest, use recode.

[...]
> Apparently I have to repeat: If you do `cat a.txt b.txt >c.txt` where
> a.txt and b.txt have this marker, then c.txt have the marker of b.txt
> somewhere in the middle. Does this make sense in anyway?
> How do I get rid of the marker in the middle transparently?

The unicode standard defines how to handle them.

>> > Let alone the confusion why the size of a file with `ls -l` is different
>> > from the size in the editor or a marker-aware `wc -c`.
>> 
>> This is true for any UTF-8 file, or any multibyte encoding. For any
>> multibyte encoding, the number of bytes in the file is different from
>> the number of characters. That doesn't (and shouldn't) stop people from
>> using multi-byte encodings.
> 
> It is different even if a pure ASCII file is marked as UTF-8.

No pure ASCII file will be marked, since a marked file will be no
ASCII file.

> And sure, the problem exists in general with multi-byte encodings.

ACK, but that's not a kernel problem nor a specific unicode problem.
Fix it by making China, Greece an Japan convert to ASCII and by making
all mathematicans stop using strange characters. All other users will
follow.

>> What the editor displays as the number of "things" is up to its own.
>> The output of wc -c will always be the same as the one of ls -l,
>> as wc -c does *not* give you characters:
>> 
>>        -c, --bytes
>>               print the byte counts
>> 
>> You might have been thinking of 'wc -m'.
> 
> It depends on the definition of "character". There are other standards
> which define "character" as "byte".

There are architectures defining a byte to be 32 bit.
They are irrelevant, too.

[...]
>> Not sure what this has
>> to do with the specific patch, though.
> 
> It is not supported by the kernel. So either you remove it or you make
> some compatibility hack (like an appropriate sym-link

-EDOESNOTWORK

#!/usr/bin/perl -T -s -w

>, etc.). Since the
> kernel can start java classes directly, you can probably make a similar
> thing for the UTF-8 stuff.

If MSDOS text files are text files are legal scripts, the kernel
should recognize [\x0D\x0A] as valid line breaks.

(The real reason would be unicode allowing NEL to be encoded as 0x0D
 or 0x0A.)

This compile-tested patch adds 32 bytes to binfmt_script:

--- ./fs/binfmt_script.c.old    2005-09-18 20:28:32.000000000 +0200
+++ ./fs/binfmt_script.c        2005-09-18 20:29:44.000000000 +0200
@@ -18,7 +18,7 @@

 static int load_script(struct linux_binprm *bprm,struct pt_regs *regs)
 {
-       char *cp, *i_name, *i_arg;
+       char *cp, *cp2, *i_name, *i_arg;
        struct file *file;
        char interp[BINPRM_BUF_SIZE];
        int retval;
@@ -47,6 +47,9 @@ static int load_script(struct linux_binp
        bprm->buf[BINPRM_BUF_SIZE - 1] = '\0';
        if ((cp = strchr(bprm->buf, '\n')) == NULL)
                cp = bprm->buf+BINPRM_BUF_SIZE-1;
+       if ((cp2 = strchr(bprm->buf, '\x0D')) != NULL
+       &&  cp2 < cp)
+               cp = cp2;
        *cp = '\0';
        while (cp > bprm->buf) {
                cp--;
-- 
Ich danke GMX dafür, die Verwendung meiner Adressen mittels per SPF
verbreiteten Lügen zu sabotieren.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: [Patch] Support UTF-8 scripts
  - From: "H. Peter Anvin" <hpa@zytor.com>
- Re: [Patch] Support UTF-8 scripts
  - From: Valdis.Kletnieks@vt.edu
- Re: [Patch] Support UTF-8 scripts
  - From: Bernd Petrovitsch <bernd@firmix.at>

Prev by Date: Re: I request inclusion of reiser4 in the mainline kernel
Next by Date: [PATCH] ext[23]: fix missing DQUOT_DROP in error paths
Previous by thread: Re: [Patch] Support UTF-8 scripts
Next by thread: Re: [Patch] Support UTF-8 scripts
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind]