Re: [Patch] Support UTF-8 scripts

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Bernd Petrovitsch wrote:
> Most of the text editors have ways to markup the source files. Not even
> the various editors are able to agreen on one method for all, so why
> could the (Linux) world agree on one for all text files?

You are ignoring the role of standardization. People invent their own
mechanism if a standard is missing (or virtually unimplementable). For
declaring encodings, there is no standard (except of iso-2022, which
is really hard to implement correctly). Therefore, editor authors
create their own standards.

Atleast Python abstained from creating yet another standard, and instead
supports both the declarations from Emacs and vim. To some degree, it
also supports notepad (namely through the UTF-8 signature).

However, people are much more likely to agree on a technology when it
is defined by a recognized standards body. This is the case for the
UTF-8 signature, which is defined by the Unicode consortium, for
precisely this purpose. Therefore, editors *will* agree on that
mechanism, while keeping their own mechanism for the more general
problem.

>>Even for the programming language, it is a pain to implement: what
>>if you have non-ASCII characters before the pragma that declares the
>>encoding? and so on.
> 
> 
> That's the problem of the language definers who absolutely want such
> (IMHO absolutely superflous) features.

It's not the language designers who absolutely want this feature. It's
the language users. Of course, you'ld have to be a language designer to
know that fact - language users go to the language designers asking for
the feature, not to the kernel developers.

>>Hmm. What does that have to do with the patch I'm proposing? This
>>patch does *not* interfere with all text files. It is only relevant
>>for executable files starting with the #! magic.
> 
> 
> It *does* interfere since scripts are also text files in every aspect.
> So every feature you want for "scripts" you also get for text files (and
> vice versa BTW).

The specific feature I get is that when I pass a file starting
with <utf8sig>#! to execve, Linux will execute the file following
the #!. In what way do I get this feature for text in general?
And if I do, why is that a problem?

> If you think "script" and "text file" are different, define both of
> them, please, otherwise a discussion is pointless.

A script file (in the context of this discussion) is a text file
that is executable (i.e. has the appropriate subset of
S_IXUSR|S_IXGRP|S_IXOTH set), starts with #!, and has the path
name of an executable file after the #!.

More generally, a script file is a text file written in a scripting
language. A scripting language is a programming language which
supports "direct" execution of source code. So in the more
general definition, a script file does not need to start with
#!; for the context of this discussion, we should restrict
attention to files actually affected by the patch.

>>This conclusion is false. Many tools that don't understand the file
>>structure still can do their job on the files. So the fact that a tool
>>does not understand the structure does not necessarily imply that
>>the tool breaks when the structure changes.
> 
> 
> It *may* break just because of some to-be-ignored inline marking due to
> some questionable feature.

Be more specific. For what specific kind of file will cat(1) break?
Unless cat(1) has a 2GB limitation, I very much doubt it will break
(i.e. fail to do its job, "concatenate files and print on the standard
output") for any kind of input - whether this is text files, binary
files, images, sound files, HTML files. cat always does what it is
designed to do.

> Let alone the confusion why the size of a file with `ls -l` is different
> from the size in the editor or a marker-aware `wc -c`.

This is true for any UTF-8 file, or any multibyte encoding. For any
multibyte encoding, the number of bytes in the file is different from
the number of characters. That doesn't (and shouldn't) stop people from
using multi-byte encodings.

What the editor displays as the number of "things" is up to its own.
The output of wc -c will always be the same as the one of ls -l,
as wc -c does *not* give you characters:

       -c, --bytes
              print the byte counts

You might have been thinking of 'wc -m'.

>>For a Python script, I don't need to guess: It will just work.
> 
> 
> Then write a short python script (with a "#!/usr/bin/python" line at the
> start [without parameters]) natively on a Win*-system, copy it binary
> over to an arbitrary Linux system and see what's happening.

It depends on the editor I use, of course: the kernel will consider any
CR after the n as part of the interpreter name. Not sure what this has
to do with the specific patch, though.

Regards,
Martin

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]
  Powered by Linux