Re: [Patch] Support UTF-8 scripts

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[email protected] wrote:
> For the benefit of those of us who are interested in the problem, but aren't
> in the mood to wade through a long standard looking for the answer to a
> specific question, can you elaborate?

See

http://www.unicode.org/faq/utf_bom.html#38

> It isn't as obvious as all that, because of all the nasty corner cases...

It really depends on the specific structure of the text file. For Python
scripts, the Python interpreter will reject a U+FEFF in the middle of
the file as a syntax error (*). This is, IMO, a reasonable reaction: you
just shouldn't concatenate Python scripts blindly. They may have
different source encodings, so any concatenation of Python scripts
needs to convert them both into a common encoding. The first script
may also fail to terminate with a newline, so concatenating Python
scripts also needs to insert a line break. In edition, you would
also typically want to remove the docstring in the second file.

The same holds for many other formats: for example, you cannot blindly
concatenate XML files, either (the result often won't be an XML file).
So that the BOM is treated as an error would give no problem.

> Given a file "a.txt" that's pure ASCII, and a file "b.txt" that has the BOM
> marker on it, what happens when you do "cat a.txt b.txt > c.txt"?

You answer the question yourself correctly:

> 'cat' doesn't know, and has no way of knowing, that c.txt needs a BOM at the
> *front* of the file until it's already written past the point in c.txt where
> the BOM has to go.
> 
> What does the Unicode standard say to do in this case?

The point is that the BOM *also* is a regular character, U+FEFF. It used
to have a specific function, too, but now U+2060 (WORD JOINER) should
be used for that function. So U+FEFF is exclusively used for the BOM
now. If you see it in the middle of a file, you know it doesn't belong
there (*). In processing the file, you can complain, you can ignore it,
and you can chose to strip it off. Which of these you do depends on
the application; if you don't know better, treating it as ZERO WIDTH
NON-BREAKING SPACE is the recommended reaction.

Regards,
Martin

(*) unless it occurs in a string literal, in which case it becomes
part of the string. In the case of concatenating two Python files,
it won't be part of a string literal, though, but instead occur
at the beginning of a line.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]
  Powered by Linux