[email protected] wrote:
> For the benefit of those of us who are interested in the problem, but aren't
> in the mood to wade through a long standard looking for the answer to a
> specific question, can you elaborate?
See
http://www.unicode.org/faq/utf_bom.html#38
> It isn't as obvious as all that, because of all the nasty corner cases...
It really depends on the specific structure of the text file. For Python
scripts, the Python interpreter will reject a U+FEFF in the middle of
the file as a syntax error (*). This is, IMO, a reasonable reaction: you
just shouldn't concatenate Python scripts blindly. They may have
different source encodings, so any concatenation of Python scripts
needs to convert them both into a common encoding. The first script
may also fail to terminate with a newline, so concatenating Python
scripts also needs to insert a line break. In edition, you would
also typically want to remove the docstring in the second file.
The same holds for many other formats: for example, you cannot blindly
concatenate XML files, either (the result often won't be an XML file).
So that the BOM is treated as an error would give no problem.
> Given a file "a.txt" that's pure ASCII, and a file "b.txt" that has the BOM
> marker on it, what happens when you do "cat a.txt b.txt > c.txt"?
You answer the question yourself correctly:
> 'cat' doesn't know, and has no way of knowing, that c.txt needs a BOM at the
> *front* of the file until it's already written past the point in c.txt where
> the BOM has to go.
>
> What does the Unicode standard say to do in this case?
The point is that the BOM *also* is a regular character, U+FEFF. It used
to have a specific function, too, but now U+2060 (WORD JOINER) should
be used for that function. So U+FEFF is exclusively used for the BOM
now. If you see it in the middle of a file, you know it doesn't belong
there (*). In processing the file, you can complain, you can ignore it,
and you can chose to strip it off. Which of these you do depends on
the application; if you don't know better, treating it as ZERO WIDTH
NON-BREAKING SPACE is the recommended reaction.
Regards,
Martin
(*) unless it occurs in a string literal, in which case it becomes
part of the string. In the case of concatenating two Python files,
it won't be part of a string literal, though, but instead occur
at the beginning of a line.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
|
|