Re: [Patch] Support UTF-8 scripts

On Sun, 18 Sep 2005 21:23:42 +0200, Bodo Eggert said:
> Bernd Petrovitsch <bernd@firmix.at> wrote:
> > Apparently I have to repeat: If you do `cat a.txt b.txt >c.txt` where
> > a.txt and b.txt have this marker, then c.txt have the marker of b.txt
> > somewhere in the middle. Does this make sense in anyway?
> > How do I get rid of the marker in the middle transparently?
> 
> The unicode standard defines how to handle them.

For the benefit of those of us who are interested in the problem, but aren't
in the mood to wade through a long standard looking for the answer to a
specific question, can you elaborate?

It isn't as obvious as all that, because of all the nasty corner cases...

> > It is different even if a pure ASCII file is marked as UTF-8.
> 
> No pure ASCII file will be marked, since a marked file will be no
> ASCII file.

Given a file "a.txt" that's pure ASCII, and a file "b.txt" that has the BOM
marker on it, what happens when you do "cat a.txt b.txt > c.txt"?

'cat' doesn't know, and has no way of knowing, that c.txt needs a BOM at the
*front* of the file until it's already written past the point in c.txt where
the BOM has to go.

What does the Unicode standard say to do in this case?

Attachment: pgpFynk0QZkY3.pgp
Description: PGP signature

References:
- Re: [Patch] Support UTF-8 scripts
  - From: Bodo Eggert <harvested.in.lkml@7eggert.dyndns.org>

Prev by Date: Re: [linux-usb-devel] URB_ASYNC_UNLINK b0rkage
Next by Date: Re: dell's latitude cdburner problem
Previous by thread: Re: [Patch] Support UTF-8 scripts
Next by thread: Re: [Patch] Support UTF-8 scripts
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind]