Fedora Users — Re: Ampersands in XML (was: UTF-8 editing problems with GEdit)

On Wed, Apr 28, 2004 at 03:32:47AM +0200, Björn Persson wrote:
>> Subject: Ampersands in XML (was: UTF-8 editing problems with GEdit)
> Paul M. Bucalo wrote:
> 
> >I believe the problem was the use
> >one or more illegal character entries (like adding "&" to a title) and
> >that was causing it to fail. I seem to recall that you need to precede
> >each of these with an "\" to keep it from being interpreted, just like
> >at a BASH console.
> 
> In XML and HTML (and I would guess all sorts of SGML), ampersand starts 
> an entity reference (which ends with a semicolon). If you want a literal 
> ampersand you must write it as "&amp;", that is, an entity reference 
> that references the ampersand entity. Likewise, if you want a less-than 
> sign you can't write just "<", because that starts a tag, so you write 
> it as "&lt;".

The list of syntax-significant characters is context sensitive and
important to pay attention to.  Language specific character encoding
complicates things.  See meta characters, bash, sed, grep, sh, csh,
XML, HTML, awk....

Programming language sensitive editors like vim and some emacs modes can help.

Programming language + natural language (character encoding) currently
place some demands on people to set the environment correctly with
comments or mode lines.

Four special characters seem to be important....
   &amp;, &gt;, &lt;, &quot; and &apos;.

Comment escapes and white space rules seem to add complexity --

In:
  http://www.w3.org/TR/REC-xml/#dt-character
See...
   2.4 Character Data and Markup
   2.5 Comments

And,

   http://recode.progiciels-bpi.ca/manual/HTML.html

"XML-standalone
    "This charset is available in recode under the name XML-standalone,
    with h0 as an acceptable alias. It is documented in section 4.1 of
    http://www.w3.org/TR/REC-xml. It only knows &amp;, &gt;, &lt;,
    &quot; and &apos;.
...
"HTML_2.0
    "This charset is available in recode under the name HTML_2.0, and
    has RFC1866, 1866 and h2 for aliases. HTML 2.0 entities are listed
    in RFC 1866. Basically, there is an entity for each alphabetical
    character in the right part of ISO 8859-1. In addition, there are
    four entities for syntax-significant ASCII characters: &amp;,
    &gt;, &lt; and &quot;."


This all gets more interesting in the context of multi byte (UNICODE) characters
as another thread is discovering.

Interesting stuff.



-- 
	T o m  M i t c h e l l 
	/dev/null the ultimate in secure storage.