On Wed, Apr 28, 2004 at 03:32:47AM +0200, Björn Persson wrote: >> Subject: Ampersands in XML (was: UTF-8 editing problems with GEdit) > Paul M. Bucalo wrote: > > >I believe the problem was the use > >one or more illegal character entries (like adding "&" to a title) and > >that was causing it to fail. I seem to recall that you need to precede > >each of these with an "\" to keep it from being interpreted, just like > >at a BASH console. > > In XML and HTML (and I would guess all sorts of SGML), ampersand starts > an entity reference (which ends with a semicolon). If you want a literal > ampersand you must write it as "&", that is, an entity reference > that references the ampersand entity. Likewise, if you want a less-than > sign you can't write just "<", because that starts a tag, so you write > it as "<". The list of syntax-significant characters is context sensitive and important to pay attention to. Language specific character encoding complicates things. See meta characters, bash, sed, grep, sh, csh, XML, HTML, awk.... Programming language sensitive editors like vim and some emacs modes can help. Programming language + natural language (character encoding) currently place some demands on people to set the environment correctly with comments or mode lines. Four special characters seem to be important.... &, >, <, " and '. Comment escapes and white space rules seem to add complexity -- In: http://www.w3.org/TR/REC-xml/#dt-character See... 2.4 Character Data and Markup 2.5 Comments And, http://recode.progiciels-bpi.ca/manual/HTML.html "XML-standalone "This charset is available in recode under the name XML-standalone, with h0 as an acceptable alias. It is documented in section 4.1 of http://www.w3.org/TR/REC-xml. It only knows &, >, <, " and '. ... "HTML_2.0 "This charset is available in recode under the name HTML_2.0, and has RFC1866, 1866 and h2 for aliases. HTML 2.0 entities are listed in RFC 1866. Basically, there is an entity for each alphabetical character in the right part of ISO 8859-1. In addition, there are four entities for syntax-significant ASCII characters: &, >, < and "." This all gets more interesting in the context of multi byte (UNICODE) characters as another thread is discovering. Interesting stuff. -- T o m M i t c h e l l /dev/null the ultimate in secure storage.