Fedora Users — Re: developing using the firefox engine

On Sat, 2006-07-01 at 12:00 -0400, fedora-list-request@xxxxxxxxxx wrote:

> On 01/07/06, bruce <bedouglas@xxxxxxxxxxxxx> wrote:
> > hi...
> >
> > i've been trying (unsuccessfully) to parse/process html files. i'm
> almost
> > certain that the issue has to do with the fact that the html is not
> valaid
> > html.. running the html through various apps "tidy/html
> validator/etc..."
> > complain with warnings.
> 
> I have been having a similar problem with html, though this time the
> guilty party is mshtml. That piece of dog vomit *can not be made* to
> produce xhtml!!! I get the pseudo-html from mshtml, run it through
> sgmlreader+converter class and get (x)html out. I can then parse +
> process the file with standard xml/xsl tools. It took me an age to
> find good things for .net though - you shouldn't have nearly as many
> problems on linux/fedora.
> I suggest you pass it through tidy and get xhtml out. It may give you
> some junk but you don't really have many other options...
> Cheers
> Antoine

I think you're far better off avoiding XHTML entirely unless there's
some specific reason (MathML and the like) that *requires* that you have
it. 

If you are simply desirous of strictness, you're far better off using
HTML 4.01 Strict, and doing it correctly, validating with the w3.org
validators for HTML and CSS. 

Internet Explorer (by far the most prevalent browser) neither in the
current incarnation NOR in the version 7 forthcoming, understand XHTML
properly, instead being forced to parse it as if it were HTML (i.e.
serving it as text/html instead of application/xhtml+xml) tag-soup. 

You might as well be handing it doctypeless html 3.2 tag soup from 1995
for all the good passing XHTML as text/html is doing you. 

IE6 and IE7 _do not understand XHTML_. Period. There are NO plans for
incorporating an XHTML/XML parser into IE version 7. 

If you're having to tell IE that, no it's not really xhtml, it's
text/html, then I have to ask one simple question:

"Where is the benefit?"

use HTML 4.01 Strict. EVERYTHING understands it. It's still "strict".
You don't need ridiculous CDATA comment escapes to comment out inline
css/javascript. 

for further details read carefully the document at
http://hixie.ch/advocacy/xhtml also note
http://www.ietf.org/rfc/rfc3236.txt 

It's additionally curious to note that Appendix C of the XHTML rfc
states "Note that this recommendation does not define how HTML
conforming user agents should process HTML documents. Nor does it define
the meaning of the Internet Media Type text/html. For these definitions,
see [HTML4] <http://www.w3.org/TR/xhtml1/#ref-html4> and [RFC2854]
<http://www.w3.org/TR/xhtml1/#ref-rfc2854> respectively."

And RFC 2854 says that XHTML 1.0 defines a profile of XHTML that may be
served as text/html. So, its not very clear. And one reading is that
nothing actually authorises any version of XHTML 1.0 to be served as
text/html!