On Sat, 2006-07-01 at 12:00 -0400, fedora-list-request@xxxxxxxxxx wrote: > On 01/07/06, bruce <bedouglas@xxxxxxxxxxxxx> wrote: > > hi... > > > > i've been trying (unsuccessfully) to parse/process html files. i'm > almost > > certain that the issue has to do with the fact that the html is not > valaid > > html.. running the html through various apps "tidy/html > validator/etc..." > > complain with warnings. > > I have been having a similar problem with html, though this time the > guilty party is mshtml. That piece of dog vomit *can not be made* to > produce xhtml!!! I get the pseudo-html from mshtml, run it through > sgmlreader+converter class and get (x)html out. I can then parse + > process the file with standard xml/xsl tools. It took me an age to > find good things for .net though - you shouldn't have nearly as many > problems on linux/fedora. > I suggest you pass it through tidy and get xhtml out. It may give you > some junk but you don't really have many other options... > Cheers > Antoine I think you're far better off avoiding XHTML entirely unless there's some specific reason (MathML and the like) that *requires* that you have it. If you are simply desirous of strictness, you're far better off using HTML 4.01 Strict, and doing it correctly, validating with the w3.org validators for HTML and CSS. Internet Explorer (by far the most prevalent browser) neither in the current incarnation NOR in the version 7 forthcoming, understand XHTML properly, instead being forced to parse it as if it were HTML (i.e. serving it as text/html instead of application/xhtml+xml) tag-soup. You might as well be handing it doctypeless html 3.2 tag soup from 1995 for all the good passing XHTML as text/html is doing you. IE6 and IE7 _do not understand XHTML_. Period. There are NO plans for incorporating an XHTML/XML parser into IE version 7. If you're having to tell IE that, no it's not really xhtml, it's text/html, then I have to ask one simple question: "Where is the benefit?" use HTML 4.01 Strict. EVERYTHING understands it. It's still "strict". You don't need ridiculous CDATA comment escapes to comment out inline css/javascript. for further details read carefully the document at http://hixie.ch/advocacy/xhtml also note http://www.ietf.org/rfc/rfc3236.txt It's additionally curious to note that Appendix C of the XHTML rfc states "Note that this recommendation does not define how HTML conforming user agents should process HTML documents. Nor does it define the meaning of the Internet Media Type text/html. For these definitions, see [HTML4] <http://www.w3.org/TR/xhtml1/#ref-html4> and [RFC2854] <http://www.w3.org/TR/xhtml1/#ref-rfc2854> respectively." And RFC 2854 says that XHTML 1.0 defines a profile of XHTML that may be served as text/html. So, its not very clear. And one reading is that nothing actually authorises any version of XHTML 1.0 to be served as text/html!