On 01/07/06, bruce <bedouglas@xxxxxxxxxxxxx> wrote:
hi... i've been trying (unsuccessfully) to parse/process html files. i'm almost certain that the issue has to do with the fact that the html is not valaid html.. running the html through various apps "tidy/html validator/etc..." complain with warnings.
I have been having a similar problem with html, though this time the guilty party is mshtml. That piece of dog vomit *can not be made* to produce xhtml!!! I get the pseudo-html from mshtml, run it through sgmlreader+converter class and get (x)html out. I can then parse + process the file with standard xml/xsl tools. It took me an age to find good things for .net though - you shouldn't have nearly as many problems on linux/fedora. I suggest you pass it through tidy and get xhtml out. It may give you some junk but you don't really have many other options... Cheers Antoine -- This is where I should put some witty comment.