[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



hi...

i've been trying (unsuccessfully) to parse/process html files. i'm almost
certain that the issue has to do with the fact that the html is not valaid
html.. running the html through various apps "tidy/html validator/etc..."
complain with warnings.

i'm not sure why the LibXML functions (perl) and tidy complain about the
structure of the page. i also can't see a way to get libXML to ignore the
warnings.

as such, i've been wondering if anybody else has had this issue, and how you
managed to resolve this in an automated manner...

using firefox, and the XPath plugin with the DOM Inspector, I'm able to
traverse the DOM for the web page. I can also create a XPath query that I
can use in the XPath window of the firefox plugin to extract/display the
correct elements/section of the page...

i'm curious as to whether it might be possible to use the firefox engine,
coupled with the DOM/XPath plugin functionality to parse the file from a
perl/command line app...

has anyone ever done anything like this, or heard of anyone who has... is it
possible to even programatically call the firefox app...

thoughts/comments/etc...

-bruce

and yeah.. i've also posted to the firefow email list.. but thought it might
be useful to post here as well!!



[Index of Archives]     [Current Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Yosemite News]     [Yosemite Photos]     [KDE Users]     [Fedora Tools]     [Fedora Docs]
  Powered by Linux