Walking the HTML DOM without a Browser

23:15 Tue 30 Jan 2007
[, , , , ]

I had to do some screen-scraping today, and found myself doing it in JavaScript, just because JavaScript is able to understand the DOM. I would prefer to be able to walk the HTML DOM in another scripting language, like Perl, Ruby, or Python.

However, I’ve had trouble thus far in finding tools that do this. Granted, I haven’t spent that much time looking, but the time I did spend indicated that most of the libraries out there expect valid XML. Valid XHTML isn’t that common in the wild, and it’s more or less guaranteed that whatever you’re scraping is going to be badly written. So parsers that insist on valid XML aren’t that useful.

I looked at HTML::TagParser for Perl, but while it would be great for some things, it doesn’t support firstChild(), nextSibling(), or lastChild(), all of which are extremely useful when trying to grab nodes in HTML documents.

Does anyone have any suggestions for a scripting-language library that has the same kind of depth with HTML that in-browser JavaScript has? I know that this is somewhat dubious as a request, since I suspect that it’s not just JavaScript but the fact that the browser’s rendering engine transforms the page into a well-formed document before it produces the DOM. that being said, some kind of interface to a stripped Gecko engine would still be useful.

Again, looking around for that doesn’t give me quite what I’m looking for… I’m finding links to people using Gecko (outside Firefox/Mozilla) for full graphical rendering, but what I want is more like a way to instantiate a geckoDocument object, pass it text that it parses as HTML, and then have access to all the JavaScript DOM methods with it. Along the lines of:

geckoDocument myDoc = new geckoDocument(HTMLstring);
String myNode = myDoc.getElementById("someID").firstChild.nodeValue;

Anyone encountered something like this?

Digressing slightly, the ability to parse documents with XML parsers seems like a good reason to use XHTML over HTML. At the moment I prefer HTML, but this might swing me back towards XHTML, as “parsability” seems rather important.

On the other hand, any web application with developers who care about valid, well-formed code and parsability are likely to include ways to access XML versions of their output—something I certainly intend to do with sfmagic.org once I get around to working on it once more.

6 Responses to “Walking the HTML DOM without a Browser”

  1. kevintel Says:

    You noticably left out PHP. While often crufty, it’s still extremely powerful and easy to work with, and the odds are that someone has already written what you’re looking for. And in the unlikely event that they haven’t, it would be easy enough to make one.

  2. Tadhg Says:

    I looked for some things in PHP too, and didn’t find anything to my liking. And I don’t agree that it would be “easy enough” to write a proper DOM interpreter, particularly not one that accepts non-standard code. I suspect that it’s a deceptively deep problem, in fact, which is why I think the better approach would be to take one that’s out there already (like Gecko) and use that. But I could be wrong, and if you know of a PHP DOM tool that fits the bill, let me know. (Also let me know if you write one…)

  3. Peter Bengtsson Says:

    Have you looked at pykhtml?
    It does exactly what you ask for but I’ve but quite disappointed with it in execution. It seems to crash quite a lot.

  4. Tadhg Says:

    Hmm, that’s not the most ringing of endorsements, but as I’m trying to get into Python more I might try it out. Thanks for the tip!

  5. Paul Giannaros Says:

    Hi there,
    Peter Bengtsson: I’d like to suggest you try PyKHTML from the development repository — it is significantly more stable and I haven’t seen it crash (by fault of the library) for a while now. I would also say in its defence that it is fairly new! It’d be great if you could report any problems with it that you’ve had.
    Tadhg: I would suggest PyKHTML; it does pretty much exactly what you’re looking for. You can feed it a bit of HTML like in your example and walk the DOM.

    (disclaimer: I am the author of PyKHTML)

  6. Tadhg Says:

    Paul: Thanks for the comment! I’ll investigate PyKHTML when I resume work on one of the various projects that need something to walk the DOM.

Leave a Reply