« »

html5lib rocks (and a patch to preserve attribute order)

Thursday, 18 October 2007

I've been playing with the Python html5lib package -- having come across it reading Sam Ruby's blog. What a fantastically useful library!

Originally my interest in it was with the discussion surrounding santization, and I expect to use it for that later, but today I've been playing with some general parse/filter/serialize code to support some preprocessing of HTML documentation for Open Komodo.

My code looks like this:

import sys
from html5lib import treebuilders, treewalkers
from html5lib.serializer.xhtmlserializer import XHTMLSerializer

def filter_play(path):
    p = html5lib.XHTMLParser(tree=treebuilders.getTreeBuilder("simpletree"))
    f = open(path)
    dom = p.parse(f)

    walker = treewalkers.getTreeWalker("simpletree")
    stream = walker(dom)
    #stream = MyPreprocessingFilter()

    s = XHTMLSerializer()
    outputter = s.serialize(stream)

    for item in outputter:
        sys.stdout.write(item)

filter_play(sys.argv[1])

One thing that bugged me a little with the output generated with this is that attributes on HTML elements get sorted, i.e. their order is not preserved. While totally cool for correctness, this reduces the utility of using diff or similar for comparing input with output. As well, I work on the Komodo IDE/editor and would like to consider using html5lib for an HTML reflow/beautifier feature at some point. Preserving attribute order for this will be important.

To that end, here is a small patch that adds the ability to preserve attribute order in serialized output. To use it:

  1. You need odict.py.
  2. You need to change the above code to:

    ...
    s = XHTMLSerializer(preserve_attr_order=True)
    ...
    

Obviously this isn't something that would be ready to check-in to html5lib. Reasons why:

  • It only works for the "simpletree" treebuilder/treewalker. I'm not sure if it is feasible/practical to get it to work with some of the others (e.g. dom).
  • It unconditionally requires an external non-standard module (odict.py).
  • It should be optional on the parser because (a) using OrderedDict instead of dict would presumably have an undesired perf impact and (b) the attribute order normalization could be desirable for many users.

Maybe a better solution would be a custom "roundtriptree" tree type? Anyway, just throwing this up here to perhaps come back to later. I have to dig into the html5lib discussion list to see if this has come up before.

Tagged: python, komodo, programming