Python snippet: SAX parser with internal entity expansion disabled

This was a rough problem. I’m using Python’s SAX modules for parsing JMdict, and ran into a problem regarding its use of XML entities. Nothing wrong with JMdict, but the expansion is rather verbose and does not lend itself well to what I’m doing. I don’t want “word containing irregular kanji usage”, but rather, I want the “iK” code.

Unfortunately, the default ExpatParser doesn’t have a clear way to disable this. But if you read the docs closely enough, you can find out about setting a “default handler” for it, which has the side effect of disabling internal expansion.

This isn’t a perfect fix, but here’s the class I used to get this done:

# Code is Copyright 2009 by Paul Goins
# Released into the public domain - but let me know if this is useful!

from xml.sax.expatreader import ExpatParser

class ExpatParserNoEntityExp(ExpatParser):

    """An overridden Expat parser class which disables entity expansion."""

    def reset(self):
        ExpatParser.reset(self)
        self._parser.DefaultHandler = self.dummy_handler

    def dummy_handler(self, *args, **kwargs):
        pass

So, use an instance of this class instead of one by xml.sax.make_parser (or xml.sax.parse, for that matter). Then, in your xml.sax.handler.ContentHandler, specify a handler for skippedEntry(self, name), since that’s where your internal entities are going to pop up. By the way, built-ins such as & and < still seem to be parsed; at least in the case of JMdict it seems to just affect the parsing of the inline DTD.

Hope this helps someone!

Leave a Reply

Your email address will not be published. Required fields are marked *