Decoding HTML Entities to Text in Python
A while ago, I had to import some HTML into a Python script and found out that—while there is cgi.escape() for encoding to HTML—there did not seem to be an easy or well-documented way for decoding HTML entities in Python.
Silly, right?
Turns out, there are at least three ways of doing it, and which one you use probably depends on your particular app’s needs.
1) Overkill: BeautifulSoup
BeautifulSoup is an HTML parser that will also decode entities for you, like this:
soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)
The advantage is its fault-tolerance. If your input document is malformed, it will do its best to extract a meaningful DOM tree from it. The disadvantage is, if you just have a few short strings to convert, introducing the dependency on an entire HTML parsing library into your project seems overkill.
2) Duct Tape: htmlentitydefs
Python comes with a list of known HTML entity names and their corresponding unicode codepoints. You can use that together with a simple regex to replace entities with unicode characters:
import htmlentitydefs, re
mystring = re.sub('&([^;]+);', lambda m: unichr(htmlentitydefs.name2codepoint[m.group(1)]), mystring)
print mystring.encode('utf-8')
Of course, this works. But I hear you saying, how in the world is this not in the standard library? And the geeks among you have also noticed that this will not work with numerical entities. While © will give you ©, © will fail miserably. If you’re handling random, user-entered HTML, this is not a great option.
3) Standard library to the rescue: HTMLParser
After all this, I’ll give you the option I like best. The standard lib’s very own HTMLParser has an undocumented function unescape() which does exactly what you think it does:
>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> s = h.unescape('© 2010')
>>> s
u'\xa9 2010'
>>> print s
© 2010
>>> s = h.unescape('© 2010')
>>> s
u'\xa9 2010'
So unless you need the advanced parsing capabilities of BeautifulSoup or want to show off your mad regex skills, this might be your best bet for squeezing unicode out of HTML snippets in Python.


Luckily, I have the Apache Commons Lang libraries for that purpose.
You’d want to use
StringEscapeUtils.unescapeHtml(...)if you were coding in Java.Needless to say, I expect they do not handle entities from other vocabularies like MathML…
I agree with Kimber that DTDs and entities were a mistake in XML. They are really annoying legacy junk that should never have been included.
Yes, I only think it decodes HTML. Entities in XML are sufficiently useless indeed, except maybe for the < one, without which you’d have to wrap every entity in CDATA blocks if it contains such a character. Everything else can be done by properly choosing a character set (mostly: UTF-8) for the document.
There’s a bunch of UTF-8 characters which are not allowed in XML:
http://www.w3.org/TR/unicode-xml/#Suitable
I’ve seen a particular database product from a company based in Armonk, NY, (won’t tell names here
) getting the hickups when using XML columns after converting texts from Eastern European character sets to UTF-8.
And I bet some of my colleagues can even tell uglier stories. After all… we’re working at “the company formerly known as ‘The XML Company’”… :-p