A while ago, I had to import some HTML into a Python script and found out that—while there is cgi.escape() for encoding to HTML—there did not seem to be an easy or well-documented way for decoding HTML entities in Python.

Silly, right?

Turns out, there are at least three ways of doing it, and which one you use probably depends on your particular app's needs.

1) Overkill: BeautifulSoup

BeautifulSoup is an HTML parser that will also decode entities for you, like this:

soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)

The advantage is its fault-tolerance. If your input document is malformed, it will do its best to extract a meaningful DOM tree from it. The disadvantage is, if you just have a few short strings to convert, introducing the dependency on an entire HTML parsing library into your project seems overkill.

2) Duct Tape: htmlentitydefs

Python comes with a list of known HTML entity names and their corresponding unicode codepoints. You can use that together with a simple regex to replace entities with unicode characters:

import htmlentitydefs, re
mystring = re.sub('&([^;]+);', lambda m: unichr(htmlentitydefs.name2codepoint[m.group(1)]), mystring)
print mystring.encode('utf-8')
Of course, this works. But I hear you saying, how in the world is this not in the standard library? And the geeks among you have also noticed that this will not work with numerical entities. While © will give you ©, © will fail miserably. If you're handling random, user-entered HTML, this is not a great option.

3) Standard library to the rescue: HTMLParser

After all this, I'll give you the option I like best. The standard lib's very own HTMLParser has an undocumented function unescape() which does exactly what you think it does:
>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> s = h.unescape('© 2010')
>>> s
u'\xa9 2010'
>>> print s
© 2010
>>> s = h.unescape('© 2010')
>>> s
u'\xa9 2010'

So unless you need the advanced parsing capabilities of BeautifulSoup or want to show off your mad regex skills, this might be your best bet for squeezing unicode out of HTML snippets in Python.

Was this helpful? Buy me a coffee with Bitcoin! (What is this?)

SQlite error: Unable to Open Database File

When writing a web app using an SQLite database file, you might run into an error where you can read from the file but not write to it......… Continue reading

Micro-tipping with Bitcoin

Published on November 27, 2014

setTimeout in Python

Published on November 14, 2014