A while ago, I had to import some HTML into a Python script and found out that—while there is cgi.escape() for encoding to HTML—there did not seem to be an easy or well-documented way for decoding HTML entities in Python.

Silly, right?

Turns out, there are at least three ways of doing it, and which one you use probably depends on your particular app's needs.

1) Overkill: BeautifulSoup

BeautifulSoup is an HTML parser that will also decode entities for you, like this:

soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)

The advantage is its fault-tolerance. If your input document is malformed, it will do its best to extract a meaningful DOM tree from it. The disadvantage is, if you just have a few short strings to convert, introducing the dependency on an entire HTML parsing library into your project seems overkill.

2) Duct Tape: htmlentitydefs

Python comes with a list of known HTML entity names and their corresponding unicode codepoints. You can use that together with a simple regex to replace entities with unicode characters:

import htmlentitydefs, re
mystring = re.sub('&([^;]+);', lambda m: unichr(htmlentitydefs.name2codepoint[m.group(1)]), mystring)
print mystring.encode('utf-8')

Of course, this works. But I hear you saying, how in the world is this not in the standard library? And the geeks among you have also noticed that this will not work with numerical entities. While © will give you ©, © will fail miserably. If you're handling random, user-entered HTML, this is not a great option.

3) Standard library to the rescue: HTMLParser

After all this, I'll give you the option I like best. The standard lib's very own HTMLParser has an undocumented function unescape() which does exactly what you think it does:

>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> s = h.unescape('© 2010')
>>> s
u'\xa9 2010'
>>> print s
© 2010
>>> s = h.unescape('© 2010')
>>> s
u'\xa9 2010'

So unless you need the advanced parsing capabilities of BeautifulSoup or want to show off your mad regex skills, this might be your best bet for squeezing unicode out of HTML snippets in Python.

The Case for the Ubiquitous Mobile Web

As we use more and more mobile devices in our lives, an open platform is becoming *more*, not *less* important.In an article ["The declin...… Continue reading

An Easter Egg for mozilla.org

Published on November 21, 2013

No Coin to Spend Your Coins On

Published on November 17, 2013