A while ago, I had to import some HTML into a Python script and found out that—while there is
cgi.escape() for encoding to HTML—there did not seem to be an easy or well-documented way for decoding HTML entities in Python.
Turns out, there are at least three ways of doing it, and which one you use probably depends on your particular app's needs.
1) Overkill: BeautifulSoup
BeautifulSoup is an HTML parser that will also decode entities for you, like this:
soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)
The advantage is its fault-tolerance. If your input document is malformed, it will do its best to extract a meaningful DOM tree from it. The disadvantage is, if you just have a few short strings to convert, introducing the dependency on an entire HTML parsing library into your project seems overkill.
2) Duct Tape: htmlentitydefs
Python comes with a list of known HTML entity names and their corresponding unicode codepoints. You can use that together with a simple regex to replace entities with unicode characters:
import htmlentitydefs, re mystring = re.sub('&([^;]+);', lambda m: unichr(htmlentitydefs.name2codepoint[m.group(1)]), mystring) print mystring.encode('utf-8')
Of course, this works. But I hear you saying, how in the world is this not in the standard library? And the geeks among you have also noticed that this will not work with numerical entities. While
© will give you
© will fail miserably. If you're handling random, user-entered HTML, this is not a great option.
3) Standard library to the rescue: HTMLParser
After all this, I'll give you the option I like best. The standard lib's very own HTMLParser has an undocumented function
unescape() which does exactly what you think it does:
>>> import HTMLParser >>> h = HTMLParser.HTMLParser() >>> s = h.unescape('© 2010') >>> s u'\xa9 2010' >>> print s © 2010 >>> s = h.unescape('© 2010') >>> s u'\xa9 2010'
So unless you need the advanced parsing capabilities of BeautifulSoup or want to show off your mad regex skills, this might be your best bet for squeezing unicode out of HTML snippets in Python.
comments powered by Disqus