A while ago, I had to import some HTML into a Python script and found out that—while there is cgi.escape() for encoding to HTML—there did not seem to be an easy or well-documented way for decoding HTML entities in Python.

Silly, right?

Turns out, there are at least three ways of doing it, and which one you use probably depends on your particular app's needs.

1) Overkill: BeautifulSoup

BeautifulSoup is an HTML parser that will also decode entities for you, like this:

soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)

The advantage is its fault-tolerance. If your input document is malformed, it will do its best to extract a meaningful DOM tree from it. The disadvantage is, if you just have a few short strings to convert, introducing the dependency on an entire HTML parsing library into your project seems overkill.

2) Duct Tape: htmlentitydefs

Python comes with a list of known HTML entity names and their corresponding unicode codepoints. You can use that together with a simple regex to replace entities with unicode characters:

import htmlentitydefs, re
mystring = re.sub('&([^;]+);', lambda m: unichr(htmlentitydefs.name2codepoint[m.group(1)]), mystring)
print mystring.encode('utf-8')
Of course, this works. But I hear you saying, how in the world is this not in the standard library? And the geeks among you have also noticed that this will not work with numerical entities. While © will give you ©, © will fail miserably. If you're handling random, user-entered HTML, this is not a great option.

3) Standard library to the rescue: HTMLParser

After all this, I'll give you the option I like best. The standard lib's very own HTMLParser has an undocumented function unescape() which does exactly what you think it does:
>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> s = h.unescape('© 2010')
>>> s
u'\xa9 2010'
>>> print s
© 2010
>>> s = h.unescape('© 2010')
>>> s
u'\xa9 2010'

So unless you need the advanced parsing capabilities of BeautifulSoup or want to show off your mad regex skills, this might be your best bet for squeezing unicode out of HTML snippets in Python.

Read more…

Last week, I secretly released version 1.2 of my Copy ShortURL add-on. It contains a lot of improvements based on your feedback! Here's the 411 on the new features and how to use them:

is.gd is the new default I switched to is.gd (from tinyurl) as the default shortening service. I am affiliated with neither of them, but I though the point of short URLs is, well, being short. So is.gd wins on that front. If you don't like that, don't fret, because...:

You can pick your own short URL service now If you have a short URL service that you like more than the default, you can pick your own now. Instructions are in the README file on github (towards the bottom). By setting the preference extensions.copyshorturl.serviceURL in about:config, you can for example use tinyurl, bit.ly (requires an API key) and lots of other URL shorteners. If you have additional service URLs to share with the class, please leave a comment!

Notifications Initially, there was no way to tell if the add-on had already done its job, except for checking your clipboard contents (hint, if in doubt, yes, it worked). So I added unobtrusive Growl notifications for platforms that support it. For example:

If you don't have Growl, a Firefox notification bar is shown instead:

Finally, Copy ShortURL is now compatible with Firefox versions 3.6 to 4.0b5pre.

Hope you like it, and feel free to leave a comment here or file issues on github if anything is not working as expected.

Read more…

Since I last blogged about the Copy Short URL add-on, I stumbled across another, very popular example of automatically exposed short URLs:

Wordpress.com as well as self-hosted Wordpress instances have automatic short URLs now, starting with Wordpress version 3.0.

For example, this blog post on wordpress.com about a possible proof for P != NP has the shiny short URL http://wp.me/pr9Ir-1lN.

A recent blog post on my blog, in turn, has: http://fredericiana.com/?p=2921.

Of course, it's a little sad that the auto-generated short URLs on self-hosted Wordpress instances are so ugly, and they are not really short enough to use them easily on twitter or with other character-sensitive applications. But considering how long your average blog post URL is in the first place, it seems like a great win nonetheless.

An unrelated side note: I filed a bug to expose bugzil.la URLs on Mozilla's bugzilla instance. It's not picked up or resolved yet, so if you want to see the support as much as I, feel free to comment on or CC yourself on the bug!

Read more…

The Copy ShortURL Add-on has been on AMO for a week now and was recently approved to be public, so now I have a user base to please ;)

I am inclined to drop the code onto github, where I'd get a proper version history along with a bug tracker. Update: It's on github now!

For now though, here are a few ideas I have for the add-on, in no particular order and with no promise that I'm about to implement any of this right away:

  • Allow other URL shortening services. tinyURL is all fun and games, and I chose it over bit.ly because it does not require an API key -- but if you have one at hand, you should be able to use any service you like. Even if only by setting an about:config preference.
  • Incorporate selected sites that support short URLs but do not publish them as a header. Zappos (zapp.me), for example. Others seem to have a short URL available (such as: NY Times (nyti.ms), Amazon (amzn.to), ESPN (es.pn)) but only use them on their twitter account and not on every webpage, so there might be nothing we can do :(.
  • When shortening, need to make sure not to use the current URL but the canonical URL if such a header exists. (Fixed!)

Let me know what you think! I'd like to know if any other things come to your minds, or which of the above you'd find especially useful.

Read more…

Update: The add-on is now on AMO! Check it out! Also, feedback is greatly appreciated!


This week during the Mozilla Summit in Whistler, British Columbia, there was a "Rocket Your Firefox" Jetpack contest: The idea, make a new add-on using the Jetpack SDK, submit it, win a prize.

So I went ahead and made a jetpack called "Copy Short URL" and it does what it sounds like:

On any webpage, you get a new item in the right click menu called "copy short URL". When you click it, the add-on looks for a canonical short URL exposed in the page header. Currently, a number of major websites expose their own short URLs for any entry on their webpages, among these: youtube ("youtu.be/..."), flickr ("flic.kr/..."), Arstechnica, Techcrunch, and many more. If, however, the site does not name its own short URL, the add-on automatically falls back to making a short URL using tinyurl.com. Either way, after a fraction of a second, you end up with a short URL in your clipboard, ready to be used in forum posts, tweets, or wherever else you please.

My add-on won the contest in the "most useful" category. The prize was an awesome jetpack sweatshirt:

If you want to check out the add-on, it is currently available (open source, of course) on the add-ons builder website. I also uploaded the add-on to AMO.

Hope you find it useful!

Read more…

Note: Several people asked where the link is to actually add feedback to the site. This is, of course, a good point. As mentioned in the comments: The designated entry point for the feedback application is going to be an extension bundled with Firefox 4 Beta. For more information, please read Aakash's blog post. To try out the application already, feel free to add happy or sad feedback to the test site.


This morning, we published the Firefox Input application. It is a little web application soliciting feedback from our Firefox Beta Program users. The aim is to make it as easy as possible for people to tell us what specifically they like or dislike about an upcoming version of Firefox.

The application was, as far as software goes, developed very rapidly: We made it from requirements to production in a mere three weeks. What made this possible was a number of reusable components that allowed us to avoid reinventing the wheel and stay focused on making the application awesome.

A few key components of the Input application:

  • Django. I can't stress this enough, but Django is a fantastic web application framework. It makes it incredibly easy to set up a web application quickly and securely. Their built-in admin pages save me days of work that I would otherwise have to spend to allow project admins to edit the application data.
  • Jinja2 and Jingo. The only big drawback of Django is its template language: The instant you make nontrivial web applications, it gets in your way. Luckily, like all parts of Django it is replaceable: Jinja2 and Jeff Balogh's jingo interface comes to the rescue. The two of them are already in use over at AMO and also serve us well on Input.
  • Term extraction. Firefox Input extracts key words from all feedback. Sure, you can just split the sentences into words, but if you want to avoid collecting all sorts of meaningless particles ("the", "a", "if", ...), it becomes a little more complicated. We are using the topia.termextract library, which gladly does the heavy lifting for us. Only caveat: It only works for English, so once the application is localized, we need a different solution for the other languages.
  • Search. For the longest time, there was no generic way to do search in a Django app (other than straight SQL queries). In the meantime, haystack has started to fill that gap. We use it on Input in conjunction with Whoosh, a pure-Python search library. That is very easy to set up, at the expense of scalability -- if we outgrow it, however, it will be easy to switch search engines with virtually no code changes at all. Thumbs up!
  • Product details. Only very recently we released a Mozilla product details library for Django, and this is the first application to rely intimately on up-to-date product data: Input only lets users of the latest beta versions of Firefox add feedback, so it auto-updates its product data periodically to gather feedback for the newest versions as quickly as possible.

As always, the source code of Firefox Input is openly and freely available. If you notice any problems with it, feel free to fork it on github, or file a bug in our bug tracker.

Read more…

I just watched the pilot episode of Pioneer One, the "first ever made-for-torrent" TV series, and I liked it a lot!

The story is intriguing:

An object in the sky spreads radiation over North America. Fearing terrorism, U.S. Homeland Security agents are dispatched to investigate and contain the damage. What they discover will have implications for the entire world.

The pilot episode has been filmed on a budget of a mere 6000 dollars (all of it funded by private donations), and for that, the idea has been very well executed. I suggest you all see it, as the video is freely and legally available through VODO. Pioneer One is licensed under a Creative Commons Attribution Noncommercial Sharealike license.

What I find very impressive about the show is that unlike traditional producers, they embrace rather than demonize P2P file sharing. Therefore, the makers of Pioneer One have the chance to show that grassroots film-making (or rather, TV-series-making) that is successful beyond a tiny scale is possible by actively engaging the Internet community (for both funding and distribution) rather than using the Internet as a simple, tightly controlled broadcasting medium as if it was a glorified TV set.

When the article on Pioneer One faced (and fenced off) a deletion request due to alleged irrelevance on Wikipedia, I wrote the following in the deletion discussion:

Keep. Not for it being a low-funds TV series, as it is not exceptional in that respect, but for its attempt at being successful through Torrent distribution. [...] The main reason for its notability is that we see a huge effort on the side of traditional media distribution groups against P2P networking as a concept. They essentially argue that P2P is [not] tightly controllable and therefore it must be objectionable. Making an active effort to legally distribute media content via P2P is much more a political statement for the legitimacy of P2P as a cultural phenomenon than it is a way to keep distribution cost low. Compare this to other attempts at making a (mini-) series popular on the Internet (Dr. Horrible, for example) that while being free-as-in-beer (initially) did not use P2P technology (or any free-as-in-speech distribution channel), and you'll see how radically different Pioneer One is in that respect. [...]

Sure, it is not the first free-to-torrent project. But it's the first free-to-torrent series that might actually become successful. And it is a way for the filesharing community to show what it is really about: Free speech, not free beer.

Read more…

Need to add a robots.txt file to your Django project to tell Google and friends what and what not to index on your site?

Here are three ways to add a robots.txt file to Django.

1) The (almost) one-liner

In an article on e-scribe.com, Paul Bissex suggest to add this rule to your urls.py file:

from django.http import HttpResponse

urlpatterns = patterns('', ... (r'^robots.txt$', lambda r: HttpResponse("User-agent: *\nDisallow: /", mimetype="text/plain")) )

The advantage of this solution is, it is a simple one-liner disallowing all bots, with no extra files to be created, and no clutter anywhere. It's as simple as it gets.

The disadvantage, obviously, is the missing scalability. The instant you have more than one rule to add, this approach quickly balloons out of hand. Also, one could argue that urls.py is not the right place for content of any kind.

2) Direct to template

This one is the most intuitive approach: Just drop a robots.txt file into your main templates directory and link to it via directtotemplate:

from django.views.generic.simple import direct_to_template

urlpatterns = patterns('',
    ...
    (r'^robots\.txt$', direct_to_template,
     {'template': 'robots.txt', 'mimetype': 'text/plain'}),
)

Just remember to set the MIME type appropriately to text/plain, and off you go.

Advantage is its simplicity, and if you already have a robots.txt file you want to reuse, there's no overhead for that.

Disadvantage: If your robots file changes somewhat frequently, you need to push changes to your web server every time. That can get tedious. Also, this approach does not save you from typos or the like.

3) The django-robots app

Finally, there's a full-blown django app available that you can install and drop into your INSTALLED_APPS: It is called django-robots.

For small projects, this would be overkill, but if you have a lot of rules, or if you need a site admin to change them without pushing changes to the web server, this is your app of choice.

Which one is right for me?

Depending on how complicated your rule set is, either one of the solutions may be the best fit for you. Just choose the one that you are the most comfortable with and that fits the way you are using robots.txt in your application.

Read more…

Note: Several commenters have provided valuable feedback that I am responding to with updates to this post and in the comments. Make sure to read both!


On yesterday's WWDC keynote, Steve Jobs introduced "FaceTime" and explained its base technologies with the following words:

"Now it's based on a handful of standards... but this is going to be an open industry standard."

I wish Steve Jobs would cease calling H.264 and similar standards "open". Technologies that cost millions of dollars to use are, by definition, not open. He can hope it'll become an "industry standard" (as in, used by companies apart from Apple), fair enough. But he can't say it is going to be "open". That's like a college kid calling the grocery store "free" just because daddy gave them an unlimited credit card. And it's doing a huge disservice to the Open Sourcestandards community by misusing the term in the worst possible way.

Quote and photograph courtesy of Engadget. Thanks!


Update, 4/9/2010: For a while, I removed the above text in order not to express unwarranted criticism towards the speaker. After several rounds of user feedback, however, I decided to keep the original text and update/annotate this post as necessary.

Update on the definition of openness:

As Sandy pointed out in the comments, there are many definition of what standards are considered "open" and depending on which of these you follow, varying licensing fees, as long as you don't exclude anyone with enough money to buy them, are still valid for calling a standard "open". I disagree with that view, but it is a possible interpretation.

Commenter Dave mentions that Steve Jobs usually makes sure to call actual open standards "open" and calls H.264 and similar technologies "industry standards" instead. He is therefore likely to know the difference between the two, even though calling an entire stack "open" in spite of some of its components not matching that definition is a strange, or even misleading, point of view.

Finally, Jo argues that the mere fact that other vendors can build devices to connect to the FaceTime stack instead of it being limited to Apple products only makes it "open". In other words, this use of "open" would be a synonym of "standards compliant". I believe that is still a very limited view on openness, but at least it is more open than the alternative: a locked-down proprietary solution.

Peter also reminds us of the technical limitations: Since all mobile devices need hardware support for video encoding and decoding, Apple had to settle for H.264 a long time ago, and even if they wanted, they could not simply switch over to a different codec. Most people (me included) also seem to agree that H.264 is -- from a purely technical standpoint -- a good choice for the FaceTime stack.

Read more…

Over on the Mozilla Webdev blog, I just posted about a new library of ours, django-mozilla-product-details. This tongue twister allows you to periodically update the latest Mozilla product version information as well as language details from our SVN server.

The geeks among you are surely wondering, isn't that going to lead to a lot of useless traffic if the data does not change as frequently as it is being updated?

You are right. Because re-downloading unchanged data is evil and because we like our servers, we are using a fun little trick to keep the data transferred as little as possible:

Every time the update script is run, we first issue a HEAD request to the SVN server: A HEAD request is a type of HTTP request that asks for some location from a server, but instead of receiving the actual data in return (an HTML document, for example, or some binary data), the server only returns the response headers, not the actual data.

From these headers, which are very small, we can read the Last-Modified timestamp and compare that to the time we last updated our local copy of the product data. If the timestamp hasn't changed since then, there's no need for us to download further data.

Instead of blindly downloading the data files on every update, we send the time of our last successful update along to the server, in a If-Modified-Since HTTP request header. If the files have changed since then, the server will send us the updated list, but if nothing has changed in the meantime, the server will just return a "304 Not Modified" status.

This is how we ensure that (almost) no matter how often you choose to update the product data, neither your nor our resources will be wasted.

This is not only a good idea for this specific library: Next time you consume RSS feeds or other "pull" data from various places on the Internet, make sure to query for updates before downloading unnecessary data. Caveat: This method only works if the server can handle an If-Modified-Since header. Servers that serve bogus timestamps or no such header at all leave you no choice but to download and investigate the feed itself.

Update: A few readers pointed out that the If-Modified-Since request header would be an even better method to update the data conditionally than an initial HEAD request. They are, of course, right, which is why I updated the library accordingly. Thanks, everyone!

Read more…