Three ways to add a robots.txt to your Django project

Need to add a robots.txt file to your Django project to tell Google and friends what and what not to index on your site?

Here are three ways to add a robots.txt file to Django.

1) The (almost) one-liner

In an article on e-scribe.com, Paul Bissex suggest to add this rule to your urls.py file:

from django.http import HttpResponse

urlpatterns = patterns('',
    ...
    (r'^robots\.txt$', lambda r: HttpResponse("User-agent: *\nDisallow: /", mimetype="text/plain"))
)

The advantage of this solution is, it is a simple one-liner disallowing all bots, with no extra files to be created, and no clutter anywhere. It’s as simple as it gets.

The disadvantage, obviously, is the missing scalability. The instant you have more than one rule to add, this approach quickly balloons out of hand. Also, one could argue that urls.py is not the right place for content of any kind.

2) Direct to template

This one is the most intuitive approach: Just drop a robots.txt file into your main templates directory and link to it via direct_to_template:

from django.views.generic.simple import direct_to_template

urlpatterns = patterns('',
    ...
    (r'^robots\.txt$', direct_to_template,
     {'template': 'robots.txt', 'mimetype': 'text/plain'}),
)

Just remember to set the MIME type appropriately to text/plain, and off you go.

Advantage is its simplicity, and if you already have a robots.txt file you want to reuse, there’s no overhead for that.

Disadvantage: If your robots file changes somewhat frequently, you need to push changes to your web server every time. That can get tedious. Also, this approach does not save you from typos or the like.

3) The django-robots app

Finally, there’s a full-blown django app available that you can install and drop into your INSTALLED_APPS: It is called django-robots.

For small projects, this would be overkill, but if you have a lot of rules, or if you need a site admin to change them without pushing changes to the web server, this is your app of choice.

Which one is right for me?

Depending on how complicated your rule set is, either one of the solutions may be the best fit for you. Just choose the one that you are the most comfortable with and that fits the way you are using robots.txt in your application.

Everything but “open”

Note: Several commenters have provided valuable feedback that I am responding to with updates to this post and in the comments. Make sure to read both!


On yesterday’s WWDC keynote, Steve Jobs introduced “FaceTime” and explained its base technologies with the following words:

“Now it’s based on a handful of standards… but this is going to be an open industry standard.”

I wish Steve Jobs would cease calling H.264 and similar standards “open”. Technologies that cost millions of dollars to use are, by definition, not open. He can hope it’ll become an “industry standard” (as in, used by companies apart from Apple), fair enough. But he can’t say it is going to be “open”. That’s like a college kid calling the grocery store “free” just because daddy gave them an unlimited credit card. And it’s doing a huge disservice to the Open Sourcestandards community by misusing the term in the worst possible way.

Quote and photograph courtesy of Engadget. Thanks!


Update, 4/9/2010: For a while, I removed the above text in order not to express unwarranted criticism towards the speaker. After several rounds of user feedback, however, I decided to keep the original text and update/annotate this post as necessary.

Update on the definition of openness:

As Sandy pointed out in the comments, there are many definition of what standards are considered “open” and depending on which of these you follow, varying licensing fees, as long as you don’t exclude anyone with enough money to buy them, are still valid for calling a standard “open”. I disagree with that view, but it is a possible interpretation.

Commenter Dave mentions that Steve Jobs usually makes sure to call actual open standards “open” and calls H.264 and similar technologies “industry standards” instead. He is therefore likely to know the difference between the two, even though calling an entire stack “open” in spite of some of its components not matching that definition is a strange, or even misleading, point of view.

Finally, Jo argues that the mere fact that other vendors can build devices to connect to the FaceTime stack instead of it being limited to Apple products only makes it “open”. In other words, this use of “open” would be a synonym of “standards compliant”. I believe that is still a very limited view on openness, but at least it is more open than the alternative: a locked-down proprietary solution.

Peter also reminds us of the technical limitations: Since all mobile devices need hardware support for video encoding and decoding, Apple had to settle for H.264 a long time ago, and even if they wanted, they could not simply switch over to a different codec. Most people (me included) also seem to agree that H.264 is — from a purely technical standpoint — a good choice for the FaceTime stack.

Easy on the bandwidth, easy on the server: Pull-style updates in django-mozilla-product-details

Over on the Mozilla Webdev blog, I just posted about a new library of ours, django-mozilla-product-details. This tongue twister allows you to periodically update the latest Mozilla product version information as well as language details from our SVN server.

The geeks among you are surely wondering, isn’t that going to lead to a lot of useless traffic if the data does not change as frequently as it is being updated?

You are right. Because re-downloading unchanged data is evil and because we like our servers, we are using a fun little trick to keep the data transferred as little as possible:

Every time the update script is run, we first issue a HEAD request to the SVN server: A HEAD request is a type of HTTP request that asks for some location from a server, but instead of receiving the actual data in return (an HTML document, for example, or some binary data), the server only returns the response headers, not the actual data.

From these headers, which are very small, we can read the Last-Modified timestamp and compare that to the time we last updated our local copy of the product data. If the timestamp hasn’t changed since then, there’s no need for us to download further data.

Instead of blindly downloading the data files on every update, we send the time of our last successful update along to the server, in a If-Modified-Since HTTP request header. If the files have changed since then, the server will send us the updated list, but if nothing has changed in the meantime, the server will just return a “304 Not Modified” status.

This is how we ensure that (almost) no matter how often you choose to update the product data, neither your nor our resources will be wasted.

This is not only a good idea for this specific library: Next time you consume RSS feeds or other “pull” data from various places on the Internet, make sure to query for updates before downloading unnecessary data. Caveat: This method only works if the server can handle an If-Modified-Since header. Servers that serve bogus timestamps or no such header at all leave you no choice but to download and investigate the feed itself.

Update: A few readers pointed out that the If-Modified-Since request header would be an even better method to update the data conditionally than an initial HEAD request. They are, of course, right, which is why I updated the library accordingly. Thanks, everyone!

udevinfo on Ubuntu 10.4 “Lucid”

The latest versions of Ubuntu do not appear to have the tool udevinfo anymore, which is vital to find information about devices connected to the computer.

There is, however, a new tool called udevadm, and with a little syntax trick you can get it to spit out your familiar udevinfo syntax:

udevadm info -a -p `udevadm info -q path -n /dev/sdb`

shows:

Udevadm info starts with the device specified by the devpath and then
walks up the chain of parent devices. It prints for every device
found, all possible attributes in the udev rules key format.
A rule to match, can be composed by the attributes of the device
and the attributes from one single parent device.

  looking at device '/devices/pci0000:00/0000:00:13.2/usb2/2-1/2-1:1.0/host5/target5:0:0/5:0:0:0/block/sdb':
    KERNEL=="sdb"
    SUBSYSTEM=="block"
    DRIVER==""
    ATTR{range}=="16"
    ATTR{ext_range}=="256"
    ATTR{removable}=="1"
(...)

  looking at parent device '/devices/pci0000:00':
    KERNELS=="pci0000:00"
    SUBSYSTEMS==""
    DRIVERS==""

If you use this more often and don’t like the idea of entering a huge line of code for such a simple command, drop the following into your .bashrc file (all in one line):

udevinfo () { udevadm info -a -p `udevadm info -q path -n "$1"`; }

Now (after starting a new session or typing source ~/.bashrc), a simple udevinfo /dev/sdb will do the trick.

Also helpful: A long time ago, I wrote a blog post about udev rules, showing what rules I used at the time to have consistent device names for my USB drives, no matter in what order I connect or disconnect them. The devices I mention there are long gone, but I keep going back to that post every time I need to write a new udev rule.