Easy on the bandwidth, easy on the server: Pull-style updates in django-mozilla-product-details
Over on the Mozilla Webdev blog, I just posted about a new library of ours, django-mozilla-product-details. This tongue twister allows you to periodically update the latest Mozilla product version information as well as language details from our SVN server.
The geeks among you are surely wondering, isn’t that going to lead to a lot of useless traffic if the data does not change as frequently as it is being updated?
You are right. Because re-downloading unchanged data is evil and because we like our servers, we are using a fun little trick to keep the data transferred as little as possible:
Every time the update script is run, we first issue a HEAD request to the SVN server: A HEAD request is a type of HTTP request that asks for some location from a server, but instead of receiving the actual data in return (an HTML document, for example, or some binary data), the server only returns the response headers, not the actual data.
From these headers, which are very small, we can read the Last-Modified timestamp and compare that to the time we last updated our local copy of the product data. If the timestamp hasn’t changed since then, there’s no need for us to download further data.
Instead of blindly downloading the data files on every update, we send the time of our last successful update along to the server, in a If-Modified-Since HTTP request header. If the files have changed since then, the server will send us the updated list, but if nothing has changed in the meantime, the server will just return a “304 Not Modified” status.
This is how we ensure that (almost) no matter how often you choose to update the product data, neither your nor our resources will be wasted.
This is not only a good idea for this specific library: Next time you consume RSS feeds or other “pull” data from various places on the Internet, make sure to query for updates before downloading unnecessary data. Caveat: This method only works if the server can handle an If-Modified-Since header. Servers that serve bogus timestamps or no such header at all leave you no choice but to download and investigate the feed itself.
Update: A few readers pointed out that the If-Modified-Since request header would be an even better method to update the data conditionally than an initial HEAD request. They are, of course, right, which is why I updated the library accordingly. Thanks, everyone!


[...] even when the update task is frequently run. Over on my personal blog fredericiana, I wrote a blog post outlining the algorithm behind the library, and the rationale behind [...]
Why the HEAD request, doesn’t the server support If-Modified-Since?
The JSON feeds are available on mozilla.com, which will allow better performance than the SVN server as it inside the cdn network. Why should we use your solution instead of pulling JSON from server-side or using JavaScript directly from the client?
Any tips for how to support good HEAD requests from within a django app? I.e., not consuming them but implementing them?
Isn’t this exactly what HTTP status 304 (“Not Modified”) is for? You issue a conditional GET. If the resource has been modified, you get it; otherwise, you know it’s not, thanks to said status.
Very good points, guys. Perhaps I shall change the implementation to using If-Modified-Since and just catch the 304 response code. Saves an HTTP request! Thanks!
Tomer: Relying on a specific web application’s deployment kind of sucks. It’s like an img-tag pointing to the Google logo on the front page of google.com — you *can* do that, but it’s not good practice. Also, this library does not solve client problems.
Axel: urllib2 does not come with a HEAD request by itself, but you can easily subclass their Request object like I did and just change the request type to HEAD.
All right, I updated the library and the blog post. Thanks, AndersH and Jan!