Under the Hood of Firefox Input

Note: Several people asked where the link is to actually add feedback to the site. This is, of course, a good point. As mentioned in the comments: The designated entry point for the feedback application is going to be an extension bundled with Firefox 4 Beta. For more information, please read Aakash's blog post. To try out the application already, feel free to add happy or sad feedback to the test site.

This morning, we published the Firefox Input application. It is a little web application soliciting feedback from our Firefox Beta Program users. The aim is to make it as easy as possible for people to tell us what specifically they like or dislike about an upcoming version of Firefox.

The application was, as far as software goes, developed very rapidly: We made it from requirements to production in a mere three weeks. What made this possible was a number of reusable components that allowed us to avoid reinventing the wheel and stay focused on making the application awesome.

A few key components of the Input application:

Django. I can't stress this enough, but Django is a fantastic web application framework. It makes it incredibly easy to set up a web application quickly and securely. Their built-in admin pages save me days of work that I would otherwise have to spend to allow project admins to edit the application data.
Jinja2 and Jingo. The only big drawback of Django is its template language: The instant you make nontrivial web applications, it gets in your way. Luckily, like all parts of Django it is replaceable: Jinja2 and Jeff Balogh's jingo interface comes to the rescue. The two of them are already in use over at AMO and also serve us well on Input.
Term extraction. Firefox Input extracts key words from all feedback. Sure, you can just split the sentences into words, but if you want to avoid collecting all sorts of meaningless particles ("the", "a", "if", ...), it becomes a little more complicated. We are using the topia.termextract library, which gladly does the heavy lifting for us. Only caveat: It only works for English, so once the application is localized, we need a different solution for the other languages.
Search. For the longest time, there was no generic way to do search in a Django app (other than straight SQL queries). In the meantime, haystack has started to fill that gap. We use it on Input in conjunction with Whoosh, a pure-Python search library. That is very easy to set up, at the expense of scalability -- if we outgrow it, however, it will be easy to switch search engines with virtually no code changes at all. Thumbs up!
Product details. Only very recently we released a Mozilla product details library for Django, and this is the first application to rely intimately on up-to-date product data: Input only lets users of the latest beta versions of Firefox add feedback, so it auto-updates its product data periodically to gather feedback for the newest versions as quickly as possible.

As always, the source code of Firefox Input is openly and freely available. If you notice any problems with it, feel free to fork it on github, or file a bug in our bug tracker.

June 15, 2010Fred Wenzel

Hot-plugging a SATA drive under Linux

Hard drives (or controllers, rather) capable of hot-swapping (that is, plugging and un-plugging a drive into a running system) used to be a feature reserved for expensive professional RAID installations.

With the advent of SATA in the mainstream, that has changed. Supposedly any SATA hard drive can be hot-plugged now. But what if you actually try and nothing happens? Chances are your controller doesn't let the OS know about the newly found drive on its own.

Try this to rescan the SCSI hosts (each SATA port appears as a SCSI bus):

echo "0 0 0" >/sys/class/scsi_host/host<n>/scan

and to remove a drive:

echo x > /sys/bus/scsi/devices/<n>:0:0:0/delete

Replace <n> with the right numbers for your system, respectively.

Also, just to state the obvious, don't do that to a mounted drive, ever. Especially not the one that holds your system partition ;)

(via.)

June 09, 2010Fred Wenzel

Three ways to add a robots.txt to your Django project

Need to add a robots.txt file to your Django project to tell Google and friends what and what not to index on your site?

Here are three ways to add a robots.txt file to Django.

1) The (almost) one-liner

In an article on e-scribe.com, Paul Bissex suggest to add this rule to your urls.py file:

from django.http import HttpResponse

urlpatterns = patterns('',
    ...
    (r'^robots.txt$', lambda r: HttpResponse("User-agent: *\nDisallow: /", mimetype="text/plain"))
)

The advantage of this solution is, it is a simple one-liner disallowing all bots, with no extra files to be created, and no clutter anywhere. It's as simple as it gets.

The disadvantage, obviously, is the missing scalability. The instant you have more than one rule to add, this approach quickly balloons out of hand. Also, one could argue that urls.py is not the right place for content of any kind.

2) Direct to template

This one is the most intuitive approach: Just drop a robots.txt file into your main templates directory and link to it via directtotemplate:

from django.views.generic.simple import direct_to_template

urlpatterns = patterns('',
    ...
    (r'^robots\.txt$', direct_to_template,
     {'template': 'robots.txt', 'mimetype': 'text/plain'}),
)

Just remember to set the MIME type appropriately to text/plain, and off you go.

Advantage is its simplicity, and if you already have a robots.txt file you want to reuse, there's no overhead for that.

Disadvantage: If your robots file changes somewhat frequently, you need to push changes to your web server every time. That can get tedious. Also, this approach does not save you from typos or the like.

3) The django-robots app

Finally, there's a full-blown django app available that you can install and drop into your INSTALLED_APPS: It is called django-robots.

For small projects, this would be overkill, but if you have a lot of rules, or if you need a site admin to change them without pushing changes to the web server, this is your app of choice.

Which one is right for me?

Depending on how complicated your rule set is, either one of the solutions may be the best fit for you. Just choose the one that you are the most comfortable with and that fits the way you are using robots.txt in your application.

June 08, 2010Fred Wenzel

Everything but "open"

Note: Several commenters have provided valuable feedback that I am responding to with updates to this post and in the comments. Make sure to read both!

On yesterday's WWDC keynote, Steve Jobs introduced "FaceTime" and explained its base technologies with the following words:

"Now it's based on a handful of standards... but this is going to be an open industry standard."

I wish Steve Jobs would cease calling H.264 and similar standards "open". Technologies that cost millions of dollars to use are, by definition, not open. He can hope it'll become an "industry standard" (as in, used by companies apart from Apple), fair enough. But he can't say it is going to be "open". That's like a college kid calling the grocery store "free" just because daddy gave them an unlimited credit card. And it's doing a huge disservice to the Open ~~Source~~standards community by misusing the term in the worst possible way.

Quote and photograph courtesy of Engadget. Thanks!

Update, 4/9/2010: For a while, I removed the above text in order not to express unwarranted criticism towards the speaker. After several rounds of user feedback, however, I decided to keep the original text and update/annotate this post as necessary.

Update on the definition of openness:

As Sandy pointed out in the comments, there are many definition of what standards are considered "open" and depending on which of these you follow, varying licensing fees, as long as you don't exclude anyone with enough money to buy them, are still valid for calling a standard "open". I disagree with that view, but it is a possible interpretation.

Commenter Dave mentions that Steve Jobs usually makes sure to call actual open standards "open" and calls H.264 and similar technologies "industry standards" instead. He is therefore likely to know the difference between the two, even though calling an entire stack "open" in spite of some of its components not matching that definition is a strange, or even misleading, point of view.

Finally, Jo argues that the mere fact that other vendors can build devices to connect to the FaceTime stack instead of it being limited to Apple products only makes it "open". In other words, this use of "open" would be a synonym of "standards compliant". I believe that is still a very limited view on openness, but at least it is more open than the alternative: a locked-down proprietary solution.

Peter also reminds us of the technical limitations: Since all mobile devices need hardware support for video encoding and decoding, Apple had to settle for H.264 a long time ago, and even if they wanted, they could not simply switch over to a different codec. Most people (me included) also seem to agree that H.264 is -- from a purely technical standpoint -- a good choice for the FaceTime stack.

June 01, 2010Fred Wenzel

Easy on the bandwidth, easy on the server: Pull-style updates in django-mozilla-product-details

Over on the Mozilla Webdev blog, I just posted about a new library of ours, django-mozilla-product-details. This tongue twister allows you to periodically update the latest Mozilla product version information as well as language details from our SVN server.

The geeks among you are surely wondering, isn't that going to lead to a lot of useless traffic if the data does not change as frequently as it is being updated?

You are right. Because re-downloading unchanged data is evil and because we like our servers, we are using a fun little trick to keep the data transferred as little as possible:

Every time the update script is run, we first issue a HEAD request to the SVN server: A HEAD request is a type of HTTP request that asks for some location from a server, but instead of receiving the actual data in return (an HTML document, for example, or some binary data), the server only returns the response headers, not the actual data.

From these headers, which are very small, we can read the Last-Modified timestamp and compare that to the time we last updated our local copy of the product data. If the timestamp hasn't changed since then, there's no need for us to download further data.

Instead of blindly downloading the data files on every update, we send the time of our last successful update along to the server, in a If-Modified-Since HTTP request header. If the files have changed since then, the server will send us the updated list, but if nothing has changed in the meantime, the server will just return a "304 Not Modified" status.

This is how we ensure that (almost) no matter how often you choose to update the product data, neither your nor our resources will be wasted.

This is not only a good idea for this specific library: Next time you consume RSS feeds or other "pull" data from various places on the Internet, make sure to query for updates before downloading unnecessary data. Caveat: This method only works if the server can handle an If-Modified-Since header. Servers that serve bogus timestamps or no such header at all leave you no choice but to download and investigate the feed itself.

Update: A few readers pointed out that the If-Modified-Since request header would be an even better method to update the data conditionally than an initial HEAD request. They are, of course, right, which is why I updated the library accordingly. Thanks, everyone!

May 28, 2010Fred Wenzel

udevinfo on Ubuntu 10.4 "Lucid"

The latest versions of Ubuntu do not appear to have the tool udevinfo anymore, which is vital to find information about devices connected to the computer.

There is, however, a new tool called udevadm, and with a little syntax trick you can get it to spit out your familiar udevinfo syntax:

udevadm info -a -p `udevadm info -q path -n /dev/sdb`

shows:

Udevadm info starts with the device specified by the devpath and then
walks up the chain of parent devices. It prints for every device
found, all possible attributes in the udev rules key format.
A rule to match, can be composed by the attributes of the device
and the attributes from one single parent device.

  looking at device '/devices/pci0000:00/0000:00:13.2/usb2/2-1/2-1:1.0/host5/target5:0:0/5:0:0:0/block/sdb':
    KERNEL=="sdb"
    SUBSYSTEM=="block"
    DRIVER==""
    ATTR{range}=="16"
    ATTR{ext_range}=="256"
    ATTR{removable}=="1"
(...)

  looking at parent device '/devices/pci0000:00':
    KERNELS=="pci0000:00"
    SUBSYSTEMS==""
    DRIVERS==""

If you use this more often and don't like the idea of entering a huge line of code for such a simple command, drop the following into your .bashrc file (all in one line):

udevinfo () { udevadm info -a -p `udevadm info -q path -n "$1"`; }

Now (after starting a new session or typing source ~/.bashrc), a simple udevinfo /dev/sdb will do the trick.

Also helpful: A long time ago, I wrote a blog post about udev rules, showing what rules I used at the time to have consistent device names for my USB drives, no matter in what order I connect or disconnect them. The devices I mention there are long gone, but I keep going back to that post every time I need to write a new udev rule.

May 07, 2010Fred Wenzel

What happens when you click the Firefox download button?

Everybody knows Mozilla makes Firefox. But there is a lot more software at work here at Mozilla that you might not be aware of. For example: What happens when you go to getfirefox.com and click on the download button?

By clicking on the button, you ask our servers to send you a specific file, for example: Firefox 3.6.3, for Windows, in German. On a small website, the server would just fetch the file and hand it to you. But if you need to handle millions of downloads a day like we do, a single server can't handle it all by itself, so it gets more complicated. In order to provide you with downloads, updates, etc., as fast and conveniently as possible, Mozilla collaborates with a number of mirror providers that have volunteered to host Firefox and other downloads on our behalf, thus sharing the load of our numerous downloads between a number of servers all over the world.

For some years now, we have been running a bundle of software called "Bouncer" to handle our downloads for us.

Bouncer consists of of three components: The user-facing bounce script, an administrative interface called Tuxedo, and a mirror checker called Sentry.

First, the bounce script. It is the only component the "ordinary user" gets to interact with. It essentially does the following after you click on a download link:

It determines if the product you asked for exists.
Out of our list of mirrors, it picks one that has your file. Initially, it would pick one at random. Over the years, the logic has become more elaborate though: Meanwhile, it takes into account in what country you currently are, as well as how strong the mirrors are (stronger mirrors serve more downloads, weaker ones serve less).
A split-second later, Bouncer refers you to the server it decided on, and that server will send you the file you asked for.

But wait, there is more! How does Bouncer know what products are available, for what operating systems, and in what languages? That's where the admin interface comes in. We have a release engineering team who work hard every day to deliver the newest software versions to you in handy little packages. Previously, during every release, an engineer would manually tell Bouncer that a new version was available for download. But just last week, we improved this process by introducing a new interface to Bouncer, with a project called Tuxedo. The release engineering team can now, fully automatically, feed new versions into Bouncer at the time of release, with no manual intervention. With less time spent on repetitive tasks, we can spend more time making Firefox awesome.

Finally, the Sentry component is a script that periodically checks the health of our mirrors, and adjusts our settings accordingly. This is to ensure that a situation where you are forwarded to a mirror that is currently unavailable is very, very rare. So far, these mirror checks happen from Mozilla Headquarters, and therefore reflect the connectivity we get to the mirrors from here. In the future, we want to improve that by taking into account more how our users' connectivity is to the specific mirrors (for the geeks out there: Network proximity != geographical proximity), which has the potential to result in faster download times, less expenses for mirror providers, and general happiness.

As you can see, there are a lot of things happening behind the scenes before Firefox makes its way onto your computer at home, and we are constantly working on improving the way we are doing things. Plus, as always: Bouncer is completely open source, and we have a public bug tracker, so if you notice any problems or see room for improvement, make sure to let us know.

Photo credit: "directions", CC-by licensed by Phillie Casablanca.

March 30, 2010Fred Wenzel

Don't Forget to Clean Up After Yourself

On a growing number of projects at Mozilla, we use a tool called Hudson that runs a complete set of tests on the code with every check-in. The beauty of this is that if you accidentally break something, you (and everyone else) will know immediately, so you can fix it quickly. We also use a bunch of plugins with Hudson, one of which assigns points to every check-in: For example, if all tests pass, you get a positive number of points, or if you broke something, you get a negative score.

An innocent little commit of mine gained me a whopping -100 points (yes, that is minus 100) today.

How did that happen? The build broke badly, not because I wrote a pile of horrendous code, or because I didn't test before committing. In fact, I've made it a habit to commit like this:

./manage.py test && git push origin master

This fun little one-liner will result in my code being pushed to the origin repository if and only if all tests pass.

So in my case, all tests passed locally, and then horribly broke once the server ran the tests again. After a little research, it turned out that when I deleted a now unneeded Python file, I did not remove its compiled cousin, the .pyc file, along with it. Sadly, this module was still imported somewhere else, and because Python still found the .pyc file locally, it did not mind the original .py file being gone, so all tests passed. On the server, however, with a completely clean environment, the file wasn't found and resulted in the failures of dozens of tests (all of which threw an ImportError).

What's the lesson? In the short term, I should wipe my .pyc files before running tests. One way to do that would be adding something like

find . -type f -name '*.pyc' | xargs rm

to my ever-growing commit one-liner, but a more general solution might want to perform this inside the test running script. On the other hand, since that script is written in Python, some of the imports that could break have already been performed by the time the script runs.

In general, run your tests on as clean an environment as possible. While any useful test framework will take care of your database having a consistent state for every test run, you also need to ensure that you start with a plane baseline of your code -- especially if Hudson, the merciless butler, will rub it in your face if you don't ;) .

March 10, 2010Fred Wenzel

Updating the Mozilla Public License

Today, Mozilla is starting the public process on revising its signature code license, the Mozilla Public License or MPL. Mitchell Baker, chair of the board of the Mozilla Foundation and author of the original MPL 1.0, has more information about the process on her blog.

The discussion is happening on the website mpl.mozilla.org that looks something like this:

I am happy about this for a number of reasons. Of course, I made the website (the design is borrowed from mozilla.org), so I am naturally happy to see it being available to a wider audience.

But I also hope that the revision process itself will be successful. While the MPL has been a remarkable help in Mozilla desktop projects' success, it is unpleasant (to say the least) to use in web applications, for a number of reasons:

The hideous license block. The MPL is a file-based license. It allows any file in the project, even in the same directory, to be licensed differently. Therefore, each MPL-licensed code file must have an over 30 lines long comment block on top. For big code modules, that's fine. For web applications, whose files often have a handful of lines, this balloons up the whole code base and makes files horribly unreadable. Sadly, the current license only allows an exception from that rule if that's impossible "due to [the file's] structure" which would essentially only be the case if that file type did not allow comments.

The copyleft. This one is debatable, but it's a fact that some open source communities, one prominent example is the Python community, does not appreciate strong copyleft provisions. While the MPL (unlike the GNU GPL) does not have a tendency to "taint" other code, this is not at all compatible with the BSD or MIT licenses' notion of "take it and do (almost) whatever you please with it". (As you may have noticed, the file-based MPL is both a curse and a blessing here). I hope that the revision process can make it clearer how this applies to hosted applications (i.e., mostly web applications).

I am excited to see what the broad community discussion will bring to light over the next few months.

March 01, 2010Fred Wenzel

pdftk 1.41 for Mac OS X 10.6

Update: The author of pdftk, Sid Steward, left the following comment:

A new version of pdftk is available (1.43) that fixes many bugs. This release also features an installer [for] OS X 10.6. Please visit to learn more and download: www.pdflabs.com.

This blog post will stick around for the time being, but I (the author of this blog) advise you to always run the latest version so that you can enjoy the latest bug fixes.

OS X Leopard users: Sorry, neither this version nor the installer offered on pdflabs.com works on OS X before 10.6. You might be able to compile from source though. Let us know if you are successful.

Due to my being a remote employee, I get to juggle with PDF files quite a bit. A great tool for common PDF manipulations (changing page order, combining files, rotating pages etc) has proven to be pdftk. Sadly, a current version for Mac OS X is not available on their homepage. In addition, it is annoying (to say the least) to compile, which is why all three third-party package management systems that I know of (MacPorts, fink, as well as homebrew), last time I checked, did not have it at all, or their versions were broken.

Now I wouldn't be a geek if that kept me from compiling it myself. I took some hints from anoved.net who was nice enough to also provide a compiled binary, but sadly did not include the shared libraries it relies on.

Instead, I made an installer package that'll install pdftk itself as well as the handful of libraries you need into /usr/local. Once you ran this, you can open Terminal.app, and typing pdftk should greet you as follows:

$ pdftk
SYNOPSIS
       pdftk <input PDF files | - | PROMPT>
            [input_pw <input PDF owner passwords | PROMPT>]
            [<operation> <operation arguments>]
            [output <output filename | - | PROMPT>]
            [encrypt_40bit | encrypt_128bit]
(...)

You can download the updated package here: pdftk1.41_OSX10.6.dmg

(MD5 hash: ea945c606b356305834edc651ddb893d)

I only tested it on OS X 10.6.2, if you use it on older versions, please let me know in the comments if it worked.