Dates and Time Zones in Rails

2014-04-18

I work on a video education app that needs to report how many lessons were assigned each day. I have a timetracking app that needs to invoice based on when one month starts and another ends. And I’ve got another app for watching cronjobs, that counts how many failures happened in a given day. All these apps share a problem: if they get time zones wrong, they will give incorrect results. For instance, if a video lesson was assigned on Tuesday after 5pm PDT, its UTC time will be Wednesday. Even though all three apps only show dates, not times, time zones still matter. So here are some lessons I’ve learned dealing with Rails time zones, both programming myself and leading a team.

But first, here are two articles that lay some groundwork:

Both those posts describe some nice methods provided by ActiveSupport/Rails, as well as why you’d want to use them instead of built-in Ruby methods. For instance, saying Time.now will get you the time in your OS’s default time zone, whereas Time.zone.now will get you the time in whatever timezone you put in config/application.rb. It’s a good idea to use the Rails timezone, since you have more control over that and it’s part of your source control.

If you want to explore, take a look at ActiveSupport::TimeZone. You can get a specific time zone like this:

tz = ActiveSupport::TimeZone['Pacific Time (US & Canada)']

That tz will be the start of just about everything you do with time. An ActiveSupport::TimeZone is also what you get from Time.zone.

Those other articles are great, but this post talks about time zones from a higher level, aspiring to share best practices rather than helpful methods. In my examples I’ll assume a Postgres database, although there’s nothing here that won’t work on MySQL as well.

Scope Your Time Zones

The first thing to remember is that all times are relative to somebody. Probably that’s a user, but maybe it’s something else. For my video education app, sometimes it’s a teacher and sometimes it’s a student. For my cronjob app, it’s a job (and sometimes a user). But there is always a reference point. In that sense, the articles above that recommend Time.zone.now are wrong. Time.zone.now is no better than Time.now if you need many time zones—and who doesn’t? So it should really be user.time_zone.now. I’d recommend adding this method to your User class right from the beginning, so that even if you haven’t yet implemented user-specific time zones, you are still writing the rest of your app correctly:

def time_zone
  ActiveSupport::TimeZone['Pacific Time (US & Canada)']
end

Ignore Time Zones as Much as Possible

Now that you’ve got time zones everwhere, the next step is to get rid of them. Thinking about time zones is hard, and finding time zone bugs is hard. You want to avoid them as much as possible!

It’s been said that in C, if you start thinking about big endian vs little endian, you’re probably doing things wrong. Oftentimes when programmers half-understand something, they do more than they should, and that’s very true with time zones. Your code will be a lot easier to understand if you tackle time zones with a few well-placed strokes, rather than lots of fiddling all over the place.

One thing to remember is that 07:05:00 PDT and 14:05:00 UTC are the same instant. If you converted both to seconds since the epoch, you’d get the same number. So in that sense, changing the time zone doesn’t change anything: it’s just a bit of extra metadata hanging onto the time representing someone’s perspective. Knowing that x PDT and y UTC are the same instant is really helpful when you feel the urge to fiddle with timezones. Is your fiddling just a noop?

By default, Rails comes configured with its default time zone as UTC. Leave it that way! Since every time needs a reference point anyway, your code shouldn’t care about the Rails-wide setting. UTC is a good neutral choice. For one thing, it doesn’t have daylight savings time. And if you see it, you know you’re dealing with a time-zone-less value.

You should also leave your OS time zone as UTC, if possible. Keeping it consistent with Rails will remove one chance for abiguiuty. And again, it’s a good neutral.

You also want UTC in your database. If you use a migration to create columns with t.timestamps or t.datetime :foo, Rails will make a TIMESTAMP WITHOUT TIME ZONE column. You can think of this as a time in UTC if you like. Really it’s an int of (micro)seconds since the epoch. Whenever you give a time to ActiveRecord, it will convert it to UTC before it hits Postgres. Or more correctly, it will strip off the timezone part and give Postgres the int. Remember, it’s the same instant! But it’s nice to imagine the column as UTC. If you’re in psql and type SELECT created_at FROM lessons, that’s what you’re seeing.

When everything in your stack is UTC, it’s like looking through nice clear glass. You don’t have to think about conversions at each layer.

You should also strive to make your app code as time-zone-less as possible. The big principle here is to handle time zones once, hopefully at the beginning of the HTTP request, when you decide the correct reference point for all times. Usually that’s current_user.time_zone.

The second article above suggests you add this to your controller:

around_filter :user_time_zone, if: :current_user

def user_time_zone(&block)
  Time.use_zone(current_user.time_zone, &block)
end

That makes me pretty uncomfortable. The idea is that you can say Time.zone everywhere else in your app and always get the current user’s time zone. But I’d rather be explicit about where the time zone is coming from. Write code to take a tz argument if necessary (emphasis on the if necessary). This will make your code less surprising, less coupled, and easier to test.

Even better is to write your code to take a time. Almost always that’s really what you want. Remember that no matter the time zone, it’s all the same instant. Usually a single time is sufficient, because it can be a reference point for creating other times, using t + 1.day or t + 2.weeks or whatever. Use a (user|cronjob|foo)-scoped time zone to get your first time, and then forget about time zones for the rest of the stack. If you implemented the User#time_zone method above, your time zone code is already well-encapsulated, so there’s no need for an around_filter to further abbreviate things.

Here is another approach I don’t like. This blog post suggests you deal with time zones in Postgres like so:

created_at AT TIME ZONE 'UTC' AT TIME ZONE 'US/Pacific'

Does that look strange to you? What’s happening is that you start with a TIMESTAMP WITHOUT TIME ZONE, so first you tack on a time zone (just metadata), then you convert it to Pacific time. In other words, you’re doing this:

time without zone -> assumed to be UTC -> converted to Pacific Time

It’s good to know that this is how to get a Postgres timestamp converted to whatever time zone you want, but I wouldn’t recommend it in a Rails app (which is the article’s context). For one thing, Rails and Postgres don’t use the same names for all timezones, so you have to maintain your own mapping between the two. This is one of those chores that will keep nagging you for the life of your app, so I’d rather just avoid it. But more important, this approach means you have to pass your timezone all the way down to the database layer. I’d rather deal with time zones early, get to a TimeWithZone (or even an int), and then forget about time zones for the rest of the code.

Remember Daylight Savings Time

This is a small tip, but be careful about DST. You should avoid ever writing a fixed offset or a string like “PDT”. This is wrong:

Time.new(2014, 4, 8, 3, 15, 0, '-07:00')

If you were after PDT, your code is broken during PST. If you were after MST, your code is broken during MDT. Similarly in SQL if you say AT TIME ZONE 'PDT', you’ve broken PST.

If you stick with the ActiveSupport::TimeZone instances, you can forget about DST, and things will just work. That’s why there is no such thing as this:

ActiveSupport::TimeZone['PDT']

only this:

ActiveSupport::TimeZone['Pacific Time (US & Canada)']

Even Dates are Times

When you think you’re just dealing with dates, time zones probably still matter. Was the lesson assigned on Tuesday or Wednesday? It depends. From the teacher’s perspective? The student’s? The principal’s? Unless you’re really sure, I’d recommend always storing a full date+time in your database. Also, in Ruby avoid converting things to Date. Of course if you read the articles above you know you don’t want this:

Date.today

But this isn’t any better:

Time.zone.today

Or even this:

current_user.time_zone.today

Here is a query I’ve seen:

Lesson.where("date(created_at) >= ?", tz.today - 3.days)

But that’s wrong, because you’re stripping off all the time zone information. It’s going to report Tuesday’s lessons in Wednesday. (It also means you need a database index on the expression date(created_at), which is probably less often useful than a normal index on just the column.)

To improve this query, I’d write it like this:

Lesson.where("created_at >= ?", tz.now.midnight - 3.days)

If today is Wednesday (for you), that will give all the lessons created since the beginning of Sunday. Because we’re carrying a full time all the way down to the database query, we don’t have problems with the definition of “today.”

If you are ever tempted to use today or to_date, you probably want these methods instead:

tz.now.midnight     # the start of today, 00:00:00
tz.now.end_of_day   # 23:59:59

Displaying times

So far we’ve dealt with times as inputs. Times as outputs is a lot easier. Your rule should be to ignore time zones until the last minute, when you actually format the value for rendering. So it should go right into your view:

= lesson.created_at.in_time_zone(current_user.time_zone).strftime("%A, %B %-d, %Y")

You might want a helper for this though, something like:

def format_time(time, tz, format)
  time.in_time_zone(tz).strftime(format)
end

Then your Haml can be:

= format_time(lesson.created_at, current_user.time_zone, "%A, %B %-d, %Y")

Or de-parameterize that as much as you like:

def format_time(time)
  time.in_time_zone(current_user.time_zone).strftime("%A, %B %-d, %Y")
end

and:

= format_time(lesson.created_at)

Conclusion

So in general, my principles for handling times in Rails are:

  • Even dates are times.
  • Set “global” time zones to UTC everywhere you can.
  • Don’t ever use global time zones; scope it to a user or whatever is appropriate.
  • To most of your code, a time is just an instant.
  • Push time zone code to the very beginning of your request, to the top of your stack.
  • Push time zone code to the very end of your request, when rendering the view.

Good luck!

Basics of Web Architecture

2014-03-07

This post is adapted from an email I wrote several years ago and have since reused many times to help explain the “moving pieces” on the web to non-programmers. Someone on Hacker News asked about this kind of thing, so I thought I’d finally post it, with some images reused from my old talk at Wharton about the same thing. This article won’t make you a programmer, but hopefully it can give a big-picture overview of how the major parts fit together. It’s like Web 101. In some places it oversimplifies (lies a little as I like to put it), but in ways that I think is helpful for a beginner.

HTML

HTML hit the world in 1994. The way it works is your web browser, based on a URL, asks a web server somewhere out on the Internet for an HTML file, and the web server sends it back. So there are two actors: the web browser and the web server.

Basic web architecture

The web server is just a program running on someone else’s computer. The most popular web server is called Apache, but there are many others. A web server listens for people requesting URLs, and it sends back HTML files, images, etc. There are lots of companies out there that will host your website for you. Basically they provide a web server, and you upload whatever files you want the world to see.

The web browser is Firefox or Safari or IE or whatever. You type a URL into the address bar, and it finds the web server out on the Internet and asks it for the file in the URL. The server sends it some HTML, and the browser is responsible for rendering the HTML file. It may also request supporting files like images. Different web browsers render HTML slightly differently, making many headaches for web designers.

IE5

HTML stands for HyperText Markup Language. The HyperText means it has links you can click to visit new pages, and the Markup means you use angle-bracket tags to “mark up” certain parts of the text. In markup, there is an opening tag and (usually) a closing tag. The closing tag repeats the name of the opening tag, but with a forward slash at the beginning. This markup can indicate structure or display. For instance, <i>ibid</i> will render “ibid” in italics; this is an example of using markup for display. On the other hand, you might have this:

  • Walk the dog
  • Do the laundry
  • Pay the bills

This produces a bulleted list with three items (ul = unnumbered list, li = list item):

  • Walk the dog.
  • Do the laundry.
  • Pay the bills.

A tag may also have attributes, which look like name="value". This is how you write links, which are “anchor tags”: <a href="http://www.google.com/">Click here to visit Google</a>.

HTML may reference other files. When this happens, the browser downloads those files too and adds them to the display. The main example is images, which work like this: <img src="my-portrait.png">. (Notice there is no closing tag for images.) And here are some other cases of HTML referencing outside files:

CSS

CSS stands for Cascading StyleSheets. It’s an attempt to separate out display-style information from the HTML code, so the HTML just contains structural markup. A CSS file is a separate file, referenced by the HTML file, which tells how each type of element should be rendered (e.g. in big font, with a 2-pixel red border, or whatever). It’s considered good form to use CSS for display and HTML for structure, at least insofar as that’s possible. This is partly because it reduces typing and makes changes easier, it makes reading the HTML easier, and you can have different people working on different things. CSS also supports more advanced display options that just HTML, so some effects you can only achieve with CSS.

Again, each browser interprets CSS in its own slightly incompatible way. People who write really slick-looking web pages spend a ton of time discovering and applying tricks to get an acceptable display on each browser type. One way is to write separate HTML+CSS files for each browser type, detect what a user is using, and send the appropriate files. But no one really does this. At most people write a special CSS file for IE (which is the most incompatible with the others), and then use special codes to serve it only to IE browsers. There are other less drastic techniques as well.

Because this gets so expensive, it’s common to decide up front which browsers you’ll support, and which will get a slightly wonky-looking site. Your choice will depend on your expected audience. Nowadays some sites will just refuse to support older browsers or even not-so-old versions of IE. I like to aim for IE 8+ or even 7+. If you’re somebody like Amazon, then you add browsers with less market share like IE 5+, Safari, Opera, Konquerer, and maybe more. (Amazon is perhaps a bad example because most of their site uses a fairly loose, flexible design, so slight inconsistencies aren’t noticeable. A better example would be something with an artsy look like a winery.) Nowadays a big new challenge is supporting mobile devices like Androids and iPhones.

Javascript

Javascript is a full-fledged programming language (unlike HTML & CSS), with sequential execution of commands, variables, functions, etc. It runs on the user’s browser, which means a web server sends a bunch of Javascript code over the Internet, and the user runs it on his local machine. You can put Javascript straight into an HTML page like this:

<script language="javascript">
alert("Hello World!");
</script>

Or you can have a separate .js file referenced by the HTML page.

Either way, the point of Javascript is to cause dynamic effects in the user’s browser. You can’t accomplish this with HTML or CSS, because those just tell how to lay out the page. With Javascript you can make things change and move around. You can add popups like “Do you really want to delete that?” or “Call today for your free sample!”

Note that Javascript has nothing to do with Java. It was invented by Netscape when Java was a hip new programming language (invented by Sun), and as a marketing gimmick they called it Javascript. Really!

Again, different browsers treat Javascript in different ways. This is where the browser incompatibilities are the worst. It can be truly maddening, and to get good results you really need someone with a lot of experience.

jQuery

jQuery is an extension to Javascript. It’s pretty new, but it’s wonderful. It lets you easily get lots of trendy effects like fade-in or rollovers, that used to be more expensive to build. It also encapsulates a lot of the cross-browser boilerplate people used to write themselves to support all the different browsers out there. Basically instead of calling native Javascript functions, you call the jQuery functions, which automatically test for the browser’s functionality do the right thing based on the result. That sort of library has been around for many years, but jQuery is an exceptionally good implementation of it, and it’s quickly become the de facto standard. jQuery makes it much less expensive to get a pretty site and fancy effects.

In fact nowadays (2014) jQuery is old hat! People are doing more and more with Javascript. It’s no longer about rollovers and popups, but many people build their whole interface with Javascript. To manage this extra complexity, there are new frameworks out there like Backbone.js, Angular.js, and Ember.js. These frameworks provide lots of common functionality and you keep things organized.

Flash

Like Javascript, this is a programming language that gets run browser-side. But whereas Javascript can touch any element on the page, a Flash program is confined to a given square. Flash is particularly aimed at graphics, so the big ads are all written in Flash, and there are lots of Flash games, and YouTube videos are Flash. Unlike with Javascript, Flash doesn’t suffer from cross-browser incompatibilities because it is owned by Adobe (originally by Macromedia), who distributes browser “plugins” that execute the Flash code. Because the browser doesn’t run the Flash directly, it can’t be inconsistent about it. This is why you need a plugin to run Flash.

The are several disadvantages to Flash: It requires users to download the plugin (although most have it already), it is opaque to search engines, so Google can’t index what you put in Flash, and it is unsupported on iPhones. Sometimes it is used to create an intro movie on the front page of sites, but many people find that annoying, so I don’t recommend it. In general, Flash is being replaced with Javascript+HTML5, so learning it may not be a good investment, unless you want to make games or dynamic advertisements.

PHP

Let’s go back to the beginning: the way the web works is your web browser asks a web server for an HTML file, and the web server sends the file across the Internet. But what if the web page is something like a shopping cart, where the HTML is different for each user? In this case, the web server doesn’t just have a static HTML file, but it runs a program to generate an HTML file on the spot, and it sends the result to your web browser. So here the web server is basically a middle-man: it delegates the task of generating the HTML to a separate program, and it passes the result along to the web browser. The old term for this is CGI, and while it isn’t always correct any more to call this sort of thing a CGI, it’s still useful as a catch-all term for any server-side HTML generation. You can build a CGI in whatever language you like. Java and C# are common choices, as is PHP. Most modern techniques, PHP included, involve writing a file that is mostly HTML, with some special code embedded inside to output the variable parts. Generally people choose PHP for smaller sites and Java/C#/Python/Ruby for bigger sites, but that distinction seems less valid each year.

CGI (in any language) is quite different from browser-side programming like Javascript and Flash. First of all, it all happens on your own web server, so you don’t have to deal with browser incompatibilities. CGI is invisible to the user. By the time the browser sees anything, it’s just got plain old HTML. On the other hand, this means that CGI can’t give you effects like animations and popups, which require executing code on the browser. CGI is just a way to serve different HTML to each user. Note that CGI is not incompatible with Javascript and Flash. Any website of moderate complexity will employ CGI on the server side to generate custom HTML, plus Javascript/Flash on the browser side to get flashy effects. (You can even use CGI to dynamically generate Javascript, CSS, or images, but this is rare.)

Using CGI also imposes a larger burden on your web server than just serving static HTML files. Maybe the CGI has to access a database to find a user’s favorite genre of books, for example. If you have a lot of traffic, it’s important to consider the efficiency of your CGI code. The ability to support lots of traffic is called scalability. This isn’t something you really ought to worry too much about right away, but someone experienced will make fewer mistakes here.

An aside about performance

On the web, “performance” has two components: latency and scalability. Latency is how snappy your site is: how quickly you can serve a page and get all the images etc. loaded. Big players like Google and Facebook have done studies showing that engagement and conversions drop off if you have page load times over 600ms or so, hence the saying “speed is a feature.”

Scalability, on the other hand, refers to how well your site degrades as you add more users. So instead of snappiness, it’s about capacity. You can also think of scalability as how hard/expensive it is to add more capacity.

People talk about “vertical” vs. “horizontal” scalability. Vertical scalability means buying faster hardware. Horizontal scalability means buying more cheap hardware and distributing the load. Vertical scalability is expensive and has hard limits, but horizontal scalability requires careful up-front design. Rails and Django are nice frameworks because they enable or even in some ways force a more scalable architecture. One thing that is hard (by “hard” I mean it’s a current research topic for Ph.D.s) is horizontal scaling for databases. There are worthwhile techniques, but they are complex to implement. Often it’s better to just buy a bigger box.

Latency and scalability don’t always come together. You can be snappy but crash when a media hit causes traffic to spike, or you could have a sluggish site that keeps chugging no matter how many visitors arrive. Sometimes optimizations will help both categories, but not always.

Java+J2EE, Ruby+Rails, Python+Django

These are all server-side technologies like PHP. They all go beyond CGI in that they are “application servers”, so they do more work for you, like tracking sessions and pooling database connections. (Actually PHP can do that, too.) They encourage a pattern of web development called Model-View-Controller (MVC), which helps separate your code into different concerns. The Model defines the structure of your data and your “business logic.” It is the back end of your site. Its job is to talk to the database and provide a convenient, intelligible API for the other layers. The Controller is what handles incoming web requests. Each request goes to the Controller code first. The Controller decides what to do. It asks things like “Is this person logged in?” and “What page do I show for this URL?” The View is what renders the actual HTML. Usually you write your view in some kind of templating mini-language, like JSP or ERB or HAML. All those languages let you write a bunch of HTML and then embed Java/Ruby/etc code into it to inject things like user names, product images, or whatever. In general, the Controller sets up whatever data the View will need by loading Model objects, and then the View reads those Model objects to fill in its blanks.

To new coders, MVC probably seems like a lot of extraneous infrastructure. It does increase the learning curve, but it’s basically the standard in professional web development. It’s the pattern adopted by J2EE, Rails, and Django as the right way to build websites. By separating your code into layers and giving each layer its own responsibility, it becomes much easier to manage complexity. Otherwise it’s all but impossible to reason correctly about what your code will do. Without MVC, it’s easy to make a small change and inadvertently break something that you thought was unrelated. MVC protects you from this kind of thing. It also makes it easier to separate tasks among multiple developers.

Why not distribute public keys via SMTP?

2013-07-15

One of the hurdles to end-to-end email encryption is key distribution. Doing this manually ensures that only technical people will bother, so why not automate it via some protocol? A natural method would be to extend SMTP so the sender can ask if a user’s key exists, and if so then use that to send the message, e.g. like this:

HELO relay.example.org
250
MAIL FROM:<bob@example.org>
250
KEY FOR:<alice@example.com>
250
--- BEGIN PGP PUBLIC KEY BLOCK ---
...
--- END PGP PUBLIC KEY BLOCK ---
RCPT TO:<alice@example.com>
...

The KEY FOR capability could also be discoverable via the common EHLO command:

EHLO relay.example.org
250-smtp.example.org 
250-KEYS    
...

Why hasn’t this been suggested before? Googling I found only one link. Is there some flaw with it?

Paperclip with Server-Side Files

2013-06-07

Several times in the last few years I’ve built Rails sites that needed to store files via Paperclip that were not uploaded by a user, but generated programmatically on the server side, for instance PDF or Excel reports built from something in the database. There’s not much documentation on this, and my first effort led to this StackOverflow question. The first solution is not perfect, but it sometimes works:

user.photo = photo_bytes
user.photo.instance_write(:file_name, photo_file_name)
user.save!

But sometimes that won’t do, because Paperclip uses the file name to determine what kind of file you’ve got. For instance, suppose you’re generating a spreadsheet with the WriteExcel gem:

io = StringIO.new
xls = WriteExcel.new(io)
# ...
xls.close
report.spreadsheet = io.string

This will raise a NoHandlerError in the last line. The solution is to make sure you give Paperclip something with a filename, not just the raw binary string.

The easy way out would be to write your spreadsheet to a temp file, then point Paperclip at it. But writing and reading a temp file adds risk of failure, increases disk IO, and slows things down. I’m stubborn, so I really wanted to keep everything in memory.

My solution is based on the second answer to that StackOverflow question above. If we give Paperclip an IO object with a method called original_filename, it will do the right thing. So let’s define this class:

class NamedStringIO < StringIO

  def initialize(data, filename)
    super(data)
    @filename = filename
  end

  def original_filename
    @filename
  end

end

Now we can change our Paperclip code to this:

io = StringIO.new
xls = WriteExcel.new(io)
# ...
xls.close
report.spreadsheet = NamedStringIO.new(io.string, xls_filename)

That code makes Paperclip happy, and it lets us generate a named “file” that never hits the disk.

It would be even better to create our NamedStringIO at the top, so we don’t have that extra StringIO we give to WriteExcel. For some reason I couldn’t get WriteExcel to accept my NamedStringIO, so I had to copy things around a bit. Oh well, good enough for me!

Rules for Rails Migrations

2013-03-26

I see a lot of developers, even smart and experienced ones, get frustrated by Rails migrations—or cause frustration for others on their team. So here is a quick overview of how they work, plus a few rules for handling them without headaches. If you follow these rules, your migrations can be easy and smooth. Perhaps you’ll even stop wishing you could use Mongo.

Overview

A migration is a Ruby script with an up method containing the database changes you’d like applied, along with a down method to undo those changes and restore the database to its old structure. In newer versions of Rails, you can also write a single change method, and if you restrict yourself to the right methods, Rails will automatically understand how to undo your changes.

Migration scripts live in db/migrate, and they are named with a timestamp plus description, like this:

20121024000909_create_users.rb

or this:

20121102162841_add_address_to_profile.rb

You can create a new migration by saying rails g migration create_users, and Rails will automatically give it a timestamp and start the file. (Tip: If you’re using the command line, you can follow up that command with vi db/migrate/*!$* to open the migration in a text editor.)

By convention, migrations are named create_foo when creating a brand-new table, and add_foo_to_bar when adding the column foo to the table bar. But you can name them whatever you like. Note each migration needs a unique name, despite the timestamp, because each migration lives in a Ruby class whose name is the migration name camelized.

You can run the latest migrations by saying rake db:migrate, and you can undo the last-run migration with rake db:rollback. (You can run a limited number of migrations with rake db:migrate STEP=3). Rails automatically keeps track of which migrations you’ve applied in the schema_migrations table, which has just a single column listing the timestamps of all applied migrations.

Rules

Always implement the down method.

You might be tempted to leave out the down method, but it’s very useful to fill it in. Everyone makes mistakes, and the down method ensures you can recover. It also means that if your migration is not quite right the first time, you can db:rollback, make your fixes, and then db:migrate again. (But see below about pushed migrations.)

You may run into trouble if a migration fails halfway through. If this happens, the migration is not recorded in schema_migrations, but it might leave tables/columns in your database. Rather than dropping these, I prefer to simply comment out the lines that ran successfully, then restore them once the migration has succeeded. This works both when migrating and when rolling back.

A corollary to “implement the down method” is “test the down method.” I don’t mean write unit tests. I just mean when you think everything is correct, run rake db:rollback db:migrate and make sure it all works. It’s easy to have errors in your down methods if you never run them.

Don’t edit pushed migrations.

It’s fine to rollback, edit, and retry when you’re just working on new code that is private to your repo, but never edit a migration after sharing it with other developers, e.g. after doing a git push. Even safer would be never edit a migration after committing it. If you do this, you’re very likely to create out-of-sync databases for other developers and on production. This is because they may have already run the migration before your changes got pushed, and since they’ve already run it, they’ll never run it again to pick up your changes. Now your database looks different than everyone else’s.

I’ve seen this happen many times, and it’s probably the top cause of frustrations with Rails migrations. Recovering from it generally involves surgery, and unless the surgeon is patient and careful, he’s likely to make things even worse. To avoid the problem, if you’ve pushed a migration that you find is not quite right, always make your changes by adding a new migration, not by editing the old one.

Surgery that can help (if done right) is to delete rows from schema_migrations and manually add/remove/alter tables and columns to get back on track. Whatever the techniques, your goal should be to bring the databases into line with the accepted version history, so that running the migrations from scratch would produce the same database that comes out of your operating room. Otherwise you’re creating a time-bomb that will produce more out-of-sync databases somewhere down the line.

Stick to SQL.

Opinions differ on how to create a new database from scratch, for instance when a new developer joins or you decide to add that CI server. Some people like to create it from the schema.rb, like Athena springing full-grown from Zeus’s head. Others like to start with an empty database then run all the migrations from the beginning of the project (which could be years old). Other just make a dump from somewhere and import it where needed. Whatever your approach, there is value in at least striving for a non-broken migration history, so that ideally you can run all the migrations against an empty database. The closer you are to this ideal, the easier it will be to go back in time or handle branches.

This goal means your migration scripts should not rely on the rest of your source code. It’s common for migrations to use model classes, but don’t! Here is a place you should defy the DRY imperative. The reason is that when you run a 6-month-old migration, your model classes have today’s code, so it’s very easy to have missing methods, changed methods, renamed methods, even missing classes.

The best way to avoid problems is to keep your migrations entirely self-contained. For this I recommend writing data changes (i.e. DML changes) in pure SQL. You can do it like this:

ActiveRecord::Base.connection.execute(<<-EOQ)
    UPDATE  foo
    SET     bar = baz
    WHERE   ick = ack
EOQ

If you don’t like SQL, you’ll just have to deal. It’s good for you. Or change jobs and write Javascript for a living. :-P

Conclusion

Follow these rules to get frustration-free Rails migrations. To sum up, they are:

  • Write and test the down method.
  • Don’t edit pushed migrations.
  • Keep your migrations self-contained (ideally just SQL).

Defy them to your peril!

Fun Postgres Puzzle

2013-03-21

Someone posted a question recently to the Postgres mailing list that makes for a great puzzle. He had a table (we’ll call it m) like this:

       d       |       v       
---------------+---------------
 geography     | north
 geography     | south
 industry type | retail
 industry type | manufacturing
 industry type | wholesale
(5 rows)

Basically this table was a list of “dimensions” and the possible values for each dimension. So there is a “geography” dimension with possible values “north” and “south”, and an “industry type” dimension with values “retail”, “manufacturing”, and “wholesale”. The table could hold more than two dimensions or have more possible values for each dimension.

His goal was to find a query that would give him all possible combinations, assuming that each combination has exactly one value along each dimension. This is roughly the idea behind a cartesian product, so it seems like the thing you’d use a CROSS JOIN for, but here the data is all in one table, with n possible partitions. So how to write a query that works regardless of how many unique dimensions the table holds?

If you want to stop reading here and go work on the puzzle yourself, I wouldn’t blame you. And I’d love to hear what people come up with. I think it’s a really hard puzzle. If you want to see my solution, keep reading.

You can recreate the table with these commands:

CREATE TABLE m (
  d VARCHAR(255) NOT NULL,
  v VARCHAR(255) NOT NULL,
  PRIMARY KEY (d, v)
);

INSERT INTO m
(d, v)
VALUES
('geography', 'north'),
('geography', 'south'),
('industry type', 'retail'),
('industry type', 'manufacturing'),
('industry type', 'wholesale');

My solution may not be the only one, but I’m proud of doing it in pure SQL (no plpgsql), and I think it’s a neat demonstration of several special features in Postgres. Here is the query:

WITH RECURSIVE t(combo, n) AS (
  WITH dims AS (SELECT DISTINCT d, row_number() OVER () AS n FROM m GROUP BY d)
  SELECT '{}'::text[], 1
  UNION ALL
  SELECT array_append(t2.combo::text[], m.v::text), t2.n+1
  FROM  t t2, dims
  CROSS JOIN m
  WHERE m.d = dims.d AND dims.n = t2.n
) 
SELECT combo
FROM t
WHERE n = (SELECT COUNT(DISTINCT d) + 1 FROM m);

The first thing to note is that we’re using a WITH expression. This is called a Common Table Expression (CTE), and it’s basically a way to create a sub-query and give it a name. In our case the “table” is named t, with columns combo and n.

But this CTE is special, because it uses the RECURSIVE keyword. With a recursive CTE, your definition is actually two queries, separated by UNION ALL. The query on top is the starting condition, in our case just a single row. The query on bottom is executed repeatedly until it returns no more rows, and it is allowed to access the results of the previous execution. The full results of the CTE are the starting condition plus all the rows ever returned by the recursive part.

To work through our CTE, let’s look more closely at the starting condition. Here it is:

SELECT '{}'::text[], 1

That first column is a special Postgres feature called an array. An array is pretty much what you’d expect, although unlike in most programming languages they are 1-indexed. They have similar syntax for accessing their elements (although we don’t use it here):

some_array[4]

An array literal is actually a string cast to an array. Here is a literal for the first three positive integers:

'{1, 2, 3}'::int[]

So you can see that our CTE’s initial SELECT contains an empty text array.

The idea is for each array to represent one combination, so we might have {wholesale,north} or {retail,south}. We will grow the array one dimension at a time until all the dimensions are included. We are using arrays because they can have as many elements as we want, whereas a SQL query must have a fixed number of columns.

The second half of our CTE is how we grow the array. Here is the code:

  SELECT array_append(t2.combo::text[], m.v::text), t2.n+1
  FROM  t t2, dims
  CROSS JOIN m
  WHERE m.d = dims.d AND dims.n = t2.n

So t (aliased to t2) is whatever the previous iteration produced. For each row in t, we want to create several new rows, one for each possible value of the next dimension, tacking those values onto the end of the array. In other words, if t has the row {wholesale}, we want to produce the new rows {wholesale,north} and {wholesale,south}. The CROSS JOIN and array_append accomplish this for us.

We also need some way to stop the recursion. Postgres will quit re-executing our query if it ever produces zero rows. To make sure this happens, we include a second “counter” column, which also serves as a way to grab a different dimension on each iteration. This is what our nested CTE named dims is all about. It’s just a table with one row per dimension, and the rows numbered. Our WHERE clause makes sure that we process a different dimension on each iteration, and stop when there are none left.

The other thing of note is that in our nested CTE (dims), we are using a fancy Postgres feature called a window function. Our window function is the row_number() OVER () part, which, as window functions go, is actually a crashing uninteresting specimen. Rather than turning this post into a window function tutorial, I’ll just say that you should go read about them, because they are wonderful. Basically they let you get values into a non-GROUP-BY query that you’d normally need aggregate functions to compute. They are like a SQL superpower.

Finally, you can see that in the outermost query, we are filtering out the “intermediate” rows from our CTE, so that we only get combinations that include all dimensions.

Here are the results:

         combo         
-----------------------
 {wholesale,north}     
 {wholesale,south}     
 {retail,north}        
 {retail,south}        
 {manufacturing,north} 
 {manufacturing,south} 
(6 rows)

I hope you enjoyed this puzzle! I’m delighted how it does something impossible in ordinary SQL (return results with a dynamic number of “columns”), and combines three exotic Postgres features you may not have seen before.

Next: Read-only rails_admin