Postgres custom range types for Geoip

2016-07-05

I had to add a geoip lookup table to one of my projects recently, and it was a perfect application for the new(ish) range types in Postgres. A range is a column type with a start and end point. The endpoints can be “closed” (includes the endpoint) or “open” (includes everything up to the endpoint, but not the endpoint itself): for instance for integers [1,3] is 1, 2, and 3, but [1,3) is just 1 and 2. Postgres comes with a few built-in range types for numbers and timestamps, but you can also define your own. Since I wanted to use the Postgres inet datatype, that’s what I had to do.

My goal was to have a table like this:

db=> \d geoips
        Table "public.geoips"
    Column    |   Type    | Modifiers
--------------+-----------+-----------
 ips          | inetrange | not null
 country_code | text      | not null
 latitude     | real      | not null
 longitude    | real      | not null

Also I wanted to get fast lookups based on which row contained a given IP. I would have about 4 million rows, so I would need an effective index. In fact part of the reason for using a range type is that they support GiST indexes, where you can get fast lookups based on both the start and end values. A B-Tree index for something like ip BETWEEN start_ip AND end_ip can quickly get you to start_ip, but then it needs to scan forward for the end_ip check. So I expected a GiST index would do better.

Defining the range type can be as simple as this:

CREATE TYPE inetrange AS RANGE (
  subtype = inet
);

Then we can start using them:

db=> select '[1.2.3.4,1.2.3.8]'::inetrange;
     inetrange     
-------------------
 [1.2.3.4,1.2.3.8]

You also get a free constructor function:

db=> select inetrange('1.2.3.4', '1.2.3.8', '[]');
     inetrange     
-------------------
 [1.2.3.4,1.2.3.8]

And by the way, since the inet type supports IPv6 addresses, you can make ranges with those too. The only caveat is I haven’t found a way to prevent you from mixing them:

db=> select inetrange('1.2.3.4', '2001:0db8:0000:0042:0000:8a2e:0370:7334', '[]');
                inetrange                
-----------------------------------------
 [1.2.3.4,2001:db8:0:42:0:8a2e:370:7334]

So don’t do that! :-)

You can also use the <@ operator to see if an IP falls within a range:

db=> select '1.2.3.6'::inet <@ '[1.2.3.4,1.2.3.8]'::inetrange;
 ?column? 
----------
 t

Another nice thing about using a range is we can define an exclusion constraint to ensure that no two ranges overlap. That will guarantee that our lookups never return more than one row. We can include the exclusion constraint when we create the table:

CREATE TABLE geoips (
  ips inetrange NOT NULL,
  country_code TEXT NOT NULL,
  latitude REAL NOT NULL,
  longitude REAL NOT NULL,
  CONSTRAINT geoips_dont_overlap
    EXCLUDE USING gist (ips WITH &&)
    DEFERRABLE INITIALLY IMMEDIATE
);

That tells Postgres to forbid any new value x where x && ips is true, when checked against all old ips. It also automatically creates a GiST index for us—the same index we’ll use when we query the table.

Now we can load our data—and by the way if you use COPY or \copy with a CSV file, Postgres has no trouble with range literals:

"[1.0.0.0,1.0.0.255]","AU","-27.467940","153.028090"

But there is still one problem. Our GiST index isn’t so fast after all!:

db=> select * from geoips where '1.2.3.4'::inet <@ ips;
         ip          | country_code | latitude | longitude 
---------------------+--------------+--------------+-------
 [1.2.3.0,1.2.3.255] | AU           | -27.4679 |   153.028 
(1 row)

Time: 1.120 ms
db=> select * from geoips where '10.2.3.4'::inet <@ ips;
            ip             | country_code | latitude | longitude 
---------------------------+--------------+----------+-----------
 [10.0.0.0,10.255.255.255] | -            |        0 |         0 
(1 row)

Time: 54.669 ms
db=> select * from geoips where '100.2.3.4'::inet <@ ips;
           ip            | country_code | latitude | longitude 
-------------------------+--------------+----------+-----------
 [100.2.3.0,100.2.3.255] | US           |  40.8681 |  -73.4257 
(1 row)

Time: 322.719 ms

As you can see, it gets slower and slower for higher-numbered IPs. What is going on here?

The problem is that we didn’t implement a function to tell GiST indexes how “far apart” two inetranges are. That means we have to start all over! So drop the table, drop the type, and try again… . First, we’ll implement the diff function:

CREATE OR REPLACE FUNCTION inet_diff(x inet, y inet)
  RETURNS DOUBLE PRECISION AS
$$
DECLARE
BEGIN 
  RETURN x - y; 
END;
$$
LANGUAGE 'plpgsql' STRICT IMMUTABLE;

As you can see, all this does is subtract one inet from the other inet. Pretty simple!

Now create the inetrange again:

CREATE TYPE inetrange AS RANGE (
  subtype = inet,
  subtype_diff = inet_diff
);

Now create the table, load it, and try querying again:

db=> select * from geoips where '1.2.3.4'::inet <@ ips;
         ip          | country_code | latitude | longitude 
---------------------+--------------+--------------+-------
 [1.2.3.0,1.2.3.255] | AU           | -27.4679 |   153.028 
(1 row)

Time: 1.004 ms
db=> select * from geoips where '10.2.3.4'::inet <@ ips;
            ip             | country_code | latitude | longitude 
---------------------------+--------------+----------+-----------
 [10.0.0.0,10.255.255.255] | -            |        0 |         0 
(1 row)

Time: 0.678 ms
db=> select * from geoips where '100.2.3.4'::inet <@ ips;
           ip            | country_code | latitude | longitude 
-------------------------+--------------+----------+-----------
 [100.2.3.0,100.2.3.255] | US           |  40.8681 |  -73.4257 
(1 row)

Time: 0.618 ms

Those times are much better! So that is how you can use custom Postgres ranges to implement a fast geoip lookup table.

Now there is one last thing: technically if your range elements are discrete rather than continuous (as ours are), you should define a canonical function for the type, so that Postgres can resolve ambiguities between ranges with open and closed endpoints. For instance, [1,3) is the same thing as [1,2], but Postgres doesn’t know that unless we tell it.

Unfortunately, because the function depends on the type and the type depends on the function, there are complications. You have to first create a “shell type”, which requires superuser privileges, and also your canonical function must be written in C, because plpgsql functions can’t return shell types. And loading a C function requires superuser privileges too. To make this all a lot simpler, I created a Postgres extension for inetrange. That will create the inetrange type as described here, but also with a canonical function. So for real work, it is probably easiest to just use that! Or you could use the non-superuser commands above and skip the extension.

Ledger with Autosync

2015-08-01

A while back I started tracking my finances with legder-cli, a command-line tool for double-entry bookkeeping. It’s been a lot of fun, but one of the stumbling blocks is recording each transaction by hand. Fortunately almost every bank nowadays offers an API so you can download your recent account activity in a format like OFX (Open Financial Exchange) or QFX (Quicken’s extended OFX), and ledger has tools to import that data.

For this I use ledger-autosync, which is built on top of ofxclient. First you run ofxclient to create a ~/ofxclient.ini file with the details of all your accounts. Then you run ledger-autosync, which will first read your ledger file (whatever $LEDGER_FILE points to), then contact the banks and print out any new transactions. It tries to guess expense categories, but often you need to correct these. Still, that’s a lot easier than typing in everything!

Because of this step, I like to have ledger-autosync append to a review.ldg file, which I periodically edit to correct expense categories, then cut-and-paste into my regular $LEDGER_FILE. I also do some extra tweaking to format the entries how I like them. Here is the command I use:

ledger-autosync \
  -l ~/share/accounting/in-progress.ldg \
  -i 2 \
  | sed 's,^\([0-9]\{4\}\)/\([0-9]\{2\}\)/\([0-9]\{2\}\),\1-\2-\3,' \
  >> ~/share/accounting/review.ldg

That uses two spaces for indents (-i 2) and rewrites the dates.

Also note how I override $LEDGER_FILE with the -l option. That’s a trick to keep ledger-autosync from entering the same transactions twice into review.ldg. The contents of in-progress.ldg are just this:

# This file exists so that ledger-autosync will read review.ldg
# and not download entries it's already pulled down
# but we haven't yet reviewed.

include pj.ldg
include review.ldg

And that’s it! This setup makes it pretty painless keeping track of everything. I don’t have to do a lot of typing, and nothing slips through the cracks. I actually kind of like that categorizing forces me to look at expenses every week or so. If you are thinking about using ledger, hopefully this helps you out!

Paperclip expiring_url and "Request has expired"

2015-04-28

I hit a weird error today. My project stores user uploads on S3 with private permissions, and we use the Paperclip expiring_url method to let people download them. This started failing with a error like this:

<Error>
  <Code>AccessDenied</Code>
  <Message>Request has expired</Message>
  <Expires>2015-04-28T18:30:51Z</Expires>
  <ServerTime>2015-04-28T18:32:52Z</ServerTime>
  <RequestId>BLAHBLAHBLAHBLAH</RequestId>
  <HostId>
    MOREBLAHMOREBLAHMOREBLAHMOREBLAHMOREBLAHMOREBLAHMOREBLAH
  </HostId>
</Error>

The problem was that this error occurred immediately, even though we were generating the URLs with a 10 minute expiration time. Well, it turns out S3 was living in the future by 7 minutes (according to ServerTime above), and our own app was living in the past by another 5. So 5 + 7 = 12 and our URLs were expired before we even generated them. 10 minutes is the example time in the Paperclip documentation linked above, but I guess it’s too short. Watch out for clock skew out there folks!

Another reason to use generate_series for time series queries

2015-03-16

Whenever I have to write an aggregate SQL query over time-series data, I start with generate_series. Mostly that’s to avoid missing rows in the output. For instance, if I’m writing a query to show total sales in every hour, I don’t want to write this:

SELECT  extract(hour FROM order_placed_at) h,
        SUM(total_price)
FROM    sales
WHERE   order_placed_at BETWEEN ? AND ?
GROUP BY h
ORDER BY h

because then if there was an hour with no sales, that row will be simply missing from the result.

Instead I use generate_series to generate all the hours, and join my real table to that:

SELECT  h,
        COALESCE(SUM(total_price), 0) AS revenue_per_hour
FROM    generate_series(0, 23) s(h)
LEFT OUTER JOIN sales
ON      extract(hour FROM order_placed_at) = h
WHERE   order_placed_at BETWEEN ? AND ?
GROUP BY h
ORDER BY h

That way I get 24 rows every time.

But another interesting use case for generate_series appeared on the Postgres mailing list: Suppose you have a table of events with start_time and end_time, and you want to report how many events were “ongoing” for each hour of the day. That means a row of data can be counted in multiple output rows. Without generate_series this is hard to write, but if you use that as your “core” table then join to your data table, it just works:

SELECT  h,
        COUNT(events.id) events_per_hour,
        array_remove(array_agg(events.id), NULL) active_events
FROM    generate_series(0, 23) s(h)
LEFT OUTER JOIN events
ON      h BETWEEN extract(hour FROM events.start_time) AND extract(hour FROM events.end_time)
GROUP BY h
ORDER BY h

That gives you one output row per hour, and events can be counted multiple times, within each hour that includes them. I’ve thrown in an array_agg so you can see the IDs of each event, just to double-check my work.

But there are still problems. What if our query is supposed to include multiple days? How will it handle events that cross midnight? What about events that start on Monday and run until Wednesday? For these kind of things, we should avoid extract in our BETWEEN, and instead of generating a series of integers, we should generate a series of timestamps. This is also a great chance to use Postgres’s new(ish) range datatype, to avoid our own bug-prone interval comparisons:

SELECT  extract(hour FROM s.h) AS hour,
        COUNT(DISTINCT events.id) events_per_hour,
        array_remove(array_agg(events.id), NULL) active_events
FROM    generate_series(
          date_trunc('hour', '2015-03-01 00:00:00'::timestamp),
          date_trunc('hour', '2015-03-31 23:00:00'::timestamp),
          '1 hour') s(h)
LEFT OUTER JOIN events
ON      tsrange(events.start_time, events.end_time) && tsrange(s.h, s.h + interval '1 hour')
GROUP BY hour
ORDER BY hour

Here we are using the && (overlaps) operator to check whether each event overlaps with each window. If we didn’t want to use ranges, we could also use the SQL OVERLAPS operator, which accomplishes the same thing.

Now suppose we also want to report on the total time each event was “active” during each window. That is just a SUM() we can add to our query, using the * (intersection) range operator. But be careful about NULL values coming from our LEFT JOIN! In ranges, a null at either end represents unbounded, so tsrange(NULL, NULL) is not NULL, but all time from negative infinity to positive infinity! Here is a query that works around that:

SELECT  extract(hour FROM s.h) AS hour,
        COUNT(DISTINCT events.id) events_per_hour,
        array_remove(array_agg(events.id), NULL) active_events,
        SUM(CASE WHEN events.id IS NULL
            THEN NULL
            ELSE date_trunc('minute', range_to_interval(
                tsrange(s.h, s.h + interval '1 hour') *
                tsrange(events.start_time, events.end_time)
              ))
            END) active_hours
FROM    generate_series(
          date_trunc('hour', '2015-03-01 00:00:00'::timestamp),
          date_trunc('hour', '2015-03-31 23:00:00'::timestamp),
          '1 hour') s(h)
LEFT OUTER JOIN events
ON      tsrange(events.start_time, events.end_time) && tsrange(s.h, s.h + interval '1 hour')
GROUP BY hour
ORDER BY hour
;

Oh one more thing… . There is no such thing as range_to_interval. A surprising omission! But here is a function that defines it:

CREATE OR REPLACE FUNCTION range_to_interval(tsrange) 
RETURNS interval
AS
$$
SELECT upper($1) - lower($1)
$$
LANGUAGE sql
STABLE;

It might be even more proper to create a user-defined cast here, but I’ll leave that as an excerise for the reader. :-)

Lateral Join Alternatives

2015-02-23

In the comments to my post about Postgres lateral joins, someone asked about getting the same results with better performance using a GROUP BY query. Actually you can’t get the same results with a straight GROUP BY clause, but there are three more ways to write the query that are worth considering. So I’d like to talk first about why (top-level) GROUP BY doesn’t work, and then look at the other ways to do it (including a subquery GROUP BY).

Recall that the problem was: for each restaurant, show its info, plus the score of its most recent inspection (if any), plus the number of violations from that inspection. When you hear “most recent” it’s natural to think about GROUP BY. Maybe you want to write something like this:

SELECT  r.id, r.name, MAX(i.inspected_at)
FROM    restaurants r
LEFT OUTER JOIN inspections i
ON      i.restaurant_id = r.id
GROUP BY r.id
;

The problem is that whenever you want to pick a row using one column but then select additional columns from that same row, GROUP BY lets you down. In the above query, you can’t add i.score to the SELECT list, because that’s an error. You have to do something fancier.

One way to get there is a lateral join, as we’ve seen already. Three other ways are:

  1. Use GROUP BY in a subquery to get the latest inspection date, then use that date to get the rest of the inspection info.

  2. Use a window function.

  3. Use DISTINCT ON.

A couple things to keep in mind:

  • We want to show all the restaurants, even if they have no inspections yet. That’s why I didn’t write the above query FROM inspections GROUP BY restaurant_id, and that’s why all our queries below are FROM restaurants with an outer join to something else.

  • We also want to include a count of the violations from whatever inspection we choose. I’ve left that out of all my queries because it mostly just adds noise, but remember that whatever solution we consider must be compatible with that requirement. It must be possible to add one more outer join to violations and get the count. One approach this rules out is putting a whole subquery into the SELECT list, because that won’t be available for joining to violations.

Using GROUP BY in a subquery

For the first alternative, we might write the query like this:

SELECT  r.id, r.name, i.score, i.inspected_at
FROM    restaurants r
LEFT OUTER JOIN (
        SELECT  restaurant_id, MAX(inspected_at) AS max_inspection_date
        FROM    inspections
        GROUP BY restaurant_id) i2
ON      i2.restaurant_id = r.id
LEFT OUTER JOIN inspections i
ON      i.restaurant_id = r.id
AND     i.inspected_at = i2.max_inspection_date
;

Here we’re using GROUP BY to get the most recent inspection of each restaurant, and then we add a separate join based on that inspection date to get the rest of the inspection info.

I think this is borderline incorrect. There’s no guarantee that inspected_at is unique. If we have millisecond precision it might be, but what if it’s just a date? Probably no restaurant gets inspected twice in the same day, but it’d be nice if we didn’t have to rely on that assumption. Also pedagogically it’s a bad approach because who knows if your application will have the same guarantees? I’d rather find a pattern that is valid always, not just for this restaurants example.

Also it’s hard to believe this approach is very fast. Indeed, here is the query plan:

Hash Right Join  (cost=1162.54..2645.04 rows=1000 width=20)
  Hash Cond: ((i.restaurant_id = r.id) AND (i.inspected_at = (max(inspections.inspected_at))))
  ->  Seq Scan on inspections i  (cost=0.00..1256.00 rows=30000 width=16)
  ->  Hash  (cost=1147.54..1147.54 rows=1000 width=16)
        ->  Hash Right Join  (cost=45.79..1147.54 rows=1000 width=16)
              Hash Cond: (inspections.restaurant_id = r.id)
              ->  GroupAggregate  (cost=0.29..1078.29 rows=1000 width=12)
                    ->  Index Only Scan using idx_inspections_rest_and_date on inspections  (cost=0.29..918.29 rows=30000 width=12)
              ->  Hash  (cost=33.00..33.00 rows=1000 width=8)
                    ->  Seq Scan on restaurants r  (cost=0.00..33.00 rows=1000 width=8)

On my machine I get query times around 20ms, so it’s more than twice as slow as the lateral join. We already have an index on inspections for restaurant_id + inspected_at, so I don’t think there’s an easy way to speed things up here.

Using a window function

One great feature in Postgres is window functions. This lets you compute aggregates (and more!) without turning your whole query into a GROUP BY. I’ve written about window functions a bit here, but it’s a big topic, so if they’re new to you, the Postgres docs are the place to start. With window functions, we can get what we want like this:

SELECT  r.id, r.name, i.score, i.inspected_at
FROM    restaurants r
LEFT OUTER JOIN (
        SELECT  *,
                row_number() OVER (PARTITION BY restaurant_id ORDER BY inspected_at DESC) AS row_number
        FROM    inspections) i
ON      i.restaurant_id = r.id
AND     i.row_number = 1
;

This approach is more correct than the previous one. There are no risky assumptions about unique inspection dates. But it performs a little worse. Here is the query plan:

Merge Right Join  (cost=0.56..4096.38 rows=1000 width=20)
  Merge Cond: (i.restaurant_id = r.id)
  ->  Subquery Scan on i  (cost=0.29..4030.72 rows=150 width=16)
        Filter: (i.row_number = 1)
        ->  WindowAgg  (cost=0.29..3655.72 rows=30000 width=20)
              ->  Index Scan using idx_inspections_rest_and_date on inspections  (cost=0.29..3130.72 rows=30000 width=20)
  ->  Index Scan using restaurants_pkey on restaurants r  (cost=0.28..61.27 rows=1000 width=8)

Results for me take about 25ms. I’ve gotta say I love window functions, but it looks like in this case at least lateral join is a better choice.

Using DISTINCT ON

There is a special feature in Postgres called DISTINCT ON, which is another way to get grouping-like behavior without a GROUP BY clause. If you’ve heard of DISTINCT but not DISTINCT ON, you should check it out. Basically you name one or more columns whose combination should be unique in the result, and then you give an ORDER BY that includes those columns plus (usually) something else, and for each distinct combo, Postgres will include only the first row. That lets us write our query like so:

SELECT  r.id, r.name, i.score, i.inspected_at
FROM    restaurants r
LEFT OUTER JOIN (
        SELECT  DISTINCT ON (restaurant_id)
                *
        FROM    inspections i2
        ORDER BY restaurant_id, inspected_at DESC) i
ON      i.restaurant_id = r.id
;

Here is the query plan I get with that query:

Merge Right Join  (cost=0.56..3292.00 rows=1000 width=20)
  Merge Cond: (i2.restaurant_id = r.id)
  ->  Unique  (cost=0.29..3205.72 rows=1000 width=20)
        ->  Index Scan using idx_inspections_rest_and_date on inspections i2  (cost=0.29..3130.72 rows=30000 width=20)
  ->  Index Scan using restaurants_pkey on restaurants r  (cost=0.28..61.27 rows=1000 width=8)

And results take about 13ms. That is pretty nice! Not quite as fast as the lateral join, but maybe close enough. It is arguably easier to read too. DISTINCT ON is really a nice feature that is insufficiently known. I believe it is also available in MS SQL Server, but not in Oracle or MySQL—which may or may not matter to you. Personally I’m fine committing to one database vendor and leveraging it to the hilt, especially when that vendor is free, open source, and as mature as Postgres.

So anyway, I think this is an interesting lineup of alternatives to a lateral join, at least for my specific example. It’s fascinating to me that lateral join wound up the fastest of all, as that wasn’t why I originally selected it. But if you’re worried about performance, I wouldn’t depend too much on this blog post. I’d test the alternatives against your own real situation, because you might get different results. To me the lesson that is worthwhile is that these are all options (except maybe GROUP BY).

What do you think? Could I have written any of these queries in a more performant way? Is there yet another approach that would work?

Btw, if you want a real hard-core example of a lateral join that is hard to write any other way, here is an example used in funnel analysis.

Too Many Outer Joins

2015-02-16

My last post about lateral joins reminded me of a rule I have when using LEFT OUTER JOIN. Recall that in an outer join, every row in the original table will yield one or more rows in the result, depending on how many rows match in the new table. So if you say this:

SELECT  *
FROM    restaurants r
LEFT OUTER JOIN inspections i
ON      i.restaurant_id = r.id
;

then you’ll get one or more rows for each restaurant.

You could say this to show the average inspection score for each restaurant:

SELECT  r.name, AVG(i.score) avg_score
FROM    restaurants r
LEFT OUTER JOIN inspections i
ON      i.restaurant_id = r.id
GROUP BY r.name
;

Now suppose you also had an employees table like this:

restaurants -< employees
-----------    ---------
id             id
name           restaurant_id
               name

You can count the employees in each restaurant like so:

SELECT  r.name, COUNT(e.id) employees
FROM    restaurants r
LEFT OUTER JOIN employees e
ON      e.restaurant_id = r.id
GROUP BY r.name
;

Easy! So can you combine those two queries? How about like this:

SELECT  r.name, AVG(i.score) avg_score, COUNT(e.id) AS employees
FROM    restaurants r
LEFT OUTER JOIN inspections i
ON      i.restaurant_id = r.id
LEFT OUTER JOIN employees e
ON      e.restaurant_id = r.id
GROUP BY r.name
;

Unfortunately that is totally wrong! Whenever you have one central table, and you want to outer join it to two separate tables, you are in danger. To see why, think through what happens step-by-step as these tables get joined. All the joins happen before any grouping, so first we make a bunch of rows combining restaurants and inspections. Now for each of those rows, we join to the employees table. That means if we have a restaurant with two inspections, we’ll create rows for all its employees for both inspections. That means when we handle the grouping, our aggregate functions will compute wrong results. For the COUNT, we can solve the problem with COUNT(DISTINCT e.id), but it’s not so easy to solve for the AVG.

My rule is that a query can only have one outer join that produces more than one match. If two outer joins both produce multiple matches, then you’ll have trouble.

The only way I know to generate reports like this is to run a subquery for each outer-joined table, so that you can ensure one row per restaurant in the subquery results:

SELECT  r.name, x.avg_score, y.employees
FROM    restaurants r
LEFT OUTER JOIN (SELECT restaurant_id, AVG(score) avg_score
                 FROM   inspections
                 GROUP BY restaurant_id) x
ON      r.id = x.restaurant_id
LEFT OUTER JOIN (SELECT restaurant_id, COUNT(id) employees
                 FROM   employees
                 GROUP BY restaurant_id) y
ON      r.id = y.restaurant_id
;

That approach ensures that your aggregate functions will compute correct results, unpolluted by duplicate rows caused by outer joins to the sibling table(s).

Next: Postgres LATERAL JOIN