Analytics FTW

Today I was presented with this query:

SELECT *
FROM sid_batch
WHERE(loc, from_sid, sid_bid, type, add_date, batch_no) IN
(SELECT loc,
from_sid,
sid_bid,
type,
add_date,
MAX(batch_no)
FROM sid_batch
WHERE(loc, from_sid, sid_bid, type, add_date) IN
(SELECT loc,
  from_sid,
  sid_bid,
  type,
  MAX(add_date)
FROM sid_batch
WHERE loc = '004'
AND from_sid = '60056'
AND status <> 'I'
AND type <> 'A'
GROUP BY loc,
  from_sid,
  sid_bid,
  type)
AND loc = '004'
AND from_sid = '60056'
AND status <> 'I'
AND type <> 'A'
GROUP BY loc,
from_sid,
sid_bid,
type,
add_date)
AND loc = '004'
AND from_sid = '60056'
AND status <> 'I'
AND type <> 'A'
ORDER BY loc,
from_sid DESC,
sid_bid DESC,
type DESC,
batch_no DESC;

Now I know almost nothing of analytics, but I did recognize that this sort of thing is exactly what they are for. In this case, the user wants to get the records that contain the highest batch_no for the highest add_date for a given loc, from_sid, sid_bid and type.

After a bit of research and chatting with hali, I trial-and-errored my way to this:

SELECT *
FROM
(SELECT sid_batch.*,
      rank() over(PARTITION BY loc,from_sid,sid_bid,type
        ORDER BY add_date DESC, batch_no DESC) rank
FROM sid_batch
WHERE loc = '004'
AND from_sid = '60056'
AND status <> 'I'
AND type <> 'A')
WHERE rank = 1
ORDER BY loc,
from_sid DESC,
sid_bid DESC,
type DESC,
batch_no DESC;

The first thing you should notice is that it is a LOT cleaner looking. Basically all the work is done in the one query, I just use the outer query to pick the #1 ranked rows and sort them.

The key is the rank() function. What rank() does is provide a numerical ranking based on a group of criteria (the PARTITION BY clause) and an order (the ORDER BY clause). In this case I say that for every set of loc, from_sid, sid_bid and type, rank the records of that set first by most recent add_date, and then by highest batch_no. This means the the highest batch_no for the highest add_date in that set will be ordered first, with a rank of 1. In my outer query I can then just choose the records with rank=1 and be on my way.

The query plan is MUCH cleaner as well. No messy hashes or nested loops, and, while the cost wasn't too bad at all (6), I cut it in half.