Thursday, October 21, 2010

Thinking about self-join? Well, think again!

This week, the greenplum people came on-site and talked about a few tricks that I am very new to. The ones I have most feeling about are

(1) about function overloading

So PostgreSQL allows you to have functions with exact names but different inputs or outputs, even different number of inputs. This is called function overloading. It's a feature in various programming languages like C, Java etc. And since Greenplum is written in C, it has this feature by birth. And this is exactly WHY you have to specify the inputs when deleting a function in greenplum!

I know, this could be very annoying! So they should really have a switch, so you can choose whatever options you want, either follow the overloading rule or not. So if you refuse to follow overloading and you try to name a new function the same as an existing one, greenplum will throw an error on you. C language has the overloading thing. That's fine. But greenplum could have given user options.

(2) about "STRICT"

When writing user defined functions, by adding "STRICT" option at the end, you are telling PostgreSQL to return a null automatically when someone inputs a null to the function. The function will not even try to run with null inputs, it will return null immediately. This could be a nice option to speed things up.

(3) about windowing function

Windowing function is not new concept to me. But the instructor's comment about all self-join could be replaced by windowing function is quite new. It's kind of surprising to me. Of course, I believe it's not hard, just takes time to get used to, particularly when one has been so familiar with table self-joins.

Windowing function allows user to access a set of rows associated with the current row. It can perform various calculations, like sum, rank, average, max/min etc. PostgreSql has a very good tutorial on windowing functions. There are two things I like particularly, running totals and naming windows.

So now the running total revenue calculation of campaigns is

SELECT campaign_id
, revenue
, SUM(revenue) OVER (w) as running_total
, SUM(revenue) OVER() AS overall_total
,(SUM(revenue) OVER (w)) / (SUM(revenue) OVER()) AS pct
FROM rev_table
WINDOW w AS (ORDER BY revenue DESC)

Very nice and clean, right?

Last but not least, first_value, last_value or nth_value could be extremely useful sometime, for example, I'd like to know that's the last campaign a user have clicked on in the past week.

SELECT enc_email
, FIRST_VALUE(campaign_id) OVER(PARTITION BY enc_email ORDER BY click_date DESC) AS last_campaign
FROM user_history
LIMIT 1;

The "limit 1" at the end is very important. Without it, the end result will has as many rows as the original table filling with exact same rows.