Chapter 9 Analytical functions
9.1 Over the partitions and far away
Even without aggregating, it is possible to create an average of a column. A new, computed column can be added to an existing selection by using the analytical average function. This is in fact the same function as the aggregate function we covered in last chapter, but with an added keyword to distinguish it from the aggregate function and add a host of new features.
Let's start with the simplest example possible, taking the average of stock prices.
SELECT symbol, tstamp, price, AVG(price) OVER() FROM ticker
WHERE symbol='ACME'
ORDER BY tstamp
SYMBOL | TSTAMP | PRICE | AVG(PRICE)OVER() |
---|---|---|---|
ACME | 2011-04-01 | 12 | 19 |
ACME | 2011-04-02 | 17 | 19 |
ACME | 2011-04-03 | 19 | 19 |
ACME | 2011-04-04 | 21 | 19 |
ACME | 2011-04-05 | 25 | 19 |
ACME | 2011-04-06 | 12 | 19 |
ACME | 2011-04-07 | 15 | 19 |
ACME | 2011-04-08 | 20 | 19 |
ACME | 2011-04-09 | 24 | 19 |
ACME | 2011-04-10 | 25 | 19 |
The average stock price for the ACME corporation between April 1st and April 20th is added as a new column, and repeated on all rows. The OVER()
statement is new, and used exclusively on these types of analytical functions. An empty OVER()
statement means that we are taking the average over all the rows.
But what if we don't want to restrict ourselves to ACME? What if we want to do this for all stocks in the table? We make use of the previously empty OVER()
statement. Inside the paranthesis, you can add a PARTITION
statement that takes the average over distinct values of some other column - just like the aggregate functions do with GROUP BY
. Let's take a look, but this time we limit the query a little, to avid getting the whole 20 days for each stock.
SELECT symbol, tstamp, price, AVG(price) OVER(PARTITION BY symbol) FROM ticker
WHERE tstamp<=TO_DATE('05042011', 'DDMMYYYY')
ORDER BY symbol, tstamp
SYMBOL | TSTAMP | PRICE | AVG(PRICE)OVER(PARTITIONBYSYMBOL) |
---|---|---|---|
ACME | 2011-04-01 | 12 | 19 |
ACME | 2011-04-02 | 17 | 19 |
ACME | 2011-04-03 | 19 | 19 |
ACME | 2011-04-04 | 21 | 19 |
ACME | 2011-04-05 | 25 | 19 |
GLOBEX | 2011-04-01 | 3 | 6 |
GLOBEX | 2011-04-02 | 7 | 6 |
GLOBEX | 2011-04-03 | 8 | 6 |
GLOBEX | 2011-04-04 | 4 | 6 |
GLOBEX | 2011-04-05 | 8 | 6 |
As the name suggests, partitioning the data by symbol lets us create separate average for each stock.
Better yet, we can compute a moving average of the price by adding just a few more statements.
SELECT symbol, tstamp, price,
AVG(price) OVER(PARTITION BY symbol ORDER BY tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS moving_avg
FROM ticker
WHERE tstamp<=TO_DATE('05042011', 'DDMMYYYY')
ORDER BY symbol, tstamp
SYMBOL | TSTAMP | PRICE | MOVING_AVG |
---|---|---|---|
ACME | 2011-04-01 | 12 | 12.0 |
ACME | 2011-04-02 | 17 | 14.5 |
ACME | 2011-04-03 | 19 | 16.0 |
ACME | 2011-04-04 | 21 | 17.2 |
ACME | 2011-04-05 | 25 | 18.8 |
GLOBEX | 2011-04-01 | 3 | 3.0 |
GLOBEX | 2011-04-02 | 7 | 5.0 |
GLOBEX | 2011-04-03 | 8 | 6.0 |
GLOBEX | 2011-04-04 | 4 | 5.5 |
GLOBEX | 2011-04-05 | 8 | 6.0 |
Now we're talking. A moving (cumulative) average for each stock. These averages are a great way to smooth out curves that otherwise would be too noisy to make sense of. If an average of all preceding rows are a bit too much, limit the preceding rows in the same way you limit following rows. Here are two more examples, one limiting to 5 rows preceding, and another taking the average from two preceding and two following.
SELECT symbol, tstamp, price,
AVG(price) OVER(PARTITION BY symbol ORDER BY tstamp ROWS BETWEEN 5 PRECEDING AND CURRENT ROW ) AS moving_avg_1,
AVG(price) OVER(PARTITION BY symbol ORDER BY tstamp ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING ) AS moving_avg_2
FROM ticker
WHERE tstamp<=TO_DATE('05042011', 'DDMMYYYY')
ORDER BY symbol, tstamp
SYMBOL | TSTAMP | PRICE | MOVING_AVG_1 | MOVING_AVG_2 |
---|---|---|---|---|
ACME | 2011-04-01 | 12 | 12.0 | 16.0 |
ACME | 2011-04-02 | 17 | 14.5 | 17.2 |
ACME | 2011-04-03 | 19 | 16.0 | 18.8 |
ACME | 2011-04-04 | 21 | 17.2 | 20.5 |
ACME | 2011-04-05 | 25 | 18.8 | 21.7 |
GLOBEX | 2011-04-01 | 3 | 3.0 | 6.0 |
GLOBEX | 2011-04-02 | 7 | 5.0 | 5.5 |
GLOBEX | 2011-04-03 | 8 | 6.0 | 6.0 |
GLOBEX | 2011-04-04 | 4 | 5.5 | 6.8 |
GLOBEX | 2011-04-05 | 8 | 6.0 | 6.7 |
Finally, you don't have to specify the number of rows. Instead, you can specify the time interval given that you order by some variable of the date
-variety. Let's make a 3-day moving average. Even though this is no different from specifying 3 rows in this specific dataset, it comes in handy when you are dealing with missing values, weekends, and arbitrary number of rows per day.
SELECT symbol, tstamp, price,
AVG(price) OVER(PARTITION BY symbol ORDER BY tstamp RANGE BETWEEN INTERVAL '3' DAY PRECEDING AND CURRENT ROW ) AS moving_avg_1
FROM ticker
WHERE tstamp<=TO_DATE('05042011', 'DDMMYYYY')
ORDER BY symbol, tstamp
SYMBOL | TSTAMP | PRICE | MOVING_AVG_1 |
---|---|---|---|
ACME | 2011-04-01 | 12 | 12.0 |
ACME | 2011-04-02 | 17 | 14.5 |
ACME | 2011-04-03 | 19 | 16.0 |
ACME | 2011-04-04 | 21 | 17.2 |
ACME | 2011-04-05 | 25 | 20.5 |
GLOBEX | 2011-04-01 | 3 | 3.0 |
GLOBEX | 2011-04-02 | 7 | 5.0 |
GLOBEX | 2011-04-03 | 8 | 6.0 |
GLOBEX | 2011-04-04 | 4 | 5.5 |
GLOBEX | 2011-04-05 | 8 | 6.8 |
Instead of days, keywords like 'MONTH'
and 'YEAR'
can also be used.
Many of the usual aggregate functions can be used as analutical functions, but notably median and quantiles are missing. These must be computed in R instead.