Chapter 10 Statistical functions
Usually, statistics is best suited to do in R. But if your data is large enough and your needs aren't too advanced, Oracle packs an impressive array of statistical functions and can often execute them blazingly fast.
10.1 Covariance: Revisitng quantity vs unitprice
In ??, we looked at the average quantity sold of the cheapest products, compared to the average quantity sold of the most expensive product in each country - the theory being that cheap products sell in higher quantities. Not the most provocative theory, but good enough for a teaching exercise.
This question is probably better answered with some more advanced (and standard) statistical functions. We can start by taking the simple covariance between unitprice and quantity.
SELECT country, COVAR_POP(unitprice, quantity) as price_quantity_covar FROM trx
GROUP BY country
ORDER BY COUNT(1) DESC
FETCH FIRST 10 ROWS ONLY
COUNTRY | PRICE_QUANTITY_COVAR |
---|---|
United Kingdom | -22 |
Germany | -26 |
France | -42 |
EIRE | -71 |
Spain | -31 |
Netherlands | -111 |
Belgium | -21 |
Switzerland | -23 |
Portugal | -72 |
Australia | -105 |
To no surprise, we find a negative correlation between unit price and quantity sold in every country we cared to list.
There are two covariance functions: COVAR_POP
(which we used here), and COVAR_SAMP
, which are asymptotically equal with high sample size.
Covariance in small samples need to adjust for the one degree of freedom used, by removing 1 from \(n\):
\[\frac{\sum(expr1 * expr2) - \sum expr1 * \sum expr2}{n * (n-1)}\]
For COVAR_POP
, the denominator is \(n^2\) instead of \(n*(n-1)\).
10.2 Regression
Hopefully, the fact that Oracle databases does regression is impressive enough in itself, because the functionality is really limited to superficially simple cases. But if you need to do a regression with a single independent variable, Oracle is there for you. But seriously, do this in R instead if possible.
SELECT REGR_INTERCEPT(quantity, unitprice) AS beta_0,
REGR_SLOPE(quantity, unitprice) AS beta_1
FROM trx
BETA_0 | BETA_1 |
---|---|
9.6 | 0 |
The regression consists of running two different functions, one to calculate the intercept, and one to calculate the regression coefficient (and since we are limited to a single independent variable, there is only one slope to calculate). Still this syntax requires us to specify the model twice, and happily spits out a result even if we are unfortunate enough to, say, switch the arguments in one of the functions.
The only redeeming feature of regression in Oracle is that it plays nicely with the GROUP BY
statement, so that we with hardly any effort can create separate regression models for each country.
SELECT country, COUNT(1) AS goods_sold, REGR_INTERCEPT(quantity, unitprice) AS beta_0,
REGR_SLOPE(quantity, unitprice) AS beta_1
FROM trx
GROUP BY country
COUNTRY | GOODS_SOLD | BETA_0 | BETA_1 |
---|---|---|---|
Poland | 341 | 13.5 | -0.68 |
Denmark | 389 | 25.3 | -1.30 |
Brazil | 32 | 19.1 | -1.78 |
United Kingdom | 495478 | 8.6 | 0.00 |
Channel Islands | 758 | 13.2 | -0.15 |
United Arab Emirates | 68 | 16.9 | -0.74 |
Norway | 1086 | 18.2 | -0.08 |
Iceland | 182 | 19.8 | -2.38 |
Austria | 401 | 14.1 | -0.48 |
Saudi Arabia | 10 | 14.4 | -2.88 |
10.3 T-tests
A lot of statistical tests assume that the data is normally distributed, which is far from the case in most real-world data. This example is meant to demonstrate an SQL funtion, and is unfortunately not sound statistical advice.
Let's take a closer look at Germany and France in the sales data. These two countries have a similar number of sales (7-8000 each), and we might want to explore wether the customers buy similarly-priced items.
First, a simple summary table to give a clue about the sales figures for the countries:
SELECT COUNTRY, COUNT(1), median(unitprice), avg(unitprice), STDDEV(unitprice), MAX(unitprice) FROM trx
WHERE country IN('Germany','France', 'EIRE')
GROUP BY country
COUNTRY | COUNT(1) | MEDIAN(UNITPRICE) | AVG(UNITPRICE) | STDDEV(UNITPRICE) | MAX(UNITPRICE) |
---|---|---|---|---|---|
EIRE | 8196 | 2.1 | 5.9 | 54 | 1917 |
France | 8557 | 1.8 | 5.0 | 80 | 4161 |
Germany | 9495 | 1.9 | 4.0 | 17 | 600 |
Now, for the t-test comparing France and Germany. Note that only two countries can be compared at the same time.
SELECT STATS_T_TEST_INDEP(COUNTRY, unitprice, 'TWO_SIDED_SIG' ) AS two_sided_p_value
FROM trx
WHERE country IN('Germany','France')
TWO_SIDED_P_VALUE |
---|
0.21 |
The two-sided T-test for independence cannot lead us reject our null-hypothesis that the average unitprice for products sold in Germany is the same as the average unitprice for products sold in France. Looking at the summary statistics this conclusion is a little strange, given that there is a large number of observations and the average unitprice in Germany and France are 4 and 5 (Euros?) respectively. But the standard deviation for france is very large - 80.
Just for fun, let's take a look at the difference between Germany and Ireland as well.
SELECT STATS_T_TEST_INDEP(COUNTRY, unitprice, 'TWO_SIDED_SIG' ) AS two_sided_p_value
FROM trx
WHERE country IN('Germany','EIRE')
TWO_SIDED_P_VALUE |
---|
0 |
This time, we can safely reject the null-hypothesis that the average unit prices are equal.
10.4 Kolmogorov-Smirnoff tests
As we noted at the top of the previous chapter, the t-test is really not the right test for this type of data, because the distribution is anything but normal. In these cases, we either have to find what distribution fits the data, or we can opt for a test that does not require a given distribution. Kolmogorov-Smirnoff (KS for short) is that type of test.
The syntax for KS tests are beautifully similar to the t-test. We also print the KS-statistic. The third argument defaults to 'SIG' (significance), but we included it for clarity.
SELECT
STATS_KS_TEST(COUNTRY, unitprice, 'STATISTIC') AS KS_value,
STATS_KS_TEST(COUNTRY, unitprice, 'SIG') AS significance
FROM trx
WHERE country IN('Germany','France')
KS_VALUE | SIGNIFICANCE |
---|---|
0.03 | 0 |
The KS-test does not test for equality of the mean, but wether the two distributions a whole are sufficiently similar to believe that they were drawn from the same population. Intuitively, the function looks at the area with the biggest difference between the two samples and calculates wether that could be due to random choice.
As we see, the KS-test leaves little doubt that there is a difference in the price of the goods bought in France and Germany, while we could not make that conclusion using the T-statistic. The difference is probably due to the long tail of high-price items. France had some much more expensive sales than Germany, which the T-test couldn't properly account for.
10.5 And more...
These are two quite random examples of statistical tests in Oracle, but the list includes F-test (STATS_F_TEST
), Mann-Whitney tests (STATS_MW_TEST
), Wilcox (STATS_WSR_TEST
) and ANOVA (STATS_ONE_WAY_ANOVA
).