Chapter 6 Basic SQL

The simplest queries are only a line or two of code, containing two or three keywords. All queries that read data from a table follows this structure:

  • First, the keyword SELECT, followed by the variables (consistently referred to as columns in database jargon) you wish to select.

  • Second, the keyword FROM followed by the name of the table or the view you want to select from. A database table is the analogous to a dataframe in R. Details differ, but it is a rectangular sheet of data consisting of rows and columns/variables. Unlike in normal R dataframes there is no inherent row index in SQL tables, but most database tables have an index created explicitly in order to ensure data integrety.

  • You might want to subset the rows, not just the columns. This is done with the WHERE keyword, followed by whatever conditions you like. Unlike R, SQL uses single equal-signs to check for equality.

SQL is case-insensitive, but a certain coding standard has evolved.

6.1 SELECT FROM WHERE? Starting our exploration

We will be working with the movie table that comes with the BigData Lite VM. This table contains the movie title, year, budget, revenue and plot summary from some fairly recent movies. Usually you would probably do the initial exploration in a standalone SQL client, but if you want to explore the data with R, perhaps to give some basic description to your readers, it can easily be done.

6.1.1 A quick glance at our data

No matter how advanced, most data analysis starts with a simple look at the data just to get a feel for some real values. Let's select the first 10 rows from the table.

SELECT * FROM trx
FETCH FIRST 10 ROWS ONLY
Table 6.1: Displaying records 1 - 10
INVOICENO STOCKCODE DESCRIPTION QUANTITY INVOICEDATE UNITPRICE CUSTOMERID COUNTRY
536384 22189 CREAM HEART CARD HOLDER 4 2010-12-01 09:53:00 4.0 18074 United Kingdom
536384 22427 ENAMEL FLOWER JUG CREAM 3 2010-12-01 09:53:00 6.0 18074 United Kingdom
536384 22428 ENAMEL FIRE BUCKET CREAM 6 2010-12-01 09:53:00 7.0 18074 United Kingdom
536384 22424 ENAMEL BREAD BIN CREAM 8 2010-12-01 09:53:00 10.9 18074 United Kingdom
536385 22783 SET 3 WICKER OVAL BASKETS W LIDS 1 2010-12-01 09:56:00 19.9 17420 United Kingdom
536385 22961 JAM MAKING SET PRINTED 12 2010-12-01 09:56:00 1.4 17420 United Kingdom
536385 22960 JAM MAKING SET WITH JARS 6 2010-12-01 09:56:00 4.2 17420 United Kingdom
536385 22663 JUMBO BAG DOLLY GIRL DESIGN 10 2010-12-01 09:56:00 1.9 17420 United Kingdom
536385 85049A TRADITIONAL CHRISTMAS RIBBONS 12 2010-12-01 09:56:00 1.2 17420 United Kingdom
536385 22168 ORGANISER WOOD ANTIQUE WHITE 2 2010-12-01 09:56:00 8.5 17420 United Kingdom

InvoiceNo, StockCode and CustomerID seems to be random ID numbers, which will be useful later on. First, let's note that the description is a freetext column, there is no column for total salesprice, but this is easily calculated from UnitPrice and Quantity.

Note that FETCH FIRST 10 ROWS ONLY is Oracle-specific, other databases have different syntaxes, e.g. LIMIT 10 in MySQL.

Since we have customer IDs, we might be able to track a single customer over several purchases. To check this, we can select only rows from a single customer.

SELECT * FROM trx
WHERE CUSTOMERID='13058'
Table 6.2: 8 records
INVOICENO STOCKCODE DESCRIPTION QUANTITY INVOICEDATE UNITPRICE CUSTOMERID COUNTRY
559636 47590B PINK HAPPY BIRTHDAY BUNTING 3 2011-07-11 12:01:00 5.5 13058 United Kingdom
559636 47590A BLUE HAPPY BIRTHDAY BUNTING 3 2011-07-11 12:01:00 5.5 13058 United Kingdom
559636 23298 SPOTTY BUNTING 3 2011-07-11 12:01:00 5.0 13058 United Kingdom
570832 47590A BLUE HAPPY BIRTHDAY BUNTING 6 2011-10-12 13:28:00 5.5 13058 United Kingdom
570832 47590B PINK HAPPY BIRTHDAY BUNTING 6 2011-10-12 13:28:00 5.5 13058 United Kingdom
570832 23313 VINTAGE CHRISTMAS BUNTING 5 2011-10-12 13:28:00 5.0 13058 United Kingdom
576539 23313 VINTAGE CHRISTMAS BUNTING 25 2011-11-15 12:43:00 5.0 13058 United Kingdom
576539 22776 SWEETHEART 3 TIER CAKE STAND 1 2011-11-15 12:43:00 9.9 13058 United Kingdom

Indeed, we see this customer returning on three different dates between July and November 2011.

You might have noticed that we have no information about the denomination for either the quantity or price, which is actually surprisingly common in real life. We suspect that the dataset comes from the US, which would make dollar the right currency. Amount-wise this makes sense, but it might be british pounds, or a combination. All the quantities are whole numbers, and the product names suggests that these are separate items, as opposed to kilos, meters, liters or some other more esoteric continuous measurement.

6.1.2 Aggregating your results

First off, let's check the total sales at this shop.

SELECT SUM(unitprice*quantity) AS totalSales FROM trx
Table 6.3: 1 records
TOTALSALES
9747748

The shop has sold for almost $10 millon, which is a good amount. But this triggers even more questions. How many units have it sold, and over which time period is this? This is where the date format pays off. We can simply subtract the maximum date from the minimum date in the table, and Oracle returns the number of days inbetween. Note that since we are dealing with timestamps, we are returned a decimal number which we can just round off.

SELECT 
SUM(quantity) AS units_sold,
SUM(unitprice*quantity) AS totalSales, 
ROUND(MAX(INVOICEDATE)-MIN(INVOICEDATE)) AS timeperiod,
ROUND(SUM(unitprice*quantity)/SUM(quantity), 2) AS avg_price
FROM trx
Table 6.4: 1 records
UNITS_SOLD TOTALSALES TIMEPERIOD AVG_PRICE
5176450 9747748 373 1.9

A lot is happening here, so we will go through each column individually:

  • SUM(quantity) AS units_sold returns the sum of the quantity column, and uses the AS keyword to give the resulting column a new name units_sold.
  • SUM(unitprice*quantity) AS totalSales multiplies the columns unitprice and quantity, and sums the product. The result is given the new column name totalSales.

  • ROUND(MAX(INVOICEDATE)-MIN(INVOICEDATE)) AS timeperiod subtracts the latest (e.g. maximum) invoice date from the first (e.g. minimum) invoice date. The result is the number of days between the two dates, but since these columns are actually timestamps we are presented with a long decimal number which we round off using the ROUND() function.

  • ROUND(SUM(unitprice*quantity)/SUM(quantity), 2) AS avg_price divides the total sales by the total units sold, which gives us the average unit price. To avoid overly long decimal numbers, we round it off to two decimal points.

Lastly, let's just check the total number of rows in the table, using the COUNT(*) function.

SELECT COUNT(1) AS number_of_rows FROM trx
Table 6.5: 1 records
NUMBER_OF_ROWS
541909

The number of rows is fairly unimportant from a business perspective, as it neither reflects the number of units sold or the number of transactions.

The table also contains a Country column, which we have seen contains United Kingdom and we suspect contains the United States. Let's check how many countries there are, and the sales in each. In fact, we can reuse and extend the large query above to get statistics per country.

SELECT country,
SUM(quantity) AS units_sold,
SUM(unitprice*quantity) AS totalSales, 
ROUND(MAX(INVOICEDATE)-MIN(INVOICEDATE)) AS timeperiod,
ROUND(SUM(unitprice*quantity)/SUM(quantity), 2) AS avg_price
FROM trx
GROUP BY country
ORDER BY 2 DESC
Table 6.6: Displaying records 1 - 10
COUNTRY UNITS_SOLD TOTALSALES TIMEPERIOD AVG_PRICE
United Kingdom 4263829 8187806 373 1.9
Netherlands 200128 284662 372 1.4
EIRE 142637 263277 372 1.9
Germany 117448 221698 373 1.9
France 110480 197404 373 1.8
Australia 83653 137077 358 1.6
Sweden 35637 36596 356 1.0
Switzerland 30325 56385 361 1.9
Spain 26824 54775 369 2.0
Japan 25218 35341 366 1.4

Note the final line, ORDER BY 2 DESC. This sorts the output data by the second column, in descending order. This way, the countries with the largest sales appear first. It is possible to order by multiple columns (we could say ORDER BY 2 DESC, country) to make sure any ties are sorted alphabetically by country. Passing column numbers instead of column nicknames is a nice shortcut, but we could just as well have written ORDER BY units_sold DESC. The ORDER BY clause is one of the few commands in Oracle that accepts the nicknames we create in the SELECT statement.

This is indeed a nice overview, and properly sorted. Before we continue though, let's improve our understanding further with some visualizations.

In order to visualize, we have to first load the data into an R data frame. We will show how to pass a query into a data frame using the ROracle library. This process consists of passing a query (and a database connection) into the function dbSendQuery. This, however, does not return the data. In order to return the data set, you have to run fetch on the result of dbSendQuery. It is this resulting data frame that we will be plotting, using ggplot.

dfq <- dbSendQuery(con, "
  SELECT country, SUM(quantity*unitprice) AS sales_total FROM trx
  WHERE country IN(
                     SELECT country FROM trx
                     GROUP BY country
                     ORDER BY COUNT(*) desc
                     FETCH FIRST 4 ROWS ONLY
  )
  GROUP BY COUNTRY, INVOICENO, CUSTOMERID
")
dfd2 <- fetch(dfq)

library(ggplot2)
library(dplyr)
dfd2 %>% ggplot(aes(COUNTRY, SALES_TOTAL)) +
  geom_jitter(alpha=0.08, stroke=0) + 
  scale_y_log10()
## Warning in scale$trans$trans(x): NaNs produced
## Warning: Removed 3661 rows containing missing values (geom_point).

This figure plots the different reciept totals, for the four biggest markets of the store. We can see that we are dealing with totals centered in the 100 - 10 000 dollar range (1e+02 to 1e+04), suggesting that this store is probably into wholesale.

The plot (or rather the warning messages) shows us something else as well: A good portion of the values are removed because the log value was a NaN, meaning Not a Number. What this means in practice, is that some of the reciept totals must be negative. This is quite unexpected, and warrants investigation.

As for the code, there are some points to be made. Firstly, the SQL statement is sent to the server with the dbQuery function, which takes as arguments our Oracle connection and the SQL statement we want to excute. The output from this function (stored in the dfq variable) is simply some summary information about the query, not the actual result returned from the query. The result is returned from the next statement, the fetch(dfq) function. The data frame dfd2 is now a standard R dataframe containing the result of the query as we would expect. Note that all column names returned from Oracle are uppercase, regardless of how you capitalized them in the SQL statement.

Secondly, the dataframe that is returned can be fed directly into the ggplot function from the excellent ggplot2 library. Books have been written about ggplot, and we are not going to go into the library in details. But the code basically does the following: 1. It passes the two variables we want in our plot into the function aes which stands for aesthetic. 2. It then adds an element to the visualization, geom_jitter, that, based on the two variables we specified in the line above, creates a scatterplot of country and sales. Because country is a category variable, we don't want all our points to fall on a single line. That is why we use the geom_jitter function instead of the geom_point function. We are adding some random offset for each point on the X-axis, resulting in wide strokes that gives a better visual indication of the distribution. 3. Lastly, because the values varies so widely on the Y-axis, we add a logarithmic Y-axis. By default a linear Y-axis would have been added (without us having to specify anything), but adding our own axis overwrites the default.

ggplot2 is a great library for visualizations, and if you want to learn more, one starting point is the documentation (littered with examples) at ggplot2.tidyverse.org.

Another thing that happened in the code above, was a subquery. Instead of specifying that the country name should be EIRE or France or Germany or United Kingdom, we use an in() function, which usually takes a comma-separated list of accepted values, and replace the list with an entire SQL query that returns the country names we want. We can even run the subquery by itself, to see that it indeed is entirely valid and returns the name of the 4 largest countries in our dataset.

SELECT country FROM trx
GROUP BY country
ORDER BY COUNT(*) desc
FETCH FIRST 4 ROWS ONLY
Table 6.7: 4 records
COUNTRY
United Kingdom
Germany
France
EIRE

Combining queries like this can be powerful, and the query and subquery could be from different tables - as long as the countries are spelled the same, our database won't care.

Other ways of combining data will be covered when we start grappling with some of the sample data that comes with Oracle.

6.2 More ways to JOIN tables

Subqueries are great, but using joins to combine the content of multiple tables is just as frequent a task. In order to demonstrate a good use case for joins, we will take a quick detour from our transaction data and visit some of the sample data that came with the BigData Lite VM. Using the moviedemo user, you will find three tables named movie,movie_cast and cast. These tables are laid out in a way quite typical for relational databases. The movie table contains basic information about movies, such as their name, budget, and year produced as well as a movie ID to uniquely identify the movie. The cast table contains the name and a unique ID of every actor in these movies. Because one movie uses many actors, and one actor can star in several movies, the movie-cast table functions as a lookup table listing which actor played in which movie.

Using joins, we can select movie titles (from the movie table) and cast names (from the cast table), using the movie_cast table:

SELECT title, year, name FROM movie m
INNER JOIN movie_cast mc ON m.movie_id=mc.movie_id
INNER JOIN cast c ON c.cast_id=mc.cast_id
WHERE year BETWEEN 1900 AND 1910
ORDER BY title
Table 6.8: Displaying records 1 - 10
TITLE YEAR NAME
A Corner in Wheat 1909 James Kirkwood
A Corner in Wheat 1909 Linda Arvidson
A Corner in Wheat 1909 Grace Henderson
A Corner in Wheat 1909 H. B. Walthall
A Corner in Wheat 1909 Frank Powell
A Corner in Wheat 1909 W. C. Miller
A Corner in Wheat 1909 Gladys Egan
A Corner in Wheat 1909 Blanche Sweet
Le voyage dans la lune 1902 Jeanne d'Alcy
Le voyage dans la lune 1902 Georges Melies

We are not displaying any information from the movie_cast table, insted we are simply using this table to join the movie table with the cast table.

A word of caution about joins: If the columns you are joining on not uniquely identifies rows in two or more of the tables, you end up with what is known as many-to-many joins. SQL handles these situations by creating a carthesian join, meaning it creates all possible combinations of the rows. If you happen to have two rows with identical ID in both tables, the result will contain 2 x 2 = 4 rows with that ID. Although this behavior can be useful at times, it is normally not what you want to see happen.

6.3 A few common gotchas

Before diving deep into data exploration, there are a couple of novelties in how SQL works that is worth mentioning. These are not unique to Oracle or SQL, but depending on the language you come from these behaviours might trip you. So take some time now to make sure you are aware of them.

6.3.1 Missing values

You might be familiar with how R handles missing values, SQL has a similar but somewhat different approach, but both R and SQL are based on set theory.

Missing values in SQL are referred to as NULLs. NULL values are different from all other values, and neither smaller, larger nor equal any other value. NULLs are not even equal NULL. If you want to select rows where a given variable is NULL, you have to use the special condition IS NULL. For example, we can count the number of customers we don't have an address of:

SELECT COUNT(1) AS missing_addresses FROM customer
WHERE street_address IS NULL
Table 6.9: 1 records
MISSING_ADDRESSES
2191

Out of our almost 5 000 customers, we don't have an address on almost 3 000.

In aggregate functions though, NULLs are ignored by default and do not usually require special treatment. We can show this by selecting the number of street addresses in the customer table, which will ignore the missing addresses:

SELECT COUNT(street_address) AS num_addr FROM customer
Table 6.10: 1 records
NUM_ADDR
2657

This will be identical to the number of rows where the address is not missing:

SELECT COUNT(1) AS num_addr FROM customer
WHERE street_address IS NOT NULL
Table 6.11: 1 records
NUM_ADDR
2657

The same behavior will be true for other aggregate functions, such as AVG().

6.3.2 Execution order

When you program in R or most other languages, you expect the machine to throw an error and exit after the first error in the code. After fixing this one, subsequent errors may manifest as the executor reaches them.

SQL is not most other languages. The execution order of the script is not linear from top to bottom, and as a result the order in which errors are thrown might seem arbitrary. The appendix gives a little more detail on the execution order, but for now just know that errors are not necessarily given to you in a left-to-right fashion.

6.3.3 ANSI and the dialects

For better and worse, SQL is not one language. Much of the basic syntax and behavior is specified by ANSI (American National Standards Institute), and is frequently referred to as ANSI SQL. But any given database vendor can choose to leave out certain features, and also to implement additional features not in ANSI.

In the later years, Oracle has stived to implement all of ANSI. But Oracle also implements a host of other features that are not in ANSI. Some of these features are extremely powerful, and we will spend time discussing many of them.