Chapter 11 Row pattern matching

So far, we have mostly been doing things that are equally possible to do in R, and hence the reason for doing it in SQL instead of R is probably performance-related.

Row pattern matching is something quite different, to which there currently isn't any comparable functionality in any R libraries. That is bound to change sooner or later, but for the time being, Oracle has a big selling point in being able to do a type of processing that is both novel, useful and difficult to implement from scratch.

Row pattern matching is, basically, a WHERE-statement on steroids. Instead of selecting individual rows, you can select whole sets of rows, if they, as a group satisfy the criteria you have written. The canonical example is stocks, where you may not be interested in selecting the days where the stock is above, say, 20, but you really want to find periods where the price was increasing. With row pattern matching you can do that - define what an increase means, and then say you want all rows that are part of, f.ex., periods of five or more consecutive days of price increases.

In addition to the amped-up WHERE-clause, pattern matching lets you calculate new columns, called MEASURES, from the patterns that you define. Things like the difference between the start- and end-price in the aforementioned 5-or-more days of stock price increases.

Also, the pattern matching can do aggregations, leaving you with only one row summing up each pattern found.

11.1 A simple(ish) example

Although the stocks make a good and relatable intro to what pattern-matching can do, it is probably more powerful in other areas, with more feature-rich data. Nevertheless, we will start by doing a simple stock-price example where we select periods of stock price increases, however long or short.


SELECT * FROM ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY tstamp
ALL ROWS PER MATCH
PATTERN ( up+ )
DEFINE up AS PREV(price)<price
)
Table 11.1: Displaying records 1 - 10
SYMBOL TSTAMP PRICE
ACME 2011-04-02 17
ACME 2011-04-03 19
ACME 2011-04-04 21
ACME 2011-04-05 25
ACME 2011-04-07 15
ACME 2011-04-08 20
ACME 2011-04-09 24
ACME 2011-04-10 25
ACME 2011-04-13 25
ACME 2011-04-17 14

Although the query was short, there is a lot of new syntax here, starting with the MATCH_RECOGNIZE command that encompases the whole circus. Inside the match_recognize command, we start off by partitioning by symbol, which just like with the OVER clause says that we want to analyze patterns in different stocks in isolation. Secondly, we order by TSTAMP, which is the time column and denotes in which order things happened (the default ordering, ascending) usually agrees with how we want to analyze pattern in time.

Now, for the really new stuff, which is probably easier to understand if we start with the last line.

DEFINE is where we define the basic building blocks of what we want to match. We say that a row can be classified as "UP" if the stock price is higher that day than it was on the previous day. This is where the ORDER BY statement becomes important - without it, the previous row might not be the previous day, and our results wouldn't make much sense.

In the PATTERN statement, we use the definition(s) we create in the DEFINE statement to specify what combinations of rows we are looking for. This step uses a lightweight version of regex to let us specify the patters, and the +-sign means the same here as in regex: One or more occurance.

Finally, ALL ROWS PER MATCH is an explicit way of telling Oracle not to aggregate the results. In this query, we use match_recognize simply as a filter.

To summarize, we use the DEFINE statement to definitions/labels/classifications of rows, with the help of functions such as PREV (previous row), NEXT, and some other methods that we will return to. We use these labels in the PATTERN statement, where we specify which combinations of these rows we are looking for, and finally we specify that we don't want to aggregate the results.

The output of this, however, isn't easy to make sense of. The first four rows shows that the price increases from 17 to 25, but if we read the code carefully we see that 17 must already be an increase from the day before - otherwise it wouldn't have satisfied the UP definition. But we don't see what it increased from.

Then, it drops from 25 to 15, and skips a day. Presumably, it found the pattern multiple times in the same stock, right next to each other. But we have a hard time reading this from the output, so let's add some more to the query.

Firstly, we want to add the day the pattern starts. To do this we must add a definition to the pattern, but since we don't need this first row to satisfy any conditions other than coming before the first UP, we don't actually have to define it. If that sounds strange, you'll understand when you see it.

Secondly, we want to see when one pattern ends and another starts. We can do this by adding a computed variable, called a MEASURE and a function called MATCH_NUMBER() that specifically enumerates what occurance of the pattern the row is part of.

Additionally, since we now will have two different definitions as part of our pattern, we want to see what each row is classified as. This too we can do in the MEASURE clause, with the special function CLASSIFIER() which returns the name of the definition that triggered.

SELECT * FROM ticker
MATCH_RECOGNIZE(
PARTITION BY symbol
ORDER BY tstamp
MEASURES MATCH_NUMBER() AS match_num,
        CLASSIFIER() as clf
ALL ROWS PER MATCH
PATTERN ( strt up+ )
DEFINE up AS PREV(price)<price
)
Table 11.2: Displaying records 1 - 10
SYMBOL TSTAMP MATCH_NUM CLF PRICE
ACME 2011-04-01 1 STRT 12
ACME 2011-04-02 1 UP 17
ACME 2011-04-03 1 UP 19
ACME 2011-04-04 1 UP 21
ACME 2011-04-05 1 UP 25
ACME 2011-04-06 2 STRT 12
ACME 2011-04-07 2 UP 15
ACME 2011-04-08 2 UP 20
ACME 2011-04-09 2 UP 24
ACME 2011-04-10 2 UP 25

This helps immensly with readability. We can see that the start (nicknamed strt in the code) of the pattern is the 31st of march 2011, when the price is 12, before it increases to 17 which was the first observation we saw in the previous query. Also, we get a confirmation that the drop we saw when the price was 25, was the end of the first occurance of the pattern and the start of the second occurance. When we see a new ticker symbol, the match number counter resets.

11.2 A more involved example

In order to really show the power of pattern matching, we have to import some more data. The New York City public payroll data is both public, rich, and has a panel-data structure that allows us to study changes over time. As of this writing, you can find the data at https://data.cityofnewyork.us/City-Government/Citywide-Payroll-Data-Fiscal-Year-/k397-673e.

11.2.1 Importing and cleaning

You can download the dataset as CSV, and use the SQL Developer import wizard to create a table for it. If this was a dataset we were expecting to add more data to regularly, we would have set up a proper import script, probably in R, which would make the process automatic and reproducible. But in this case, just right-click on the tables-folder in the left sidebar, and choose import data. The wizard will guide you through the rest.

library(dplyr)


df <- read.csv('~/citywide-payroll-data-fiscal-year.csv', sep=",", 
               col.names = c('YEAR', 'PAYROLL_NO', 'AGENCY_NAME', 'LAST_NAME', 'FIRST_NAME', 'MID_INIT', 
                             'AGENCY_START_DATE', 'WORK_LOCATION_BOROUGH', 'TITLE_DESCRIPTION', 
                             'LEAVE_STATUS_AS_OF_JUNE_30', 'BASE_SALARY', 'PAY_BASIS', 'REGULAR_HOURS', 
                             'REGULAR_GROSS_PAID', 'OT_HOURS', 'TOTAL_OT_PAID', 'TOTAL_OTHER_PAY'), header = 1, stringsAsFactors = F)


dbWriteTable(con, "NYCPAY", df)

After the data is imported, we want to do a little bit of data analysis to find out what we are dealing with. We can already see there are a few different ways to measure pay (the PAY BASIS column), which we might have to deal with in order to compare pay. But more importantly, since we want to analyze the carreer development of employees, we need to make sure that we have some sort of way of uniquely identifying each employee.

But first, as a matter of good practice, let's add an anonymous ID for each person so that we don't have to use full names everywhere.

ALTER TABLE nycpay
ADD PERS_ID VARCHAR(32) ;


UPDATE nycpay
SET PERS_ID = ORA_HASH(FIRST_NAME || MID_INIT || LAST_NAME);

COMMIT;

It is easier to have one column for identifying a person, and if you have worked with personal data before, you will hopefully appreciate working with data that isn't directly personally identifiable (PII). Since the dataset is public this is in principle an empty gesture, but removing PII is nevertheless valuable as a general practice.

Back to uniquely identifying one person. Names are of course not unique, but we also have another challenge: One person may have more than one job with the city. This is a real challenge, because the data won't be unique by person and year - and this is what we want to track. How does each person fare.

In practice we have two choices: Aggregate the data so that it is unique per person per year, or simply remove persons who have had multiple jobs. First, let's check how many people actually have multiple jobs:

SELECT COUNT(1), COUNT(DISTINCT pers_id), SUM(occ) FROM (
SELECT year, pers_id, COUNT(1) AS occ FROM nycpay
GROUP BY year, pers_id
HAVING count(1)>1
)
Table 11.3: 1 records
COUNT(1) COUNT(DISTINCTPERS_ID) SUM(OCC)
834680 259922 1831579

We can see that out of 3.9 million rows, over 1.8 million belongs to people who have had more than one job in one year. And that there are 259 922 people who have had more than one job. If our goal was to conduct a thorough review of the city's pay data, we would have to choose to aggregate the data in some way in order to construct a dataset that contained every person and a meaningful unique identifier.

Further exploration shows that a large number of people are election workers, which we might not find all that interesting to analyze in a career perspective. We use the chance to create a second table without the names, without the people causing the duplicates, and with some other data tidying thrown in.

CREATE TABLE SIMPLEPAY AS (
  SELECT year, 
  pers_id,
  TRIM(upper(AGENCY_NAME)) AS agency, 
  TO_DATE(SUBSTR(AGENCY_START_DATE, 1, 10), 'YYYY-MM-DD') AS job_start,
  TRIM(WORK_LOCATION_BOROUGH) AS job_location,
  TRIM(TITLE_DESCRIPTION) AS job_title,
  TRIM(LEAVE_STATUS_AS_OF_JUNE_30) AS leave_status,
  base_salary,
  TRIM(pay_basis) AS pay_basis,
  regular_hours,
  regular_gross_paid,
  ot_hours,
  total_ot_paid,
  TOTAL_OTHER_PAY
  FROM nycpay
  WHERE 
  TRIM(upper(TITLE_DESCRIPTION)) != 'ELECTION WORKER' AND
  pers_id NOT IN(
    SELECT DISTINCT pers_id FROM nycpay
    WHERE TRIM(upper(TITLE_DESCRIPTION)) = 'ELECTION WORKER'
    GROUP BY year, pers_id
    HAVING COUNT(1)>1
  )
);

This leaves us with only about 250 000 records (people) per year, but this is hopefully people with fairly regular employment contracts for whom we can do meaningful analysis.

Before we go any further, let's check a column that looks promising: Regular gross pay. This seems to always have a value, which might make it a good indicator of what a person actually makes (some of the other pay-related columns have 0-values, which is not a likely wage level.

We can use the WIDTH_BUCKET function to create bins, and count the number of people in the respective wage brackets. We create 20 bins between $0 and $200,000, which leaves us with neat $10,000 increments. Any values outside the $0-$200,000 interval will be grouped in "overflow" buckets at the low and high end.

SELECT WIDTH_BUCKET(BASE_SALARY, 0, 200000, 20) AS pay,
MIN(BASE_SALARY) AS from_pay,
MAX(BASE_SALARY) AS to_pay,
COUNT(1) AS employee_count
FROM simplepay
GROUP BY WIDTH_BUCKET(BASE_SALARY, 0, 200000, 20)
ORDER BY 1
Table 11.4: Displaying records 1 - 10
PAY FROM_PAY TO_PAY EMPLOYEE_COUNT
1 1e-02 9910 1299806
2 1e+04 19928 2048
3 2e+04 29999 65973
4 3e+04 39999 267240
5 4e+04 49999 318872
6 5e+04 59998 257477
7 6e+04 69999 238067
8 7e+04 79999 303499
9 8e+04 89999 317832
10 9e+04 99999 183706

There are no negative values, but a lot of employees with less than $10,000 in base pay. A quick look at the job titles for these low-paying jobs may reveal something about what type of jobs this is.

SELECT job_title, COUNT(1) AS num_employees FROM simplepay
WHERE base_salary<10000
GROUP BY job_title
ORDER BY 2 DESC
FETCH FIRST 20 ROWS ONLY
Table 11.5: Displaying records 1 - 10
JOB_TITLE NUM_EMPLOYEES
TEACHER- PER SESSION 589323
TEACHER-GENERAL ED 82246
F/T SCHOOL AIDE 57402
JOB TRAINING PARTICIPANT 52810
SUBSTITUTE ED PARA 50308
COLLEGE ASSISTANT 37153
ADJUNCT LECTURER 28939
F/T SCHOOL LUNCH HELPER 28515
STUDENT AIDE 22667
CITY SEASONAL AIDE 22412

The largest group is job traning participants, but a large and diverse group of school aides dominates the list. Many of these roles can be argued have an element of volunteering or civil engagement to them - pay might not be the main reason these people are doing the job.

The job training participants are a good testcase: Let's look at how they fared. Did any of them end up with a job in the city?

SELECT * FROM (
  SELECT year, pers_id, agency, job_title, job_start, base_salary, pay_basis, leave_status FROM simplepay
)
MATCH_RECOGNIZE(
PARTITION BY pers_id
ORDER BY year
MEASURES MONTHS_BETWEEN(FIRST(fulljob.job_start), training.job_start) AS time_to_job,
COUNT(fulljob.*) AS fulljob_years
ALL ROWS PER MATCH
PATTERN ( ^ training fulljob+ )
DEFINE 
training AS job_title='JOB TRAINING PARTICIPANT',
fulljob AS job_title!='JOB TRAINING PARTICIPANT'
)
Table 11.6: Displaying records 1 - 10
PERS_ID YEAR TIME_TO_JOB FULLJOB_YEARS AGENCY JOB_TITLE JOB_START BASE_SALARY PAY_BASIS LEAVE_STATUS
1000843506 2016 NA 0 DEPT OF PARKS & RECREATION JOB TRAINING PARTICIPANT 2016-05-04 11.8 per Hour ACTIVE
1000843506 2017 14 1 DEPT OF PARKS & RECREATION CITY SEASONAL AIDE 2017-06-30 15.2 per Hour ACTIVE
1002309329 2015 NA 0 DEPT OF PARKS & RECREATION JOB TRAINING PARTICIPANT 2013-04-24 9.4 per Hour CEASED
1002309329 2015 19 1 HRA/DEPT OF SOCIAL SERVICES ELIGIBILITY SPECIALIST 2014-12-08 33284.0 per Annum CEASED
1006085312 2016 NA 0 DEPT OF PARKS & RECREATION JOB TRAINING PARTICIPANT 2015-06-29 11.8 per Hour CEASED
1006085312 2020 49 1 DEPT OF HEALTH/MENTAL HYGIENE COMMUNITY COORDINATOR 2019-07-20 36.6 per Hour ACTIVE
1010330046 2018 NA 0 DEPT OF PARKS & RECREATION JOB TRAINING PARTICIPANT 2017-12-13 13.5 per Hour ACTIVE
1010330046 2019 17 1 DEPT OF PARKS & RECREATION CITY PARK WORKER 2019-05-20 16.1 per Hour ACTIVE
1010330046 2020 17 2 DEPT OF PARKS & RECREATION CITY PARK WORKER 2019-05-20 16.1 per Hour CEASED
1011149826 2015 NA 0 DEPT OF PARKS & RECREATION JOB TRAINING PARTICIPANT 2011-07-05 9.3 per Hour CEASED

Here, we have selected people who start out as job training participants, and who (the next time we see them) are regular employees (well, not job training participants anyway). We also measure the number of months from they start job training to they start a paid job, and count the number of rows of regular employment.

From here, we can continue our analysis by counting the number of people this is, compare it to the number of people who have been in job training programmes, and look at the time between the job training programme and their job start. A practical way to do this is by removing the ALL ROWS PER MATCH clause, which will leave us with one row per person.

SELECT * FROM (
  SELECT year, pers_id, agency, job_title, job_start, base_salary, pay_basis, leave_status FROM simplepay
)
MATCH_RECOGNIZE(
PARTITION BY pers_id
ORDER BY year
MEASURES MONTHS_BETWEEN(FIRST(fulljob.job_start), FIRST(training.job_start)) AS time_to_job,
COUNT(fulljob.*) AS fulljob_years
PATTERN ( ^ training+ fulljob+ )
DEFINE 
training AS job_title='JOB TRAINING PARTICIPANT',
fulljob AS job_title!='JOB TRAINING PARTICIPANT'
)
Table 11.7: Displaying records 1 - 10
PERS_ID TIME_TO_JOB FULLJOB_YEARS
1000843506 14 1
1002309329 19 1
1002319496 21 4
1004555801 0 3
1006085312 49 1
1006328677 16 2
1010330046 17 2
1011280536 37 2
101235340 77 1
1020333191 0 2

Just by deleting a line, we have an aggregate, with the partition variable (PERS_ID), and the MEASURES we defined. This is a great start to take into R for further analysis. We have also changed the query slightly, opening up for multiple years of job training (the training+ argument), and changed our measure to make sure we get moths between the start date of the first year in job training and the start date of the first job.

q <- dbSendQuery(con, "
SELECT * FROM (
  SELECT year, pers_id, agency, job_title, job_start, base_salary, pay_basis, leave_status FROM simplepay
)
MATCH_RECOGNIZE(
PARTITION BY pers_id
ORDER BY year
MEASURES MONTHS_BETWEEN(FIRST(fulljob.job_start), FIRST(training.job_start)) AS time_to_job,
COUNT(fulljob.*) AS fulljob_years
PATTERN ( ^ training+ fulljob+ )
DEFINE 
training AS job_title='JOB TRAINING PARTICIPANT',
fulljob AS job_title!='JOB TRAINING PARTICIPANT'
)
")
df <- fetch(q)

library(ggplot2)
library(dplyr)
df %>% ggplot(aes(TIME_TO_JOB)) +
  geom_histogram() 
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

A large portion of the people that get a job in the city after job training, seems to be keeping the job they had in job training, but transition to a normal position. For the group of people who didn't get a job straight from their job training, the curve looks roughly linear. The 5 years of data we have corresponds to 60 months, so the non-neglible number of people getting a job close to 100 months is probably due job tranings that started long before 2014 when our data set starts. The long tail of both negative and positive values are most likely a result of data entry mistakes. This is real world data, and there will be real-world errors.