Skip to content

Instantly share code, notes, and snippets.

@dannguyen
Last active November 3, 2016 00:27
Show Gist options
  • Save dannguyen/5271753f3d1363b8ec686619a929c672 to your computer and use it in GitHub Desktop.
Save dannguyen/5271753f3d1363b8ec686619a929c672 to your computer and use it in GitHub Desktop.
An example of how to explore and filter a non-normalized SQL database table using subqueries and joins using cross-platform syntax (e.g no DISTINCT or WINDOW)

A SQL tutorial on using subqueries to filter a table against itself

This is a short SQL tutorial on how to do multiple transformation/aggregates of a table in order to perform analyses of the kind:

Given a list of grocery stores that includes how much fruit they've sold, per type of fruit, e.g.:

|    store     |  fruit  | price | sold |
|--------------|---------|-------|------|
| Safeway      | Apples  |  2.00 |   50 |
| Safeway      | Oranges |  1.50 |   30 |
| Trader Joe's | Apples  |  0.99 |  100 |
| Whole Foods  | Apples  |  5.99 |    0 |
| Whole Foods  | Oranges |  2.50 |    7 |

Find all stores that:

  • sold at least one apple
  • sold only apples
  • sold apples at a price below $1
  • sold apples but not oranges

There are lots of ways to do such calculations, so I want to keep the SQL syntax as basic as possible, i.e. limited to the SQL syntax that I personally understand, and that is consistent among the major flavors of SQL, including SQLite, MySQL, and PostgreSQL.

So, basically, no DISTINCT or WINDOW. Just: GROUP BY, COUNT, and the concepts of subqueries and joins.

The queries in this tutorial work on MySQL, SQlite, and PostgreSQL alike, and you can use the following SQLFiddle to practice queries against:

http://sqlfiddle.com/#!9/24b2e4

Instead of grocery stores and fruit sales, let's use a simple mockup of criminal court data, in which criminal cases have multiple charges, e.g. a defendant can be charged with assault and DUI in a single case:

| rowid | casenum |   charge   | disposition |
|-------|---------|------------|-------------|
|     1 | A       | assault    | dismissed   |
|     2 | B       | battery    | convicted   |
|     3 | C       | conspiracy | dismissed   |
|     4 | C       | conspiracy | dismissed   |
|     5 | D       | dui        | dismissed   |
|     6 | D       | dui        | convicted   |
|     7 | D       | assault    | convicted   |
|     8 | E       | evilness   | dismissed   |
|     9 | E       | assault    | dismissed   |
|    10 | E       | assault    | dismissed   |

Given that data format, how do we find all cases...:

| casenum |
|---------|
| B       |
| D       |
| casenum | total_charges | total_dismissals |
|---------|---------------|------------------|
| A       |             1 |                1 |
| C       |             2 |                2 |
| E       |             3 |                3 |
| rowid | casenum |  charge  | disposition |
|-------|---------|----------|-------------|
|     1 | A       | assault  | dismissed   |
|     5 | D       | dui      | dismissed   |
|     6 | D       | dui      | convicted   |
|     7 | D       | assault  | convicted   |
|     8 | E       | evilness | dismissed   |
|     9 | E       | assault  | dismissed   |
|    10 | E       | assault  | dismissed   |
| rowid | casenum |  charge  | disposition |
|-------|---------|----------|-------------|
|     5 | D       | dui      | dismissed   |
|     6 | D       | dui      | convicted   |
|     7 | D       | assault  | convicted   |
|     8 | E       | evilness | dismissed   |
|     9 | E       | assault  | dismissed   |
|    10 | E       | assault  | dismissed   |
| casenum | total_charges | total_convictions |
|---------|---------------|-------------------|
| A       |             1 | (null)            |
| C       |             2 | (null)            |
| D       |             3 | 2                 |

Create and seed the table of charges

If you don't want to use the SQLFiddle, here's the code to create and seed the example table. This statement (and most of the subsequent SELECT queries) should work in MySQL, PostgreSQL, and SQLite:

CREATE TABLE charges ( 
    rowid INTEGER PRIMARY KEY,
    casenum VARCHAR(255),
    charge VARCHAR(255),
    disposition VARCHAR(255)
);

INSERT INTO charges(rowid, casenum, charge, disposition) 
  VALUES
    (1, 'A','assault','dismissed'),
    (2, 'B','battery','convicted'),
    (3, 'C','conspiracy','dismissed'),
    (4, 'C','conspiracy','dismissed'),
    (5, 'D','dui','dismissed'),
    (6, 'D','dui','convicted'),
    (7, 'D','assault','convicted'),
    (8, 'E','evilness','dismissed'),
    (9, 'E','assault','dismissed'),
    (10, 'E','assault','dismissed')
    ;

Cases with at least one conviction

This requires a GROUP BY clause but not necessarily a COUNT column, as every case without a conviction will be filtered out of the results by the WHERE clause before the aggregate is done:

SELECT casenum
FROM charges
WHERE disposition = 'convicted'
GROUP BY casenum;
| casenum |
|---------|
| B       |
| D       |

However, if we want to include the number of convictions per case:

SELECT casenum, 
  COUNT(1) AS total_convictions
FROM charges
WHERE disposition = 'convicted'
GROUP BY casenum;
| casenum | total_convictions |
|---------|-------------------|
| B       |                 1 |
| D       |                 2 |

We don't need the HAVING clause -- i.e. HAVING total_convictions > 0 -- because the WHERE clause automatically eliminates all cases without a single convictions (by selecting only the rows that have a conviction before grouping them). If we wanted to find all cases with more than one conviction (which seems of little utility in a real-world analysis), we would have to use a HAVING clause to filter the aggregated results:

SELECT casenum, 
  COUNT(1) AS total_convictions
FROM charges
WHERE disposition = 'convicted'
GROUP BY casenum
HAVING total_convictions > 1;
| casenum | total_convictions |
|---------|-------------------|
| D       |                 2 |

2. All cases in which all charges were dismissed

In other words, all cases in which the total number of charges is equal to the number of charges that were dismissed.

Conceptually, we create two aggregate tables:

  1. A table with 2 columns: case_num and a count of all charges per case: total_charges
  2. A table with 2 columns: case_num and a count of all dismissed charges per case: total_dismissals

The query that solves our problem is a INNER JOIN between the two tables on case_num, in which total_dismissals is equal to total_charges.

Total charge counts per case

SELECT casenum, 
     COUNT(1) AS total_charges
FROM charges
GROUP BY casenum;
| casenum | total_charges |
|---------|---------------|
| A       |             1 |
| B       |             1 |
| C       |             2 |
| D       |             3 |
| E       |             3 |

Total dismissal counts per case

SELECT casenum, 
     COUNT(1) AS total_dismissals
FROM charges
WHERE disposition = 'dismissed'
GROUP BY casenum;

Note that case B does not show up because none of its charges resulted in a dismissal:

| casenum | total_dismissals |
|---------|------------------|
| A       |                1 |
| C       |                2 |
| D       |                1 |
| E       |                3 |

All together

You could create two new tables, temporary or not, and then do the join on those two tables. But since the result of each of the SELECT statements effectively results in a new table, we can just execute both of the aforementioned statements, alias and then join their results, to save us the tedium of creating and deleting temp tables.

Here's how to do it with subqueries:

SELECT tc.casenum, 
     tc.total_charges, 
     td.total_dismissals
FROM
  (SELECT casenum, COUNT(1) AS total_charges
    FROM charges
    GROUP BY casenum) 
  AS tc

INNER JOIN 
   (SELECT casenum, COUNT(1) AS total_dismissals
    FROM charges
    WHERE disposition = 'dismissed'
    GROUP BY casenum)
  AS td
  ON tc.casenum = td.casenum
WHERE tc.total_charges = td.total_dismissals;

Note that including tc.total_charges and td.total_dismissals is unnecessary, but we include it here for a sanity check in the results:

| casenum | total_charges | total_dismissals |
|---------|---------------|------------------|
| A       |             1 |                1 |
| C       |             2 |                2 |
| E       |             3 |                3 |

Another side note: it's possible to eliminate the WHERE clause by putting its condition into the ON clause:

\\ ...

ON tc.casenum = td.casenum
   AND tc.total_charges = td.total_dismissals;

The result would be the same as the previous query, but in my opinion, the code does not describe what we want conceptually, and it may be code that is harder to maintain in terms of preventing errors if we incorporate it into a non-trivial analysis.

And I believe it would screw up the efficiency of the query, as casenum would presumably be indexed, but total_charges and total_dismissals would not be.

Refactor subqueries with the WITH common-table expression

If you're using a SQL variant that is not MySQL, you can achieve the same result using a more declarative (and arguably more readable) style with the WITH common table expression -- hat-tip to this great listicle of SQL tricks:

WITH  
   tc AS (SELECT casenum, 
          COUNT(1) AS total_charges
          FROM charges
          GROUP BY casenum),
   td AS (SELECT casenum, 
          COUNT(1) AS total_dismissals
          FROM charges
          WHERE disposition = 'dismissed'
          GROUP BY casenum)

SELECT tc.casenum, 
     tc.total_charges, 
     td.total_dismissals
FROM tc
INNER JOIN td
  ON tc.casenum = td.casenum
WHERE tc.total_charges = td.total_dismissals;

Make it simpler with a subquery in the WHERE clause

Hoa Nguyen on the NICAR-L mailing list pointed out a much simpler method that requires no JOINs and far less code:

SELECT casenum
FROM charges
WHERE casenum
  NOT IN (SELECT casenum FROM charges WHERE disposition != 'dismissed')
GROUP BY casenum;

My main objection was that I had thought this would always be less optimal than a join, but according to EXPLAIN on MySQL 5.6.x, I would be very wrong. And the semantics of Hoa's query is likely more intuitive to more people. That said, there are some limitations (specifically with MySQL) when it coems to subqueries (see documentation here). And, in my opinion, the immediate clarity of this code will become much more muddled when doing more complicated JOINs. (but I don't have an examples at the moment to clearly illustrate this)

3. All cases, with charge-per-case detail, in which at least one of the charges is as assault

This seems like it'd be similar to the first example, except with a different WHERE condition:

SELECT casenum
FROM charges
WHERE charge = 'assault'
GROUP BY casenum;

But the aggregate destroys the individual charge-per-case detail, e.g. how cases D and E each have more than one charge:

| casenum |
|---------|
| A       |
| D       |
| E       |

Sometimes for research and reporting purposes, we want a result table with as much detail as possible, to make it easier to browse the filtered cases.

So to get all cases that include at least one assault charge, but include all other charges for each case, as well as all the original columns, we'll need to do a subquery and INNER JOIN.

Note how the main SELECT statement has no GROUP BY clause -- this allows us to list each charge per case:

SELECT charges.*
FROM charges
INNER JOIN
   (SELECT casenum
    FROM charges
    WHERE charge = 'assault'
    GROUP BY casenum)
  AS assault_cases
  ON assault_cases.casenum = charges.casenum;
| rowid | casenum |  charge  | disposition |
|-------|---------|----------|-------------|
|     1 | A       | assault  | dismissed   |
|     5 | D       | dui      | dismissed   |
|     6 | D       | dui      | convicted   |
|     7 | D       | assault  | convicted   |
|     8 | E       | evilness | dismissed   |
|     9 | E       | assault  | dismissed   |
|    10 | E       | assault  | dismissed   |

4. All cases that have multiple charges and at least one assault charge, and include the charge-per-case detail

This example is meant to show why we might be interested in doing the third example in the first place. It uses virtually the same code as in the second and third examples, stacked together.

It requires joining together 3 tables:

  • charges - this is the original charges table, becausue we need access to the original rows and columns
  • ta - this table is the result of a subquery that filters the charges include only charges of 'assault', then does a GROUP BY casenum to return only rows with unique casenum values.
  • tc - this subquery returns the unique casenum for each case from charges, along with total count of charges per case, i.e. a GROUP BY casenum.
SELECT charges.*
FROM charges
INNER JOIN
  (SELECT casenum
   FROM charges
   WHERE charge = 'assault'
   GROUP BY casenum)
  AS ta
  ON ta.casenum = charges.casenum
INNER JOIN
  (SELECT casenum, 
    COUNT(1) AS total_charges
   FROM charges
   GROUP BY casenum)
  AS tc
  ON tc.casenum = ta.casenum
WHERE tc.total_charges > 1;

Note how case A is eliminated from the results, as it has only one charge total:

| rowid | casenum |  charge  | disposition |
|-------|---------|----------|-------------|
|     5 | D       | dui      | dismissed   |
|     6 | D       | dui      | convicted   |
|     7 | D       | assault  | convicted   |
|     8 | E       | evilness | dismissed   |
|     9 | E       | assault  | dismissed   |
|    10 | E       | assault  | dismissed   |

Using a HAVING clause

It's probably more conceptually clearer to eliminate the WHERE clause in the main SELECT and use a HAVING clause in the subquery that finds total charges per case:

SELECT charges.*
FROM charges
INNER JOIN
  (SELECT casenum
   FROM charges
   WHERE charge = 'assault'
   GROUP BY casenum)
  AS ta
  ON ta.casenum = charges.casenum
INNER JOIN
  (SELECT casenum, 
    COUNT(1) AS total_charges
   FROM charges
   GROUP BY casenum
   HAVING total_charges > 1)
  AS tc
  ON tc.casenum = ta.casenum
  ;

All cases in which not all charges were dismissed

This is just a join between two tables created by subqueries. But it's different than when we wanted to find cases in which every charge was dismissed, i.e.

  1. Cases in which total number of dismissals is equal to total number of charges.

By definition, such cases have at least one dismissal.

However, a case in which not all charges were dismissed includes:

  1. Cases in which total number dismissals is less than total number of charges.
  2. Cases in which there were no dismissals

In order to capture the second condition, we need to use LEFT JOIN instead of a INNER JOIN, as well as add a NULL check in the WHERE clause:

SELECT 
  tc.casenum, 
  tc.total_charges, 
  td.total_dismissals
FROM
  (SELECT casenum, 
     COUNT(1) AS total_charges
   FROM charges
   GROUP BY casenum) 
  AS tc

LEFT JOIN 
   (SELECT casenum, 
       COUNT(1) AS total_dismissals
    FROM charges
    WHERE disposition = 'dismissed'
    GROUP BY casenum)
  AS td

ON tc.casenum = td.casenum
WHERE 
  tc.total_charges != td.total_dismissals 
  OR td.total_dismissals IS NULL;
| casenum | total_charges | total_convictions |
|---------|---------------|-------------------|
| A       |             1 | (null)            |
| C       |             2 | (null)            |
| D       |             3 | 2                 |

All cases with no convictions

TODO: (fill this out later)

Finding cases that have at least one conviction is much easier than finding cases without a conviction:

SELECT allcases.casenum
FROM 
  (SELECT casenum
   FROM charges
   GROUP BY casenum)
  AS allcases
LEFT JOIN
  (SELECT casenum
   FROM charges
   WHERE disposition = 'convicted'
   GROUP BY casenum)
  AS convictions
  ON allcases.casenum = convictions.casenum
WHERE convictions.casenum IS NULL;
| casenum |
|---------|
| A       |
| C       |
| E       |
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment