Using SQL Union

I had the chance recently to use SQL Union in my everyday work, so I figured I'd write this to remind myself of why it's so useful.

First, though, some hot takes.

  1. Building dashboards and CSVs for internal business tools into your apps is a little silly and cumbersome. For one, every time someone wants another column added, they have to ask engineering to make a code change. And also, once the data set gets big enough, requests start timing out.

  2. It's actually really difficult to generate quality queries with ActiveRecord. When you're working with more complex joins across the codebase, it's probably better to dip into writing raw SQL. Although it's not as maintainable, it's often more efficient, and you end up with tabular data rather than objects hydrated with data from your database. Plus you can actually transform the data into the shape that you want by using CASE statements and so on.

Okay, hot takes aside, you can probably see that in some cases, it's better to expose your app's data (probably with a data pipeline) to allow it to be queryable with a business intelligence tool. So...that is what we did in this case.

Using Postico, I connected to a replica in a data warehouse and was able to query data from across applications, which is how SQL Union came into play.

I wanted to put data from different sources into one table to export it as a CSV. To do this, I used the UNION operator in SQL.

Before you use union, you'll need to make sure that the number of columns in each data set is equal. The data will just get mashed together independent of any column names.

Here's a basic example and then I'll show off the query I actually used.

If for some reason you had two tables and wanted to make a single list of kids and parents, you could use union operator.

KIDS

name email age
tracy tracy@tracy.com 30
eric eric@eric.com 28
rachel rachel@rachel.com 25

PARENTS

name email
mom mom@mom.com
dad dad@dad.com

But lo and behold, the number of columns differs in each table.

That's cool: we can make them the same by adding a value to represent that column. Or we could omit that column if we wanted to. I'll do the example where we list a value.


select *
  from kids
union
select
  name,
  email,
  'n/a' as age
from
  parents

RESULTS

name email age
tracy tracy@tracy.com 30
eric eric@eric.com 28
rachel rachel@rachel.com 25
mom mom@mom.com n/a
dad dad@dad.com n/a

And that should allow us to get the right data in the right columns. Simple, silly example, but this sets the stage for some more complex querying.

Here's a cleaned-up version of the query I used to retrieve the set of online students who were in both our self-paced and structured programs. Data was split across many different tables from two separate apps that had their data piped into Redshift.

select
  users.uuid,
  users.email,
  users.first_name,
  users.last_name,
  cohort_batches.*
from
  users
join
  user_roles
on
  user_roles.user_id = users.id
join
  roles
on
  user_roles.role_id = roles.id
join
 (select
	batches.id as batch_id,
	batches.uuid as batch_uuid,
	batches.iteration as batch_iteration,
	cohorts.name as cohort_name,
	campuses.name as campus_name,
	cohorts.start_date as start_date
  from
    cohorts
  join
    batches
  on
    cohorts.batch_id = batches._uuid
  join
    campuses
  on
    cohorts.campus_id = campuses.id
  join
    courses
  on
    courses.id = cohorts.course_id
  where
    campuses.name = 'Online'
  and
    courses.name in ('Software Engineering', 'Data Science')
  union
  select
    batches.id as batch_id,
    batches.uuid as batch_uuid,
    batches.iteration as batch_iteration,
    null as cohort_name, # wasn't available, so fill in a default value
    'Online' as campus_name, # campus name all the same for these
    null as start_date # no start date since self-paced
  from
    batches
  where
    batches.iteration in ('v-000', 'online-web-sp-000', 'online-ds-sp-000')
  ) as cohort_batches
on
  cohort_batches.batch_id = user_roles.roleable_id
where
  roles.name = 'student'
and
  user_roles.roleable_type = 'Batch'

It's not the craziest query I've ever seen, but as a person unaccustomed to writing raw SQL all day, it was definitely a doozy to construct. What I found interesting was the ability to join two sets of "cohort batches" together derived from different sources and then join that set of data to our users table based on a role and roleable type.

So, lots of querying, subquerying, union-ing and yeah. I suppose it was fun? For what it's worth, dipping down into raw SQL wasn't my favorite activity, but it got progressively more fun over time. So happy querying, I guess.