Using SQL Union
I had the chance recently to use SQL Union
in my everyday work, so I figured
I'd write this to remind myself of why it's so useful.
First, though, some hot takes.
-
Building dashboards and CSVs for internal business tools into your apps is a little silly and cumbersome. For one, every time someone wants another column added, they have to ask engineering to make a code change. And also, once the data set gets big enough, requests start timing out.
-
It's actually really difficult to generate quality queries with
ActiveRecord
. When you're working with more complex joins across the codebase, it's probably better to dip into writing raw SQL. Although it's not as maintainable, it's often more efficient, and you end up with tabular data rather than objects hydrated with data from your database. Plus you can actually transform the data into the shape that you want by usingCASE
statements and so on.
Okay, hot takes aside, you can probably see that in some cases, it's better to expose your app's data (probably with a data pipeline) to allow it to be queryable with a business intelligence tool. So...that is what we did in this case.
Using Postico, I connected to a replica in a data warehouse and was able to
query data from across applications, which is how SQL Union
came into play.
I wanted to put data from different sources into one table to export it as a
CSV. To do this, I used the UNION
operator in SQL.
Before you use union
, you'll need to make sure that the number of columns
in each data set is equal. The data will just get mashed together
independent of any column names.
Here's a basic example and then I'll show off the query I actually used.
If for some reason you had two tables and wanted to make a single list of kids and parents, you could use union
operator.
KIDS
name | age | |
---|---|---|
tracy | tracy@tracy.com | 30 |
eric | eric@eric.com | 28 |
rachel | rachel@rachel.com | 25 |
PARENTS
name | |
---|---|
mom | mom@mom.com |
dad | dad@dad.com |
But lo and behold, the number of columns differs in each table.
That's cool: we can make them the same by adding a value to represent that column. Or we could omit that column if we wanted to. I'll do the example where we list a value.
select *
from kids
union
select
name,
email,
'n/a' as age
from
parents
RESULTS
name | age | |
---|---|---|
tracy | tracy@tracy.com | 30 |
eric | eric@eric.com | 28 |
rachel | rachel@rachel.com | 25 |
mom | mom@mom.com | n/a |
dad | dad@dad.com | n/a |
And that should allow us to get the right data in the right columns. Simple, silly example, but this sets the stage for some more complex querying.
Here's a cleaned-up version of the query I used to retrieve the set of online students who were in both our self-paced and structured programs. Data was split across many different tables from two separate apps that had their data piped into Redshift.
select
users.uuid,
users.email,
users.first_name,
users.last_name,
cohort_batches.*
from
users
join
user_roles
on
user_roles.user_id = users.id
join
roles
on
user_roles.role_id = roles.id
join
(select
batches.id as batch_id,
batches.uuid as batch_uuid,
batches.iteration as batch_iteration,
cohorts.name as cohort_name,
campuses.name as campus_name,
cohorts.start_date as start_date
from
cohorts
join
batches
on
cohorts.batch_id = batches._uuid
join
campuses
on
cohorts.campus_id = campuses.id
join
courses
on
courses.id = cohorts.course_id
where
campuses.name = 'Online'
and
courses.name in ('Software Engineering', 'Data Science')
union
select
batches.id as batch_id,
batches.uuid as batch_uuid,
batches.iteration as batch_iteration,
null as cohort_name, # wasn't available, so fill in a default value
'Online' as campus_name, # campus name all the same for these
null as start_date # no start date since self-paced
from
batches
where
batches.iteration in ('v-000', 'online-web-sp-000', 'online-ds-sp-000')
) as cohort_batches
on
cohort_batches.batch_id = user_roles.roleable_id
where
roles.name = 'student'
and
user_roles.roleable_type = 'Batch'
It's not the craziest query I've ever seen, but as a person unaccustomed to writing raw SQL all day, it was definitely a doozy to construct. What I found interesting was the ability to join two sets of "cohort batches" together derived from different sources and then join that set of data to our users table based on a role and roleable type.
So, lots of querying, subquerying, union-ing and yeah. I suppose it was fun? For what it's worth, dipping down into raw SQL wasn't my favorite activity, but it got progressively more fun over time. So happy querying, I guess.