Hints for Optimizing a SQL Query

A few weeks ago, I fielded a bug report about an API endpoint that was timing out and showing a 500 error to a client. For one of our seller sites, there just seemed to be too much data to return efficiently in one request. The bug was affecting a handful of other sellers, but only this one was consistently timing out.

Since diving more deeply into query optimization was new to me, I thought it would be interesting and instructive to document my process.

1. Gather some metrics about the timeout

The first step was to replicate and understand the bug. We located the endpoint and looked at its performance metrics in our tools -- New Relic for APM, Datadog for additional logs, Honeybadger for exceptions.

Datadog and Honeybadger both reported timeout exceptions. Looking at the transaction in New Relic, we viewed a flamegraph breakdown of the entire request and identifed that the cause of the bottleneck was two slow queries within the same request.

Two queries were each taking about 18 seconds to complete! Wild.

2. Read through the code and convert it to SQL

Our next step was to dive deeper into the code to understand where the two queries were coming from. The code was pretty gnarly, part Active Record, part parameterized SQL fragment, so we turned it into raw SQL using the parameters for the most problematic request.

As part of investigating the code, we noticed that one query was required to return the paginated event data for the request, but the other query was returning a count of the total events.

Apparently, with the volume of data we had, the query to count the events took just as long as the query to return the events themselves!

3. Run the SQL query planner against a production replica

Next, we attempted to optimize the query using the EXPLAIN ANALYZE command from Postgres. For every query it gets, PostgreSQL devises a query plan to get the data. Sometimes you can rearrange a query to help Postgres figure out the best way to get the data.

So, that was our next step, getting the query plan. We generated it through the PostgreSQL client, Postico, which I use to read from the databases in our various environments.

4. Visualize the bottlenecks

Once we had obtained the query plan, we used a query plan visualizer, Dalibo, to look at the actual query to see which indexes Postgres was using and whether there was something we could do to speed up the queries.

There were a ton of joins on massive tables and a lot of filters based on date ranges, offsets, limits, and so on.

5. Experiment and evaluate the costs and benefits of each approach

With these tools, we continued experimenting with different approaches. By cutting out some of the cruft in the count query, we were able to speed it up by 44%, but that still wasn't enough to bring the entire request consistently below our 29-second Rack timeout.

So instead, we ended up changing the interface of the API endpoint. It was a private, undocumented endpoint used by an analytics provider, so we worked closely with them to make the change.

The solution for this bug was to introduce an optional query parameter to eliminate the count query. Rather than including the count and total pages in our response, the client would hit the endpoint until it received an empty result set. This allowed us to eliminate the count query entirely and effectively cut the response time in half.

Now, our clients can get the data they need, which is pretty exciting. And...this should buy us time for a while. In the future, more work may be required to pre-process this data somewhere else.