ActiveRecord is a great tool for making SQL a more readable. However, there
are some things that are really difficult to do in pure ActiveRecord. One of those is
a SELECT * FROM (subquery)
style of query. Before writing it in a raw
SQL query, check out the following tip for preserving some ActiveRecord goodness.
Our database structure
The way the above works is that a User is able to attend the Concert of a particular
Artist. An Artist has_many
Concerts (since they perform in multiple locations) and a
User can Attend as many Concerts as they want. A User is assumed as having attended a particular Artist’s Concert if they have a valid Attendance record. The resulting models might look
like the following code:
class User < ApplicationRecord
has_many :attendances
has_many :attended_concerts, through: :attendances, source: :concert
has_many :attended_artists, through: :attended_concerts, source: :artist
end
class Attendance < ApplicationRecord
belongs_to :user
belongs_to :concert
end
class Concert < ApplicationRecord
has_many :attendances
belongs_to :artist
end
class Artist < ApplicationRecord
has_many :concerts
has_many :attendances, through: :concerts
end
Now that we’ve defined a simple database structure and model code, let’s say our task is to return a table that contains the following:
Association source option
The :source option specifies the source association name for a has_many :through association. You only need to use this option if the name of the source association cannot be automatically inferred from the association name.
- guides.rubyonrails.org
- The first column should display the Artist of the Concert
- Then the user whom attended
- (Here’s the tricky part) The user’s latest attendance if they have seen an artist more than once.
- We want Attendances sorted by newest first in the results
Here’s an example of desired output:
Artist | Attendee | Latest Attendance |
---|---|---|
Pretty Lights | Sean Bean | Mon, 16 Jul 2018 15:39:36 EDT -04:00 |
Bassnectar | Emilia Clarke | Wed, 06 Jun 2018 15:42:02 EDT -04:00 |
Kodomo | Kit Harington | Fri, 27 Apr 2018 15:42:17 EDT -04:00 |
For our above example, let’s also specify that Sean Bean has attended a Pretty Lights concert three times in the last year. We only want to see his most latest attendance which happens to be July 16th, 2018 above.
Getting into the SQL
With the above specifications we could write some SQL code to query for the results like so:
SELECT art.name as artist,
u.name as attendee,
MAX(c.start_time) as latest_attendance
FROM users u
INNER JOIN attendances a ON a.user_id = u.id
INNER JOIN concerts c ON c.id = a.concert_id
INNER JOIN artists art ON art.id = c.artist_id
GROUP BY art.name, u.name
ORDER BY latest_attendance DESC
Here’s the above SQL translated to ActiveRecord:
User
.all
.joins(attendances: { concert: :artist })
.select("
users.name as attendee,
artists.name as artist,
MAX(concerts.start_time) as latest_attendance
")
.group(:id, "artist")
.order("latest_time DESC")
We’re done! Right? Well not quite. As with any real world project, there’s a new client request.
There’s always another requirement
The client now wants to display the Concert’s location along with latest attendance for each User / Artist combination. That seems like it would be as easy as adding location into the select clause and grouping by it with .group(:id, "artist, latest_attendance")
but unfortunately that would only give you latest_attendance
specified by unique locations
as you can see below.
Artist | Attendee | Latest Attendance | Location |
---|---|---|---|
Pretty Lights | Sean Bean | Mon, 16 Jul 2018 15:39:36 EDT -04:00 | Los Angeles, CA |
Pretty Lights | Sean Bean | Sat, 14 Jul 2018 19:00:18 EDT -04:00 | Atlanta, GA |
Pretty Lights | Sean Bean | Tue, 10 Jul 2018 19:00:46 EDT -04:00 | Indianapolis, IN |
Bassnectar | Emilia Clarke | Wed, 06 Jun 2018 15:42:02 EDT -04:00 | Houston, TX |
Kodomo | Kit Harington | Fri, 27 Apr 2018 15:42:17 EDT -04:00 | Miami, FL |
This is great information but isn’t quite accurate to the client’s requirements. (Also, apparently Sean Bean has been flying all over the country and is a huge Pretty Lights fan. Who knew?)
PostgreSQL DISTINCT ON
SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal.
- postgresql.org
In order to properly select the latest_attendance
while keeping it in sync with the location we’ll need to use a subquery to keep them associated with each other while filtering it to only return a single row. We can accomplish this with PostgreSQL’s DISTINCT ON
clause.
Using DISTINCT ON
we can eliminate returned rows by making the results unique to the selected content. Let’s see if we can write the proper SQL for this.
SELECT * FROM (
SELECT DISTINCT ON(art.id)
art.name as artist,
a.name as attendee,
c.start_time as latest_attendance
FROM users u
INNER JOIN attendances a ON a.user_id = u.id
INNER JOIN concerts c ON c.id = a.concert_id
INNER JOIN artists art ON art.id = c.artist_id
ORDER BY art.id, latest_attendance DESC
) inner_query
ORDER BY inner_query.latest_attendance DESC
The above works by ensuring that all the results are distinct by the artists.id
. We then order the subquery by latest_attendance descending in order to grab the latest start time. However, because we’re using the DISTINCT ON
clause for filtering the order by clause has to contain the same constraint of art.id
as the first argument like so: ORDER BY art.id, latest_attendance DESC
. This means that the records aren’t quite in the right order because the Artist’s id takes precedence over the latest_attendance in the sort order. The magic happens when we also order the subquery in the outer query by its latest_attendance descending ORDER BY inner_query.latest_attendance DESC
. This ensures that we only are looking at the latest result for a given Artist.
An alternative syntax for the above query could be to use a Common Table Expression via the WITH
keyword in PostgreSQL. It could be considered a more readable query.
WITH inner_query AS (
SELECT DISTINCT ON(art.id)
art.name as artist,
a.name as attendee,
c.start_time as latest_attendance
FROM users u
INNER JOIN attendances a ON a.user_id = u.id
INNER JOIN concerts c ON c.id = a.concert_id
INNER JOIN artists art ON art.id = c.artist_id
ORDER BY art.id, latest_attendance DESC
)
SELECT *
FROM inner_query
ORDER BY inner_query.latest_attendance DESC
At this point you could just run the above raw SQL directly using ActiveRecord::Base.connection.execute(raw_sql)
and that would be a perfectly reasonable way of accomplishing our goal.
Here we have the final dataset. Note that while Sean Bean attended 3 Pretty Lights Concerts we only wanted the information regarding his latest attendance with a matching Concert location.
Artist | Attendee | Latest Attendance | Location |
---|---|---|---|
Pretty Lights | Sean Bean | Mon, 16 Jul 2018 15:39:36 EDT -04:00 | Los Angeles, CA |
Bassnectar | Emilia Clarke | Wed, 06 Jun 2018 15:42:02 EDT -04:00 | Houston, TX |
Kodomo | Kit Harington | Fri, 27 Apr 2018 15:42:17 EDT -04:00 | Miami, FL |
However, by doing this in raw SQL we lose ActiveRecord's
built in helpers and chainability as we are instead returning a PG::Result object
.
raw_sql = <<~SQL
SELECT * FROM (
SELECT DISTINCT ON(art.id)
art.name as artist,
u.name as attendee,
c.start_time as latest_attendance
FROM users u
INNER JOIN attendances a ON a.user_id = u.id
INNER JOIN concerts c ON c.id = a.concert_id
INNER JOIN artists art ON art.id = c.artist_id
ORDER BY art.id, latest_attendance DESC
) inner_query
ORDER BY inner_query.latest_attendance DESC
SQL
result = ActiveRecord::Base.connection.execute(raw_sql)
result #=> #<PG::Result:0x000000000db346a8 status=PGRES_TUPLES_OK ntuples=5 nfields=29 cmd_tuples=5>
PG::Result object’s are still pretty useful giving you access to a implementation of #each
, #values
, and #to_a
. However, we’d like the whole slew of ActiveRecord collection methods and helper available. Let’s fix that!
Converting this back into ActiveRecord
Making this work in ActiveRecord is hard. There isn’t a lot of clear guidance of how to accomplish this. I was able to dig through enough forums and documentation until I happened upon the following entry on ActiveRecord from.
The ActiveRecord
from method generally just specifies the table you
are querying against. However, it also allows you to pass in built ActiveRecord collections and query from those. This is accomplishes by building the subquery and passing it into the from
method on a valid ActiveRecord scope.
# This example builds equivalent SQL to the previous raw_sql example
# with the SELECT * FROM (DISTINCT ON ...)
inner_query = User
.all
.joins(attendances: { concert: :artist })
.select("DISTINCT ON(artists.id)
artists.name as artist,
users.name as attendee,
concerts.start_time as latest_time
")
.order("
artists.id,
latest_time DESC
")
User
.unscoped
.select("*")
.from(inner_query, :inner_query)
.order("inner_query.latest_time DESC")
Note that in the above query we use the second argument in the from clause to specify
the alias for the table .from(inner_query, :inner_query)
. This allows us to more easily order by the latest_time in the order by clause.
Additionally, the unscoped
just gives us a basic
ActiveRecord Association to use for querying. I used User
as the starting
point above but really you can use any model to achieve the same results. This is
because we’re only querying from the data within the subquery and not the actual User
model directly.
Conclusion
And there you have it. We’ve fully moved our subquery into ActiveRecord and can now enjoy all the conveniences that it provides. It took a little tweaking to the built in ActiveRecord from clause by passing in a subquery but it works like a charm. Rock on!
What did you think about this technique? Let me know by leaving a comment below and thanks for reading.
Join the conversation