Qualify
Snowflake ·This is the second post about some Snowflake features, that I use often and find useful to write a better SQL. You can find the first of the series about aliasing here.
What is it?
Have you ever been in the situation where you want to keep all rows of your query but than filter the result set based on a condition you would get from a subset of it? Let me clarify this with some examples:
- Suppose you are tracking all the interactions with a website and you want to see what happened after a specific action took place: then you would want to keep all events for those customers that have a specific event in their event stream.
- Suppose you sell multiple products per customers and you want to see all the purchase history of the customers that have a very specific product as last purchase.
Let’s stick with this second example. In that case you would have a table looking like this, let’s call it purchases
:
customer_id | product | purchased_date (YYYY-MM-DD) |
---|---|---|
1 | A | 2020-04-01 |
1 | B | 2020-05-03 |
2 | A | 2020-03-01 |
3 | B | 2020-04-01 |
So customer 1 and 3 last purchase was product B, and customer 2 last purchase was product A. Now suppose you want to filter in your data only the customers that purchased B as last product, for an analysis you’re doing.
What you would do, without using the qualify statement, would be something like this:
with product_purchased_order as (
select
customer_id
, product
, row_number() over (partition by customer_id
order by purchased_date desc) as n_last_order_nr
from purchases
)
select customer_id
from product_purchased_order
where n_last_order_nr = 1 and product ='B'
You see that I have to use at least two queries: first the window statement and the where clause to filter the results I want. There are many other way to get to the same result, but let’s see how to simplify the query.
The qualify
expression allows to rewrite the statement above to read nicely and in one single query. Here you go:
select
customer_id
, product
, last_value( product ) over (partition by customer_id
order by purchased_date) as last_product_purchased
from purchases
qualify last_product_purchased = 'B'
Note the benefits of using this second option:
- I can write only one query
- I can keep in the same query all the fields in the base table: in this case all the purchase history, even though I am just keeping the customer that have bought B as last product – I can still see all the products they purchased.
When analysing something, you often want to filter by a condition on a partition, but keep all the records at a lower granularity – that is where the qualify
come in handy. Note also how using the alias in the query above made the statement more clear.
Where does the qualify clause this fit?
From the Snowflake general query syntax reference you get this general outline:
[ WITH ... ]
SELECT
[ TOP <n> ]
...
[ FROM ...
[ AT | BEFORE ... ]
[ CHANGES ... ]
[ CONNECT BY ... ]
[ JOIN ... ]
[ MATCH_RECOGNIZE ... ]
[ PIVOT | UNPIVOT ... ]
[ VALUES ... ]
[ SAMPLE ... ] ]
[ WHERE ... ]
[ GROUP BY ...
[ HAVING ... ] ]
[ QUALIFY ... ]
[ ORDER BY ... ]
[ LIMIT ... ]
So, the qualify
sit in the query order right after the having
(if present) but before the order by
and the limit
clauses.
It’s also possible to not include the window function you are filtering by in the select list, which is good in case you don’t want to show the value you are filtering on, for example in the previous query something like:
select
customer_id
, product
from purchases
qualify last_value( product ) over (partition by customer_id
order by purchased_date) = 'B'
Which could be good if the additional field is redundant or used just for filtering.
That’s it for today! Hope you’ll start getting rid of some extra CTEs / sub-queries, thanks to the qualify
or, why not, have better time writing the logic to get some fancy slice of your data, based on whatever complex condition.
Have you used the qualify
statement? What are the pros and cons about it? Any warnings that you want to share? Let me know!