Handling time series window functions in data science interviews

Data scientists handle time series data on a daily basis, and being able to manipulate and analyze this data is a necessary part of the job. The SQL window functions allow you to do just this, and it’s a common question in data science interviews. So let’s talk about what time series data is, when to use it, and how to implement features to help manage time series data.

What is time series data?

Time series data are variables within your data that have a time component. This means that each value in this attribute has either a date or a time value, sometimes they have both. Here are some examples of time series data:

• The daily price of the shares of the companies because each share price is associated with a specific day.

• The average daily value of the stock index over the past few years because each value is assigned to a specific day

• Unique visits to a website during a month

• Platform logs every day

• Sales and monthly income

• Daily logins for an application

LAG and LEAD window functions

When dealing with time series data, a common calculation is to calculate growth or averages over time. This means that you will need to take either the future date or the past date and their associated values.

Two WINDOW functions that allow you to achieve this are LAG and LEAD, which are extremely useful for handling time-related data. The main difference between LAG and LEAD is that LAG gets data from previous rows while LEAD is the opposite, it gets data from next rows.

We can use either of the two functions to compare month-to-month growth, for example. As a data analytics professional, you’ll most likely work with time-related data, and if you can use LAG or LEAD efficiently, you’ll be a very productive data scientist.

A data science interview question that requires a window function

Let’s discuss an advanced sql data science interview question that deals with this window function. You’ll see that window functions are often a part of interview questions, but you’ll also see them a lot in your day-to-day work, so it’s important to know how to use them.

Let’s look at an Airbnb question called Airbnb growth. If you want to follow it interactively, you can do it here.

The question is to estimate the growth of Airbnb each year using the number of registered hosts as a growth metric. The growth rate is calculated by taking ((number of hosts registered in the current year – number of hosts registered in the previous year) / number of hosts registered in the previous year) * 100.

Output of the year, number of hosts in the current year, number of hosts in the previous year, and growth rate. Round the growth rate to the nearest percent and sort the result in ascending order by year.

Approach Step 1: Count the host for the current year

The first step is to count hosts by year, so we’ll need to extract the year from the date values.

SELECT extract (year

FROM host_since::date) AS year,

count(id) host_current_year

FROM airbnb_search_details

WHERE host_since IS NOT NULL

GROUP BY extract (year

FROM host_from::date)

ORDER BY year

Approach Step 2: Count the host from the previous year.

This is where you will use the LAG window function. Here you’ll create a view where we have the year, the number of hosts in that current year, and then the number of hosts from the previous year. Use a lag function for last year’s count and take last year’s value and put it in the same row as this year’s count. This way you will have 3 columns in your view: year, current year host count, and last year host count. The LAG function allows you to easily extract the last year’s host count in your queue. This makes it easy for you to implement any metric as a growth rate because you have all the values you need in one row for SQL to easily calculate a metric. Here is the code for it:

SELECTYear,

host_of_current_year,

LAG(host_current_year, 1) OVER (SORT BY year) LIKE host_previous_year

OF

(SELECT extract (year

FROM host_since::date) AS year,

count(id) host_current_year

FROM airbnb_search_details

WHERE host_since IS NOT NULL

GROUP BY extract (year

FROM host_from::date)

ORDER BY year) t1) t2

Approach 3: Implement the growth metric

As mentioned above, it’s much easier to implement a metric like the following when all the values are in one row. That is why it performs the LAG function. Implement the growth rate calculation round (((host_current_year – host_previous_year)/(cast(host_previous_year AS numeric)))*100) estimated_growth

SELECTYear,

host_of_current_year,

previous_previous_host,

round(((current_year_host – prev_year_host)/(cast(prev_year_host AS numeric)))*100) estimated_growth

OF

(SELECT year,

host_of_current_year,

LAG(host_current_year, 1) OVER (SORT BY year) LIKE host_previous_year

OF

(SELECT extract (year

FROM host_since::date) AS year,

count(id) host_current_year

FROM airbnb_search_details

WHERE host_since IS NOT NULL

GROUP BY extract (year

FROM host_from::date)

ORDER BY year) t1) t2

About Me

Dashy

Handling time series window functions in data science interviews

Leave a Reply Cancel reply