top of page

Breaking into Amazon by passing SQL interview - Customer Orders

  • Writer: Christina
    Christina
  • Jun 12, 2022
  • 4 min read

Updated: May 8, 2024

Regularly challenging oneself is a habit I'm trying to build. So what better first post for this blog than starting right away. Today, I will be solving a SQL exercise coming from a technical interview at Amazon.


As a Data Engineer, SQL is my mother tongue. I use it every day for a lot of different tasks. Like any language, practice is the key the keep fluency and accuracy.


Our question today is: Identify the customers who placed more than three transactions each in both 2019 and 2020. The database looks like the following:


sql modelisation example

The expected output is a single column containing the list of customers' names.


I am going to build a SQL script that is going to run directly on the database to get the expected result.


Let's begin coding!


A quick mention about Interview Query for SQL


For a couple of years, I've subscribed to Interview Query (this post is not sponsored by them). It's an awesome learning platform to prepare for technical interviews. From Amazon to Google, they have a huge database of questions and exercises for you to use. This topic came from their platform and they also provide the database for you to use which is very convenient. You can directly run the queries and see the end result.


I just wanted to share with you the platform as they are doing an amazing job for supporting the tech community. Now let's go back to our main problem.


Let's get started for real this time!


Step 1: Let's keep the transactions of 2019 and 2020


The volume of data is a relevant value to have. But why? Because I need to have a high-level knowledge of the size of my database. I'm not thinking right away about optimization but these numbers will give me the right instinct about the answer. If the volume is incoherent with this then I know that I must have made a mistake somewhere. To know the number of transactions per year, use the following query:


SELECT YEAR(created_at) as year, count(*) as nb_transactions
FROM transactions
WHERE YEAR(created_at) IN (2019,2020)
GROUP BY YEAR(created_at)
ORDER BY YEAR(created_at) ASC;

Step 2: Let's focus on 2019 and retrieve the customer

Let's observe closely how many transactions our customers did in 2019. To do so, let's add the customer's name (because it is better than an ID) and count how many transactions there are for each customer.


SELECT name, count(*) as nb_transactions
FROM transactions as t
INNER JOIN users as u ON t.user_id = u.id
WHERE YEAR(created_at) = 2019
GROUP BY name
ORDER BY count(*) ASC;

Now let's focus on the second part of the question. Because the logic is the same for each year, you can copy/paste the query of 2019 and change the year.


-- Situation in 2019
SELECT name 
FROM transactions as t
INNER JOIN users as u ON t.user_id = u.id
WHERE YEAR(created_at) = 2019
GROUP BY name, YEAR(created_at)
HAVING COUNT(*) > 3 
ORDER BY name ASC;

-- Situation in 2020
SELECT name 
FROM transactions as t
INNER JOIN users as u ON t.user_id = u.id
WHERE YEAR(created_at) = 2020
GROUP BY name, YEAR(created_at)
HAVING COUNT(*) > 3 
ORDER BY name ASC;

Each query will give me the list of customers who did more than three transactions. Below find the results based on my data sample (I ran them directly on a made-up database but if you use InterviewQuery the result is going to be different). If you have yours, you can skip this part.


For 2019, we find the following result:

  • Cammy Whitelaw

  • Darby Rayman

  • Erminia Cowterd

  • Everlyn Kembley

  • Jewell Karim

  • Ronnie Simper

For 2020, we find the following result:

  • Darby Rayman

  • Erminia Cowterd

  • Huberto Bradnum


Now the ultimate step consists of selecting only the customers who appear in both results. So based on my example, the final answer should be:

  • Darby Rayman

  • Erminia Cowterd


Step 3: Identify the customers and answer the question


We have several options to store the results. I'm assuming that we only need the immediate answer so I'm building a one-shot query for this one. Otherwise, I would have built a view for making that information available and reusable.


-- List of customers with more than 3 transactions in both 2019 and 2020
SELECT cust_2019.name as name
FROM (
  -- Situation in 2019
  SELECT name --, count(*) as nb_transactions
  FROM transactions as t
  INNER JOIN users as u ON t.user_id = u.id
  WHERE YEAR(created_at) = 2019
  GROUP BY name, YEAR(created_at)
  HAVING COUNT(*) > 3
  -- ORDER BY name ASC;
) as cust_2019
INNER JOIN (
  -- Situation in 2020
  SELECT name --, count(*) as nb_transactions
  FROM transactions as t
  INNER JOIN users as u ON t.user_id = u.id
  WHERE YEAR(created_at) = 2020
  GROUP BY name, YEAR(created_at)
  HAVING COUNT(*) > 3 
  -- ORDER BY name ASC;
) AS cust_2020 
ON cust_2019.name = cust_2020.name;

Final step: Review and optimize


Awesome! We answered the question but let's watch closely our query again. Maybe doing the JOIN statement so soon is not relevant. Based on the same logic, we can work with the table transactions and manipulate the user_id instead of the name. So a possible alternative could be the following:


SELECT u.name as customer_name
FROM (
  -- Situation in 2019
  SELECT user_id 
  FROM transactions 
  WHERE YEAR(created_at) = 2019
  GROUP BY user_id, YEAR(created_at)
  HAVING COUNT(*) > 3
) as cust_2019

INNER JOIN (
  -- Situation in 2020
  SELECT user_id
  FROM transactions 
  WHERE YEAR(created_at) = 2020
  GROUP BY user_id, YEAR(created_at)
  HAVING COUNT(*) > 3 
) AS cust_2020 ON cust_2019.user_id = cust_2020.user_id

INNER JOIN users as u ON cust_2019.user_id = u.id ;

The final result is the list of customers below. Congratulation, you've finished this exercise given by Amazon.

CUSTOMER

Darby Rayman

Erminia Cowterd

See you at the next practice!


source: direct link

bottom of page