SQL is the language of data engineering interviews. Not Python. Not Spark. SQL. Even roles that are 80% pipeline work will open with SQL questions to establish a baseline, and the advanced questions — window functions, query optimization, recursive CTEs — separate strong candidates from the rest.
This guide covers 50 SQL questions organized by difficulty, with complete working answers. Every query is written for readability. Adapt syntax as needed for your target database (Postgres, BigQuery, Snowflake, etc.).
Basic (Questions 1–10)
These questions establish fundamentals. You should answer them instantly — hesitation here is a yellow flag.
Q1. What is the difference between WHERE and HAVING?
WHERE filters rows before aggregation. HAVING filters after aggregation.
-- WHERE filters individual rows
SELECT department, AVG(salary)
FROM employees
WHERE status = 'active'
GROUP BY department;
-- HAVING filters aggregated groups
SELECT department, AVG(salary) AS avg_sal
FROM employees
GROUP BY department
HAVING AVG(salary) > 80000;
Q2. What does GROUP BY do, and what columns must appear in SELECT?
GROUP BY collapses rows with the same value into one row per group. Every column in SELECT must either appear in GROUP BY or be wrapped in an aggregate function (SUM, COUNT, MAX, etc.).
SELECT department, COUNT(*) AS headcount, MAX(salary) AS top_salary
FROM employees
GROUP BY department;
Q3. Write a query to find all employees who earn more than $100,000.
SELECT employee_id, name, salary
FROM employees
WHERE salary > 100000
ORDER BY salary DESC;
Q4. What is the difference between INNER JOIN and LEFT JOIN?
INNER JOIN returns only rows that match in both tables. LEFT JOIN returns all rows from the left table, with NULLs for non-matching rows on the right.
-- INNER JOIN: only employees who have a department
SELECT e.name, d.department_name
FROM employees e
INNER JOIN departments d ON e.department_id = d.id;
-- LEFT JOIN: all employees, even those with no department
SELECT e.name, d.department_name
FROM employees e
LEFT JOIN departments d ON e.department_id = d.id;
Q5. How do you remove duplicate rows from a result set?
-- Remove duplicates using DISTINCT
SELECT DISTINCT department_id
FROM employees;
-- Or using GROUP BY
SELECT department_id
FROM employees
GROUP BY department_id;
Q6. Write a query to count the number of employees per department.
SELECT department_id, COUNT(*) AS employee_count
FROM employees
GROUP BY department_id
ORDER BY employee_count DESC;
Q7. What is the difference between TRUNCATE and DELETE?
DELETE removes rows one at a time, fires triggers, and can be rolled back. TRUNCATE removes all rows at once, does not fire row-level triggers, is faster, and in most databases cannot be rolled back inside a transaction.
Q8. Write a query to find the second highest salary in the employees table.
SELECT MAX(salary) AS second_highest
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
Q9. What does COALESCE do?
COALESCE returns the first non-NULL value from a list of expressions. It's the standard null-handling function.
SELECT name, COALESCE(phone, email, 'no contact') AS contact_method
FROM users;
Q10. Write a query to get the total salary per department, only showing departments with more than 5 employees.
SELECT department_id, SUM(salary) AS total_salary, COUNT(*) AS headcount
FROM employees
GROUP BY department_id
HAVING COUNT(*) > 5
ORDER BY total_salary DESC;
Intermediate (Questions 11–30)
These questions test real SQL fluency — joins, subqueries, and the patterns that appear constantly in data pipelines.
Q11. What is a self join? Write an example.
A self join joins a table to itself. Common use: employee-manager hierarchy.
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.employee_id;
Q12. What is a CROSS JOIN and when would you use it?
A cross join returns the Cartesian product — every row in the left table combined with every row in the right table. Use cases: generating all combinations, date spine generation.
-- Generate all product-region combinations
SELECT p.product_name, r.region_name
FROM products p
CROSS JOIN regions r;
Q13. Write a query to find employees who exist in one table but not another (anti-join).
-- Using NOT EXISTS (generally fastest)
SELECT e.employee_id, e.name
FROM employees e
WHERE NOT EXISTS (
SELECT 1 FROM payroll p WHERE p.employee_id = e.employee_id
);
-- Using LEFT JOIN + NULL check
SELECT e.employee_id, e.name
FROM employees e
LEFT JOIN payroll p ON e.employee_id = p.employee_id
WHERE p.employee_id IS NULL;
Q14. What is a CTE (Common Table Expression) and how does it differ from a subquery?
A CTE is a named temporary result set defined with WITH. It improves readability, can be referenced multiple times in the same query, and in some databases enables recursion. Subqueries are inline and cannot be reused.
WITH dept_totals AS (
SELECT department_id, SUM(salary) AS total_salary
FROM employees
GROUP BY department_id
),
dept_avg AS (
SELECT AVG(total_salary) AS company_avg FROM dept_totals
)
SELECT d.department_id, d.total_salary, a.company_avg
FROM dept_totals d, dept_avg a
WHERE d.total_salary > a.company_avg;
Q15. Find all employees whose salary is above their department average.
SELECT e.name, e.salary, e.department_id
FROM employees e
WHERE e.salary > (
SELECT AVG(salary)
FROM employees
WHERE department_id = e.department_id
);
Q16. Write a query to find duplicate email addresses in a users table.
SELECT email, COUNT(*) AS occurrences
FROM users
GROUP BY email
HAVING COUNT(*) > 1
ORDER BY occurrences DESC;
Q17. What is the difference between UNION and UNION ALL?
UNION deduplicates results (slower). UNION ALL returns all rows including duplicates (faster). Use UNION ALL by default unless you specifically need deduplication.
-- UNION: unique rows only
SELECT user_id FROM paid_users
UNION
SELECT user_id FROM trial_users;
-- UNION ALL: includes duplicates
SELECT user_id FROM paid_users
UNION ALL
SELECT user_id FROM trial_users;
Q18. Write a query to pivot monthly sales data (rows to columns).
SELECT
product_id,
SUM(CASE WHEN EXTRACT(MONTH FROM sale_date) = 1 THEN amount ELSE 0 END) AS jan,
SUM(CASE WHEN EXTRACT(MONTH FROM sale_date) = 2 THEN amount ELSE 0 END) AS feb,
SUM(CASE WHEN EXTRACT(MONTH FROM sale_date) = 3 THEN amount ELSE 0 END) AS mar
FROM sales
WHERE EXTRACT(YEAR FROM sale_date) = 2026
GROUP BY product_id;
Q19. Write a query to get a running total of sales by date.
SELECT
sale_date,
daily_total,
SUM(daily_total) OVER (ORDER BY sale_date) AS running_total
FROM (
SELECT sale_date, SUM(amount) AS daily_total
FROM sales
GROUP BY sale_date
) daily_sales
ORDER BY sale_date;
Q20. Explain the difference between correlated and non-correlated subqueries.
A non-correlated subquery executes once and returns a fixed result. A correlated subquery references columns from the outer query, executing once per row — it's slower but sometimes necessary.
-- Non-correlated: runs once
SELECT name FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
-- Correlated: runs once per employee row
SELECT name FROM employees e1
WHERE salary > (
SELECT AVG(salary) FROM employees e2
WHERE e2.department_id = e1.department_id
);
Q21. Write a query to find the top 3 earners per department.
WITH ranked AS (
SELECT
name,
department_id,
salary,
ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rn
FROM employees
)
SELECT name, department_id, salary
FROM ranked
WHERE rn <= 3;
Q22. What is the difference between ROW_NUMBER, RANK, and DENSE_RANK?
ROW_NUMBER: unique sequential number, no tiesRANK: tied rows get the same rank, next rank skips (1, 2, 2, 4)DENSE_RANK: tied rows get the same rank, next rank does not skip (1, 2, 2, 3)
SELECT
name,
salary,
ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num,
RANK() OVER (ORDER BY salary DESC) AS rnk,
DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rnk
FROM employees;
Q23. Write a query to calculate month-over-month revenue change.
WITH monthly AS (
SELECT
DATE_TRUNC('month', sale_date) AS month,
SUM(amount) AS revenue
FROM sales
GROUP BY 1
)
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_month,
ROUND(
(revenue - LAG(revenue) OVER (ORDER BY month)) /
NULLIF(LAG(revenue) OVER (ORDER BY month), 0) * 100,
2
) AS pct_change
FROM monthly
ORDER BY month;
Q24. Write a query to delete duplicate rows, keeping only the one with the highest ID.
DELETE FROM users
WHERE id NOT IN (
SELECT MAX(id)
FROM users
GROUP BY email
);
Q25. What is a window function? Name five common ones.
A window function performs a calculation across a set of rows related to the current row, without collapsing the result like GROUP BY does. The window is defined with OVER.
Common window functions: ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, SUM, AVG, FIRST_VALUE, LAST_VALUE, NTILE.
Q26. Write a query to calculate a 7-day rolling average of daily signups.
SELECT
signup_date,
daily_signups,
AVG(daily_signups) OVER (
ORDER BY signup_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7d_avg
FROM daily_signup_counts
ORDER BY signup_date;
Q27. What does PARTITION BY do inside a window function?
PARTITION BY divides the result set into groups (partitions) for the window calculation, resetting the window function for each group — similar to GROUP BY but without collapsing rows.
-- Rank employees within their department (not globally)
SELECT name, department_id, salary,
RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS dept_rank
FROM employees;
Q28. Write a query to find users who signed up in the last 30 days but have never made a purchase.
SELECT u.user_id, u.email, u.created_at
FROM users u
WHERE u.created_at >= CURRENT_DATE - INTERVAL '30 days'
AND NOT EXISTS (
SELECT 1 FROM orders o WHERE o.user_id = u.user_id
);
Q29. How do you handle NULL values in aggregate functions?
Most aggregate functions (SUM, AVG, COUNT(column)) ignore NULLs automatically. COUNT(*) counts all rows including NULLs. Use COALESCE or NULLIF to control behavior explicitly.
-- NULLIF prevents division by zero
SELECT department_id,
SUM(salary) / NULLIF(COUNT(*), 0) AS avg_salary
FROM employees
GROUP BY department_id;
Q30. Write a query to find the nth highest salary (generalized).
-- Find the 5th highest salary using OFFSET
SELECT DISTINCT salary
FROM employees
ORDER BY salary DESC
LIMIT 1 OFFSET 4;
-- Using DENSE_RANK for exact tie handling
SELECT salary
FROM (
SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rnk
FROM employees
) ranked
WHERE rnk = 5
LIMIT 1;
Advanced (Questions 31–50)
These questions come up at senior and staff-level interviews and at companies with mature data stacks.
Q31. Write a recursive CTE to traverse an employee hierarchy.
WITH RECURSIVE org_chart AS (
-- Anchor: start with the CEO (no manager)
SELECT employee_id, name, manager_id, 0 AS depth
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- Recursive: join subordinates
SELECT e.employee_id, e.name, e.manager_id, oc.depth + 1
FROM employees e
INNER JOIN org_chart oc ON e.manager_id = oc.employee_id
)
SELECT employee_id, name, depth
FROM org_chart
ORDER BY depth, name;
Q32. Explain what an index is. When would you not use one?
An index is a data structure (typically a B-tree) that allows the database to find rows without scanning every row in the table. Don't add indexes when: the table is small (full scan is faster), you write far more than you read (indexes slow writes), or the column has very low cardinality (like a boolean) and the selectivity is too low to be useful.
Q33. What is the difference between a clustered and non-clustered index?
A clustered index defines the physical order of data on disk — there can only be one per table. The primary key is typically clustered. A non-clustered index is a separate structure that stores pointers to the physical rows.
Q34. Write a query to detect gaps in a sequential ID column.
-- Find missing IDs between min and max
WITH sequence AS (
SELECT generate_series(MIN(id), MAX(id)) AS expected_id
FROM orders
)
SELECT expected_id AS missing_id
FROM sequence
WHERE expected_id NOT IN (SELECT id FROM orders)
ORDER BY missing_id;
Q35. How does EXPLAIN (or EXPLAIN ANALYZE) help with query optimization?
EXPLAIN shows the query execution plan — the sequence of operations the database will perform. EXPLAIN ANALYZE actually runs the query and shows real timing. Look for: sequential scans on large tables (may need an index), nested loop joins on large datasets (consider hash joins), high row estimate errors (outdated statistics), and operations happening in unexpected order.
EXPLAIN ANALYZE
SELECT e.name, d.department_name
FROM employees e
JOIN departments d ON e.department_id = d.id
WHERE e.salary > 100000;
Q36. Write a query to find the first and last purchase date for each customer.
SELECT
customer_id,
MIN(purchase_date) AS first_purchase,
MAX(purchase_date) AS last_purchase,
MAX(purchase_date) - MIN(purchase_date) AS customer_lifespan_days
FROM orders
GROUP BY customer_id;
Q37. What is a materialized view and how does it differ from a regular view?
A regular view is a saved query that executes on every access — no data is stored. A materialized view stores the query results physically. It's faster to query but requires a refresh strategy to stay current. Use materialized views for expensive aggregations that don't need real-time freshness.
Q38. Write a query to calculate user retention by cohort (weekly cohorts).
WITH cohorts AS (
SELECT user_id, DATE_TRUNC('week', created_at) AS cohort_week
FROM users
),
activity AS (
SELECT DISTINCT user_id, DATE_TRUNC('week', activity_date) AS activity_week
FROM user_events
)
SELECT
c.cohort_week,
a.activity_week,
DATE_PART('week', a.activity_week - c.cohort_week) AS weeks_since_signup,
COUNT(DISTINCT c.user_id) AS cohort_size,
COUNT(DISTINCT a.user_id) AS retained_users,
ROUND(COUNT(DISTINCT a.user_id)::numeric / COUNT(DISTINCT c.user_id) * 100, 1) AS retention_pct
FROM cohorts c
LEFT JOIN activity a ON c.user_id = a.user_id AND a.activity_week >= c.cohort_week
GROUP BY 1, 2, 3
ORDER BY 1, 3;
Q39. How would you find the most recent record per user when the table has millions of rows?
-- Method 1: ROW_NUMBER (clean, predictable)
SELECT user_id, event_type, created_at
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
FROM events
) t
WHERE rn = 1;
-- Method 2: DISTINCT ON (Postgres-specific, often fastest)
SELECT DISTINCT ON (user_id) user_id, event_type, created_at
FROM events
ORDER BY user_id, created_at DESC;
Q40. What is a lateral join and when is it useful?
LATERAL allows a subquery on the right side of a join to reference columns from the left side — like a correlated subquery but returning multiple rows. Useful for top-N per group, applying a function to each row, or row expansion.
-- Get the 3 most recent orders for each customer
SELECT c.customer_id, c.name, o.order_id, o.order_date
FROM customers c
LEFT JOIN LATERAL (
SELECT order_id, order_date
FROM orders
WHERE customer_id = c.customer_id
ORDER BY order_date DESC
LIMIT 3
) o ON true;
Q41. Write a query to calculate a cumulative distribution of salary ranges.
SELECT
CASE
WHEN salary < 60000 THEN 'under 60k'
WHEN salary < 100000 THEN '60k-100k'
WHEN salary < 150000 THEN '100k-150k'
ELSE '150k+'
END AS salary_band,
COUNT(*) AS headcount,
SUM(COUNT(*)) OVER (ORDER BY MIN(salary)) AS cumulative_count,
ROUND(SUM(COUNT(*)) OVER (ORDER BY MIN(salary)) * 100.0 / SUM(COUNT(*)) OVER (), 1) AS cumulative_pct
FROM employees
GROUP BY 1
ORDER BY MIN(salary);
Q42. How do you handle slowly changing dimensions (SCD Type 2) in SQL?
SCD Type 2 tracks historical changes by adding new rows rather than overwriting. Each row has valid_from, valid_to, and is_current columns.
-- Get the current record for each customer
SELECT customer_id, name, address, plan_type
FROM customer_dim
WHERE is_current = true;
-- Get the record as of a specific date
SELECT customer_id, name, address, plan_type
FROM customer_dim
WHERE '2025-01-01' BETWEEN valid_from AND COALESCE(valid_to, '9999-12-31');
Q43. Write a query to detect session boundaries in clickstream data (sessions gap by 30 minutes).
WITH flagged AS (
SELECT
user_id,
event_time,
LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_event_time,
CASE
WHEN event_time - LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) > INTERVAL '30 minutes'
OR LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) IS NULL
THEN 1 ELSE 0
END AS is_new_session
FROM events
),
sessions AS (
SELECT *,
SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY event_time) AS session_id
FROM flagged
)
SELECT user_id, session_id, MIN(event_time) AS session_start, MAX(event_time) AS session_end,
COUNT(*) AS events_in_session
FROM sessions
GROUP BY user_id, session_id
ORDER BY user_id, session_id;
Q44. What are the isolation levels in SQL transactions?
From weakest to strongest: READ UNCOMMITTED (dirty reads possible), READ COMMITTED (no dirty reads, default in most DBs), REPEATABLE READ (no phantom reads for rows already read), SERIALIZABLE (full isolation, transactions behave as if serial). Higher isolation = more correctness, less concurrency.
Q45. Write a query to generate a date spine (every day in a range) and fill in missing data.
WITH date_spine AS (
SELECT generate_series(
'2026-01-01'::date,
'2026-06-30'::date,
'1 day'::interval
)::date AS dt
),
daily_signups AS (
SELECT DATE(created_at) AS signup_date, COUNT(*) AS signups
FROM users
GROUP BY 1
)
SELECT d.dt AS date, COALESCE(s.signups, 0) AS signups
FROM date_spine d
LEFT JOIN daily_signups s ON d.dt = s.signup_date
ORDER BY d.dt;
Q46. How would you optimize a query that scans a 500M-row events table?
Key strategies: partition the table by date (so queries only scan relevant partitions), add composite indexes on (user_id, created_at) for common filter patterns, cluster on the partition key, use columnar storage (Parquet, BigQuery, Redshift) which only reads needed columns, avoid SELECT *, push filters as early as possible, and consider pre-aggregated summary tables for dashboard queries.
Q47. Write a query to calculate the median salary per department.
SELECT
department_id,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary
FROM employees
GROUP BY department_id
ORDER BY department_id;
Q48. What is the difference between NTILE and PERCENT_RANK?
NTILE(n) divides rows into n roughly equal buckets and assigns each row to a bucket number (1 through n). PERCENT_RANK calculates the relative rank of each row as a percentage from 0.0 to 1.0.
SELECT
name,
salary,
NTILE(4) OVER (ORDER BY salary) AS salary_quartile,
ROUND(PERCENT_RANK() OVER (ORDER BY salary)::numeric, 3) AS pct_rank
FROM employees;
Q49. Write a query to find customers who bought product A but not product B.
SELECT DISTINCT customer_id
FROM orders
WHERE product_id = 'A'
AND customer_id NOT IN (
SELECT customer_id FROM orders WHERE product_id = 'B'
);
-- Alternative using NOT EXISTS (handles NULLs more safely)
SELECT DISTINCT o.customer_id
FROM orders o
WHERE o.product_id = 'A'
AND NOT EXISTS (
SELECT 1 FROM orders o2
WHERE o2.customer_id = o.customer_id AND o2.product_id = 'B'
);
Q50. You have a query that used to run in 2 seconds and now takes 2 minutes. How do you diagnose it?
Start with EXPLAIN ANALYZE to see where time is spent. Check if statistics are stale (ANALYZE table_name). Look for new table scans where index scans used to be (index dropped? index no longer selective?). Check for parameter sniffing / plan cache issues. Look at lock waits — the query may be blocked, not slow. Check for data skew (a PARTITION BY key with one dominant value). Review recent data volume changes — an index that worked at 1M rows may not work at 50M. Finally, check for schema changes (new columns, type changes) that may have invalidated execution plans.
Common Interview Patterns to Memorize
Find duplicates: GROUP BY + HAVING COUNT(*) > 1
Top N per group: ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) + WHERE rn <= N
Running totals: SUM(...) OVER (ORDER BY ...)
Period-over-period change: LAG(value) OVER (ORDER BY date)
Gap detection: generate a sequence, left join your table, filter for NULLs
Deduplication: ROW_NUMBER() or DISTINCT ON (Postgres)
Anti-join: NOT EXISTS or LEFT JOIN + WHERE right.key IS NULL
SQL fluency in data engineering interviews is built through deliberate practice, not passive reading. Work through these queries by typing them, running them against real data, and explaining each step aloud.
CareerLift.ai includes SQL interview simulations with AI-powered feedback on your query logic and explanations. Practice the full range — basic through advanced — at careerlift.ai.