Data engineering is the plumbing that makes data-driven companies work. The interview tests your ability to design, build, and operate reliable data pipelines at scale β not just write SQL queries.
The Data Engineer Interview Loop
- Technical screen β SQL + one coding problem (Python/Scala)
- Loop (4β5 rounds):
- SQL (complex queries, optimization)
- Coding (Python/Scala data processing)
- Data pipeline / system design
- Distributed systems concepts
- Behavioral
SQL Mastery
SQL is tested more heavily in data engineering interviews than in any other role. You need:
Advanced SQL (must know):
- Window functions:
ROW_NUMBER,RANK,DENSE_RANK,LAG,LEAD,NTILE,FIRST_VALUE - CTEs and recursive CTEs (hierarchical data, graph traversal)
- Self-joins (consecutive events, user retention)
- PIVOT/UNPIVOT patterns
- Date manipulation:
DATE_TRUNC,DATE_DIFF,EXTRACT - Performance: query plans (
EXPLAIN), partitioning, indexing, avoiding full table scans
Classic hard SQL problems:
- "Find users active on 3+ consecutive days" (session gap analysis with window functions)
- "Compute 7-day rolling average of daily revenue" (LAG + window frame)
- "Find the second highest salary per department" (DENSE_RANK)
- "Build a retention cohort table" (self-join on user first seen date)
Python for Data Engineering
You'll be tested on data processing in Python:
- Pandas:
groupby,merge,pivot_table,apply,explode, complexagg - Performance: when to use vectorized ops vs
apply(always vectorize) - File I/O: reading/writing Parquet, JSON, CSV efficiently
- Error handling in pipelines: try/except, logging, dead letter queues
- Type handling: schema validation, null handling, type coercion
Pipeline Design (Data System Design)
This is the most important round for senior roles:
Common prompts:
- "Design a data pipeline to track user events from mobile apps to a data warehouse"
- "Design a real-time dashboard for business metrics with < 30-second latency"
- "Design a data quality monitoring system"
- "How would you migrate a 10TB PostgreSQL database to BigQuery with zero downtime?"
Framework for pipeline design:
- Sources: What data, what format, what frequency, what volume?
- Ingestion: Batch vs streaming? Kafka, Kinesis, Pub/Sub, Fivetran?
- Processing: Spark, Flink, dbt, Dataflow β what are the latency requirements?
- Storage: Raw (data lake), processed (data warehouse), serving (OLAP)?
- Quality: Schema validation, null checks, freshness monitors, anomaly detection
- Orchestration: Airflow, Prefect, Dagster β DAG design, retry logic, alerting
Spark and Distributed Processing
- Spark internals: DAG execution, lazy evaluation, wide vs narrow transformations
- Shuffle: The most expensive Spark operation β how to minimize it
- Partitioning: Why partition count matters, how to repartition correctly
- Spark SQL vs DataFrame API: Trade-offs, catalyst optimizer
- Performance tuning: Broadcast joins, caching, speculative execution, skew handling
Spark interview questions:
- "What is the difference between
mapandflatMap?" - "When would you use
coalescevsrepartition?" - "How do you handle data skew in a Spark job?"
- "What's the difference between
cacheandpersist?"
Kafka and Streaming
- Producers, consumers, consumer groups, topics, partitions
- Offset management: auto-commit vs manual commit, at-least-once vs exactly-once
- Rebalancing: what triggers it, how to minimize impact
- Kafka Streams vs Flink vs Spark Streaming β when to use each
- Common question: "How do you guarantee exactly-once processing in a Kafka consumer?"
Data Modeling
- Star schema vs snowflake schema β trade-offs in a data warehouse
- Slowly changing dimensions (SCD): Type 1, Type 2, Type 3
- Fact vs dimension tables
- dbt: models, tests, macros, incremental models, sources
6-Week Data Engineer Prep Plan
| Week | Focus | |------|-------| | 1 | Advanced SQL: 30 problems including window functions + CTEs | | 2 | Python data processing: Pandas + data pipeline coding | | 3 | Spark: core concepts + 10 Spark interview questions | | 4 | Pipeline system design: 4 full designs | | 5 | Kafka + streaming: concepts + real-time pipeline design | | 6 | Mock full loops + behavioral stories |
Practice explaining your pipeline designs and SQL reasoning out loud with CareerLift.ai β data engineering interviews reward clear communication of complex data flow decisions.