🔧

Data Engineer Interview Prep

Strong SQL + Python + cloud + pipeline tools. Highest-paying data role.

12 questions·60 min, often with coding test·9 technical, 2 behavioural, 1 scenario

General tips for this role

Be ready for a live SQL exercise. Practise window functions and complex joins.
Know one orchestration tool deeply (Airflow, dbt, Prefect). Naming all three is fine but be ready to go deep on one.
Mention testing and data quality often. It separates senior engineers from juniors.
Cloud platform matters: know the warehouse the company uses. AWS shop = Redshift, GCP = BigQuery, Microsoft = Synapse or Fabric.
Salary tip: companies underpay data engineers in the UK relative to ML engineers. Negotiate.

What is the difference between ETL and ELT?

easytechnical

Show model answer

Model answer

ETL (Extract-Transform-Load): transform data BEFORE loading into the warehouse. Traditional approach, used when storage was expensive. ELT (Extract-Load-Transform): load raw data first, transform inside the warehouse using SQL. Modern approach with cloud warehouses (BigQuery, Snowflake) where compute is cheap and storage is cheaper. dbt is the dominant ELT tool today.

What is a data lake vs data warehouse vs lakehouse?

mediumtechnical

Show model answer

Model answer

Data warehouse: structured, optimised for analytics, schema-on-write, expensive (Snowflake, BigQuery, Redshift). Data lake: raw data in any format, cheap storage, schema-on-read (S3, Azure Data Lake). Lakehouse: combines both — structured tables on top of object storage with ACID transactions (Databricks Delta Lake, Apache Iceberg). The lakehouse pattern is now dominant for new builds.

What is idempotency in a data pipeline and why does it matter?

mediumtechnical

Show model answer

Model answer

Idempotent = running the same task multiple times produces the same result. Critical because pipelines fail and need to retry. Achieved by using MERGE (upsert) instead of INSERT, deterministic partition keys, and replaceable batch IDs. Without idempotency you get duplicate rows or partial updates on retry.

Walk me through how you would design a daily batch pipeline for sales data.

mediumtechnical

Show model answer

Model answer

1) Source: extract from production DB via CDC (change data capture) or read replica. 2) Land in raw zone (S3) as Parquet, partitioned by date. 3) Transform with dbt: stage (clean types and names), intermediate (joins, dedupe), mart (business-ready tables). 4) Load to warehouse. 5) Run tests with dbt: not null, unique keys, accepted values, foreign keys. 6) Trigger downstream: Power BI refresh, Slack alert if test fails. 7) Schedule with Airflow or dbt Cloud. 8) Monitor freshness, run time, row counts.

Tip

Mention dbt tests. Many candidates focus on extraction and skip data quality.

How do you handle late-arriving data?

hardtechnical

Show model answer

Model answer

Several patterns: (1) Use event_time vs processing_time and watermarks. (2) Allow a grace window in batch jobs (e.g. always reprocess last 3 days). (3) For streaming, use Kafka with time-windowed aggregations. (4) For dimensions, use Type 2 slowly-changing-dimensions with effective_from/effective_to columns. (5) Always store the original event with its timestamp so you can reprocess.

Your pipeline is slow. Walk me through how you would optimise.

hardtechnical

Show model answer

Model answer

Profile to find the bottleneck. Common fixes: (1) Push compute to warehouse (ELT not ETL). (2) Use columnar formats (Parquet). (3) Partition wisely — match the typical query pattern. (4) Avoid SELECT * — only read columns you need. (5) Cluster/sort by frequent filter columns. (6) Parallelise — break one big job into N smaller ones. (7) Materialise frequently-joined CTEs as tables. (8) Pre-aggregate where possible. (9) Cache intermediate results. (10) Right-size your compute (Snowflake warehouse size, EMR cluster).

How would you implement Slowly Changing Dimensions Type 2?

mediumtechnical

Show model answer

Model answer

SCD2 keeps history of changes. Each row has: business_key, all attributes, effective_from, effective_to, is_current. On update: close the current row (set effective_to = now, is_current = false) and insert a new row (effective_from = now, is_current = true). Implement with MERGE or with a combination of INSERT and UPDATE. dbt has built-in snapshot functionality that does this for you.

What is the difference between Apache Spark and Apache Airflow?

mediumtechnical

Show model answer

Model answer

Different categories. Spark is a distributed compute engine — for processing large datasets in parallel (batch or streaming). Airflow is a workflow orchestrator — schedules and monitors pipelines but does not do the heavy compute itself. You often use them together: Airflow schedules a Spark job. Modern alternatives: Spark replaced by Snowflake/BigQuery for many workloads; Airflow has competitors like Dagster and Prefect.

What is a partition in a data table and why partition?

mediumtechnical

Show model answer

Model answer

A partition is a logical division of a table based on a column (commonly date). Each partition is stored separately. Benefits: queries that filter on the partition column scan only relevant partitions (much faster), partitions can be loaded/dropped independently, easier to manage retention. Common partitions: date, region, customer_id (only if cardinality is right). Avoid too many small partitions (file fragmentation).

Tell me about a data quality issue you found and how you handled it.

mediumbehavioural

Show model answer

Model answer

STAR. Show: detection (how you noticed), investigation (where the issue came from), short-term fix (stop the bleeding), long-term fix (test or monitoring to prevent recurrence). Quantify the impact and the resolution.

Tip

Data engineers who proactively find issues are gold. Demonstrate that.

Why data engineering and not data science?

easybehavioural

Show model answer

Model answer

Honest answer + a technical reason. 'I love the building, the systems thinking, the satisfaction of a pipeline that just works.' Or 'Modelling is fun but most ML projects fail because the data is bad. I want to be upstream of that.' Pick what is actually true for you.

Your team's dbt project has 200 models and runs are slow. How do you scale it?

hardscenario

Show model answer

Model answer

Analyse the DAG: find long-running models, models with many dependencies, models that other models wait on. Split into smaller sub-projects if needed. Use incremental models for large fact tables. Avoid SELECT * in CTEs. Run jobs by tag or by recently-changed models in CI. Use dbt Cloud or a managed Airflow setup with parallelism. Materialise expensive intermediate views as tables. Set up SLAs and monitoring per model.

Help someone else find this

This is free, no ads. Share with anyone preparing for the test.

WhatsApp Post on X LinkedIn Facebook

Telegram Email