🤖

Machine Learning Engineer Interview Prep

Build and deploy ML models in production. Strong Python + ML + cloud + system design.

9 questions·60+ min, often 2-3 rounds with coding and ML design·7 technical, 1 behavioural, 1 scenario

General tips for this role

Have one ML project end-to-end on GitHub: data, model, evaluation, deployment, README.
Know one cloud ML platform deeply: SageMaker, Vertex AI, or Azure ML.
Be ready to discuss trade-offs: model size vs latency, accuracy vs cost.
Don't oversell deep learning — sometimes logistic regression is enough.
Mention monitoring and drift detection in EVERY production-related answer.

What is the difference between supervised, unsupervised, and reinforcement learning?

easytechnical

Show model answer

Model answer

Supervised: train on labeled data, predict labels for new data. Classification (spam/not spam) and regression (predict price). Unsupervised: no labels, find patterns. Clustering (customer segments), anomaly detection (fraud). Reinforcement learning: agent learns by trial and error, receiving rewards. Used for games (AlphaGo), robotics, recommendation systems.

Explain overfitting and how to prevent it.

mediumtechnical

Show model answer

Model answer

Overfitting: model memorises training data, fails on new data. Signs: training accuracy high, validation accuracy low. Prevention: (1) More data. (2) Simpler model. (3) Regularisation (L1/L2, dropout). (4) Cross-validation. (5) Early stopping. (6) Data augmentation (for images, text). (7) Ensemble methods. Always split data into train/val/test.

What is the bias-variance trade-off?

mediumtechnical

Show model answer

Model answer

Bias: error from wrong assumptions (model too simple, underfitting). Variance: error from sensitivity to small fluctuations (model too complex, overfitting). Total error = bias² + variance + irreducible noise. Trade-off: reducing one often increases the other. Goal: find the sweet spot — usually via cross-validation across model complexity.

How would you deploy an ML model to production?

hardtechnical

Show model answer

Model answer

(1) Containerise the model (Docker). (2) Build a prediction API (FastAPI, Flask) — receive input, return prediction. (3) Add input validation and error handling. (4) Deploy to managed service (SageMaker, Vertex AI, Azure ML) or Kubernetes. (5) Add monitoring: latency, error rate, prediction distribution. (6) Set up data drift detection — alert if input distribution shifts. (7) Set up retraining pipeline — schedule or trigger when drift is detected. (8) A/B test new model versions against current before full rollout.

Walk me through how you would design a recommendation system.

hardtechnical

Show model answer

Model answer

Two main approaches. Collaborative filtering: 'users who liked X also liked Y'. Matrix factorisation (ALS, SVD). Pros: discovers latent preferences. Cons: cold start. Content-based: recommend items similar to ones the user liked. Item embeddings + similarity. Pros: handles cold start. Cons: limited to user's history. Hybrid (most production systems): combine both. Add ranking layer with features (user, item, context). Train two-tower neural network or gradient-boosted tree. Online: A/B test models against engagement metrics.

What metrics would you use to evaluate a classification model?

mediumtechnical

Show model answer

Model answer

Depends on the problem. Accuracy = correct / total — bad for imbalanced classes. Precision = TP / (TP + FP) — when false positives are costly. Recall = TP / (TP + FN) — when false negatives are costly (cancer detection). F1 = harmonic mean of precision and recall. ROC AUC: model's ability to rank — good for ranking, not for fixed threshold. PR AUC: better than ROC for imbalanced data. Log loss: for probabilistic outputs. Always pick metrics that match the business goal.

What is feature engineering and give an example.

mediumtechnical

Show model answer

Model answer

Creating new features from raw data to help the model learn. Examples: from a timestamp, extract day_of_week, hour, is_weekend. From text, extract word counts, TF-IDF, embeddings. Domain knowledge matters: a fraud model might use 'transactions per hour from this IP' as an engineered feature. Often the difference between a 70% and 90% accurate model is feature engineering, not the algorithm.

Tell me about an ML project that did not work as expected.

mediumbehavioural

Show model answer

Model answer

STAR. Pick a real failure. Cover: what you tried, why it didn't work, what you learned, what you did instead. Showing you can fail honestly and iterate is more valuable than only sharing successes. Common honest answers: data quality killed the model, the metric we optimised was wrong, the model couldn't beat a simple baseline.

Tip

Saying 'all my projects worked' is a red flag.

Your model is in production but accuracy is decreasing over time. What do you do?

hardscenario

Show model answer

Model answer

Most likely cause: data drift. The world has changed, training data is stale. Investigate: compare input distributions (training vs recent) for each feature. Look at predictions distribution. Look at target distribution if labels are available. Once confirmed, options: (1) Retrain on recent data. (2) Set up scheduled retraining. (3) For non-stationary problems, use online learning or shorter training windows. (4) Monitor more closely with concept drift detection. Document the incident and add to monitoring playbook.

Help someone else find this

This is free, no ads. Share with anyone preparing for the test.

WhatsApp Post on X LinkedIn Facebook

Telegram Email