AI Engineer (LLM / GenAI) Interview Prep
Building applications with LLMs. Hottest role of 2026. Heavy mix of Python, ML basics, prompt engineering, and product sense.
General tips for this role
- Build at least one project before applying. A working RAG bot over your own docs is enough.
- Know one model deeply (e.g. GPT-4o pricing, limits, strengths) rather than naming many.
- Understand WHY transformers work, not just that they do. The self-attention concept is interview gold.
- Practise reading and explaining a research paper abstract. Senior interviews often include this.
- Be honest about the limits of LLMs. Interviewers love grounded realism over hype.
What is an LLM and how does it work at a high level?
Show model answer
A Large Language Model is a neural network trained on huge text datasets to predict the next token (roughly a word) given a context. By repeatedly predicting tokens, it generates coherent text. Modern LLMs use the transformer architecture with self-attention, which lets the model weigh different parts of the input. Examples: GPT-4, Claude, Gemini, Llama 3.
What is the difference between fine-tuning and prompt engineering?
Show model answer
Prompt engineering: changing the input to get better output. Cheap, fast, no training needed. Fine-tuning: training the model on more data to specialise it. Expensive, slow, but more reliable. Rule: try prompt engineering first. If that does not work, try RAG. Only fine-tune if you have lots of high-quality domain data and the first two are insufficient.
Explain RAG and when you would use it.
Show model answer
Retrieval-Augmented Generation: when a user asks a question, you first retrieve relevant documents from your own knowledge base (using vector search) and pass them as context to the LLM. The LLM then answers based on those documents. Use RAG when: you need up-to-date info, you have proprietary data, you need citations, or you want to reduce hallucination. Most enterprise AI chatbots use RAG.
Mention that RAG does NOT replace the need for a good knowledge base. Garbage in, garbage out.
What is a vector database and why does it matter for AI?
Show model answer
A database optimised for storing and searching high-dimensional vectors (embeddings). Used in RAG: documents are converted to embeddings, stored in the vector DB, and at query time we find the closest matches. Examples: Pinecone, Weaviate, Chroma, Qdrant, pgvector (Postgres extension). The matching algorithm is approximate nearest neighbour (ANN) โ usually HNSW under the hood.
How do you reduce hallucinations in an LLM-based system?
Show model answer
Multiple layers: (1) Use RAG so the model has accurate context. (2) Add explicit instructions in the prompt: 'Answer only based on the provided context. If the answer is not there, say so.' (3) Lower the temperature (e.g. 0.1) for factual tasks. (4) Add a fact-checking step that re-asks the model to verify its claims. (5) Use structured output (JSON schema) so the model cannot wander. (6) Evaluate with a test set of expected answers and measure hallucination rate.
Walk me through how you would build a customer-support chatbot from scratch.
Show model answer
1) Gather requirements: what questions does it need to answer? What is the success criterion? 2) Build a knowledge base: scrape docs, FAQs, past tickets. 3) Chunk the docs into 500-token pieces. 4) Embed each chunk and store in a vector DB. 5) Build the API: receive user question, embed it, retrieve top-5 similar chunks, send to LLM with the question, return answer. 6) Add safety: profanity filter, PII detection, escalation to human for sensitive topics. 7) Evaluate on a held-out test set. 8) Deploy with rate limiting and monitoring. 9) Set up feedback collection (thumbs up/down). 10) Iterate.
Always mention the eval set โ most candidates skip evaluation entirely.
How would you evaluate the quality of an LLM output?
Show model answer
Multiple methods needed. (1) Automated metrics: BLEU/ROUGE for translation, BERTScore for semantic similarity, exact match for QA. (2) Custom rubrics: write 20 test prompts, define 3-5 criteria (accuracy, helpfulness, tone), score each output 1-5. (3) LLM-as-judge: use a stronger model to score outputs against criteria. Calibrate it against human scores first. (4) Human evaluation: gold standard, expensive. (5) Production monitoring: thumbs-up/down, conversation length, escalation rate.
Mention LLM-as-judge but caveat: it has its own biases.
What is temperature in an LLM and when would you change it?
Show model answer
Temperature controls randomness in token selection. 0 = deterministic (always picks highest-probability token). 1 = uses the model's full distribution. 2 = more random/creative. Use low temp (0 to 0.3) for factual tasks (Q&A, classification, code). Use higher (0.7 to 1.0) for creative tasks (story generation, brainstorming). Different from top_p (which limits the candidate set).
What is the context window and why does it matter?
Show model answer
The max number of tokens the model can process in one request (input + output). GPT-4o: 128k tokens. Claude 3.5: 200k. Gemini 1.5 Pro: 1M+. Matters because: if you have a long doc, you may need to chunk and use RAG instead of fitting it all in. You pay per token, so longer prompts cost more. Models often lose accuracy on info in the middle of very long contexts ('lost in the middle' problem).
How would you handle PII (personal info) when sending data to an LLM?
Show model answer
Several layers: (1) Pre-process: redact or replace PII (names, emails, phone) with placeholders before sending to the LLM. Tools: Microsoft Presidio, Amazon Comprehend. (2) Use a provider with no-training agreements (OpenAI, Anthropic, Azure OpenAI default to no training on enterprise tier). (3) Self-host an open model (Llama, Mistral) for highly sensitive data. (4) Log only essential metadata, not full prompts. (5) Consider regulatory: HIPAA for health, GDPR for EU, etc. Pick provider accordingly.
Tell me about an AI project you built end to end.
Show model answer
STAR. Walk through: problem, why AI was right for it, your approach, technical choices and trade-offs, evaluation, deployment, what you learned. End with: what would you do differently next time? Showing humility and growth is more valuable than perfection.
If you have not built one yet: build one before applying. Even a small RAG over Wikipedia is enough.
How do you stay current with the AI space when it changes every week?
Show model answer
Specific habits beat generic. 'I follow the Anthropic and OpenAI blogs. I read the weekly TLDR AI newsletter. I subscribe to the LangChain Discord. I build one small project per month using something new โ last month I tried Gemini's grounding feature.' Show curiosity and practical engagement.
Your LLM application is too slow. How do you speed it up?
Show model answer
Profile first to find the bottleneck. Common fixes: (1) Use a smaller/faster model for the use case (e.g. GPT-4o-mini instead of GPT-4o). (2) Cache common queries. (3) Stream the response so users see partial results. (4) Reduce prompt size โ fewer few-shot examples. (5) Parallelise calls where possible. (6) For RAG: reduce retrieved chunks, use smaller embedding model. (7) Use a faster inference provider (Groq, Together AI). End-to-end latency budget: under 2 seconds for chatbots, ideally.
Your stakeholders want to use AI for everything. How do you decide what to actually build?
Show model answer
Push back politely. For each proposal ask: (1) What is the manual process today? (2) What would success look like (metric)? (3) Is AI the right tool, or is a database query / business rule enough? (4) What is the cost of a wrong AI answer? Prioritise based on impact vs effort. Build a prototype in a week to validate before committing to production.
Shows you can manage expectations, not just build.
What concerns you most about working with AI?
Show model answer
Show you take ethics seriously. 'Hallucination in customer-facing settings.' 'Bias in training data leaking into output.' 'The pace of change making it hard to make stable architectural decisions.' Pick something real you have actually thought about, not a textbook answer.
Saying 'nothing concerns me' is a red flag.