Question 1

AI Agent Evaluation

Accepted Answer

AI Agent Evaluation involves assessing not just the final output, but the entire trajectory of actions the agent took. This includes measuring tool selection accuracy, reasoning logic (Chain of Thought), efficiency (number of steps), and ability to recover from simulated errors. See: AI Agent Evaluation. Evaluating an LLM is like grading an essay. Evaluating an agent is like grading a driving test. You must measure the path taken. If an agent achieves the goal but hallucinates 5 invalid tool calls along the way, it is inefficient and unreliable. Frameworks like WebArena or AgentBench use sandboxed environments to test agents on multi-step workflows.

Question 2

What is evaluation-driven development for AI applications?

Accepted Answer

Evaluation-Driven Development (EDD) is the practice of building a rigorous, automated evaluation dataset (a 'golden set') BEFORE tweaking prompts or fine-tuning models, treating prompt engineering as an empirical science rather than guesswork. In traditional software, you write TDD (Test-Driven Development). EDD is the AI equivalent. You define 100 edge-case inputs and expected outputs. When you tweak the system prompt, you run the new prompt against all 100 cases using an LLM-as-a-judge to see if the overall accuracy went up or down. This prevents the 'whack-a-mole' problem where fixing one prompt issue breaks three others.

Question 3

How do you evaluate LLM outputs? What metrics do you use?

Accepted Answer

LLM outputs are evaluated using deterministic metrics (Regex, Exact Match, JSON Schema validation), statistical metrics (BLEU, ROUGE), semantic metrics (BERTScore, Cosine Similarity), and qualitative metrics (LLM-as-a-judge for tone, helpfulness, and safety). Evaluation depends entirely on the task. If the LLM is extracting data, you use strict JSON validation and Exact Match. If the LLM is translating text, you use BLEU. If it's answering a customer query, traditional metrics fail (because two completely different sentences can have the same meaning), so you use an LLM-as-a-judge (like GPT-4) to score the answer on a scale of 1-5 based on a strict rubric.

Question 4

Explain BLEU, ROUGE, and BERTScore. When would you use each?

Accepted Answer

BLEU measures precision based on word overlap (used for Translation). ROUGE measures recall based on word overlap (used for Summarization). BERTScore uses pre-trained embeddings to measure semantic similarity, ignoring exact word matching. If the reference is 'The cat sat on the mat', and the LLM says 'The feline rested on the rug':
- **BLEU/ROUGE**: Will score very low because almost no words match exactly.
- **BERTScore**: Will score very high because it embeds the sentences into vectors and calculates cosine similarity, recognizing that 'feline' and 'cat' mean the same thing. BERTScore is superior for modern generative AI evaluation.

Question 5

What is G-Eval, and how does it use LLMs for evaluation?

Accepted Answer

G-Eval is a framework that uses LLMs with Chain-of-Thought (CoT) prompting to act as a judge, evaluating the quality of text generated by other models based on a specific set of criteria (like coherence or relevance) without requiring human labels. Instead of relying on human graders (expensive/slow) or ROUGE (inaccurate), G-Eval provides GPT-4 with a detailed rubric (e.g., 'Score this summary from 1-5 on fluency'). It asks the LLM to generate a Chain-of-Thought reasoning trace explaining its judgment, and then outputs the final score. Studies show G-Eval correlates very highly with human judgment.

Question 6

What is LLM-as-a-judge evaluation, and what are its limitations?

Accepted Answer

LLM-as-a-judge involves using a powerful frontier model (like GPT-4) to score the output of another model. Limitations include self-preference bias, verbosity bias, and positional bias. See: LLM as a Judge. It is the industry standard for evaluating chatbots. However, it has flaws: 
1. **Verbosity Bias**: The judge often gives higher scores to longer, wordy answers even if a shorter answer is better.
2. **Self-Preference**: Claude prefers Claude's style; GPT-4 prefers GPT-4's style.
3. **Positional Bias**: If you ask it to choose between Answer A and Answer B, it often blindly picks Answer A just because it was listed first.

Question 7

How do you conduct human evaluation for AI systems?

Accepted Answer

Human evaluation is conducted via Blind A/B testing (Side-by-Side/SxS evaluation), where domain experts are given a prompt and two anonymized model outputs, and asked to rate them based on a strict rubric (Factuality, Tone, Formatting). Human evaluation is the gold standard but is slow and expensive. You build a UI where human raters see 'Model A' and 'Model B'. They do not know which is the new model. They grade the outputs on a Likert scale or select a winner. For specialized domains (like legal or medical AI), the raters must be certified experts (doctors/lawyers), making it even more expensive.

Question 8

What is red teaming, and how do you red team an LLM application?

Accepted Answer

Red teaming is an adversarial evaluation process where security researchers intentionally try to break the AI system, attempting to bypass guardrails to generate toxic content, leak PII, or execute unauthorized code. Before a major model launch, teams spend weeks trying to 'jailbreak' it. They use techniques like role-playing ('Pretend you are a hacker'), base64 encoding malicious prompts, or multi-language attacks. Automated red teaming tools (like AutoDAN) use an attacking LLM to generate thousands of adversarial prompts against the target LLM.

Question 9

How do you detect and measure hallucinations in LLM outputs?

Accepted Answer

Hallucinations are detected using RAG-based factual consistency metrics (like RAGAS or TruLens). An LLM-as-a-judge compares the generated answer strictly against the retrieved context to verify that every claim in the answer is supported by the context. You cannot measure hallucinations by just looking at the answer, because you don't know the truth. In a RAG system, you use a metric called 'Faithfulness'. 
1. Extract all claims from the LLM's answer.
2. Prompt a judge LLM: 'Is Claim X supported by Document Y?'
3. If the answer contains claims not found in the source document, it is flagged as a hallucination.

Question 10

What is adversarial testing for AI systems?

Accepted Answer

Adversarial testing involves intentionally feeding the model corrupted, edge-case, or malicious inputs designed to trick the model into failing, hallucinating, or breaking its formatting constraints. Unlike normal evaluation (testing happy paths), adversarial testing tests the boundaries. For an agent extracting JSON, an adversarial test might involve feeding it a PDF with corrupted formatting, an empty string, a string 10x larger than the context window, or a prompt containing SQL injection. The goal is to ensure the system degrades gracefully and throws a handled error instead of crashing.

Question 11

How do you build a regression test suite for AI applications?

Accepted Answer

You build a regression suite by curating a diverse dataset of historic user queries and expected outputs, integrating an evaluation framework (like Promptfoo) into your CI/CD pipeline, and running the suite automatically on every prompt or model change. Because prompts are brittle, fixing one issue often breaks another. A regression suite contains hundreds of fixed test cases categorized by intent (e.g., 50 tests for greetings, 100 tests for complex RAG queries, 50 adversarial tests). When a developer updates the system prompt, the pipeline runs all 200 tests. If the score drops below a threshold, the build fails.

Question 12

What are benchmark suites (MMLU, HumanEval, GSM8K), and how do you interpret them?

Accepted Answer

Benchmarks are standardized academic tests used to compare the baseline capabilities of different foundation models. MMLU tests broad knowledge, HumanEval tests coding ability, and GSM8K tests mathematical reasoning. 1. **MMLU (Massive Multitask Language Understanding)**: Multiple-choice questions across 57 subjects (Law, Medicine, Math). Measures general knowledge.
2. **HumanEval**: OpenAI's benchmark for code generation. Asks the model to write Python functions based on docstrings.
3. **GSM8K**: Grade School Math word problems. Tests multi-step logic.
You interpret them as baseline indicators. A model with a high HumanEval score is a good candidate for a coding agent.

Question 13

How do you evaluate a RAG system end-to-end?

Accepted Answer

RAG is evaluated using a tripartite framework (like RAGAS): evaluating Context Relevance (did the Vector DB find the right docs?), Answer Faithfulness (did the LLM hallucinate?), and Answer Relevance (did the LLM actually answer the user's question?). RAG has two moving parts: Retrieval and Generation. If the final answer is wrong, you must know why. 
1. **Context Precision/Recall**: If the Vector DB retrieved irrelevant documents, the LLM will fail. This is a retrieval error.
2. **Faithfulness**: If the Vector DB retrieved the correct documents, but the LLM made up a fact not in the documents, that is a generation error (hallucination).
3. **Answer Relevance**: The LLM answered based on the context, but it didn't actually address the user's specific prompt.

Question 14

How do you evaluate the quality of AI agents?

Accepted Answer

AI agents are evaluated on Task Success Rate, Trajectory Efficiency (number of unnecessary steps), Tool Use Accuracy, and Robustness to environment errors. To evaluate an agent, you give it a goal in a simulated environment (e.g., 'Book a flight on this dummy website'). You measure: Did it book the flight? (Success Rate). Did it try to call a nonexistent tool? (Tool Accuracy). Did it get stuck in an infinite loop? (Robustness). Did it complete it in 4 API calls instead of 20? (Efficiency).

Question 15

What is the difference between offline and online evaluation for AI systems?

Accepted Answer

Offline evaluation happens before deployment using static datasets and automated metrics (LLM-as-a-judge). Online evaluation happens in production using live telemetry, implicit user behavior (session length, accept rates), and explicit feedback (thumbs up/down). Offline evaluation ensures the model is safe and accurate enough to be released. However, user behavior is unpredictable. Online evaluation monitors how real users interact with the system. If your offline tests show 99% accuracy, but online metrics show users are abandoning the chat after 2 messages or mashing the 'thumbs down' button, the system is failing in the real world.

Evaluation & Testing
Interview Prep Portal

AI Agent Evaluation

What is evaluation-driven development for AI applications?

How do you evaluate LLM outputs? What metrics do you use?

Explain BLEU, ROUGE, and BERTScore. When would you use each?

What is G-Eval, and how does it use LLMs for evaluation?

What is LLM-as-a-judge evaluation, and what are its limitations?

How do you conduct human evaluation for AI systems?

What is red teaming, and how do you red team an LLM application?

How do you detect and measure hallucinations in LLM outputs?

What is adversarial testing for AI systems?

How do you build a regression test suite for AI applications?

What are benchmark suites (MMLU, HumanEval, GSM8K), and how do you interpret them?

How do you evaluate a RAG system end-to-end?

How do you evaluate the quality of AI agents?

What is the difference between offline and online evaluation for AI systems?

How do you measure factual consistency in LLM outputs?

How do you evaluate multi-turn conversation quality?

What is the role of golden datasets in AI evaluation?

How do you implement continuous evaluation for production AI systems?

How do you evaluate bias in AI model outputs?

How do you compare two models or prompts in a statistically rigorous way?

How do you evaluate the robustness of an LLM application across input variations?

What are the key differences between evaluating traditional ML vs LLM applications?

How do you set up an evaluation framework from scratch for a new LLM application?

Your model passes one fairness metric but fails another. How do you handle conflicting audit results?

Your model was fair at deployment, but became biased 6 months later. How do you monitor continuously?

An external auditor cannot reproduce your model's results. How do you ensure audit reproducibility?

How do you structure red teaming for an LLM chatbot before launch?

How do you red team a multimodal model where text-only safety tests miss cross-modal attacks?

Evaluation & Testing Interview Prep Portal

AI Agent Evaluation

What is evaluation-driven development for AI applications?

How do you evaluate LLM outputs? What metrics do you use?

Explain BLEU, ROUGE, and BERTScore. When would you use each?

What is G-Eval, and how does it use LLMs for evaluation?

What is LLM-as-a-judge evaluation, and what are its limitations?

How do you conduct human evaluation for AI systems?

What is red teaming, and how do you red team an LLM application?

How do you detect and measure hallucinations in LLM outputs?

What is adversarial testing for AI systems?

How do you build a regression test suite for AI applications?

What are benchmark suites (MMLU, HumanEval, GSM8K), and how do you interpret them?

How do you evaluate a RAG system end-to-end?

How do you evaluate the quality of AI agents?

What is the difference between offline and online evaluation for AI systems?

How do you measure factual consistency in LLM outputs?

How do you evaluate multi-turn conversation quality?

What is the role of golden datasets in AI evaluation?

How do you implement continuous evaluation for production AI systems?

How do you evaluate bias in AI model outputs?

How do you compare two models or prompts in a statistically rigorous way?

How do you evaluate the robustness of an LLM application across input variations?

What are the key differences between evaluating traditional ML vs LLM applications?

How do you set up an evaluation framework from scratch for a new LLM application?

Your model passes one fairness metric but fails another. How do you handle conflicting audit results?

Your model was fair at deployment, but became biased 6 months later. How do you monitor continuously?

An external auditor cannot reproduce your model's results. How do you ensure audit reproducibility?

How do you structure red teaming for an LLM chatbot before launch?

How do you red team a multimodal model where text-only safety tests miss cross-modal attacks?

Evaluation & Testing
Interview Prep Portal