πŸ’‘ If you like this website, please share it with your friends and network! πŸš€
Evaluation & Testing & Large Language Models

Evaluation & Testing
Interview Prep Portal

Master Large Language Models (LLMs), RAG pipelines, vector semantic search, embedding geometries, prompt engineering methodologies, and autonomous tool-calling AI agents.

LLMs & TransformersRAG PipelinesVector SearchPrompt EngineeringAI Agents
PROGRESS0 / 29 Mastered
0%
Filter Level:
Evaluation & TestingIntermediateQ1

AI Agent Evaluation

Evaluation & TestingBeginnerQ2

What is evaluation-driven development for AI applications?

Evaluation & TestingIntermediateQ3

How do you evaluate LLM outputs? What metrics do you use?

Evaluation & TestingIntermediateQ4

Explain BLEU, ROUGE, and BERTScore. When would you use each?

Evaluation & TestingAdvancedQ5

What is G-Eval, and how does it use LLMs for evaluation?

Evaluation & TestingAdvancedQ6

What is LLM-as-a-judge evaluation, and what are its limitations?

Evaluation & TestingIntermediateQ7

How do you conduct human evaluation for AI systems?

Evaluation & TestingAdvancedQ8

What is red teaming, and how do you red team an LLM application?

Evaluation & TestingAdvancedQ9

How do you detect and measure hallucinations in LLM outputs?

Evaluation & TestingIntermediateQ10

What is adversarial testing for AI systems?

Evaluation & TestingIntermediateQ11

How do you build a regression test suite for AI applications?

Evaluation & TestingBeginnerQ12

What are benchmark suites (MMLU, HumanEval, GSM8K), and how do you interpret them?

Evaluation & TestingAdvancedQ13

How do you evaluate a RAG system end-to-end?

Evaluation & TestingAdvancedQ14

How do you evaluate the quality of AI agents?

Evaluation & TestingIntermediateQ15

What is the difference between offline and online evaluation for AI systems?

Evaluation & TestingAdvancedQ16

How do you measure factual consistency in LLM outputs?

Evaluation & TestingIntermediateQ17

How do you evaluate multi-turn conversation quality?

Evaluation & TestingBeginnerQ18

What is the role of golden datasets in AI evaluation?

Evaluation & TestingAdvancedQ19

How do you implement continuous evaluation for production AI systems?

Evaluation & TestingAdvancedQ20

How do you evaluate bias in AI model outputs?

Evaluation & TestingAdvancedQ21

How do you compare two models or prompts in a statistically rigorous way?

Evaluation & TestingIntermediateQ22

How do you evaluate the robustness of an LLM application across input variations?

Evaluation & TestingBeginnerQ23

What are the key differences between evaluating traditional ML vs LLM applications?

Evaluation & TestingIntermediateQ24

How do you set up an evaluation framework from scratch for a new LLM application?

Evaluation & TestingAdvancedQ25

Your model passes one fairness metric but fails another. How do you handle conflicting audit results?

Evaluation & TestingAdvancedQ26

Your model was fair at deployment, but became biased 6 months later. How do you monitor continuously?

Evaluation & TestingIntermediateQ27

An external auditor cannot reproduce your model's results. How do you ensure audit reproducibility?

Evaluation & TestingAdvancedQ28

How do you structure red teaming for an LLM chatbot before launch?

Evaluation & TestingAdvancedQ29

How do you red team a multimodal model where text-only safety tests miss cross-modal attacks?