Question 1

LLM optimization techniques

Accepted Answer

LLM optimization involves algorithmic improvements (FlashAttention, Speculative Decoding), memory management (PagedAttention, KV Cache optimization), compression (Quantization, Pruning), and hardware-specific compilation (TensorRT-LLM) to maximize inference throughput and minimize latency. See: LLM optimization techniques. Serving a 70B parameter model naively in PyTorch is too slow for production. You must optimize the model before serving. 
1. **Quantization**: Shrink the weights from 16-bit to 8-bit or 4-bit.
2. **KV Caching**: Cache the attention states of previous tokens so they aren't recalculated.
3. **FlashAttention**: Optimize GPU memory reads/writes to speed up the attention calculation.
4. **Continuous Batching**: Pack multiple user requests into the GPU simultaneously to maximize utilization.

Question 2

How do you select GPUs for LLM inference?

Accepted Answer

GPU selection is primarily driven by VRAM (Video RAM) capacity to hold the model weights and KV cache, followed by Memory Bandwidth for generation speed, and compute capability (TFLOPS) for processing dense attention layers. For LLM inference, memory is the biggest bottleneck, not raw compute. A 70B model in 16-bit precision requires ~140GB of VRAM just to load. An NVIDIA A100 has 80GB of VRAM. Therefore, you must use at least two A100s connected via NVLink (Tensor Parallelism). Consumer GPUs (like the RTX 4090 with 24GB VRAM) are excellent for single-batch local inference if the model is quantized, but lack the memory bandwidth required for high-throughput multi-user servers.

Question 3

What is model parallelism vs data parallelism in distributed training?

Accepted Answer

Data Parallelism copies the entire model onto multiple GPUs and splits the training data across them. Model Parallelism splits a single massive model across multiple GPUs when the model is too large to fit in the VRAM of a single GPU. If you have an 8B model (takes 16GB VRAM) and four 80GB GPUs: Use Data Parallelism. Each GPU gets a full copy of the model, and GPU 1 trains on batch A, GPU 2 on batch B, speeding up training 4x.
If you have a 100B model (takes 200GB VRAM) and four 80GB GPUs: Use Model Parallelism. The model cannot fit on one GPU. You must slice the neural network (e.g., layers 1-10 on GPU 1, 11-20 on GPU 2) and pass the data between them.

Question 4

What is tensor parallelism, and how does it help serve large models?

Accepted Answer

Tensor Parallelism is a specific type of model parallelism that vertically slices individual matrix operations (like attention or linear layers) across multiple GPUs, calculating the math simultaneously and syncing the results. Instead of putting Layer 1 on GPU A and Layer 2 on GPU B (Pipeline Parallelism), Tensor Parallelism slices a single matrix in Layer 1 in half. GPU A does the math for the left half of the matrix; GPU B does the right half. They then communicate over NVLink to sum the results. This drastically reduces the VRAM requirement per GPU and speeds up the math calculation for massive models like Llama 3 70B during inference.

Question 5

What is pipeline parallelism?

Accepted Answer

Pipeline Parallelism is a type of model parallelism that horizontally slices the layers of a neural network across multiple GPUs. For example, GPU A holds layers 1-10, processes the data, and passes the intermediate results to GPU B, which holds layers 11-20. Imagine an assembly line. GPU A does the initial processing, passes it to GPU B, which passes it to GPU C. The problem is 'pipeline bubbles'—while GPU A is working on the first layer, GPUs B and C are sitting idle waiting for the data. Advanced frameworks use micro-batching to keep all GPUs busy, feeding new data into GPU A as soon as it passes its first batch to GPU B.

Question 6

How does continuous batching improve LLM inference throughput?

Accepted Answer

Continuous batching dynamically inserts new user requests into the GPU's processing batch the moment a previous request finishes its generation, rather than waiting for all requests in a static batch to complete. See: Continuous Batching in LLMs. In older static batching, if you batch 4 requests together, and Request A needs 10 tokens but Request B needs 100 tokens, Request A finishes in 1 second and then sits idle on the GPU for 9 seconds waiting for B to finish. Continuous batching (iteration-level scheduling) ejects Request A the millisecond it finishes and immediately slots Request C into the GPU, ensuring 100% compute utilization.

Question 7

What is speculative decoding, and how does it speed up inference?

Accepted Answer

Speculative decoding speeds up inference by using a small, fast 'draft' model to guess the next several tokens, and then using the large, slow 'target' model to verify all those guessed tokens simultaneously in a single forward pass. See: Speculative Decoding. LLM generation is memory-bandwidth bound. Reading the massive weights of a 70B model from VRAM to the compute core to generate one token takes time. With speculative decoding, a tiny 1B model guesses the next 5 words: 'The cat sat on the'. The 70B model reads its weights ONCE, processes all 5 words in parallel, and confirms 'Yes, those 5 words are exactly what I would have generated.' You just generated 5 tokens for the time-cost of 1.

Question 8

What is KV cache, and how do you manage memory for it?

Accepted Answer

The KV (Key-Value) Cache stores the mathematical attention states of previously generated tokens so the model doesn't have to recalculate the entire sentence every time it generates a new word. Managing it requires allocating massive chunks of VRAM. See: What is KV Cache in LLMs? When generating token 100, the model needs to know how token 100 relates to token 1. Instead of recalculating the math for tokens 1-99 on every single step (which is O(N^2) complexity), it saves the Key and Value matrices of tokens 1-99 in the GPU's memory. As context windows grow to 128k+, the KV cache size explodes, sometimes consuming more memory than the model weights themselves.

Question 9

What is Paged Attention?

Accepted Answer

PagedAttention is a memory management algorithm that stores the KV cache in non-contiguous blocks of memory (pages) rather than a single massive contiguous block, virtually eliminating memory fragmentation and allowing massive batch sizes. See: Paged Attention in LLMs. Historically, frameworks pre-allocated a massive contiguous block of VRAM for the maximum possible length of a user's prompt (e.g., 8k tokens). If the user only generated 10 tokens, 99% of that VRAM was wasted (fragmentation). PagedAttention, introduced by vLLM, borrows the concept of 'Virtual Memory Paging' from operating systems. It allocates VRAM dynamically in tiny 'pages' as the tokens are generated, reducing wasted memory to near 0%.

Question 10

How do you optimize inference for edge and mobile deployment?

Accepted Answer

Edge deployment relies on extreme quantization (INT4/GGUF format), pruning, and utilizing specialized mobile hardware (NPUs or Apple Neural Engine) via frameworks like MLX, CoreML, or ExecuTorch. You cannot run a 70B model on an iPhone. You must use a Small Language Model (SLM) like Llama 3 8B or Phi-3. You compile the model into a specialized format (like GGUF or CoreML) and quantize the weights down to 4-bit. This shrinks the memory footprint to ~4GB, allowing it to fit into the unified memory of an iPhone or Mac. Inference is then executed using specialized C++ engines (like llama.cpp) optimized for CPU/NPU vector math.

Question 11

What is model quantization (INT8, INT4, FP16, BF16), and how does it affect quality?

Accepted Answer

Quantization converts model weights from high-precision formats (FP16/BF16) to lower-precision integers (INT8/INT4). It drastically reduces memory usage and speeds up inference, with only minor degradations in reasoning quality. See video: AI Engineering Explained. 
- **FP16/BF16 (16-bit Float)**: Standard precision. High accuracy, massive VRAM requirement.
- **INT8 (8-bit Integer)**: Halves the memory. Almost zero noticeable quality loss for general tasks.
- **INT4 (4-bit Integer)**: Quarters the memory. Noticeable quality loss on complex coding or deep logic tasks, but excellent for chat and summarization.
Advanced algorithms (like AWQ) protect the most 'important' weights from being quantized too aggressively, preserving the model's accuracy better than naive rounding.

AI Infrastructure & Scalability Interview Prep Portal