#

Table of Contents

Metrics to understand for LLM production

Throughput

  • defined as queries processed per second
  • Maximise Throughput to make best use of GPU resources

    Latency

  • defined as time per token
  • Minimise to suit the user experiance

    Cost

  • cost of each token processed
  • Minimise

What affects the LLM metrics

  • Time for computation while inference
  • Loading model into memory
  • An breakeven exists between the batch size we choose to process in terms of inference and loading model.
  • below this breakeven the latency is affected by loading of the model
  • Above this breakeven the latency is affected by computation of the tokens
  • Making decision on the batch size is important

GPU memory utilisations

  • Model weights
  • space to run calculations
  • KV cache

References -

  1. LLMs in Production
  2. MLOps.community