#

Table of Contents

Transformers and GPU Memory

  • OpenAI’s GPT-3 charges twice as per input token for longer context models.
  • Economic consequences of high memory consumption.
  • Most memory, especially with larger context lengths, goes towards the KV cache.

Understanding Self-Attention Mechanism

  • Each token corresponds to an embedding vector X.
  • X is multiplied by matrices to form Query (Q), Key (K), and Value (V) vectors.
  • Q represents the new token, K and V depict previous contexts.
  • Attention mechanism: Softmax((Q.K^T)/sqrt(d))*V.

The Role of KV Cache

  • In autoregressive decoding, Q vector is generated, and cached values of K and V matrices are fetched.
  • Model calculates a new column for the K matrix and a new row for the V matrix.

Pivotal Function of KV Cache within Transformer’s Architecture

  • KV cache works seamlessly with the self-attention layer.
  • Self-attention layer processes previous K and V cache and the embedding for the current token.
  • Computes new K and V vectors for the current token, appends them to KV cache.

Memory Usage Calculation

  • Formula: 2 x precision x layers x dimension x sequence_length x batch
  • Elements:
    • 2 for K and V matrices.
    • Precision: number of bytes per parameter.
    • Layers: total number of layers.
    • Dimension: size of embeddings per layer.
    • Sequence_length: length of the sequence to generate.
    • Batch: batch size.

Drawbacks and Constraints of KV Cache

  • Memory allocation breakdown for a 13B-parameter Language Model (LM) on NVIDIA A100 GPU with 40GB RAM.
  • Approximately 65% for model weights, 30% for dynamic states (KV cache), and the remaining for other data.

KVs and Latency

  • Higher latency when processing the prompt versus subsequent tokens.
  • For subsequent tokens, latency is lower as only cached K and V need to be computed.

Conclusion

  • KV Cache is a core component of Transformer models.
  • Demands proper management due to its significant impact on memory usage, especially in large models.
  • Balancing powerful NLP models and optimizing them to prevent resource constraints.

References