Table of Contents

Multihead Attention Mechanism

Key points

  • Batch of 30 sentences
  • Sequence length is 50
  • Embedding dimension is 512
  • Number of layers 64 decoder is repeated 64 times
    • Gives overlapping learning, makes the embedding more context-aware
  • Attention head 8 parallel attention learning
    • Gives parallel independent learning
  • Concat
    • Combine the attention layers output, optionally can be passed to a linear layer
  • Add residue
    • To allow gradient flow deep in network, speed up training
  • Layer normalization
    • Normalizes across embedding for each token
  • Feedforward
    • To add nonlinearity, complex learning
  • Linear layer
    • Projects the embedding to vocabulary size
  • Decode
    • Decodes to token basis probability from softmax

Reference