Visual Explanation of Transformer with Dimensions

Multihead Attention Mechanism

Key points

Batch of 30 sentences
Sequence length is 50
Embedding dimension is 512
Number of layers 64 decoder is repeated 64 times
- Gives overlapping learning, makes the embedding more context-aware
Attention head 8 parallel attention learning
- Gives parallel independent learning
Concat
- Combine the attention layers output, optionally can be passed to a linear layer
Add residue
- To allow gradient flow deep in network, speed up training
Layer normalization
- Normalizes across embedding for each token
Feedforward
- To add nonlinearity, complex learning
Linear layer
- Projects the embedding to vocabulary size
Decode
- Decodes to token basis probability from softmax