Table of Contents
Key points
- Batch of 30 sentences
- Sequence length is 50
- Embedding dimension is 512
- Number of layers 64 decoder is repeated 64 times
- Gives overlapping learning, makes the embedding more context-aware
- Attention head 8 parallel attention learning
- Gives parallel independent learning
- Concat
- Combine the attention layers output, optionally can be passed to a linear layer
- Add residue
- To allow gradient flow deep in network, speed up training
- Layer normalization
- Normalizes across embedding for each token
- Feedforward
- To add nonlinearity, complex learning
- Linear layer
- Projects the embedding to vocabulary size
- Decode
- Decodes to token basis probability from softmax
Reference