
“Encoder”: Bidirectional + Self-attention
“Decoder”: Unidirectional + Auto-regression + Masked self-attention
- Encoder-only models: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition.
- Decoder-only models: Good for generative tasks such as text generation.
- Encoder-decoder models or sequence-to-sequence models: Good for generative tasks that require an input, such as translation or summarization.
[ Encoder ] Attention layer: The attention mechanism in transformers is designed to weigh the importance of different tokens (words or subwords) in a sequence differently, depending on the context.
-
Tokenization: Tokenization is the process of breaking down text into smaller pieces called tokens, usually words or phrases, making it easier to manage and analyze.
-
Input Embedding: Input Embedding is the process of converting tokens into numerical vectors in a way that maintains the relationships and properties of the text, enabling machine learning models to understand and process the text data.
Position encoding: sinusoidal (sin/cos) position encoding ( considering difference frequencies)
Multi-Head Attention (self-attention):
Query + Key + Value ⇒ Attention Filter + Value (Attention Filter + Original Image)
( different attention filters focus on different content.)
Cosine similarity: +1(most similar) to -1(most dissimilar)
Residual Connections: preservation of earlier information
Add & norm ⇒ make convergence faster and more stable
Feed Forward ⇒ enhances the model’s ability to represent complex functions
[ Decoder ]
Only Q & K sent to the decoder :
Linear + softmax ⇒ tells us what to generate
masked multi-head attention ⇒ pay attention to the words that we have created