“Encoder”: Bidirectional + Self-attention

“Decoder”: Unidirectional + Auto-regression + Masked self-attention

  • Encoder-only models: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition.
  • Decoder-only models: Good for generative tasks such as text generation.
  • Encoder-decoder models or sequence-to-sequence models: Good for generative tasks that require an input, such as translation or summarization.

[ Encoder ] Attention layer: The attention mechanism in transformers is designed to weigh the importance of different tokens (words or subwords) in a sequence differently, depending on the context.

  • Tokenization: Tokenization is the process of breaking down text into smaller pieces called tokens, usually words or phrases, making it easier to manage and analyze.

  • Input Embedding: Input Embedding is the process of converting tokens into numerical vectors in a way that maintains the relationships and properties of the text, enabling machine learning models to understand and process the text data.

Position encoding: sinusoidal (sin/cos) position encoding ( considering difference frequencies)

Multi-Head Attention (self-attention):

Query + Key + Value ⇒ Attention Filter + Value (Attention Filter + Original Image)

( different attention filters focus on different content.)

Cosine similarity: +1(most similar) to -1(most dissimilar)

Residual Connections: preservation of earlier information

Add & norm ⇒ make convergence faster and more stable

Feed Forward ⇒ enhances the model’s ability to represent complex functions


[ Decoder ] Only Q & K sent to the decoder :

Linear + softmax ⇒ tells us what to generate

masked multi-head attention ⇒ pay attention to the words that we have created


#transformer