Transformer Structure

“Encoder”： Bidirectional + Self-attention

“Decoder”: Unidirectional + Auto-regression + Masked self-attention

Encoder-only models: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition.
Decoder-only models: Good for generative tasks such as text generation.
Encoder-decoder models or sequence-to-sequence models: Good for generative tasks that require an input, such as translation or summarization.

[ Encoder ] Attention layer: The attention mechanism in transformers is designed to weigh the importance of different tokens (words or subwords) in a sequence differently, depending on the context.

Tokenization: Tokenization is the process of breaking down text into smaller pieces called tokens, usually words or phrases, making it easier to manage and analyze.
Input Embedding: Input Embedding is the process of converting tokens into numerical vectors in a way that maintains the relationships and properties of the text, enabling machine learning models to understand and process the text data.

Position encoding: sinusoidal (sin/cos) position encoding ( considering difference frequencies)

Multi-Head Attention (self-attention):

Query + Key + Value ⇒ Attention Filter + Value (Attention Filter + Original Image)

( different attention filters focus on different content.)

Cosine similarity: +1(most similar) to -1(most dissimilar)

Residual Connections: preservation of earlier information

Add & norm ⇒ make convergence faster and more stable

Feed Forward ⇒ enhances the model’s ability to represent complex functions

[ Decoder ] Only Q & K sent to the decoder :

Linear + softmax ⇒ tells us what to generate

masked multi-head attention ⇒ pay attention to the words that we have created

#transformer

🪴 LLM Papers

Explorer

Transformer Structure

Graph View

Backlinks