Playground: Transformers
Transformers and Attention Mechanisms
Table of Contents
Objectives
- Understand the evolution from RNNs to Transformers and the motivation behind attention mechanisms
- Learn the mathematical foundations of scaled dot-product attention and multi-head attention
- Implement a complete Transformer model from scratch including embeddings, positional encoding, and attention layers
- Explore tokenization, cross-attention, and self-attention mechanisms
- Understand evaluation metrics for sequence-to-sequence tasks
From RNNs to Transformers: The Attention Revolution
The Problem with RNNs:
Recurrent Neural Networks process sequences sequentially, which creates two major limitations:
- Sequential Processing: Can't parallelize computation across time steps
- Vanishing Gradients: Long-range dependencies are difficult to capture
The Transformer Solution:
The Transformer architecture, introduced in "Attention Is All You Need" (Vaswani et al., 2017), eliminates recurrence entirely and relies solely on attention mechanisms to capture dependencies in sequences.
"Attention is All You Need" - The paper showed that attention mechanisms alone, without recurrence or convolutions, can achieve state-of-the-art results in sequence transduction tasks.
Mathematical Foundations
1. Attention Mechanism
An attention function maps a query and a set of key-value pairs to an output:
\[ \text{Attention}(Q, K, V) = \text{softmax}(\text{score}(Q, K))V \]
Where:
- \(Q\) (Query): What we're looking for
- \(K\) (Key): What we're comparing against
- \(V\) (Value): The actual information we want to retrieve
2. Scaled Dot-Product Attention
The core attention mechanism used in Transformers:
Components:
- \(QK^T\): Compute similarity scores between queries and keys
- \(√d_k\): Scaling factor to prevent softmax saturation
- \(softmax()\): Normalize scores to probabilities
- \(V\): Weight the values by attention probabilities
Why Scale by \(√d_k\)?
For large \(d_k\), dot products grow large in magnitude, pushing softmax into regions with extremely small gradients. Scaling prevents this saturation.
3. Multi-Head Attention
Instead of using a single attention function, Multi-Head Attention runs multiple attention heads in parallel:
where \( head_i = Attention(QW^Q_i, KW^K_i, VW^V_i) \)
Parameters:
- \( W^Q_i ∈ ℝ^(d_model × d_k) \): Query projection matrix for head i
- \( W^K_i ∈ ℝ^(d_model × d_k) \): Key projection matrix for head i
- \( W^V_i ∈ ℝ^(d_model × d_v) \): Value projection matrix for head i
- \( W^O ∈ ℝ^(hd_v × d_model) \): Output projection matrix
4. Positional Encoding
Since Transformers have no inherent notion of sequence order, positional encodings are added to input embeddings:
Where:
- \( pos \): Position in the sequence
- \( i \): Dimension index
- \( d_model \): Model dimension
Scaled Dot-Product Attention Implementation
Multi-Head Attention Implementation
Scaled Dot-Product Attention Implementation
Encoder Layer Implementation
Decoder Layer Implementation
Complete Transformer Implementation
Using device: cuda
Tokenization and Embeddings
Tokenization is the process of converting text into numerical tokens that can be processed by the model:
Complete Transformer Implementation
Vocabulary: [' ', 'a', 'd', 'e', 'f', 'h', 'i', 'l', 'm', 'n', 'o', 'r', 's', 't', 'w']
Vocab size: 15
Original text: hello world this is a transformer model
Encoded: [5, 3, 7, 7, 10, 0, 14, 10, 11, 7, 2, 0, 13, 5, 6, 12, 0, 6, 12, 0, 1, 0, 13, 11, 1, 9, 12, 4, 10, 11, 8, 3, 11, 0, 8, 10, 2, 3, 7]
Decoded: hello world this is a transformer model
Publications
Coming soon