My Links

  • https://www.youtube.com/channel/UCs425SuaGgC7QzYktQP9__Q
  • https://www.youtube.com/channel/UCs9H632lefMkhX9v1FJotCg
  • https://marcfreir.tumblr.com/
  • https://www.behance.net/marcfreir
  • https://dribbble.com/marcfreir
  • https://www.facebook.com/fibilywizy/
  • https://marcfreir.github.io

Playground: Transformers

Transformers and Attention Mechanisms

Objectives

  • Understand the evolution from RNNs to Transformers and the motivation behind attention mechanisms
  • Learn the mathematical foundations of scaled dot-product attention and multi-head attention
  • Implement a complete Transformer model from scratch including embeddings, positional encoding, and attention layers
  • Explore tokenization, cross-attention, and self-attention mechanisms
  • Understand evaluation metrics for sequence-to-sequence tasks

From RNNs to Transformers: The Attention Revolution

The Problem with RNNs:

Recurrent Neural Networks process sequences sequentially, which creates two major limitations:

  1. Sequential Processing: Can't parallelize computation across time steps
  2. Vanishing Gradients: Long-range dependencies are difficult to capture

The Transformer Solution:

The Transformer architecture, introduced in "Attention Is All You Need" (Vaswani et al., 2017), eliminates recurrence entirely and relies solely on attention mechanisms to capture dependencies in sequences.

Key Innovation:
"Attention is All You Need" - The paper showed that attention mechanisms alone, without recurrence or convolutions, can achieve state-of-the-art results in sequence transduction tasks.

Mathematical Foundations

1. Attention Mechanism

An attention function maps a query and a set of key-value pairs to an output:

General Attention Formula:
\[ \text{Attention}(Q, K, V) = \text{softmax}(\text{score}(Q, K))V \]

Where:

  • \(Q\) (Query): What we're looking for
  • \(K\) (Key): What we're comparing against
  • \(V\) (Value): The actual information we want to retrieve

2. Scaled Dot-Product Attention

The core attention mechanism used in Transformers:

\[Attention(Q, K, V) = softmax(QK^T / √d_k)V\]

Components:

  • \(QK^T\): Compute similarity scores between queries and keys
  • \(√d_k\): Scaling factor to prevent softmax saturation
  • \(softmax()\): Normalize scores to probabilities
  • \(V\): Weight the values by attention probabilities

Why Scale by \(√d_k\)?

For large \(d_k\), dot products grow large in magnitude, pushing softmax into regions with extremely small gradients. Scaling prevents this saturation.

3. Multi-Head Attention

Instead of using a single attention function, Multi-Head Attention runs multiple attention heads in parallel:

\[ MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O \]
where \( head_i = Attention(QW^Q_i, KW^K_i, VW^V_i) \)

Parameters:

  • \( W^Q_i ∈ ℝ^(d_model × d_k) \): Query projection matrix for head i
  • \( W^K_i ∈ ℝ^(d_model × d_k) \): Key projection matrix for head i
  • \( W^V_i ∈ ℝ^(d_model × d_v) \): Value projection matrix for head i
  • \( W^O ∈ ℝ^(hd_v × d_model) \): Output projection matrix

4. Positional Encoding

Since Transformers have no inherent notion of sequence order, positional encodings are added to input embeddings:

\[ PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) \] \[ PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model)) \]

Where:

  • \( pos \): Position in the sequence
  • \( i \): Dimension index
  • \( d_model \): Model dimension

Scaled Dot-Product Attention Implementation

Multi-Head Attention Implementation

Scaled Dot-Product Attention Implementation

Encoder Layer Implementation

Decoder Layer Implementation

Complete Transformer Implementation

Using device: cuda

Tokenization and Embeddings

Tokenization is the process of converting text into numerical tokens that can be processed by the model:

Complete Transformer Implementation


Vocabulary: [' ', 'a', 'd', 'e', 'f', 'h', 'i', 'l', 'm', 'n', 'o', 'r', 's', 't', 'w']
Vocab size: 15
Original text: hello world this is a transformer model
Encoded: [5, 3, 7, 7, 10, 0, 14, 10, 11, 7, 2, 0, 13, 5, 6, 12, 0, 6, 12, 0, 1, 0, 13, 11, 1, 9, 12, 4, 10, 11, 8, 3, 11, 0, 8, 10, 2, 3, 7]
Decoded: hello world this is a transformer model
                        

Publications

Coming soon

All projects