My Links

Playground: Transformers

Transformers and Attention Mechanisms

Objectives
From RNNs to Transformers: The Attention Revolution
Mathematical Foundations
Implementation: Building a Transformer from Scratch
Tokenization and Embeddings

Objectives

Understand the evolution from RNNs to Transformers and the motivation behind attention mechanisms
Learn the mathematical foundations of scaled dot-product attention and multi-head attention
Implement a complete Transformer model from scratch including embeddings, positional encoding, and attention layers
Explore tokenization, cross-attention, and self-attention mechanisms
Understand evaluation metrics for sequence-to-sequence tasks

From RNNs to Transformers: The Attention Revolution

The Problem with RNNs:

Recurrent Neural Networks process sequences sequentially, which creates two major limitations:

Sequential Processing: Can't parallelize computation across time steps
Vanishing Gradients: Long-range dependencies are difficult to capture

The Transformer Solution:

The Transformer architecture, introduced in "Attention Is All You Need" (Vaswani et al., 2017), eliminates recurrence entirely and relies solely on attention mechanisms to capture dependencies in sequences.

Key Innovation:
"Attention is All You Need" - The paper showed that attention mechanisms alone, without recurrence or convolutions, can achieve state-of-the-art results in sequence transduction tasks.

Mathematical Foundations

1. Attention Mechanism

An attention function maps a query and a set of key-value pairs to an output:

General Attention Formula:
\[ \text{Attention}(Q, K, V) = \text{softmax}(\text{score}(Q, K))V \]

Where:

\(Q\) (Query): What we're looking for
\(K\) (Key): What we're comparing against
\(V\) (Value): The actual information we want to retrieve

2. Scaled Dot-Product Attention

The core attention mechanism used in Transformers:

\[Attention(Q, K, V) = softmax(QK^T / √d_k)V\]

Components:

\(QK^T\): Compute similarity scores between queries and keys
\(√d_k\): Scaling factor to prevent softmax saturation
\(softmax()\): Normalize scores to probabilities
\(V\): Weight the values by attention probabilities

Why Scale by \(√d_k\)?

For large \(d_k\), dot products grow large in magnitude, pushing softmax into regions with extremely small gradients. Scaling prevents this saturation.

3. Multi-Head Attention

Instead of using a single attention function, Multi-Head Attention runs multiple attention heads in parallel:

\[ MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O \]
where \( head_i = Attention(QW^Q_i, KW^K_i, VW^V_i) \)

Parameters:

\( W^Q_i ∈ ℝ^(d_model × d_k) \): Query projection matrix for head i
\( W^K_i ∈ ℝ^(d_model × d_k) \): Key projection matrix for head i
\( W^V_i ∈ ℝ^(d_model × d_v) \): Value projection matrix for head i
\( W^O ∈ ℝ^(hd_v × d_model) \): Output projection matrix

4. Positional Encoding

Since Transformers have no inherent notion of sequence order, positional encodings are added to input embeddings:

\[ PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) \] \[ PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model)) \]

Where:

\( pos \): Position in the sequence
\( i \): Dimension index
\( d_model \): Model dimension

Scaled Dot-Product Attention Implementation

Multi-Head Attention Implementation

Scaled Dot-Product Attention Implementation

Encoder Layer Implementation

Decoder Layer Implementation

Complete Transformer Implementation

Using device: cuda

Tokenization and Embeddings

Tokenization is the process of converting text into numerical tokens that can be processed by the model:

Complete Transformer Implementation


Vocabulary: [' ', 'a', 'd', 'e', 'f', 'h', 'i', 'l', 'm', 'n', 'o', 'r', 's', 't', 'w']
Vocab size: 15
Original text: hello world this is a transformer model
Encoded: [5, 3, 7, 7, 10, 0, 14, 10, 11, 7, 2, 0, 13, 5, 6, 12, 0, 6, 12, 0, 1, 0, 13, 11, 1, 9, 12, 4, 10, 11, 8, 3, 11, 0, 8, 10, 2, 3, 7]
Decoded: hello world this is a transformer model

Publications

Coming soon

My Links

Playground: Transformers

Transformers and Attention Mechanisms

Table of Contents

Objectives

From RNNs to Transformers: The Attention Revolution

The Problem with RNNs:

The Transformer Solution:

Mathematical Foundations

1. Attention Mechanism

2. Scaled Dot-Product Attention

Why Scale by \(√d_k\)?

3. Multi-Head Attention

4. Positional Encoding

Scaled Dot-Product Attention Implementation

Multi-Head Attention Implementation

Scaled Dot-Product Attention Implementation

Encoder Layer Implementation

Decoder Layer Implementation

Complete Transformer Implementation

Tokenization and Embeddings

Complete Transformer Implementation

Publications

All projects