Playground: SSRL
On Self-Supervised Representation Learning
Table of Contents
Objectives
- Understand self-supervised representation learning and its advantages over supervised learning
- Learn the fundamental concepts of positive pairs and negative samples in contrastive learning
- Explore SimCLR (Simple Framework for Contrastive Learning) and its contrastive approach
- Understand BYOL (Bootstrap Your Own Latent) and how it learns without negative pairs
- Compare and contrast different self-supervised learning strategies
From Supervised to Self-Supervised: No Labels Needed
The Problem with Supervised Learning:
Traditional supervised learning requires vast amounts of labeled data, which is:
- Expensive: Manual annotation requires significant time and resources
- Limited: Only a fraction of available data can be realistically labeled
- Biased: Human labeling can introduce subjective biases
The Self-Supervised Solution:
Self-supervised learning creates supervisory signals from the data itself, without requiring manual labels. By learning to solve pretext tasks (like predicting augmentations or reconstructing inputs), models learn powerful representations that transfer to downstream tasks.
Self-supervised methods leverage data augmentation to create multiple views of the same image, then train models to recognize that these views represent the same underlying content. This simple idea has revolutionized representation learning.
Key Concepts: Positive Pairs vs Negatives
Understanding the fundamental building blocks of self-supervised contrastive learning:
Positive Pairs:
Two augmented views of the same original image. For example, if we have image I1 and create two different augmentations (crop, color jitter, flip, etc.), we get:
- A1: First augmented view of I1
- B1: Second augmented view of I1
- Result: (A1, B1) form a positive pair
The model should learn that A1 and B1 represent the same semantic content.
Negative Samples:
Augmented views from different original images. For example:
- A1: Augmented view from I1
- B2: Augmented view from I2 (different image)
- Result: A1 and B2 are negatives
The model should learn that A1 and B2 represent different semantic content and push their representations apart.
Example with Batch Size 8:
Starting with 8 images (I1, I2, I3, I4, I5, I6, I7, I8), we create two augmented views of each:
- Positive pairs: (A1, B1), (A2, B2), (A3, B3), ..., (A8, B8)
- For A1: B1 is positive, all others (A2, B2, A3, B3, ..., A8, B8) are negatives
- Total views in batch: 16 (8 images × 2 augmentations)
SimCLR: Simple Framework for Contrastive Learning
What SimCLR Does:
SimCLR is a contrastive learning framework that uses both positive and negative samples. Given a batch of N images:
- Takes N images (e.g., I1-I8 for batch size 8)
- Creates two augmented views of each image → 2N total views in the batch
- For each view (e.g., A1 from I1):
- Positive: B1 (the other view of I1) — only 1 positive
- Negatives: All other views from different images — 2(N-1) negatives
- Applies contrastive loss to pull positives together and push negatives apart
SimCLR with Batch Size 8 - Detailed Breakdown:
- Input: 8 images (I1, I2, I3, I4, I5, I6, I7, I8)
- Augmentation: Create pairs (A1, B1), (A2, B2), ..., (A8, B8)
- Total views: 16 images in the augmented batch
- For A1:
- Positive: B1 (1 positive)
- Negatives: A2, B2, A3, B3, A4, B4, A5, B5, A6, B6, A7, B7, A8, B8 (14 negatives)
- For A2:
- Positive: B2 (1 positive)
- Negatives: A1, B1, A3, B3, A4, B4, ... (14 negatives)
\[ \mathcal{L}_i = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k)/\tau)} \]
Where:
- \(z_i, z_j\): Representations of positive pair
- \(\text{sim}(u, v)\): Cosine similarity = \(u^T v / (\|u\| \|v\|)\)
- \(\tau\): Temperature parameter (typically 0.5)
- Numerator: Similarity to positive sample
- Denominator: Similarity to all samples (positive + negatives)
Pull A1 close to B1 while pushing A1 away from all other 14 images in the batch. This creates a discriminative representation space where similar images cluster together.
BYOL: Bootstrap Your Own Latent (Non-Contrastive)
What BYOL Actually Does:
BYOL is fundamentally different from SimCLR — it's a non-contrastive method that doesn't use negative samples at all. Here's how it works:
- Takes N images (e.g., I1-I8 for batch size 8)
- Creates two augmented views of each: (A1, B1), (A2, B2), ..., (A8, B8)
- Critical difference: Each pair is processed independently:
- A1 → online network, B1 → target network → compute similarity loss
- A2 → online network, B2 → target network → compute similarity loss
- ... and so on for all N pairs
- No cross-image comparisons: BYOL never compares A1 with B2, or A4 with B6, etc.
⚠️ Common Misconception:
BYOL does NOT create patterns like "A1 ~ B1, A2 ~ B2, A3 ~ B3, A4 ≠ B4". This misunderstands the algorithm.
The truth: BYOL wants ALL pairs to be similar:
- A1 similar to B1 ✓
- A2 similar to B2 ✓
- A3 similar to B3 ✓
- A4 similar to B4 ✓ (not dissimilar!)
- ... all pairs should match
Each pair's loss is computed independently, with no cross-image interactions.
BYOL Architecture Components:
- Online Network: Trainable network with encoder + projector + predictor
- Target Network: Slowly updated copy (EMA) of online encoder + projector
- Predictor: Extra MLP only in online network (key asymmetry)
- No negatives: Loss only involves the positive pair
\[ \mathcal{L}_{\text{BYOL}} = \left\| \frac{q_\theta(z_\theta)}{\|q_\theta(z_\theta)\|_2} - \frac{z'_\xi}{\|z'_\xi\|_2} \right\|_2^2 \]
Where:
- \(q_\theta(z_\theta)\): Prediction from online network on view A
- \(z'_\xi\): Representation from target network on view B
- Symmetric loss: Also compute loss with roles reversed (B → online, A → target)
- EMA update: \(\xi \leftarrow \tau \xi + (1-\tau)\theta\) where \(\tau \approx 0.99\)
The asymmetry between online and target networks (predictor + EMA updates) prevents collapse to trivial solutions. The target network provides stable, slowly evolving targets that guide the online network's learning.
Summary: SimCLR vs BYOL
SimCLR (Contrastive):
- Uses: Positive pairs + negative samples
- For each view: 1 positive, 2×(batch_size-1) negatives
- Goal: Pull positives together, push negatives apart
- Requires: Large batch sizes for many negatives (typical: 256-4096)
- Loss: NT-Xent (normalized temperature-scaled cross-entropy)
BYOL (Non-Contrastive):
- Uses: Only positive pairs (no negatives)
- Processing: Each pair processed independently
- Goal: Match online prediction to target representation
- Key mechanism: Online-target asymmetry + EMA updates
- Loss: Mean squared error between normalized representations
Performance Characteristics:
- SimCLR: Benefits from large batches; sensitive to augmentation strength
- BYOL: Works with smaller batches; more robust to augmentations
- Both: Achieve state-of-the-art results on ImageNet and transfer tasks
Implementation Coming Soon
In the next sections, we'll implement both SimCLR and BYOL from scratch, including:
- Data augmentation pipelines for creating positive pairs
- Encoder architectures (ResNet backbone)
- Projection heads and predictors
- Contrastive loss implementation (NT-Xent for SimCLR)
- Target network with EMA updates (for BYOL)
- Training loops and evaluation on downstream tasks
Publications
Coming soon