My Links

  • YouTube
  • YouTube
  • Tumblr
  • Behance
  • Dribbble
  • Facebook
  • GitHub

Playground: SSRL

On Self-Supervised Representation Learning

Objectives

  • Understand self-supervised representation learning and its advantages over supervised learning
  • Learn the fundamental concepts of positive pairs and negative samples in contrastive learning
  • Explore SimCLR (Simple Framework for Contrastive Learning) and its contrastive approach
  • Understand BYOL (Bootstrap Your Own Latent) and how it learns without negative pairs
  • Compare and contrast different self-supervised learning strategies

From Supervised to Self-Supervised: No Labels Needed

The Problem with Supervised Learning:

Traditional supervised learning requires vast amounts of labeled data, which is:

  1. Expensive: Manual annotation requires significant time and resources
  2. Limited: Only a fraction of available data can be realistically labeled
  3. Biased: Human labeling can introduce subjective biases

The Self-Supervised Solution:

Self-supervised learning creates supervisory signals from the data itself, without requiring manual labels. By learning to solve pretext tasks (like predicting augmentations or reconstructing inputs), models learn powerful representations that transfer to downstream tasks.

Key Innovation:
Self-supervised methods leverage data augmentation to create multiple views of the same image, then train models to recognize that these views represent the same underlying content. This simple idea has revolutionized representation learning.

Key Concepts: Positive Pairs vs Negatives

Understanding the fundamental building blocks of self-supervised contrastive learning:

Positive Pairs:

Two augmented views of the same original image. For example, if we have image I1 and create two different augmentations (crop, color jitter, flip, etc.), we get:

  • A1: First augmented view of I1
  • B1: Second augmented view of I1
  • Result: (A1, B1) form a positive pair

The model should learn that A1 and B1 represent the same semantic content.

Negative Samples:

Augmented views from different original images. For example:

  • A1: Augmented view from I1
  • B2: Augmented view from I2 (different image)
  • Result: A1 and B2 are negatives

The model should learn that A1 and B2 represent different semantic content and push their representations apart.

Example with Batch Size 8:

Starting with 8 images (I1, I2, I3, I4, I5, I6, I7, I8), we create two augmented views of each:

  • Positive pairs: (A1, B1), (A2, B2), (A3, B3), ..., (A8, B8)
  • For A1: B1 is positive, all others (A2, B2, A3, B3, ..., A8, B8) are negatives
  • Total views in batch: 16 (8 images × 2 augmentations)
[Diagram: Positive Pairs vs Negatives] Visual representation of augmentation and pair formation

SimCLR: Simple Framework for Contrastive Learning

What SimCLR Does:

SimCLR is a contrastive learning framework that uses both positive and negative samples. Given a batch of N images:

  1. Takes N images (e.g., I1-I8 for batch size 8)
  2. Creates two augmented views of each image → 2N total views in the batch
  3. For each view (e.g., A1 from I1):
    • Positive: B1 (the other view of I1) — only 1 positive
    • Negatives: All other views from different images — 2(N-1) negatives
  4. Applies contrastive loss to pull positives together and push negatives apart

SimCLR with Batch Size 8 - Detailed Breakdown:

  • Input: 8 images (I1, I2, I3, I4, I5, I6, I7, I8)
  • Augmentation: Create pairs (A1, B1), (A2, B2), ..., (A8, B8)
  • Total views: 16 images in the augmented batch
  • For A1:
    • Positive: B1 (1 positive)
    • Negatives: A2, B2, A3, B3, A4, B4, A5, B5, A6, B6, A7, B7, A8, B8 (14 negatives)
  • For A2:
    • Positive: B2 (1 positive)
    • Negatives: A1, B1, A3, B3, A4, B4, ... (14 negatives)
SimCLR Loss (NT-Xent):
\[ \mathcal{L}_i = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k)/\tau)} \]

Where:

  • \(z_i, z_j\): Representations of positive pair
  • \(\text{sim}(u, v)\): Cosine similarity = \(u^T v / (\|u\| \|v\|)\)
  • \(\tau\): Temperature parameter (typically 0.5)
  • Numerator: Similarity to positive sample
  • Denominator: Similarity to all samples (positive + negatives)
The Goal:
Pull A1 close to B1 while pushing A1 away from all other 14 images in the batch. This creates a discriminative representation space where similar images cluster together.
[Diagram: SimCLR Architecture] Encoder → Projection Head → Contrastive Loss

BYOL: Bootstrap Your Own Latent (Non-Contrastive)

What BYOL Actually Does:

BYOL is fundamentally different from SimCLR — it's a non-contrastive method that doesn't use negative samples at all. Here's how it works:

  1. Takes N images (e.g., I1-I8 for batch size 8)
  2. Creates two augmented views of each: (A1, B1), (A2, B2), ..., (A8, B8)
  3. Critical difference: Each pair is processed independently:
    • A1 → online network, B1 → target network → compute similarity loss
    • A2 → online network, B2 → target network → compute similarity loss
    • ... and so on for all N pairs
  4. No cross-image comparisons: BYOL never compares A1 with B2, or A4 with B6, etc.

⚠️ Common Misconception:

BYOL does NOT create patterns like "A1 ~ B1, A2 ~ B2, A3 ~ B3, A4 ≠ B4". This misunderstands the algorithm.

The truth: BYOL wants ALL pairs to be similar:

  • A1 similar to B1 ✓
  • A2 similar to B2 ✓
  • A3 similar to B3 ✓
  • A4 similar to B4 ✓ (not dissimilar!)
  • ... all pairs should match

Each pair's loss is computed independently, with no cross-image interactions.

BYOL Architecture Components:

  • Online Network: Trainable network with encoder + projector + predictor
  • Target Network: Slowly updated copy (EMA) of online encoder + projector
  • Predictor: Extra MLP only in online network (key asymmetry)
  • No negatives: Loss only involves the positive pair
BYOL Loss:
\[ \mathcal{L}_{\text{BYOL}} = \left\| \frac{q_\theta(z_\theta)}{\|q_\theta(z_\theta)\|_2} - \frac{z'_\xi}{\|z'_\xi\|_2} \right\|_2^2 \]

Where:

  • \(q_\theta(z_\theta)\): Prediction from online network on view A
  • \(z'_\xi\): Representation from target network on view B
  • Symmetric loss: Also compute loss with roles reversed (B → online, A → target)
  • EMA update: \(\xi \leftarrow \tau \xi + (1-\tau)\theta\) where \(\tau \approx 0.99\)
Why BYOL Works Without Negatives:
The asymmetry between online and target networks (predictor + EMA updates) prevents collapse to trivial solutions. The target network provides stable, slowly evolving targets that guide the online network's learning.
[Diagram: BYOL Architecture] Online Network (trainable) ↔ Target Network (EMA)

Summary: SimCLR vs BYOL

SimCLR (Contrastive):

  • Uses: Positive pairs + negative samples
  • For each view: 1 positive, 2×(batch_size-1) negatives
  • Goal: Pull positives together, push negatives apart
  • Requires: Large batch sizes for many negatives (typical: 256-4096)
  • Loss: NT-Xent (normalized temperature-scaled cross-entropy)

BYOL (Non-Contrastive):

  • Uses: Only positive pairs (no negatives)
  • Processing: Each pair processed independently
  • Goal: Match online prediction to target representation
  • Key mechanism: Online-target asymmetry + EMA updates
  • Loss: Mean squared error between normalized representations

Performance Characteristics:

  • SimCLR: Benefits from large batches; sensitive to augmentation strength
  • BYOL: Works with smaller batches; more robust to augmentations
  • Both: Achieve state-of-the-art results on ImageNet and transfer tasks
[Table: Detailed Comparison] Feature-by-feature comparison of SimCLR and BYOL

Implementation Coming Soon

In the next sections, we'll implement both SimCLR and BYOL from scratch, including:

  • Data augmentation pipelines for creating positive pairs
  • Encoder architectures (ResNet backbone)
  • Projection heads and predictors
  • Contrastive loss implementation (NT-Xent for SimCLR)
  • Target network with EMA updates (for BYOL)
  • Training loops and evaluation on downstream tasks
[Code Implementation Section] Complete PyTorch implementations coming soon

Publications

Coming soon

All projects