Marc Freir | SSRL

Objectives

Understand self-supervised representation learning and its advantages over supervised learning
Learn the fundamental concepts of positive pairs and negative samples in contrastive learning
Explore SimCLR (Simple Framework for Contrastive Learning) and its contrastive approach
Understand BYOL (Bootstrap Your Own Latent) and how it learns without negative pairs
Compare and contrast different self-supervised learning strategies

From Supervised to Self-Supervised: No Labels Needed

The Problem with Supervised Learning:

Traditional supervised learning requires vast amounts of labeled data, which is:

Expensive: Manual annotation requires significant time and resources
Limited: Only a fraction of available data can be realistically labeled
Biased: Human labeling can introduce subjective biases

The Self-Supervised Solution:

Self-supervised learning creates supervisory signals from the data itself, without requiring manual labels. By learning to solve pretext tasks (like predicting augmentations or reconstructing inputs), models learn powerful representations that transfer to downstream tasks.

Key Innovation:
Self-supervised methods leverage data augmentation to create multiple views of the same image, then train models to recognize that these views represent the same underlying content. This simple idea has revolutionized representation learning.

Key Concepts: Positive Pairs vs Negatives

Understanding the fundamental building blocks of self-supervised contrastive learning:

Positive Pairs:

Two augmented views of the same original image. For example, if we have image I1 and create two different augmentations (crop, color jitter, flip, etc.), we get:

A1: First augmented view of I1
B1: Second augmented view of I1
Result: (A1, B1) form a positive pair

The model should learn that A1 and B1 represent the same semantic content.

Negative Samples:

Augmented views from different original images. For example:

A1: Augmented view from I1
B2: Augmented view from I2 (different image)
Result: A1 and B2 are negatives

The model should learn that A1 and B2 represent different semantic content and push their representations apart.

Example with Batch Size 8:

Starting with 8 images (I1, I2, I3, I4, I5, I6, I7, I8), we create two augmented views of each:

Positive pairs: (A1, B1), (A2, B2), (A3, B3), ..., (A8, B8)
For A1: B1 is positive, all others (A2, B2, A3, B3, ..., A8, B8) are negatives
Total views in batch: 16 (8 images × 2 augmentations)

SimCLR: Simple Framework for Contrastive Learning

What SimCLR Does:

SimCLR is a contrastive learning framework that uses both positive and negative samples. Given a batch of N images:

Takes N images (e.g., I1-I8 for batch size 8)
Creates two augmented views of each image → 2N total views in the batch
For each view (e.g., A1 from I1):
- Positive: B1 (the other view of I1) — only 1 positive
- Negatives: All other views from different images — 2(N-1) negatives
Applies contrastive loss to pull positives together and push negatives apart

SimCLR with Batch Size 8 - Detailed Breakdown:

Input: 8 images (I1, I2, I3, I4, I5, I6, I7, I8)
Augmentation: Create pairs (A1, B1), (A2, B2), ..., (A8, B8)
Total views: 16 images in the augmented batch
For A1:
- Positive: B1 (1 positive)
- Negatives: A2, B2, A3, B3, A4, B4, A5, B5, A6, B6, A7, B7, A8, B8 (14 negatives)
For A2:
- Positive: B2 (1 positive)
- Negatives: A1, B1, A3, B3, A4, B4, ... (14 negatives)

SimCLR Loss (NT-Xent):
\[ \mathcal{L}_i = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k)/\tau)} \]

Where:

\(z_i, z_j\): Representations of positive pair
\(\text{sim}(u, v)\): Cosine similarity = \(u^T v / (\|u\| \|v\|)\)
\(\tau\): Temperature parameter (typically 0.5)
Numerator: Similarity to positive sample
Denominator: Similarity to all samples (positive + negatives)

The Goal:
Pull A1 close to B1 while pushing A1 away from all other 14 images in the batch. This creates a discriminative representation space where similar images cluster together.

BYOL: Bootstrap Your Own Latent (Non-Contrastive)

What BYOL Actually Does:

BYOL is fundamentally different from SimCLR — it's a non-contrastive method that doesn't use negative samples at all. Here's how it works:

Takes N images (e.g., I1-I8 for batch size 8)
Creates two augmented views of each: (A1, B1), (A2, B2), ..., (A8, B8)
Critical difference: Each pair is processed independently:
- A1 → online network, B1 → target network → compute similarity loss
- A2 → online network, B2 → target network → compute similarity loss
- ... and so on for all N pairs
No cross-image comparisons: BYOL never compares A1 with B2, or A4 with B6, etc.

⚠️ Common Misconception:

BYOL does NOT create patterns like "A1 ~ B1, A2 ~ B2, A3 ~ B3, A4 ≠ B4". This misunderstands the algorithm.

The truth: BYOL wants ALL pairs to be similar:

A1 similar to B1 ✓
A2 similar to B2 ✓
A3 similar to B3 ✓
A4 similar to B4 ✓ (not dissimilar!)
... all pairs should match

Each pair's loss is computed independently, with no cross-image interactions.

BYOL Architecture Components:

Online Network: Trainable network with encoder + projector + predictor
Target Network: Slowly updated copy (EMA) of online encoder + projector
Predictor: Extra MLP only in online network (key asymmetry)
No negatives: Loss only involves the positive pair

BYOL Loss:
\[ \mathcal{L}_{\text{BYOL}} = \left\| \frac{q_\theta(z_\theta)}{\|q_\theta(z_\theta)\|_2} - \frac{z'_\xi}{\|z'_\xi\|_2} \right\|_2^2 \]

Where:

\(q_\theta(z_\theta)\): Prediction from online network on view A
\(z'_\xi\): Representation from target network on view B
Symmetric loss: Also compute loss with roles reversed (B → online, A → target)
EMA update: \(\xi \leftarrow \tau \xi + (1-\tau)\theta\) where \(\tau \approx 0.99\)

Why BYOL Works Without Negatives:
The asymmetry between online and target networks (predictor + EMA updates) prevents collapse to trivial solutions. The target network provides stable, slowly evolving targets that guide the online network's learning.

Summary: SimCLR vs BYOL

                    SimCLR (Contrastive):
                    Uses: Positive pairs + negative samples
For each view: 1 positive, 2×(batch_size-1) negatives
Goal: Pull positives together, push negatives apart
Requires: Large batch sizes for many negatives (typical: 256-4096)
Loss: NT-Xent (normalized temperature-scaled cross-entropy)

                

                    BYOL (Non-Contrastive):
                    Uses: Only positive pairs (no negatives)
Processing: Each pair processed independently
Goal: Match online prediction to target representation
Key mechanism: Online-target asymmetry + EMA updates
Loss: Mean squared error between normalized representations

                

Performance Characteristics:

SimCLR: Benefits from large batches; sensitive to augmentation strength
BYOL: Works with smaller batches; more robust to augmentations
Both: Achieve state-of-the-art results on ImageNet and transfer tasks

Implementation Coming Soon

In the next sections, we'll implement both SimCLR and BYOL from scratch, including:

Data augmentation pipelines for creating positive pairs
Encoder architectures (ResNet backbone)
Projection heads and predictors
Contrastive loss implementation (NT-Xent for SimCLR)
Target network with EMA updates (for BYOL)
Training loops and evaluation on downstream tasks

My Links

Playground: SSRL

On Self-Supervised Representation Learning

Table of Contents