CLIP (Contrastive Language-Image Pretraining)

Beginner Explanation

Imagine you have a magical book that can describe any picture you show it. If you show a picture of a cat and say ‘This is a cat’, the book learns that the word ‘cat’ relates to that picture. CLIP is like that magical book, but it can learn from millions of pictures and their descriptions. So, if you ask it to find pictures that match a certain word or phrase, it can quickly show you the best matches, even if it has never seen those exact images before!

Technical Explanation

CLIP (Contrastive Language-Image Pretraining) is a model developed by OpenAI that learns to associate images and text through a contrastive learning approach. It is trained on a large dataset of image-text pairs, where the model learns to maximize the similarity between the representations of matching pairs while minimizing the similarity for non-matching pairs. The architecture consists of two encoders: one for images (usually a Vision Transformer) and one for text (typically a Transformer). Here’s a simplified code example using PyTorch: “`python import torch from transformers import CLIPProcessor, CLIPModel # Load model and processor model = CLIPModel.from_pretrained(‘openai/clip-vit-base-patch16’) processor = CLIPProcessor.from_pretrained(‘openai/clip-vit-base-patch16’) # Prepare inputs texts = [‘a photo of a cat’, ‘a photo of a dog’] images = … # Load your images here # Process inputs inputs = processor(text=texts, images=images, return_tensors=’pt’, padding=True) # Forward pass outputs = model(**inputs) # Get similarity scores logits_per_image = outputs.logits_per_image “` This enables CLIP to perform zero-shot classification, where it can categorize images based on text prompts without needing additional training.

Academic Context

CLIP (Contrastive Language-Image Pretraining) was introduced in the paper ‘Learning Transferable Visual Models From Natural Language Supervision’ by Radford et al. (2021). The model leverages a large-scale dataset of image-text pairs to learn joint representations of images and text. The mathematical foundation relies on contrastive learning, where the objective is to minimize the distance between matched image-text pairs in the embedding space while maximizing the distance for unmatched pairs. The loss function used is the InfoNCE loss, which is derived from the principles of noise-contrastive estimation. This approach allows CLIP to generalize well to various tasks, such as image classification and retrieval, without fine-tuning on specific datasets.

Code Examples

Example 1:

import torch
from transformers import CLIPProcessor, CLIPModel

# Load model and processor
model = CLIPModel.from_pretrained('openai/clip-vit-base-patch16')
processor = CLIPProcessor.from_pretrained('openai/clip-vit-base-patch16')

# Prepare inputs
texts = ['a photo of a cat', 'a photo of a dog']
images = ... # Load your images here

# Process inputs
inputs = processor(text=texts, images=images, return_tensors='pt', padding=True)

# Forward pass
outputs = model(**inputs)

# Get similarity scores
logits_per_image = outputs.logits_per_image

Example 2:

import torch
from transformers import CLIPProcessor, CLIPModel

# Load model and processor
model = CLIPModel.from_pretrained('openai/clip-vit-base-patch16')

Example 3:

from transformers import CLIPProcessor, CLIPModel

# Load model and processor
model = CLIPModel.from_pretrained('openai/clip-vit-base-patch16')
processor = CLIPProcessor.from_pretrained('openai/clip-vit-base-patch16')

View Source: https://arxiv.org/abs/2511.16674v1

Pre-trained Models

sentence-transformers/clip-ViT-B-32-multilingual-v1

sentence-similarity
↓ 82,256 downloads

openai/clip-vit-base-patch32

zero-shot-image-classification
↓ 18,760,366 downloads

openai/clip-vit-large-patch14

zero-shot-image-classification
↓ 9,677,420 downloads

laion/CLIP-ViT-H-14-laion2B-s32B-b79K

zero-shot-image-classification
↓ 947,414 downloads

zer0int/CLIP-GmP-ViT-L-14

zero-shot-image-classification
↓ 7,069 downloads

OFA-Sys/chinese-clip-vit-base-patch16

zero-shot-image-classification
↓ 81,918 downloads

laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

zero-shot-image-classification
↓ 92,682 downloads

openai/clip-vit-large-patch14-336

zero-shot-image-classification
↓ 4,457,796 downloads

External References

Hf dataset: 0 Hf model: 11 Implementations: 0