CLIP

Beginner Explanation

Imagine you have a friend who can look at a picture and then tell you what it is in words, like ‘a cat’ or ‘a sunset’. CLIP is like that friend, but it’s a computer program. It learns to connect words and pictures by studying lots of examples. So when you show it a new picture, it can guess what it is, even if it has never seen that exact picture before. This is super useful because it can help computers understand images without needing to be specially trained for each new task.

Technical Explanation

CLIP (Contrastive Language–Image Pre-training) is a model developed by OpenAI that combines images and text. It uses a dual-encoder architecture, where one encoder processes images and the other processes text. Both encoders project their inputs into a shared embedding space. The training objective is to maximize the similarity between the embeddings of matched image-text pairs while minimizing the similarity of unmatched pairs. This is achieved using contrastive loss. Here’s a simple implementation using PyTorch: “`python import torch from torchvision import models, transforms from PIL import Image # Load pre-trained CLIP model model = models.clip.load(‘ViT-B/32’) model.eval() # Preprocess image image = Image.open(‘example.jpg’) preprocess = transforms.Compose([ transforms.Resize((224, 224)), transforms.ToTensor(), transforms.Normalize(mean=[0.48145466, 0.4578275, 0.40821073], std=[0.26862954, 0.26130258, 0.27577711]), ]) input_image = preprocess(image).unsqueeze(0) # Encode image and text with torch.no_grad(): image_features = model.encode_image(input_image) text_features = model.encode_text(clip.tokenize([‘a cat’, ‘a dog’])) # Calculate similarity similarity = torch.cosine_similarity(image_features, text_features) “`

Academic Context

CLIP, introduced by Radford et al. in ‘Learning Transferable Visual Models From Natural Language Supervision’ (2021), represents a significant advancement in bridging the gap between vision and language. The model leverages the vast amount of image-text pairs available on the internet to learn visual concepts without task-specific annotations. The mathematical foundation relies on contrastive learning, which optimizes the model to differentiate between correct and incorrect pairs in a high-dimensional space. This allows CLIP to perform zero-shot learning across various tasks, demonstrating its versatility in applications like image classification and retrieval. Key papers include ‘Contrastive Learning of Generalized Visual Representations’ and ‘Zero-Shot Learning with Semantic Output Codes’.

Code Examples

Example 1:

import torch
from torchvision import models, transforms
from PIL import Image

# Load pre-trained CLIP model
model = models.clip.load('ViT-B/32')
model.eval()

# Preprocess image
image = Image.open('example.jpg')
preprocess = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.48145466, 0.4578275, 0.40821073], std=[0.26862954, 0.26130258, 0.27577711]),
])
input_image = preprocess(image).unsqueeze(0)

# Encode image and text
with torch.no_grad():
    image_features = model.encode_image(input_image)
    text_features = model.encode_text(clip.tokenize(['a cat', 'a dog']))

# Calculate similarity
similarity = torch.cosine_similarity(image_features, text_features)

Example 2:

transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.48145466, 0.4578275, 0.40821073], std=[0.26862954, 0.26130258, 0.27577711]),

Example 3:

image_features = model.encode_image(input_image)
    text_features = model.encode_text(clip.tokenize(['a cat', 'a dog']))

Example 4:

import torch
from torchvision import models, transforms
from PIL import Image

# Load pre-trained CLIP model

Example 5:

from torchvision import models, transforms
from PIL import Image

# Load pre-trained CLIP model
model = models.clip.load('ViT-B/32')

Example 6:

from PIL import Image

# Load pre-trained CLIP model
model = models.clip.load('ViT-B/32')
model.eval()

View Source: https://arxiv.org/abs/2511.16674v1

Pre-trained Models

sentence-transformers/clip-ViT-B-32-multilingual-v1

sentence-similarity
↓ 82,256 downloads

openai/clip-vit-base-patch32

zero-shot-image-classification
↓ 18,760,366 downloads

openai/clip-vit-large-patch14

zero-shot-image-classification
↓ 9,677,420 downloads

laion/CLIP-ViT-H-14-laion2B-s32B-b79K

zero-shot-image-classification
↓ 947,414 downloads

zer0int/CLIP-GmP-ViT-L-14

zero-shot-image-classification
↓ 7,069 downloads

OFA-Sys/chinese-clip-vit-base-patch16

zero-shot-image-classification
↓ 81,918 downloads

laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

zero-shot-image-classification
↓ 92,682 downloads

openai/clip-vit-large-patch14-336

zero-shot-image-classification
↓ 4,457,796 downloads

Relevant Datasets

External References

Hf dataset: 11 Hf model: 11 Implementations: 0