Beginner Explanation
Imagine you have a friend who can look at a picture and then tell you what it is in words, like ‘a cat’ or ‘a sunset’. CLIP is like that friend, but it’s a computer program. It learns to connect words and pictures by studying lots of examples. So when you show it a new picture, it can guess what it is, even if it has never seen that exact picture before. This is super useful because it can help computers understand images without needing to be specially trained for each new task.Technical Explanation
CLIP (Contrastive Language–Image Pre-training) is a model developed by OpenAI that combines images and text. It uses a dual-encoder architecture, where one encoder processes images and the other processes text. Both encoders project their inputs into a shared embedding space. The training objective is to maximize the similarity between the embeddings of matched image-text pairs while minimizing the similarity of unmatched pairs. This is achieved using contrastive loss. Here’s a simple implementation using PyTorch: “`python import torch from torchvision import models, transforms from PIL import Image # Load pre-trained CLIP model model = models.clip.load(‘ViT-B/32’) model.eval() # Preprocess image image = Image.open(‘example.jpg’) preprocess = transforms.Compose([ transforms.Resize((224, 224)), transforms.ToTensor(), transforms.Normalize(mean=[0.48145466, 0.4578275, 0.40821073], std=[0.26862954, 0.26130258, 0.27577711]), ]) input_image = preprocess(image).unsqueeze(0) # Encode image and text with torch.no_grad(): image_features = model.encode_image(input_image) text_features = model.encode_text(clip.tokenize([‘a cat’, ‘a dog’])) # Calculate similarity similarity = torch.cosine_similarity(image_features, text_features) “`Academic Context
CLIP, introduced by Radford et al. in ‘Learning Transferable Visual Models From Natural Language Supervision’ (2021), represents a significant advancement in bridging the gap between vision and language. The model leverages the vast amount of image-text pairs available on the internet to learn visual concepts without task-specific annotations. The mathematical foundation relies on contrastive learning, which optimizes the model to differentiate between correct and incorrect pairs in a high-dimensional space. This allows CLIP to perform zero-shot learning across various tasks, demonstrating its versatility in applications like image classification and retrieval. Key papers include ‘Contrastive Learning of Generalized Visual Representations’ and ‘Zero-Shot Learning with Semantic Output Codes’.Code Examples
Example 1:
import torch
from torchvision import models, transforms
from PIL import Image
# Load pre-trained CLIP model
model = models.clip.load('ViT-B/32')
model.eval()
# Preprocess image
image = Image.open('example.jpg')
preprocess = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.48145466, 0.4578275, 0.40821073], std=[0.26862954, 0.26130258, 0.27577711]),
])
input_image = preprocess(image).unsqueeze(0)
# Encode image and text
with torch.no_grad():
image_features = model.encode_image(input_image)
text_features = model.encode_text(clip.tokenize(['a cat', 'a dog']))
# Calculate similarity
similarity = torch.cosine_similarity(image_features, text_features)
Example 2:
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.48145466, 0.4578275, 0.40821073], std=[0.26862954, 0.26130258, 0.27577711]),
Example 3:
image_features = model.encode_image(input_image)
text_features = model.encode_text(clip.tokenize(['a cat', 'a dog']))
Example 4:
import torch
from torchvision import models, transforms
from PIL import Image
# Load pre-trained CLIP model
Example 5:
from torchvision import models, transforms
from PIL import Image
# Load pre-trained CLIP model
model = models.clip.load('ViT-B/32')
Example 6:
from PIL import Image
# Load pre-trained CLIP model
model = models.clip.load('ViT-B/32')
model.eval()
View Source: https://arxiv.org/abs/2511.16674v1