CLIP

Beginner Explanation

Imagine you have a friend who can look at a picture and then tell you what it is in words, like ‘a cat’ or ‘a sunset’. CLIP is like that friend, but it’s a computer program. It learns to connect words and pictures by studying lots of examples. So when you show it a new picture, it can guess what it is, even if it has never seen that exact picture before. This is super useful because it can help computers understand images without needing to be specially trained for each new task.

Technical Explanation

CLIP (Contrastive Language–Image Pre-training) is a model developed by OpenAI that combines images and text. It uses a dual-encoder architecture, where one encoder processes images and the other processes text. Both encoders project their inputs into a shared embedding space. The training objective is to maximize the similarity between the embeddings of matched image-text pairs while minimizing the similarity of unmatched pairs. This is achieved using contrastive loss. Here’s a simple implementation using PyTorch: “`python import torch from torchvision import models, transforms from PIL import Image # Load pre-trained CLIP model model = models.clip.load(‘ViT-B/32’) model.eval() # Preprocess image image = Image.open(‘example.jpg’) preprocess = transforms.Compose([ transforms.Resize((224, 224)), transforms.ToTensor(), transforms.Normalize(mean=[0.48145466, 0.4578275, 0.40821073], std=[0.26862954, 0.26130258, 0.27577711]), ]) input_image = preprocess(image).unsqueeze(0) # Encode image and text with torch.no_grad(): image_features = model.encode_image(input_image) text_features = model.encode_text(clip.tokenize([‘a cat’, ‘a dog’])) # Calculate similarity similarity = torch.cosine_similarity(image_features, text_features) “`

Academic Context

CLIP, introduced by Radford et al. in ‘Learning Transferable Visual Models From Natural Language Supervision’ (2021), represents a significant advancement in bridging the gap between vision and language. The model leverages the vast amount of image-text pairs available on the internet to learn visual concepts without task-specific annotations. The mathematical foundation relies on contrastive learning, which optimizes the model to differentiate between correct and incorrect pairs in a high-dimensional space. This allows CLIP to perform zero-shot learning across various tasks, demonstrating its versatility in applications like image classification and retrieval. Key papers include ‘Contrastive Learning of Generalized Visual Representations’ and ‘Zero-Shot Learning with Semantic Output Codes’.

Code Examples

Example 1:

import torch
from torchvision import models, transforms
from PIL import Image

# Load pre-trained CLIP model
model = models.clip.load('ViT-B/32')
model.eval()

# Preprocess image
image = Image.open('example.jpg')
preprocess = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.48145466, 0.4578275, 0.40821073], std=[0.26862954, 0.26130258, 0.27577711]),
])
input_image = preprocess(image).unsqueeze(0)

# Encode image and text
with torch.no_grad():
    image_features = model.encode_image(input_image)
    text_features = model.encode_text(clip.tokenize(['a cat', 'a dog']))

# Calculate similarity
similarity = torch.cosine_similarity(image_features, text_features)

Example 2:

transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.48145466, 0.4578275, 0.40821073], std=[0.26862954, 0.26130258, 0.27577711]),

Example 3:

image_features = model.encode_image(input_image)
    text_features = model.encode_text(clip.tokenize(['a cat', 'a dog']))

Example 4:

import torch
from torchvision import models, transforms
from PIL import Image

# Load pre-trained CLIP model

Example 5:

from torchvision import models, transforms
from PIL import Image

# Load pre-trained CLIP model
model = models.clip.load('ViT-B/32')

Example 6:

from PIL import Image

# Load pre-trained CLIP model
model = models.clip.load('ViT-B/32')
model.eval()

View Source: https://arxiv.org/abs/2511.16674v1

Beginner Explanation

Technical Explanation

Academic Context

Code Examples

Like this:

Pre-trained Models

sentence-transformers/clip-ViT-B-32-multilingual-v1

sentence-transformers/clip-ViT-B-32

openai/clip-vit-base-patch32

openai/clip-vit-large-patch14

laion/CLIP-ViT-H-14-laion2B-s32B-b79K

zer0int/CLIP-GmP-ViT-L-14

mlunar/clip-variants

OFA-Sys/chinese-clip-vit-base-patch16

Marqo/onnx-open_clip-ViT-H-14

laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

openai/clip-vit-large-patch14-336

Relevant Datasets

nousr/laion5b-subset-and-cliph-embeddings

CodedotAI/code-clippy-tfrecords

CodedotAI/code_clippy

CodedotAI/code_clippy_github

clips/mfaq

clips/mqa

flax-community/code_clippy_data

fuyun1107/clip-for-vlp

rocca/clip-keyphrase-embeddings

M-CLIP/ImageCaptions-7M-Translations

M-CLIP/ImageCaptions-7M-Embeddings

External References

Beginner Explanation

Technical Explanation

Academic Context

Code Examples

Share this:

Like this:

Pre-trained Models

Relevant Datasets

External References

Related Concepts