Beginner Explanation
Imagine you have a coloring book where some pictures are missing parts, like a cat with a missing tail. If you wanted to color the tail, you would have to guess what color it should be based on the rest of the cat. Masked prediction is like that! In machine learning, we cover up parts of the input data and train the computer to figure out what’s missing. This helps the computer learn better by making it use clues from the visible parts.Technical Explanation
Masked prediction is a technique used in self-supervised learning, particularly in models like BERT for natural language processing. In this method, a fraction of the input tokens (words) are randomly masked, and the model’s objective is to predict these masked tokens based on the context provided by the unmasked tokens. For instance, in Python using PyTorch, you could implement masked prediction by creating a dataset where you randomly select tokens to mask, and then use a cross-entropy loss to compare the predicted tokens with the actual ones. Here’s a simple code snippet: “`python import torch from transformers import BertTokenizer, BertForMaskedLM tokenizer = BertTokenizer.from_pretrained(‘bert-base-uncased’) model = BertForMaskedLM.from_pretrained(‘bert-base-uncased’) input_text = “The cat sat on the [MASK].” tokens = tokenizer(input_text, return_tensors=’pt’) with torch.no_grad(): outputs = model(**tokens) predictions = outputs.logits predicted_index = torch.argmax(predictions[0, 4]).item() # Predicting the masked token predicted_token = tokenizer.decode([predicted_index]) print(predicted_token) # Output should be a word that fits the context “`Academic Context
Masked prediction is rooted in self-supervised learning, where models learn to predict parts of the input from other parts without requiring labeled data. The foundational work on this technique can be traced back to the BERT model (Devlin et al., 2018), which introduced the concept of masked language modeling (MLM). The mathematical formulation involves minimizing the negative log-likelihood of the masked tokens given the unmasked context, represented as: L(θ) = -Σ log P(x_m | x_u; θ), where x_m are the masked tokens, x_u are the unmasked tokens, and θ represents the model parameters. This approach has been pivotal in advancing natural language understanding tasks and has influenced various architectures in the field.Code Examples
Example 1:
import torch
from transformers import BertTokenizer, BertForMaskedLM
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
input_text = "The cat sat on the [MASK]."
tokens = tokenizer(input_text, return_tensors='pt')
with torch.no_grad():
outputs = model(**tokens)
predictions = outputs.logits
predicted_index = torch.argmax(predictions[0, 4]).item() # Predicting the masked token
predicted_token = tokenizer.decode([predicted_index])
print(predicted_token) # Output should be a word that fits the context
Example 2:
outputs = model(**tokens)
predictions = outputs.logits
predicted_index = torch.argmax(predictions[0, 4]).item() # Predicting the masked token
predicted_token = tokenizer.decode([predicted_index])
Example 3:
import torch
from transformers import BertTokenizer, BertForMaskedLM
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
Example 4:
from transformers import BertTokenizer, BertForMaskedLM
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
View Source: https://arxiv.org/abs/2511.16639v1