Beginner Explanation
Imagine you’re trying to find the lowest point in a hilly landscape while blindfolded. You can only feel the ground beneath your feet. To find your way down, you take small steps in the direction that feels steepest downwards. Each step is based on what you feel right under your feet, rather than looking at the entire landscape. Stochastic Gradient Descent (SGD) works similarly: it helps a computer learn by making small adjustments to its guesses based on just a few examples at a time, rather than using all the data at once. This makes it faster and helps it avoid getting stuck in places that aren’t the lowest point.Technical Explanation
Stochastic Gradient Descent (SGD) is an optimization algorithm used to minimize the loss function in machine learning models. Instead of calculating the gradient using the entire dataset, SGD updates the model parameters using a randomly selected subset (mini-batch) of data. This is particularly useful for large datasets. The update rule for a parameter θ is given by: θ = θ – η * ∇L(θ; x_i, y_i) where η is the learning rate, ∇L is the gradient of the loss function with respect to the parameters, and (x_i, y_i) is a randomly chosen data point. This process is repeated for multiple iterations until convergence. Here’s a simple code example: “`python import numpy as np # Sample loss function (mean squared error) def loss_function(y_true, y_pred): return np.mean((y_true – y_pred) ** 2) # Stochastic Gradient Descent function def stochastic_gradient_descent(X, y, learning_rate=0.01, epochs=100): m, n = X.shape theta = np.zeros(n) for epoch in range(epochs): for i in range(m): y_pred = np.dot(X[i], theta) gradient = -2 * (y[i] – y_pred) * X[i] theta -= learning_rate * gradient return theta “`Academic Context
Stochastic Gradient Descent (SGD) is a variation of the gradient descent optimization algorithm, first introduced in the context of machine learning in the 1950s. It is particularly effective for large-scale and online learning scenarios. The convergence properties of SGD have been extensively studied, with key papers such as ‘Stochastic Approximation and Recursive Estimation’ by Robbins and Monro (1951) laying the groundwork. Theoretical aspects of SGD, including its convergence rates and variance reduction techniques, are discussed in ‘Understanding the Effectiveness of SGD in Deep Learning’ by Bottou et al. (2018). Mathematically, the update step is derived from the principle of minimizing the expected loss by sampling from the data distribution, which introduces noise but allows for faster convergence in practice.Code Examples
Example 1:
import numpy as np
# Sample loss function (mean squared error)
def loss_function(y_true, y_pred):
return np.mean((y_true - y_pred) ** 2)
# Stochastic Gradient Descent function
def stochastic_gradient_descent(X, y, learning_rate=0.01, epochs=100):
m, n = X.shape
theta = np.zeros(n)
for epoch in range(epochs):
for i in range(m):
y_pred = np.dot(X[i], theta)
gradient = -2 * (y[i] - y_pred) * X[i]
theta -= learning_rate * gradient
return theta
Example 2:
return np.mean((y_true - y_pred) ** 2)
Example 3:
m, n = X.shape
theta = np.zeros(n)
for epoch in range(epochs):
for i in range(m):
y_pred = np.dot(X[i], theta)
gradient = -2 * (y[i] - y_pred) * X[i]
theta -= learning_rate * gradient
return theta
Example 4:
import numpy as np
# Sample loss function (mean squared error)
def loss_function(y_true, y_pred):
return np.mean((y_true - y_pred) ** 2)
Example 5:
def loss_function(y_true, y_pred):
return np.mean((y_true - y_pred) ** 2)
# Stochastic Gradient Descent function
def stochastic_gradient_descent(X, y, learning_rate=0.01, epochs=100):
Example 6:
def stochastic_gradient_descent(X, y, learning_rate=0.01, epochs=100):
m, n = X.shape
theta = np.zeros(n)
for epoch in range(epochs):
for i in range(m):
View Source: https://arxiv.org/abs/2511.16340v1