Stochastic Gradient Descent

Successful

May 28, 2025

Gradient Descent Motivation

Machine Learning has 2 phases: Learning and Prediction.

Learning: The learning algorithm creates a hypothesis function.
Prediction: The hypothesis function's job is to predict outputs.

The cost function represents the error between predicted and actual outputs. We can tune parameters (of learning) to minimize the error. Gradient descent is a popular way of doing so.

Gradient Descent Algorithm

Goal: Find a local minima of Cost Function, $J(\theta)$

Start with any value for parameters of learning.
1. Can pick random values or initialize to all 0s.

Example, $\theta_0=\vec{0}$

parameters = np.zeros(dimensions + 1)

Update parameters to go downhill on the cost surface.

$Gradient descent update rule$

$Gradient step update$

parameters += learning_rate * (y_i - predict(x_i)) * x_i

Repeat until you reach a local optimum.
1. If the cost function stops significantly decreasing (and α is small), you're typically at a local minimum.

Gradient Descent Example

Linear regression model using gradient descent based learning algorithm

"""Machine Learning Models"""

import numpy as np


class LinearRegression:
    """Linear Regression"""

    def __init__(self, dimensions: int = 1):
        """init

        Parameters
        ----------
        dimensions
            number of dimensions for linear regression, by default 1
        """
        # Initilize parameters
        self.parameters = np.zeros(dimensions + 1)

    def learn(self, x: np.ndarray, y: np.ndarray, learning_rate: float, max_iter: int):
        """learn maps input-output training dataset to a hypothesis function

        Parameters
        ----------
        x
            input data
        y
            output data
        learning_rate
            learning rate
        max_iter
            maximum number of iterations, currently equal to total iterations
        """
        # Gradient descent
        for _ in range(max_iter):
            for xy in zip(x, y):
                x_i, y_i = self.reconstruct_input_vec(xy[0]), xy[1]
                self.parameters += learning_rate * (y_i - self.predict(x_i)) * x_i

    def reconstruct_input_vec(self, x: np.ndarray | float):
        """Create a dummy input, x_0, always defined as 1

        Parameters
        ----------
        x
            input

        Returns
        -------
            (x_0,...x_n), x_0 = 1
        """
        return np.concatenate(([1], np.atleast_1d(x)))

    def predict(self, x: np.ndarray):
        """Predict using linear regression hypothesis function

        Parameters
        ----------
        x
            input

        Returns
        -------
            prediction
        """
        return np.dot(self.parameters, x)

Batch Vs Stochastic

Batch gradient descent scans through the entire training set before updating, where as stochastic gradient descent updates for each input-output pair. Stochastic is preferred over batch gradient descent when the data set is large.