AI/MLBeginner

Gradient Descent: Minimizing The Loss Function

February 2, 2025 โ€ข 5 min read

Gradient Descent: Minimizing The Loss Function

Backpropagation & Gradient Descent

Key Terms

  • Gradient = Rate of Change = Derivative
  • In deep learning, the gradient tells us how to adjust weights to reduce error
  • Backpropagation computes gradients using the chain rule
  • Gradient descent moves in the opposite direction of the gradient to improve predictions

Visualizing Gradient Descent

This plot shows gradient descent in action, where we try to minimize a simple loss function:

What's Happening in the Graph?

  • The blue curve represents the loss function, which is a simple U-shaped curve
  • The goal is to reach the lowest point (where $x = 0$)
  • The red points show the path of gradient descent:
    • We start at (far from the minimum)
    • We compute the gradient (derivative) at that point
    • We move in the opposite direction of the gradient to decrease the loss
    • This process repeats until we reach the lowest point
  • The red dashed line shows the step-by-step movement as we adjust towards the minimum

How This Relates to Deep Learning

  • In a neural network, each weight (parameter) has a gradient that tells us how to adjust it
  • Backpropagation computes gradients for all weights using the chain rule
  • Gradient descent updates the weights, just like in the example, moving towards the best solution

Putting it Together

  • Gradient descent is just moving step by step downhill to minimize the loss
  • The gradient (derivative) tells us which direction to move
  • Backpropagation computes these gradients for every weight in the neural network
  • By following the gradient, the model "learns" better predictions

Step-by-Step Walkthrough of Backpropagation and Gradients

Let's break it down into simple, clear steps so you understand exactly what's happening in the training process.

Step 1: Define the Simple Neural Network

class SimpleNN(torch.nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.weight = torch.nn.Parameter(torch.tensor([2.0], requires_grad=True))  # Start with weight = 2
        
    def forward(self, x):
        return self.weight * x  # Simple linear model: y = weight * x
  • ๐Ÿ”น We create a simple neural network with one weight
  • ๐Ÿ”น The goal is for the weight to learn the correct value (which should be 4)
  • ๐Ÿ”น requires_grad=True tells PyTorch to track gradients so we can apply backpropagation

Step 2: Define the Loss Function

loss_fn = torch.nn.MSELoss()  # Mean Squared Error loss function
  • ๐Ÿ”น We use Mean Squared Error (MSE), which measures how far off our predictions are
  • ๐Ÿ”น Formula for MSE:

  • If our model predicts badly, the loss is high
  • If our model predicts well, the loss is low

Step 3: Define the Training Data

x_train = torch.tensor([1.0])  # Input value
y_train = torch.tensor([4.0])  # Expected output (goal is for weight to be 4)
  • ๐Ÿ”น We input into the model
  • ๐Ÿ”น We want the model to output
  • ๐Ÿ”น The model will adjust the weight until it learns to predict 4

Step 4: Forward Pass (Prediction)

y_pred = model(x_train)  # Forward pass
  • ๐Ÿ”น The model makes a prediction using:

  • Initially, weight = 2, so:

  • But the correct answer is 4, so the model is wrong

Step 5: Compute the Loss

loss = loss_fn(y_pred, y_train)  # Compute loss
  • ๐Ÿ”น We compute the MSE loss:

  • The loss is high because the prediction is far from 4
  • We need to reduce the loss by updating the weight

Step 6: Compute the Gradient (Backpropagation)

loss.backward()  # Compute gradients
  • ๐Ÿ”น PyTorch automatically calculates the derivative of the loss function with respect to the weight

How does this work?

  • Loss Function:
  • Since , we take the derivative:

  • Substituting and :

  • This gradient tells us that if we increase the weight, the loss will decrease

Step 7: Update the Weight (Gradient Descent)

  • ๐Ÿ”น Gradient Descent Formula:

  • Initially, weight = 2
  • The computed gradient is -4
  • We update the weight using learning rate = 0.1:

  • The weight moves closer to 4
  • The model is getting better!

Step 8: Repeat the Process

  • ๐Ÿ”น We repeat steps 4-7 for multiple iterations:
    • The weight gets closer and closer to 4
    • The loss gets smaller and smaller
    • Eventually, the model learns the correct weight

Full Process

  1. Forward Pass: The model makes a prediction
  2. Compute Loss: We check how wrong the model is
  3. Backpropagation: We compute gradients (derivatives) using the chain rule
  4. Gradient Descent: We update the weight opposite to the gradient
  5. Repeat: The model keeps improving with each step

How This Connects to the Chain Rule

  • Each layer in a neural network is a function inside another function
  • The chain rule lets us calculate the gradient layer by layer
  • Backpropagation applies the chain rule recursively to update all weights