AI/MLBeginner
Gradient Descent: Minimizing The Loss Function
February 2, 2025 โข 5 min read

Backpropagation & Gradient Descent
Key Terms
- Gradient = Rate of Change = Derivative
- In deep learning, the gradient tells us how to adjust weights to reduce error
- Backpropagation computes gradients using the chain rule
- Gradient descent moves in the opposite direction of the gradient to improve predictions
Visualizing Gradient Descent
This plot shows gradient descent in action, where we try to minimize a simple loss function:
What's Happening in the Graph?
- The blue curve represents the loss function, which is a simple U-shaped curve
- The goal is to reach the lowest point (where $x = 0$)
- The red points show the path of gradient descent:
- We start at (far from the minimum)
- We compute the gradient (derivative) at that point
- We move in the opposite direction of the gradient to decrease the loss
- This process repeats until we reach the lowest point
- The red dashed line shows the step-by-step movement as we adjust towards the minimum
How This Relates to Deep Learning
- In a neural network, each weight (parameter) has a gradient that tells us how to adjust it
- Backpropagation computes gradients for all weights using the chain rule
- Gradient descent updates the weights, just like in the example, moving towards the best solution
Putting it Together
- Gradient descent is just moving step by step downhill to minimize the loss
- The gradient (derivative) tells us which direction to move
- Backpropagation computes these gradients for every weight in the neural network
- By following the gradient, the model "learns" better predictions
Step-by-Step Walkthrough of Backpropagation and Gradients
Let's break it down into simple, clear steps so you understand exactly what's happening in the training process.
Step 1: Define the Simple Neural Network
class SimpleNN(torch.nn.Module):
def __init__(self):
super(SimpleNN, self).__init__()
self.weight = torch.nn.Parameter(torch.tensor([2.0], requires_grad=True)) # Start with weight = 2
def forward(self, x):
return self.weight * x # Simple linear model: y = weight * x- ๐น We create a simple neural network with one weight
- ๐น The goal is for the weight to learn the correct value (which should be 4)
- ๐น
requires_grad=Truetells PyTorch to track gradients so we can apply backpropagation
Step 2: Define the Loss Function
loss_fn = torch.nn.MSELoss() # Mean Squared Error loss function- ๐น We use Mean Squared Error (MSE), which measures how far off our predictions are
- ๐น Formula for MSE:
- If our model predicts badly, the loss is high
- If our model predicts well, the loss is low
Step 3: Define the Training Data
x_train = torch.tensor([1.0]) # Input value
y_train = torch.tensor([4.0]) # Expected output (goal is for weight to be 4)- ๐น We input into the model
- ๐น We want the model to output
- ๐น The model will adjust the weight until it learns to predict 4
Step 4: Forward Pass (Prediction)
y_pred = model(x_train) # Forward pass- ๐น The model makes a prediction using:
- Initially, weight = 2, so:
- But the correct answer is 4, so the model is wrong
Step 5: Compute the Loss
loss = loss_fn(y_pred, y_train) # Compute loss- ๐น We compute the MSE loss:
- The loss is high because the prediction is far from 4
- We need to reduce the loss by updating the weight
Step 6: Compute the Gradient (Backpropagation)
loss.backward() # Compute gradients- ๐น PyTorch automatically calculates the derivative of the loss function with respect to the weight
How does this work?
- Loss Function:
- Since , we take the derivative:
- Substituting and :
- This gradient tells us that if we increase the weight, the loss will decrease
Step 7: Update the Weight (Gradient Descent)
- ๐น Gradient Descent Formula:
- Initially, weight = 2
- The computed gradient is -4
- We update the weight using learning rate = 0.1:
- The weight moves closer to 4
- The model is getting better!
Step 8: Repeat the Process
- ๐น We repeat steps 4-7 for multiple iterations:
- The weight gets closer and closer to 4
- The loss gets smaller and smaller
- Eventually, the model learns the correct weight
Full Process
- Forward Pass: The model makes a prediction
- Compute Loss: We check how wrong the model is
- Backpropagation: We compute gradients (derivatives) using the chain rule
- Gradient Descent: We update the weight opposite to the gradient
- Repeat: The model keeps improving with each step
How This Connects to the Chain Rule
- Each layer in a neural network is a function inside another function
- The chain rule lets us calculate the gradient layer by layer
- Backpropagation applies the chain rule recursively to update all weights
