Softmax

Softmax Regression Algorithm

Multiclass Classification

Introduction to Softmax

Softmax regression is a generalization of logistic regression
Extends binary classification to multiclass classification contexts

Review of Logistic Regression

For binary classification (y ∈ 1):
Calculate z = w·x + b
Compute a = g(z) using the sigmoid function
Interpret a as P(y=1|x)
P(y=0|x) = 1 - P(y=1|x) = 1 - a
Alternative view (to set up for softmax):
a₁ = P(y=1|x) = sigmoid(z)
a₂ = P(y=0|x) = 1 - a₁
a₁ + a₂ = 1 (probabilities must sum to 1)

Softmax Regression Formula

Example: 4 Classes

For y ∈ 4:

Calculate each class score:

z₁ = w₁·x + b₁
z₂ = w₂·x + b₂
z₃ = w₃·x + b₃
z₄ = w₄·x + b₄

Convert scores to probabilities:

a₁ = e^z₁ / (e^z₁ + e^z₂ + e^z₃ + e^z₄)
a₂ = e^z₂ / (e^z₁ + e^z₂ + e^z₃ + e^z₄)
a₃ = e^z₃ / (e^z₁ + e^z₂ + e^z₃ + e^z₄)
a₄ = e^z₄ / (e^z₁ + e^z₂ + e^z₃ + e^z₄)

General Formula

For y ∈ {1,2, … ,n}:

For each class j, calculate:

zⱼ = wⱼ·x + bⱼ

Convert to probabilities:

aⱼ = e^zⱼ / (∑ₖ₌₁ᵏ e^zₖ)

Parameters:

w₁, w₂, …, wₙ
b₁, b₂, …, bₙ

Cost Function for Softmax Regression

Logistic Regression Cost (for comparison):

Loss = -y log(a₁) - (1-y)log(a₂)
If y=1: Loss = -log(a₁)
If y=0: Loss = -log(a₂)
Cost = average of losses over training set

Softmax Regression Cost:

If y=j: Loss = -log(aⱼ)
In general: Loss = -log(aᵧ)

Intuition

Negative log loss incentivizes high confidence in correct class
When aⱼ approaches 1, loss approaches 0
When aⱼ is small, loss becomes large
Pushes model to assign high probability to correct class

Important Note

For each training example, y takes only one value
Only compute -log(aⱼ) for the actual value j = y
Don’t compute loss terms for other classes

Next Steps

Softmax regression extends logistic regression to handle multiple classes by computing a separate score for each class and then converting these scores to probabilities using the softmax function. The probabilities sum to 1, and the cost function encourages the model to assign high probability to the correct class. This forms the foundation for multiclass classification in neural networks.