Skip to content
Pablo Rodriguez

Softmax

Multiclass Classification
  • Softmax regression is a generalization of logistic regression
  • Extends binary classification to multiclass classification contexts
  • For binary classification (y ∈ 1):

  • Calculate z = w·x + b

  • Compute a = g(z) using the sigmoid function

  • Interpret a as P(y=1|x)

  • P(y=0|x) = 1 - P(y=1|x) = 1 - a

  • Alternative view (to set up for softmax):

  • a₁ = P(y=1|x) = sigmoid(z)

  • a₂ = P(y=0|x) = 1 - a₁

  • a₁ + a₂ = 1 (probabilities must sum to 1)

Example: 4 Classes

For y ∈ 4:

  1. Calculate each class score:
  • z₁ = w₁·x + b₁
  • z₂ = w₂·x + b₂
  • z₃ = w₃·x + b₃
  • z₄ = w₄·x + b₄
  1. Convert scores to probabilities:
  • a₁ = e^z₁ / (e^z₁ + e^z₂ + e^z₃ + e^z₄)
  • a₂ = e^z₂ / (e^z₁ + e^z₂ + e^z₃ + e^z₄)
  • a₃ = e^z₃ / (e^z₁ + e^z₂ + e^z₃ + e^z₄)
  • a₄ = e^z₄ / (e^z₁ + e^z₂ + e^z₃ + e^z₄)

For y ∈ {1,2, ,n}:

  1. For each class j, calculate:
  • zⱼ = wⱼ·x + bⱼ
  1. Convert to probabilities:
  • aⱼ = e^zⱼ / (∑ₖ₌₁ᵏ e^zₖ)
  1. Parameters:
  • w₁, w₂, …, wₙ
  • b₁, b₂, …, bₙ

Logistic Regression Cost (for comparison):

Section titled “Logistic Regression Cost (for comparison):”
  • Loss = -y log(a₁) - (1-y)log(a₂)
  • If y=1: Loss = -log(a₁)
  • If y=0: Loss = -log(a₂)
  • Cost = average of losses over training set
  • If y=j: Loss = -log(aⱼ)
  • In general: Loss = -log(aᵧ)

Intuition

  • Negative log loss incentivizes high confidence in correct class
  • When aⱼ approaches 1, loss approaches 0
  • When aⱼ is small, loss becomes large
  • Pushes model to assign high probability to correct class

Important Note

  • For each training example, y takes only one value
  • Only compute -log(aⱼ) for the actual value j = y
  • Don’t compute loss terms for other classes

Softmax regression extends logistic regression to handle multiple classes by computing a separate score for each class and then converting these scores to probabilities using the softmax function. The probabilities sum to 1, and the cost function encourages the model to assign high probability to the correct class. This forms the foundation for multiclass classification in neural networks.