Skip to content
Pablo Rodriguez

Softmax Lab

Optional Lab
  • Softmax function is used in:
  • Softmax Regression
  • Neural Networks when solving Multiclass Classification problems
  • Converts linear outputs into a probability distribution
Mathematical Foundation
  • In both softmax regression and neural networks with softmax outputs:

  • N outputs are generated

  • One output is selected as the predicted category

  • Vector z is generated by a linear function then passed to softmax

  • Softmax function:

  • Converts z into a probability distribution

  • Each output will be between 0 and 1

  • All outputs sum to 1

  • Larger inputs correspond to larger output probabilities

  • Mathematical formula:

  • a_j = e^(z_j) / Σ(e^(z_k)) for k=1 to N

  • Vector form interpretation:

  • Output a(x) is a vector of probabilities:

    • P(y=1|x;w,b)
    • P(y=N|x;w,b)
Softmax Implementation
def my_softmax(z):
ez = np.exp(z) #element-wise exponential
sm = ez/np.sum(ez)
return(sm)
  • The exponential in the numerator magnifies small differences in values
  • Output values sum to one
  • Softmax spans all outputs - a change in one input (z0) changes all output values (a0-a3)
  • Different from ReLU or Sigmoid which have single input and single output
Cross-Entropy Loss
  • Cross-entropy loss function:

  • L(a,y) = -log(a_y) where y is the target category

  • Only the probability of the correct class contributes to the loss

  • Complete cost function (over all examples):

  • J(w,b) = -1/m [ ΣΣ 1{y^(i)==j} log(e^(z_j^(i))/Σ(e^(z_k^(i)))) ]

  • m is number of examples

  • N is number of outputs

  • This is the average of all losses

Two Approaches
Standard Implementation
model = Sequential([
Dense(25, activation = 'relu'),
Dense(15, activation = 'relu'),
Dense(4, activation = 'softmax') # softmax activation here
])
model.compile(
loss=tf.keras.losses.SparseCategoricalCrossentropy(),
optimizer=tf.keras.optimizers.Adam(0.001),
)
  • Softmax is an activation in the final Dense layer
  • Loss function (SparseCategoricalCrossentropy) specified separately
  • Model output is a vector of probabilities
Preferred Implementation
preferred_model = Sequential([
Dense(25, activation = 'relu'),
Dense(15, activation = 'relu'),
Dense(4, activation = 'linear') # No activation here
])
preferred_model.compile(
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), # Note the from_logits=True
optimizer=tf.keras.optimizers.Adam(0.001),
)
Important Distinction
  • In the preferred model, outputs are not probabilities

  • Values can range from large negative to large positive numbers

  • These raw outputs are called “logits”

  • To get probabilities during prediction:

  • Pass outputs through tf.nn.softmax()

Converting Logits to Probabilities
p_preferred = preferred_model.predict(X_train)
sm_preferred = tf.nn.softmax(p_preferred).numpy()
  • To simply get the predicted category:
  • Use np.argmax() directly on the logits
  • Softmax is monotonic, so argmax gives same result with or without softmax
Getting Predicted Categories
for i in range(5):
print(f"{p_preferred[i]}, category: {np.argmax(p_preferred[i])}")

SparseCategoricalCrossentropy vs CategoricalCrossEntropy

Section titled “SparseCategoricalCrossentropy vs CategoricalCrossEntropy”

SparseCategoricalCrossentropy

  • Expects target to be an integer corresponding to the index
  • Example: For 10 potential classes, y would be between 0 and 9

CategoricalCrossEntropy

  • Expects target to be one-hot encoded
  • Example: For 10 potential classes, target of class 2 would be [0,0,1,0,0,0,0,0,0,0]

The softmax function transforms linear outputs into a probability distribution, enabling neural networks to perform multiclass classification. While the standard implementation puts softmax in the output layer, the preferred implementation uses linear activation with from_logits=True for numerical stability. Unlike other activation functions, softmax spans multiple outputs, making it uniquely suited for classification problems with multiple possible categories.