Softmax Lab

Softmax Function Lab

Optional Lab

Introduction to Softmax

Softmax function is used in:
Softmax Regression
Neural Networks when solving Multiclass Classification problems
Converts linear outputs into a probability distribution

Softmax Function Definition

Mathematical Foundation

In both softmax regression and neural networks with softmax outputs:
N outputs are generated
One output is selected as the predicted category
Vector z is generated by a linear function then passed to softmax
Softmax function:
Converts z into a probability distribution
Each output will be between 0 and 1
All outputs sum to 1
Larger inputs correspond to larger output probabilities
Mathematical formula:
a_j = e^(z_j) / Σ(e^(z_k)) for k=1 to N
Vector form interpretation:
Output a(x) is a vector of probabilities:
- P(y=1|x;w,b)
- …
- P(y=N|x;w,b)

NumPy Implementation

def my_softmax(z):
 ez = np.exp(z)              #element-wise exponential
 sm = ez/np.sum(ez)
 return(sm)

Key Observations about Softmax

The exponential in the numerator magnifies small differences in values
Output values sum to one
Softmax spans all outputs - a change in one input (z0) changes all output values (a0-a3)
Different from ReLU or Sigmoid which have single input and single output

Cost Function for Softmax

Cross-Entropy Loss

Cross-entropy loss function:
L(a,y) = -log(a_y) where y is the target category
Only the probability of the correct class contributes to the loss
Complete cost function (over all examples):
J(w,b) = -1/m [ ΣΣ 1{y^(i)==j} log(e^(z_j^(i))/Σ(e^(z_k^(i)))) ]
m is number of examples
N is number of outputs
This is the average of all losses

TensorFlow Implementation

Two Approaches

”Obvious” Organization

model = Sequential([
 Dense(25, activation = 'relu'),
 Dense(15, activation = 'relu'),
 Dense(4, activation = 'softmax')    # softmax activation here
])
model.compile(
 loss=tf.keras.losses.SparseCategoricalCrossentropy(),
 optimizer=tf.keras.optimizers.Adam(0.001),
)

Softmax is an activation in the final Dense layer
Loss function (SparseCategoricalCrossentropy) specified separately
Model output is a vector of probabilities

”Preferred” Organization

preferred_model = Sequential([
 Dense(25, activation = 'relu'),
 Dense(15, activation = 'relu'),
 Dense(4, activation = 'linear')   # No activation here
])
preferred_model.compile(
 loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),  # Note the from_logits=True
 optimizer=tf.keras.optimizers.Adam(0.001),
)

Output Handling in Preferred Model

Important Distinction

In the preferred model, outputs are not probabilities
Values can range from large negative to large positive numbers
These raw outputs are called “logits”
To get probabilities during prediction:
Pass outputs through tf.nn.softmax()

p_preferred = preferred_model.predict(X_train)
sm_preferred = tf.nn.softmax(p_preferred).numpy()

To simply get the predicted category:
Use np.argmax() directly on the logits
Softmax is monotonic, so argmax gives same result with or without softmax

for i in range(5):
 print(f"{p_preferred[i]}, category: {np.argmax(p_preferred[i])}")

SparseCategoricalCrossentropy vs CategoricalCrossEntropy

SparseCategoricalCrossentropy

Expects target to be an integer corresponding to the index
Example: For 10 potential classes, y would be between 0 and 9

CategoricalCrossEntropy

Expects target to be one-hot encoded
Example: For 10 potential classes, target of class 2 would be [0,0,1,0,0,0,0,0,0,0]

The softmax function transforms linear outputs into a probability distribution, enabling neural networks to perform multiclass classification. While the standard implementation puts softmax in the output layer, the preferred implementation uses linear activation with from_logits=True for numerical stability. Unlike other activation functions, softmax spans multiple outputs, making it uniquely suited for classification problems with multiple possible categories.