Skip to content
Pablo Rodriguez

Choosing

Choosing Activation Functions for Neural Networks

Section titled “Choosing Activation Functions for Neural Networks”

Choosing Output Layer Activation Functions

Section titled “Choosing Output Layer Activation Functions”
  • Different neurons in a neural network can use different activation functions
  • For the output layer, there’s often one natural choice depending on what the target label (y) is

Binary Classification

  • When y is either 0 or 1
  • Use sigmoid activation
  • Neural network learns to predict probability that y equals 1
  • Similar to logistic regression

Regression (+ and -)

  • When y can be positive or negative
  • Example: predicting stock price changes
  • Use linear activation function
  • Allows output to take on positive or negative values

Non-negative Regression

  • When y can only take non-negative values
  • Example: predicting house prices
  • Use ReLU activation function
  • Only outputs zero or positive values
Industry Standard
  • ReLU is by far the most common choice for hidden layers
  • Evolution from sigmoid to ReLU:
  • Early neural networks used sigmoid functions
  • Modern practice heavily favors ReLU
  • Sigmoid now rarely used (except for binary classification output)
  1. Computational Efficiency:
  • ReLU only requires computing max(0,z)
  • Sigmoid requires exponentiation and inverse operations
  1. Better Gradient Flow (more important reason):
  • ReLU goes flat in only one part of graph (left side)
  • Sigmoid goes flat in two places (both extremes)
  • Flat regions cause gradient descent to be slow
  • ReLU allows neural networks to learn faster
  • For hidden layers:

  • activation='relu' (recommended default)

  • For output layer:

  • Binary classification: activation='sigmoid'

  • Regression (positive/negative): activation='linear'

  • Non-negative outputs: activation='relu'

Advanced
  • Research literature mentions other activation functions:

  • tanh (hyperbolic tangent)

  • LeakyReLU

  • Swish

  • New activation functions emerge periodically

  • Sometimes perform slightly better in specific cases

  • Example: “I’ve used the LeakyReLU activation function a few times in my work, and sometimes it works a little bit better than the ReLU”

  • For most applications, sigmoid/ReLU/linear are sufficient

Choosing the right activation function is essential for neural network performance. For output layers, select based on your prediction target type (binary, unbounded, or non-negative). For hidden layers, ReLU is the standard choice due to its computational efficiency and better gradient properties.