from typing import override
import dnnlpy
import matplotlib.pyplot as plt
import numpy as np
from dnnlpy.models.mlp import Module
rng = np.random.default_rng(42)
dnnlpy.set_matplotlib_format('svg')
print('NumPy version:', np.__version__)Chapter 3.2 Activation Functions: Adding Nonlinearity to Neural Networks
In the previous section, we saw that if we only stack multiple linear layers, the whole model is still equivalent to one linear layer:
\[ Z = XW' + b' \]
In other words, the number of layers increases, but the model can still only represent a linear transformation. To let a neural network represent more complex functions, we need to insert nonlinear operations between linear layers. This nonlinear operation is the activation function.
For a two-layer MLP, we can write:
\[ \begin{aligned} H &= XW_1 + b_1 \\ A &= \phi(H) \\ Z &= AW_2 + b_2 \end{aligned} \]
Here, \(\phi\) is the activation function. It usually acts elementwise on \(H\), which means each element in \(H\) passes through the same function.
This section introduces several classic activation functions and derives their gradients during backpropagation. Later, we will implement each activation function as a separate module with its own forward and backward, so that it can be reused directly in the MLP implementation.
3.2.1 Background: Jacobian Matrix and VJP
Before introducing activation functions, we first need one concept that is often overlooked in backpropagation: the Jacobian matrix.
Start with an ordinary scalar function. Suppose:
\[ y = f(x) \]
where both \(x\) and \(y\) are scalars. The derivative is:
\[ \frac{dy}{dx} \]
It describes how the output \(y\) changes when the input \(x\) changes by a small amount.
Variables in neural networks, however, are usually not scalars. They are vectors or matrices. For example, one layer can be viewed as a function from a vector to a vector:
\[ y = f(x), \quad x \in \mathbb{R}^n, \quad y \in \mathbb{R}^m \]
That is, the input has \(n\) components:
\[ x = [x_1, x_2, \dots, x_n] \]
and the output has \(m\) components:
\[ y = [y_1, y_2, \dots, y_m] \]
In this case, a single derivative is not enough to describe how \(f\) changes, because each output \(y_i\) may depend on each input \(x_j\). We therefore collect all partial derivatives into one matrix:
\[ J_f(x) = \frac{\partial y}{\partial x} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_n} \\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_n} \end{bmatrix} \in \mathbb{R}^{m \times n} \]
This matrix is the Jacobian matrix. The entry in row \(i\) and column \(j\) represents the local influence of input component \(x_j\) on output component \(y_i\).
Backpropagation needs the Jacobian matrix because it repeatedly applies the chain rule through a composite function.
Suppose the loss function \(L\) does not depend on \(x\) directly, but depends on \(x\) through an intermediate variable \(y\):
\[ x \rightarrow y = f(x) \rightarrow L \]
During backpropagation, we usually already know the upstream gradient:
\[ \frac{\partial L}{\partial y} \]
Now we need to continue propagating backward and compute:
\[ \frac{\partial L}{\partial x} \]
By the chain rule, \(\frac{\partial L}{\partial x}\) is built from two parts:
- the gradient of \(L\) with respect to \(y\), which is the upstream gradient;
- the derivative of \(y\) with respect to \(x\), which is the Jacobian matrix of the current module.
If we treat gradients as row vectors, we can write:
\[ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial x} = \frac{\partial L}{\partial y} J_f(x) \]
This form is worth making explicit:
The
backwardof each module is essentially the upstream gradient multiplied by the Jacobian matrix of the current module.
In actual deep learning code, however, we almost never construct the full Jacobian matrix explicitly. The reason is direct: it is too large. If a function maps a 1000-dimensional vector to a 1000-dimensional vector, its Jacobian is \(1000 \times 1000\), already one million entries. For images, sequences, or hidden states in large models, the full Jacobian becomes even larger.
Therefore, what deep learning frameworks usually compute is not the complete \(J_f(x)\), but the result of multiplying the upstream gradient by the Jacobian:
\[ \frac{\partial L}{\partial y} J_f(x) \]
This is called a VJP (vector-Jacobian product). Backpropagation does not care about materializing the whole Jacobian; it cares about efficiently propagating the upstream gradient to the input.
Activation functions provide a clean example. They usually act elementwise:
\[ y_i = \phi(x_i) \]
The word elementwise is the key condition: the \(i\)-th output depends only on the \(i\)-th input, not on the other input components. Therefore:
\[ \frac{\partial y_i}{\partial x_j} = 0, \quad i \ne j \]
So the Jacobian matrix of an elementwise activation function is diagonal:
\[ \frac{\partial y}{\partial x} = \begin{bmatrix} \phi'(x_1) & 0 & \cdots & 0 \\ 0 & \phi'(x_2) & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \phi'(x_n) \end{bmatrix} \]
Multiplying the upstream gradient by this diagonal matrix is equivalent to elementwise multiplication:
\[ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \odot \phi'(x) \]
This is why the backward of an activation function is usually written as:
grad_input = grad_output * local_derivativeMathematically, this is still computing “upstream gradient \(\times\) Jacobian matrix”; in code, because the Jacobian of an elementwise activation is diagonal, we can implement it directly with elementwise multiplication.
3.2.2 What Does an Activation Function Do?
The previous subsection established one fact: the backward of an activation function is not an arbitrary elementwise multiplication, but a simplification of the VJP under a diagonal Jacobian matrix. Now we return to the forward pass and discuss why activation functions are placed after linear layers.
Suppose a linear layer outputs a scalar \(h\). Without an activation function, the next layer receives this linear output directly:
\[ a = h \]
With an activation function, it becomes:
\[ a = \phi(h) \]
In the forward pass, the activation function changes the hidden representation. For example, ReLU turns negative values into 0 and keeps positive values:
\[ \operatorname{ReLU}(h) = \max(0, h) \]
This means the hidden layer is no longer just a linear transformation of the input, but a nonlinear filtering of it. Without this nonlinearity, multiple stacked linear layers are still equivalent to a single linear layer.
In the backward pass, the activation function controls how gradients continue to propagate to the previous layer. In the scalar case, if:
\[ a = \phi(h) \]
and the upstream gradient is \(\frac{\partial L}{\partial a}\), then by the chain rule:
\[ \frac{\partial L}{\partial h} = \frac{\partial L}{\partial a} \frac{\partial a}{\partial h} = \frac{\partial L}{\partial a}\phi'(h) \]
For the matrix form that is more common in MLPs, let:
\[ H \in \mathbb{R}^{B \times D}, \quad A = \phi(H) \]
Here, \(B\) is the batch size and \(D\) is the hidden dimension. The activation function acts elementwise, so:
\[ A_{b,d} = \phi(H_{b,d}) \]
If we flatten \(H\) and \(A\) into vectors, the output at position \((b,d)\) depends only on the input at position \((b,d)\), not on any other position:
\[ \frac{\partial A_{b,d}}{\partial H_{b',d'}} = 0, \quad (b,d) \ne (b',d') \]
Therefore, the Jacobian of \(A\) with respect to \(H\) is diagonal, and the diagonal entries are the local derivatives at the corresponding positions:
\[ \frac{\partial A_{b,d}}{\partial H_{b,d}} = \phi'(H_{b,d}) \]
So the VJP in backpropagation simplifies to elementwise multiplication:
\[ \frac{\partial L}{\partial H} = \frac{\partial L}{\partial A} \odot \phi'(H) \]
Here, \(\odot\) denotes elementwise multiplication.
This is the shared structure of the next three activation functions: derive the scalar derivative, then use “elementwise operation \(\Rightarrow\) diagonal Jacobian” to write backward as the elementwise product of the upstream gradient and the local derivative.
3.2.3 Sigmoid Activation Function
Sigmoid was a common activation function in early neural networks:
\[ \sigma(x) = \frac{1}{1 + e^{-x}} \]
It compresses any real number into the interval \((0, 1)\). Larger inputs produce outputs closer to 1; smaller inputs produce outputs closer to 0.
Implement sigmoid with NumPy:
def sigmoid(x: np.ndarray) -> np.ndarray:
return 1 / (1 + np.exp(-x))
x = np.linspace(-10, 10, 100)
y = sigmoid(x)
fig = plt.figure(1, figsize=(5, 3.5))
ax = fig.add_subplot(1, 1, 1)
ax.plot(x, y)
ax.grid(linestyle='--')
ax.set_xlabel('x')
ax.set_ylabel(r'$\sigma(x)$')
ax.set_title('Sigmoid Activation Function')
plt.show()The scalar derivative of sigmoid can be expressed using its output value. Let:
\[ y = \sigma(x) \]
Then:
\[ \frac{\partial y}{\partial x} = y(1 - y) \]
If the input is a matrix \(X\), the output is:
\[ Y = \sigma(X) \]
Because sigmoid acts elementwise:
\[ Y_{b,d} = \sigma(X_{b,d}) \]
Therefore, when \((b,d) \ne (b',d')\):
\[ \frac{\partial Y_{b,d}}{\partial X_{b',d'}} = 0 \]
That is, the Jacobian matrix of sigmoid has nonzero entries only on the diagonal. The diagonal entry is:
\[ \sigma'(X_{b,d}) = Y_{b,d}(1 - Y_{b,d}) \]
So if the upstream gradient is \(\frac{\partial L}{\partial Y}\), backpropagation gives:
\[ \frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} \odot Y \odot (1 - Y) \]
Written as a NumPy module:
class Sigmoid(Module):
def __init__(self):
super().__init__()
@override
def forward(self, x: np.ndarray) -> np.ndarray:
self.ctx = sigmoid(x)
return self.ctx
@override
def backward(self, grad: np.ndarray) -> np.ndarray:
assert self.ctx is not None, 'Must call forward before backward.'
return grad * self.ctx * (1 - self.ctx)Here, grad is the upstream gradient. In forward, we save self.ctx, which is the sigmoid output \(Y\). This is because backward needs \(Y(1-Y)\) to compute the local derivative.
The problem with sigmoid appears in the saturation regions. When the input is very large or very small, the output becomes very close to 1 or 0, and the derivative approaches 0:
\[ \sigma'(x) = \sigma(x)(1 - \sigma(x)) \approx 0 \]
From the Jacobian perspective, this means the diagonal entries become very small. After the upstream gradient is multiplied by such a diagonal Jacobian, it is also reduced. This is the common vanishing gradient problem.
3.2.4 Tanh Activation Function
Tanh is also an S-shaped activation function:
\[ \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \]
It compresses the input into \((-1, 1)\).
def tanh(x: np.ndarray) -> np.ndarray:
return np.tanh(x)
x = np.linspace(-10, 10, 100)
y = tanh(x)
fig = plt.figure(2, figsize=(5, 3.5))
ax = fig.add_subplot(1, 1, 1)
ax.plot(x, y)
ax.grid(linestyle='--')
ax.set_xlabel('x')
ax.set_ylabel(r'$\tanh(x)$')
ax.set_title('Tanh Activation Function')
plt.show()Compared with sigmoid, tanh is centered at 0. It can output both positive and negative values. In many settings, this makes tanh more suitable than sigmoid as a hidden-layer activation function.
The scalar derivative of tanh can also be expressed using its output value. Let:
\[ y = \tanh(x) \]
Then:
\[ \frac{\partial y}{\partial x} = 1 - y^2 \]
For a matrix input \(X\), the output is:
\[ Y = \tanh(X) \]
Again:
\[ Y_{b,d} = \tanh(X_{b,d}) \]
So when \((b,d) \ne (b',d')\):
\[ \frac{\partial Y_{b,d}}{\partial X_{b',d'}} = 0 \]
This means the Jacobian matrix of tanh is also diagonal. Its diagonal entries are:
\[ 1 - Y_{b,d}^2 \]
Therefore, backpropagation gives:
\[ \frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} \odot (1 - Y^2) \]
The corresponding NumPy implementation is:
class Tanh(Module):
def __init__(self):
super().__init__()
@override
def forward(self, x: np.ndarray) -> np.ndarray:
self.ctx = tanh(x)
return self.ctx
@override
def backward(self, grad: np.ndarray) -> np.ndarray:
assert self.ctx is not None, 'Must call forward before backward.'
return grad * (1 - np.square(self.ctx))However, tanh still has a saturation problem. When the absolute value of the input is large, the output approaches 1 or -1, and the local derivative \(1-y^2\) approaches 0. In other words, the diagonal entries of its Jacobian also become very small, so the gradient can still be reduced when it passes through this layer.
Therefore, although sigmoid and tanh are historically important, modern hidden layers more often use ReLU and its variants.
3.2.5 ReLU Activation Function
ReLU (Rectified Linear Unit) (Agarap 2018) is one of the commonly used activation functions. Its definition is simple:
\[ \operatorname{ReLU}(x) = \max(0, x) \]
That is:
\[ \operatorname{ReLU}(x) = \begin{cases} x, & x > 0 \\ 0, & x \le 0 \end{cases} \]
The NumPy implementation is:
def relu(x: np.ndarray) -> np.ndarray:
return np.maximum(0, x)
x = np.linspace(-10, 10, 100)
y = relu(x)
fig = plt.figure(3, figsize=(5, 3.5))
ax = fig.add_subplot(1, 1, 1)
ax.plot(x, y)
ax.grid(linestyle='--')
ax.set_xlabel('x')
ax.set_ylabel('ReLU(x)')
ax.set_title('ReLU Activation Function')
plt.show()The scalar derivative of ReLU is also simple. Let:
\[ y = \operatorname{ReLU}(x) \]
Then, when \(x \ne 0\):
\[ \frac{\partial y}{\partial x} = \begin{cases} 1, & x > 0 \\ 0, & x < 0 \end{cases} \]
Strictly speaking, ReLU is not differentiable at \(x=0\). In actual implementations, the gradient at \(x=0\) is usually set to 0. This is an implementation convention and usually does not affect neural network training.
For a matrix input \(X\), the output is:
\[ Y = \operatorname{ReLU}(X) \]
ReLU still acts elementwise:
\[ Y_{b,d} = \operatorname{ReLU}(X_{b,d}) \]
So when \((b,d) \ne (b',d')\):
\[ \frac{\partial Y_{b,d}}{\partial X_{b',d'}} = 0 \]
This is the same as sigmoid and tanh. Although ReLU is not invertible, that does not prevent its Jacobian from being diagonal. The reason the off-diagonal entries are 0 is elementwise dependence, not invertibility.
The diagonal entries of the ReLU Jacobian are:
\[ \mathbb{1}(X_{b,d} > 0) \]
Here, \(\mathbb{1}(X_{b,d} > 0)\) is the indicator function. It is 1 when \(X_{b,d}>0\), and 0 otherwise. Therefore, backpropagation gives:
\[ \frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} \odot \mathbb{1}(X > 0) \]
The corresponding module implementation is:
class ReLU(Module):
def __init__(self):
super().__init__()
@override
def forward(self, x: np.ndarray) -> np.ndarray:
self.ctx = x
return relu(x)
@override
def backward(self, grad: np.ndarray) -> np.ndarray:
assert self.ctx is not None, 'Must call forward before backward.'
return grad * (self.ctx > 0)Notice that we save the input x, not the output. ReLU’s backward needs to know which positions were greater than 0 during the forward pass.
We can also save a mask:
class ReLUWithMask(Module):
def __init__(self):
super().__init__()
@override
def forward(self, x: np.ndarray) -> np.ndarray:
self.ctx = x > 0
return x * self.ctx
@override
def backward(self, grad: np.ndarray) -> np.ndarray:
assert self.ctx is not None, 'Must call forward before backward.'
return grad * self.ctxThese two versions are essentially the same. Neither constructs the full Jacobian matrix. They directly compute the VJP: the upstream gradient multiplied by the diagonal Jacobian of ReLU.
ReLU is simpler than sigmoid and tanh, but it works well in deep networks. One reason is that when \(x>0\), the local derivative of ReLU is 1:
\[ \operatorname{ReLU}'(x) = 1, \quad x > 0 \]
From the Jacobian perspective, this means that at positive activations, the diagonal entry is 1, so the gradient can pass through this layer more directly. It does not shrink quickly because of saturation as in sigmoid or tanh.
On the other hand, when \(x \le 0\), the ReLU output is 0 and the local derivative is also 0. This makes the hidden representation sparse: not every neuron is activated for every sample.
ReLU still has its own problem. If a neuron stays in the negative half-axis for a long time, the gradient at the corresponding position remains 0, and its parameters may become difficult to update. This is usually called the dead ReLU problem.
Even so, ReLU remains one of the most basic and commonly used activation functions in MLPs, CNNs, and many deep models. In the MLP implementation in this chapter, we will also use ReLU by default.
3.2.6 Activation Functions Usually Do Not Change Tensor Shape
In neural networks, activation functions usually keep the input and output shapes the same. That is, if a linear layer outputs:
\[ H \in \mathbb{R}^{B \times d} \]
then after the activation function, we usually still get:
\[ A \in \mathbb{R}^{B \times d} \]
However, two concepts need to be separated here:
- whether the shape is preserved;
- whether the operation is elementwise.
For activation functions such as sigmoid, tanh, and ReLU, both are true: they preserve shape and act elementwise. That is:
\[ A_{bj} = \phi(H_{bj}) \]
The output at position \((b,j)\) depends only on the input at position \((b,j)\), not on other positions. Therefore, if we flatten \(H\) and \(A\) into vectors, their Jacobian matrix is diagonal:
\[ \frac{\partial A_i}{\partial H_j}=0,\quad i\ne j \]
So their backpropagation can be written as a simple elementwise multiplication:
\[ \frac{\partial L}{\partial H} = \frac{\partial L}{\partial A} \odot \phi'(H) \]
But preserving shape does not necessarily mean acting elementwise. For example, softmax usually also preserves shape:
\[ A = \mathrm{softmax}(H) \]
If softmax is applied along the last dimension, then:
\[ A_{bj} = \frac{\exp(H_{bj})} {\sum_k \exp(H_{bk})} \]
Now \(A_{bj}\) depends not only on \(H_{bj}\), but also on other elements \(H_{bk}\) in the same row. Therefore, although softmax does not change tensor shape, it is not an elementwise function, and its Jacobian matrix is not diagonal.
The more accurate statement is:
The reason elementwise activation functions such as sigmoid, tanh, and ReLU can write their backward pass as elementwise multiplication is that their Jacobian matrices are diagonal. Whether the shape is preserved is only a surface property, not the real criterion for whether
backwardcan be computed by elementwise multiplication.
3.2.7 Summary
This section first explained the relation between the Jacobian matrix and backpropagation, then introduced the role of activation functions in an MLP.
For a function from a vector to a vector, the derivative is no longer a scalar. It is a Jacobian matrix. Each module in backpropagation is essentially computing:
\[ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y}J_f(x) \]
which is the VJP between the upstream gradient and the Jacobian matrix of the current module.
The special property of an activation function is that it usually acts elementwise:
\[ A_{b,d} = \phi(H_{b,d}) \]
Therefore, each output position depends only on the corresponding input position and not on other positions:
\[ \frac{\partial A_{b,d}}{\partial H_{b',d'}} = 0, \quad (b,d) \ne (b',d') \]
So the Jacobian matrix of an activation function is diagonal, and backward can be simplified to elementwise multiplication:
\[ \frac{\partial L}{\partial H} = \frac{\partial L}{\partial A}\odot \phi'(H) \]
We introduced sigmoid, tanh, and ReLU. They are all elementwise activation functions, so the off-diagonal entries of their Jacobian matrices are 0; the difference lies in their diagonal local derivatives. Sigmoid and tanh compress inputs into bounded intervals, but when the input magnitude is large, they saturate and make the diagonal entries of the Jacobian close to 0, causing gradients to shrink. ReLU is simpler: when the input is positive, the local derivative is 1, which makes it more commonly used in deep networks. But when the input stays negative for a long time, the gradient may also stay 0, producing the dead ReLU problem.
In the next section, we will discuss the output layer of a classification model. The final layer of an MLP outputs logits, but logits are not probabilities. To train a classification model, we need to combine logits with the true labels, obtain the softmax cross entropy loss, and derive its backward pass.