Softmax to MNIST: Building a Tiny Autograd and Classifier

The goal here is to compile all the findings, including low level (math) so we can go deep down the rabbit hole and high level (intuition) so we can generalize easier about the computation.

This post walks through the math and implementation of axis-aware, numerically stable softmax and log-softmax, their vector–Jacobian products (backward pass), and the Negative Log-Likelihood loss. We finish with a simple MLP trained on MNIST using our own autograd engine.

Notation

$z$ : logits (pre-softmax scores) from the final linear layer. In an MLP, $z = W\,a_{\text{prev}} + b$ .
$a$ : softmax probabilities, $a = \mathrm{softmax}(z)$ , with components $a_i = e^{z_i}/\sum_j e^{z_j}$ .
$y$ : target vector (assumed one-hot here), $y_i \in \{0,1\}$ and $\sum_i y_i = 1$ .
$k$ : class axis (dimension along which probabilities sum to 1 and reductions occur).
$m$ : per-axis maximum for stability, $m = \max\nolimits_k(z)$ .
$\ell$ : log-probabilities (log-softmax outputs), $\ell = \log\operatorname{softmax}(z)$ .
$g$ : upstream gradient from the loss (same shape as the quantity it differentiates, e.g., $g = \partial L/\partial a$ or $g = \partial L/\partial \ell$ ).
$W, b, a_{\text{prev}}$ : weights, bias, and previous-layer activations feeding the final linear layer.

Softmax: definition, stability, and gradient

For logits $x$ and class axis $k$ , softmax is

\mathrm{softmax}(x)_i = \frac{e^{x_i}}{\sum\limits_j e^{x_j}}\,.

Intuition note

Softmax turns arbitrary scores into a probability distribution: all entries become non-negative and sum to 1 along the class axis. Imagine classes “competing” for a fixed 100% of belief: raising one score necessarily lowers others.

To avoid overflow we use a per-axis shift (log-sum-exp trick): let $m = \max\nolimits_k(x)$ . Then

\mathrm{softmax}(x) = \frac{e^{x - m}}{\sum\nolimits_k e^{x - m}}\,.

Vector–Jacobian product. With upstream gradient $g = \partial L/\partial a$ and $a = \mathrm{softmax}(x)$ :

\frac{\partial L}{\partial x} \;=\; a \odot \Big(g - \langle g, a \rangle_k\Big), \quad \text{where } \langle g, a \rangle_k = \sum\nolimits_k g \odot a \,.

Here $\odot$ denotes elementwise multiplication. This captures Jacobian $J = \mathrm{diag}(a) - a a^\top$ without materializing it.

Intuition note

Subtracting the per-axis max $m$ doesn’t change the probabilities (the $e^m$ factor cancels). It only prevents $e^{x}$ from blowing up.

Let $z$ denote the logits (pre-softmax scores) from the last linear layer. In an MLP, $z = W\,a_{\text{prev}} + b$ and the predicted class probabilities are $a = \mathrm{softmax}(z)$ . We will later instantiate this with our MNIST MLP, but the notation applies to any model producing pre-softmax scores.

Softmax partial derivatives (component-wise sensitivities):

\frac{\partial a_i}{\partial z_i} = a_i(1-a_i),\quad\quad \frac{\partial a_j}{\partial z_i} = -a_j a_i\quad (j\neq i).

Intuition note

Increasing one logit $z_i$ boosts its own probability $a_i$ , but because all probabilities must still sum to 1, other $a_j$ must decrease—hence the negative cross terms.

Derivation

Using the softmax partials above and cross-entropy $C=-\sum_i y_i\ln a_i$ (one-hot $y$ ), the output-layer error simplifies neatly. For one example with $a=\mathrm{softmax}(z)$ :

\delta_i \equiv \frac{\partial C}{\partial z_i} = -\sum_j \frac{y_j}{a_j}\,\frac{\partial a_j}{\partial z_i} = -\frac{y_i}{a_i}a_i(1-a_i) - \sum_{j\ne i}\frac{y_j}{a_j}(-a_j a_i) = a_i - y_i\,.

Intuition note

If the model assigns too much probability to a wrong class, the gradient there is positive (push it down). If it assigns too little probability to the true class, the gradient there is negative (push it up). The signal is literally “what you predicted minus what it should be.”

Small numeric example: for logits $z = [2, 0, -1]$ ,

\mathrm{softmax}(z) \approx [0.79,\; 0.18,\; 0.03].

If the one-hot target is $y = [0,1,0]$ , then $\delta = a - y \approx [0.79,\; -0.82,\; 0.03]$ .

Log-softmax: stable forward, simple backward

Log-softmax normalizes in log space:

\log\operatorname{softmax}(x)_i = x_i - \log\sum\nolimits_j e^{x_j}.

Stable form with $m = \max\nolimits_k(x)$ :

\log\operatorname{softmax}(x) = (x - m) - \log\Big(\sum\nolimits_k e^{x - m}\Big).

Let $\ell = \log\operatorname{softmax}(x)$ and $a = e^{\ell} = \mathrm{softmax}(x)$ . The Vector–Jacobian product is:

\frac{\partial L}{\partial x} = g - a\, \sum\nolimits_k g\,.

Intuition: we subtract the component of $g$ that lies along the all-ones direction on the simplex (hence $\sum\nolimits_k \partial L/\partial x = 0$ ).

Negative Log-Likelihood (NLL) and Cross-Entropy

For one-hot targets $y$ and log-probabilities $\ell$ ,

\mathrm{NLL}(\ell, y) = -\sum\nolimits_k y \odot \ell \,;\quad \frac{\partial L}{\partial \ell} = -y\,.

Cross-Entropy with logits is just

\text{CE-with-logits}(x, y) = \mathrm{NLL}\big(\log\operatorname{softmax}(x), y\big),

which yields the classic gradient

$\frac{\partial L}{\partial x} = \mathrm{softmax}(x) - y\,.$

All formulas above are axis-aware (we reduce along the chosen class axis $k$ with keepdims=True), which makes broadcasting correct in both forward and backward passes.

Operations code can in my autograd repository:

Tests-first: numerical checks and invariants

During development I tried my best to follow TDD. I have tested quite a few cases, some of them:

Finite-difference checks for softmax/log-softmax/NLL along both $k\in\{0,1\}$ .
Stability: probabilities sum to 1; log-softmax stays finite for extreme logits.
Invariants: the softmax/log-softmax gradients sum to zero along the class axis.

You can find all of them here.

MNIST: a minimal MLP baseline

Off to a training. I decided to give it a try on a MNIST dataset. We trained a one-hidden-layer MLP (784→256→10, tanh) with cross-entropy (log-softmax + NLL). Tried many configurations including tweaking learning rate, changing activation function to ReLU, but MLP with tanh activation was getting the best results.

Kaggle submission from this MLP achieved 0.955 accuracy which is a good result for such a small MLP.

You can find learning demo in my autograd repository