Back

/ 5 min read

Softmax to MNIST: Building a Tiny Autograd and Classifier

The goal here is to compile all the findings, including low level (math) so we can go deep down the rabbit hole and high level (intuition) so we can generalize easier about the computation.

This post walks through the math and implementation of axis-aware, numerically stable softmax and log-softmax, their vector–Jacobian products (backward pass), and the Negative Log-Likelihood loss. We finish with a simple MLP trained on MNIST using our own autograd engine.

Notation

  • zz: logits (pre-softmax scores) from the final linear layer. In an MLP, z=W aprev+bz = W\,a_{\text{prev}} + b.
  • aa: softmax probabilities, a=softmax(z)a = \mathrm{softmax}(z), with components ai=ezi/βˆ‘jezja_i = e^{z_i}/\sum_j e^{z_j}.
  • yy: target vector (assumed one-hot here), yi∈{0,1}y_i \in \{0,1\} and βˆ‘iyi=1\sum_i y_i = 1.
  • kk: class axis (dimension along which probabilities sum to 1 and reductions occur).
  • mm: per-axis maximum for stability, m=max⁑k(z)m = \max\nolimits_k(z).
  • β„“\ell: log-probabilities (log-softmax outputs), β„“=log⁑softmax⁑(z)\ell = \log\operatorname{softmax}(z).
  • gg: upstream gradient from the loss (same shape as the quantity it differentiates, e.g., g=βˆ‚L/βˆ‚ag = \partial L/\partial a or g=βˆ‚L/βˆ‚β„“g = \partial L/\partial \ell).
  • W,b,aprevW, b, a_{\text{prev}}: weights, bias, and previous-layer activations feeding the final linear layer.

Softmax: definition, stability, and gradient

For logits xx and class axis kk, softmax is

softmax(x)i=exiβˆ‘jexj .\mathrm{softmax}(x)_i = \frac{e^{x_i}}{\sum\limits_j e^{x_j}}\,.

Intuition note

Softmax turns arbitrary scores into a probability distribution: all entries become non-negative and sum to 1 along the class axis. Imagine classes β€œcompeting” for a fixed 100% of belief: raising one score necessarily lowers others.

To avoid overflow we use a per-axis shift (log-sum-exp trick): let m=max⁑k(x)m = \max\nolimits_k(x). Then

softmax(x)=exβˆ’mβˆ‘kexβˆ’m .\mathrm{softmax}(x) = \frac{e^{x - m}}{\sum\nolimits_k e^{x - m}}\,.

Vector–Jacobian product. With upstream gradient g=βˆ‚L/βˆ‚ag = \partial L/\partial a and a=softmax(x)a = \mathrm{softmax}(x):

βˆ‚Lβˆ‚xβ€…β€Š=β€…β€ŠaβŠ™(gβˆ’βŸ¨g,a⟩k),where ⟨g,a⟩k=βˆ‘kgβŠ™a .\frac{\partial L}{\partial x} \;=\; a \odot \Big(g - \langle g, a \rangle_k\Big), \quad \text{where } \langle g, a \rangle_k = \sum\nolimits_k g \odot a \,.

Here βŠ™\odot denotes elementwise multiplication. This captures Jacobian J=diag(a)βˆ’aa⊀J = \mathrm{diag}(a) - a a^\top without materializing it.

Intuition note

Subtracting the per-axis max mm doesn’t change the probabilities (the eme^m factor cancels). It only prevents exe^{x} from blowing up.

Let zz denote the logits (pre-softmax scores) from the last linear layer. In an MLP, z=W aprev+bz = W\,a_{\text{prev}} + b and the predicted class probabilities are a=softmax(z)a = \mathrm{softmax}(z). We will later instantiate this with our MNIST MLP, but the notation applies to any model producing pre-softmax scores.

Softmax partial derivatives (component-wise sensitivities):

βˆ‚aiβˆ‚zi=ai(1βˆ’ai),βˆ‚ajβˆ‚zi=βˆ’ajai(jβ‰ i).\frac{\partial a_i}{\partial z_i} = a_i(1-a_i),\quad\quad \frac{\partial a_j}{\partial z_i} = -a_j a_i\quad (j\neq i).

Intuition note

Increasing one logit ziz_i boosts its own probability aia_i, but because all probabilities must still sum to 1, other aja_j must decreaseβ€”hence the negative cross terms.

Derivation

Using the softmax partials above and cross-entropy C=βˆ’βˆ‘iyiln⁑aiC=-\sum_i y_i\ln a_i (one-hot yy), the output-layer error simplifies neatly. For one example with a=softmax(z)a=\mathrm{softmax}(z):

Ξ΄iβ‰‘βˆ‚Cβˆ‚zi=βˆ’βˆ‘jyjajβ€‰βˆ‚ajβˆ‚zi=βˆ’yiaiai(1βˆ’ai)βˆ’βˆ‘jβ‰ iyjaj(βˆ’ajai)=aiβˆ’yi .\delta_i \equiv \frac{\partial C}{\partial z_i} = -\sum_j \frac{y_j}{a_j}\,\frac{\partial a_j}{\partial z_i} = -\frac{y_i}{a_i}a_i(1-a_i) - \sum_{j\ne i}\frac{y_j}{a_j}(-a_j a_i) = a_i - y_i\,.

Intuition note

If the model assigns too much probability to a wrong class, the gradient there is positive (push it down). If it assigns too little probability to the true class, the gradient there is negative (push it up). The signal is literally β€œwhat you predicted minus what it should be.”

Small numeric example: for logits z=[2,0,βˆ’1]z = [2, 0, -1],

softmax(z)β‰ˆ[0.79,β€…β€Š0.18,β€…β€Š0.03].\mathrm{softmax}(z) \approx [0.79,\; 0.18,\; 0.03].

If the one-hot target is y=[0,1,0]y = [0,1,0], then Ξ΄=aβˆ’yβ‰ˆ[0.79,β€…β€Šβˆ’0.82,β€…β€Š0.03]\delta = a - y \approx [0.79,\; -0.82,\; 0.03].

Log-softmax: stable forward, simple backward

Log-softmax normalizes in log space:

log⁑softmax⁑(x)i=xiβˆ’logβ‘βˆ‘jexj.\log\operatorname{softmax}(x)_i = x_i - \log\sum\nolimits_j e^{x_j}.

Stable form with m=max⁑k(x)m = \max\nolimits_k(x):

log⁑softmax⁑(x)=(xβˆ’m)βˆ’log⁑(βˆ‘kexβˆ’m).\log\operatorname{softmax}(x) = (x - m) - \log\Big(\sum\nolimits_k e^{x - m}\Big).

Let β„“=log⁑softmax⁑(x)\ell = \log\operatorname{softmax}(x) and a=eβ„“=softmax(x)a = e^{\ell} = \mathrm{softmax}(x). The Vector–Jacobian product is:

βˆ‚Lβˆ‚x=gβˆ’aβ€‰βˆ‘kg .\frac{\partial L}{\partial x} = g - a\, \sum\nolimits_k g\,.

Intuition: we subtract the component of gg that lies along the all-ones direction on the simplex (hence βˆ‘kβˆ‚L/βˆ‚x=0\sum\nolimits_k \partial L/\partial x = 0).

Negative Log-Likelihood (NLL) and Cross-Entropy

For one-hot targets yy and log-probabilities β„“\ell,

NLL(β„“,y)=βˆ’βˆ‘kyβŠ™β„“β€‰;βˆ‚Lβˆ‚β„“=βˆ’y .\mathrm{NLL}(\ell, y) = -\sum\nolimits_k y \odot \ell \,;\quad \frac{\partial L}{\partial \ell} = -y\,.

Cross-Entropy with logits is just

CE-with-logits(x,y)=NLL(log⁑softmax⁑(x),y),\text{CE-with-logits}(x, y) = \mathrm{NLL}\big(\log\operatorname{softmax}(x), y\big),

which yields the classic gradient

βˆ‚Lβˆ‚x=softmax(x)βˆ’y .\frac{\partial L}{\partial x} = \mathrm{softmax}(x) - y\,.

All formulas above are axis-aware (we reduce along the chosen class axis kk with keepdims=True), which makes broadcasting correct in both forward and backward passes.

Operations code can in my autograd repository:

  1. Softmax
  2. Log Softmax
  3. Negative Log-Likelihood Loss

Tests-first: numerical checks and invariants

During development I tried my best to follow TDD. I have tested quite a few cases, some of them:

  • Finite-difference checks for softmax/log-softmax/NLL along both k∈{0,1}k\in\{0,1\}.
  • Stability: probabilities sum to 1; log-softmax stays finite for extreme logits.
  • Invariants: the softmax/log-softmax gradients sum to zero along the class axis.

You can find all of them here.

MNIST: a minimal MLP baseline

Off to a training. I decided to give it a try on a MNIST dataset. We trained a one-hidden-layer MLP (784β†’256β†’10, tanh) with cross-entropy (log-softmax + NLL). Tried many configurations including tweaking learning rate, changing activation function to ReLU, but MLP with tanh activation was getting the best results.

Kaggle submission from this MLP achieved 0.955 accuracy which is a good result for such a small MLP.

You can find learning demo in my autograd repository

Resources (Standing on the shoulders of giants)