Linear Separability of NN-induced Embeddings

A particular case of implicit margin maximization by cross-entropy loss in supervised learning.

For connection of this to OOD (out-of-distribution) data and NNs, see Cha and Lee 2025.

Cross-Entropy Loss and Gradient Direction

In supervised learning for classification, we typically minimize the cross-entropy loss over model parameters $\theta$:

$$ L = -\frac{1}{N} \sum_i \log p_\theta(y_i | x_i) $$

For a linear classifier $f(x) = W^\top x$ with softmax outputs:

$$ p_\theta(y|x) = \frac{e^{w_y^\top x}}{\sum_k e^{w_k^\top x}} $$

the loss for a single example becomes:

$$ L_i = -\log \frac{e^{w_{y_i}^\top x_i}}{\sum_k e^{w_k^\top x_i}} = -w_{y_i}^\top x_i + \log\sum_k e^{w_k^\top x_i} $$

Taking the gradient with respect to the feature $x_i$:

$$ \nabla_{x_i} L_i = -w_{y_i} + \sum_k p_\theta(k|x_i) w_k $$

This means:

  • $x_i$ is pushed toward its correct class weight vector $w_{y_i}$.
  • $x_i$ is pushed away from competing class weight vectors $w_k$ proportionally to their predicted probability.

Hence, the gradient dynamics explicitly increase the projection of features onto the correct class vector and decrease projections onto incorrect ones.

Connection to Margin Maximization

As training continues and the model approaches perfect classification (low loss regime):

  • The predicted probabilities $p_\theta(y_i | x_i)$ become highly peaked at the correct class.
  • The norm of the weights $||W||$ tends to grow (in linear models without explicit regularization).

For linearly separable data, it can be shown (Soudry et al., Implicit Bias of Gradient Descent on Separable Data, 2018) that:

$$ w(t) / ||w(t)|| \to w^*, $$

where $w^*$ is the maximum-margin separator — the same direction as the hard-margin SVM solution.

This happens without any explicit margin term: gradient descent on cross-entropy loss implicitly drives the weights toward a direction that maximizes the margin between classes.

Intuitive Interpretation via Feature Geometry

Because the loss penalizes small logit differences between the true class and the most competitive false class, minimizing it encourages:

$$ (w_{y_i}^\top x_i) - \max_{k \ne y_i} (w_k^\top x_i) $$

to be as large as possible — i.e., to maximize the margin between classes in logit space.

In deep networks, the same intuition extends to learned feature representations:

  • Early layers adjust so that the features $x_i$ become more linearly separable.
  • Later layers (or the final linear classifier) align these features with distinct class weight vectors.

Hence, the cross-entropy gradient naturally shapes the feature space toward configurations that maximize separation between classes — even without explicit regularization enforcing it.

Summary

ConceptDescription
MechanismGradient descent on cross-entropy aligns features with class vectors and repels them from others.
ResultImplicitly increases inter-class margins (logit separation).
Theoretical SupportProven for linear models (Soudry et al., 2018) — convergence direction equals the max-margin separator.
Deep Learning AnalogyNetwork layers learn representations where classes become linearly separable in feature space.

Take-home message is this:

In supervised learning with cross-entropy loss, the gradient dynamics implicitly bias the model toward feature configurations that maximize class margins — analogous to seeking linear separability.

Nice-to-read

Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., & Srebro, N. (2018). The Implicit Bias of Gradient Descent on Separable Data. Journal of Machine Learning Research, 19(70), 1–57.

Donghun Lee
Donghun Lee
Associate Professor

Connecting artificial intelligence and mathematics, in both directions.