Model Intrinsic Dimensionality

This post is mainly based on

Measuring the Intrinsic Dimension of Objective Landscapes, ICLR 2018

Overview

Question
- How to measure the capacity of a model?
- How many parameters are really needed?
“Suggestive” Conclusion
- Many problems have smaller intrinsic dimensions than one might suspect
- The intrinsic dimension for a given dataset varies little across a family of models with vastly different sizes
- Implication: once a parameter space is large enough to solve a problem, extra parameters serve directly to increase the dimensionality of the solution manifold
Intrinsic dimension allows quantitative comparison across different types of learning (e.g., supervised, reinforcement)
- Solving the inverted pendulum problem is 100 times easier than classifying digits from MNIST
- Playing Atari Pong from pixels is about as hard as classifying CIFAR-10
Intrinsic dimension can be used to obtain an upper bound on the minimum description length of a solution
- By product: compressing, low dimensional representation

NN Optimization

Steps
- Step 1: Fix dataset, network architecture, loss function
- Step 2: Initialize network
- Step 3: Adjusting its weights to minimize loss
“Training NN” = traversing some path along an objective landscape (loss landscape)
As soon as Step 1 is completed, the loss landscape is fixed
Step 2 and 3 only affect how the loss landscape is explored
Dimension of the optimization
- Given a network parameterized by $D$ parameters
- The optimization problem is $D$ dimensional

Background

Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, NIPS 2014
- In contrast to conventional thinking about getting stuck in local optima
- Local critical points in high dimension are almost never valleys but are instead saddlepoint
- Vanishing gradient is perhaps a more severe problem than local optima
Qualitatively characterizing neural network optimization problems, ICLR 2015
- Paths directly from the initial point to the final point of optimization are often monotonically decreasing
- Although dimension is high, the loss space is “simpler”
This paper
- Restricting training to random subspaces $\mathbb{R}^d$ of the full loss space $\mathbb{R}^D$
- Perform optimization in randomly generated subspaces $\mathbb{R}^d$
- Find minimal $d$ where the solution start to occur
- Intrinsic dimension of a particular problem = $d$

Toy Example 1

$\theta_1 + \theta_2 + \theta_3 = 1$
Parameter space: $\theta \in \mathbb{R}^3$
Intrinsic dimension: 1 or $\mathbb{R}$
Why?
- The manifold of the solution space is a hyperplane passing through $(0,0,1)$, $(0,1,0)$ and $(1,0,0)$
- This manifold is 2 dimensional
- Denoting as $s$ the dimensionality of the solution set
- $D = d_{int} + s$. Hence the codimension $d_{int} = 3-2 = 1$
On the manifold, there are 2 orthogonal directions one can move and remain at zero cost

Toy Example 2

$D = 1000$
Loss: Squared error cost function that requires,
- The first 100 elements to sum to 1
- The second 100 elements to sum to 2
- …
The manifold of solutions is a 990 dimensional hyperplane
$d_{int} = 10$

$d_{int}$ approximation

For hard optimization problem: need algorithm method to approximate $d_{int}$
Notations
- Parameter: $\theta^{(D)} \in \mathbb{R}^D$
- Initial parameter: $\theta^{(D)}_0$
- Final parameter: $\theta^{(D)}_*$
- Projection matrix: $P$, randomly generated and frozen
- Some paper called $\theta^{(d)}$ a $\theta^{(D)}$
Training: $\theta^{(D)} = \theta^{(D)}_0 + P(\theta^{(d)})$
Choice of $P$
- $P = \theta^d W$
- $P = \theta^d W_{sparse}$
- $P = \theta^d M$, Fastfood transform
  - $M = HG \Pi HB$, fixed
  - $H$: a Hadamard matrix
  - $G$: a random diagonal matrix with independent standard normal entries
  - $B$: a random diagonal matrix with equal probability $\pm 1$ entries
  - $\Pi$: a random permutation matrix
  - Matrix multiplication with a Hadamard matrix can be computed in $O(D \log d)$ via the Fast Walsh-Hadamard Transform
Training proceeds by computing gradients with respect to $\theta^{(d)}$
- To compute gradient w.r.t $\theta^{(d)}$
- Take an extra step in chain rule to compute $\partial \theta^{(D)} / \partial \theta^{(d)}$
Choice of $P$
- Columns of P are normalized to unit length
- Columns of P may also be orthogonalized if desired (paper use approximate orthogonality)
Adjustment to optimizers
- SGD is “rotation-invariant”
- For RMSProp & Adam, the path taken through $\theta^{(D)}$ space will depend on the rotation chosen
Can we find solution in subspace $\theta^{(d)}$?
- If $d < D-s$, solutions will almost surely not be found (with probability 1)
- If $d \geq D-s$,
  - If solution set is a hyperplane, the solution will almost surely intersect the subspace
  - If solution set has arbitrary topology, intersection is not guaranteed

toy-ex-2

Plot of performance vs. subspace dimension for the toy example 2. The problem becomes both 90% solvable and 100% solvable at random subspace dimension 10, so $d{int90}$ and $d_{int100}$ are 10._

$d_{int}$ Details

$d_{int100}$: solutions whose performance is statistically indistinguishable from baseline solutions
- $d_{int100}$ tends to to vary widely
- $d_{int100} \rightarrow D$ when the task requires matching a very well-tuned baseline model
$d_{int90}$: solutions with performance at least 90% of the baseline
- More practical and useful

Experiments: MNIST

Compression

FC and CNN are very compressible
Store
- The random seed to generate the frozen $\theta^{(D)}$
- The random seed to generate $P$
- The 750 floating point numbers in $\theta^{(d)}_*$
FC
- 750 degrees of freedom (0.4% params) for 90% of the baseline performance
- Assuming fp32, compression factor = 260x (from 793kB to only 3.2kB)
CNN (LeNet)
- 290 degrees of freedom (0.65% params) for 90% of the baseline performance
- Assuming fp32, compression factor = 150x

mnist-dim

Performance (validation accuracy) vs. subspace dimension d for two networks trained on MNIST. Left: FC; Right: CNN.

Robustness of Intrinsic Dimension

How intrinsic dimension varies across FC networks with a varying number of layers and varying layer width
Grid sweep on FC
- Layers $L \in {1, 2, 3, 4, 5}$
- Width $W \in {50, 100, 200, 400}$
Results
- $D$ changes by a factor of 24.1
- $d_{int90}$ changes by a factor of 1.33

robust

Intrinsic dimension vs number of parameters for 20 FC models. Though the number of native parameters $D$ varies by a factor of 24.1, $d_{int90}$ varies by only 1.33. This indicate that $d_{int90}$ is a fairly robust measure across a model family.

Random Subspaces Parameter Efficiency

If the intrinsic dimension for MNIST is low, can we directly train a small network, instead of train in subspace of large network?
1000 small networks
- Layers $L \in {1, 2, 3, 4, 5}$
- Width $W \in {2, 3, 5, 8, 10, 15, 20, 25}$
Results
- FC: Given the same level of validation accuracy, there persistent horizontal gap given similar # of trainable parameters
- CNN: generally more parameter efficient than FC

param-eff

Performance vs. number of trainable parameters for (left) FC networks and (right) CNN trained on MNIST. Gray circles: randomly generated direct networks. Blue circle: random subspace training results. CNN: parameter efficiency varies, as the gray points can be significantly to the right of or close to the blue manifold.

Measuring $d_{int90}$ on Shuffled Data

Compared to FC, CNN impose inductive bias. So are CNN always better than FC on MNIST?
Shuffled Data: pixels are subject to a random permutation, chosen once for the entire dataset
Results
- FC: $d_{int90}$ remains at 750
- CNN: $d_{int90}$ jump from 290 to 1400
Catch
- CNN better suited to classifying digits given images with local structure
- When this structure is removed, violating convolutional assumptions, more degrees of freedom are now required to model the underlying distribution

Experiments: Various Learning Problems

param-eff-rl

Intrinsic dimensions on InvertedPendulum−v1, Humanoid−v1, and Pong−v0. This places the walking humanoid task on a similar level of difficulty as modeling MNIST with a FC network. Pong on the same order of modeling CIFAR-10.

param-eff-comp

$d_{int90}$ measured on various supervised and reinforcement learning problems.