This post is mainly based on

The goal of Parameter-Efficient Fine-Tuning (PEFT) is to avoid full fine-tuning on all model parameters, which is prohibitively expensive for LLM with billions of parameters. Backgrounds: Measuring Neural Network’s Intrinsic Dimensionality.

Difference between 2 paper

  • First paper
    • Empirically show LM has low Intrinsic Dimension (across all params)
    • Parameters are grouped as a whole, without matrix structure, then project them to low dimension
    • Projection matrix is computed from Fastfood transform
  • LoRA
    • Assume the each $\Delta$ weight matrix has low rank, and estimate it by low rank decomposition
    • Parameters are grouped by considering their matrix structure, then project them to low dimension
    • Low rank decomposition is learned

Intrinsic Dimensionality of LLM

  • It is unclear why LLM with millions of parameters can be fine-tuned (FT) on hundreds or thousands of labeled examples
  • Based on the Intrinsic Dimension Method, this paper empirically show
    • LLM has a low dimension reparameterization (FT on full params vs reduced params are equally effective)
    • By optimizing 200 parameters, RoBERTa can achieve 90% performance of the full parameter FT on MRPC
    • Pre-training implicitly minimizes intrinsic dimension, which could reveal insights on how to compress the average NLP task

Background

  • Pre-trained models such as BERT are redundant in their capacity, allowing for significant sparsification without much performance degradation
  • FT top layers of pre-trained models is not effective, FT with Adapter Module is parameter efficient
  • Standard FT seem to affect generalization of pre-trained representations

Direct Intrinsic Dimension (DID)

\[\theta^{(D)} = \theta^{(D)}_0 + P\theta^{(d)}\]

Selection of Projection Matrix

  • The only computationally reasonable subspace optimization method is Fastfood transform
  • Reason: projecting RoBERTa-Large model to a $d = 1000$ subspace with a dense matrix requires 1.42TB of memory

Structure Aware Intrinsic Dimension (SAID)

  • Existing literature found that individual layers specialize separate task in transformer
  • What Does BERT Look At? An Analysis of BERT’s Attention
  • However, the original DID method ignores the layer-wise structure of transformer
  • SAID inject structure to $\theta^{d}$ when computing $d_{90}$
\[\theta^{(D)}_i = \theta^{(D)}_{0,i} + \lambda_i P\theta^{(d-m)}_i\]

where,

  • $i$: layer index
  • $m$: total number of layer
  • $d-m$: dimension of projected parameters, s.t., subspace’s total degree of freedom = $d-m$ + $m$, where $m$ = number of parameters reserved for $\lambda_i$

Experiments

Intrinsic Dimension of LLM

  • From GLUE benchmark, choose MRPC and QQP
    • MRPC: small dataset, ~3700 samples
    • QQP: large dataset, 363k samples
  • For each dataset and model, we run 100 subspace trainings with $d \in [10, 10000]$ on log scale
  • Classification model contains randomly initialized sentence classification head, fixed
  • This ensure we have exactly $d$ parameters to optimize
  • Results
    • Dimensionality for LLM on viable solutions is incredible low
      • RoBERTa large # of parameters: 354 MM
      • RoBERTa large MRPC subspace dimension: 207
    • RoBERTa consistently outperforms BERT across various subspace dimensions $d$ while having more parameters

llm-dim

Estimated \(d_{90}\) intrinsic dimension via DID and SAID

Intrinsic Dimension, Pre-Training, And Generalization Gap

  • Interpretation of the intrinsic parameter vector
    • The low dimensional vector encodes the target task with respect to the pre-trained representations
    • Can interpret $d$ as the minimal description length of the target task via pre-trained representations
  • Hypothesis
    • Pre-training reduced the intrinsic dimensionality of the average NLP task
    • Pre-training compress the minimal description length of target tasks
  • The compression point of view
    • For a classification network with pre-trained representation $f$, classification head $g$ are their respective weights $w_f, w_g$
    • Fine tuning with SAID only requires storing $d$ parameters + $1$ seed to generate random matrices in Fastfood transform

Pre-Training & Intrinsic Dimension

  • Steps
    • Retrain a RoBERTa-Base from scratch, checkpoint model every pre-training 10000 steps
    • Pre-training data does not contains downstream datasets
    • At each checkpoint, using SAID to estimate $d_{90}$ across 6 datasets
  • Results
    • Intrinsic dimensionality of RoBERTa-Base monotonically decreases as we continue pre-training
    • The correlation between tasks traditionally hard for RoBERTa and their large intrinsic dimension hints at a connection between generalization and intrinsic dimension
  • Hypothesis
    • Masked Language Models (MLM) learns generic and distributed enough representations of language to facilitate downstream learning of highly compressed task representations
    • Pre-training learning representations that form a compression framework with respect to various NLP tasks

llm-dim-train

Estimated \(d_{90}\) intrinsic dimension of RoBERTa-Base when training from scratch. \(d_{90}\) are estimated on 6 datasets.

Parameter Count & Intrinsic Dimension

  • Steps
    • For each model, using SAID to estimate $d_{90}$ on MRPC dataset
  • Results
    • Strong general trend: as # parameters increases, intrinsic dimension of fine-tuning on MRPC decreases
    • Experiments on other datasets showed the same trend (see paper Appendix)

llm-dim-scaling

Scaling of intrinsic dimension w.r.t # params using the SAID method on the MRPC dataset.

Generalization Gap & Intrinsic Dimension

  • A lower intrinsic dimension is strongly correlated with better evaluation performance

llm-dim-acc

Evaluation accuracy vs intrinsic dimensionalities on 6 datasets.

TBD: Generalization Bounds

LoRA

  • “Low-Rank Adaptation”
  • How
    • Freezes the pretrained model weights
    • Injects trainable rank decomposition matrices into each Transformer layer
    • Update only the parameters of rank decomposition matrices
  • Greatly reduce trainable parameters + increase training throughput
  • Performance: on-par or better than traditional finetuning on RoBERTa, DeBERTa, GPT-2, and GPT-3

Rank-deficiency in language model adaptation

  • Hypothesize: change in weights during model adaptation also has a low “intrinsic rank”
  • Using GPT-3 175B as an example, we show that a very low rank $\Delta$ weight is enough
  • Even the full rank for GPT-3 is as high as 12,288

Background

  • Over-parametrized models in fact reside on a low intrinsic dimension
  • Approach 1: Adapter Modules
    • 2 Adapter Layers are added on top of each transformer layer
    • Advantage: only store and load a small number of task-specific parameters
    • Disadvantage
      • Increase inference latency due to sequential processing (20-30%, see Table 1)
      • For model sharding, problem get worse, due to the synchronous GPU AllReduce and Broadcast
  • Approach 2: Prefix Tuning
    • Disadvantage
      • Difficult to optimize
      • Often fail to match the fine-tuning baselines
      • Reserving a part of the sequence length for adaptation reduces the input sequence length
  • Approach 3: SAID
    • Disadvantage
      • Random Matrix in Fastfood transform still requires large storage
      • Slow to train

LoRA

lora-arch

Reparametrization. Pre-trained Weights are fixed. Only train A and B are trained.

LoRA directly modifies the pre-trained parameters / weight matrices $W_0$ in each Transformer Layers by adding a $\Delta W$ on top of them:

\[W_0 + \Delta W = W_0 + BA\]

where,

  • $B \in \mathbb{R}^{d \times r}$
  • $A \in \mathbb{R}^{r \times k}$
  • $d$: input dimension
  • $r$: matrix rank
  • $k$: hidden dimension
  • $r \ll min(d,k)$

During training and inference, matmul between $x$ and $W_0, \Delta W$ can be parallelized:

\[h = W_0x + \Delta Wx = W_0x + BAx\]

LoRA advantages

  • One pre-trained model + many small LoRA modules ($A,B$ in Figure 1) for different tasks
  • Lower hardware barrier + higher throughput (No need to calculate the gradients or maintain the optimizer states for most parameters)
  • On deployment: merge trainable matrices with frozen weights => no inference latency
  • Orthogonal to previous approaches and can be combined with them (e.g., prefix-tuning)

LoRA is A Generalization of Full Fine-tuning

Why? As we increase the number of trainable parameters

  • LoRA: converges to training the original model
  • Adapter-based methods: converges to an MLP
  • Prefix-based methods: converges to a model that cannot take long input sequences

Applying LoRA to Transformer

  • Transformer: 4 weight matrices in the self-attention module $(W_q, W_k, W_v, W_o)$ and two in the MLP module
  • This study
    • Only adapting the attention weights for downstream tasks
    • Freeze the MLP modules

Practical Benefits and Limitations

  • Most significant benefit: reduction in memory and storage usage
  • VRAM
    • If $R \ll d_{model}$, can reduce that VRAM usage by up to 2/3
    • On GPT-3 175B, can reduce the training VRAM from 1.2TB to 350GB
    • If $r=4$ and only the query and value projection matrices being adapted, checkpoint size is reduced by roughly 10,000x (350GB -> 35MB)
  • Training speed
    • 25% speedup during training on GPT-3 175B compared to full fine-tuning
    • Reason: do not need to calculate the gradient for the vast majority of the parameters

Experiments

  • LoRA on GPT-3 175B
  • Competitive to full fine-tuning

lora-acc

Performance of different adaptation methods on GPT-3 175B. Metric: logical form validation accuracy on WikiSQL, validation accuracy on MultiNLI-matched, and Rouge-1/2/L on SAMSum.

Is LoRA Robust Against Number of Trainable Parameters?

lora-robust

GPT-3 175B validation accuracy vs. number of trainable parameters of several adaptation methods on WikiSQL and MNLI-matched. LoRA exhibits better scalability and task performance.

Ablation study: effect on adapting different types of attention weight matrices in a Transformer.

Which Weight Matrix in Transformer Should We Apply LoRA To?

lora-which-matrix

Validation accuracy on WikiSQL and MultiNLI after applying LoRA to different types of attention weights in GPT-3, given the same number of trainable parameters. Adapting both $W_q$ and $W_v$ gives the best performance overall.

What is the Optimal Rank $r$ for LoRA?

lora-what-rank

Validation accuracy on WikiSQL and MultiNLI with different rank $r$. To our surprise, a rank as small as one suffices for adapting both Wq and $W_v$ on these datasets while training Wq alone needs a larger $r$. We conduct a similar experiment on GPT-2.