Router-Tuning — EMNLP 2025

Overview

Router tuning in one pass

Router-Tuning is a lightweight recipe for enabling dynamic-depth inference in Transformers. It fine-tunes only router-related parameters, leaving the backbone model frozen.

Compared with full-model dynamic-depth tuning, this keeps adaptation cost low while preserving the efficiency-quality tradeoff needed for deployment.

Updates

News

Aug 2025 Router-Tuning accepted to EMNLP 2025 main conference.

Oct 2024 arXiv preprint and code released.

Motivation

Why Router-Tuning?

Traditional transformers execute a fixed number of layers for every token, wasting computation on easy tokens. Mixture of Depths (MoD) addresses this by dynamically skipping less important computations, but two practical issues remain:

01

High Training Cost

Existing methods usually tune the whole model, causing expensive full-parameter dynamic-depth adaptation.

02

Quality Degradation

Aggressive skipping can hurt quality if routing is not well calibrated, especially with higher compute budgets.

Router-Tuning tackles both: it focuses optimization on routing components and introduces routing strategies that better preserve performance-efficiency tradeoffs.

Approach

Core Method

1 · Router-Only Fine-Tuning

Tune only router-related parameters instead of full-model updates.

The backbone weights remain frozen, dramatically reducing optimization cost for dynamic-depth adaptation — making it practical for models already fine-tuned on downstream tasks.

2 · Attention-Based Dynamic Depth

Uses attention-based routing granularity (attn_token, attn_sequence) to improve compute and memory efficiency.

Preserves output quality under dynamic-depth execution while enabling flexible deployment at different compute budgets.

Experiments

Results

Router-Tuning improves the efficiency-quality tradeoff over full-parameter dynamic-depth baselines while keeping routing behavior stable.

Main Benchmark Results

Router-Tuning achieves strong speedups with small quality degradation across evaluated settings.

Figure 1 — Efficiency vs. quality tradeoff across benchmarks. Router-Tuning (RT) vs. full-parameter baselines.

Expert Routing Analysis

Router specialization becomes clearer after tuning: the model learns more stable token-to-layer routing patterns, enabling dynamic-depth execution with lower unnecessary computation.

Figure 2 — Routing pattern visualization before and after Router-Tuning.

LoRA Compatibility

Router-Tuning composes naturally with LoRA-based adaptation, enabling lightweight deployment recipes without full-model retraining.

Figure 3 — LoRA + Router-Tuning combined results. Efficiency-performance balance under joint adaptation.

Reference

Citation

BibTeX

@misc{he2024routertuningsimpleeffectiveapproach,
  title         = {Router-Tuning: A Simple and Effective Approach
                   for Enabling Dynamic-Depth in Transformers},
  author        = {Shwai He and Tao Ge and Guoheng Sun and
                   Bowei Tian and Xiaoyang Wang and Dong Yu},
  year          = {2024},
  eprint        = {2410.13184},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2410.13184}
}