ICLR 2026

Capacity-Aware Inference for Mixture-of-Experts Models

Mitigating the straggler effect in sparse MoE inference by bounding overloaded experts at inference time, with no retraining.

Problem

Expert parallelism waits for the slowest experts.

Sparse MoE models activate only a subset of experts per token. Under expert parallelism, imbalanced routing can overload a few experts, and the full system waits for those stragglers to finish.

The bottleneck is not average work.

End-to-end latency is dominated by tail expert load. A small set of overloaded experts can erase the expected efficiency gains of sparse activation.

Tail load dominates synchronization
Fixed model inference-time control
MoE-wide language and multimodal tests
Straggler effect in MoE inference

Straggler effect in sparse MoE inference

A few overloaded experts determine the global step time while other devices wait idle.

Method

Capacity bounds make overload explicit.

Capacity-Aware Inference regulates expert load through a capacity factor and then either drops overflow tokens or redirects candidates to underused local experts before dropping.

C = γN,   N = Tk / E

C is the per-expert capacity, γ is the capacity factor, T is the token count, k is top-k routing, and E is the number of experts.

Capacity-aware token drop

Capacity-Aware Token Drop

Bound each expert by capacity and drop overflow tokens routed to already overloaded experts.

Capacity-aware expanded drop

Capacity-Aware Expanded Drop

Expand the candidate expert set toward low-load local experts first, then apply capacity constraints for a stronger throughput-quality tradeoff.

Results

Expanded Drop preserves quality while reducing stragglers.

The reported experiments compare baseline routing, Token Drop, and Expanded Drop across language and multimodal MoE models.

Main results comparing baseline, Token Drop, and Expanded Drop
Layer-level inference speedup

Layer-level speedup

Capacity-aware control reduces overloaded expert work across MoE layers.

End-to-end inference speedup

End-to-end speedup

Capacity-aware inference reduces the wall-clock latency impact of overloaded experts.

Multimodal results on MMBench

Multimodal applicability

The same inference-time idea transfers to multimodal MoE evaluation.

Practical controls

Set EXPERT_CAPACITY and STRATEGY in the evaluation scripts to choose the latency-quality operating point.

Lower capacity factors reduce overload more aggressively. Expanded Drop tends to use underloaded experts before removing routed tokens.

Resources

Run the capacity-aware evaluation.

The repository includes modified Hugging Face MoE modeling files, language evaluation scripts, and a multimodal evaluation pipeline through VLMEvalKit.

conda create -n capacity-moe python=3.10 -y
conda activate capacity-moe
pip install -r requirements.txt

cd lm-evaluation-harness
bash runs_prune/eval_baseline.sh
bash runs_prune/eval_capacity.sh

Citation

Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts.

Shwai He, Weilin Cai, Jiayi Huang, Ang Li. ICLR 2026.