TMLR 2026

Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping

A unified study of architectural redundancy in Transformer-based LLMs, with practical pipelines for block dropping, attention/MLP layer dropping, joint layer dropping, and post-training quantization.

Shwai He*, Guoheng Sun*, Zheyu Shen, Ang Li

University of Maryland, College Park

LLM-Drop layer dropping overview
LLM-Drop studies Transformer redundancy through block drop, layer drop, joint layer drop, and quantization-aware evaluation.

Motivation

Not every Transformer component is equally necessary.

LLM-Drop examines architectural redundancy across Transformer blocks and sublayers, then turns that analysis into reproducible dropping and benchmarking pipelines.

Unified dropping study The project compares block dropping, attention-layer dropping, MLP-layer dropping, and joint layer dropping under one framework.
Practical model editing Dropped configurations are represented through custom model files and updated configs, making the resulting checkpoints loadable with Transformers.
Efficiency evaluation The repo includes task performance, inference speed, and optional AWQ/GPTQ quantization workflows.

Pipeline

From importance estimation to dropped checkpoints.

The pipeline estimates module importance, selects layers or blocks to remove, writes updated model configs, and benchmarks the resulting model.

Block Drop Remove full Transformer blocks when both attention and MLP sublayers are selected.
Layer Drop Drop attention or MLP sublayers independently to study subcomponent redundancy.
Joint Layer Drop Combine attention and MLP dropping decisions for a broader compression schedule.
Quantization Evaluate dropped models with optional AWQ and GPTQ post-training quantization.

Resources

Paper, code, checkpoints, and benchmarks.

The repository includes the dropping pipeline, benchmark scripts, custom dropped-model classes, and released checkpoints.

News

Project updates.

  • Feb 2026: The paper is published in Transactions on Machine Learning Research.
  • May 2025: The project received the Qualcomm Innovation Fellowship North America award for efficiency-optimized Transformer architectures.
  • Sep 2024: Dropped-model checkpoints were released on Hugging Face.

Citation

Cite this work.

If LLM-Drop helps your research, please cite the corresponding paper.

@article{
  he2026uncovering,
  title={Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping},
  author={Shwai He and Guoheng Sun and Zheyu Shen and Ang Li},
  journal={Transactions on Machine Learning Research},
  issn={2835-8856},
  year={2026},
  url={https://openreview.net/forum?id=1I7PCbOPfe},
}