FloE: On-the-Fly MoE Inference on Memory-constrained GPU

Published in ICML, 2025

Recommended citation: Zhou, Y., Li, Z., Zhang, J., Wang, J., Wang, Y., Xie, Z., Chen, K., & Shou, L. (2025). FloE: On-the-Fly MoE Inference on Memory-constrained GPU. arXiv. https://arxiv.org/abs/2505.05950 https://arxiv.org/pdf/2505.05950v2

An on-the-fly MoE inference system on memory-constrained GPU, founded on the insight that substantial untapped redundancy exists within sparsely activated experts.

Download paper here