Mixture of Experts in Large-scale and Multimodal Models

Speaker: Huy Minh Nguyen (UT Austin)

Date: 4/22/25

Abstract: Mixture of experts (MoE) framework has recently emerged as an effective approach to enhancing the efficiency and scalability of machine learning models by aggregating the power of multiple sub-models, called experts, through an adaptive gating network. In this talk, I will present our investigation into two MoE components that are central to the success of the DeepSeek-V3 language model. In particular, I will first demonstrate the benefits of the normalized sigmoid gating mechanism over the conventional softmax gating. Then, I will examine the effects of the shared expert structure on the expert convergence behavior. Next, I will introduce the connection between MoE and the self-attention mechanism in the Transformer model architecture as well as its applications parameter-efficient fine-tuning methods. Finally, in the context of multimodal learning where data consist of different modalities, including time series, text, and images, I will highlight our MoE router designs for integrating these data modalities, followed by an empirical comparison with previous methods in the literature.