1⃣Routing weights (RW) in MoE provide training-free embedding complementary to widely-used hidden states (HS)
2⃣MoEE (RW + HS) beats standalone HS by +23%
1⃣Routing weights (RW) in MoE provide training-free embedding complementary to widely-used hidden states (HS)
2⃣MoEE (RW + HS) beats standalone HS by +23%