Model compute needs scale O(n^2) with model parameter size, but MoE models use far fewer active parameters.
[1] www.interconnects.ai/p/deepseek-v...
Model compute needs scale O(n^2) with model parameter size, but MoE models use far fewer active parameters.
[1] www.interconnects.ai/p/deepseek-v...