But the stage 1 looks to me like it's structured pruning on the activation. What I am curious about is does this approach help improving inference ? So I presume we won't need to do compute for certain heads.
But the stage 1 looks to me like it's structured pruning on the activation. What I am curious about is does this approach help improving inference ? So I presume we won't need to do compute for certain heads.