David Espejo
davidmirror.bsky.social
David Espejo
@davidmirror.bsky.social
Avid learner. OSS maintainer.Books, running, DistSys, and dogs. Proud father and husband.
2) A more efficient yet bit more complicated is to start with a large MHA layer, calculate the QKV vectors in a single operation and then split them to resemble multiple heads, using .view and .transpose Pytorch methods.
November 21, 2024 at 10:39 AM