https://www.rpisoni.dev/
I tried replacing dot-product attention with the negative squared KQ-distance and was able to remove the softmax without issues and loss in performance!
I tried replacing dot-product attention with the negative squared KQ-distance and was able to remove the softmax without issues and loss in performance!
As you can see the model really cares about the semantics of background too. Hope this will get better in the second half of the training.🤞
As you can see the model really cares about the semantics of background too. Hope this will get better in the second half of the training.🤞
I thought the flowers were important too but the model doesn't think so.
I thought the flowers were important too but the model doesn't think so.
🧵
🧵
But I think it has value to be able to balance losses with changing magnitudes explicitly so I came up with a variant that leaves the grads unaffected while fixing the loss magnitude to 1.0
WDYT now?
bsky.app/profile/4rte...
But I think it has value to be able to balance losses with changing magnitudes explicitly so I came up with a variant that leaves the grads unaffected while fixing the loss magnitude to 1.0
WDYT now?
bsky.app/profile/4rte...
BTW thanks to @merve.bsky.social and Niels for the pic!🤗
BTW thanks to @merve.bsky.social and Niels for the pic!🤗
It's the padding! Let me show you how to fix it!🧵 #mlsky
It's the padding! Let me show you how to fix it!🧵 #mlsky