bram
bwasti.bsky.social
bram
@bwasti.bsky.social
meta ai, focused on efficient inference (and reasoning for the next couple months)
rough tensor core (TC) times

1 ns - 1000 TC operations
10 ns - load 1000 TCs from shared memory
100 ns - load a single TC from CPU

1 us - 1B float operations
10 us - launch a CUDA kernel
100 us - pytorch call

1 ms - move 1GB from GPU RAM
10 ms - move 1GB btwn 2 GPUs
100 ms - move 1GB from CPU
October 30, 2024 at 10:28 PM
bf16 is to fp16 as fp8-e5m2 is to fp8-e4m3
October 30, 2024 at 1:41 PM
when it comes to adding data to an LLM, i tend to follow this token count rule

one comma: put it in the prompt
two commas: use RAG
three commas: finetune
four commas: pretrain (i’ll be honest i’ve never actually hit this one 😜)
October 30, 2024 at 12:06 AM