1 ns - 1000 TC operations
10 ns - load 1000 TCs from shared memory
100 ns - load a single TC from CPU
1 us - 1B float operations
10 us - launch a CUDA kernel
100 us - pytorch call
1 ms - move 1GB from GPU RAM
10 ms - move 1GB btwn 2 GPUs
100 ms - move 1GB from CPU
1 ns - 1000 TC operations
10 ns - load 1000 TCs from shared memory
100 ns - load a single TC from CPU
1 us - 1B float operations
10 us - launch a CUDA kernel
100 us - pytorch call
1 ms - move 1GB from GPU RAM
10 ms - move 1GB btwn 2 GPUs
100 ms - move 1GB from CPU
one comma: put it in the prompt
two commas: use RAG
three commas: finetune
four commas: pretrain (i’ll be honest i’ve never actually hit this one 😜)
one comma: put it in the prompt
two commas: use RAG
three commas: finetune
four commas: pretrain (i’ll be honest i’ve never actually hit this one 😜)