Leviathan Gamer
leviathangamer.bsky.social
Leviathan Gamer
@leviathangamer.bsky.social
It is for loading of assets. The game freezes until it is done. On the original Xbox, this is fairly quick most of the time loading from a DVD Discs. It is short frame drops on faster storage like a HDD.
October 25, 2025 at 7:20 PM
It is pretty nice of Microsoft to switch everyone to Linux for free.
October 16, 2025 at 2:03 PM
AMD leaked a driver that runs dp4a which supports RDNA2 GPUs.
October 14, 2025 at 5:20 PM
What will probably happen is DX13(DX12_3 alternatively) will add it, and AMD will use DGF to implement it, and Nvidia will use RTX Mega Geometry.
August 7, 2025 at 12:06 PM
it grows dramatically, which impacts the Switch 2. Does that make sense?
June 26, 2025 at 2:48 AM
Tensor Core throughput however is just one portion of DLSS(the math heavy portion). As we scale up the Tensor Cores plow through the math heavy portion and get scheduling limited in the non-Tensor Core part, but if we scale down, the portion of DLSS that is Tensor Core heavy...
June 26, 2025 at 2:47 AM
is we are trying to determine if the Tensor Cores are Compute or Memory Limited. Since we can properly feed them, we are very Compute Limited across all Nvidia GPUs(that is why we do both per Clock and per SM to make sure we can rule on all of them)...
June 26, 2025 at 2:45 AM
So for context, I got a Bachelor's in Computer Engineering and Master's in Electrical Engineering, so I am coming at it from a chip design perspective. Maybe referring to it as Bytes per Clock per SM would make it easier to understand. The reason we are doing this...
June 26, 2025 at 2:43 AM
You can see there are 4 sub-partitions in the Block Diagram. Nvidia's peak throughput assumes all 4 instructions issued each clock go to either the CUDA Cores or Tensor Cores. So at 2 to CUDA Cores and 2 to Tensor Cores we are at half peak for both.
June 26, 2025 at 2:42 AM
I never said that Ada/Blackwell don't change the "MMA Ops Compute Capability". I stated they don't change SM Bytes Per Clock, which they don't. The reason I mention this is because we are talking about if it is Memory Limited and it isn't.
June 26, 2025 at 1:56 AM
cycles to the Tensor Cores on the Switch at 60fps(or 8ms of frametime).
June 26, 2025 at 1:53 AM
1) I never explained that you were right. Quote me or give up.
2)the peak throughput listed for Nvidia GPUs assume you issue to the CUDA Cores 100% of the time and 0% of the time to the Tensor Cores. The Peak Tensor Core throughput assumes the opposite.
3)This results in giving up half your clock...
June 26, 2025 at 1:52 AM
FP8 and FP4 are not used in the CNN model. FP8 is used in the Transformer Model. Also FP8 is half the size of FP16, so you get twice the throughput because FP16 is 2 Bytes while FP8 is 1 Byte. Same applies to FP4.
June 26, 2025 at 1:39 AM
for frametime cost if the Tensor Cores were "free" now in Ampere. That would be an Apples to Oranges comparison, so why would they confuse their devs.
3)The RTX 2050 Mobile doesn't get Memory Bandwidth Limited in the DF Video. It got crippled by Post-Processing post-upscale.
June 26, 2025 at 1:37 AM
not allow you to co-issue to the CUDA Cores and Tensor Cores. This is why they mention frametime cost in the DLSS documentation. Think logically, If this was truly different, why would Nvidia have Turing and Ampere GPUs in the same chart...
June 26, 2025 at 1:36 AM
On Turing, you can't use Async Compute simultaneously with Ray Tracing Shaders, which means the RT Cores can't have workloads scheduled at the same time as the Tensor Cores which require Compute Shaders. Ampere now allows you to run RT Cores and Tensor Cores at the same time, but does...
June 26, 2025 at 1:34 AM
to read whitepapers. In Turing(and all subsequent Nvidia GPUs), you can't schedule an instruction to both the CUDA Cores and the Tensor Cores at the same time. People get this confused because of what the Whitepaper says, but they are just reading it wrong...
June 26, 2025 at 1:32 AM
per Clock per SM to work with for any data not currently loaded into your register file. You could probably reach close to 90% Tensor Core utilization before you might have Memory Bandwidth limits. 2) As for the Async Compute thing, I actually get this one a lot because people don't know how...
June 26, 2025 at 1:31 AM
I will just answer everything here instead of branching at each point:
1) Blackwell/Ada do not improve FP16 Tensor Core throughput, so they consume the same amount which is 384 Bytes per Clock, of which the Register Files can feed up to 384 Bytes per Clock. You also get an additional 128 Bytes...
June 26, 2025 at 1:28 AM
I never said you can't see ghosting. I said you aren't seeing a 33ms Pixel response time, which if you counted the 5 separate frames in the soccer ball picture you would know would necessitate a minimum of 33ms of Pixel response at that instance(4 frames/120Hz).
June 26, 2025 at 1:22 AM
You didn't answer my question. Is it possible you don't know that Pixel response times are highly variable depending on the content displayed? Also your screenshot is a poor one to tell Pixel Response Times from every LCD could look like that if captured at the right moment.
June 26, 2025 at 12:45 AM
They trained their main CNN Model around using DLSS before post-processing which covers all the presets the Switch 2 will use.
June 26, 2025 at 12:43 AM
anything else.
June 26, 2025 at 12:41 AM
You don't get memory limited with DLSS on any platform. SM Bytes per Clock for Tensor Cores has not changed since Ampere and it obviously scales with SMs. Sure you can use it like that, but now you have allocates 8ms of your 16.66ms Frame Time just to DLSS, which means you can't use it for...
June 26, 2025 at 12:41 AM
That isn't the intentional use that they train their presets on, which makes it look extra worse.
June 26, 2025 at 12:15 AM