Nish Tahir
nish.nishtahir.com.ap.brid.gy
Nish Tahir
@nish.nishtahir.com.ap.brid.gy
Thoughts, ideas and `

🌉 bridged from https://nishtahir.com/ on the fediverse by https://fed.brid.gy/
GMKTec Evo-X2 Ryzen AI Max 395+ Benchmarks
I recently got my hands on a GMKTec Evo X2 for local model inference. Here's my hardware details nish@gmktec-evo-x2:~$ sudo lshw -short H/W path Device Class Description ========================================================= system NucBox_EVO-X2 (EVO-X2-001) /0 bus GMKtec /0/0 memory 64KiB BIOS /0/b memory 1280KiB L1 cache /0/c memory 16MiB L2 cache /0/d memory 64MiB L3 cache /0/e processor AMD RYZEN AI MAX+ 395 w/ Radeon 8060S /0/11 memory 128GiB System Memory The box came with Windows 11 Pro preinstalled which I didn't bother with and quickly replaced with Ubuntu Server. nish@gmktec-evo-x2:~$ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 24.04.3 LTS Release: 24.04 Codename: noble # Out of the box performance I installed `ollama` and tested a few models using the verbose option. It's worth noting these were out of the box with no additional drivers or tooling installed. My prompt was "What is the distance between the earth and the Sun?" I started with `gpt-oss:20b`. total duration: 11.558212329s load duration: 98.524563ms prompt eval count: 76 token(s) prompt eval duration: 39.185462ms prompt eval rate: 1939.49 tokens/s eval count: 270 token(s) eval duration: 11.346835974s eval rate: 23.80 tokens/s `gpt-oss-120b` was next which showed decent performance. total duration: 24.107760366s load duration: 218.341745ms prompt eval count: 77 token(s) prompt eval duration: 4.975562972s prompt eval rate: 15.48 tokens/s eval count: 277 token(s) eval duration: 18.757952199s eval rate: 14.77 tokens/s I tested `qwen3:32b` and was quite disappointed with the performance. total duration: 5m16.79871007s load duration: 47.442582ms prompt eval count: 19 token(s) prompt eval duration: 924.354931ms prompt eval rate: 20.55 tokens/s eval count: 1393 token(s) eval duration: 5m15.38993452s eval rate: 4.42 tokens/s # Rocm and AMD GPU driver installation Next I installed, `Rocm` and the AMD GPU Driver by following the instructions here. I was quite surprised that the `Rocm` installation required 23GB of disk space. $ sudo apt install rocm ... Need to get 5345 MB of archives. After this operation, 23.0 GB of additional disk space will be used. Do you want to continue? [Y/n] I verified the installation by using `rocm-smi` $ rocm-smi ======================================== ROCm System Management Interface ======================================== ================================================== Concise Info ================================================== Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU% (DID, GUID) (Edge) (Socket) (Mem, Compute, ID) ================================================================================================================== 0 1 0x1586, 40251 27.0°C 5.083W N/A, N/A, 0 N/A N/A 0% auto N/A 0% 0% ================================================================================================================== ============================================== End of ROCm SMI Log =============================================== Testing `qwen3:32b` showed improved performance. I assume this is the result of the updated drivers. total duration: 1m49.043363047s load duration: 51.078806ms prompt eval count: 20 token(s) prompt eval duration: 202.439512ms prompt eval rate: 98.79 tokens/s eval count: 1021 token(s) eval duration: 1m48.316184545s eval rate: 9.43 tokens/s `gpt-oss:120b` also showed improved performance. total duration: 9.300572016s load duration: 100.106345ms prompt eval count: 77 token(s) prompt eval duration: 144.640986ms prompt eval rate: 532.35 tokens/s eval count: 295 token(s) eval duration: 8.925786695s eval rate: 33.05 tokens/s Just under 50tps for `gpt-oss:20b`! total duration: 7.016576027s load duration: 96.902471ms prompt eval count: 77 token(s) prompt eval duration: 159.437642ms prompt eval rate: 482.95 tokens/s eval count: 305 token(s) eval duration: 6.602954724s eval rate: 46.19 tokens/s # Llama.cpp sudo apt install build-essential cmake libcurl4-openssl-dev When building llama.cpp for AMD GPUs the HIP build instructions require an `AMDGPU_TARGET` to be set. I found this using `rocminfo`. $ rocminfo ROCk module version 6.14.14 is loaded ... Agent 2 ******* Name: gfx1151 Uuid: GPU-XX Marketing Name: AMD Radeon Graphics Then I ran a build using HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \ cmake -S . -B build \ -DGGML_HIP=ON \ -DAMDGPU_TARGETS=gfx1151 \ -DCMAKE_BUILD_TYPE=Release \ -DBUILD_SHARED_LIBS=OFF \ -DCMAKE_POSITION_INDEPENDENT_CODE=ON \ -DGGML_CUDA_ENABLE_UNIFIED_MEMORY=1 \ && cmake --build build --config Release -- -j 16 Then I tested it against `ggml-org/gemma-3-1b-it-GGUF`. $ ./build/bin/llama-cli -hf ggml-org/gemma-3-1b-it-GGUF llama_perf_sampler_print: sampling time = 26.93 ms / 366 runs ( 0.07 ms per token, 13590.29 tokens per second) llama_perf_context_print: load time = 491.51 ms llama_perf_context_print: prompt eval time = 34.05 ms / 19 tokens ( 1.79 ms per token, 557.99 tokens per second) llama_perf_context_print: eval time = 2141.20 ms / 347 runs ( 6.17 ms per token, 162.06 tokens per second) llama_perf_context_print: total time = 14215.85 ms / 366 tokens llama_perf_context_print: graphs reused = 345 llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | llama_memory_breakdown_print: | - ROCm0 (Graphics) | 65536 = 63876 + (1314 = 762 + 38 + 514) + 345 | llama_memory_breakdown_print: | - Host | 318 = 306 + 0 + 12 | I pulled the same model using ollama to compare. I believe llama.cpp uses the `q4k_m` model by default so it should be fair. $ ollama run gemma3:1b --verbose total duration: 1.979890959s load duration: 124.143692ms prompt eval count: 19 token(s) prompt eval duration: 35.132589ms prompt eval rate: 540.81 tokens/s eval count: 271 token(s) eval duration: 1.718437609s eval rate: 157.70 tokens/s `gpt-oss:120b` managed to hit 45tps llama_perf_sampler_print: sampling time = 27.84 ms / 281 runs ( 0.10 ms per token, 10092.67 tokens per second) llama_perf_context_print: load time = 11317.00 ms llama_perf_context_print: prompt eval time = 138.78 ms / 16 tokens ( 8.67 ms per token, 115.29 tokens per second) llama_perf_context_print: eval time = 5828.50 ms / 264 runs ( 22.08 ms per token, 45.29 tokens per second) llama_perf_context_print: total time = 593306.52 ms / 280 tokens llama_perf_context_print: graphs reused = 262 llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | llama_memory_breakdown_print: | - ROCm0 (Graphics) | 65536 = 4734 + (60421 = 59851 + 171 + 398) + 380 | llama_memory_breakdown_print: | - Host | 601 = 586 + 0 + 15 | `gpt-oss:20b` reports 65tps llama_perf_sampler_print: sampling time = 12.83 ms / 263 runs ( 0.05 ms per token, 20495.64 tokens per second) llama_perf_context_print: load time = 1017.53 ms llama_perf_context_print: prompt eval time = 72.33 ms / 16 tokens ( 4.52 ms per token, 221.20 tokens per second) llama_perf_context_print: eval time = 3754.42 ms / 246 runs ( 15.26 ms per token, 65.52 tokens per second) llama_perf_context_print: total time = 186207.11 ms / 262 tokens llama_perf_context_print: graphs reused = 244 llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | llama_memory_breakdown_print: | - ROCm0 (Graphics) | 65536 = 53694 + (11461 = 10949 + 114 + 398) + 380 | llama_memory_breakdown_print: | - Host | 601 = 586 + 0 + 15 | # iGPU tweaks To ensure that all of the VRAM (128GB) is addressable for bigger models, I made a few adjustments in the BIOS sourced from here. 1. Set UMA frame buffer size to 1G (This was the minimum in my bios). Interestingly this is the value that gets reported by `rocm-smi --showmeminfo vram`[1] ============================ ROCm System Management Interface ============================ ================================== Memory Usage (Bytes) ================================== GPU[0] : VRAM Total Memory (B): 1073741824 GPU[0] : VRAM Total Used Memory (B): 163188736 ========================================================================================== ================================== End of ROCm SMI Log =================================== 2. Disable IOMMU Next I added the following kernel boot options to GRUB to set the GTT and TTM sizes. $ sudo nano /etc/default/grub # Update GRUB_CMDLINE_LINUX_DEFAULT to GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432" I verified this using $ sudo dmesg | grep -i gtt [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.8.0-86-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro quiet splash amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432 vt.handoff=7 [ 0.068142] Kernel command line: BOOT_IMAGE=/vmlinuz-6.8.0-86-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro quiet splash amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432 vt.handoff=7 [ 3.527604] amdgpu 0000:c5:00.0: amdgpu: [drm] Configuring gttsize via module parameter is deprecated, please use ttm.pages_limit [ 3.527604] amdgpu 0000:c5:00.0: amdgpu: [drm] GTT size has been set as 137438953472 but TTM size has been set as 66813538304, this is unusual [ 3.527605] [drm] amdgpu: 131072M of GTT memory ready. and $ sudo dmesg | grep -i ttm [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.8.0-86-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro quiet splash amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432 vt.handoff=7 [ 0.068142] Kernel command line: BOOT_IMAGE=/vmlinuz-6.8.0-86-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro quiet splash amd_iommu=off amdgpu.gttsize=131072ttm.pages_limit=33554432 vt.handoff=7 [ 3.527604] amdgpu 0000:c5:00.0: amdgpu: [drm] Configuring gttsize via module parameter is deprecated, please use ttm.pages_limit [ 3.527604] amdgpu 0000:c5:00.0: amdgpu: [drm] GTT size has been set as 137438953472 but TTM size has been set as 66813538304, this is unusual # Llama bench To compare performance with the DGX Spark, I ran `llama-bench` with params I found here. ./build/bin/llama-bench -m model.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 --mmap 0 The all have the preamble ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 and end with build: 03792ad9 (6816) ## gpt-oss:20b model | size | params | backend | ngl | n_ubatch | fa | test | t/s ---|---|---|---|---|---|---|---|--- gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 2048 | 1 | pp2048 | 1621.75 ± 122.61 gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 2048 | 1 | tg32 | 65.73 ± 0.07 gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 2048 | 1 | pp2048 @ d4096 | 1172.54 ± 1.82 gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 2048 | 1 | tg32 @ d4096 | 59.53 ± 0.06 gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 2048 | 1 | pp2048 @ d8192 | 950.99 ± 1.95 gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 2048 | 1 | tg32 @ d8192 | 57.25 ± 0.06 gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 2048 | 1 | pp2048 @ d16384 | 695.44 ± 0.78 gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 2048 | 1 | tg32 @ d16384 | 53.79 ± 0.05 gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 2048 | 1 | pp2048 @ d32768 | 451.42 ± 0.54 gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | 2048 | 1 | tg32 @ d32768 | 47.71 ± 0.06 ## gpt-oss:120b model | size | params | backend | ngl | n_ubatch | fa | mmap | test | t/s ---|---|---|---|---|---|---|---|---|--- gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 | 818.11 ± 9.03 gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | tg32 | 46.05 ± 0.18 gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d4096 | 650.83 ± 2.16 gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d4096 | 42.45 ± 0.03 gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d8192 | 542.66 ± 1.71 gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d8192 | 40.88 ± 0.04 gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d16384 | 411.87 ± 1.60 gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d16384 | 38.39 ± 0.06 gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d32768 | 274.69 ± 0.65 gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d32768 | 34.15 ± 0.01 ## Qwen3 Coder 30B A3B model | size | params | backend | ngl | n_ubatch | fa | mmap | test | t/s ---|---|---|---|---|---|---|---|---|--- qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 | 773.45 ± 44.44 qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | tg32 | 50.16 ± 0.22 qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d4096 | 534.51 ± 1.19 qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d4096 | 44.28 ± 0.03 qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d8192 | 407.36 ± 0.54 qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d8192 | 40.25 ± 0.03 qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d16384 | 274.46 ± 0.34 qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d16384 | 34.89 ± 0.03 qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | pp2048 @ d32768 | 166.77 ± 0.24 qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | ROCm | 99 | 2048 | 1 | 0 | tg32 @ d32768 | 27.59 ± 0.01 * * * 1. From what I can tell `rocm-smi` doesn't report VRAM usage correctly. A more accurate reflection of VRAM consumption seems to be given by `free -h`. ↩︎
nishtahir.com
October 23, 2025 at 9:38 PM
On the DORA 2025 AI Report - AI Adoption and use
The DORA State of AI Assisted Software Development report came out recently. It's a massive 142 page report that details analysis done by the DORA team. It captures trends and observations covering a 5000 participant study that focuses on AI Adoption and tool use in the Software industry. It's a long and detailed report so I'll focusing on areas I think are interesting and summarizing as I go. I aim to follow the flow of the original report and leave commentary as I go. These notes follow my progress in trying to understand this stuff myself. I invite any thoughts or perspective on the topic. The original report is available here. The survey questions are published here. # Foreword The foreword highlights a decade-long evolution in software development practices, emphasizing Google’s DORA research on DevOps and its recent pivot to address AI's impact. The author is bullish on Vibe Coding - as evidenced by their upcoming book of the same name and take the stance of having seen AI result in extremely positive outcomes going so far as to call last year's report the "2024 DORA anomaly" which showed a correlation between increased AI use and reduced software stability/throughput. > Steve and I have seen how using vibe coding can go wrong, resulting in deleted tests, outages, and even deleted code repositories. But we’ve concluded that this was because the engineering instincts that served us well for decades were now proving woefully insufficient. They blame issues they've experienced with vibecoding on current engineering instincts being "woefully insufficient". The author comes from the position that this is a paradigm shift that needs new patterns to keep up with > Suppose the fastest you’ve ever traveled is walking at four miles per hour, and someone asks you to drive a car at 50 miles per hour. Without practice and training, you will undoubtedly wreck the car. I think the broader technical industry has seen its stance evolve on vibecoding over the year. There remain unresolved issues with surrendering your understanding and your solution becoming a black box[1]. It's possible that the author has solutions that they will be revealing in their book. They support their claim that training and patterns are the issue through 2 case studies starting with Adidas - > Fernando Cornago, global vice-president, Digital and E-Commerce Technology, Adidas, oversees nearly a thousand developers. In their generative AI (gen AI) pilot, they found that teams who worked in loosely coupled architectures and had fast feedback loops “experienced productivity gains of 20% to 30%, as measured by increases in commits, pull requests, and overall feature-delivery velocity,” and had a “50% increase in ‘Happy Time’”— more hands-on coding and less administrative toil. And Booking.com - > We also appreciated the case study from Bruno Passos, group product manager, Developer Experience, Booking.com, which has a team of more than 3,000 developers. In their gen AI innovation efforts, they found that, “developer uptake of vibe coding and coding assistant tools was uneven ... Bruno’s team soon realized the missing ingredient was training. When developers learned how to give their coding assistant more explicit instructions and more effective context, they found up to 30% increases in merge requests and higher job satisfaction. They conclude by pointing out that this report includes data from 5,000 participants, and aims to uncover groundbreaking insights similar to past DevOps breakthroughs. # AI adoption and use The report defines AI adoption as the intersection between Reliance, Trust, and Reflexive use and tweaked their survey questions to measure those key facets. The results show that AI has seen overwhelming adoption. 90% of respondents say they use AI at work in some capacity. It's worth noting that this is within the margin of error of 84% reported in the Stackoverflow developer survey[2]. Unfortunately my excitement for this is somewhat tempered by AI tool use mandates that have become more common[3]. Unfortunately it's difficult to say how much of it is purely organic or as a result of mandates pushing for greater adoption. The next section points out that in aggregate 60% of users report reflexively using AI half the time or more. > Although AI use is nearly ubiquitous in our sample, reflexive use—the default employment of AI when facing a problem—is not. Among AI users, only 7% report “always” using AI when faced with a problem to solve or a task to complete, while 39% only “sometimes” seek AI for help. Still, a full 60% of AI users in our survey employ AI “about half the time” or more when encountering a problem to solve or task to complete, suggesting that AI has become a frequent part of the development process. ## Perception of productivity Some of the stats focus on perception. Respondents report a perception of increased productivity and code quality > More than 80% of this year’s survey respondents report a perception that AI has increased their productivity. Although more than 40% report that their productivity has increased only “slightly,” fewer than 10% of respondents perceive AI contributing to any decrease in their productivity. > In addition to perceiving positive impacts on their productivity, a majority (59%) of survey respondents also observe that AI has positively impacted their code quality. 31% perceive this increase to be only “slight” and another 30% observe neither positive nor negative impacts. However, just 10% of respondents perceive any negative impacts on their code quality as a result of AI use. While the data here is interesting, I think its weakness is that it primarily relies on self reported data. It's difficult to establish causal relationship to the tools. Do the devs feel more productive because the tool is actually making them more productive, or is there an illusion of productivity because they are typing messages to an LLM? The METR study[4] that came out this year made an attempt to measure this. > Surprisingly, we find that when developers use AI tools, they take 19% longer than without—AI makes them slower [4:1] There is of course nuance to it but I argue that it's enough evidence to cast at least some doubt on self reported metrics in this context[5]. Speaking anecdotally, whether or not I see gains from AI depend on the task, how much the requirements have been figured out, how much of the solution is "cookie cutter", etc. What I have found it consistently do is save me from typing as much. Other anecdotes published by others seem to mirror my experience[6][7] but also point out some of the dangers that come from not understanding the nuance > These claims wouldn't matter if the topic weren't so deadly serious. Tech leaders everywhere are buying into the FOMO, convinced their competitors are getting massive gains they're missing out on. This drives them to rebrand as AI-First companies, justify layoffs with newfound productivity narratives, and lowball developer salaries under the assumption that AI has fundamentally changed the value equation.[6:1] ## Trust Overall 46% of developers "somewhat" trust AI-generated output, 20% say "a lot" and 4% say "a great deal". There isn't a direct analog in the Stackoverflow survey, however there is a section on "Accuracy of AI tools" we can use as a reference > More developers actively distrust the accuracy of AI tools (46%) than trust it (33%), and only a fraction (3%) report "highly trusting" the output. Experienced developers are the most cautious, with the lowest "highly trust" rate (2.6%) and the highest "highly distrust" rate (20%), indicating a widespread need for human verification for those in roles with accountability.[2:1] The results line up quite well which points out frustrations with the reliability of the tools. They offer some advice on building trust in AI[8]. > Importantly, developers who trust gen AI more reap more positive productivity benefits from its use. In a logs-based exploration of Google developers’ trust in AI code completion, our EPR team found that developers who frequently accepted suggestions from a gen AI-assisted coding tool submitted more change lists (CLs) and spent less time seeking information than developers who infrequently accepted suggestions from the same tool. This was true even when controlling for confounding factors, including job level, tenure, development type, programming language, and CL count. Put simply, **developers who trust gen AI more are more productive**. Something that stands out to me is that it this comes from the perspective that using gen AI is a fixed productivity gain, this seems true based on the self reported data but it still depends heavily on who you ask[9][10]. The five pieces of advice they offer to increase trust all seem like good ideas regardless of how you feel about the tech or adoption. 1. Establish a policy about acceptable gen AI use, even if your developers are good corporate citizens. > ... establishing clear guidelines encouraging acceptable use of gen AI will likely also promote cautious and responsible developers to use gen AI, by assuaging fears of unknowingly acting irresponsibly 2. Double-down on fast high-quality feedback, like code reviews and automated testing, using gen AI as appropriate. > ... appropriate safeguards assuring them that any errors that may be introduced by gen AI-generated code will be detected before it is deployed to production. 3. Provide opportunities for developers to gain exposure with gen AI, especially those which support using their preferred programming language. > Providing opportunities to gain exposure to gen AI, like training, unstructured activities, or slack time devoted to trying gen AI, will help increase trust, _especially if such activities can be performed in developers’ preferred programming language_ in which they are best equipped to evaluate gen AI’s quality 4. Encourage gen AI use, but don’t force it. > One approach to encouraging gen AI use in a manner that prioritizes developers’ sense of control is to promote the spread of knowledge organically, by building community structures to foster conversations about gen AI 5. Help developers think beyond automating their day-to-day work and envision what the future of their role might look like. > ... without a clear vision for what the transformed role of a developer working at a higher level of abstraction in which these repetitive tasks are delegated to gen AI resembles, it will be hard to assuage fears of unemployment. I think trust here covers the fact that there is a learning curve to the tools. How and when you use them can be the difference between good and bad outcomes, however there is pressure to adopt it whenever and wherever possible which can backfire and erode trust in the tools. # Conclusion The report concludes this section by opining that although respondents express concerns about trust in AI-generated code, they report positive impacts to productivity and code quality. > But, whether social pressure is a logical motivation to adopt a new technology is debatable. While our data shows many positive outcomes of AI adoption, we have also documented notable drawbacks. > > For this reason, we caution against interpreting these findings of AI’s ubiquity as an indication that all organizations should rapidly move to adopt AI, regardless of their specific needs. Rather, we interpret these findings as a strong signal that everyone engaged in software development—whether an individual contributor, team manager, or executive leader— should think deeply about whether, where, and how AI can and should be applied in their work. They note that their data also points to considerable drawbacks when exploring the impact of AI Adoption most notably that higher rates of AI adoption predict increased software delivery instability and developer burnout. Overall, I think this is a great report but one that has deep nuance and is hurt by the fact that there are a lot of mixed signals coming from different sources that often shows opposing data. It acknowledges a lot of things and behaviors we don't know but also makes sweeping generalizations on productivity that the reader must read with nuance in mind. If you haven't already, I recommend taking the time to read the report. * * * 1. Goel N. (2025) Karpathy’s ‘vibe coding’ movement considered harmful. nmn.gl. Available at: https://nmn.gl/blog/dangers-vibe-coding (Accessed: 2025-10-21). ↩︎ 2. (no date) 2025 stack overflow developer survey. survey.stackoverflow.co. Available at: https://survey.stackoverflow.co/2025/ai (Accessed: 2025-10-21). ↩︎ ↩︎ 3. (no date) www.reddit.com. Available at: https://www.reddit.com/r/ExperiencedDevs/comments/1j7aqsx/ai_coding_mandates_at_work/ (Accessed: 2025-10-21). ↩︎ 4. METR. (2025) Measuring the impact of early-2025 AI on experienced open-source developer productivity. Available at: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/ (Accessed: 2025-10-21). ↩︎ ↩︎ 5. Reading further it looks like they agree - "These mixed signals indicate to us that more evidence-based work should be done to evaluate the true impact of AI on product development, especially given the sheer scale of AI investment and adoption. We believe that the developer community and employers should be setting realistic expectations, and gaining a clear perspective on AI’s actual impact is the first step toward managing those expectations responsibly". ↩︎ 6. Judge, M. (2025) Where's the shovelware? Why AI coding claims don't add up. mikelovesrobots.substack.com. Available at: https://mikelovesrobots.substack.com/p/wheres-the-shovelware-why-ai-coding (Accessed: 2025-10-21) ↩︎ ↩︎ 7. (no date) Where's the shovelware? Why AI coding claims don't add up. news.ycombinator.com. Available at: https://news.ycombinator.com/item?id=45120517 (Accessed: 2025-10-21). ↩︎ 8. Storer, KM. et al. (no date) Fostering trust in AI. dora.dev. Available at: https://dora.dev/research/ai/trust-in-ai/ (Accessed: 2025-10-21). ↩︎ 9. Kapani, C. (2025) AI coding assistants aren’t really making devs feel more productive. leaddev.com. Available at: https://leaddev.com/velocity/ai-coding-assistants-arent-really-making-devs-feel-more-productive (Accessed: 2025-10-21). ↩︎ 10. (no date) www.reddit.com. Available at: https://www.reddit.com/r/ExperiencedDevs/comments/1lml3ti/did_ai_increase_productivity_in_your_company/ (Accessed: 2025-10-21). ↩︎
nishtahir.com
October 21, 2025 at 3:06 AM
Notes on OpenAI's AppSDK
OpenAI's dev day was today. While I wrote up a short summary of what was announced on bluesky, one of the major announcements was the AppSDK for ChatGPT. It looks like OpenAI plans to position ChatGPT as a platform for the future not unlike the Google Play and Apple Apps stores, except within ChatGPT. The platform builds on MCP encouraging developers to expose MCP servers that ChatGPT can discover for capabilites but goes further in allowing developers to inject custom UI components that customers can interact with. The general workflow appears to be 1. Your MCP server backend exposes tools ChatGPT can call. Each tool has a JSON schema interface that defines inputs and outputs, along with additional widget metadata. 2. A user will interact with ChatGPT and invoke your app (by name usually), which will cause ChatGPT to invoke a tool call to your MCP server. This is where you are expected to handle the business logic. 3. Your MCP server now has the option to respond with widget output data which ChatGPT can embed inline in the conversation. OpenAI provides some design guidelines the expect from Apps > **Conversational** : Experiences should feel like a natural extension of ChatGPT, fitting seamlessly into the conversational flow and UI. > **Intelligent** : Tools should be aware of conversation context, supporting and anticipating user intent. Responses and UI should feel individually relevant. > **Simple** : Each interaction should focus on a single clear action or outcome. Information and UI should be reduced to the absolute minimum to support the context. > **Responsive** : Tools should feel fast and lightweight, enhancing conversation rather than overwhelming it. > **Accessible** : Designs must support a wide range of users, including those who rely on assistive technologies. It's worth noting that this comes right after the announcment of the Agent Commerce Protocol which I assume this builds on in some way, although I didn't see the reference when browsing through. So that creates the incentive for developers to build new experiences on the platform This is interesting and legitimizes MCP in a way that we haven't seen yet. Before you run off and begin rolling out your own MCP server, it's worth noting that MCP has a pretty large attack surface[1] and deployments must be designed with security in mind. * * * 1. Bithead, XL. (2025) MCP security exposed: What You Need to know now. live.paloaltonetworks.com. Available at: https://live.paloaltonetworks.com/t5/community-blogs/mcp-security-exposed-what-you-need-to-know-now/ba-p/1227143 (Accessed: 2025-10-7). ↩︎
nishtahir.com
October 7, 2025 at 2:39 AM
Vector Norms
A Norm of a vector \\(\vec{v}\\) describes the magnitude or size of a vector. It is usually denoted as \\(||\vec{v}||\\). There are a few common norms worth discussing. ## Eucledian Norm (L2 Norm) Let's consider a vector `v = [3, 4]` we can calculate the Eucledean norm as, \\[ ||\vec{v}|| = \sqrt{3^2 + 4^2} = \sqrt{9 + 16} = \sqrt{25} = 5 \\] The intuition here is that the norm calculates the hypotenouse of a right angle triangle. Or in this case the shortest path to travel from `[0, 0]` to `[3, 4]`. We can formalize the process as \\[ ||\vec{v}|| = \sqrt{\sum_{i=1}^{n} v_i^2} \\] where, \\[ \vec{v} = [v_1, v_2, ..., v_n] \\] Which will allow us to define a function for it. In code, this is quite straight forward with python's math library. import math def eucledian_norm(v): return math.sqrt(sum(x**2 for x in v)) v = [3, 4] norm = eucledian_norm(v) print(norm) # 5.0 # Manhattan Norm (L1 Norm) Let's consider vector `v = [3, 4]` we can calculate the Manhattan norm as, \\[ ||\vec{v}|| = |3| + |4| = 7 \\] Intuitively, this computes how many lines we have to traverse in the `x` direction (3) and y direction (4) in order to get to move from `[0, 0]` `[3, 4]`. This is often called the Taxi-cab normed distance. We can formalize the process as \\[ ||\vec{v}|| = \sum_{i=1}^{n} |v_i| \\] where, \\[ \vec{v} = [v_1, v_2, ..., v_n] \\] def manhattan_norm(v): return sum(abs(x) for x in v) v = [3, 4] norm = manhattan_norm(v) print(norm) # 7 # Generalized Norm (L-p Norm) We can generalize the norm for any `p >= 1` as \\[ ||\vec{v}|| = (\sum_{i=1}^{n} |v_i|^p) ^{\frac{1}{p}} \\] where, \\[ \vec{v} = [v_1, v_2, ..., v_n] \\] In code this can be expressed as def generalized_norm(v, p): return (sum(abs(x)**p for x in v))**(1/p) v = [3, 4] norm = generalized_norm(v, 2) print(norm) # 5.0 ## Axioms ### 1. Positive Definiteness The norm of a vector is always non-negative. It is equal to zero if and only if the vector itself is the zero vector. \\[ ||\vec{v}|| \ge 0 \text{, and } ||\vec{v}|| = 0 \iff \vec{v} = \vec{0} \\] The intuition here is that a vector can't have a negative length. The only vector with a length of zero is the zero vector, which has no magnitude. We can test our function against this. zero_v = [0, 0] norm = generalized_norm(zero_v, 2) print(norm) # 0.0 ### 2. Absolute Homogeneity If you scale a vector by a scalar value \\(\alpha\\), it's norm is scaled by the absolute value of that scalar. Intuitively, it means that the hypotenouse of a triangle should scale with its sides. \\[ ||\alpha \vec{v}|| = |\alpha| ||\vec{v}|| \\] We can test our function against this. To scale our vector `v = [3, 4]` by a scalar, we multiply each element by the scalar value. We then compute the norm of the scaled vector and verify that it equals the original norm multiplied by the absolute value of the scalar. v = [3, 4] alpha = 2 v_norm = generalized_norm(v, 2) print(v_norm) # 5.0 alpha_v = [alpha * x for x in v] print(alpha_v) # [6, 8] alpha_v_norm = generalized_norm(alpha_v, 2) print(alpha_v_norm) # 10.0 assert abs(alpha) * v_norm == alpha_v_norm # True ### 3. Triangle Inequality The triangle inequality states that the norm of the sum of two vectors is less than or equal to the sum of their individual norms. \\[ ||\vec{v} + \vec{w}|| \le ||\vec{v}|| + ||\vec{w}|| \\] This can be visualized as the idea that the shortest distance between two points is a straight line. To add two vectors together, we sum the items at each position. We can then take the norm of the result and compare that to the sum of the norms of each vector v = [1, 2, 3] w = [4, 5, 6] # ||v + w|| v_plus_w = [v[i] + w[i] for i in range(len(v))] # [5, 7, 9] norm_v_plus_w = generalized_norm(v_plus_w, 2) print(norm_v_plus_w) # 12.449899597988733 # ||v|| + ||w|| norm_v_plus_norm_w = generalized_norm(v, 2) + generalized_norm(w, 2) print(norm_v_plus_norm_w) # 12.516621774166063 A fun exercise might be trying these axioms to see if they hold for higher `lp` norms.
nishtahir.com
October 6, 2025 at 3:42 AM
Notes on - Why do LLMs freak out over the seahorse emoji?
A fantastic deep dive into the seahorse emoji phenomena[1] was recently published by Theia[2]. It's engaging, well presented and worth reading. The post presents a case using `meta-llama/Llama-3.3-70B-Instruct`. However I wanted to verify this behavior with smaller models which unsurprisingly fail the challenge as well. I specifically tested `microsoft/Phi-4-mini-Instruct` and `HuggingFaceTB/SmolLM2-135M-Instruct`. I started with the code sample Theia provided here and made some modifications that 1. Added a CLI using typer to make it easier to iterate 2. Print tables using `rich` for nicer formatting 3. Save and compare activations A few things stood out to me when testing. Using `microsoft/Phi-4-mini-Instruct`, It's kind of interesting to see the early activations tend a bit more toward unsafe content before converging on the output[3]. This is a bit more obvious when contrasted against SmolLM2 whose output looks a bit more arbitrary to me. We can compare the layer activations of a few different queries generated using `microsoft/Phi-4-mini-Instruct` to see how they differ. My goal was to determine whether the queries are processed the same way. Intutively we can see that at first the queries are processed similarly but diverges around layer 20 as the model begins to converge on an output. Here's a plot generated with smollm2. Similarly it all starts off the same and begins to diverge around layer 23. I made a git repo that anyone can build off of here * * * 1. (no date) www.reddit.com. Available at: https://www.reddit.com/r/GeminiAI/comments/1nglzed/gemini_loses_its_mind_after_failing_to_produce_a/ (Accessed: 2025-10-5). ↩︎ 2. (no date) Why do LLMs freak out over the seahorse emoji?. vgel.me. Available at: http://vgel.me/posts/seahorse/ (Accessed: 2025-10-5). ↩︎ 3. COVID being mentioned is not where I expected this experiment to go. It's fun surprises like this that keeps things interesting. ↩︎
nishtahir.com
October 5, 2025 at 6:37 PM
I've been thinking about what it would take to migrate to this as my primary mastodon account. I don't know if this is a supported use case or how well it would work
September 23, 2025 at 12:27 AM
How LLM Structured Decoding works
Last week I happened to be in a discussion that involved getting an LLM to generate JSON reliably. A major frustration expressed was that no matter how much they tried the model would often fail to follow instructions during generation. I pointed out that most major vendors support some variant of "Structured Output" parsing which allows the user to provide an output schema. That happened to be a good solution to the problem but I wanted to take a moment to write up some notes about how and why it works so well. All language models happen to have a vocabulary which is essentially a map of a _Token_ to _Token Id_. This is how before making a prediction, strings are broken up and assigned to numbers to work with. A snippet of the Phi-4-mini-instruct vocabulary looks like this. { "\u0120NSError": 85268, "\u0120filtro": 85269, "\u0120vyt": 85270, "\u0120Prefeitura": 85271, "*sizeof": 85272, "\u0120Continental": 85273, "\u0120Enfin": 85274, "???\u010a\u010a": 85275, "-best": 85276, "\u0120tolle": 85277, "\u00e8\u012d\u00b9\u00e6\u0140\u013e\u00e7\u012b\u012a": 85278, "\u0120\u00d8\u00a7\u00d9\u0126\u00d8\u00b5\u00d9\u012a\u00d8\u00b1": 85279, "\u0120\u00c3\u00a9nerg": 85280, "icester": 85281, "\u0120abbiamo": 85282, ... } We can tokenize a string, using an instance of the tokenizer which will give us a sequence of token Ids. prompt = "Write a json object with the following keys: name, age, city must be an object that starts with { and ends with }" tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-4-mini-instruct") inputs = tokenizer(prompt, return_tensors="pt") print(inputs) Output: {'input_ids': tensor([[10930, 261, 5701, 2817, 483, 290, 3992, 12994, 25, 1308, 11, 5744, 11, 5030, 2804, 413, 448, 2817, 484, 13217, 483, 354, 326, 17095, 483, 388]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])} A prediction for the model outputs a probability distribution for each token over the entire vocabulary. This means that for each possible token, you get a score for how likely that token appears in the sequence. We can make a prediction to visualize this. model = AutoModelForCausalLM.from_pretrained("microsoft/phi-4-mini-instruct") with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits Now let's grab the next token and see what the highest probability tokens are next_token_logits = logits[0, -1] torch.topk(next_token_logits.softmax(dim=-1), 10) Output: torch.return_types.topk( values=tensor([0.1901, 0.1205, 0.0866, 0.0475, 0.0461, 0.0414, 0.0316, 0.0183, 0.0180, 0.0171]), indices=tensor([ 326, 483, 13, 887, 1366, 2804, 558, 2238, 350, 290])) The highest probability token here is 326, which appens to be `and`. Next is 483 is `with`. So it's clear the model is really just trying to "complete" the prompt. Since we have access to the predictions, Structured Decoding in this context means making more intelligent decisions about what predictions to accept from the model based on the rules/criteria we wish to apply to our output. For example, JSON follows a strict grammar in order for a string to be valid JSON[1]. 1. A valid JSON object must start with a `{` 2. A valid JSON array must start with a `[` 3. A valid JSON primitive can be a string, number, `true`, `false`, `null` So to make a prediction that is a valid JSON object, it can only be a finite number of possible values. This means that when sampling the next token from the model's predictions, we can reject any token that would not be valid and only sample from a pool of valid alternatives[2]. valid_starts = ["{", "["] valid_ids = [tokenizer.encode(tok, add_special_tokens=False, return_tensors="pt")[0] for tok in valid_starts] print(valid_ids) # Mask out everything except our valid ids mask = torch.full_like(next_token_logits, float("-inf")) for vid in valid_ids: mask[vid] = next_token_logits[vid] # Take the highest probability token from the pool of valid tokens next_token_id = torch.argmax(mask).item() next_token = tokenizer.decode([next_token_id]) # Print for visualization print("Chosen token:", next_token) Output: Chosen token: '{' In this example, I'm constraining the valid tokens to `{` and `[`. We assume every other token is invalid and mask them. Then we take the highest probability token from the pool of valid tokens. For more elaborate control over what is valid, we need a way to define a grammar and partially match the completion. Most of the major LLM Vendors usually provide some way to define a JSON schema[3][4][5]. But more sophisticated APIs allow some sort of BNF like notation or Regex for matching on the outputs. A manual regex based implementation might look something like this. prompt = "Write a valid json object with a single test key and value" completion = "" for i in range(20): inputs = tokenizer(f"<|user|>{prompt}<|end|><|assistant|>{completion}", return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits next_token_logits = logits[0, -1] mask = torch.full_like(next_token_logits, float("-inf")) for token_id, token_str in zip(token_ids, token_strs): expected_completion = completion + token_str if pattern.fullmatch(expected_completion, partial=True): mask[token_id] = next_token_logits[token_id] if torch.all(mask == float("-inf")): # No valid alternative break next_token_id = torch.argmax(mask).item() next_token = tokenizer.decode([next_token_id]) print(next_token) completion += next_token Output: { " test ": " This " } Hopefully this shows why Structured Output is actually guaranteed to conform to the schema barring any bugs that occur during sampling. This is a great option if you have information about the expected structure that might not be immediately clear to the model or you intend for the output to be consumed by other tools. * * * 1. This is an example. For full details on the grammar the full standard is available here https://www.json.org/json-en.html. ↩︎ 2. It's not a coincidence that it's extremely similar to DFA or other parsing techniques. You are effectively streaming lexemes and need to make decisions about what fits and what does not. ↩︎ 3. (no date) Structured output. ai.google.dev. Available at: https://ai.google.dev/gemini-api/docs/structured-output (Accessed: 2025-9-13). ↩︎ 4. (no date) OpenAI platform. platform.openai.com. Available at: https://platform.openai.com/docs/guides/structured-outputs (Accessed: 2025-9-13). ↩︎ 5. (no date) Structured outputs. docs.vllm.ai. Available at: https://docs.vllm.ai/en/v0.9.2/features/structured_outputs.html (Accessed: 2025-9-13). ↩︎
nishtahir.com
September 13, 2025 at 5:17 PM
This is incredible, genie 3 generates interactive worlds in real time with persistent memory. I can't find a paper on it yet, but the preview is incredible.

https://www.theverge.com/news/718723/google-ai-genie-3-model-video-game-worlds-real-time
Google’s new AI model creates video game worlds in real time
Google is investing a lot into AI world models.
www.theverge.com
August 5, 2025 at 3:39 PM