Lightnews — Scholar-powered news

lhl

@lhl.bsky.social

I started watching this epic 3.5h investigative journalism piece by Gamers Nexus on Chinese GPU smuggling, it's really amazing the work this independent YouTube gaming channel is doing: www.youtube.com/watch?v=1H3x...

THE NVIDIA AI GPU BLACK MARKET | Investigating Smuggling, Corruption, & Governments

YouTube video by Gamers Nexus

www.youtube.com

August 18, 2025 at 6:18 AM

lhl

@lhl.bsky.social

Over the past couple weeks I've been working on some Strix Halo testing in my spare time. This includes bringing up a harness for doing full sweeps for pp/tg for a variety of different model architectures, backends, and flags. Writeup just posted to r/LocalLLama: www.reddit.com/r/LocalLLaMA...

July 22, 2025 at 11:05 AM

lhl

@lhl.bsky.social

One neat thing is that experimenting with using Shisa V2 405B to regen our datasets, I'm seeing gains w/ new chosen DPO (slight boost on Qwen 3 vs original DPO), and for SFT+DPO, close to a 0.5 point gain on Shaberi averages for Llama 3.1 8B.

June 20, 2025 at 6:24 PM

lhl

@lhl.bsky.social

Recently I started doing some Qwen3 testing (Shaberi, GPT-4.1 judge) and interestingly for almost all models, reasoning yielded worse performance. Note: I need to stand multieval back up - Even though Qwen3 8B tunes appear to match the Shisa V2 12B/14B tunes, they are much worse on translation.

New table of Shaberi scores (GPT-4.1 judge)

June 15, 2025 at 5:03 AM

lhl

@lhl.bsky.social

I had a chat w/ o3 chatgpt.com/share/6846ff... about Apple's new "Illusion of Thinking" paper machinelearning.apple.com/research/ill... - based on the researchers' definition, neither reasoning LLMs nor humans are true reasoners, but the Python script I had o3 write to solve the logic puzzles are.

ChatGPT - Illusion of Thinking Summary

Shared via ChatGPT

chatgpt.com

June 9, 2025 at 3:43 PM

lhl

@lhl.bsky.social

Today we launched one more addition to the Shisa V2 models: Shisa V2 405B. This is new Llama 3.1 405B post-tune that is the strongest model ever trained in Japan! It matches GPT-4o and DeepSeek-V3 in JA MT-Bench. Read more here: shisa.ai/posts/shisa-...

JA MT-Bench Radar Charts: Our post-training has improved the Japanese performance of the Llama 3.1 405B Instruct base model across all evaluation categories. Shisa V2 405B's JA MT-Bench scores are competitive with the flagship models published by both leading US and Chinese Frontier Labs.

June 3, 2025 at 4:59 AM

lhl

@lhl.bsky.social

OK, first JA slide deck in the books. 😅 (Thanks, ChatGPT 4.5.)

Shisa V2 405B scores above GPT-4o latest in JA MT-Bench

Shisa V2 405B scores on par with the latest DeepSeek V3 and and GPT-4o in every category in JA MT-Bench

May 27, 2025 at 4:19 AM

lhl

@lhl.bsky.social

BTW, in case anyone wants to kick the tires or test their 日本語, I have our Shisa V2 405B model up and running temporarily (just a day or two until I finish evals/start training again): chat.shisa.ai

Shisa V2 405B

chat.shisa.ai

May 24, 2025 at 9:19 PM

lhl

@lhl.bsky.social

When your model is sufficiently better than the judge model, it may just start throwing a lot of 10s in its scoring 😂 (based on our overall eval battery shisa-v2 70b is a fair amount better than gpt-4 and gpt-4-turbo, but that's the standard judge used for 1:1 comparisons...)

May 23, 2025 at 5:34 AM

lhl

@lhl.bsky.social

I've recently been poking at Strix Halo. For those interested in using it for inference, it's about expected (except for surprisingly bad llama.cpp HIP perf): www.reddit.com/r/LocalLLaMA... - but for those looking to do work (PyTorch, etc)... the current state is not good.

PyTorch FA perf on Strix Halo (gfx1151) is quite awful.

May 14, 2025 at 5:46 PM

lhl

@lhl.bsky.social

For those curious, Like with Llama 4, I've run Qwen 3 through some Japanese language evals. Writeup here: shisa.ai/posts/qwen3-...

Qwen 3 Japanese Performance – Shisa.AI

shisa.ai

May 1, 2025 at 5:36 AM

lhl

@lhl.bsky.social

Over the weekend, I finished up our Llama 405B run (4th group I know of to do a FFT?). It was a real beast to train, but beats our Shisa V2 70B (as well as GPT-4 and GPT-4 Turbo) using basically our Shisa V2 recipe. It is, I believe the best performing LLM (JA and EN) to ever be trained in Japan.

Chart showing JA + EN performance of Shisa V2 and a new 405B FFT vs others

April 28, 2025 at 12:25 PM

lhl

@lhl.bsky.social

Our small team (of 2!) has just released some of the strongest open Japanese LLMs, Shisa V2 (7-70B). We tried quite a few new techniques (most failed to replicate), so in the end, it was largely grinding out better datasets the past few months: shisa.ai/posts/shisa-...

Shisa V2 – Shisa.AI

shisa.ai

April 15, 2025 at 5:51 PM

lhl

@lhl.bsky.social

For those interested in how Llama 4's Japanese capabilities stack up, I've just published a set of evals I've run here (better than Llama 3, pretty good for their active parameter counts): shisa.ai/posts/llama4...

Llama 4 Japanese Performance – Shisa.AI

shisa.ai

April 10, 2025 at 4:05 PM

lhl

@lhl.bsky.social

I asked OpenAI Deep Research to do an analysis on the Trump tariffs, their walkback due to the bond market exploding etc (including a document someone pointed to as the Trump tariff playbook and China's 4/9 published response. It's a long but pretty digestible read: chatgpt.com/share/67f778...

ChatGPT - Trump Trade Strategy Analysis

Shared via ChatGPT

chatgpt.com

April 10, 2025 at 8:29 AM

lhl

@lhl.bsky.social

The new Llama 4 release has been a bit of a mess. I've been busy so waited for a vLLM stable release blog.vllm.ai/2025/04/05/l... (w/ inference accuracy validation) to see if it's really that bad... Run on an H100 node, they do OK on EN/JA benchmarks (including some unreleased/just created ones)

April 7, 2025 at 10:02 AM

lhl

@lhl.bsky.social

quasar-alpha looks... quite good

April 5, 2025 at 6:25 PM

lhl

@lhl.bsky.social

Daniel Kokotajlo's et al's latest essay, published at ai-2027.com is definitely a full read, but I also asked OpenAI Deep Research to do an in-depth critique analysis of it and some earlier essays. It spent almost 30 minutes on the task: chatgpt.com/share/67efd3...

ChatGPT - AI Projection Analysis 2025

Shared via ChatGPT

chatgpt.com

April 4, 2025 at 1:32 PM

lhl

@lhl.bsky.social

A lot of fun great stuff w the new ChatGPT image generation (r/ChatGPT is a nice party) but this is probably the most interesting things I’ve seen so far: threadreaderapp.com/thread/19057...

Thread by @Josikinz on Thread Reader App

@Josikinz: This just in: Claude expresses significantly less existential distress than chatGPT 4o when presented with the same prompt asking it to script comics about its life (more detail in thread)....

threadreaderapp.com

March 30, 2025 at 12:39 PM

lhl

@lhl.bsky.social

Finally at a point where I can just kick back and wait for results...

March 29, 2025 at 4:04 AM

lhl

@lhl.bsky.social

I never noticed this before. OpenAI Deep Research has some new tricks up its sleeve?

March 28, 2025 at 3:50 PM

lhl

@lhl.bsky.social

Holy crap, I burnt like 8 hours this week banging my head on trying to fix things, but my big runs were blowing up b/c DeepSpeed 0.16.4 is super borked (fix is disable gradient checkpointing or downgrade to 0.15.0): github.com/deepspeedai/...

[BUG] OOM when train 70B models using deepspeed 0.16.4 · Issue #7116 · deepspeedai/DeepSpeed

We found that using OpenRLHF + DeepSpeed 0.15.0, SFT + Adam Offload can train a 70B model with 8 A100 70G + ZeRO3, whereas DeepSpeed 0.16.4 results in OOM. You can try the script https://github.com...

github.com

March 27, 2025 at 3:21 PM

lhl

@lhl.bsky.social

I've been going through some of the rl releases from last year I've been meaning to try out, like SPIN github.com/uclaml/SPIN - I implemented a DPO version w/ tuned hyperparameters, and despite decent trajectories, it fails hard (each iteration eval'd worse than the last)

Each iteration does slightly better on accuracy and reward margins

each iteration also gets significantly worse on evals

March 17, 2025 at 6:39 PM

lhl

@lhl.bsky.social

Recently tested SimPO vs DPO and got similar to others w/ DPO better even when (grey line) using the "V2" optimized hyperparams w/ same ArmoRM dataset on similar model (a llama3.1-8b SFT) - used trl 0.13.0 since there's a multi-GPU bug w/ CPOTrainer: github.com/huggingface/...

DPO vs SimPO eval results corroborate the training curves

March 14, 2025 at 5:28 AM

lhl

@lhl.bsky.social

This has been pretty good (and on theme, lol) while doing training/data cleaning: www.youtube.com/watch?v=JRnD... (the work is mysterious and important)

Severance — Music To Refine To feat. ODESZA | Apple TV+

YouTube video by Apple TV

www.youtube.com

March 14, 2025 at 5:15 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news