Lightnews — Scholar-powered news

jayk56.bsky.social

@jayk56.bsky.social

the awkward part of participating in #nokings when you're from sac..

June 15, 2025 at 4:16 PM

jayk56.bsky.social

@jayk56.bsky.social

help! i just asked claude opus to build bloxorz and it's begun optimizing it in a dev loop. it hasn't let me test since version 4...

May 22, 2025 at 7:46 PM

jayk56.bsky.social

@jayk56.bsky.social

Another fun fact is the new Gemini 2.5 Flash (non-thinking) preview model performs on par with Claude 3.7 Sonnet (non-thinking) on this benchmark for 1/10th the price...

Bar chart titled “Aider Results Gemini 2.5 Flash (no thinking)” showing percent correct on the y-axis and number of attempts (1 to 8) on the x-axis. Three color-coded bars per attempt represent Run 1 (blue), Run 2 (green), and Run 3 (gray). The chart shows consistent improvement in percent correct across attempts, with performance stabilizing above 85% after attempt 5.

May 22, 2025 at 5:57 AM

jayk56.bsky.social

@jayk56.bsky.social

Gemini 2.5 Pro is a beast for the price and, given 8 tries per problem, scores 93% on the Aider benchmark. o4-mini (high) was not far behind, but cost about 30% more to score 91%. Sonnet 3.7 without thinking is able to average 87% but cost about $45 per run (more than double the cost of gemini)

May 13, 2025 at 7:58 PM

jayk56.bsky.social

@jayk56.bsky.social

Almost... follow here to get the next installment in this saga

A line chart titled “Aider Benchmark Results (May 7 ’25) – Almost…” plots model pass rates (in percent) on the vertical axis (20–100%) against increasing Pass@ thresholds (Pass@1, Pass@2, Pass@4, Pass@8) on the horizontal axis. Three solid lines and one highlighted point are shown:
• Blue line (o3 (high) + gpt-4.1): rises from about 35% at Pass@1 to about 85% at Pass@2, then stops.
• Green line (Gemini 2.5 Pro Preview 05-06): climbs from 40% at Pass@1 to 75% at Pass@2, then to about 90% at Pass@4, and flattens near 93% at Pass@8.
• Gray line (o3-mini (high) diff edits): rises from 20% at Pass@1 to about 60% at Pass@2, then stops.
• Black dotted extension and dot: a dashed black line continues the green curve from Pass@2 to a single black circle at roughly 98% for Pass@4, highlighting an almost-complete result.

Overall, the chart shows that larger or more advanced models achieve higher pass@ rates as the threshold increases.

May 8, 2025 at 2:17 AM

jayk56.bsky.social

@jayk56.bsky.social

what do we think? line go up? #aider

A graph showing three of the top performing models in the Aider LLM benchmark: ChatGPT's o3, Google's Gemini 2.5 Pro Preview, and ChatGPT's o3-mini. It is titled "Aider Benchmark Results (May 6th 2025) - Line go up??" and shows the performance of the models at different attempts. At a single attempt, the top models get about 40% correct. When they get a 2nd attempt to correct any mistakes, they get up to 80%. The chart author added a dotted line going up and to the right to indicate a possible score if the models are given up to 4 attempts after receiving unit test failure information. The point the author chose is near the 100% mark.

May 7, 2025 at 12:01 AM

jayk56.bsky.social

@jayk56.bsky.social

Hundreds of SacTown locals showing up today to protect veterans, retirees, and science and protest oligarchs carving up the American public's wealth. let's turn those honks into action with the Hands OFF! protest April 5th.

#SacTeslaTakedown #TeslaTakedown #StandUpForScience #HandsOff

March 29, 2025 at 10:56 PM

jayk56.bsky.social

@jayk56.bsky.social

I might be struggling with prompting, but o3-mini-high provided documents 1-3 and o1 provided document 4. Here is a summary of what I was noticing (as stated by o3-mini-high grading the responses)

January 31, 2025 at 9:31 PM

jayk56.bsky.social

@jayk56.bsky.social

OpenAI's Realtime API feels heads and shoulders above Gemini 2.0 Flash live, but it's also possible it's just easier to get working examples going. I was comparing an example react component with 4o-mini against Google's example python cookbook with gemini 2.0 flash and flash just felt shallow..

December 24, 2024 at 11:58 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news