Lightnews — Scholar-powered news

vishal

@vishal-learner.bsky.social

Initial tests with Claude Haiku show promise - it successfully distinguished between stronger and weaker models. Next step: building an evaluator with FastHTML to grade 450 completions (150 prompts × 3 models). Stay tuned!

April 29, 2025 at 5:42 AM

vishal

@vishal-learner.bsky.social

My scoring system uses 0 (failure), 0.5 (partial success), or 1.0 (success) points per criterion for each category. Diff categories have diff numbers of criteria: Grammar (5), Creativity (4), Plot (4), Context-tracking (3), and Factual/Reasoning (1 each) so I'll need to normalize

April 29, 2025 at 5:42 AM

vishal

@vishal-learner.bsky.social

Good creativity/plot prompts provide opportunities w/o sacrificing consistency. Example: "Once upon a time, there was a tiger who liked to play the guitar" offers fertile ground for creativity without losing coherence, while others might force models to choose between them.

April 29, 2025 at 5:42 AM

vishal

@vishal-learner.bsky.social

When analyzing the 44 TinyStories eval prompts, I discovered factual prompts were the easiest to isolate, context-tracking prompts were dime a dozen, and reasoning prompts were hard to distinguish from context tracking. This led me to curate category-specific prompts.

April 29, 2025 at 5:42 AM

vishal

@vishal-learner.bsky.social

Video Walkthrough: youtube.com/watch?v=k5J1...

Repo: github.com/vishalbakshi...

TSL: Curating Evaluation Prompts, Defining Scoring Criteria + Designing LLM Judge Prompt Template

YouTube video by vishal

youtube.com

April 29, 2025 at 5:42 AM

vishal

@vishal-learner.bsky.social

Project goals: Study both training dynamics of tiny models and their language capabilities (grammar, context tracking, factual knowledge, reasoning, creativity, and plot construction). Looking forward to sharing more progress soon!

/end

April 27, 2025 at 7:51 AM

vishal

@vishal-learner.bsky.social

Next steps: Pausing training to build evaluation infrastructure. Will use Gemini Flash 2.5 or Claude Haiku 3.5 as LLM judges (on TinyStories-1M/8M/28M/33M-generated stories), comparing against manual evaluation to refine scoring prompts for six capability categories.

April 27, 2025 at 7:51 AM

vishal

@vishal-learner.bsky.social

Details on the architectures I'm using---the LlamaConfigs are shared in the blog post above. I'm loosely referencing the official TinyStories models (intermediate dim = 4 x hidden dim). Intentionally undershooting named model sizes.

April 27, 2025 at 7:51 AM

vishal

@vishal-learner.bsky.social

Cost analysis: L4 GPU is more efficient for 5M model (~$0.20/epoch), while A100 is better for larger models. 125M model costs ~$0.84/epoch. This gives me a baseline to plan my budget for longer training runs.

April 27, 2025 at 7:51 AM

vishal

@vishal-learner.bsky.social

Video: youtube.com/watch?v=o_c3...

Blog post: vishalbakshi.github.io/blog/posts/2...

TinyScaleLab Project Update: Training Cost Analysis and Evaluation Infrastructure Plans

YouTube video by vishal

youtube.com

April 27, 2025 at 7:51 AM

vishal

@vishal-learner.bsky.social

P.S. if you are unfamiliar, here are my main takeaways from the TinyStories (Eldan/Li) and Small-scale proxies (Wortsman, et al) papers. Really incredibly inspiring work. I am giddy to jump into this project. LFG!!!

April 26, 2025 at 6:27 AM

vishal

@vishal-learner.bsky.social

I'll end with one of my favorite quotes. I am standing on the shoulders of giants to even consider taking on this research project!

April 26, 2025 at 6:27 AM

vishal

@vishal-learner.bsky.social

Here's a recap of my presentation, highlighting my goals for the TinyScale Lab research project!

April 26, 2025 at 6:27 AM

vishal

@vishal-learner.bsky.social

My project timeline consists of 4 phases. I expect this project to take 8-12 months (which means it will probably take two years 😅). First order of business is building the eval and logging setup, and then running initial training runs. Phase 2 involves core experimentation!

April 26, 2025 at 6:27 AM

vishal

@vishal-learner.bsky.social

My rough back-of-the-envelope budget is $2000. I'll closely monitor this each week. If it seems to be heading in that direction consistently, I'll have to seriously consider building my own GPU rig. But time will tell!

April 26, 2025 at 6:27 AM

vishal

@vishal-learner.bsky.social

Following fastai principles, I'll build in public: sharing code, models, datasets, weekly updates, and interactive visualizations. If this work saves someone time, money, or gives them insight, that would be truly the best reward.

April 26, 2025 at 6:27 AM

vishal

@vishal-learner.bsky.social

What excites me most: I cannot wait to see some of these capabilities emerge with model size or training steps - watching grammar emerge first, then consistency, and finally creativity, just as the TinyStories paper observed.

April 26, 2025 at 6:27 AM

vishal

@vishal-learner.bsky.social

My plan: extensive logging of training dynamics + evaluating capabilities with LLM judge scoring. I'll train 100+ model variations across different learning rates and stability techniques (QK-layernorm and z-loss).

April 26, 2025 at 6:27 AM

vishal

@vishal-learner.bsky.social

Making ML research accessible to resource-constrained environments isn't trivial - it's essential for the field's diversity and progress! I'm using modest computational resources (but substantial for me) to conduct what I think is meaningful research.

April 26, 2025 at 6:27 AM

vishal

@vishal-learner.bsky.social

I believe this approach—using tiny models as proxies to study phenomena relevant to models of all sizes—represents an underexplored path that could benefit other resource-constrained researchers. I think this is how most of the world's potential researchers would need to work.

April 26, 2025 at 6:27 AM

vishal

@vishal-learner.bsky.social

My hypothesis: training stability directly affects specific model capabilities in predictable ways. I'll train models from 3M to 120M params, analyzing how logits, gradients, parameters, and loss relate to capabilities like grammar, consistency, and reasoning.

April 26, 2025 at 6:27 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news