ShaneG
banner
thegaragelab.bsky.social
ShaneG
@thegaragelab.bsky.social
Software Engineer, Sci-Fi lover homelab enthusiast and infatuated with the possibilities of ML. I remain optimistic for the future of humanity despite evidence to the contrary.
You might be interested in something like this project - github.com/weaviate/Verba - it will let you use your data with an existing base model (open or otherwise).

Integrating RAG - en.m.wikipedia.org/wiki/Retriev... - into your publishing tools will go a long way towards what you seem to want.
GitHub - weaviate/Verba: Retrieval Augmented Generation (RAG) chatbot powered by Weaviate
Retrieval Augmented Generation (RAG) chatbot powered by Weaviate - weaviate/Verba
github.com
December 15, 2024 at 11:25 PM
That is a great dataset for fine tuning or RAG but nowhere near enough to train a model from scratch. TinyLlama - github.com/jzhang38/Tin... - is an open model trained on an open dataset consisting of 3T tokens (about 2T words) and it's not really usable.
GitHub - jzhang38/TinyLlama: The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens. - jzhang38/TinyLlama
github.com
December 15, 2024 at 11:18 PM
2) The compute resources required to turn that data into a prediction model is the next hurdle - yes it can be leased rather than bought but it's still going to be expensive and time consuming.
December 15, 2024 at 12:01 AM
1) The biggest blocker to having (and running) your own personal model is the amount of resources required - well beyond the capability of most individuals. The amount of data required to train a model is huge and it all has to be collected, validated, labelled, free of copyright, etc.
December 14, 2024 at 11:59 PM
Fine tuning (updating the 'core' model with new content) and RAG (providing local knowledge specific to the query) are two techniques that are supported by both open and closed models. There are a lot of commercial and open source tools available to support both techniques.
December 14, 2024 at 11:54 PM
I think the biggest driver of piracy is accessibility of the content - when NetFlix streaming first came out video piracy dropped dramatically, the same thing happened with Spotify and music. Now that video is being siloed off into multiple providers (Disney, Paramount, etc) it's on the rise again.
November 23, 2024 at 10:17 PM
Sorry for bad link formatting - I type markdown automatically these days and thought Bluesky supported it.
February 11, 2024 at 2:46 AM
I'm upgrading/repairing devices in my homelab today - installation is automated so robot fixing robot.
February 10, 2024 at 3:00 AM
I needed something for a first post and I was indulging in an afternoon sugar top up so I thought - why not? Chocolate and banana slice with cream - both delicious and unhealthy :)
February 9, 2024 at 3:45 AM