Lightnews — Scholar-powered news

n1o_c0rTx

@n1o-cortx.bsky.social

From machine learning to vunerability research.
For more general ML stuff: https://n1o.github.io/
For more ML focus on Vunerability Research: https://codebreakers.re/
Github: https://github.com/n1o

Posts Replies Media Videos

n1o_c0rTx

@n1o-cortx.bsky.social

Of course we can and I compiled an introductory text on various techniques and research papers doing exactly that.

December 17, 2024 at 10:22 AM

n1o_c0rTx

@n1o-cortx.bsky.social

There is a lot of research on how to combine Graph Neural Networks and Large Language Models. GALLa is a very interesting research paper that uses a well-known Adapter pattern (mostly used by Vision models) to embed a Graph into the embedding space of an LLM

December 10, 2024 at 8:44 AM

n1o_c0rTx

@n1o-cortx.bsky.social

To take it to the next level, the authors develop a special Macro Inference score, which measure the contribution of individual transfomer blocks towards the models soft labels (token distribution) and choose the block that contributes the least!

December 3, 2024 at 1:21 PM

n1o_c0rTx

@n1o-cortx.bsky.social

FuseGPT introdued here: arxiv.org/abs/2411.14507
takes the knowledge from the linear layers of the droped transfomer block. This is done by fusing (adding) the removed linear weights to the linear weight in its neighbourhood trough an low rank projection matrix.

FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers

Generative Pre-trained Transformers (GPTs) have demonstrated remarkable performance across diverse domains through the extensive scaling of model parameters. Recent works observe the redundancy across...

arxiv.org

December 3, 2024 at 1:21 PM

n1o_c0rTx

@n1o-cortx.bsky.social

To make it even better, their 1.5B model trained only on 1.5T tokens still achieves State of the Art among 2B models, with nearly 3x higher throughput.

December 2, 2024 at 8:50 AM

n1o_c0rTx

@n1o-cortx.bsky.social

To reduce memory requirements, they share the KV cache between two consecutive layers, bringing down the cache's memory requirement by 20x compared to a vanilla attention model.

December 2, 2024 at 8:50 AM

n1o_c0rTx

@n1o-cortx.bsky.social

They also introduced Meta tokens which smooth out attention's softmax distribution to avoid attention sinks, and at the same time they bootstrap Mamba's internal state.

December 2, 2024 at 8:50 AM

n1o_c0rTx

@n1o-cortx.bsky.social

Long story short: They run Attention and Mamba in parallel at each layer, where Mamba serves as long-term memory, and Attention (mostly Sliding Window except on the First, Middle and Last Layer) as short-term memory with perfect recall.

December 2, 2024 at 8:50 AM

n1o_c0rTx

@n1o-cortx.bsky.social

Ehm Qwen obviously!

November 25, 2024 at 4:15 PM

n1o_c0rTx

@n1o-cortx.bsky.social

By taking pretrained LLMs there are multiple techniques that allow us to transform them into word models or sequence models while investing only a couple of hours on a single GPU and with a bit of pruning we can make them lean and mean!

November 25, 2024 at 9:39 AM

n1o_c0rTx

@n1o-cortx.bsky.social

Why embedding models? They are still the kings when it comes to understanding tasks, and pretraining them is expensive.

November 25, 2024 at 9:39 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news