Lightnews — Scholar-powered news

Reposted by Jaswanth Reddy

Sung Kim

@sungkim.bsky.social

Microsoft's MarkItDown

The MarkItDown library is a utility tool for converting various files to Markdown (e.g., for indexing, text analysis, etc.)

Repo: github.com/microsoft/ma...

GitHub - microsoft/markitdown: Python tool for converting files and office documents to Markdown.

Python tool for converting files and office documents to Markdown. - microsoft/markitdown

github.com

December 12, 2024 at 9:56 PM

Reposted by Jaswanth Reddy

Ben Burtenshaw

@benburtenshaw.bsky.social

For anyone interested in fine-tuning or aligning LLMs, I’m running this free and open course called smol course. It’s not a big deal, it’s just smol.

🧵>>

December 3, 2024 at 9:21 AM

Reposted by Jaswanth Reddy

leonie

@iamleonie.bsky.social

Struggling with RAG over PDF files?

You might want to give Docling a try.

𝗪𝗵𝗮𝘁'𝘀 𝗗𝗼𝗰𝗹𝗶𝗻𝗴?
• Python package by IBM
• OS (MIT license)
• PDF, DOCX, PPTX → Markdown, JSON

𝗪𝗵𝘆 𝘂𝘀𝗲 𝗗𝗼𝗰𝗹𝗶𝗻𝗴?
• Doesn’t require fancy gear, lots of memory, or cloud services
• Works on regular computers or Google Colab Pro

November 28, 2024 at 1:34 PM

Reposted by Jaswanth Reddy

Sung Kim

@sungkim.bsky.social

Improve the LLM inference with a long context time by up to 11x while preserving 95-100% of accuracy.

Nvidia's Star Attention: Efficient LLM Inference over Long Sequences

November 27, 2024 at 5:58 PM

Reposted by Jaswanth Reddy

François Fleuret

@francois.fleuret.org

My deep learning course at the University of Geneva is available on-line. 1000+ slides, ~20h of screen-casts. Full of examples in PyTorch.

fleuret.org/dlc/

And my "Little Book of Deep Learning" is available as a phone-formatted pdf (nearing 700k downloads!)

fleuret.org/lbdl/

November 26, 2024 at 6:15 AM

Reposted by Jaswanth Reddy

Tim Rocktäschel

@handle.invalid

Excited to announce "BALROG: a Benchmark for Agentic LLM and VLM Reasoning On Games" led b UCL DARK's @dpaglieri.bsky.social! Douwe Kiela plot below is maybe the scariest for AI progress — LLM benchmarks are saturating at an accelerating rate. BALROG to the rescue. This will keep us busy for years.

November 22, 2024 at 11:27 AM

Reposted by Jaswanth Reddy

Jack Hessel

@jmhessel.bsky.social

LLMs generate novel word sequences not contained in their pretraining data. However, compared to humans, models generate significantly fewer novel n-grams.

RLHF = 30% *more* copying than base!

Awesome work from the awesome Ximing Lu (gloriaximinglu.github.io) et al. 🤩

arxiv.org/pdf/2410.04265

A screenshot from the linked paper's figure 1. The figure is a pretty-complicated three column figure, but --- in essence, it sketches out how the authors compare llm sequences to the pretraining data / human authors to the pretraining data. Humans write more novel n-gram sequences.

November 22, 2024 at 6:14 AM

Reposted by Jaswanth Reddy

Jonathan A. Michaels

@jonathanamichaels.bsky.social

Hot take: if you believe that talk therapy is useful, you have to believe that LLMs will eventually be the best and most available therapists

November 22, 2024 at 12:45 PM

Reposted by Jaswanth Reddy

Christoph Molnar

@christophmolnar.bsky.social

Just realized BlueSky allows sharing valuable stuff cause it doesn't punish links. 🤩

Let's start with "What are embeddings" by @vickiboykis.com

The book is a great summary of embeddings, from history to modern approaches.

The best part: it's free.

Link: vickiboykis.com/what_are_emb...

Over the past decade, embeddings — numerical representations of
machine learning features used as input to deep learning models — have
become a foundational data structure in industrial machine learning
systems. TF-IDF, PCA, and one-hot encoding have always been key tools
in machine learning systems as ways to compress and make sense of
large amounts of textual data. However, traditional approaches were
limited in the amount of context they could reason about with increasing
amounts of data. As the volume, velocity, and variety of data captured
by modern applications has exploded, creating approaches specifically
tailored to scale has become increasingly important.
Google’s Word2Vec paper made an important step in moving from
simple statistical representations to semantic meaning of words. The
subsequent rise of the Transformer architecture and transfer learning, as
well as the latest surge in generative methods has enabled the growth
of embeddings as a foundational machine learning data structure. This
survey paper aims to provide a deep dive into what embeddings are,
their history, and usage patterns in industry.

November 22, 2024 at 11:13 AM

Reposted by Jaswanth Reddy

Mark Riedl

@markriedl.bsky.social

Why are some LLMs better at chess than others

Part 1: dynomight.net/chess/

Part 2: dynomight.net/more-chess/

Something weird is happening with LLMs and chess

are they good or bad?

dynomight.net

November 22, 2024 at 4:16 PM

Reposted by Jaswanth Reddy

Leandro von Werra

@lvwerra.bsky.social

What's the secret sauce of SmolLM2 to beat LLM titans like Llama3.2 and Qwen2.5?

Unsurprisingly: data, data, data!

The SmolTalk is open and available here: huggingface.co/datasets/Hug...

November 21, 2024 at 2:17 PM

Reposted by Jaswanth Reddy

Sung Kim

@sungkim.bsky.social

Free eBook: Machine Learning Systems by Vijay Janapa Reddi

Principles and Practices of Engineering Artificially Intelligent Systems

mlsysbook.ai

November 21, 2024 at 3:05 AM

Reposted by Jaswanth Reddy

Sung Kim

@sungkim.bsky.social

Oh wow!

A surprising result from Databricks when measuring embeddings and rerankers on internal evals.

1- Reranking few docs improves recall (expected).
2- Reranking many docs degrades quality (!).
3- Reranking too many documents is quite often worse than using embedding model alone (!!).

November 20, 2024 at 8:46 PM

Reposted by Jaswanth Reddy

Sung Kim

@sungkim.bsky.social

DeepSeek-R1-Lite-Preview Test Number 1

November 20, 2024 at 3:24 PM

Reposted by Jaswanth Reddy

Leandro von Werra

@lvwerra.bsky.social

All the things you need to know to pretrain an LLM at home*!

Gave a workshop at Uni Bern: starts with scaling laws and goes to web scale data processing and finishes training with 4D parallelism and ZeRO.

*assuming your home includes an H100 cluster

November 19, 2024 at 8:37 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news