Louis Teitelbaum
banner
louisteitelbaum.bsky.social
Louis Teitelbaum
@louisteitelbaum.bsky.social
Computational Social Psychology @ Ben-Gurion University, Distributional Semantics × Spread of Ideas. Co-author of https://ds4psych.com/
Just as dplyr is “a grammar of data manipulation”, embedplyr is a grammar of embeddings manipulation, designed to facilitate the use of word and text embeddings in common analysis workflows without introducing new syntax or unfamiliar data structures. This makes it perfect for teaching students.
November 11, 2025 at 7:40 AM
e.g. Analyzing texts with a anchored Distributed Dictionary Representation (DDR) used to take ~100 lines of code + an hour figuring out how to load the pretrained model you like. With embedplyr you can do that in ~6 lines of code, and your favorite pretrained model is loaded automatically.
November 11, 2025 at 7:40 AM
I'm only a year into my PhD, and I already have a steady stream of grad students coming to me for help analyzing text with semantic embeddings. Sometimes they have legitimate methodological questions, but often they just don't know where to start with the analysis code. Enter embedplyr...
November 11, 2025 at 7:40 AM
9/
Finally—cosine is not the only similarity metric out there. We go through the pros and cons of each, with advice about when e.g. dot product is more effective.
June 24, 2025 at 2:10 PM
8/
You may think good and evil are opposites, but your embedding model might think: “Those are both moral judgements! Very similar!” If your construct has an opposite, consider using an anchored vector.
June 24, 2025 at 2:10 PM
7/
CAV = learn a vector representation from labeled examples. Humans rate a few posts; you apply the pattern to analyze new texts! This new method gives precise, interpretable scores if you have relevant training data on hand.
June 24, 2025 at 2:10 PM
6/
CCR = embed a questionnaire. Very powerful when your texts are similar to questionnaire scale items (e.g. open-ended responses). We point out a risk—if you aren’t careful, you might measure how much your texts sound like psychological questionnaires—but there are solutions!
June 24, 2025 at 2:10 PM
5/
DDR = average embedding of a word list. Great for summarizing abstract dimensions (emotion, morality) across genres. Not good for more complex constructs. NEW important tip: weight words by frequency to reduce noise from rare words.
June 24, 2025 at 2:10 PM
4/
We review 3 ways to improve on traditional methods: Distributed Dictionary Representation (DDR), Contextualized Construct Representation (CCR) & our new Correlational Anchored Vectors (CAV).
Each has advantages and disadvantages.
June 24, 2025 at 2:10 PM
3/
Your trusty Likert scale questionnaire could be free response instead.
Your validated word list could be leveraged to analyze words that aren’t included.
Your painstaking MTurk-rated dataset could be extended to analyze 10,000 social media posts.
June 24, 2025 at 2:10 PM
2/
What’s an embedding?
Why choose one model over another?
Why do you need embeddings when you can ask ChatGPT to rate your texts?
Take a look: doi.org/10.31234/osf...
OSF
doi.org
June 24, 2025 at 2:10 PM