Finally—cosine is not the only similarity metric out there. We go through the pros and cons of each, with advice about when e.g. dot product is more effective.
Finally—cosine is not the only similarity metric out there. We go through the pros and cons of each, with advice about when e.g. dot product is more effective.
You may think good and evil are opposites, but your embedding model might think: “Those are both moral judgements! Very similar!” If your construct has an opposite, consider using an anchored vector.
You may think good and evil are opposites, but your embedding model might think: “Those are both moral judgements! Very similar!” If your construct has an opposite, consider using an anchored vector.
CAV = learn a vector representation from labeled examples. Humans rate a few posts; you apply the pattern to analyze new texts! This new method gives precise, interpretable scores if you have relevant training data on hand.
CAV = learn a vector representation from labeled examples. Humans rate a few posts; you apply the pattern to analyze new texts! This new method gives precise, interpretable scores if you have relevant training data on hand.
CCR = embed a questionnaire. Very powerful when your texts are similar to questionnaire scale items (e.g. open-ended responses). We point out a risk—if you aren’t careful, you might measure how much your texts sound like psychological questionnaires—but there are solutions!
CCR = embed a questionnaire. Very powerful when your texts are similar to questionnaire scale items (e.g. open-ended responses). We point out a risk—if you aren’t careful, you might measure how much your texts sound like psychological questionnaires—but there are solutions!
DDR = average embedding of a word list. Great for summarizing abstract dimensions (emotion, morality) across genres. Not good for more complex constructs. NEW important tip: weight words by frequency to reduce noise from rare words.
DDR = average embedding of a word list. Great for summarizing abstract dimensions (emotion, morality) across genres. Not good for more complex constructs. NEW important tip: weight words by frequency to reduce noise from rare words.
We review 3 ways to improve on traditional methods: Distributed Dictionary Representation (DDR), Contextualized Construct Representation (CCR) & our new Correlational Anchored Vectors (CAV).
Each has advantages and disadvantages.
We review 3 ways to improve on traditional methods: Distributed Dictionary Representation (DDR), Contextualized Construct Representation (CCR) & our new Correlational Anchored Vectors (CAV).
Each has advantages and disadvantages.
Your trusty Likert scale questionnaire could be free response instead.
Your validated word list could be leveraged to analyze words that aren’t included.
Your painstaking MTurk-rated dataset could be extended to analyze 10,000 social media posts.
Your trusty Likert scale questionnaire could be free response instead.
Your validated word list could be leveraged to analyze words that aren’t included.
Your painstaking MTurk-rated dataset could be extended to analyze 10,000 social media posts.
What’s an embedding?
Why choose one model over another?
Why do you need embeddings when you can ask ChatGPT to rate your texts?
Take a look: doi.org/10.31234/osf...
What’s an embedding?
Why choose one model over another?
Why do you need embeddings when you can ask ChatGPT to rate your texts?
Take a look: doi.org/10.31234/osf...