BOW representation is
extremely sparse (99.92% sparse)
high-dimensional (98,235 features)
Source: Rodriguez & Spirling, 2022
Additional considerations
Do we need “deeper” embeddings?
Do we need to scale up the level of analysis?
Why do we use them?
Neural network-based techniques
nested_data <- readRDS("data114.RDS") %>%
unnest_tokens(word, speech) %>%
anti_join(get_stopwords()) %>%
group_by(word) %>%
filter(n() >= 50) %>%
ungroup() %>%
nest(words = c(word))
slide_windows <- function(tbl, window_size) {
skipgrams <- slider::slide(
tbl, ~.x, .after = window_size - 1, .step = 1, .complete = T
)
safe_mutate <- safely(mutate)
out <- map2(skipgrams, 1:length(skipgrams), ~safe_mutate(.x, window_id = .y))
out %>%
transpose() %>%
pluck("result") %>%
compact() %>%
bind_rows()
}
The most similar words to “tax”:
word_vectors <- pmi %>%
widely_svd(item1, item2, pmi, nv = 100, maxit = 1000)
nearest_neighbors <- function(df, token) {
df %>%
widely(
~ {
y <- .[rep(token, nrow(.)), ]
res <- rowSums(. * y) /
(sqrt(rowSums(. ^ 2)) * sqrt(sum(.[token, ] ^ 2)))
matrix(res, ncol = 1, dimnames = list(x = names(res)))
},
sort = TRUE
)(item1, dimension, value) %>%
select(-item2)
}
word_list <- word_vectors %>%
nearest_neighbors("tax")
word_list
The first 6 principal components with top 10 contributing words:
word_vectors %>%
filter(dimension <= 6) %>%
group_by(dimension) %>%
top_n(10, abs(value)) %>%
ungroup() %>%
mutate(dimension = as.factor(dimension),
item1 = reorder_within(item1, value, dimension)) %>%
ggplot(aes(item1, value, fill = dimension)) +
geom_col(show.legend = F) +
facet_wrap(~ dimension, scales = "free_y", ncol = 3) +
coord_flip() +
scale_x_reordered() +
theme_bw()
Do politicians discuss “tax” in different ways?
Democrats
Republicans
The most similar words to “tax”:
word_vectors_context <- glove$components
glove_embedding <- word_vectors + t(word_vectors_context)
words <- glove_embedding["tax",, drop = F]
cos_sim <- text2vec::sim2(x = glove_embedding, y = words, method = "cosine", norm = "l2")
glove_similar_words <- cos_sim %>%
as.data.frame() %>%
arrange(desc(tax)) %>%
rownames_to_column("item1")
SVD | GloVe |
---|---|
tax | tax |
taxes | taxes |
deductions | credits |
exercise | breaks |
deduction | pay |
code | reform |
millionaires | code |
earnedincome | corporate |
taxation | expenditures |
taxed | income |
Correlations between nearest-neighbors’ rankings (Rodriguez and Spirling, 2022)
Compute a similarity measure (i.e. cosine similarity) between a single word and the entire common vocabulary for each model
Compute correlations (Pearson or Spearman) between these similarity measures
library(sweater)
target_words <- c("achievement", "success", "excellence", "leadership",
"partnership", "collaboration", "innovation", "initiative")
attribute_words_female <- c("woman", "female", "she", "her", "hers",
"mom", "daughter", "girl")
attribute_words_male <- c("man", "male", "he", "him", "his",
"father", "son", "boy")
bias <- query(glove_embedding,
target_words,
attribute_words_female,
attribute_words_male,
method = "guess")
plot(bias)
Feed into downstream NLP tasks
Populate your dictionaries
Validate word embeddings using a Turing test (Rodriguez and Spirling, 2022)
Improve performance of word embeddings
Apply to different language contexts
Corpus: Annual reports of the Chinese national government (1998-2021)
GloVe word vectors could help when the corpus is small
Chinese word segmenter: jiebaR package
# prepare data and packages
library(text2vec)
library(jiebaR)
library(tidyverse)
report <- read.delim("reports.txt")
colnames(report) <- "text"
# create text tokenizer
text_seg <- worker(bylines = T,
user = "hit_stopwords.txt",
symbol=T)
it <- itoken(report$text, tokenizer = function(x) sapply(x, segment, text_seg))
# create vocabulary
vocab <- create_vocabulary(it) %>%
filter(nchar(term) > 1) %>%
filter(!term %>% str_detect(pattern = "\\d+"))
vocab_pruned <- prune_vocabulary(vocab, term_count_min = 4)
# create term-cooccurence matrix
vectorizer <- vocab_vectorizer(vocab_pruned)
tcm <- create_tcm(it,
vectorizer,
skip_grams_window = 4,
skip_grams_window_context = "symmetric",
weights = rep(1, 4))
# integrate glove embeddings
set.seed(98105)
glove <- GlobalVectors$new(rank = 50, x_max = 100, learning_rate = 0.05)
word_vectors <- glove$fit_transform(tcm,
n_iter = 100,
convergence_tol = 0.001,
n_threads = RcppParallel::defaultNumThreads())
word_vectors_context <- glove$components
glove_embedding <- word_vectors + t(word_vectors_context)
How does the Chinese government talk about “people” (人民) and “democracy” (民主)?
similar_words <- function(input_word){
words <- glove_embedding[input_word,, drop = F]
cos_sim <- text2vec::sim2(x = glove_embedding, y = words, method = "cosine", norm = "l2")
sim_words <- cos_sim %>% as.data.frame() %>% arrange(desc(.))
return(sim_words)
}
similar_words("人民") %>% head(5)
similar_words("民主") %>% head(5)
The most similar words to “people” are: people (人民), mass (群众), maintain (保障), life (生活), practical (切实)
The most similar words to “democracy” are: democracy (民主), grassroots (基层), infrastructure (设施), civilization (文明), become (成为)
Can word embeddings be debiased? When do we want to control the bias? Evidence indicates that debiasing occurs at a decision which can also introduce stereotypes.
How did you or will you use word embeddings in your own research? What are the challenges you have confronted or you will expect when using word embeddings?
How to integrate word embeddings to perform document-level analyses? How do we validate document embeddings and how can they be used?
Grimmer, Justin, Brandon M. Stewart, and Margaret E. Roberts. Text As Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press, 2022.
Rodriguez, Pedro L. and Arthur Spirling, 2022. Word Embeddings: What Works, What Doesn’t, and How to Tell the Difference for Applied Research. Journal of Politics 84(1) 101-115.
Lena Voita’s animations of word embeddings
Chris Moody’s tutorial on building word embeddings from scratch
Dmitriy Selivanov’s tutorial on GloVe Word Embeddings
Julia Silge’s blog post on word vectors with tidy data principles
Chris Bail’s tutorial on word embedings
Hvitfeldt and Silge, 2022. Supervised Machine Learning for Text Analysis in R