Word Embeddings with R

Zhaowen Guo

University of Washington

Why using word embeddings?

The corpus comes from Congressional Record which records all remarks made by senators and representatives on the floor of the US Congress
Use congressional record for the 114th Congress as an example
- Two variables: speech, speech_id
- 67,557 observations

Represent the text using BOW

library(tidytext)
library(tidyverse)

data <- readRDS("data114.RDS")

Represent the text using BOW

library(tidytext)
library(tidyverse)

data <- readRDS("data114.RDS") %>%
  unnest_tokens(word, speech) %>%
  anti_join(get_stopwords()) %>%
  mutate(stem = wordStem(word)) %>%
  count(speech_id, stem) %>%
  bind_tf_idf(stem, speech_id, n) %>%
  cast_dfm(speech_id, stem, n)

What do we find?

BOW representation is

extremely sparse (99.92% sparse)
high-dimensional (98,235 features)

Represent the text using word embeddings

Source: Rodriguez & Spirling, 2022

Additional considerations

Do we need “deeper” embeddings?
- Contextual embeddings (i.e. BERT) may help
Do we need to scale up the level of analysis?
- Document-level embeddings or word mover’s distance may be next steps

Represent the text using word embeddings

Why do we use them?

Understand language use
- across time
- across groups
- stereotype and bias
Feed into downstream NLP implementations

How to find word embeddings?

Singular Vector Decomposition (SVD)
Neural network-based techniques
- Word2Vec
- GloVe
- language models with transformers such as BERT

nested_data <- readRDS("data114.RDS") %>%
  unnest_tokens(word, speech) %>%
  anti_join(get_stopwords()) %>%
  group_by(word) %>%
  filter(n() >= 50) %>%
  ungroup() %>%
  nest(words = c(word))
  
slide_windows <- function(tbl, window_size) {
  skipgrams <- slider::slide(
    tbl, ~.x, .after = window_size - 1, .step = 1, .complete = T
  )
  safe_mutate <- safely(mutate)
  
  out <- map2(skipgrams, 1:length(skipgrams), ~safe_mutate(.x, window_id = .y))
  
  out %>%
    transpose() %>%
    pluck("result") %>%
    compact() %>%
    bind_rows()
}

library(widyr) 
library(furrr)
plan(multisession)

pmi <- nested_data %>%
  mutate(words = future_map(words, slide_windows, 4)) %>%
  unnest(words) %>%
  unite(window_id, speech_id, window_id) %>%
  pairwise_pmi(word, window_id)

Explore word vectors

The most similar words to “tax”:

word_vectors <- pmi %>%
  widely_svd(item1, item2, pmi, nv = 100, maxit = 1000)
  
nearest_neighbors <- function(df, token) {
  df %>%
    widely(
      ~ {
        y <- .[rep(token, nrow(.)), ]
        res <- rowSums(. * y) /
          (sqrt(rowSums(. ^ 2)) * sqrt(sum(.[token, ] ^ 2)))
        matrix(res, ncol = 1, dimnames = list(x = names(res)))
        },
      sort = TRUE
      )(item1, dimension, value) %>%
    select(-item2)
}

word_list <- word_vectors %>%
  nearest_neighbors("tax")
word_list

Explore word vectors

The first 6 principal components with top 10 contributing words:

word_vectors %>%
  filter(dimension <= 6) %>%
  group_by(dimension) %>%
  top_n(10, abs(value)) %>%
  ungroup() %>%
  mutate(dimension = as.factor(dimension),
         item1 = reorder_within(item1, value, dimension)) %>%
  ggplot(aes(item1, value, fill = dimension)) +
  geom_col(show.legend = F) +
  facet_wrap(~ dimension, scales = "free_y", ncol = 3) +
  coord_flip() +
  scale_x_reordered() + 
  theme_bw()

Explore word vectors

Do politicians discuss “tax” in different ways?

Democrats

Republicans

Implementation of GloVe

Hyperparameters
Vocabulary
Term co-occurence matrix
Model parameters

WINDOW_SIZE <- 4 # context window size
DIM <- 100 # length of the word vector
ITERS <- 100 # maximum number of iterations for model convergence
COUNT_MIN <- 10 # threshold of word count to be kept

set.seed(123)
speech <- sample(data$speech, replace = T)

tokens <- space_tokenizer(speech)
it <- itoken(tokens, progressbar = T)
vocab <- create_vocabulary(it)
vocab_pruned <- prune_vocabulary(vocab, term_count_min = COUNT_MIN)

vectorizer <- vocab_vectorizer(vocab_pruned)
tcm <- create_tcm(it, 
                  vectorizer, 
                  skip_grams_window = WINDOW_SIZE, 
                  skip_grams_window_context = "symmetric", 
                  weights = rep(1, WINDOW_SIZE))

glove <- GlobalVectors$new(rank = DIM, x_max = 100, learning_rate = 0.05)
word_vectors <- glove$fit_transform(tcm, 
                                    n_iter = ITERS, 
                                    convergence_tol = 0.001, 
                                    n_threads = RcppParallel::defaultNumThreads())

Explore word vectors

The most similar words to “tax”:

word_vectors_context <- glove$components
glove_embedding <- word_vectors + t(word_vectors_context)

words <- glove_embedding["tax",, drop = F]
cos_sim <- text2vec::sim2(x = glove_embedding, y = words, method = "cosine", norm = "l2")
glove_similar_words <- cos_sim %>%
  as.data.frame() %>%
  arrange(desc(tax)) %>%
  rownames_to_column("item1")

How similar are these word embeddings for “tax”?

Top 10 most similar words to “tax”
SVD	GloVe
tax	tax
taxes	taxes
deductions	credits
exercise	breaks
deduction	pay
code	reform
millionaires	code
earnedincome	corporate
taxation	expenditures
taxed	income

How similar are these word embeddings for “tax”?

Correlations between nearest-neighbors’ rankings (Rodriguez and Spirling, 2022)

Compute a similarity measure (i.e. cosine similarity) between a single word and the entire common vocabulary for each model
Compute correlations (Pearson or Spearman) between these similarity measures

How similar are these word embeddings for “tax”?

similar_words <- glove_word_list %>%
  inner_join(word_list)
  
ggplot(similar_words, aes(x = tax, y = value)) + 
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm") + 
  labs(x = "GloVe", y = "SVD") +
  annotate("text", x = -0.2, y = 0.045, label = "Pearson's correlation = 0.37") + 
  theme_bw()

Is there any gender bias inherent in the speech corpus?

library(sweater)

target_words <- c("achievement", "success", "excellence", "leadership", 
                  "partnership", "collaboration", "innovation", "initiative")
attribute_words_female <- c("woman", "female", "she", "her", "hers", 
                            "mom", "daughter", "girl")
attribute_words_male <- c("man", "male", "he", "him", "his", 
                          "father", "son", "boy")

bias <- query(glove_embedding, 
              target_words, 
              attribute_words_female, 
              attribute_words_male, 
              method = "guess")
plot(bias)

What else can we do with word embeddings?

Feed into downstream NLP tasks
- Populate your dictionaries
  
  Validate word embeddings using a Turing test (Rodriguez and Spirling, 2022)
Improve performance of word embeddings
- Embedding alignment: Procrustes matrix alignment
Apply to different language contexts

Application in a Chinese corpus

Corpus: Annual reports of the Chinese national government (1998-2021)

GloVe word vectors could help when the corpus is small
Chinese word segmenter: jiebaR package

# prepare data and packages
library(text2vec)
library(jiebaR)
library(tidyverse)
report <- read.delim("reports.txt")
colnames(report) <- "text"

# create text tokenizer
text_seg <- worker(bylines = T, 
                   user = "hit_stopwords.txt", 
                   symbol=T)
it <- itoken(report$text, tokenizer = function(x) sapply(x, segment, text_seg))

# create vocabulary
vocab <- create_vocabulary(it) %>%
  filter(nchar(term) > 1) %>%
  filter(!term %>% str_detect(pattern = "\\d+"))
vocab_pruned <- prune_vocabulary(vocab, term_count_min = 4)

# create term-cooccurence matrix
vectorizer <- vocab_vectorizer(vocab_pruned)
tcm <- create_tcm(it, 
                  vectorizer, 
                  skip_grams_window = 4, 
                  skip_grams_window_context = "symmetric", 
                  weights = rep(1, 4))

# integrate glove embeddings
set.seed(98105)
glove <- GlobalVectors$new(rank = 50, x_max = 100, learning_rate = 0.05)
word_vectors <- glove$fit_transform(tcm, 
                                    n_iter = 100, 
                                    convergence_tol = 0.001, 
                                    n_threads = RcppParallel::defaultNumThreads())
word_vectors_context <- glove$components
glove_embedding <- word_vectors + t(word_vectors_context)

Application in a Chinese corpus

How does the Chinese government talk about “people” (人民) and “democracy” (民主)?

similar_words <- function(input_word){
  words <- glove_embedding[input_word,, drop = F]
  cos_sim <- text2vec::sim2(x = glove_embedding, y = words, method = "cosine", norm = "l2")
  sim_words <- cos_sim %>% as.data.frame() %>% arrange(desc(.))
  return(sim_words)
}

similar_words("人民") %>% head(5)
similar_words("民主") %>% head(5)

The most similar words to “people” are: people (人民), mass (群众), maintain (保障), life (生活), practical (切实)
The most similar words to “democracy” are: democracy (民主), grassroots (基层), infrastructure (设施), civilization (文明), become (成为)

Discussion

Can word embeddings be debiased? When do we want to control the bias? Evidence indicates that debiasing occurs at a decision which can also introduce stereotypes.
How did you or will you use word embeddings in your own research? What are the challenges you have confronted or you will expect when using word embeddings?
How to integrate word embeddings to perform document-level analyses? How do we validate document embeddings and how can they be used?

References

Grimmer, Justin, Brandon M. Stewart, and Margaret E. Roberts. Text As Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press, 2022.

Rodriguez, Pedro L. and Arthur Spirling, 2022. Word Embeddings: What Works, What Doesn’t, and How to Tell the Difference for Applied Research. Journal of Politics 84(1) 101-115.

Lena Voita’s animations of word embeddings

Chris Moody’s tutorial on building word embeddings from scratch

Dmitriy Selivanov’s tutorial on GloVe Word Embeddings

Julia Silge’s blog post on word vectors with tidy data principles

Chris Bail’s tutorial on word embedings

Hvitfeldt and Silge, 2022. Supervised Machine Learning for Text Analysis in R