Word Embeddings with R

Zhaowen Guo

University of Washington

Why using word embeddings?

  • The corpus comes from Congressional Record which records all remarks made by senators and representatives on the floor of the US Congress
  • Use congressional record for the 114th Congress as an example
    • Two variables: speech, speech_id
    • 67,557 observations

Represent the text using BOW

library(tidytext)
library(tidyverse)

data <- readRDS("data114.RDS") 

Represent the text using BOW

library(tidytext)
library(tidyverse)

data <- readRDS("data114.RDS") %>%
  unnest_tokens(word, speech) %>%
  anti_join(get_stopwords()) %>%
  mutate(stem = wordStem(word)) %>%
  count(speech_id, stem) %>%
  bind_tf_idf(stem, speech_id, n) %>%
  cast_dfm(speech_id, stem, n)

What do we find?

BOW representation is

  • extremely sparse (99.92% sparse)

  • high-dimensional (98,235 features)

Represent the text using word embeddings

Source: Rodriguez & Spirling, 2022

Additional considerations

  • Do we need “deeper” embeddings?

    • Contextual embeddings (i.e. BERT) may help
  • Do we need to scale up the level of analysis?

    • Document-level embeddings or word mover’s distance may be next steps

Represent the text using word embeddings

Why do we use them?

  • Understand language use
    • across time
    • across groups
    • stereotype and bias
  • Feed into downstream NLP implementations

How to find word embeddings?

Implementation of SVD

nested_data <- readRDS("data114.RDS") %>%
  unnest_tokens(word, speech) %>%
  anti_join(get_stopwords()) %>%
  group_by(word) %>%
  filter(n() >= 50) %>%
  ungroup() %>%
  nest(words = c(word))
  
slide_windows <- function(tbl, window_size) {
  skipgrams <- slider::slide(
    tbl, ~.x, .after = window_size - 1, .step = 1, .complete = T
  )
  safe_mutate <- safely(mutate)
  
  out <- map2(skipgrams, 1:length(skipgrams), ~safe_mutate(.x, window_id = .y))
  
  out %>%
    transpose() %>%
    pluck("result") %>%
    compact() %>%
    bind_rows()
}
library(widyr) 
library(furrr)
plan(multisession)

pmi <- nested_data %>%
  mutate(words = future_map(words, slide_windows, 4)) %>%
  unnest(words) %>%
  unite(window_id, speech_id, window_id) %>%
  pairwise_pmi(word, window_id)

Explore word vectors

The most similar words to “tax”:

word_vectors <- pmi %>%
  widely_svd(item1, item2, pmi, nv = 100, maxit = 1000)
  
nearest_neighbors <- function(df, token) {
  df %>%
    widely(
      ~ {
        y <- .[rep(token, nrow(.)), ]
        res <- rowSums(. * y) /
          (sqrt(rowSums(. ^ 2)) * sqrt(sum(.[token, ] ^ 2)))
        matrix(res, ncol = 1, dimnames = list(x = names(res)))
        },
      sort = TRUE
      )(item1, dimension, value) %>%
    select(-item2)
}

word_list <- word_vectors %>%
  nearest_neighbors("tax")
word_list

Explore word vectors

The first 6 principal components with top 10 contributing words:

word_vectors %>%
  filter(dimension <= 6) %>%
  group_by(dimension) %>%
  top_n(10, abs(value)) %>%
  ungroup() %>%
  mutate(dimension = as.factor(dimension),
         item1 = reorder_within(item1, value, dimension)) %>%
  ggplot(aes(item1, value, fill = dimension)) +
  geom_col(show.legend = F) +
  facet_wrap(~ dimension, scales = "free_y", ncol = 3) +
  coord_flip() +
  scale_x_reordered() + 
  theme_bw()

Explore word vectors

Do politicians discuss “tax” in different ways?

Democrats

Republicans

Implementation of GloVe

WINDOW_SIZE <- 4 # context window size
DIM <- 100 # length of the word vector
ITERS <- 100 # maximum number of iterations for model convergence
COUNT_MIN <- 10 # threshold of word count to be kept
set.seed(123)
speech <- sample(data$speech, replace = T)

tokens <- space_tokenizer(speech)
it <- itoken(tokens, progressbar = T)
vocab <- create_vocabulary(it)
vocab_pruned <- prune_vocabulary(vocab, term_count_min = COUNT_MIN)
vectorizer <- vocab_vectorizer(vocab_pruned)
tcm <- create_tcm(it, 
                  vectorizer, 
                  skip_grams_window = WINDOW_SIZE, 
                  skip_grams_window_context = "symmetric", 
                  weights = rep(1, WINDOW_SIZE))
glove <- GlobalVectors$new(rank = DIM, x_max = 100, learning_rate = 0.05)
word_vectors <- glove$fit_transform(tcm, 
                                    n_iter = ITERS, 
                                    convergence_tol = 0.001, 
                                    n_threads = RcppParallel::defaultNumThreads())

Explore word vectors

The most similar words to “tax”:

word_vectors_context <- glove$components
glove_embedding <- word_vectors + t(word_vectors_context)

words <- glove_embedding["tax",, drop = F]
cos_sim <- text2vec::sim2(x = glove_embedding, y = words, method = "cosine", norm = "l2")
glove_similar_words <- cos_sim %>%
  as.data.frame() %>%
  arrange(desc(tax)) %>%
  rownames_to_column("item1")

How similar are these word embeddings for “tax”?

Top 10 most similar words to “tax”
SVD GloVe
tax tax
taxes taxes
deductions credits
exercise breaks
deduction pay
code reform
millionaires code
earnedincome corporate
taxation expenditures
taxed income

How similar are these word embeddings for “tax”?

Correlations between nearest-neighbors’ rankings (Rodriguez and Spirling, 2022)

  • Compute a similarity measure (i.e. cosine similarity) between a single word and the entire common vocabulary for each model

  • Compute correlations (Pearson or Spearman) between these similarity measures

How similar are these word embeddings for “tax”?

similar_words <- glove_word_list %>%
  inner_join(word_list)
  
ggplot(similar_words, aes(x = tax, y = value)) + 
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm") + 
  labs(x = "GloVe", y = "SVD") +
  annotate("text", x = -0.2, y = 0.045, label = "Pearson's correlation = 0.37") + 
  theme_bw() 

Is there any gender bias inherent in the speech corpus?

library(sweater)

target_words <- c("achievement", "success", "excellence", "leadership", 
                  "partnership", "collaboration", "innovation", "initiative")
attribute_words_female <- c("woman", "female", "she", "her", "hers", 
                            "mom", "daughter", "girl")
attribute_words_male <- c("man", "male", "he", "him", "his", 
                          "father", "son", "boy")

bias <- query(glove_embedding, 
              target_words, 
              attribute_words_female, 
              attribute_words_male, 
              method = "guess")
plot(bias)

What else can we do with word embeddings?

Application in a Chinese corpus

Corpus: Annual reports of the Chinese national government (1998-2021)

  • GloVe word vectors could help when the corpus is small

  • Chinese word segmenter: jiebaR package

# prepare data and packages
library(text2vec)
library(jiebaR)
library(tidyverse)
report <- read.delim("reports.txt")
colnames(report) <- "text"

# create text tokenizer
text_seg <- worker(bylines = T, 
                   user = "hit_stopwords.txt", 
                   symbol=T)
it <- itoken(report$text, tokenizer = function(x) sapply(x, segment, text_seg))

# create vocabulary
vocab <- create_vocabulary(it) %>%
  filter(nchar(term) > 1) %>%
  filter(!term %>% str_detect(pattern = "\\d+"))
vocab_pruned <- prune_vocabulary(vocab, term_count_min = 4)

# create term-cooccurence matrix
vectorizer <- vocab_vectorizer(vocab_pruned)
tcm <- create_tcm(it, 
                  vectorizer, 
                  skip_grams_window = 4, 
                  skip_grams_window_context = "symmetric", 
                  weights = rep(1, 4))

# integrate glove embeddings
set.seed(98105)
glove <- GlobalVectors$new(rank = 50, x_max = 100, learning_rate = 0.05)
word_vectors <- glove$fit_transform(tcm, 
                                    n_iter = 100, 
                                    convergence_tol = 0.001, 
                                    n_threads = RcppParallel::defaultNumThreads())
word_vectors_context <- glove$components
glove_embedding <- word_vectors + t(word_vectors_context)

Application in a Chinese corpus

How does the Chinese government talk about “people” (人民) and “democracy” (民主)?

similar_words <- function(input_word){
  words <- glove_embedding[input_word,, drop = F]
  cos_sim <- text2vec::sim2(x = glove_embedding, y = words, method = "cosine", norm = "l2")
  sim_words <- cos_sim %>% as.data.frame() %>% arrange(desc(.))
  return(sim_words)
}

similar_words("人民") %>% head(5)
similar_words("民主") %>% head(5)
  • The most similar words to “people” are: people (人民), mass (群众), maintain (保障), life (生活), practical (切实)

  • The most similar words to “democracy” are: democracy (民主), grassroots (基层), infrastructure (设施), civilization (文明), become (成为)

Discussion

  • Can word embeddings be debiased? When do we want to control the bias? Evidence indicates that debiasing occurs at a decision which can also introduce stereotypes.

  • How did you or will you use word embeddings in your own research? What are the challenges you have confronted or you will expect when using word embeddings?

  • How to integrate word embeddings to perform document-level analyses? How do we validate document embeddings and how can they be used?

References

Grimmer, Justin, Brandon M. Stewart, and Margaret E. Roberts. Text As Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press, 2022.

Rodriguez, Pedro L. and Arthur Spirling, 2022. Word Embeddings: What Works, What Doesn’t, and How to Tell the Difference for Applied Research. Journal of Politics 84(1) 101-115.

Lena Voita’s animations of word embeddings

Chris Moody’s tutorial on building word embeddings from scratch

Dmitriy Selivanov’s tutorial on GloVe Word Embeddings

Julia Silge’s blog post on word vectors with tidy data principles

Chris Bail’s tutorial on word embedings

Hvitfeldt and Silge, 2022. Supervised Machine Learning for Text Analysis in R