Visualizing Data with R

Zhaowen Guo

University of Washington

Agenda

  • ggplot2 and elements of information visualization

  • Visualizing tabular data

    • Information from one variable
    • Information from multiple variables
  • Practice

  • ggplot2 extensions

Graphic systems in R

  • Imperative programming
    • Step-by-step instructions to control the exact construction of output
    • Hands on and more work
    • base, grid
  • Declarative programming
    • Allow software to apply a standard solution
    • Customize with a stylesheet
    • ggplot2

What is ggplot2

  • A package for making graphics in R
    • Originally based on Leland Wilkinson’s The Grammar of Graphics
    • Extended to R by Hadley Wickham
  • Declarative programming
    • Users supply the data and tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, leaving others to the software
  • Layered grammar of graphics
    • Layers: components of a graph connected by +
    • Aesthetics: specified inside layers about how layers appear

Steps to create a ggplot2 graphic

  • All ggplots start with ggplot()
  • data= that declares a global data frame to use
  • Aesthetics are wrapped in aes() and often require variables mapped to x and y

Elements of information visualization

Data

  • Load the tidyverse package

    • ggplot2: visualize data

    • dplyr: manage data

    • lubridate: manipulate date objects

    • stringr: manipulate strings

Elements of information visualization

Objects

  • Known as geoms in ggplot2 which specifies how the data are presented on the plot
  • A layer that is added to the base plot with +
  • Geom objects look like geom_XXXX
  • Common geoms:
    • Points or texts can represent location: geom_point(), geom_text(), geom_label()
    • Lines can represent numerical values and relationships: geom_line(), geom_smooth()
    • Polygons can represent area or size: geom_rect(), geom_bar(), geom_histogram()

Elements of information visualization

Aesthetics

  • Colors: use default colors or brewer palettes ColorBrewer, R palettes and specify fill or color argument

  • Line types: specify the linetype argument by an integer or a character (see reference here)

  • Dot types: specify the shape argument by an integer or a character (see reference here)

  • Stacked bar plots or histograms: specify the fill argument

Elements of information visualization

Components

  • Title
  • Legend
  • Annotations
  • Labels
  • Background

An illustration

How did GDP per capita change over time in Oceanian countries?

options(scipen = 999) # prevent scientific notation like e
library(gapminder) # load gapminder dataset from gapminder package
library(tidyverse) # load tidyverse package for data wrangling and visualization
str(gapminder) # examine gapminder dataset
tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
 $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
oceania <- gapminder %>%
  filter(continent == "Oceania")

An illustration

ggplot(oceania, # input the data
  aes(x = year, 
      y = gdpPercap, 
      color = country,
      linetype = country)) + # establish aesthetic mappings
  geom_line(size = 1) + # apply mapping to geom objects
  ggtitle("Life expectancy in Oceanian countries over time") + # add title
  labs(x = "Year", y = "GDP per capita") + # add labels
  theme_bw() # change background to white background with grey gridlines

An illustration

ggplot(oceania, # input the data
  aes(x = year, 
      y = gdpPercap, 
      color = country,
      linetype = country)) + # establish aesthetic mappings
  geom_line(size = 1) + # apply mapping to geom objects
  ggtitle("Life expectancy in Oceanian countries over time") + # add title
  labs(x = "Year", y = "GDP per capita") + # add labels
  theme_bw() # change background to white background with grey gridlines

ggplot(oceania, # input the data
  aes(x = year, 
      y = gdpPercap, 
      color = country,
      linetype = country)) + # establish aesthetic mappings
  geom_line(size = 1) + # apply mapping to geom objects
  geom_vline(xintercept = 1992, linetype = "dashed") + # add a dashed line indicating a faster rise in gdp percap
  scale_y_continuous(labels = scales::label_dollar(negative_parens = T)) + # display a finance style for y-axis labels
  ggtitle("Life expectancy in Oceanian countries over time") + # add title
  labs(x = "Year", y = "GDP per capita") + # add labels for x and y axes
  theme_bw() + # change background to white background with grey gridlines 
  theme(plot.title = element_text(hjust = 0.5)) + # center the plot title
  guides(color = guide_legend("Country"), linetype = "none") # capitalize the legend title

knitr::kable(oceania)
country continent year lifeExp pop gdpPercap
Australia Oceania 1952 69.120 8691212 10039.60
Australia Oceania 1957 70.330 9712569 10949.65
Australia Oceania 1962 70.930 10794968 12217.23
Australia Oceania 1967 71.100 11872264 14526.12
Australia Oceania 1972 71.930 13177000 16788.63
Australia Oceania 1977 73.490 14074100 18334.20
Australia Oceania 1982 74.740 15184200 19477.01
Australia Oceania 1987 76.320 16257249 21888.89
Australia Oceania 1992 77.560 17481977 23424.77
Australia Oceania 1997 78.830 18565243 26997.94
Australia Oceania 2002 80.370 19546792 30687.75
Australia Oceania 2007 81.235 20434176 34435.37
New Zealand Oceania 1952 69.390 1994794 10556.58
New Zealand Oceania 1957 70.260 2229407 12247.40
New Zealand Oceania 1962 71.240 2488550 13175.68
New Zealand Oceania 1967 71.520 2728150 14463.92
New Zealand Oceania 1972 71.890 2929100 16046.04
New Zealand Oceania 1977 72.220 3164900 16233.72
New Zealand Oceania 1982 73.840 3210650 17632.41
New Zealand Oceania 1987 74.320 3317166 19007.19
New Zealand Oceania 1992 76.330 3437674 18363.32
New Zealand Oceania 1997 77.550 3676187 21050.41
New Zealand Oceania 2002 79.110 3908037 23189.80
New Zealand Oceania 2007 80.204 4115771 25185.01

Information from one variable

# show frequencies of a variable
gapminder %>%
  filter(year == 1952) %>%
  ggplot(aes(x = lifeExp)) + 
  geom_histogram(binwidth = 2) +
  theme_light() +
  labs(x = "Life Expectancy", y = "Count", title = "Life Expectancy in 1952")

# show frequencies of a variable
gapminder %>%
  filter(year == 1952) %>%
  ggplot(aes(x = lifeExp)) + 
  geom_density(size = 1.5, alpha = 0.2, fill = "red") +
  theme_light() +
  labs(x = "Life Expectancy", y = "Count", title = "Life Expectancy in 1952")

# show distribution of a variable (median, 1st, 3rd quantiles, outliers)
gapminder %>% filter(year == 1952, continent=="Europe") %>%
  ggplot(aes(y = lifeExp)) + 
  geom_boxplot(fill = "grey", color = "blue", outlier.shape = 1) + # adjust aesthetics
  theme_light() +
  labs(title = "Life Expectancy in 1952 (Europe)", 
       y = "Life Expectancy", 
       x = "") 

# show distribution of a discrete variable 
ggplot(gapminder, 
       aes(x = continent,
           fill = continent)) + # differentiate the filled colors
  geom_bar() + 
  theme_classic() + 
  labs(y = "Number of countries", x = "Continent")

Information from multiple variables

gapminder %>%
  ggplot(aes(x = lifeExp)) + 
  geom_histogram() + 
  facet_wrap(~ continent, ncol = 3) +
  labs(title = "Life expectancy distribution by continent") +
  theme_minimal()

gapminder %>%
  filter(year == 2007 & continent != "Oceania") %>%
  ggplot(aes(x = lifeExp)) + 
  geom_density(aes(fill = continent), size = 0.1, alpha = 0.5) + 
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Life expectancy distribution in 2007") +
  theme_minimal()

gapminder %>% 
  filter(year > 1990) %>%
  group_by(year, continent) %>%
  summarise(totalpop = sum(as.double(pop))) %>%
  ggplot(aes(x = year, y = totalpop, fill = continent)) + 
  geom_col(position = "dodge", size = 0.2, alpha = 0.8) + # dodge overlapping objects side by side 
  scale_x_continuous(breaks = seq(1992, 2007, 5), expand = c(0, 0)) +
  scale_y_continuous(labels = scales::comma, expand = c(0, 0)) +
  scale_fill_brewer(palette = "Set1") +
  theme_bw()

Practices

Exercise 1: Make a scatter plot for annual average GDP per capita across all countries.

Exercise 2: Break down the plot from exercise 1 by continent, using colors to distinguish the points and transforming mean GDP per capita on the log scale.

Exercise 3: Make a collection of bar plots faceted by year that compare mean GDP per capita across countries in a given year. Orient the plots to make it easier to read the continent labels.

Exercise 4: What is the relationship between life expectancy and GDP per capita in 2007 by non-Oceanian continents?

Solutions

gapminder %>%
  group_by(year) %>%
  summarize(meanGDPpc = mean(gdpPercap)) %>%
  ggplot(aes(x = year, y = meanGDPpc)) +
  geom_point()

gapminder %>%
  group_by(year, continent) %>% # aggregate the information by year by continent
  summarize(meanGDPpc = mean(gdpPercap)) %>%
  ggplot(aes(x = year, y = meanGDPpc, color = continent)) +
  geom_point() +
  scale_y_log10() # apply the log scale to GDP per capita

gapminder %>%
  group_by(year, continent) %>%
  summarize(meanGDPpc = mean(gdpPercap)) %>%
  ggplot(aes(x = continent, y = meanGDPpc)) +
  geom_col() +
  facet_wrap(~ year) +
  coord_flip() # flip the coordinates so that the continent names are visible

gapminder %>%
  filter(year == 2007 & continent != "Oceania") %>% 
  ggplot(aes(x = log(gdpPercap),
             y = lifeExp,
             color = continent)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") + 
  facet_wrap(~ continent)

Extensions: interactive visuals

library(plotly)
animate_gapminder <- ggplot(gapminder, aes(gdpPercap, lifeExp, color = continent)) +
  geom_point(aes(size = pop, ids = country, frame = year)) +
  geom_smooth(se = FALSE, method = "lm") +
  scale_x_log10() + 
  theme_bw() + 
  labs(x = "GDP per capita", y = "Life expectancy") +
  theme(legend.position = "none") # remove legend

ggplotly(animate_gapminder) %>% 
  highlight("plotly_hover") %>%
  animation_slider(
    currentvalue = list(prefix = "Year ", font = list(color="black"))
  )

Extensions: spatial data

ggplot(dc_tes) + # input data on tree equity score in DC
  geom_sf(aes(fill = tes), color = "#e7e5cc") + # draw polygons with colors representing tree equity scores (tes)
  scale_fill_continuous(low = "#c2d6a4", high = "#1e3d14", na.value = "grey90", name= "Tree Equity Score") + # adjust aesthetics of polygons
  geom_sf(data = dc_gunshot, color = "#c62320", size = 0.1) + # add points representing gunshot incidents
  coord_sf(xlim = c(-77.12, -76.90), ylim = c(38.79, 39.01), expand = F) + # remove empty space on the map
  labs(x = "", y = "", title = "GUNSHOT DETECTION MAP",
       subtitle = "Recorded shooting incidents in Washington D.C. during 2021",
       caption = str_wrap("Data comes from ShotSpotter gunshot detection system. Incidents of probable gunfires and firecrackers are excluded. Green spaces represent tree equity scores which compute how much tree canopy and surface temperature align with income, 
       employment, race, age and health factors at each block | Visualization by Zhaowen Guo", width = 200))