9  Exploratory Data Analysis

9.1 Objectives

The objective of this module to begin exploring data using the summary functions and graphing abilities of R.

9.2 Preliminaries

9.3 Backstory

Exploratory data analysis (EDA) is typically the first step in any kind of data analysis in which you examine your dataset, often using visual methods, to summarize the main characteristics of your variables (central tendency, distributions) and check for various issues that may complicate further analysis (e.g., missing data). The influential statistician, John Tukey, wrote a classic book on the subject, Exploratory Data Analysis (Addison-Wesley) in 1977, in which he advocated the use of the five-number summary to quickly understand the distributions of numeric variables. These include:

  • the sample minimum (i.e., the value of the smallest observation)
  • the lower or first quartile
  • the median (i.e., the middle value in the distribution of observation)
  • the upper quartile or third quartile
  • the sample maximum (i.e., the value of the largest observation)

R has some very easy to use functions for taking a quick tour of your data. We have seen some of these already (e.g., head(), tail(), and str()), and you should always use these right after loading in a dataset to work with. Also useful are dim() to return the number of rows and columns in a data frame, names(), colnames(), and sometimes rownames().

NOTE: You can use the attach() function to make variables within data frames accessible in R with fewer keystrokes. The attach() function binds the variables from data frame named as an argument to the local namespace so that as long as the data frame is attached, variables can be called by their names without explicitly referring to the data frame. That is, if you attach() a data frame, then you do not need to use the $ operator or double bracket notation to refer to a particular variable or column vector from your data frame.

It is important to remember to detach() data frames when finished. It is also possible to attach multiple data frames (or the same data frame multiple times), and, if these share variable names, then the more recently attached one will mask the other. Thus, it is best to attach only one data frame at a time. You can also attach() and detach() packages from the namespace. In general, I do not recommend using attach() and detach() for dataframes.

EXAMPLE:

f <- "https://raw.githubusercontent.com/difiore/ada-datasets/main/Country-Data-2016.csv"
d <- read_csv(f, col_names = TRUE)
names(d)
##  [1] "country"        "population"     "area"           "govt_form"     
##  [5] "birthrate"      "deathrate"      "life_expect"    "mammals"       
##  [9] "birds"          "reptiles"       "amphibians"     "fishes"        
## [13] "mollucs"        "other_inverts"  "plants"         "fungi_protists"
attach(d)
mean(life_expect, na.rm = TRUE)
## [1] 72.19083
detach(d)
# mean(life_expect, na.rm=TRUE) # this throws an error!
mean(d$life_expect, na.rm = TRUE)
## [1] 72.19083

NOTE: The with() function accomplishes much the same thing as attach() but is self-contained and cleaner, especially for use in functions. If you use with(), however, all code to be run should be included as an argument of the function.

EXAMPLE:

with(d, mean(life_expect, na.rm = TRUE))
## [1] 72.19083

9.4 “Tidy” Data

The tabular data we have worked with thus far has all been presented in a format where each row represents a single case or observation or record, and each column contains a different variable that is scored for each of these observations. Data in this format is referred to as “tidy”, and the {tidyverse} set of R packages (which we have used a bit already) provides many tools for manipulating tabular data in this format.

Having data in a “tidy” format is useful for many kinds of analyses (although “tidy” data is not the ONLY useful data format). Often, then, the first manipulations we need to do with data in order to work with it efficiently are to reformat it to make it “tidy”.

There are multiple ways that data can be “non-tidy”, but one common one way is because data is organized in such a way to make data entry in a spreadsheet easier. For example, consider the following table of data representing body weight in grams for three individual titi monkeys, a species of South American primate, collected in each of two years:

Individual Year_1 Year_2
Lily 580 600
Leia 550 575
Loki 600 620

This data is easy to ENTER, but it is non-tidy… Why? Because the latter two columns actually represent a combinations of two variables (“Year” and “Weight”) that pertain to a given individual. Each row thus represents two observations.

Below is the same data in tidy format, where each row represents a single observation:

Individual Year Weight
Lily Year_1 580
Lily Year_2 600
Leia Year_1 550
Leia Year_2 575
Loki Year_1 600
Loki Year_2 620

We can convert data from the former format to the latter using the function pivot_longer(), where the first argument is the tabular data of interest, the next argument is a vector of columns to collect data from, the names_to= argument is the name we will assign to a new variable that indicates from which column values are being collected from in the non-tidy dataset, and the final argument, values_to=, is the name for the new variable into which we are collecting values. This process has the effect, often, of making “wide” tables narrower and longer, thus is sometime referred to as converting data to “long” format, thus the name for the function, pivot_longer(). Note that this function is essentially an update of an older {tidyr} function, gather(), which had a less intuitive name and set of argument names.

d <-
  data.frame(
    "Individual" = c("Lily", "Leia", "Loki"),
    "Year_1" = c(580, 550, 600),
    "Year_2" = c(600, 575, 620)
  )
d
##   Individual Year_1 Year_2
## 1       Lily    580    600
## 2       Leia    550    575
## 3       Loki    600    620
d <- pivot_longer(d, c("Year_1", "Year_2"), names_to = "Year", values_to = "Weight")
d
## # A tibble: 6 × 3
##   Individual Year   Weight
##   <chr>      <chr>   <dbl>
## 1 Lily       Year_1    580
## 2 Lily       Year_2    600
## 3 Leia       Year_1    550
## 4 Leia       Year_2    575
## 5 Loki       Year_1    600
## 6 Loki       Year_2    620

Alternatively, sometimes we will have data that is non-tidy because variables for the same observation are presented in different rows. Consider the following example of “non-tidy” data:

Species Variable Value
Orangutans Body Size (kg) 37
Orangutans Brain Size (cc) 340
Chimpanzees Body Size (kg) 38
Chimpanzees Brain Size (cc) 350
Gorillas Body Size (kg) 80
Gorillas Brain Size (cc) 470

The same data could be represented in “tidy” format as:

Species Body Size (kg) Brain Size (cc)
Orangutans 37 340
Chimpanzees 38 350
Gorillas 80 470

The pivot_wider() function lets us easily reformat. Here, names_from= is the column containing the different variables, and values_from= is the column containing the measured values of those variables. Note that this function is essentially an update of an older {tidyr} function, spread().

d <-
  data.frame(
    "Species" = c(
      "Orangutans",
      "Orangutans",
      "Chimpanzees",
      "Chimpanzees",
      "Gorillas",
      "Gorillas"
    ),
    "Variable" = rep(c("Body Size (kg)", "Brain Size (cc)"), 3),
    "Value" = c(37, 340, 38, 350, 80, 470)
  )
d
##       Species        Variable Value
## 1  Orangutans  Body Size (kg)    37
## 2  Orangutans Brain Size (cc)   340
## 3 Chimpanzees  Body Size (kg)    38
## 4 Chimpanzees Brain Size (cc)   350
## 5    Gorillas  Body Size (kg)    80
## 6    Gorillas Brain Size (cc)   470
d <- pivot_wider(d, names_from = "Variable", values_from = "Value")
d
## # A tibble: 3 × 3
##   Species     `Body Size (kg)` `Brain Size (cc)`
##   <chr>                  <dbl>             <dbl>
## 1 Orangutans                37               340
## 2 Chimpanzees               38               350
## 3 Gorillas                  80               470

This process has the effect, often, of making long and narrow tables wider, thus is sometime referred to as converting data to “wide” format.

The image below, pulled from a cheatsheet on Data Wrangling with {dplyr} and {tidyr}, summarizes the basics of these processes for “reshaping” data.

For the remainder of this module, all of the example datasets will comprise data already in the “tidy” format.

9.5 Exploring Single Variables

Variable Summaries

The summary() function from {base} R provides a quick overview of each variable in a data frame. For numeric variables, this includes the five-number summary and the mean, as well as a count of NA (missing values). For factors, it includes a count of each factor.

CHALLENGE

  • Load the “Country-Data-2016” dataset into a data frame variable, d, and summarize the variables in that data frame. You can load the file any way you want, e.g., load from a local file or access the data straight from GitHub, as in the code below.
# using {base} R
f <- curl("https://raw.githubusercontent.com/difiore/ada-datasets/main/Country-Data-2016.csv")
d <- read.csv(f, header = TRUE, sep = ",", stringsAsFactors = FALSE)
d <- as_tibble(d) # I like tibbles!
head(d)
## # A tibble: 6 × 16
##   country    population   area govt_form birthrate deathrate life_expect mammals
##   <chr>           <int>  <dbl> <chr>         <dbl>     <dbl>       <dbl>   <int>
## 1 Afghanist…   32564342 6.52e5 islamic …      38.6      13.9        50.9      11
## 2 Albania       3029278 2.87e4 republic       12.9       6.6        78.1       3
## 3 Algeria      39542166 2.38e6 republic       23.7       4.3        76.6      14
## 4 American …      54343 1.99e2 territor…      22.9       4.8        75.1       1
## 5 Andorra         85580 4.68e2 constitu…       8.1       7          82.7       2
## 6 Angola       19625353 1.25e6 republic       38.8      11.5        55.6      17
## # ℹ 8 more variables: birds <int>, reptiles <int>, amphibians <int>,
## #   fishes <int>, mollucs <int>, other_inverts <int>, plants <int>,
## #   fungi_protists <int>
names(d)
##  [1] "country"        "population"     "area"           "govt_form"     
##  [5] "birthrate"      "deathrate"      "life_expect"    "mammals"       
##  [9] "birds"          "reptiles"       "amphibians"     "fishes"        
## [13] "mollucs"        "other_inverts"  "plants"         "fungi_protists"
# or, using {readr}
f <- "https://raw.githubusercontent.com/difiore/ada-datasets/main/Country-Data-2016.csv"
d <- read_csv(f, col_names = TRUE)
## Rows: 248 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): country, govt_form
## dbl (14): population, area, birthrate, deathrate, life_expect, mammals, bird...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(d)
## # A tibble: 6 × 16
##   country    population   area govt_form birthrate deathrate life_expect mammals
##   <chr>           <dbl>  <dbl> <chr>         <dbl>     <dbl>       <dbl>   <dbl>
## 1 Afghanist…   32564342 6.52e5 islamic …      38.6      13.9        50.9      11
## 2 Albania       3029278 2.87e4 republic       12.9       6.6        78.1       3
## 3 Algeria      39542166 2.38e6 republic       23.7       4.3        76.6      14
## 4 American …      54343 1.99e2 territor…      22.9       4.8        75.1       1
## 5 Andorra         85580 4.68e2 constitu…       8.1       7          82.7       2
## 6 Angola       19625353 1.25e6 republic       38.8      11.5        55.6      17
## # ℹ 8 more variables: birds <dbl>, reptiles <dbl>, amphibians <dbl>,
## #   fishes <dbl>, mollucs <dbl>, other_inverts <dbl>, plants <dbl>,
## #   fungi_protists <dbl>
names(d)
##  [1] "country"        "population"     "area"           "govt_form"     
##  [5] "birthrate"      "deathrate"      "life_expect"    "mammals"       
##  [9] "birds"          "reptiles"       "amphibians"     "fishes"        
## [13] "mollucs"        "other_inverts"  "plants"         "fungi_protists"
  • What are the median area and population size of all countries in the dataset?

HINT: There are a couple of ways to do this… try summary() and median() (for the latter, you’ll need to use the na.rm=TRUE argument).

Show Code
summary(d)
Show Output
##    country            population             area            govt_form        
##  Length:248         Min.   :3.000e+01   Min.   :1.000e-01   Length:248        
##  Class :character   1st Qu.:2.991e+05   1st Qu.:1.769e+03   Class :character  
##  Mode  :character   Median :4.912e+06   Median :6.970e+04   Mode  :character  
##                     Mean   :2.999e+07   Mean   :6.110e+05                     
##                     3rd Qu.:1.803e+07   3rd Qu.:3.988e+05                     
##                     Max.   :1.367e+09   Max.   :1.710e+07                     
##                     NA's   :6           NA's   :1                             
##    birthrate       deathrate      life_expect       mammals      
##  Min.   : 0.00   Min.   : 0.00   Min.   :49.80   Min.   :  0.00  
##  1st Qu.:11.40   1st Qu.: 5.65   1st Qu.:67.40   1st Qu.:  3.00  
##  Median :16.40   Median : 7.40   Median :74.70   Median :  8.00  
##  Mean   :18.95   Mean   : 7.61   Mean   :72.19   Mean   : 13.85  
##  3rd Qu.:24.35   3rd Qu.: 9.40   3rd Qu.:78.40   3rd Qu.: 15.00  
##  Max.   :45.50   Max.   :14.90   Max.   :89.50   Max.   :188.00  
##  NA's   :17      NA's   :17      NA's   :19      NA's   :3       
##      birds           reptiles         amphibians          fishes      
##  Min.   :  0.00   Min.   :  0.000   Min.   :  0.000   Min.   :  0.00  
##  1st Qu.:  6.00   1st Qu.:  2.000   1st Qu.:  0.000   1st Qu.: 11.00  
##  Median : 12.00   Median :  5.000   Median :  0.000   Median : 25.00  
##  Mean   : 17.82   Mean   :  8.331   Mean   :  9.849   Mean   : 32.84  
##  3rd Qu.: 19.00   3rd Qu.:  8.000   3rd Qu.:  4.000   3rd Qu.: 43.00  
##  Max.   :165.00   Max.   :139.000   Max.   :215.000   Max.   :249.00  
##  NA's   :3        NA's   :3         NA's   :3         NA's   :3       
##     mollucs       other_inverts        plants        fungi_protists   
##  Min.   :  0.00   Min.   :  0.00   Min.   :   0.00   Min.   : 0.0000  
##  1st Qu.:  0.00   1st Qu.:  3.00   1st Qu.:   2.00   1st Qu.: 0.0000  
##  Median :  1.00   Median : 11.00   Median :  10.00   Median : 0.0000  
##  Mean   :  9.62   Mean   : 32.57   Mean   :  60.78   Mean   : 0.6082  
##  3rd Qu.:  6.00   3rd Qu.: 33.00   3rd Qu.:  44.00   3rd Qu.: 0.0000  
##  Max.   :301.00   Max.   :340.00   Max.   :1856.00   Max.   :12.0000  
##  NA's   :3        NA's   :3        NA's   :3         NA's   :3
Show Code
median(d$area, na.rm = TRUE)
Show Output
## [1] 69700
Show Code
median(d$population, na.rm = TRUE)
Show Output
## [1] 4911766
  • Create a new density variable in your data frame which is population / area. What are the 10 most dense countries? The 10 least dense?

HINT: Check out the order() function.

Show Code
d$density <- d$population / d$area
d <- d[order(d$density, decreasing = TRUE), ] # or
d <- d[order(-d$density), ]
d[1:10, ]
Show Output
## # A tibble: 10 × 17
##    country   population   area govt_form birthrate deathrate life_expect mammals
##    <chr>          <dbl>  <dbl> <chr>         <dbl>     <dbl>       <dbl>   <dbl>
##  1 Macau         592731   28   special …       8.9       4.2        84.5       0
##  2 Monaco         30535    2   constitu…       6.7       9.2        89.5       3
##  3 Holy See…        842    0.1 monarchy       NA        NA          NA         1
##  4 Singapore    5674472  697   republic        8.3       3.4        84.7      13
##  5 Hong Kong    7141106 1108   special …       9.2       7.1        82.9       3
##  6 Gibraltar      29258    7   British …      14.1       8.4        79.3       4
##  7 Bahrain      1346613  760   constitu…      13.7       2.7        78.7       3
##  8 Maldives      393253  298   republic       15.8       3.9        75.4       2
##  9 Malta         413965  316   republic       10.2       9.1        80.2       2
## 10 Bermuda        70196   54   British …      11.3       8.2        81.2       4
## # ℹ 9 more variables: birds <dbl>, reptiles <dbl>, amphibians <dbl>,
## #   fishes <dbl>, mollucs <dbl>, other_inverts <dbl>, plants <dbl>,
## #   fungi_protists <dbl>, density <dbl>
Show Code
d <- d[order(d$density), ]
d[1:10, ]
Show Output
## # A tibble: 10 × 17
##    country   population   area govt_form birthrate deathrate life_expect mammals
##    <chr>          <dbl>  <dbl> <chr>         <dbl>     <dbl>       <dbl>   <dbl>
##  1 South Ge…         30 3.90e3 British …      NA        NA          NA         3
##  2 Greenland      57733 2.17e6 autonomo…      14.5       8.5        72.1       9
##  3 Falkland…       3361 1.22e4 British …      NA        NA          NA         4
##  4 Pitcairn…         48 4.7 e1 British …      NA        NA          NA         1
##  5 Mongolia     2992908 1.56e6 republic       20.3       6.4        69.3      11
##  6 Western …     570866 2.66e5 autonomo…      30.2       8.3        62.6      10
##  7 French G…     181000 8.35e4 overseas…       0         0          76.1       8
##  8 Namibia      2212307 8.24e5 republic       19.8      13.9        51.6      14
##  9 Australia   22751014 7.74e6 parliame…      12.2       7.1        82.2      63
## 10 Iceland       331918 1.03e5 republic       13.9       6.3        83         6
## # ℹ 9 more variables: birds <dbl>, reptiles <dbl>, amphibians <dbl>,
## #   fishes <dbl>, mollucs <dbl>, other_inverts <dbl>, plants <dbl>,
## #   fungi_protists <dbl>, density <dbl>
  • Use subsetting or filtering to extract data from the 20 largest countries into a new variable, s. What are the median area and population size of these countries?
Show Code
s <- d[order(-d$population), ]
s <- s[1:20, ]
med_area <- median(s$area, na.rm = TRUE)
med_area
Show Output
## [1] 1052875
Show Code
med_pop <- median(s$population, na.rm = TRUE)
med_pop
Show Output
## [1] 124328234
  • Extract data from all countries beginning with the letters “A” through “F”. What are the mean area and population size of these countries?

NOTE: Single bracket notation subsetting is used here to return the rows of the d data frame where the country name (d$country) begins with the capital letters “A” through “F”. The grep() function is a pattern recognition function that uses a regular expression to pull out the country names of interest.

Show Code
s <- d[grep(pattern = "^[A-F]", d$country), ]
summary(s)
Show Output
##    country            population             area           govt_form        
##  Length:78          Min.   :5.960e+02   Min.   :      14   Length:78         
##  Class :character   1st Qu.:2.991e+05   1st Qu.:    4066   Class :character  
##  Mode  :character   Median :4.785e+06   Median :   51148   Mode  :character  
##                     Mean   :3.507e+07   Mean   :  918248                     
##                     3rd Qu.:1.469e+07   3rd Qu.:  466498                     
##                     Max.   :1.367e+09   Max.   :14000000                     
##                     NA's   :4                                                
##    birthrate       deathrate       life_expect       mammals    
##  Min.   : 0.00   Min.   : 0.000   Min.   :49.80   Min.   : 0.0  
##  1st Qu.:11.65   1st Qu.: 5.850   1st Qu.:68.75   1st Qu.: 3.0  
##  Median :15.90   Median : 7.700   Median :75.50   Median : 7.0  
##  Mean   :18.77   Mean   : 7.861   Mean   :72.25   Mean   :13.4  
##  3rd Qu.:23.30   3rd Qu.: 9.500   3rd Qu.:78.40   3rd Qu.:14.0  
##  Max.   :42.00   Max.   :14.400   Max.   :82.70   Max.   :81.0  
##  NA's   :7       NA's   :7        NA's   :7                     
##      birds           reptiles        amphibians         fishes      
##  Min.   :  0.00   Min.   : 0.000   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  6.00   1st Qu.: 2.000   1st Qu.:  0.00   1st Qu.: 10.00  
##  Median : 11.00   Median : 5.000   Median :  0.00   Median : 24.50  
##  Mean   : 18.62   Mean   : 7.397   Mean   : 11.86   Mean   : 29.54  
##  3rd Qu.: 18.00   3rd Qu.: 8.000   3rd Qu.:  3.00   3rd Qu.: 41.50  
##  Max.   :165.00   Max.   :43.000   Max.   :215.00   Max.   :133.00  
##                                                                     
##     mollucs       other_inverts        plants        fungi_protists  
##  Min.   :  0.00   Min.   :  0.00   Min.   :   0.00   Min.   :0.0000  
##  1st Qu.:  0.00   1st Qu.:  4.00   1st Qu.:   2.25   1st Qu.:0.0000  
##  Median :  1.00   Median : 11.00   Median :  10.00   Median :0.0000  
##  Mean   : 10.17   Mean   : 23.63   Mean   :  70.64   Mean   :0.6026  
##  3rd Qu.:  5.00   3rd Qu.: 25.25   3rd Qu.:  41.75   3rd Qu.:0.0000  
##  Max.   :174.00   Max.   :340.00   Max.   :1856.00   Max.   :8.0000  
##                                                                      
##     density         
##  Min.   :   0.2761  
##  1st Qu.:  24.5932  
##  Median :  75.0297  
##  Mean   : 162.4785  
##  3rd Qu.: 140.7140  
##  Max.   :1771.8592  
##  NA's   :4

Doing the same thing with {dplyr} verbs (see Module 10) is easier…

Show Code
s <- filter(d, grepl(pattern = "^[A-F]", country))
summary(s)
Show Output
##    country            population             area           govt_form        
##  Length:78          Min.   :5.960e+02   Min.   :      14   Length:78         
##  Class :character   1st Qu.:2.991e+05   1st Qu.:    4066   Class :character  
##  Mode  :character   Median :4.785e+06   Median :   51148   Mode  :character  
##                     Mean   :3.507e+07   Mean   :  918248                     
##                     3rd Qu.:1.469e+07   3rd Qu.:  466498                     
##                     Max.   :1.367e+09   Max.   :14000000                     
##                     NA's   :4                                                
##    birthrate       deathrate       life_expect       mammals    
##  Min.   : 0.00   Min.   : 0.000   Min.   :49.80   Min.   : 0.0  
##  1st Qu.:11.65   1st Qu.: 5.850   1st Qu.:68.75   1st Qu.: 3.0  
##  Median :15.90   Median : 7.700   Median :75.50   Median : 7.0  
##  Mean   :18.77   Mean   : 7.861   Mean   :72.25   Mean   :13.4  
##  3rd Qu.:23.30   3rd Qu.: 9.500   3rd Qu.:78.40   3rd Qu.:14.0  
##  Max.   :42.00   Max.   :14.400   Max.   :82.70   Max.   :81.0  
##  NA's   :7       NA's   :7        NA's   :7                     
##      birds           reptiles        amphibians         fishes      
##  Min.   :  0.00   Min.   : 0.000   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  6.00   1st Qu.: 2.000   1st Qu.:  0.00   1st Qu.: 10.00  
##  Median : 11.00   Median : 5.000   Median :  0.00   Median : 24.50  
##  Mean   : 18.62   Mean   : 7.397   Mean   : 11.86   Mean   : 29.54  
##  3rd Qu.: 18.00   3rd Qu.: 8.000   3rd Qu.:  3.00   3rd Qu.: 41.50  
##  Max.   :165.00   Max.   :43.000   Max.   :215.00   Max.   :133.00  
##                                                                     
##     mollucs       other_inverts        plants        fungi_protists  
##  Min.   :  0.00   Min.   :  0.00   Min.   :   0.00   Min.   :0.0000  
##  1st Qu.:  0.00   1st Qu.:  4.00   1st Qu.:   2.25   1st Qu.:0.0000  
##  Median :  1.00   Median : 11.00   Median :  10.00   Median :0.0000  
##  Mean   : 10.17   Mean   : 23.63   Mean   :  70.64   Mean   :0.6026  
##  3rd Qu.:  5.00   3rd Qu.: 25.25   3rd Qu.:  41.75   3rd Qu.:0.0000  
##  Max.   :174.00   Max.   :340.00   Max.   :1856.00   Max.   :8.0000  
##                                                                      
##     density         
##  Min.   :   0.2761  
##  1st Qu.:  24.5932  
##  Median :  75.0297  
##  Mean   : 162.4785  
##  3rd Qu.: 140.7140  
##  Max.   :1771.8592  
##  NA's   :4

NOTE: Here, we use grepl() instead of grep() because we want to return a logical vector for all of the elements of “country” to filter() on the basis of. grep() returns a vector of index positions.

Or, instead of summary(), we can use mean() with the na.rm=TRUE argument…

Show Code
mean(s$population, na.rm = TRUE)
Show Output
## [1] 35065172
Show Code
mean(s$area, na.rm = TRUE)
Show Output
## [1] 918247.7

There are a number of other packages and associated functions we might also use to generate nice summaries of our data.

For example, with the skim() function from {skimr}, you can get for numeric variables a count of observations, the number of missing observations, the mean and standard deviation, and the five-number summary. The select(), kable() and kable_styling() functions will format the output nicely!

NOTE: The following code block chains together a set of data wrangling commands using the native pipe operator (|>), which we have not covered yet, but is discussed in detail in Module 10.

library(skimr)
library(kableExtra)
s <- skim(d) # the main `skimr()` function
s |>
  filter(skim_type == "numeric") |>
  rename(
    variable = skim_variable, missing = n_missing, mean = numeric.mean,
    sd = numeric.sd, min = numeric.p0, p25 = numeric.p25, median = numeric.p50,
    p75 = numeric.p75, max = numeric.p100, hist = numeric.hist
  ) |>
  select(variable, missing, mean, sd, min, median, max, hist) |>
  # drop p25 and p75 for purposes of display
  kable() |>
  kable_styling(font_size = 10, full_width = FALSE)
variable missing mean sd min median max hist
population 6 2.998647e+07 1.241345e+08 30.0000000 4.911766e+06 1.367485e+09 ▇▁▁▁▁
area 1 6.109518e+05 1.927077e+06 0.1000000 6.970000e+04 1.709824e+07 ▇▁▁▁▁
birthrate 17 1.894892e+01 9.925912e+00 0.0000000 1.640000e+01 4.550000e+01 ▂▇▅▂▁
deathrate 17 7.610390e+00 3.135197e+00 0.0000000 7.400000e+00 1.490000e+01 ▁▅▇▃▂
life_expect 19 7.219083e+01 8.587024e+00 49.8000000 7.470000e+01 8.950000e+01 ▂▂▃▇▂
mammals 3 1.385306e+01 2.058178e+01 0.0000000 8.000000e+00 1.880000e+02 ▇▁▁▁▁
birds 3 1.781633e+01 2.213342e+01 0.0000000 1.200000e+01 1.650000e+02 ▇▁▁▁▁
reptiles 3 8.330612e+00 1.408903e+01 0.0000000 5.000000e+00 1.390000e+02 ▇▁▁▁▁
amphibians 3 9.848980e+00 2.869408e+01 0.0000000 0.000000e+00 2.150000e+02 ▇▁▁▁▁
fishes 3 3.283673e+01 3.552576e+01 0.0000000 2.500000e+01 2.490000e+02 ▇▂▁▁▁
mollucs 3 9.620408e+00 2.734627e+01 0.0000000 1.000000e+00 3.010000e+02 ▇▁▁▁▁
other_inverts 3 3.256735e+01 5.362966e+01 0.0000000 1.100000e+01 3.400000e+02 ▇▁▁▁▁
plants 3 6.077551e+01 1.618624e+02 0.0000000 1.000000e+01 1.856000e+03 ▇▁▁▁▁
fungi_protists 3 6.081633e-01 1.847000e+00 0.0000000 0.000000e+00 1.200000e+01 ▇▁▁▁▁
density 6 4.230267e+02 1.882632e+03 0.0076864 8.389234e+01 2.116896e+04 ▇▁▁▁▁
detach(package:kableExtra)
detach(package:skimr)

With descr(), dfSummary(), and view() from the {summarytools} package, you can also return nicely formatted tables of summary statistics. Note that the output of the code below is not shown.

CAUTION: Q1 and Q3 (the equivalent of “numeric.p25” and “numeric.p75”) are calculated slightly differently for descr() and dfSummary() than for summary() and skim()!

library(summarytools)
s <- descr(d, style = "rmarkdown", transpose = TRUE)
# |> to view() print nicely formatted table to viewer
s |> summarytools::view()
s <- dfSummary(d, style = "grid", plain.ascii = FALSE)
s |> summarytools::view()
detach(package:summarytools)

With makeDataReport() from the {dataMaid} package, you can generate a nicely formated report that gets written out to a location of your choice. Again, the output of this code is not shown as it generates a report in an external file.

library(dataMaid)
# this code below produces a formated report, with the type of report
# specified by `output=`
makeDataReport(d,
  output = "html",
  file = "~/Desktop/dataMaid-output.Rmd",
  replace = TRUE
)
detach(package:dataMaid)

Additionally, the packages {psych}, {pastecs}, and {Hmisc} also all contain functions for producing variable summaries:

psych::describe(d)

pastecs::stat.desc(d)

Hmisc::describe(d)

Boxplots, Barplots, and Dotcharts

The boxplot() function from {base} R provides a box-and-whiskers visual representation of the five-number summary, sometimes noting outliers that go beyond the bulk of the data. The function balks if you pass it nonnumeric data, so you will want to reference columns of interest specifically using either double bracket notation or the $ operator.

NOTE: Actually, it is technically not true that boxplot() always shows you the five-number summary. One of the default arguments for boxplot() includes range=1.5, which means that the “whiskers” of the boxplot extend to 1.5 times the interquartile range, and then values more extreme than that are indicated as single points. If range=0, then, the “whiskers” extend to the minimum and maximum values of the data.

The barplot() function is also useful taking a quick look at crude data, with bar height proportional to the value of the variable. The function dotchart() provides a similar graphical summary.

CHALLENGE

  • Make boxplots, barplots, and dotcharts of the raw population and area data, then log() transform the variables and repeat the boxplots.

NOTE: The par() command in the code below lets you set up a grid of panels in which to plot. Here, we set up a 1 row x 2 column grid.

Show Code
par(mfrow = c(1, 2))
d$log_population <- log(d$population)
d$log_area <- log(d$area)
boxplot(d$population, ylab = "Population")
boxplot(d$area, ylab = "Area")

Show Code
barplot(d$population, xlab = "Case", ylab = "Population")
dotchart(d$population, xlab = "Population", ylab = "Case")

Show Code
barplot(d$area, xlab = "Case", ylab = "Area")
dotchart(d$area, xlab = "Area", ylab = "Case")

Show Code
boxplot(d$log_population, ylab = "log(Population)")
boxplot(d$log_area, ylab = "log(Area)")

Plotting with {ggplot2}

As an alternative for plotting, we can use the “grammar of graphics” {ggplot2} package. In the code below, we first use tidyr::pivot_longer() to convert our data from wide to long format… this is so we can then use facet.grid().

# uncomment this to look at a different set of variables
# d_long <- pivot_longer(d, c("log_population","log_area"),
# names_to="Variable", values_to="Value")

# uncomment this to look at a different set of variables
# d_long <- pivot_longer(d, c("mammals","birds","reptiles","fishes"),
# names_to="Variable", values_to="Value")

d_long <-
  pivot_longer(d,
    c("birthrate", "deathrate", "life_expect"),
    names_to = "Variable",
    values_to = "Value"
  )

p <- ggplot(data = d_long, aes(x = factor(0), y = Value)) +
  geom_boxplot(na.rm = TRUE, outlier.shape = NA) +
  theme(
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  ) +
  geom_dotplot(
    binaxis = "y",
    stackdir = "center",
    stackratio = 0.2,
    alpha = 0.3,
    dotsize = 0.5,
    color = NA,
    fill = "red",
    na.rm = TRUE
  ) +
  facet_grid(. ~ Variable) +
  geom_rug(sides = "l")
p

p <- ggplot(data = d_long, aes(x = factor(0), y = Value)) +
  geom_violin(na.rm = TRUE, draw_quantiles = c(0.25, 0.5, 0.75)) +
  theme(
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  ) +
  geom_dotplot(
    binaxis = "y",
    stackdir = "center",
    stackratio = 0.2,
    alpha = 0.3,
    dotsize = 0.5,
    color = NA,
    fill = "red",
    na.rm = TRUE
  ) +
  facet_grid(. ~ Variable) +
  geom_rug(sides = "l")
p

Histograms

The hist() function returns a histogram showing the complete empirical distribution of the data in binned categories, which is useful for checking skewwness of the data, symmetry, multi-modality, etc. Setting the argument freq=FALSE will scale the Y axis to represent the proportion of observations falling into each bin rather than the count.

CHALLENGE

  • Make histograms of the log() transformed population and area data from the “Country-Data-2016” file. Explore what happens if you set freq=FALSE versus the default of freq=TRUE. Try looking at other variables as well.
Show Code
par(mfrow = c(1, 2)) # sets up two panels side by side
attach(d) # lets us use variable names without specifying the data frame!
hist(
  log(population),
  freq = FALSE,
  col = "red",
  main = "Plot 1",
  xlab = "log(population size)",
  ylab = "density",
  ylim = c(0, 0.2)
)
hist(
  log(area),
  freq = FALSE,
  col = "red",
  main = "Plot 2",
  xlab = "log(area)",
  ylab = "density",
  ylim = c(0, 0.2)
)

NOTE: You can add lines to your histograms (e.g., to show the mean value for a variable) using the abline() command, with arguments. For example, to show a single vertical line representing the mean log(population size), you would add the argument v=mean(log(area), na.rm=TRUE). We need to set na.rm to be TRUE otherwise the mean() function will not run the way we expect.

Show Code
hist(
  log(area),
  freq = FALSE,
  col = "red",
  main = "Plot 2",
  xlab = "log(area)",
  ylab = "density",
  ylim = c(0, 0.2)
)
abline(v = mean(log(area), na.rm = TRUE))

Density Plots

The density() function computes a non-parametric estimate of the distribution of a variable, which can be combined with other plots to also yield a graphical view of the distribution of the data. If your data have missing values, then you need to add the argument na.rm=TRUE to the density() function. To superimpose a density() curve on a histogram, you can use the lines(density()) function.

par(mfrow = c(1, 1)) # set up one panel and redraw the log(population) histogram
hist(
  log(population),
  freq = FALSE,
  col = "white",
  main = "Density Plot with Mean",
  xlab = "log(population size)",
  ylab = "density",
  ylim = c(0, 0.2)
)
abline(v = mean(log(population), na.rm = TRUE), col = "blue")
lines(density(log(population), na.rm = TRUE), col = "green")

detach(d)

Tables

The table() function can be used to summarize counts and proportions for categorical variables in your dataset.

CHALLENGE

  • Using the table() function, find what is the most common form of government in the “Country-Data-2016” dataset. How many countries have that form?

HINT: We can combine table() with sort() and the argument decreasing=TRUE to get the desired answered straight away.

Show Code
t <- sort(table(d$govt_form), decreasing = TRUE)
t
Show Output
## 
##                                republic                 constitutional monarchy 
##                                     127                                      33 
##              British overseas territory            overseas territory of France 
##                                      12                                       7 
##           presidential federal republic                                monarchy 
##                                       7                                       6 
##                  parliamentary monarchy          territory of the United States 
##                                       5                                       5 
##                 parliamentary democracy                  territory of Australia 
##                                       4                                       4 
## autonomous territory of the Netherlands                        federal republic 
##                                       3                                       3 
##                        islamic republic                      socialist republic 
##                                       3                                       3 
##                       absolute monarchy            autonomous region of Denmark 
##                                       2                                       2 
##     autonomous territory of New Zealand            overseas community of France 
##                                       2                                       2 
##          parliamentary federal republic  special administrative region of China 
##                                       2                                       2 
##                     territory of Norway                       autonomous region 
##                                       2                                       1 
##             autonomous region of France            autonomous region of Morocco 
##                                       1                                       1 
##                              federation         foreign-administrated territory 
##                                       1                                       1 
##      islamic-socialist peoples republic                  parliamentary republic 
##                                       1                                       1 
##                        peoples republic                     territory of France 
##                                       1                                       1 
##                territory of New Zealand            territory of the Netherlands 
##                                       1                                       1

Doing the same thing with {dplyr} verbs (see Module 10) provides nicer output…

Show Code
t <-
  group_by(d, govt_form) |>
  summarize(count = n()) |>
  arrange(desc(count))
t
Show Output
## # A tibble: 33 × 2
##    govt_form                      count
##    <chr>                          <int>
##  1 republic                         127
##  2 constitutional monarchy           33
##  3 British overseas territory        12
##  4 overseas territory of France       7
##  5 presidential federal republic      7
##  6 monarchy                           6
##  7 parliamentary monarchy             5
##  8 territory of the United States     5
##  9 parliamentary democracy            4
## 10 territory of Australia             4
## # ℹ 23 more rows

9.6 Exploring Multiple Variables

Plotting Several Variables

  • Multiple boxplots or histograms can be laid out side-by-side or overlaid.

CHALLENGE

Read in the dataset “KamilarAndCooperData”, which contains a host of summary information about 213 primate species.

Spend some time exploring the data on your own and then make boxplots of log(female body mass) ~ family. Try doing this with {base} graphics and then look at how we might do in in {ggplot2}, which provides a standard “grammar of graphics” (see the {ggplot2} documentation)

Show Code
f <- "https://raw.githubusercontent.com/difiore/ada-datasets/main/KamilarAndCooperData.csv"
d <- read_csv(f, col_names = TRUE)
head(d)
Show Output
## # A tibble: 6 × 45
##   Scientific_Name        Superfamily Family Genus Species Brain_Size_Species_M…¹
##   <chr>                  <chr>       <chr>  <chr> <chr>                    <dbl>
## 1 Allenopithecus_nigrov… Cercopithe… Cerco… Alle… nigrov…                   58.0
## 2 Allocebus_trichotis    Cercopithe… Cerco… Allo… tricho…                   NA  
## 3 Alouatta_belzebul      Ceboidea    Ateli… Alou… belzeb…                   52.8
## 4 Alouatta_caraya        Ceboidea    Ateli… Alou… caraya                    52.6
## 5 Alouatta_guariba       Ceboidea    Ateli… Alou… guariba                   51.7
## 6 Alouatta_palliata      Ceboidea    Ateli… Alou… pallia…                   49.9
## # ℹ abbreviated name: ¹​Brain_Size_Species_Mean
## # ℹ 39 more variables: Brain_Size_Female_Mean <dbl>, Brain_size_Ref <chr>,
## #   Body_mass_male_mean <dbl>, Body_mass_female_mean <dbl>,
## #   Mass_Dimorphism <dbl>, Mass_Ref <chr>, MeanGroupSize <dbl>,
## #   AdultMales <dbl>, AdultFemale <dbl>, AdultSexRatio <dbl>,
## #   Social_Organization_Ref <chr>, InterbirthInterval_d <dbl>, Gestation <dbl>,
## #   WeaningAge_d <dbl>, MaxLongevity_m <dbl>, LitterSz <dbl>, …

We can also do quick summaries of and visualizations of data distributions for multiple variables at the same time using the {skimr} and {kableExtra} packages.

library(skimr)
library(kableExtra)
Show Code
s <- skim(d) # formats results to a wide table
# here we make use of the `|>` operator and {dplyr} verbs... see below
s |>
  filter(skim_variable == "Scientific_Name" | skim_type == "numeric") |>
  rename(
    variable = skim_variable, missing = n_missing, mean = numeric.mean,
    sd = numeric.sd, min = numeric.p0, p25 = numeric.p25, median = numeric.p50,
    p75 = numeric.p75, max = numeric.p100, hist = numeric.hist
  ) |>
  select(variable, missing, mean, sd, min, median, max, hist) |>
  # drop p25 and p75 for purposes of display
  kable() |>
  kable_styling(font_size = 10, full_width = FALSE)
variable missing mean sd min median max hist
Scientific_Name 0 NA NA NA NA NA NA
Brain_Size_Species_Mean 42 68.109649 7.768403e+01 1.630 61.45000 491.2700 ▇▂▁▁▁
Brain_Size_Female_Mean 48 65.235879 7.371151e+01 1.660 57.20000 480.1500 ▇▂▁▁▁
Body_mass_male_mean 18 8111.800513 1.920858e+04 31.000 4290.00000 170400.0000 ▇▁▁▁▁
Body_mass_female_mean 18 5396.473846 1.008896e+04 30.000 3039.00000 97500.0000 ▇▁▁▁▁
Mass_Dimorphism 18 1.246005 3.212141e-01 0.841 1.10900 2.6880 ▇▃▂▁▁
MeanGroupSize 60 15.055065 1.759245e+01 1.000 7.80000 90.0000 ▇▂▁▁▁
AdultMales 68 2.515621 2.373255e+00 0.900 1.50000 16.0000 ▇▁▁▁▁
AdultFemale 68 5.048655 5.428615e+00 1.000 2.90000 25.2000 ▇▂▁▁▁
AdultSexRatio 84 2.304729 2.192575e+00 0.500 1.45000 15.6000 ▇▁▁▁▁
InterbirthInterval_d 103 572.058546 3.549477e+02 144.470 476.93000 2007.5000 ▇▅▁▁▁
Gestation 75 163.465362 3.740063e+01 59.990 165.04000 256.0000 ▁▃▇▃▂
WeaningAge_d 95 310.042881 2.546776e+02 40.000 237.73500 1260.8100 ▇▅▂▁▁
MaxLongevity_m 66 327.331701 1.259496e+02 103.000 303.60000 720.0000 ▅▇▆▂▁
LitterSz 47 1.180542 3.601098e-01 0.990 1.01000 2.5200 ▇▁▁▁▁
GR_MidRangeLat_dd 38 -1.796457 1.362270e+01 -24.500 -0.76000 35.8800 ▆▆▇▂▁
Precip_Mean_mm 38 1543.134286 5.158597e+02 419.000 1541.90000 2794.3000 ▂▅▇▅▂
Temp_Mean_degC 38 23.132000 3.405405e+00 2.600 24.30000 27.4000 ▁▁▁▂▇
AET_Mean_mm 38 1253.112000 2.635696e+02 453.100 1291.10000 1828.3000 ▁▃▆▇▂
PET_Mean_mm 38 1553.158857 1.567096e+02 842.500 1566.90000 1927.3000 ▁▁▂▇▂
HomeRange_km2 65 1.937905 5.210604e+00 0.002 0.27500 28.2400 ▇▁▁▁▁
DayLength_km 104 1.551449 1.439277e+00 0.250 1.21200 11.0000 ▇▁▁▁▁
Territoriality 109 2.228885 2.201259e+00 0.225 1.59235 15.5976 ▇▁▁▁▁
Fruit 98 47.736956 2.408200e+01 1.000 48.00000 97.0000 ▅▇▇▇▃
Canine_Dimorphism 92 1.617438 6.722173e-01 0.880 1.56000 5.2630 ▇▃▁▁▁
Feed 141 33.075833 1.366899e+01 9.000 33.30000 63.9000 ▇▇▇▇▂
Move 143 21.668857 1.059124e+01 3.000 21.00000 70.6000 ▅▇▂▁▁
Rest 143 34.259571 1.967175e+01 4.000 30.48000 78.5000 ▇▇▅▆▂
Social 136 7.368701 5.242077e+00 0.900 5.40000 23.5000 ▇▃▃▁▁
detach(package:kableExtra)
detach(package:skimr)

As another type of exporatory data visualization involving multiple variables, it is often useful to make boxplots summarizing a numerical variable in relation to a categorical variable.

Plotting using {base} graphics… the ~ operator can be read as “by”.

boxplot(log(d$Body_mass_female_mean) ~ d$Family)

Alternatively, plotting using {ggplot2}…

p <- ggplot(data = d, aes(x = Family, y = log(Body_mass_female_mean)))
p <- p + geom_boxplot(na.rm = TRUE)
p <- p + theme(axis.text.x = element_text(angle = 90))
p <- p + ylab("log(Female Body Mass)")
p

Bivariate Scatterplots

Scatterplots are a natural tool for visualizing two continuous variables and can be made easily with the plot(x=<variable 1>, y=<variable 2>) function in {base} graphics (where <variable 1> and <variable 2> denote the names of the two variables you wish to plot). Transformations of the variables, e.g., log or square root (sqrt()), may be necessary for effective visualization.

CHALLENGE

  • Again using data from the “KamilarAndCooperData” dataset, plot the relationship between female body size and female brain size. Then, play with log transforming the data and plot again.
Show Code
par(mfrow = c(1, 2)) # sets up two panels side by side
plot(x = d$Body_mass_female_mean, y = d$Brain_Size_Female_Mean)
plot(x = log(d$Body_mass_female_mean), y = log(d$Brain_Size_Female_Mean))

The grammar for {ggplot2} is a bit more complicated… see if you can follow it in the example below.

p <- ggplot(data = d, aes(
  x = log(Body_mass_female_mean),
  y = log(Brain_Size_Female_Mean),
  color = factor(Family)
)) # first, we build a plot object and color points by Family
# then we modify the axis labels
p <- p + xlab("log(Female Body Mass)") + ylab("log(Female Brain Size)")
# then we make a scatterplot
p <- p + geom_point(na.rm = TRUE)
# then we modify the legend
p <- p + theme(legend.position = "bottom", legend.title = element_blank())
# and, finally, we plot the object
p

We can use the cool package {ggExtra} to add marginal univariate plots to our bivariate scatterplots, too!

library(ggExtra)
ggMarginal(p, type = "densigram") # try with other types, too!

detach(package:ggExtra)

Using {ggplot2}, we can also easily set up a grid for “faceting”” by a grouping variable…

p <- p +
  facet_wrap(~Family, ncol = 4) # wrap data "by" family into 4 columns
p <- p + theme(legend.position = "none")
p

In {ggplot2} can easily add regression lines to our plot. Here, we add a linear model to each facet.

p <- p + geom_smooth(method = "lm", fullrange = FALSE, na.rm = TRUE)
p

The scatterplot() function from the {car} package also produces nice bivariate plots and includes barplots along the margins.

library(car)
scatterplot(
  data = d,
  log(Brain_Size_Female_Mean) ~ log(Body_mass_female_mean),
  xlab = "log(Female Body Mass",
  ylab = "log(Female Brain Size",
  boxplots = "xy",
  regLine =
    list(
      method = lm,
      lty = 1,
      lwd = 2,
      col = "red"
    )
)

detach(package:car)

CHALLENGE

  • Build your own bivariate scatterplot of log(MaxLongevity_m) by log(Body_mass_female_mean) using the “KamilarAndCooperData” dataset.
Show Code
p <- ggplot(data = d, aes(
  x = log(Body_mass_female_mean),
  y = log(MaxLongevity_m)
))
p <- p + geom_point(na.rm = TRUE)
p <- p + geom_smooth(method = "lm", na.rm = TRUE)
p
## `geom_smooth()` using formula = 'y ~ x'

We can also use either the pairs() function from {base} R, the scatterplotMatrix() function from the {car} package, or the pairs.panel() function from the {psych} package to do multiple scatterplots simulaneously. The ggpairs() function from the {GGally} package can also be used, but it is slower.

CHALLENGE

Select the variables Brain_Size_Female_Mean, Body_mass_female_mean, MeanGroupSize, WeaningAge_d, MaxLongevity_m, HomeRange_km2, and DayLength_km from data frame d and plot scatterplots of all pairs of variables.

NOTE: Here we are using the select() function from the {dplyr} package, which we cover in more detail in Module 10.

Show Code
s <-
  select(
    d,
    c(
      "Brain_Size_Female_Mean",
      "Body_mass_female_mean",
      "MeanGroupSize",
      "WeaningAge_d",
      "MaxLongevity_m",
      "HomeRange_km2",
      "DayLength_km"
    )
  )
pairs(s[, 1:ncol(s)]) # or

Show Code
pairs(data = s, ~.) # NAs are ignored by default
# adding argument `upper.panel=NULL` will suppress plots above the diagonal

The {car} package function scatterplotMatrix() provides a more customizable interface to the pairs() function from {base} R. It includes a neat rug plot for each variable.

library(car)
scatterplotMatrix(s,
  smooth = TRUE,
  regLine = list(method = lm, lty = 1, lwd = 2),
  ellipse = TRUE,
  upper.panel = NULL
)

detach(package:car)

The {psych} package also makes it easy to construct a scatterplot matrix with customizable output using the pairs.panel() function. Here, method= specifies the correlation method, density= specifies whether to superimpose a density curve on the histogram, and elipses= specifies whether to show correlation elipses.

library(psych)
pairs.panels(s[],
  smooth = FALSE,
  lm = TRUE,
  method = "pearson",
  hist.col = "#00AFBB",
  density = TRUE,
  ellipses = TRUE
)

# NAs are ignored by default
detach(package:psych)

The {GGally} package lets us do something similar…

library(GGally)
ggpairs(s, columns = 1:ncol(s)) # NAs produce a warning

detach(package:GGally)

Finally, we can do a nice correlation plot using the {corrplot} package, where we get a matrix of correlations among variables and a graphical representation of the strength and direction of the bivariate correlation coefficients among them.

library(corrplot)
cor <- cor(s, use = "pairwise.complete.obs")
corrplot.mixed(cor, lower.col = "black", number.cex = .7)

detach(package:corrplot)

9.7 Custom Layouts for Multiple Plots

Above, we saw the typical {base} R method for laying out plots, using the par() command with the mfrow= argument to set up a grid of panels in which to plot to the current graphics output device. There are, however, several easier ways for combining plots into custom layouts. Three alternatives are provided in the {patchwork}, {cowplot}, and {gridExtra} packages. Each starts by generating plot objects, typically using {ggplot2}. First, we create 3 plot objects using the KamilarAndCooper dataset:

p1 <- ggplot(data = d, aes(
  x = log(Body_mass_female_mean),
  y = log(MaxLongevity_m)
))
p1 <- p1 + geom_point(na.rm = TRUE)
p1 <- p1 + geom_smooth(method = "lm", na.rm = TRUE)
p1 <- p1 + xlab("log(Female Body Size)") + ylab("log(Lifespan)")

p2 <- ggplot(data = d, aes(
  x = log(Body_mass_male_mean),
  y = log(MaxLongevity_m)
))
p2 <- p2 + geom_point(na.rm = TRUE)
p2 <- p2 + geom_smooth(method = "lm", na.rm = TRUE)
p2 <- p2 + xlab("log(Male Body Size)") + ylab("log(Lifespan)")

p3 <- ggplot(data = d, aes(x = Family, y = log(Body_mass_female_mean)))
p3 <- p3 + geom_boxplot(na.rm = TRUE)
p3 <- p3 + theme(axis.text.x = element_text(angle = 90))
p3 <- p3 + xlab("Family") + ylab("log(Female Body Mass)")

Once we have these plot objects, we can arrange them in custom ways using these different packages. Some alternative ways of arranging these three plots in 2 rows, with 2 panels in the top row and 1 in the bottom row, are given below…

The {patchwork} package is perhaps the easiest to use!

library(patchwork)
(p1 | p2) / p3 + plot_annotation(tag_levels = "A")

detach(package:patchwork)

We can also use {cowplot}…

library(cowplot)
plot_grid(
  plot_grid(
    p1,
    p2,
    labels = c("A", "B"),
    label_size = 12,
    nrow = 1
  ),
  p3,
  labels = c("", "C"),
  label_size = 12,
  nrow = 2
)

detach(package:cowplot)

… or {gridExtra}

library(gridExtra)
grid.arrange(grid.arrange(p1, p2, nrow = 1), p3, nrow = 2)

detach(package:gridExtra)

9.8 Aggregate Statistics

To calculate summary statistics for groups of observations in a data frame, there are many different approaches. One is to use the aggregate() function from the {stats} package (a {base} R standard package), which provides a quick way to look at summary statistics for sets of observations, though it requires a bit of clunky code. Here, we apply a particular function (FUN = "mean") to mean female body mass, grouped by Family.

aggregate(d$Body_mass_female_mean ~ d$Family, FUN = "mean", na.rm = TRUE)
##           d$Family d$Body_mass_female_mean
## 1         Atelidae               6616.2000
## 2          Cebidae                876.3323
## 3  Cercopithecidae               6327.8247
## 4    Cheirogalidae                186.0286
## 5    Daubentonidae               2490.0000
## 6        Galagidae                371.6143
## 7        Hominidae              53443.7167
## 8      Hylobatidae               6682.1200
## 9        Indriidae               3886.5333
## 10       Lemuridae               1991.1200
## 11   Lepilemuridae                813.5000
## 12       Lorisidae                489.8625
## 13     Pitheciidae               1768.5000
## 14       Tarsiidae                120.0000

Or, alternatively…

aggregate(x = d["Body_mass_female_mean"], by = d["Family"], FUN = "mean", na.rm = TRUE)
##             Family Body_mass_female_mean
## 1         Atelidae             6616.2000
## 2          Cebidae              876.3323
## 3  Cercopithecidae             6327.8247
## 4    Cheirogalidae              186.0286
## 5    Daubentonidae             2490.0000
## 6        Galagidae              371.6143
## 7        Hominidae            53443.7167
## 8      Hylobatidae             6682.1200
## 9        Indriidae             3886.5333
## 10       Lemuridae             1991.1200
## 11   Lepilemuridae              813.5000
## 12       Lorisidae              489.8625
## 13     Pitheciidae             1768.5000
## 14       Tarsiidae              120.0000
# | include: false
detach(package:curl)
detach(package:tidyverse)

Concept Review

  • Summary statistics: summary(), skim()
  • Basic plotting: boxplot(), barplot(), histogram()
  • Plotting with {ggplot2}:
    • ggplot(data = , mapping = aes()) + geom() + theme() + ...
  • Summarizing by groups: aggregrate()
  • Arranging plots with {patchwork}, {cowplot}, and {gridExtra}