10 Data Wrangling

10.1 Objectives

The objective of this module to introduce you to data wrangling and transformation using the {dplyr} and {tidyr} packages, which are part of the {tidyverse} family.

10.2 Preliminaries

Install (but do not load yet) this package in R: {tidylog}
Install and load this package: {magrittr}
Load {tidyverse}
Load in the KamilarAndCooper dataset we used in Module 09 as a “tibble” named d

f <- "https://raw.githubusercontent.com/difiore/ada-datasets/main/KamilarAndCooperData.csv"
d <- read_csv(f, col_names = TRUE) # creates a "tibble"
head(d)

## # A tibble: 6 × 45
##   Scientific_Name        Superfamily Family Genus Species Brain_Size_Species_M…¹
##   <chr>                  <chr>       <chr>  <chr> <chr>                    <dbl>
## 1 Allenopithecus_nigrov… Cercopithe… Cerco… Alle… nigrov…                   58.0
## 2 Allocebus_trichotis    Cercopithe… Cerco… Allo… tricho…                   NA  
## 3 Alouatta_belzebul      Ceboidea    Ateli… Alou… belzeb…                   52.8
## 4 Alouatta_caraya        Ceboidea    Ateli… Alou… caraya                    52.6
## 5 Alouatta_guariba       Ceboidea    Ateli… Alou… guariba                   51.7
## 6 Alouatta_palliata      Ceboidea    Ateli… Alou… pallia…                   49.9
## # ℹ abbreviated name: ¹Brain_Size_Species_Mean
## # ℹ 39 more variables: Brain_Size_Female_Mean <dbl>, Brain_size_Ref <chr>,
## #   Body_mass_male_mean <dbl>, Body_mass_female_mean <dbl>,
## #   Mass_Dimorphism <dbl>, Mass_Ref <chr>, MeanGroupSize <dbl>,
## #   AdultMales <dbl>, AdultFemale <dbl>, AdultSexRatio <dbl>,
## #   Social_Organization_Ref <chr>, InterbirthInterval_d <dbl>, Gestation <dbl>,
## #   WeaningAge_d <dbl>, MaxLongevity_m <dbl>, LitterSz <dbl>, …

10.3 Data Wrangling Using {dplyr}

The {dplyr} package, included in the {tidyverse}, provides “a flexible grammar of data manipulation” that makes many of the manipulations that we explore in Module 07 and Module 09 much easier and much more intuitive!

Among other functions, {dplyr} introduces a set of verbs (filter(), select(), arrange(), rename(), mutate(), summarize(), and group_by()) that can be used to perform useful operations on “tibbles” and related tabular data structures (e.g., normal data frames and data tables). Before using {dplyr} for summarizing data and producing aggregate statistics, let’s look in general at what we can do with these verbs…

`filter()`

The filter() function lets us pull out rows from a data frame that meet a particular criterion or set of criteria:

FROM: Baumer et al. (2017). Modern Data Science with R. Chapman and Hall/CRC.

# selecting rows..
s <- filter(d, Family == "Hominidae" & Mass_Dimorphism > 2)
head(s)

## # A tibble: 3 × 45
##   Scientific_Name Superfamily Family    Genus   Species  Brain_Size_Species_Mean
##   <chr>           <chr>       <chr>     <chr>   <chr>                      <dbl>
## 1 Gorilla_gorilla Hominoidea  Hominidae Gorilla gorilla                     490.
## 2 Pongo_abelii    Hominoidea  Hominidae Pongo   abelii                      390.
## 3 Pongo_pygmaeus  Hominoidea  Hominidae Pongo   pygmaeus                    377.
## # ℹ 39 more variables: Brain_Size_Female_Mean <dbl>, Brain_size_Ref <chr>,
## #   Body_mass_male_mean <dbl>, Body_mass_female_mean <dbl>,
## #   Mass_Dimorphism <dbl>, Mass_Ref <chr>, MeanGroupSize <dbl>,
## #   AdultMales <dbl>, AdultFemale <dbl>, AdultSexRatio <dbl>,
## #   Social_Organization_Ref <chr>, InterbirthInterval_d <dbl>, Gestation <dbl>,
## #   WeaningAge_d <dbl>, MaxLongevity_m <dbl>, LitterSz <dbl>,
## #   Life_History_Ref <chr>, GR_MidRangeLat_dd <dbl>, Precip_Mean_mm <dbl>, …

NOTE: The first argument of any of the {dplyr} verbs is the .data argument. That is the line of code above is equivalent to s <- filter(.data = d, Family == "Hominidae" & Mass_Dimorphism > 2)

`select()`

The select() function lets us pull out only particular columns from a data frame:

FROM: Baumer et al. (2017). Modern Data Science with R. Chapman and Hall/CRC.

# selecting specific columns...
s <- select(d, Family, Genus, Body_mass_male_mean)
head(s)

## # A tibble: 6 × 3
##   Family          Genus          Body_mass_male_mean
##   <chr>           <chr>                        <dbl>
## 1 Cercopithecidae Allenopithecus                6130
## 2 Cercopithecidae Allocebus                       92
## 3 Atelidae        Alouatta                      7270
## 4 Atelidae        Alouatta                      6525
## 5 Atelidae        Alouatta                      5800
## 6 Atelidae        Alouatta                      7150

`arrange()`

The arrange() function lets us sort a data frame based on a select variable or set of variables:

FROM: Baumer et al. (2017). Modern Data Science with R. Chapman and Hall/CRC.

# reordering a data frame by a set of variables...
s <- arrange(d, Family, Genus, Body_mass_male_mean)
head(s)

## # A tibble: 6 × 45
##   Scientific_Name    Superfamily Family   Genus   Species Brain_Size_Species_M…¹
##   <chr>              <chr>       <chr>    <chr>   <chr>                    <dbl>
## 1 Alouatta_guariba   Ceboidea    Atelidae Alouat… guariba                   51.7
## 2 Alouatta_caraya    Ceboidea    Atelidae Alouat… caraya                    52.6
## 3 Alouatta_seniculus Ceboidea    Atelidae Alouat… senicu…                   55.2
## 4 Alouatta_palliata  Ceboidea    Atelidae Alouat… pallia…                   49.9
## 5 Alouatta_belzebul  Ceboidea    Atelidae Alouat… belzeb…                   52.8
## 6 Alouatta_pigra     Ceboidea    Atelidae Alouat… pigra                     51.1
## # ℹ abbreviated name: ¹Brain_Size_Species_Mean
## # ℹ 39 more variables: Brain_Size_Female_Mean <dbl>, Brain_size_Ref <chr>,
## #   Body_mass_male_mean <dbl>, Body_mass_female_mean <dbl>,
## #   Mass_Dimorphism <dbl>, Mass_Ref <chr>, MeanGroupSize <dbl>,
## #   AdultMales <dbl>, AdultFemale <dbl>, AdultSexRatio <dbl>,
## #   Social_Organization_Ref <chr>, InterbirthInterval_d <dbl>, Gestation <dbl>,
## #   WeaningAge_d <dbl>, MaxLongevity_m <dbl>, LitterSz <dbl>, …

We can also specify the direction in which we want the data frame to be sorted:

# `desc()` can be used to reverse the order
s <- arrange(d, desc(Family), Genus, Species, desc(Body_mass_male_mean))
head(s)

## # A tibble: 6 × 45
##   Scientific_Name        Superfamily Family Genus Species Brain_Size_Species_M…¹
##   <chr>                  <chr>       <chr>  <chr> <chr>                    <dbl>
## 1 Tarsius_bancanus       Tarsioidea  Tarsi… Tars… bancan…                   3.16
## 2 Tarsius_dentatus       Tarsioidea  Tarsi… Tars… dentat…                  NA   
## 3 Tarsius_syrichta       Tarsioidea  Tarsi… Tars… syrich…                   3.36
## 4 Cacajao_calvus         Ceboidea    Pithe… Caca… calvus                   76   
## 5 Cacajao_melanocephalus Ceboidea    Pithe… Caca… melano…                  68.8 
## 6 Callicebus_donacophil… Ceboidea    Pithe… Call… donaco…                  NA   
## # ℹ abbreviated name: ¹Brain_Size_Species_Mean
## # ℹ 39 more variables: Brain_Size_Female_Mean <dbl>, Brain_size_Ref <chr>,
## #   Body_mass_male_mean <dbl>, Body_mass_female_mean <dbl>,
## #   Mass_Dimorphism <dbl>, Mass_Ref <chr>, MeanGroupSize <dbl>,
## #   AdultMales <dbl>, AdultFemale <dbl>, AdultSexRatio <dbl>,
## #   Social_Organization_Ref <chr>, InterbirthInterval_d <dbl>, Gestation <dbl>,
## #   WeaningAge_d <dbl>, MaxLongevity_m <dbl>, LitterSz <dbl>, …

`rename()`

The rename() function allows us to change the names of particular columns in a data frame:

# renaming columns...
s <- rename(d, "Female_Mass" = Body_mass_female_mean)
head(s$Female_Mass)

## [1] 3180   84 5520 4240 4550 5350

`mutate()`

The mutate() function allows us to add new columns to a data frame:

FROM: Baumer et al. (2017). Modern Data Science with R. Chapman and Hall/CRC.

# and adding new columns...
s <- mutate(d, "Binomial" = paste(Genus, Species, sep = " "))
head(s$Binomial) # or head(s[["Binomial"]])

## [1] "Allenopithecus nigroviridis" "Allocebus trichotis"        
## [3] "Alouatta belzebul"           "Alouatta caraya"            
## [5] "Alouatta guariba"            "Alouatta palliata"

`summarize()` and `group_by()`

The {dplyr} package makes it easy to summarize data using more convenient functions than the {base} function aggregate(), which we looked at in Module 09. The summarize() function specifies a list of summary variables that will appear in the output, along with the operations that will be performed on vectors in the data frame to produce those summary variables:

FROM: Baumer et al. (2017). Modern Data Science with R. Chapman and Hall/CRC.

# the n() function returns the number of rows in the data frame
s <- summarize(
  d,
  n_cases = n(),
  avgF = mean(Body_mass_female_mean, na.rm = TRUE),
  avgM = mean(Body_mass_male_mean, na.rm = TRUE)
)
s

## # A tibble: 1 × 3
##   n_cases  avgF  avgM
##     <int> <dbl> <dbl>
## 1     213 5396. 8112.

Additionally, the group_by() function allows us to construct these summary variables for sets of observations defined by a particular categorical variable, as we did above with aggregate().

by_family <- group_by(d, Family)
# here, n() returns the number of rows in the group being considered
s <- summarise(
  by_family,
  n_cases = n(),
  avgF = mean(Body_mass_female_mean, na.rm = TRUE),
  avgM = mean(Body_mass_male_mean, na.rm = TRUE)
)
s

## # A tibble: 14 × 4
##    Family          n_cases   avgF   avgM
##    <chr>             <int>  <dbl>  <dbl>
##  1 Atelidae             12  6616.  7895.
##  2 Cebidae              37   876.  1012.
##  3 Cercopithecidae      79  6328.  9543.
##  4 Cheirogalidae         7   186.   193.
##  5 Daubentonidae         1  2490   2620 
##  6 Galagidae             7   372.   395.
##  7 Hominidae             6 53444. 98681.
##  8 Hylobatidae          11  6682.  6926.
##  9 Indriidae             9  3887.  3638.
## 10 Lemuridae            17  1991.  2077.
## 11 Lepilemuridae         6   814.   792 
## 12 Lorisidae             8   490.   512.
## 13 Pitheciidae          10  1768.  1955.
## 14 Tarsiidae             3   120    131

10.4 Other Useful {dplyr} Functions

ungroup(): clears group metadata from a table put in place group_by()
bind_rows() and bind_cols(): adds rows and columns, respectively, to a dataframe or tibble; when binding rows, if the column names do not match, the column will still be added and missing values filled with NA; when binding columns, the number of rows in each dataframe needs to be the same
pull(): pulls a single variable out of a dataframe as a vector
sample_n(): randomly samples a set of “size=” rows from a dataframe with (“replace=TRUE”) or without (“replace=FALSE”) replacement; this function is being superceded in favor of slice_sample(), where an additional argument (n= or prop=) allows you to specify the number or proportion of rows, respectively, to sample randomly
drop_na(): drops rows from a dataframe that have NA values for any variable names passed as arguments to the function
rowwise(): allows you to explicitly perform functions on a data frame on a row-at-a-time basis, which is useful if a vectorized function does not exit

A full list of {dplyr} functions and their descriptions is available here.

10.5 Joining Tables

One of the other major forms of data wrangling that we often need to do is to combine variables from different tabular data structures into a new table. This process is often referred to as performing a “mutating join” or simply a “join”.

NOTE: For those with experience with other database systems, it is related to the “JOIN” commands in SQL.

The process works by matching observations in two different tables by a common key variable and then selecting additional variables of interest to pull from each of the tables. A simple example is the following… suppose we have two tables, one that contains average brain sizes for particular species of primates and one that contains individual body sizes for some of the same species, plus others. In the latter table, too, we may have data from multiple individuals of the same species represented.

table1 <-
  tibble(
    Taxon = c("Gorilla", "Human", "Chimpanzee", "Orangutan", "Baboon"),
    Avg_Brain_Size = c(470, 1100, 350, 340, 140)
  )
table1

## # A tibble: 5 × 2
##   Taxon      Avg_Brain_Size
##   <chr>               <dbl>
## 1 Gorilla               470
## 2 Human                1100
## 3 Chimpanzee            350
## 4 Orangutan             340
## 5 Baboon                140

table2 <-
  tibble(
    Taxon = c(
      "Gorilla",
      "Gorilla",
      "Gorilla",
      "Human",
      "Human",
      "Chimpanzee",
      "Orangutan",
      "Orangutan",
      "Macaque",
      "Macaque",
      "Macaque"
    ),
    Body_Weight = c(80, 81, 77, 48, 49, 38, 37, 36, 6, 7, 6)
  )
table2

## # A tibble: 11 × 2
##    Taxon      Body_Weight
##    <chr>            <dbl>
##  1 Gorilla             80
##  2 Gorilla             81
##  3 Gorilla             77
##  4 Human               48
##  5 Human               49
##  6 Chimpanzee          38
##  7 Orangutan           37
##  8 Orangutan           36
##  9 Macaque              6
## 10 Macaque              7
## 11 Macaque              6

Inner Joins

An inner join or equijoin matches up sets of observations between two tables whenever their keys are equal. The output of an inner join is a new data frame that contains all rows from the left-hand (x) and right-hand (y) tables where there are matching values in the key column, plus all columns from x and y. If there are multiple matches between the tables, all combination of the matches are returned. This is represented schematically below:

FROM: Wickham & Grolemund (2017). R for Data Science. O’Reilly Media, Inc.

inner_join(table1, table2, by = "Taxon")

## # A tibble: 8 × 3
##   Taxon      Avg_Brain_Size Body_Weight
##   <chr>               <dbl>       <dbl>
## 1 Gorilla               470          80
## 2 Gorilla               470          81
## 3 Gorilla               470          77
## 4 Human                1100          48
## 5 Human                1100          49
## 6 Chimpanzee            350          38
## 7 Orangutan             340          37
## 8 Orangutan             340          36

Outer Joins

While an inner join keeps only observations that appear in both tables, different flavors of outer joins keep observations that appear in at least one of the tables. There are three types of outer joins:

A left join returns all rows from the left-hand table, x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns from the y table. If there are multiple matches between x and *y, all combinations of the matches are returned.
A right join returns all rows from the right-hand table, y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns from the x table. If there are multiple matches between x and y, all combinations of the matches are returned.
A full join returns all rows and all columns in both the left-hand (x) and right-hand (y) tables, joining them where there are matches. Where there are not matching values, the join returns NA for the columns from table where they are missing.

The following figure shows a schematic representation of these various types of outer joins:

FROM: Wickham & Grolemund (2017). R for Data Science. O’Reilly Media, Inc.

left_join(table1, table2, by = "Taxon")

## # A tibble: 9 × 3
##   Taxon      Avg_Brain_Size Body_Weight
##   <chr>               <dbl>       <dbl>
## 1 Gorilla               470          80
## 2 Gorilla               470          81
## 3 Gorilla               470          77
## 4 Human                1100          48
## 5 Human                1100          49
## 6 Chimpanzee            350          38
## 7 Orangutan             340          37
## 8 Orangutan             340          36
## 9 Baboon                140          NA

right_join(table1, table2, by = "Taxon")

## # A tibble: 11 × 3
##    Taxon      Avg_Brain_Size Body_Weight
##    <chr>               <dbl>       <dbl>
##  1 Gorilla               470          80
##  2 Gorilla               470          81
##  3 Gorilla               470          77
##  4 Human                1100          48
##  5 Human                1100          49
##  6 Chimpanzee            350          38
##  7 Orangutan             340          37
##  8 Orangutan             340          36
##  9 Macaque                NA           6
## 10 Macaque                NA           7
## 11 Macaque                NA           6

full_join(table1, table2, by = "Taxon")

## # A tibble: 12 × 3
##    Taxon      Avg_Brain_Size Body_Weight
##    <chr>               <dbl>       <dbl>
##  1 Gorilla               470          80
##  2 Gorilla               470          81
##  3 Gorilla               470          77
##  4 Human                1100          48
##  5 Human                1100          49
##  6 Chimpanzee            350          38
##  7 Orangutan             340          37
##  8 Orangutan             340          36
##  9 Baboon                140          NA
## 10 Macaque                NA           6
## 11 Macaque                NA           7
## 12 Macaque                NA           6

Other Joins

There are also two additional join types that may be sometimes be useful… note that these joins only return columns from the left-hand table, x.

A semi_join returns rows from the left-hand table, x, where there are matching values in y, but keeping just the columns from x. A semi_join differs from an inner_join because an inner_join will return a row of x for every matching row of y (so some x rows can be duplicated), whereas a semi_join will never duplicate rows of x.
An anti_join returns all rows from the left-hand table,x where there are not matching values in y, keeping just the columns from x.

semi_join(table1, table2, by = "Taxon")

## # A tibble: 4 × 2
##   Taxon      Avg_Brain_Size
##   <chr>               <dbl>
## 1 Gorilla               470
## 2 Human                1100
## 3 Chimpanzee            350
## 4 Orangutan             340

anti_join(table1, table2, by = "Taxon")

## # A tibble: 1 × 2
##   Taxon  Avg_Brain_Size
##   <chr>           <dbl>
## 1 Baboon            140

The cheatsheet on Data Transformation with {dplyr} provides a nice overview of these and additional data wrangling functions included the {dplyr} package.

10.6 Chaining and Piping

One other cool thing about the {dplyr} package is that it provides a convenient way to chain together operations on a data frame using the “forward pipe” operator (%>%). The %>% operator basically takes what is on the left-hand side (LHS) of the operator and directly applies the function call on the right-hand side (RHS) of the operator to it. That is, it “pipes” what is on the LHS of the operator directly to the first argument of the function on the right. This process allows us to build of chains of successive operations, each one being applied to the outcome of the previous operation in the chain.

Because of how useful “piping” is for data wranging, newer versions of R incorporated this functionality by adding a “native” forward pipe operator (|>) into the {base} package. In the code below, you could use |> in lieu of %>%.

EXAMPLE:

# this...
d %>%
  select(Scientific_Name, Body_mass_female_mean) |>
  head()

## # A tibble: 6 × 2
##   Scientific_Name             Body_mass_female_mean
##   <chr>                                       <dbl>
## 1 Allenopithecus_nigroviridis                  3180
## 2 Allocebus_trichotis                            84
## 3 Alouatta_belzebul                            5520
## 4 Alouatta_caraya                              4240
## 5 Alouatta_guariba                             4550
## 6 Alouatta_palliata                            5350

# is equivalent to...
head(select(d, Scientific_Name, Body_mass_female_mean))

## # A tibble: 6 × 2
##   Scientific_Name             Body_mass_female_mean
##   <chr>                                       <dbl>
## 1 Allenopithecus_nigroviridis                  3180
## 2 Allocebus_trichotis                            84
## 3 Alouatta_belzebul                            5520
## 4 Alouatta_caraya                              4240
## 5 Alouatta_guariba                             4550
## 6 Alouatta_palliata                            5350

The forward pipe is useful because it allows us to write and follow code from left to right (as when writing in English), instead of right to left with many nested parentheses.

CHALLENGE

In one line of code, do the following:
- Add a variable, Binomial to our data frame d, which is a concatenation of the Genus and Species…
- Trim the data frame to only include the variables Binomial, Family, Body_mass_female_mean, Body_mass_male_mean and Mass_Dimorphism…
- Group these variables by Family…
- Calculate the average value for female body mass, male body mass, and mass dimorphism (remember, you will need to specify na.rm = TRUE…)
- And arrange by decreasing average mass dimorphism.

Show Code

s <- mutate(d, Binomial = paste(Genus, Species, sep = " ")) %>%
  select(
    Binomial,
    Family,
    Body_mass_female_mean,
    Body_mass_male_mean,
    Mass_Dimorphism
  ) %>%
  group_by(Family) |>
  summarise(
    avgF = mean(Body_mass_female_mean, na.rm = TRUE),
    avgM = mean(Body_mass_male_mean, na.rm = TRUE),
    avgBMD = mean(Mass_Dimorphism, na.rm = TRUE)
  ) %>%
  arrange(desc(avgBMD))
s

Show Output

## # A tibble: 14 × 4
##    Family            avgF   avgM avgBMD
##    <chr>            <dbl>  <dbl>  <dbl>
##  1 Hominidae       53444. 98681.  1.81 
##  2 Cercopithecidae  6328.  9543.  1.49 
##  3 Atelidae         6616.  7895.  1.23 
##  4 Tarsiidae         120    131   1.09 
##  5 Pitheciidae      1768.  1955.  1.09 
##  6 Cebidae           876.  1012.  1.07 
##  7 Lemuridae        1991.  2077.  1.06 
##  8 Daubentonidae    2490   2620   1.05 
##  9 Lorisidae         490.   512.  1.05 
## 10 Galagidae         372.   395.  1.05 
## 11 Hylobatidae      6682.  6926.  1.03 
## 12 Cheirogalidae     186.   193.  1.02 
## 13 Lepilemuridae     814.   792   0.980
## 14 Indriidae        3887.  3638.  0.950

There are several other, very cool, “special case” pipe operators that are useful in particular situations. These are available from the {magrittr} package. [Actually, the functionality of the original forward pipe operator (%>%) also comes from the {magrittr} package, but it is replicated in {dplyr}.]

The “tee” pipe (%T>%) operator allows you to pipe the outcome of a process into a new expression (just like the forward pipe operator does) and to simultaneously return the original value instead of the forward-piped result to an intermediate expression. This is useful, for example, for printing out or plotting intermediate results. In the example below, we filter our data frame for just observations of the genus Alouatta, print those to the screen as an intermediate side effect, and pass the filtered data to the summarise() function.

s <- filter(d, Genus == "Alouatta") %T>% print() |>
  summarise(avgF = mean(Body_mass_female_mean, na.rm = TRUE))

## # A tibble: 6 × 45
##   Scientific_Name    Superfamily Family   Genus   Species Brain_Size_Species_M…¹
##   <chr>              <chr>       <chr>    <chr>   <chr>                    <dbl>
## 1 Alouatta_belzebul  Ceboidea    Atelidae Alouat… belzeb…                   52.8
## 2 Alouatta_caraya    Ceboidea    Atelidae Alouat… caraya                    52.6
## 3 Alouatta_guariba   Ceboidea    Atelidae Alouat… guariba                   51.7
## 4 Alouatta_palliata  Ceboidea    Atelidae Alouat… pallia…                   49.9
## 5 Alouatta_pigra     Ceboidea    Atelidae Alouat… pigra                     51.1
## 6 Alouatta_seniculus Ceboidea    Atelidae Alouat… senicu…                   55.2
## # ℹ abbreviated name: ¹Brain_Size_Species_Mean
## # ℹ 39 more variables: Brain_Size_Female_Mean <dbl>, Brain_size_Ref <chr>,
## #   Body_mass_male_mean <dbl>, Body_mass_female_mean <dbl>,
## #   Mass_Dimorphism <dbl>, Mass_Ref <chr>, MeanGroupSize <dbl>,
## #   AdultMales <dbl>, AdultFemale <dbl>, AdultSexRatio <dbl>,
## #   Social_Organization_Ref <chr>, InterbirthInterval_d <dbl>, Gestation <dbl>,
## #   WeaningAge_d <dbl>, MaxLongevity_m <dbl>, LitterSz <dbl>, …

## # A tibble: 1 × 1
##    avgF
##   <dbl>
## 1 5217.

The “assignment” pipe (%<>%) operator evaluates the expression on the right-hand side of the pipe operator and simultaneosuly reassigns the resultant value to the left-hand side.

s <- filter(d, Genus == "Alouatta")
s %<>% select(Genus, Species)
s

## # A tibble: 6 × 2
##   Genus    Species  
##   <chr>    <chr>    
## 1 Alouatta belzebul 
## 2 Alouatta caraya   
## 3 Alouatta guariba  
## 4 Alouatta palliata 
## 5 Alouatta pigra    
## 6 Alouatta seniculus

Finally, the “exposition” pipe (%$%) operator exposes the names within the object on the left-hand side of the pipe to the right-hand side expression.

s <- filter(d, Genus == "Alouatta") %$% paste0(Genus, " ", Species)
s

## [1] "Alouatta belzebul"  "Alouatta caraya"    "Alouatta guariba"  
## [4] "Alouatta palliata"  "Alouatta pigra"     "Alouatta seniculus"

10.7 Dot Syntax and the Pipe

Normally, when we use the forward pipe operator, the LHS of the operator is passed to the first argument of the function on the RHS. Thus, the following are all equivalent:

s <- filter(d, Genus == "Alouatta")
s <- d %>% filter(Genus == "Alouatta")
d %>% filter(Genus == "Alouatta") -> s

The behavior of the forward pipe operator means we can use it do something like the following, where d is implicitly piped into the data argument for ggplot()

d %>% ggplot(aes(x = log(Body_mass_female_mean), y = log(Brain_Size_Species_Mean))) +
  geom_point()

We can also use dot (.) syntax with the {magrittr} forward pipe operator to pass the LHS of a statement to somewhere other than the first argument of the function on the RHS. Thus…

y %>% function(x, .) is equivalent to function(x, y)

… which means we can do something like this to pipe d into a function such as lm() (“linear model”), where the data frame that the function is run on is not the first argument:

d %>% lm(log(Body_mass_female_mean) ~ log(Brain_Size_Species_Mean), data = .)

## 
## Call:
## lm(formula = log(Body_mass_female_mean) ~ log(Brain_Size_Species_Mean), 
##     data = .)
## 
## Coefficients:
##                  (Intercept)  log(Brain_Size_Species_Mean)  
##                        3.713                         1.135

We can also use the {magrittr} pipe’s curly brace ({}) syntax to wrap the RHS of a statement and pass the LHS into several places:

d %>%
  {
    plot(log(.$Brain_Size_Species_Mean), log(.$Body_mass_female_mean))
  }

NOTE: As mentioned above, newer versions of R incorporate a “native” pipe operator (|>) into its {base} syntax. This operator behaves very similarly to the {dplyr}/{magrittr} forward pipe, but it does not support dot syntax the same way. It also requires an explicit function call on the RHS, which means that it is necessary to append () to the end of the function name, rather than just using the name. The first version of the line below using %>% could be used to take the log of all Brain_Size_Female_Mean values in d, while the second version using |> needs to have () appended to the functions calls to work properly.

d %>%
  select(Brain_Size_Female_Mean) %>%
  log() %>%
  head()

## # A tibble: 6 × 1
##   Brain_Size_Female_Mean
##                    <dbl>
## 1                   3.98
## 2                  NA   
## 3                   3.94
## 4                   3.87
## 5                   3.89
## 6                   3.87

d |>
  select(Brain_Size_Female_Mean) |>
  log() |>
  head()

## # A tibble: 6 × 1
##   Brain_Size_Female_Mean
##                    <dbl>
## 1                   3.98
## 2                  NA   
## 3                   3.94
## 4                   3.87
## 5                   3.89
## 6                   3.87

10.8 Useful Related Packages

The {tidylog} package provides wrappers around many {dplyr} and {tidyr} package functions that provide logged feedback on the outcome of those functions, which can be useful for understanding the effects of whatever data wrangling processes we run. For example, running filter() will provide feedback on the number of runs removed and kept as part of a filtering operation…

library(tidylog)

## 
## Attaching package: 'tidylog'

## The following objects are masked from 'package:dplyr':
## 
##     add_count, add_tally, anti_join, count, distinct, distinct_all,
##     distinct_at, distinct_if, filter, filter_all, filter_at, filter_if,
##     full_join, group_by, group_by_all, group_by_at, group_by_if,
##     inner_join, left_join, mutate, mutate_all, mutate_at, mutate_if,
##     relocate, rename, rename_all, rename_at, rename_if, rename_with,
##     right_join, sample_frac, sample_n, select, select_all, select_at,
##     select_if, semi_join, slice, slice_head, slice_max, slice_min,
##     slice_sample, slice_tail, summarise, summarise_all, summarise_at,
##     summarise_if, summarize, summarize_all, summarize_at, summarize_if,
##     tally, top_frac, top_n, transmute, transmute_all, transmute_at,
##     transmute_if, ungroup

## The following objects are masked from 'package:tidyr':
## 
##     drop_na, fill, gather, pivot_longer, pivot_wider, replace_na,
##     separate_wider_delim, separate_wider_position,
##     separate_wider_regex, spread, uncount

## The following object is masked from 'package:stats':
## 
##     filter

# compare...
s <- dplyr::filter(d, Family == "Hominidae" & Mass_Dimorphism > 2)
# to...
s <- filter(d, Family == "Hominidae" & Mass_Dimorphism > 2)

## filter: removed 210 rows (99%), 3 rows remaining

Similarly, running sample_n() will give us logged output on that random record sampling process…

# compare...
s <- dplyr::sample_n(d, size = 100, replace = FALSE)
# to...
s <- sample_n(d, size = 100, replace = FALSE)

## sample_n: removed 113 rows (53%), 100 rows remaining

detach(package:tidylog)

NOTE: Loading in {tidylog} function will conflict with or “mask” corresponding function names from {dplyr} and {tidyr}. In the examples above, then, to run the {tidyverse} versions of filter() and select_n(), it was necessary to use the :: notation to specifically call the {dplyr} version of the functions directly. The {tidylog} versions of these functions run a bit more slowly, so if speed is important, you may not want to use {tidylog}, you may want to call the {dplyr} or {tidyr} functions explicitly after loading {tidylog}, or you may want to simply call the {tidylog} version of a function explicitly.

s <- tidylog::filter(d, Family == "Hominidae" & Mass_Dimorphism > 2)

## filter: removed 210 rows (99%), 3 rows remaining

# | include: false
detach(package:magrittr)
detach(package:tidyverse)

Concept Review

Using {dplyr}: select(), filter(), arrange(), rename(), mutate(), summarise(), group_by()
Chaining and piping with |> or %>% and other pipe operators
Joining tables: inner_join(), left_join(), right_join(), full_join()
Using the {tidylog} package

10.1 Objectives

10.2 Preliminaries

10.3 Data Wrangling Using {dplyr}

filter()

select()

arrange()

rename()

mutate()

summarize() and group_by()

10.4 Other Useful {dplyr} Functions

10.5 Joining Tables

Inner Joins

Outer Joins

Other Joins

10.6 Chaining and Piping

CHALLENGE

10.7 Dot Syntax and the Pipe

10.8 Useful Related Packages

Concept Review

`filter()`

`select()`

`arrange()`

`rename()`

`mutate()`

`summarize()` and `group_by()`