Exercise 10 Solution

• Solution

Preliminaries

Using the {tidyverse} read_csv() function, load the “AVONETdataset1.csv” dataset as a “tibble” named d.

library(tidyverse)
library(skimr)
library(cowplot)
library(broom)
library(infer)
library(mosaic)
library(ggpubr) # for ggqqplot()

f <- "https://raw.githubusercontent.com/difiore/ada-datasets/main/AVONETdataset1.csv"
d <- read_csv(f, col_names = TRUE)

Winnow the dataset to variables of interest.

d <- d |> select(
  Species1, Family1, Order1,
  Beak.Length_Culmen, Beak.Width, Beak.Depth,
  Tarsus.Length, Wing.Length, Tail.Length, Mass,
  Habitat, Migration, Trophic.Level, Trophic.Niche,
  Min.Latitude, Max.Latitude, Centroid.Latitude, Range.Size,
  Primary.Lifestyle
)

Do a bit of exploratory data analysis.

skim(d)

Data summary
Name	d
Number of rows	11009
Number of columns	19
_______________________
Column type frequency:
character	7
numeric	12
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
Species1	0	1.00	9	36	11009
Family1	0	1.00	7	18	243
Order1	0	1.00	10	19	36
Habitat	98	0.99	4	14	11
Trophic.Level	5	1.00	8	9	4
Trophic.Niche	10	1.00	8	21	10
Primary.Lifestyle	0	1.00	6	11	5

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Beak.Length_Culmen	0	1.00	26.36	24.39	4.50	14.70	19.90	28.50	414.20	▇▁▁▁▁
Beak.Width	0	1.00	6.58	5.15	0.70	3.60	5.00	7.70	88.90	▇▁▁▁▁
Beak.Depth	0	1.00	8.06	7.59	1.00	3.80	5.80	9.40	110.90	▇▁▁▁▁
Tarsus.Length	0	1.00	28.73	24.84	2.50	17.40	22.00	31.30	481.20	▇▁▁▁▁
Wing.Length	0	1.00	124.78	93.44	0.10	66.80	91.50	145.50	789.90	▇▂▁▁▁
Tail.Length	0	1.00	86.65	61.08	0.10	50.20	68.70	99.90	812.80	▇▁▁▁▁
Mass	0	1.00	267.15	1883.03	1.90	15.00	35.50	121.00	111000.00	▇▁▁▁▁
Migration	23	1.00	1.29	0.62	1.00	1.00	1.00	1.00	3.00	▇▁▁▁▁
Min.Latitude	57	0.99	-6.44	22.37	-85.58	-21.22	-7.15	8.07	68.08	▁▃▇▃▁
Max.Latitude	57	0.99	11.51	23.32	-65.12	-3.33	9.00	22.07	85.01	▁▃▇▂▁
Centroid.Latitude	57	0.99	2.95	22.07	-71.04	-9.73	-0.22	15.28	78.43	▁▃▇▂▁
Range.Size	57	0.99	2578859.38	7629310.06	0.88	54052.87	416076.61	2187040.21	136304432.20	▇▁▁▁▁

The taxomic variables, along with Habitat, Trophic.Level, Trophic.Niche, and Migration, are categorical, although Migration is scored using integer factors as “1”, “2”, or “3”. The remaining variables are numeric.

Challenge 1

Step 1

Make boxplots of log(Mass) in relation to Trophic.Level and Migration and convert the variable Migration into a factor.

d <- d |> mutate(
  logMass = log(Mass),
  logRS = log(Range.Size),
  logBeak = log(Beak.Length_Culmen),
  logTarsus = log(Tarsus.Length),
  Migration = as.factor(Migration)
)

p1 <- ggplot(data = d |> drop_na(Trophic.Level), aes(x = Trophic.Level, y = log(Mass))) +
  geom_boxplot() +
  geom_jitter(alpha = 0.05)

p2 <- ggplot(data = d |> drop_na(Migration), aes(x = Migration, y = log(Mass))) +
  geom_boxplot() +
  geom_jitter(alpha = 0.05)
plot_grid(p1, p2, nrow = 1)

Step 2

Run linear models using the lm() function to look at the relationship between log(Mass) and Trophic.Level and between log(Mass) and Migration.

m1 <- lm(logMass ~ Trophic.Level, data = d)
m2 <- lm(logMass ~ Migration, data = d)

Examine the output of the resultant linear models. Is log(Mass) associated with either Trophic.Level or Migration category? That is, in the global test of significance, is the F statistic large enough to reject the null hypothesis of an F value of zero?

summary(m1)

## 
## Call:
## lm(formula = logMass ~ Trophic.Level, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4229 -1.1551 -0.3028  0.8982  7.5526 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             3.80834    0.01967 193.632  < 2e-16 ***
## Trophic.LevelHerbivore  0.25639    0.03406   7.528 5.54e-14 ***
## Trophic.LevelOmnivore   0.01422    0.04116   0.345     0.73    
## Trophic.LevelScavenger  4.63189    0.34447  13.446  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.538 on 11000 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.02094,    Adjusted R-squared:  0.02067 
## F-statistic: 78.42 on 3 and 11000 DF,  p-value: < 2.2e-16

summary(m2)

## 
## Call:
## lm(formula = logMass ~ Migration, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8924 -1.1769 -0.3088  0.9152  7.8427 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.77457    0.01636 230.710  < 2e-16 ***
## Migration2   0.75971    0.04731  16.059  < 2e-16 ***
## Migration3   0.37647    0.05155   7.303 3.02e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.535 on 10983 degrees of freedom
##   (23 observations deleted due to missingness)
## Multiple R-squared:  0.02563,    Adjusted R-squared:  0.02546 
## F-statistic: 144.5 on 2 and 10983 DF,  p-value: < 2.2e-16

In both cases, the global test is significant.

Given the regression coefficients returned for your Migration model, which Migration categor(ies) are different than the reference level? What level is the reference level?

In m2, Migration levels 2 and 3 are both different from Migration level 1, which is the reference level.

Relevel and assess differences among the remaining pair of Migration categories.

d <- d |> mutate(Migration = relevel(Migration, ref = "3"))
m2 <- lm(logMass ~ Migration, data = d)
summary(m2)

## 
## Call:
## lm(formula = logMass ~ Migration, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8924 -1.1769 -0.3088  0.9152  7.8427 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.15104    0.04889  84.909  < 2e-16 ***
## Migration1  -0.37647    0.05155  -7.303 3.02e-13 ***
## Migration2   0.38324    0.06603   5.804 6.67e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.535 on 10983 degrees of freedom
##   (23 observations deleted due to missingness)
## Multiple R-squared:  0.02563,    Adjusted R-squared:  0.02546 
## F-statistic: 144.5 on 2 and 10983 DF,  p-value: < 2.2e-16

Based on the re-leveled model results, Migration levels 2 and 3 are also different from one another.

Step 3

Conduct a post-hoc Tukey Significant Differences test to also evaluate which Migration categories differ “significantly” from one another.

# we need an ANOVA object to run a Tukey test
m2 <- aov(logMass ~ Migration, data = d)
(posthoc <- TukeyHSD(m2, which = "Migration", conf.level = 0.95))

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = logMass ~ Migration, data = d)
## 
## $Migration
##           diff        lwr        upr p adj
## 1-3 -0.3764693 -0.4973105 -0.2556282     0
## 2-3  0.3832374  0.2284536  0.5380211     0
## 2-1  0.7597067  0.6488157  0.8705977     0

plot(posthoc)

Comparisons among all pairs of levels are significant.

Step 4

Use a permutation approach to inference to generate a null distribution of F statistic values for the model of log(Mass) in relation to Trophic.Level and calculate a p value for your original F statistic.

original.F <- aov(logMass ~ Trophic.Level, data = d) |>
  tidy() |>
  filter(term == "Trophic.Level") |>
  pull(statistic)
# show aov results for F statistic and p value for omnibus F test
original.F

## [1] 78.42283

# using {infer}
permuted.F <- d |>
  specify(logMass ~ Trophic.Level) |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  calculate(stat = "F")

visualize(permuted.F) + shade_p_value(obs_stat = original.F, direction = "greater")

p.value <- permuted.F |>
  get_p_value(obs_stat = original.F, direction = "greater")

# or...
# using {mosaic}
reps <- 1000
permuted.F <- do(reps) * {
  d |>
    mutate(Trophic.Level = sample(Trophic.Level)) %>%
    # note the use of the %>% pipe operator in the line above... for some reason, the native pipe operator (`|>`) throws an error!
    aov(logMass ~ Trophic.Level, data = .) |>
    tidy() |>
    filter(term == "Trophic.Level") |>
    pull(statistic)
}

permuted.F <- as_tibble(permuted.F) |>
  rename(stat = "result")

p.value <- permuted.F |>
  get_p_value(obs_stat = original.F, direction = "greater")
p.value

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

Challenge 2

Step 1

Create the following two new variables and add them to AVONET dataset:
- Relative beak length, which you should calculate as the residual of log(Beak.Length_Culmen) on log(Mass).
- Relative tarsus length, which you should calculate as the residual of log(Tarsus.Length) on log(Mass).

relBeak <- lm(logBeak ~ logMass, data = d)
relTarsus <- lm(logTarsus ~ logMass, data = d)
d <- d |> mutate(
  relBeak = relBeak$residuals,
  relTarsus = relTarsus$residuals
)

Step 2

Make a boxplot or violin plot of your new relative tarsus length variable in relation to Primary.Lifestyle and of your new relative beak length variable in relation to Trophic.Niche

d <- d |> mutate(
  Primary.Lifestyle = factor(Primary.Lifestyle,
    levels = c("Aerial", "Aquatic", "Insessorial", "Terrestrial", "Generalist")
  )
)
p1 <- ggplot(data = d |> drop_na(Primary.Lifestyle), aes(x = Primary.Lifestyle, y = relTarsus)) +
  geom_boxplot() +
  # geom_jitter(alpha = 0.05) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

d <- d |> mutate(
  Trophic.Niche = factor(Trophic.Niche,
    levels = c("Nectarivore", "Herbivore aquatic", "Frugivore", "Granivore", "Herbivore terrestrial", "Aquatic predator", "Invertivore", "Vertivore", "Scavenger", "Omnivore")
  )
)
p2 <- ggplot(data = d |> drop_na(Trophic.Niche), aes(x = Trophic.Niche, y = relBeak)) +
  geom_boxplot() +
  # geom_jitter(alpha = 0.05) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )
plot_grid(p1, p2, nrow = 1)

Step 3

Run ANOVA analyses to look at the association between geographic range size and the variable Migration. You should first drop those observations for which Migration is not scored and also look at the distribution of the variable Range.Size to decide whether and how it might need to be transformed.

migration <- d |> drop_na(Migration)
histogram(migration$Range.Size)

histogram(migration$logRS)

It looks like Range.Size should be log transformed!

# look at distribution of the data across levels
ggplot(data = migration, aes(x = Migration, y = logRS)) +
  geom_violin() +
  geom_jitter(alpha = 0.05, width = 0.5)

m3 <- lm(logRS ~ Migration, data = migration)
summary(m3)

## 
## Call:
## lm(formula = logRS ~ Migration, data = migration)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5710  -1.4521   0.4357   1.9763   5.9271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.55082    0.08896 163.568  < 2e-16 ***
## Migration1  -2.51702    0.09380 -26.834  < 2e-16 ***
## Migration2  -0.73233    0.12015  -6.095 1.13e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.785 on 10934 degrees of freedom
##   (49 observations deleted due to missingness)
## Multiple R-squared:  0.0869, Adjusted R-squared:  0.08674 
## F-statistic: 520.3 on 2 and 10934 DF,  p-value: < 2.2e-16

Based on the global model, is range size associated with form of migration? How much of the variance in your measure of range size is associated with Migration behavior level?

From the model summary(), we can see that the F statistic is significant, although the R-squared value suggests that only ~9% of the variation in log(range size) is associated with Migration behavior style.

NOTE: We can calculate the amount of variance explained by the model (i.e., R-squared, or, a.k.a. \(\eta\)-squared) ourselves from the ANOVA table output from running aov(). To do so, we calculate the ratio of the regression sum of squares to the total sum of squares. The code below pulls these value from a tidy() table of the model coefficients…

m3 <- aov(logRS ~ Migration, data = migration)
summary(m3)

##                Df Sum Sq Mean Sq F value Pr(>F)    
## Migration       2   8071    4035   520.3 <2e-16 ***
## Residuals   10934  84798       8                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 49 observations deleted due to missingness

coefs <- m3 |>
  tidy() %>%
  select(term, sumsq)
(r2 <- 1 - (coefs |> filter(term == "Residuals") |> pull(sumsq)) / (coefs |> summarize(totalSS = sum(sumsq)) |> pull(totalSS)))

## [1] 0.0869031

Given the regression coefficients returned in the output of the model, which Migration categor(ies) are different than the reference level? What level is the reference level?

In the summary() of the m3 linear model, Migration levels 1 and 2 are both different from Migration level 3, which is the reference level (because that is what we releveled to above).

Re-level and assess differences among the remaining pair of Migration categories.

migration <- migration |> mutate(Migration = relevel(Migration, ref = "1"))
m3 <- aov(logRS ~ Migration, data = migration)
summary(m3)

##                Df Sum Sq Mean Sq F value Pr(>F)    
## Migration       2   8071    4035   520.3 <2e-16 ***
## Residuals   10934  84798       8                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 49 observations deleted due to missingness

After re-leveling, we can see that Migration levels 1 and 2 are also different from one another.

Also conduct a post-hoc Tukey Honest Significant Differences test to also evaluate which Migration categories differ “significantly” from one another.

(posthoc <- TukeyHSD(m3, which = "Migration", conf.level = 0.95))

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = logRS ~ Migration, data = migration)
## 
## $Migration
##           diff       lwr       upr p adj
## 3-1  2.5170168  2.297150  2.736883     0
## 2-1  1.7846901  1.582952  1.986428     0
## 2-3 -0.7323266 -1.013964 -0.450689     0

plot(posthoc)

Step 4

Winnow your original data to just consider birds from the Infraorder “Passeriformes” (song birds).

pass <- d |> filter(Order1 == "Passeriformes")

Run separate one-factor ANOVA analyses to look at the association between [1] relative beak length and Primary.Lifestyle and between [2] relative beak length and Trophic.Level. In doing so…

Make boxplots of the response variable by each predictor and by the combination of predictors…

p1 <- ggplot(data = pass, aes(x = Primary.Lifestyle, y = relBeak)) +
  geom_boxplot() +
  geom_jitter(alpha = 0.05) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )
p2 <- ggplot(data = pass, aes(x = Trophic.Level, y = relBeak)) +
  geom_boxplot() +
  geom_jitter(alpha = 0.05) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )
p3 <- ggplot(data = pass, aes(x = Primary.Lifestyle, y = relBeak)) +
  geom_boxplot() +
  geom_jitter(alpha = 0.05) +
  facet_wrap(~Trophic.Level) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )
p4 <- ggplot(data = pass, aes(x = Trophic.Level, y = relBeak)) +
  geom_boxplot() +
  geom_jitter(alpha = 0.05) +
  facet_wrap(~Primary.Lifestyle) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )
plot_grid(plot_grid(p1, p2, nrow = 1), p3, p4, nrow = 3)

… and run linear models for each predictor separately and interpret the model output.

m5 <- aov(relBeak ~ Primary.Lifestyle, data = pass)
# or
# m5 <- lm(relBeak ~ Primary.Lifestyle, data = pass)
summary(m5)

##                     Df Sum Sq Mean Sq F value Pr(>F)    
## Primary.Lifestyle    3   18.2   6.067   130.2 <2e-16 ***
## Residuals         6610  307.9   0.047                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

m6 <- aov(relBeak ~ Trophic.Level, data = pass)
# or
# m6 <- lm(relBeak ~ Trophic.Level, data = pass)
summary(m6)

##                 Df Sum Sq Mean Sq F value Pr(>F)    
## Trophic.Level    2  16.31   8.154     174 <2e-16 ***
## Residuals     6611 309.81   0.047                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Relative beak length is significantly associated with both variables, but the R-squared (or eta- squared) values are quite modest.

Step 5

Run a two-factor model to look at the association between relative beak length and both Primary.Lifestyle and Trophic.Level among the passeriforms. Based on the model output, what would you conclude about how relative beak length is related to these two variables?

m7 <- aov(relBeak ~ Primary.Lifestyle + Trophic.Level, data = pass)
# or
# m7 <- lm(relBeak ~ Primary.Lifestyle + Trophic.Level, data = pass)
summary(m7)

##                     Df Sum Sq Mean Sq F value Pr(>F)    
## Primary.Lifestyle    3  18.20   6.067   138.1 <2e-16 ***
## Trophic.Level        2  17.68   8.838   201.2 <2e-16 ***
## Residuals         6608 290.24   0.044                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both variables are important predictors of relative beak length.

Step 6

Finally, run an additional two-way model with the same dataset and predictors, but adding the possibility of an interaction term. Based on the model output, what would you now conclude about how relative beak length is related to these two variables?

m8 <- aov(relBeak ~ Primary.Lifestyle + Trophic.Level + Primary.Lifestyle:Trophic.Level, data = pass)
# or
# m8 <- lm(relBeak ~ Primary.Lifestyle + Trophic.Level + Primary.Lifestyle:Trophic.Level, data = pass)
# or
# m8 <- lm(relBeak ~ Primary.Lifestyle * Trophic.Level, data = pass)
summary(m8)

##                                   Df Sum Sq Mean Sq F value Pr(>F)    
## Primary.Lifestyle                  3  18.20   6.067  142.15 <2e-16 ***
## Trophic.Level                      2  17.68   8.838  207.11 <2e-16 ***
## Primary.Lifestyle:Trophic.Level    4   8.41   2.102   49.26 <2e-16 ***
## Residuals                       6604 281.83   0.043                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Primary.Lifestyle, Trophic.Level, and the interaction of these two variables are all significant predictors of relative beak length.

Step 7

Use the interaction.plot() function to visualize the interaction between Primary.Lifestyle and Trophic.Level.

par(mfrow = c(2, 1))
interaction.plot(
  x.factor = pass$Primary.Lifestyle,
  xlab = "Primary Lifestyle",
  trace.factor = pass$Trophic.Level,
  trace.label = "Trophic Level",
  response = pass$relBeak,
  fun = base::mean, # make sure we use {base} version
  ylab = "Mean Relative Beak Length"
)

interaction.plot(
  x.factor = pass$Trophic.Level,
  xlab = "Trophic Level",
  trace.factor = pass$Primary.Lifestyle,
  trace.label = "Primary Lifestyle",
  response = pass$relBeak,
  fun = base::mean, # make sure we use {base} version
  ylab = "Mean Relative Beak Length"
)

Step 8

Let’s look at the model of relative beak size ~ trophic level.

First, we check the ratio of variances among different levels of the predictor…

sd_ratio <- pass |>
  group_by(Trophic.Level) |>
  summarize(sd = sd(relBeak, na.rm = TRUE)) |>
  pull(sd)
(sd_ratio <- max(sd_ratio) / min(sd_ratio))

## [1] 1.336194

Then, we do visual checks for normality of observations and residuals within groups… all seem to deviate from normality!

p1 <- ggplot(data = pass, aes(x = relBeak)) +
  geom_histogram() +
  facet_wrap(~Trophic.Level)
m6data <- tibble(residuals = m6$residuals, Trophic.Level = m6$model$Trophic.Level)
p2 <- ggqqplot(data = m6data, x = "residuals") + facet_wrap(~Trophic.Level)
plot_grid(p1, p2, nrow = 2)

We can do the same for our model of log(range size) ~ migration level.

Ratio of variances…

sd_ratio <- migration |>
  group_by(Migration) |>
  summarize(sd = sd(logRS, na.rm = TRUE)) |>
  pull(sd)
(sd_ratio <- max(sd_ratio) / min(sd_ratio))

## [1] 1.332764

Visual checks… again all seem to deviate from normality!

p1 <- ggplot(data = migration, aes(x = logRS)) +
  geom_histogram() +
  facet_wrap(~Migration)
m3data <- tibble(residuals = m3$residuals, Migration = m3$model$Migration)
p2 <- ggqqplot(data = m3data, x = "residuals") + facet_wrap(~Migration)
plot_grid(p1, p2, nrow = 2)

Although the variance ratios are under 2, given the clear deviation from normality for the residuals, we are not really justified in using standard (or general) linear regression for the analyses done above.