Exercise 11

Practice Model Selection

Learning Objectives

Run exploratory multivariate regression to evaluate different models

Preliminaries

Set up a new GitHub repo in your GitHub workspace named “exercise-11” and clone that down to your computer as a new RStudio project. The instructions outlined as Method 1 in Module 6 will be helpful.
Using the {tidyverse} read_tsv() function, load the “Mammal_lifehistories_v2.txt” dataset from this URL as a “tibble” named d. As discussed in class, this is data set that compiles life history and other variables for over 1400 species of placental mammals from 17 different Orders.

Data source:

Ernest SKM. (2003). Life history characteristics of placental nonvolant mammals. Ecology 84: 3402–3402.

Do a bit of exploratory data analysis with this dataset, e.g., using the {skimr} package. Which of the variables are categorical and which are numeric?

Challenge

Step 1

Replace all values of -999 (the authors’ code for missing data) with NA.

HINT: This is easy to do in base {R}, but you can also check out the replace_with_na_all() function from the {naniar} package.

Step 2

Drop the variables litter size and refs.
Rename the variable max. life(mo) to maxlife(mo).
Rename the variable wean mass(g) to weanmass(g).

Step 3

Log transform all of the other numeric variables.

HINT: There are lots of ways to do this… look into mutate(across(where(), .funs)) for an efficient motif.

Step 4

Regress the (now log transformed) age [gestation(mo), weaning(mo), AFR(mo) (i.e., age at first reproduction), and maxlife(mo) (i.e., maximum lifespan)] and mass [newborn(g) and weanmass(g)] variables on (now log transformed) overall body mass(g) and add the residuals to the dataframe as new variables [relGest, relWean, relAFR, relLife, relNewbornMass, and relWeaningMass].

HINT: Use “na.action=na.exclude” in yourlm() calls. With this argument set, the residuals will be padded to the correct length by inserting NAs for cases with missing data. To access these correctly, however, where the length of the vector of residuals is equal in length to the length of your original vector (i.e., where missing residuals are padded out as NAs) you will need to call the resid(m) function on the model object (assuming the model is named m) rather than by calling m$residuals. The former function returns a vector of residuals with “NA” for cases where the value of one of the formula variables is missing, while the latter returns a vector with the NAs dropped, which may be shorter than the length of the original data frame!

Step 5

Plot residuals of max lifespan (relLife) in relation to Order. Which mammalian orders have the highest residual lifespan?
Plot residuals of newborn mass (relNewbornMass) in relation to Order. Which mammalian orders have the have highest residual newborn mass?
Plot residuals of weaning mass (relWeaningMass) in relation to Order. Which mammalian orders have the have highest residual weaning mass?

NOTE: There will be lots of missing data for the latter two variables!

Step 6

Run models and a model selection process to evaluate what (now log transformed) variables best predict each of the two response variables, maxlife(mo) and AFR(mo), from the set of the following predictors: gestation(mo), newborn(g), weaning(mo), weanmass(g), litters/year, and overall body mass(g).

HINT: Before running models, winnow your dataset to drop rows that are missing the respective response variable or any of the predictors, e.g., by using drop_na().

For each of the two response variables, indicate what is the best model overall based on AICc and how many models have a delta AICc of 4 or less?
What variables, if any, appear in all of this set of “top” models?
Calculate and plot the model-averaged coefficients and their CIs across this set of top models.

# Exercise 11 {.unnumbered} # Practice Model Selection {.unnumbered} ## Learning Objectives {.unnumbered} - Run exploratory multivariate regression to evaluate different models ### Preliminaries {.unnumbered} - Set up a new ***GitHub*** repo in your ***GitHub*** workspace named "exercise-11" and clone that down to your computer as a new ***RStudio*** project. The instructions outlined as **Method 1** in [**Module 6**](#module-06) will be helpful. - Using the {tidyverse} `read_tsv()` function, load the "Mammal_lifehistories_v2.txt" dataset from [this URL](https://raw.githubusercontent.com/difiore/ada-datasets/main/Mammal_lifehistories_v2.txt) as a "tibble" named **d**. As discussed in class, this is data set that compiles life history and other variables for over 1400 species of placental mammals from 17 different Orders. > **Data source**: > > Ernest SKM. (2003). Life history characteristics of placental nonvolant mammals. *Ecology* 84: 3402–3402. - Do a bit of exploratory data analysis with this dataset, e.g., using the {skimr} package. Which of the variables are categorical and which are numeric? ## Challenge {.unnumbered} #### Step 1 {.unnumbered} - Replace all values of -999 (the authors' code for missing data) with `NA`. > **HINT:** This is easy to do in base {R}, but you can also check out the `replace_with_na_all()` function from the {naniar} package. #### Step 2 {.unnumbered} - Drop the variables **litter size** and **refs**. - Rename the variable **max. life(mo)** to **maxlife(mo)**. - Rename the variable **wean mass(g)** to **weanmass(g)**. #### Step 3 {.unnumbered} - Log transform all of the other numeric variables. > **HINT:** There are lots of ways to do this... look into `mutate(across(where(), .funs))` for an efficient motif. #### Step 4 {.unnumbered} - Regress the (now log transformed) *age* [**gestation(mo)**, **weaning(mo)**, **AFR(mo)** (i.e., age at first reproduction), and **maxlife(mo)** (i.e., maximum lifespan)] and *mass* [**newborn(g)** and **weanmass(g)**] variables on (now log transformed) overall body **mass(g)** and add the residuals to the dataframe as new variables [**relGest**, **relWean**, **relAFR**, **relLife**, **relNewbornMass**, and **relWeaningMass**]. > **HINT:** Use "na.action=na.exclude" in your`lm()` calls. With this argument set, the residuals will be padded to the correct length by inserting NAs for cases with missing data. To access these correctly, however, where the length of the vector of residuals is equal in length to the length of your original vector (i.e., where missing residuals are padded out as NAs) you will need to call the `resid(m)` function on the model object (assuming the model is named **m**) rather than by calling `m$residuals`. The former function returns a vector of residuals with "NA" for cases where the value of one of the formula variables is missing, while the latter returns a vector with the NAs dropped, which may be shorter than the length of the original data frame! #### Step 5 {.unnumbered} - Plot residuals of max lifespan (**relLife**) in relation to **Order**. Which mammalian orders have the highest residual lifespan? - Plot residuals of newborn mass (**relNewbornMass**) in relation to **Order**. Which mammalian orders have the have highest residual newborn mass? - Plot residuals of weaning mass (**relWeaningMass**) in relation to **Order**. Which mammalian orders have the have highest residual weaning mass? > **NOTE:** There will be lots of missing data for the latter two variables! #### Step 6 {.unnumbered} - Run models and a model selection process to evaluate what (now log transformed) variables best predict each of the two response variables, **maxlife(mo)** and **AFR(mo)**, from the set of the following predictors: **gestation(mo)**, **newborn(g)**, **weaning(mo)**, **weanmass(g)**, **litters/year**, and overall body **mass(g)**. > **HINT:** Before running models, winnow your dataset to drop rows that are missing the respective response variable or any of the predictors, e.g., by using `drop_na()`. - For each of the two response variables, indicate what is the best model overall based on AICc and how many models have a delta AICc of 4 or less? - What variables, if any, appear in all of this set of "top" models? - Calculate and plot the model-averaged coefficients and their CIs across this set of top models.