Exercise 11
Practice Model Selection
Learning Objectives
- Run exploratory multivariate regression to evaluate different models
Preliminaries
- Set up a new GitHub repo in your GitHub workspace named “exercise-11” and clone that down to your computer as a new RStudio project. The instructions outlined as Method 1 in Module 6 will be helpful.
- Using the {tidyverse}
read_tsv()function, load the “Mammal_lifehistories_v2.txt” dataset from this URL as a “tibble” named d. As discussed in class, this is data set that compiles life history and other variables for over 1400 species of placental mammals from 17 different Orders.
Data source:
Ernest SKM. (2003). Life history characteristics of placental nonvolant mammals. Ecology 84: 3402–3402.
- Do a bit of exploratory data analysis with this dataset, e.g., using the {skimr} package. Which of the variables are categorical and which are numeric?
Challenge
Step 1
- Replace all values of -999 (the authors’ code for missing data) with
NA.
HINT: This is easy to do in base {R}, but you can also check out the
replace_with_na_all()function from the {naniar} package.
Step 2
- Drop the variables litter size and refs.
- Rename the variable max. life(mo) to maxlife(mo).
- Rename the variable wean mass(g) to weanmass(g).
Step 3
- Log transform all of the other numeric variables.
HINT: There are lots of ways to do this… look into
mutate(across(where(), .funs))for an efficient motif.
Step 4
- Regress the (now log transformed) age [gestation(mo), weaning(mo), AFR(mo) (i.e., age at first reproduction), and maxlife(mo) (i.e., maximum lifespan)] and mass [newborn(g) and weanmass(g)] variables on (now log transformed) overall body mass(g) and add the residuals to the dataframe as new variables [relGest, relWean, relAFR, relLife, relNewbornMass, and relWeaningMass].
HINT: Use “na.action=na.exclude” in your
lm()calls. With this argument set, the residuals will be padded to the correct length by inserting NAs for cases with missing data. To access these correctly, however, where the length of the vector of residuals is equal in length to the length of your original vector (i.e., where missing residuals are padded out as NAs) you will need to call theresid(m)function on the model object (assuming the model is named m) rather than by callingm$residuals. The former function returns a vector of residuals with “NA” for cases where the value of one of the formula variables is missing, while the latter returns a vector with the NAs dropped, which may be shorter than the length of the original data frame!
Step 5
- Plot residuals of max lifespan (relLife) in relation to Order. Which mammalian orders have the highest residual lifespan?
- Plot residuals of newborn mass (relNewbornMass) in relation to Order. Which mammalian orders have the have highest residual newborn mass?
- Plot residuals of weaning mass (relWeaningMass) in relation to Order. Which mammalian orders have the have highest residual weaning mass?
NOTE: There will be lots of missing data for the latter two variables!
Step 6
- Run models and a model selection process to evaluate what (now log transformed) variables best predict each of the two response variables, maxlife(mo) and AFR(mo), from the set of the following predictors: gestation(mo), newborn(g), weaning(mo), weanmass(g), litters/year, and overall body mass(g).
HINT: Before running models, winnow your dataset to drop rows that are missing the respective response variable or any of the predictors, e.g., by using
drop_na().
For each of the two response variables, indicate what is the best model overall based on AICc and how many models have a delta AICc of 4 or less?
What variables, if any, appear in all of this set of “top” models?
Calculate and plot the model-averaged coefficients and their CIs across this set of top models.