Which models the relationship between a continuous dependent variable and a single independent variable?

Simple linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables:

  • One variable, denoted x, is regarded as the predictor, explanatory, or independent variable.
  • The other variable, denoted y, is regarded as the response, outcome, or dependent variable.

Because the other terms are used less frequently today, we'll use the "predictor" and "response" terms to refer to the variables encountered in this course. The other terms are mentioned only to make you aware of them should you encounter them. Simple linear regression gets its adjective "simple," because it concerns the study of only one predictor variable. In contrast, multiple linear regression, which we study later in this course, gets its adjective "multiple," because it concerns the study of two or more predictor variables.

Types of relationships

Before proceeding, we must clarify what types of relationships we won't study in this course, namely, deterministic (or functional) relationships. Here is an example of a deterministic relationship.

Note that the observed (x, y) data points fall directly on a line. As you may remember, the relationship between degrees Fahrenheit and degrees Celsius is known to be:

\[\text{F} = \frac{9}{5}\text{C}+32\]

That is, if you know the temperature in degrees Celsius, you can use this equation to determine the temperature in degrees Fahrenheit exactly.

Here are some examples of other deterministic relationships that students from previous semesters have shared:

  • Circumference = π × diameter
  • Hooke's Law: Y = α + βX, where Y = amount of stretch in a spring, and X = applied weight.
  • Ohm's Law: I = V/r, where V = voltage applied, r = resistance, and I = current.
  • Boyle's Law: For a constant temperature, P = α/V, where P = pressure, α = constant for each gas, and V = volume of gas.

For each of these deterministic relationships, the equation exactly describes the relationship between the two variables. This course does not examine deterministic relationships. Instead, we are interested in statistical relationships, in which the relationship between the variables is not perfect.

Here is an example of a statistical relationship. The response variable y is the mortality due to skin cancer (number of deaths per 10 million people) and the predictor variable x is the latitude (degrees North) at the center of each of 49 states in the U.S. (skincancer.txt) (The data were compiled in the 1950s, so Alaska and Hawaii were not yet states, and Washington, D.C. is included in the data set even though it is not technically a state.)

You might anticipate that if you lived in the higher latitudes of the northern U.S., the less exposed you'd be to the harmful rays of the sun, and therefore, the less risk you'd have of death due to skin cancer. The scatter plot supports such a hypothesis. There appears to be a negative linear relationship between latitude and mortality due to skin cancer, but the relationship is not perfect. Indeed, the plot exhibits some "trend," but it also exhibits some "scatter." Therefore, it is a statistical relationship, not a deterministic one.

Some other examples of statistical relationships might include:

  • Height and weight — as height increases, you'd expect weight to increase, but not perfectly.
  • Alcohol consumed and blood alcohol content — as alcohol consumption increases, you'd expect one's blood alcohol content to increase, but not perfectly.
  • Vital lung capacity and pack-years of smoking — as amount of smoking increases (as quantified by the number of pack-years of smoking), you'd expect lung function (as quantified by vital lung capacity) to decrease, but not perfectly.
  • Driving speed and gas mileage — as driving speed increases, you'd expect gas mileage to decrease, but not perfectly.

Okay, so let's study statistical relationships between one response variable y and one predictor variable x!

Overview

Teaching: 15 min
Exercises: 20 min

Questions

  • How can we visualise the relationship between three variables, two of which are continuous and one of which is categorical, in R?

  • How can we fit a linear regression model to this type of data in R?

  • How can we obtain and interpret the model parameters?

  • How can we visualise the linear regression model in R?

Objectives

  • Explore the relationship between a continuous dependent variable and two explanatory variables, one continuous and one categorical, using ggplot2.

  • Fit a linear regression model with one continuous explanatory variable and one categorical explanatory variable using lm().

  • Use the jtools package to interpret the model output.

  • Use the interactions package to visualise the model output.

In this episode we will learn to fit a linear regression model when we have two explanatory variables: one continuous and one categorical. Before we fit the model, we can explore the relationship between our variables graphically. We are asking the question: does the relationship between the continuous explanatory variable and the response variable differ between the levels of the categorical variable?

Exploratory plots

Let us take the response variable BMI, the continuous explanatory variable Weight and the categorical explanatory variable Sex as an example. The code below subsets our data for individuals who are older than 17 years with filter(). The plotting is then initiated using ggplot(). Inside aes(), we select the response variable with y = BMI, the continuous explanatory variable with x = Weight and the categorical explanatory variable with colour = Sex. As a consequence of the final argument, the points produced by geom_point() are coloured by Sex. Finally, we include opacity using alpha = 0.2, which allows us to distinguish areas dense in points from sparser areas. Note that RStudio returns the warning “Removed 320 rows containing missing values (geom_point).” - this reflects that 320 participants had missing BMI and/or Weight data.

The plot suggests that on average, participants classed as female have a higher BMI than participants classed as male, for any particular value of Weight. We might therefore want to account for this in our model by having separate intercepts for the levels of Sex.

dat %>% filter(Age > 17) %>% ggplot(aes(x = Weight, y = BMI, colour = Sex)) + geom_point(alpha = 0.2)

Exercise

You have been asked to model the relationship between FEV1 and Age in the NHANES data, splitting observations by whether participants are active or ex smokers (SmokeNow). Use the ggplot2 package to create an exploratory plot, ensuring that:

  1. NAs are discarded from the SmokeNow variable.
  2. FEV1 (FEV1) is on the y-axis and Age (Age) is on the x-axis.
  3. These data are shown as a scatterplot, with points coloured by SmokeNow and an opacity of 0.4.

Solution

dat %>% drop_na(SmokeNow) %>% ggplot(aes(x = Age, y = FEV1, colour = SmokeNow)) + geom_point(alpha = 0.4)

Fitting and interpreting a multiple linear regression model

We fit the model using lm(). With formula = BMI ~ Weight + Sex we specify that our model has two explanatory variables, Weight and Sex.

BMI_Weight_Sex <- dat %>% filter(Age > 17) %>% lm(formula = BMI ~ Weight + Sex)

The linear model equation associated with this model has the general form

\[E(y) = \beta_0 + \beta_1 \times x_1 + \beta_2 \times x_2.\]

Notice that we now have the additional parameters, $\beta_2$ and $x_2$, in comparison to the simple linear regression model equation. In the context of our model, $\beta_1$ is the parameter for the effect of Weight and $\beta_2$ is the parameter for the effect of Sex. Recall from the simple linear regression lesson that a categorical variable has a baseline level in R. The parameter associated with the categorical variable then estimates the difference in the outcome variable in a group different from the baseline. Since “f” precedes “m” in the alphabet, R takes female as the baseline level. Therefore, $\beta_2$ estimates the difference in BMI between males and females, given a particular value for Weight. $\beta_2$ is only included when $x_2 = 1$, i.e. when the Sex of a participant is male. This results in separate intercepts for the levels of Sex: the intercept for females is given by $\beta_0$ and the intercept for males is given by $\beta_0 + \beta_2$. In this model, the effect of Weight is given by $\beta_1$, irrespective of the Sex of a participant. This results in equal slopes across the levels of Sex.

The model parameters can be obtained using summ() from the jtools package. We include 95% confidence intervals for the parameter estimates using confint = TRUE and three digits after the decimal point using digits = 3. The intercept when Sex = female equals $5.538$ and the intercept when Sex = male equals $5.538 - 4.202 = 1.336$. Notice that in this model, the effect of Weight is $0.310$, regardless of Sex. The model equation can therefore be updated to be:

\[E(\text{BMI}) = 5.538 + 0.310 \times \text{Weight} -4.202 \times x_2,\]

where $x_2 = 1$ for Sex = male and $0$ otherwise.

summ(BMI_Weight_Sex, confint = TRUE, digits = 3)

MODEL INFO: Observations: 6177 (320 missing obs. deleted) Dependent Variable: BMI Type: OLS linear regression MODEL FIT: F(2,6174) = 19431.981, p = 0.000 R² = 0.863 Adj. R² = 0.863 Standard errors: OLS -------------------------------------------------------------- Est. 2.5% 97.5% t val. p ----------------- -------- -------- -------- --------- ------- (Intercept) 5.538 5.289 5.787 43.656 0.000 Weight 0.310 0.307 0.313 196.965 0.000 Sexmale -4.202 -4.333 -4.071 -63.075 0.000 --------------------------------------------------------------

Exercise

  1. Using the lm() command, fit a multiple linear regression of FEV1 (FEV1) as a function of Age (Age), grouped by SmokeNow. Name this lm object FEV1_Age_SmokeNow.
  2. Using the summ function from the jtools package, answer the following questions:

A) What FEV1 does the model predict, on average, for an individual, belonging to the baseline level of SmokeNow, with an Age of 0?
B) What is R taking as the baseline level of SmokeNow and why?
C) By how much is FEV1 expected to differ, on average, between the baseline and the alternative level of SmokeNow?
D) By how much is FEV1 expected to differ, on average, for a one-unit difference in Age?
E) Given these values and the names of the response and explanatory variables, how can the general equation $E(y) = \beta_0 + {\beta}_1 \times x_1 + {\beta}_2 \times x_2$ be adapted to represent the model?

Solution

FEV1_Age_SmokeNow <- dat %>% lm(formula = FEV1 ~ Age + SmokeNow) summ(FEV1_Age_SmokeNow, confint = TRUE, digits=3)

MODEL INFO: Observations: 2265 (7735 missing obs. deleted) Dependent Variable: FEV1 Type: OLS linear regression MODEL FIT: F(2,2262) = 560.807, p = 0.000 R² = 0.331 Adj. R² = 0.331 Standard errors: OLS -------------------------------------------------------------------- Est. 2.5% 97.5% t val. p ----------------- ---------- ---------- ---------- --------- ------- (Intercept) 4928.155 4805.493 5050.816 78.787 0.000 Age -35.686 -37.799 -33.573 -33.117 0.000 SmokeNowYes -249.305 -317.045 -181.564 -7.217 0.000 --------------------------------------------------------------------

A) 4928.155 mL
B) No, because “n” precedes “y” in the alphabet.
C) On average, individuals from the two categories are expected to differ by 249.305 mL.
D) It is expected to decrease by 35.686 mL.
E) $E(\text{FEV1}) = 4928.155 - 35.686 \times \text{Age} - 249.305 \times x_2$,
where $x_2 = 1$ for current smokers and $0$ otherwise.

Visualising a multiple linear regression model

The model can be visualised using two lines, representing the levels of Sex. Rather than using effect_plot() from the jtools package, we use interact_plot from the interactions package. We specify the model by its name, the continuous explanatory variable using pred = Weight and the categorical explanatory variable using modx = Sex. We include the data points using plot.points = TRUE and introduce opacity into the data points using point.alpha = 0.2.

Note that R returns the warning “Weight and Sex are not included in an interaction with one another in the model.” - we will learn what interactions are and when they might be appropriate in the next episode. In this scenario we can safely ignore the warning.

interact_plot(BMI_Weight_Sex, pred = Weight, modx = Sex, plot.points = TRUE, point.alpha = 0.2)

Exercise

To help others interpret the FEV1_Age_SmokeNow model, produce a figure. Make this figure using the interactions package.

Solution

interact_plot(FEV1_Age_SmokeNow, pred = Age, modx = SmokeNow, plot.points = TRUE, point.alpha = 0.4)

Key Points

  • A scatterplot, with points coloured by the levels of a categorical variable, can be used to explore the relationship between two continuous variables and a categorical variable.

  • The categorical variable can be added to the formula in lm() using a +.

  • The model output shows separate intercepts for the levels of the categorical variable. The slope across the levels of the categorical variable is held constant.

  • Parallel lines can be added to the exploratory scatterplot to visualise the linear regression model.

Toplist

Neuester Beitrag

Stichworte