The Correlation Coefficient Between Two Variables X And Y Is R = 0.121. What Conclusion Can We Draw?

Linear regression
- First footstep: some plotting and summary statistics
- Constructing a regression model
  - Exploring the lm object
  - Plotting the lm object
  - Diagnostic plots for diamonds information.
  - Collinearity and pairs plots
Thinking more critically nearly linear regression
- What does it mean for a coefficient to be statistically meaning?
- What happens when we have collinearity?
- Practical considerations in linear regression
Factors in linear regression
- Interpreting coefficients of cistron variables
  - Why is one of the levels missing in the regression?
- Interaction terms

###Packages

Let's brainstorm past loading the packages we'll need to get started

          library(tidyverse)

          ## ── Attaching packages ────────

          ## ✔ ggplot2 3.2.1     ✔ purrr   0.iii.2 ## ✔ tibble  2.one.three     ✔ dplyr   0.8.3 ## ✔ tidyr   ane.0.0     ✔ stringr one.iv.0 ## ✔ readr   i.3.1     ✔ forcats 0.four.0

          ## ── Conflicts ───────────────── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag()    masks stats::lag()

          library(GGally)

          ## Registered S3 method overwritten by 'GGally': ##   method from    ##   +.gg   ggplot2

          ##  ## Attaching packet: 'GGally'

          ## The following object is masked from 'package:dplyr': ##  ##     nasa

          library(knitr)

Linear regression

Linear regression is just a more full general grade of ANOVA, which itself is a generalized t-exam. In each instance, nosotros're assessing if and how the mean of our event $y$ varies with other variables. Unlike t-tests and ANOVA, which are restricted to the case where the factors of interest are all categorical, regression allows you to also model the effects of continuous variables.

linear regression is used to model linear human relationship betwixt an consequence variable, $y$, and a prepare of covariates or predictor variables $x_1, x_2, \ldots, x_p$.

For our first example we'll look at a small information prepare in which we're interested in predicting the crime rate per million population based on socio-economic and demographic data at the land level.

Let'south get-go import the data set up and see what we're working with.

            # Import data fix criminal offence <- read_delim("http://world wide web.andrew.cmu.edu/user/achoulde/94842/data/crime_simple.txt", delim = "\t")

            ## Parsed with column specification: ## cols( ##   R = col_double(), ##   Historic period = col_double(), ##   S = col_double(), ##   Ed = col_double(), ##   Ex0 = col_double(), ##   Ex1 = col_double(), ##   LF = col_double(), ##   G = col_double(), ##   N = col_double(), ##   NW = col_double(), ##   U1 = col_double(), ##   U2 = col_double(), ##   West = col_double(), ##   X = col_double() ## )

The variable names that this data set comes with are very confusing, and even misleading.

R: Crime charge per unit: # of offenses reported to police per million population

Age: The number of males of age 14-24 per g population

S: Indicator variable for Southern states (0 = No, i = Yep)

Ed: Mean # of years of schooling 10 ten for persons of age 25 or older

Ex0: 1960 per capita expenditure on law by country and local government

Ex1: 1959 per capita expenditure on police past state and local regime

LF: Labor force participation charge per unit per 1000 civilian urban males age 14-24

Grand: The number of males per chiliad females

North: State population size in hundred thousands

NW: The number of non-whites per 1000 population

U1: Unemployment rate of urban males per 1000 of age xiv-24

U2: Unemployment rate of urban males per thou of age 35-39

W: Median value of transferable goods and avails or family unit income in tens of $

X: The number of families per 1000 earning below i/two the median income

Nosotros really demand to give these variables amend names

            # Assign more than meaningful variable names, likewise # Convert is.south to a factor # Dissever average.ed by ten so that the variable is really average education # Catechumen median avails to 1000's of dollars instead of 10's criminal offense <- crime %>%   rename(crime.per.meg = R,          immature.males = Age,          is.southward = South,          average.ed = Ed,          exp.per.cap.1960 = Ex0,          exp.per.cap.1959 = Ex1,          labour.part = LF,          male person.per.fem = Grand,          population = N,          nonwhite = NW,          unemp.youth = U1,          unemp.developed = U2,          median.assets = W,          num.low.salary = X) %>%   mutate(is.south = equally.factor(is.due south),          boilerplate.ed = boilerplate.ed / 10,          median.avails = median.assets / 100) # print summary of the data summary(crime)

            ##  crime.per.million  immature.males    is.south   average.ed    ##  Min.   : 34.xx    Min.   :119.0   0:31     Min.   : 8.lxx   ##  1st Qu.: 65.85    1st Qu.:130.0   1:sixteen     1st Qu.: 9.75   ##  Median : 83.10    Median :136.0            Median :10.80   ##  Mean   : 90.51    Mean   :138.6            Mean   :10.56   ##  third Qu.:105.75    3rd Qu.:146.0            3rd Qu.:11.45   ##  Max.   :199.30    Max.   :177.0            Max.   :12.twenty   ##  exp.per.cap.1960 exp.per.cap.1959  labour.part     male person.per.fem    ##  Min.   : 45.0    Min.   : 41.00   Min.   :480.0   Min.   : 934.0   ##  1st Qu.: 62.5    1st Qu.: 58.l   1st Qu.:530.5   1st Qu.: 964.5   ##  Median : 78.0    Median : 73.00   Median :560.0   Median : 977.0   ##  Hateful   : 85.0    Mean   : 80.23   Mean   :561.two   Mean   : 983.0   ##  3rd Qu.:104.5    3rd Qu.: 97.00   3rd Qu.:593.0   3rd Qu.: 992.0   ##  Max.   :166.0    Max.   :157.00   Max.   :641.0   Max.   :1071.0   ##    population        nonwhite      unemp.youth      unemp.adult    ##  Min.   :  three.00   Min.   :  2.0   Min.   : 70.00   Min.   :twenty.00   ##  1st Qu.: 10.00   1st Qu.: 24.0   1st Qu.: 80.50   1st Qu.:27.50   ##  Median : 25.00   Median : 76.0   Median : 92.00   Median :34.00   ##  Mean   : 36.62   Mean   :101.1   Mean   : 95.47   Mean   :33.98   ##  3rd Qu.: 41.50   3rd Qu.:132.five   3rd Qu.:104.00   3rd Qu.:38.50   ##  Max.   :168.00   Max.   :423.0   Max.   :142.00   Max.   :58.00   ##  median.assets   num.low.salary  ##  Min.   :2.880   Min.   :126.0   ##  1st Qu.:4.595   1st Qu.:165.five   ##  Median :5.370   Median :176.0   ##  Mean   :5.254   Mean   :194.0   ##  tertiary Qu.:5.915   3rd Qu.:227.5   ##  Max.   :vi.890   Max.   :276.0

Starting time stride: some plotting and summary statistics

Y'all can start by feeding everything into a regression, but it'due south oft a meliorate idea to construct some uncomplicated plots (e.g., scatterplots and boxplots) and summary statistics to get some sense of how the data behaves.

              # Scatter plot of outcome (criminal offence.per.million) against average.ed qplot(boilerplate.ed, crime.per.million, data = criminal offence)

              # correlation betwixt education and crime with(crime, cor(average.ed, crime.per.million))

              ## [1] 0.3228349

This seems to advise that higher levels of average didactics are associated with higher crime rates. Can you come upwardly with an explanation for this miracle?

              # Besprinkle plot of outcome (crime.per.meg) against median.assets qplot(median.assets, law-breaking.per.million, information = criminal offense)

              # correlation betwixt educational activity and crime with(criminal offense, cor(median.avails, criminal offense.per.million))

              ## [1] 0.4413199

There besides appears to be a positive association between median avails and offense rates.

              # Boxplots showing crime rate broken down by southern vs non-southern state qplot(is.due south, law-breaking.per.meg, geom = "boxplot", information = crime)

Amalgam a regression model

To construct a linear regression model in R, we use the lm() function. You tin specify the regression model in various ways. The simplest is oft to use the formula specification.

The kickoff model nosotros fit is a regression of the outcome (crimes.per.1000000) against all the other variables in the data set. Y'all can either write out all the variable names. or employ the shorthand y ~ . to specify that you lot desire to include all the variables in your regression.

              criminal offense.lm <- lm(crime.per.1000000 ~ ., data = offense) # Summary of the linear regression model crime.lm

              ##  ## Call: ## lm(formula = crime.per.million ~ ., information = crime) ##  ## Coefficients: ##      (Intercept)       immature.males         is.south1        average.ed   ##       -vi.918e+02         1.040e+00        -8.308e+00         1.802e+01   ## exp.per.cap.1960  exp.per.cap.1959       labour.part      male person.per.fem   ##        1.608e+00        -6.673e-01        -4.103e-02         ane.648e-01   ##       population          nonwhite       unemp.youth       unemp.adult   ##       -4.128e-02         7.175e-03        -6.017e-01         1.792e+00   ##    median.avails    num.low.bacon   ##        1.374e+01         vii.929e-01

              summary(law-breaking.lm)

              ##  ## Call: ## lm(formula = law-breaking.per.million ~ ., data = crime) ##  ## Residuals: ##     Min      1Q  Median      3Q     Max  ## -34.884 -11.923  -one.135  13.495  50.560  ##  ## Coefficients: ##                    Estimate Std. Error t value Pr(>|t|)     ## (Intercept)      -6.918e+02  1.559e+02  -4.438 nine.56e-05 *** ## young.males       1.040e+00  four.227e-01   2.460  0.01931 *   ## is.south1        -8.308e+00  1.491e+01  -0.557  0.58117     ## boilerplate.ed        ane.802e+01  half dozen.497e+00   ii.773  0.00906 **  ## exp.per.cap.1960  ane.608e+00  1.059e+00   1.519  0.13836     ## exp.per.cap.1959 -6.673e-01  i.149e+00  -0.581  0.56529     ## labour.part      -4.103e-02  1.535e-01  -0.267  0.79087     ## male.per.fem      one.648e-01  2.099e-01   0.785  0.43806     ## population       -4.128e-02  one.295e-01  -0.319  0.75196     ## nonwhite          7.175e-03  6.387e-02   0.112  0.91124     ## unemp.youth      -6.017e-01  4.372e-01  -1.376  0.17798     ## unemp.adult       i.792e+00  8.561e-01   2.093  0.04407 *   ## median.assets     1.374e+01  1.058e+01   1.298  0.20332     ## num.depression.salary    7.929e-01  2.351e-01   3.373  0.00191 **  ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ##  ## Residuum standard error: 21.94 on 33 degrees of freedom ## Multiple R-squared:  0.7692, Adjusted R-squared:  0.6783  ## F-statistic: 8.462 on 13 and 33 DF,  p-value: 3.686e-07

R's default is to output values in scientific annotation. This can make information technology difficult to interpret the numbers. Here's some code that can be used to force full printout of numbers.

              options(scipen=4)  # Ready scipen = 0 to get back to default

              summary(crime.lm)

              ##  ## Call: ## lm(formula = crime.per.1000000 ~ ., data = law-breaking) ##  ## Residuals: ##     Min      1Q  Median      3Q     Max  ## -34.884 -11.923  -1.135  xiii.495  l.560  ##  ## Coefficients: ##                     Guess  Std. Error t value  Pr(>|t|)     ## (Intercept)      -691.837588  155.887918  -4.438 0.0000956 *** ## immature.males         1.039810    0.422708   2.460   0.01931 *   ## is.south1          -8.308313   14.911588  -0.557   0.58117     ## boilerplate.ed         18.016011    6.496504   2.773   0.00906 **  ## exp.per.cap.1960    1.607818    1.058667   i.519   0.13836     ## exp.per.cap.1959   -0.667258    1.148773  -0.581   0.56529     ## labour.function        -0.041031    0.153477  -0.267   0.79087     ## male.per.fem        0.164795    0.209932   0.785   0.43806     ## population         -0.041277    0.129516  -0.319   0.75196     ## nonwhite            0.007175    0.063867   0.112   0.91124     ## unemp.youth        -0.601675    0.437154  -i.376   0.17798     ## unemp.adult         1.792263    0.856111   2.093   0.04407 *   ## median.assets      thirteen.735847   x.583028   1.298   0.20332     ## num.low.salary      0.792933    0.235085   3.373   0.00191 **  ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.ane ' ' 1 ##  ## Residual standard mistake: 21.94 on 33 degrees of liberty ## Multiple R-squared:  0.7692, Adjusted R-squared:  0.6783  ## F-statistic: 8.462 on 13 and 33 DF,  p-value: 0.0000003686

Looking at the p-values, information technology looks like num.low.salary (number of families per 1000 earning beneath one/ii the median income), unemp.developed (Unemployment charge per unit of urban males per 1000 of age 35-39), average.ed (Mean # of years of schooling 25 or older), and young.males (number of males of age fourteen-24 per 1000 population) are all statistically pregnant predictors of offense rate.

The coefficients for these predictors are all positive, so crime rates are positively associated with wealth inequality, developed unemployment rates, average didactics levels, and high rates of immature males in the population.

Exploring the lm object

What kind of output do we get when we run a linear model (lm) in R?

                # List all attributes of the linear model attributes(criminal offence.lm)

                ## $names ##  [i] "coefficients"  "residuals"     "furnishings"       "rank"          ##  [5] "fitted.values" "assign"        "qr"            "df.residual"   ##  [9] "contrasts"     "xlevels"       "call"          "terms"         ## [13] "model"         ##  ## $course ## [1] "lm"

                # coefficients crime.lm$coef

                ##      (Intercept)      young.males        is.south1       average.ed  ##   -691.837587905      1.039809653     -8.308312889     18.016010601  ## exp.per.cap.1960 exp.per.cap.1959      labour.office     male.per.fem  ##      one.607818377     -0.667258285     -0.041031047      0.164794968  ##       population         nonwhite      unemp.youth      unemp.developed  ##     -0.041276887      0.007174688     -0.601675298      1.792262901  ##    median.assets   num.depression.bacon  ##     13.735847285      0.792932786

None of the attributes seem to give you p-values. Here's what y'all tin can do to go a tabular array that allows y'all to extract p-values.

                # Pull coefficients element from summary(lm) object round(summary(crime.lm)$coef, iii)

                ##                  Estimate Std. Error t value Pr(>|t|) ## (Intercept)      -691.838    155.888  -4.438    0.000 ## young.males         one.040      0.423   2.460    0.019 ## is.south1          -8.308     14.912  -0.557    0.581 ## boilerplate.ed         xviii.016      six.497   2.773    0.009 ## exp.per.cap.1960    1.608      ane.059   one.519    0.138 ## exp.per.cap.1959   -0.667      1.149  -0.581    0.565 ## labour.part        -0.041      0.153  -0.267    0.791 ## male person.per.fem        0.165      0.210   0.785    0.438 ## population         -0.041      0.130  -0.319    0.752 ## nonwhite            0.007      0.064   0.112    0.911 ## unemp.youth        -0.602      0.437  -1.376    0.178 ## unemp.adult         i.792      0.856   ii.093    0.044 ## median.avails      13.736     x.583   one.298    0.203 ## num.low.salary      0.793      0.235   3.373    0.002

If you want a particular p-value, y'all can get it by doing the post-obit

                # Pull the coefficients table from summary(lm) offense.lm.coef <- round(summary(crime.lm)$coef, three) # See what this gives class(crime.lm.coef)

                ## [one] "matrix"

                attributes(crime.lm.coef)

                ## $dim ## [one] fourteen  4 ##  ## $dimnames ## $dimnames[[1]] ##  [1] "(Intercept)"      "young.males"      "is.south1"        ##  [4] "average.ed"       "exp.per.cap.1960" "exp.per.cap.1959" ##  [seven] "labour.part"      "male.per.fem"     "population"       ## [10] "nonwhite"         "unemp.youth"      "unemp.adult"      ## [13] "median.assets"    "num.depression.salary"   ##  ## $dimnames[[2]] ## [i] "Estimate"   "Std. Error" "t value"    "Pr(>|t|)"

                crime.lm.coef["average.ed", "Pr(>|t|)"]

                ## [1] 0.009

The coefficients table is a matrix with named rows and columns. Yous tin therefore access detail cells either by numeric index, or past proper name (as in the example in a higher place).

Plotting the lm object

                plot(law-breaking.lm)

These four plots are of import diagnostic tools in assessing whether the linear model is appropriate. The showtime 2 plots are the nearly of import, merely the terminal 2 can also help with identifying outliers and non-linearities.

Residuals vs. Fitted When a linear model is advisable, we expect

the residuals volition take abiding variance when plotted against fitted values; and
the residuals and fitted values will be uncorrelated.

If there are articulate trends in the residual plot, or the plot looks similar a funnel, these are articulate indicators that the given linear model is inappropriate.

Normal QQ plot You can use a linear model for prediction even if the underlying normality assumptions don't hold. Still, in guild for the p-values to be believable, the residuals from the regression must look approximately normally distributed.

Scale-location plot This is some other version of the residuals vs fitted plot. There should be no discernible trends in this plot.

Residuals vs Leverage. Leverage is a measure of how much an observation influenced the model fit. It's a 1-number summary of how different the model fit would be if the given ascertainment was excluded, compared to the model fit where the observation is included. Points with high rest (poorly described by the model) and loftier leverage (loftier influence on model fit) are outliers. They're skewing the model fit abroad from the rest of the data, and don't actually seem to fit with the rest of the data.

The residual vs fitted and scale-location diagnostic plots for the criminal offense data aren't specially insightful, largely due to the very small sample size. Below we look at the diamonds data to see what a more typical anaylsis of linear model diagnostic plots might reveal.

Diagnostic plots for diamonds data.

                diamonds.lm <- lm(price ~ carat + cut + clarity + color, data = diamonds)  plot(diamonds.lm)

Residuals vs. Fitted

There is a clear indication of non-linearity present in this plot. Furthermore, we see that the variance appears to be increasing in fitted value.

Normal QQ plot The residuals appear highly non-normal. Both the lower tail and upper tail are heavier than we would expect under normality. This may exist due to the non-constant variance issue nosotros observed in the Residuals vs. Fitted plot.

Scale-location plot We encounter a articulate increasing trend in residual variance that runs through near of the plot. This is indicated by the upward gradient of the red line, which we can translate as the standard deviation of the residuals at the given level of fitted value.

Residuals vs Leverage. None of the points appear to be outliers.

Here's what happens if we log-transform both the price and carat variables.

                diamonds.lm2 <- lm(log(price) ~ I(log(carat)) + cut + clarity + color, data = diamonds)  plot(diamonds.lm2)

While there remains a very slight indication of not-linearity in the Rest vs Fitted plot, the non-constant variance issue appears to have been addressed past the variable transformations. The Normal QQ plot indicates that the residuals have a heavier tailed distribution, but since we accept a very large sample size this should not cause problems for inference. There practice non appear to be whatever clear outliers in the data.

Collinearity and pairs plots

In your regression class you probably learned that collinearity tin throw off the coefficient estimates. To diagnose collinearity, we can do a plot matrix. In base graphics, this tin can be accomplished via the pairs function.

As a demo, let's look at some of the economic indicators in our information set.

                economic.var.names <- c("exp.per.cap.1959", "exp.per.cap.1960", "unemp.adult", "unemp.youth", "labour.part", "median.assets") pairs(law-breaking[,economic.var.names])

                round(cor(law-breaking[,economical.var.names]), 3)

                ##                  exp.per.cap.1959 exp.per.cap.1960 unemp.adult unemp.youth ## exp.per.cap.1959            ane.000            0.994       0.169      -0.052 ## exp.per.cap.1960            0.994            1.000       0.185      -0.044 ## unemp.adult                 0.169            0.185       1.000       0.746 ## unemp.youth                -0.052           -0.044       0.746       1.000 ## labour.part                 0.106            0.121      -0.421      -0.229 ## median.avails               0.794            0.787       0.092       0.045 ##                  labour.part median.avails ## exp.per.cap.1959       0.106         0.794 ## exp.per.cap.1960       0.121         0.787 ## unemp.adult           -0.421         0.092 ## unemp.youth           -0.229         0.045 ## labour.part            1.000         0.295 ## median.assets          0.295         ane.000

Since the in a higher place-diagonal and below-diagonal plots contain substantially the same data, it'southward oftentimes more useful to brandish some other values in one of the spaces. In the instance below, nosotros apply the panel.cor function from the pairs() documentation to add text below the diagonal.

                # Function taken from ?pairs Case section.   console.cor <- part(x, y, digits = 2, prefix = "", cex.cor, ...) {     usr <- par("usr"); on.go out(par(usr))     par(usr = c(0, 1, 0, 1))     r <- abs(cor(10, y))     txt <- format(c(r, 0.123456789), digits = digits)[i]     txt <- paste0(prefix, txt)     if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)     text(0.5, 0.5, txt, cex = pmax(1, cex.cor * r)) }  # Utilise panel.cor to display correlations in lower panel. pairs(criminal offence[,economic.var.names], lower.console = panel.cor)

                # ggpairs from GGally library # Dissimilar pairs(), ggpairs() works with non-numeric # predictors in addition to numeric ones. # Consider ggpairs() for your last projection ggpairs(criminal offense[,c(economic.var.names, "is.south")], axisLabels = "internal")

                ## `stat_bin()` using `bins = thirty`. Option better value with `binwidth`. ## `stat_bin()` using `bins = thirty`. Choice better value with `binwidth`. ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## `stat_bin()` using `bins = 30`. Pick amend value with `binwidth`. ## `stat_bin()` using `bins = 30`. Option better value with `binwidth`. ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Looking at the plot, we see that many of the variables are very strongly correlated. In particular, police expenditures are pretty much identical in 1959 and 1960. This is an extreme case of collinearity. Also, unsurprisingly, youth unemployment and developed unemployment are also highly correlated.

Let'due south but include the 1960 police expenditure variable, and too drop the youth unemployment variable. We'll practise this using the update() office. Here's what happens.

                crime.lm.2 <- update(law-breaking.lm, . ~ . - exp.per.cap.1959 - unemp.youth) summary(law-breaking.lm.2)

                ##  ## Call: ## lm(formula = crime.per.1000000 ~ young.males + is.south + boilerplate.ed +  ##     exp.per.cap.1960 + labour.part + male.per.fem + population +  ##     nonwhite + unemp.adult + median.assets + num.depression.salary,  ##     data = crime) ##  ## Residuals: ##    Min     1Q Median     3Q    Max  ## -35.82 -11.57  -1.51  10.63  55.02  ##  ## Coefficients: ##                     Guess  Std. Error t value  Pr(>|t|)     ## (Intercept)      -633.438828  145.470340  -4.354  0.000111 *** ## young.males         ane.126883    0.418791   2.691  0.010853 *   ## is.south1          -0.556600   thirteen.883248  -0.040  0.968248     ## average.ed         15.328028    6.202516   2.471  0.018476 *   ## exp.per.cap.1960    1.138299    0.226977   5.015 0.0000153 *** ## labour.part         0.068716    0.133540   0.515  0.610087     ## male.per.fem        0.003021    0.173041   0.017  0.986172     ## population         -0.064477    0.128278  -0.503  0.618367     ## nonwhite           -0.013794    0.061901  -0.223  0.824960     ## unemp.adult         0.931498    0.541803   ane.719  0.094402 .   ## median.assets      15.158975   x.524458   i.440  0.158653     ## num.low.bacon      0.825936    0.234189   3.527  0.001197 **  ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ##  ## Residue standard error: 21.98 on 35 degrees of liberty ## Multiple R-squared:  0.7543, Adjusted R-squared:  0.6771  ## F-statistic: ix.769 on 11 and 35 DF,  p-value: 0.00000009378

                criminal offense.lm.summary.2 <- summary(crime.lm.two)

When outputting regression results, it'south ever skilful to employ the kable() function to make things wait a fiddling nicer.

                kable(crime.lm.summary.2$coef,        digits = c(3, 3, 3, 4), format = 'markdown')

	Estimate	Std. Mistake	t value	Pr(>\|t\|)
(Intercept)	-633.439	145.470	-4.354	0.0001
young.males	1.127	0.419	2.691	0.0109
is.south1	-0.557	13.883	-0.040	0.9682
boilerplate.ed	15.328	6.203	2.471	0.0185
exp.per.cap.1960	1.138	0.227	5.015	0.0000
labour.part	0.069	0.134	0.515	0.6101
male.per.fem	0.003	0.173	0.017	0.9862
population	-0.064	0.128	-0.503	0.6184
nonwhite	-0.014	0.062	-0.223	0.8250
unemp.developed	0.931	0.542	i.719	0.0944
median.assets	15.159	ten.524	1.440	0.1587
num.depression.bacon	0.826	0.234	3.527	0.0012

Thinking more critically about linear regression

And then far nosotros accept seen how to run a linear regression using the lm() function and how to use the plot() and pairs() commands to diagnose common problems such as non-abiding variance, outliers, and collinearity among predictors. In this section we'll delve deeper into linear regression to better understand how to interpret the output. Our word will focus on interpreting factors (chiselled variables) and interaction terms.

Permit'southward choice up where we only left off. At the last stage, we had a regression with a couple of variable removed to address collinearity problems.

            law-breaking.lm <- lm(offense.per.one thousand thousand ~ ., information = crime)  # Remove 1959 expenditure and youth unemployment crime.lm2 <- update(criminal offence.lm, . ~ . - exp.per.cap.1959 - unemp.youth)

Here'south a comparison of the regression models (with and without the collinearity problem).

            kable(summary(criminal offence.lm)$coef,        digits = c(3, 3, 3, 4), format = 'markdown')

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-691.838	155.888	-4.438	0.0001
young.males	1.040	0.423	2.460	0.0193
is.south1	-8.308	14.912	-0.557	0.5812
average.ed	18.016	6.497	ii.773	0.0091
exp.per.cap.1960	one.608	ane.059	1.519	0.1384
exp.per.cap.1959	-0.667	ane.149	-0.581	0.5653
labour.function	-0.041	0.153	-0.267	0.7909
male.per.fem	0.165	0.210	0.785	0.4381
population	-0.041	0.130	-0.319	0.7520
nonwhite	0.007	0.064	0.112	0.9112
unemp.youth	-0.602	0.437	-1.376	0.1780
unemp.adult	i.792	0.856	2.093	0.0441
median.avails	13.736	x.583	1.298	0.2033
num.low.salary	0.793	0.235	three.373	0.0019

            offense.lm.summary2 <- summary(crime.lm2) kable(offense.lm.summary2$coef,        digits = c(3, 3, 3, four), format = 'markdown')

	Estimate	Std. Fault	t value	Pr(>\|t\|)
(Intercept)	-633.439	145.470	-4.354	0.0001
young.males	ane.127	0.419	two.691	0.0109
is.south1	-0.557	xiii.883	-0.040	0.9682
average.ed	15.328	vi.203	2.471	0.0185
exp.per.cap.1960	1.138	0.227	5.015	0.0000
labour.part	0.069	0.134	0.515	0.6101
male.per.fem	0.003	0.173	0.017	0.9862
population	-0.064	0.128	-0.503	0.6184
nonwhite	-0.014	0.062	-0.223	0.8250
unemp.adult	0.931	0.542	i.719	0.0944
median.assets	15.159	10.524	1.440	0.1587
num.depression.bacon	0.826	0.234	3.527	0.0012

Observe that the coefficient of 1960 expenditure went from beingness non-signficant to significant (p-value is now very small).

What does information technology mean for a coefficient to be statistically significant?

Allow's expect at the coefficient of average.ed in the crime.lm2 model. This coefficient is 15.3280278. We might interpret it by saying that:

All else being equal between two states, a one-year increase in average teaching appears to exist associated with a fifteen.3 increase in crime rates per million.

In addition to the coefficient estimate, nosotros besides have a standard error gauge and a p-value. The standard error tells us how uncertain our guess of the coefficient of average.ed actually is. In this instance, our estimate is fifteen.3, but the standard fault is six.203. Using the "2 standard fault rule" of thumb, we could refine our before statement to say:

Based on the data, we estimate a 1-yr increase in average pedagogy is associated with a 15.3 +/- 12.iv increment in crimes per meg population

In other words, our estimate is quite uncertain (has a large standard error).

The "2 standard fault rule" is a nice quick way of putting together approximate 95% confidence intervals for regression coefficients. Here'due south a more principled arroyo, which works for any desired confidence level. This approach uses the confint command.

              # all 95% confidence intervals confint(law-breaking.lm2)

              ##                         2.5 %       97.five % ## (Intercept)      -928.7593182 -338.1183387 ## young.males         0.2766914    1.9770739 ## is.south1         -28.7410920   27.6278928 ## average.ed          ii.7362499   27.9198056 ## exp.per.cap.1960    0.6775118    1.5990864 ## labour.part        -0.2023846    0.3398163 ## male.per.fem       -0.3482706    0.3543119 ## population         -0.3248958    0.1959409 ## nonwhite           -0.1394591    0.1118719 ## unemp.developed        -0.1684209    2.0314168 ## median.assets      -6.2068096   36.5247604 ## num.low.salary      0.3505063    1.3013656

              # Only for educational activity confint(law-breaking.lm2, parm = "average.ed")

              ##              2.5 %   97.v % ## boilerplate.ed ii.73625 27.91981

              # 75% confidence interval confint(crime.lm2, parm = "boilerplate.ed", level = 0.75)

              ##              12.five %   87.5 % ## average.ed 8.072542 22.58351

              # How does 2 SE rule compare to confint output? #  lower endpoint coef(crime.lm2)["average.ed"] - 2* summary(crime.lm2)$coef["average.ed", "Std. Fault"]

              ## boilerplate.ed  ##   ii.922995

              # upper endpoint coef(crime.lm2)["average.ed"] + ii* summary(crime.lm2)$coef["average.ed", "Std. Error"]

              ## boilerplate.ed  ##   27.73306

The p-value of 0.0184764 is less than 0.05, so this tells us that the coefficient estimate is statistically significantly different from 0. What does this hateful? It means that the information suggests the actual association between average education and crime rates is non-zero. i.e., the data shows testify that the coefficient is non-zero.

One of the exercises on Homework 5 will walk you through running a simulation experiment to better sympathize what signficance means in a regression setting.

Here's a preview. The red line is the truthful regression line. The greyness points prove a random realization of the data. The various black curves evidence 100 estimates of the regression line based on repeated random realizations of the data.

Estimated regression lines

What happens when we have collinearity?

Hither'southward an extreme example of perfectly collinear data.

              my.data <- data.frame(y =  c(12, 13, 10, five, 7, 12, 15),                       x1 = c(6, half-dozen.5, five, 2.5, three.5, 6, 7.5),                       x2 = c(half-dozen, six.five, 5, 2.v, 3.5, vi, vii.5)) my.data

              ##    y  x1  x2 ## 1 12 vi.0 six.0 ## 2 13 6.5 six.5 ## three x 5.0 5.0 ## four  v 2.five 2.v ## 5  7 3.five 3.5 ## half dozen 12 half-dozen.0 6.0 ## 7 xv 7.5 7.5

What do you notice?

By construction, x1 and x2 are exactly the same variable, and the upshot y is perfectly modelled equally $y = x_1 + x_2$.

Just at that place'due south a trouble… because the post-obit are too true

$y = 2 x_1$

$y = three x_1 - x_2$

$y = -400x_1 + 402 x_2$

In other words, based on the data, there's no way of figuring out which of these models is "correct". However, if we drib i of the variables from the model, we know exactly what the coefficient of the other should be.

Colinearity amid predictors causes problems for regression precisely because the model is unable to accurately distinguish between many nearly equally plausible linear combinations of colinear variables. This can pb to large standard errors on coefficients, and even coefficient signs that don't make sense.

Applied considerations in linear regression

Afterwards dealing with the colinearity event by removing the 1959 expenditure variable, nosotros see that exp.per.cap.1960 is now highly significant.

              crime.lm.summary2$coef["exp.per.cap.1960",]

              ##      Estimate    Std. Error       t value      Pr(>|t|)  ## 1.13829907170 0.22697675756 5.01504684417 0.00001532994

This is interesting. It's substantially saying that, all else beingness equal, every dollar per capita increment in police expenditure is on boilerplate associated with an increase in criminal offence of i.thirteen per one thousand thousand population.

              criminal offense.lm.summary2$coef["average.ed",]

              ##    Approximate  Std. Error     t value    Pr(>|t|)  ## 15.32802778  6.20251646  ii.47125951  0.01847635

Besides, for every unit increase in boilerplate educational activity, we find that the number of reported crimes increases past near fifteen.3 per one thousand thousand.

Ane of my primary reasons for selecting this data fix is that it illustrates some of the more than common pitfalls in interpreting regression models.

But because a coefficient is significant, doesn't mean your covariate causes your response

This is the one-time adage that correlation does not imply causation. In this example, we take strong show that higher police expenditures are positively associated with crime rates. This doesn't mean that decreasing police expenditure will lower criminal offence charge per unit. The relationship is non causal – at to the lowest degree not in that direction. A more reasonable explanation is that higher law-breaking rates promt policy makers to increase law expenditure.

At that place's a deviation between practical significance and statistical significance

Both average.ed and exp.per.cap.1960 are statistically significant. exp.per.cap.1960 has a much more significant p-value, but also a much smaller coefficient. When looking at your regression model, you shouldn't just look at the p-value column. The really interesting covariates are the ones that are pregnant, but as well take the largest issue.

Note also that the units of measurement should exist taken into account when thinking almost coefficient estimates and upshot sizes. Suppose, for example, that we regressed income (measured in $) on height and got a coefficient estimate of 100, with a standard error of 20. Is 100 a large effect? The respond depends on the units of measurement. If superlative had been measured in metres, we would be maxim that every 1m increase in height is associated on average with a $100 increase in income. That'due south also pocket-size an effect for us to care well-nigh. At present what if height was measured in mm? Then we'd be saying that every 1mm increment in height is associated on average with a $100 increase in income. Since 1inch = 25.4mm, this means that every 1inch difference in summit is on average associated with a $2540 difference in income. This would be a tremendously large effect. Moral of the story: Whether an effect is 'practically significant' depends a lot on the unit of measurement of measurement.

Factors in linear regression

Interpreting coefficients of cistron variables

In the case of quantitative predictors, we're more than or less comfortable with the interpretation of the linear model coefficient as a "slope" or a "unit of measurement increase in outcome per unit increase in the covariate". This isn't the right interpretation for factor variables. In particular, the notion of a slope or unit modify no longer makes sense when talking about a categorical variable. E.thousand., what does it even mean to say "unit increase in major" when studying the effect of college major on future earnings?

To understand what the coefficients really mean, allow'south get back to the birthwt data and try regressing birthweight on mother'due south race and mother's age.

              # Fit regression model birthwt.lm <- lm(birthwt.grams ~ race + mother.age, information = birthwt)  # Regression model summary summary(birthwt.lm)

              ##  ## Call: ## lm(formula = birthwt.grams ~ race + mother.age, information = birthwt) ##  ## Residuals: ##      Min       1Q   Median       3Q      Max  ## -2131.57  -488.02    -1.16   521.87  1757.07  ##  ## Coefficients: ##             Gauge Std. Error t value Pr(>|t|)     ## (Intercept) 2949.979    255.352  xi.553   <2e-16 *** ## raceblack   -365.715    160.636  -two.277   0.0240 *   ## raceother   -285.466    115.531  -two.471   0.0144 *   ## mother.age     6.288     ten.073   0.624   0.5332     ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.one ' ' 1 ##  ## Residual standard error: 715.seven on 185 degrees of freedom ## Multiple R-squared:  0.05217,    Adjusted R-squared:  0.0368  ## F-statistic: 3.394 on three and 185 DF,  p-value: 0.01909

Notation that there are two coefficients estimated for the race variable (raceother and racewhite). What's happening here?

When you lot put a cistron variable into a regression, you're allowing a different intercept at every level of the cistron. In the present example, yous're saying that yous want to model birthwt.grams as

Baby's birthweight = Intercept(based on mother'southward race) + $\beta$ * mother's age

Essentially y'all're saying that your information is broken down into 3 racial groups, and you want to model your data as having the aforementioned gradient governing how birthweight changes with mother's age, but potentially unlike intercepts. Here's a picture of what's happening.

              # Calculate race-specific intercepts intercepts <- c(coef(birthwt.lm)["(Intercept)"],                 coef(birthwt.lm)["(Intercept)"] + coef(birthwt.lm)["raceother"],                 coef(birthwt.lm)["(Intercept)"] + coef(birthwt.lm)["racewhite"])  lines.df <- data.frame(intercepts = intercepts,                        slopes = rep(coef(birthwt.lm)["mother.historic period"], 3),                        race = levels(birthwt$race))  qplot(x = mother.age, y = birthwt.grams, color = race, data = birthwt) +    geom_abline(aes(intercept = intercepts,                    slope = slopes,                    color = race), data = lines.df)

              ## Alert: Removed one rows containing missing values (geom_abline).

How do we interpret the 2 race coefficients? For chiselled variables, the estimation is relative to the given baseline. The baseline is merely whatever level comes first (here, "black"). E.g., the estimate of raceother means that the estimated intercept is -285.5 higher among "other" race mothers compared to black mothers. Similarly, the estimated intercept is NA higher for white mothers than black mothers.

Some other fashion of putting it: Among mothers of the same age, babies of white mothers are born on average weighing NAg more than babies of black mothers.

Why is one of the levels missing in the regression?

As you've already noticed, in that location is no coefficient called "raceblack" in the estimated model. This is considering this coefficient gets absorbed into the overall (Intercept) term.

Let'southward peek under the hood. Using the model.matrix() part on our linear model object, we tin can get the information matrix that underlies our regression. Here are the get-go 20 rows.

                head(model.matrix(birthwt.lm), twenty)

                ##    (Intercept) raceblack raceother mother.historic period ## i            1         one         0         19 ## two            1         0         ane         33 ## iii            1         0         0         20 ## four            1         0         0         21 ## five            1         0         0         18 ## six            1         0         one         21 ## 7            ane         0         0         22 ## 8            i         0         1         17 ## 9            1         0         0         29 ## 10           ane         0         0         26 ## xi           1         0         ane         19 ## 12           one         0         ane         19 ## 13           ane         0         i         22 ## 14           1         0         i         30 ## fifteen           1         0         0         eighteen ## 16           1         0         0         18 ## 17           1         ane         0         15 ## xviii           1         0         0         25 ## 19           ane         0         1         twenty ## 20           1         0         0         28

Fifty-fifty though nosotros remember of the regression birthwt.grams ~ race + mother.historic period as being a regression on two variables (and an intercept), information technology's really a regression on 3 variables (and an intercept). This is because the race variable gets represented every bit two dummy variables: one for race == other and the other for race == white.

Why isn't there a column for representing the indicator of race == black? This gets back to our colinearity consequence. By definition, nosotros have that

raceblack + raceother + racewhite = 1 = (Intercept)

This is because for every ascertainment, one and only one of the race dummy variables volition equal 1. Thus the group of 4 variables {raceblack, raceother, racewhite, (Intercept)} is perfectly colinear, and we tin can't include all 4 of them in the model. The default behavior in R is to remove the dummy corresponding to the first level of the factor (here, raceblack), and to continue the residue.

Interaction terms

Permit's go back to the regression line plot we generated above.

              qplot(x = mother.age, y = birthwt.grams, color = race, data = birthwt) +    geom_abline(aes(intercept = intercepts,                    slope = slopes,                    color = race), data = lines.df)

              ## Warning: Removed 1 rows containing missing values (geom_abline).

We have seen similar plots earlier by using the geom_smooth or stat_smooth commands in ggplot. Compare the plot in a higher place to the post-obit.

              qplot(ten = mother.age, y = birthwt.grams, color = race, data = birthwt) + stat_smooth(method = "lm", se = Faux, fullrange = Truthful)

In this case we have non only race-specific intercepts, but also race-specific slopes. The plot above corresponds to the model:

Baby'due south birthweight = Intercept(based on mother'south race) + $\beta$(based on mother'due south race) * female parent's age

To specify this interaction model in R, we use the following syntax

              birthwt.lm.interact <- lm(birthwt.grams ~ race * female parent.age, data = birthwt)  summary(birthwt.lm.interact)

              ##  ## Telephone call: ## lm(formula = birthwt.grams ~ race * mother.historic period, data = birthwt) ##  ## Residuals: ##      Min       1Q   Median       3Q      Max  ## -2182.35  -474.23    xiii.48   523.86  1496.51  ##  ## Coefficients: ##                      Estimate Std. Error t value Pr(>|t|)     ## (Intercept)           2583.54     321.52   8.035 1.11e-xiii *** ## raceblack             1022.79     694.21   one.473   0.1424     ## raceother              326.05     545.thirty   0.598   0.5506     ## female parent.age              21.37      12.89   1.658   0.0991 .   ## raceblack:mother.age   -62.54      30.67  -2.039   0.0429 *   ## raceother:mother.age   -26.03      23.20  -1.122   0.2633     ## --- ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.one ' ' 1 ##  ## Residual standard error: 710.seven on 183 degrees of freedom ## Multiple R-squared:  0.07541,    Adjusted R-squared:  0.05015  ## F-statistic: 2.985 on 5 and 183 DF,  p-value: 0.01291

We now have new terms appearing. Terms like racewhite:mother.age are deviations from the baseline slope (the coefficient of female parent.age in the model) in the same way that terms like racewhite are deviations from the baseline intercept. This models says that:

On average amid black mothers, every boosted yr of age is associated with a 21.4g decrease in the birthweight of the baby.

To get the slope for white mothers, nosotros need to add the interaction term to the baseline.

slope(racewhite) = slope(raceblack) + racewhite:mother.age = mother.historic period + racewhite:mother.age = NA

This gradient estimate is positive, which agrees with the regression plot above.