Linear Regression

Visualizing between 2 or more variables

  • Scatterplot - the position of each point is determined by the values of two numerical variables: one on the horizontal (x) axis, the other on the vertical (y) axis (one point for each observation)

Features of the association between the two numerical variables

  • Form - describes the pattern that the two variables follow together

    Ex. Linear, non-linear, quadratic, exponential…

  • Direction

    • Positive association - values of one variable tend to increase as the other’s increase
    • Negative association - values of one variable tend to decrease as the other’s increase
  • Strength - describes how concentrated the values of the variable are around the pattern

    • Strong, moderate, weak
<data_set> %>% ggplot(aes(x=<variable_1>, y=<variable_2>)) + 
	geom_point() +
	labs(x = "Name of x-axis (unit)",
         y = "Name of y-axis (unit)") +
	theme_minimal()  # optional, makes the backgroun white

Example of the heights dataset

  • heights - the name of the dataset in use (<data_set>)
  • shoePrint - the name of the numerical variable you want to display on the x-axis (<variable_1>)
  • height - the name of the numerical variable you want to display on the y-axis (<variable_2>)

Visualizing the numerical variable with categorical variable

  1. Add ggplot(aes(... color = <varaible_3>)) displays each point based on the value of the categorial variable

    Ex. ggplot(aes(... color = sex)) Screen Shot 2021-03-10 at 12.20.42

  2. Use facet_wrap(), and specify the name of a categorial variable, then would get a separate plot for each value of this variable (2 cases) (Can create side-by-side histograms or barplots…)

    Ex. facet_wrap(~sex) Screen Shot 2021-03-10 at 12.20.47

  • Its a good idea to try both cases and see which one is more effective representation

Quantifying association: correlation

  • Correlation summarizes the strength and direction of the linear relationship between two numerical variables (non-linear relationship is not represented in correlation)

  • Sample correlation between variables for observations :

  • The sign of gives the direction ( - positive, - negative)

    • The magnitude of is a measure of the strength of the leaner association
      • If and only if there is a perfect linear relationship between and
cor(x = heights$shoePrint,  # '$' exrtract a column vector from a tibble
 	y = heights$height)

## [1] 0.812948

The correlation between shoe print length and height in this sample of 40 individuals is 0.081

Linear Regression Models

1: Numerical predictor

  • Goal of linear regression: understand variation, predict pattern (both needs a model)

  • Simple linear regression model - assumes there is a “best” straight line that explains the real relationship between and and that the values observed randomly deviate from this line

  • - response variable (or dependent variable, target variable…) for observation

    • Values are random, and observed in the sample data
  • - independent variable (or predictor, covariate, feature, input…) for observation

    • Fixed (constant) and observed in sample data
  • - intercept parameter (closed-form math expression for estimated regression coefficient)

  • - slope parameter (closed-form math expression for estimated regression coefficient)

    • Both are fixed (constants) but unknown
  • - random error term for observation (random deviation)

    • Random, but cannot be calculated directly (don’t know true value of )
  • Population regression is unknown, need to estimate a line in which is as close as possible to as many points as possible in the sample

    • Most common approach is to define the sum of squared vertical differences between each observation and the fitted (estimated) line
    • Least Squares Regression Line - the straight line which minimizes the sum of squared vertical distances between each point and the fitted line (compared to all other possible straight lines)
  • Use lm() function to fit a linear regression model

    • Screen Shot 2021-03-07 at 02.13.07
    model1 <- lm(height ~ shoePrint, data= heights)
    summary(model1)$coefficients
    
    ##				Estimate	Std. Error	t-value		Pr(>|t|)
    ## (Intercept)	80.930409	10.8933945	7.429310	6.504368e-09
    ## shoePrint	3.218561	0.3740081	8.605591	1.863474e-10
    • (Intercept) is the estimate of (ie, )
    • shoePrint is the estimated of (ie. )
  • geom_smooth(method="lm", se=FALSE)
    • To add the fitted regression line to a plot
    • Screen Shot 2021-03-07 at 08.59.05

Interpretation of regression coefficients

  • The estimated simple regression of on is: (fitted regression line or fitted regression equation)

  • Is the fitted value or the predicted value is the estimated average value of when the predictor is equal to

    • The slope is the average change in for 1-unit change in
    • The intercept is the average of when (often this doesn’t make sense, but it tells use the height of the line)
    • The difference between the observed and predicted value of for the observation is called the residual () [distance between the point and the estimated line]
  • In general, it is not ok to say that a change in the predictor causes a change in . It is only ok to talk about the association observed

    Suppose you wake up to find that your bike has been stolen, but there is a fresh shoeprint in the mud nearby. You measure it and it is 30cm long. Based on this fitted regression model, how tall would you predict that the person who left the shoeprint was?

    General equation for the fitted line: ()

  • Extrapolation - means trying to predict the response variable for values of the explanatory variable beyond those contained in the data

    • A model is only as good as the data it was trained on
    • NO reason to think that the trend for the observed range would be valid outside of the range
  • The coefficient of determination () is the proportion of the variability in which is explain by the fitted regression model

    • close to 1 indicates that most of the variability in is explain by the regression model
    • close to 0 indicates that very little of the variability in is explained by the regression model
    • Conveniently is equal to the square of the correlation ()
    summary(model1)$r.squared
    ## [1] 0.6608845
    cor(x = height$shoePrint, y = heights$height)^2
    ## [1] 0.6608845

2: Categorical predictor

  • For the equation for a simple regression line with one numerical predictor: (), for categorical data, the value needs a indicator variable to encode the categorial data
    This also needs a baseline value (the level corresponding to ), here F(female) is the value
model2 <- lm(height ~ sex, data= heights)
summary(model2)$coefficients

##				Estimate	Std. Error	t-value		Pr(>|t|)
## (Intercept)	166.82381	1.357760	122.866909	5.085412e-51
## sexM			15.79198	1.970046	8.016048	1.085391e-09
  • (Intercept) is the estimate of (ie, )
  • sexM is the estimated of (ie. )

Intrerpreting and

  • Combining the equation () and equation (), get:
    • When , have , implies this is the predicted value of for individuals with (in ex. Is the predicted height for women)
    • When , have , implies that this is the predicted value of for individuals with
    • Is the average differences in the response variable between the two categories

Inference for Simple Linear Regression

  • Fitted regression line: (similar to )

  • However, the estimates does not equal to the Tre parameter values

    • The estimation is based on the sample data, so they are subject tot sampling variability
  • For the linear model (), write a pair of hypotheses to test if the slope parameter in this regression model is different from 0

    • vs
    • Under the linear model, if the slope parameter is 0, then does not predict , then predicting the sample mean of for all observation is fine; If the slope () is different from 0, then knowing does help to better predict

Assumptions for statistical inference on regression coefficients

In STA130, these assumptions did not need to be verified before doing the inference

  • The p-values in the output of lm() is based on the Student’s distribution (a continuous probability distribution). So for this to be value, need to make a few assumptions

    1. There is a linear association between
    2. Constant variance in for all values of (scatterplot is not cone-shaped)
    3. The observations are independent
    4. The residual follow a normal distribution
  • If one or more assumptions above is not reasonable, then the inference may not be valid

Using R for hypothesis testing

  • R auto gives the p-value for hypothesis test of the form vs

    • P-value is the Pr(>|t|) in the code above (view code for model1)

    In the example for model1, the p-value is very small (p-value < ), then this indicates there is very strong evidence gains the null hypothesis.


  • Advantage of randomization test
    • NO assumptions about the distribution of the data
    • More flexible (can be used to compare any statistic across two groups, [not just mean])
  • Advantage of linear regression approach
    • Only requires 1 (or 2) lines of code (but only valid if the assumptions are valid)