Multiple L.R. and Assessing Prediction models

Branches of statistical inference

Testing - a hypothesis test evaluates evidence against a particular value for a parameter
- Hypothesis test for one proportion
- Randomization test to compare the values of a parameter across two groups
Estimation - confidence interval estimating a parameter (gives a range of plausible values for a parameter)
- Bootstrap confidence interval
Prediction - predict value of a variable for an observation using a statistical model based on other variables
- Simple linear regression (one predictor) — useful when response $y$ is numerical
- Multiple linear regression (multiple predictors) — useful when response $y$ is numerical
- Classification trees - [next lecture] — useful when response $y$ is categorical

Measuring and assessing prediction accuracy

Strategy

Randomly divide the sample data into “training” and “testing” datasets

Ex: 80% for training and 20% for testing
Fit the prediction model based on the training data
Run the “test” data through the fitted prediction model & look at how accurate the prediction model was

When fit a prediction model, it uses the $(x_{i}, y_{i})$ pair of observations sets to get to the estimations of the regression coefficient $\hat{β}$ . Therefore it would me more fair to use a different observation set for fitting (or building and testing a prediction model
The Root Mean Squared Error (RMSE) measure can help to summarize the “accuracy” of the prediction. More specifically, it measures prediction error for predictions from a linear regression model.
$RMSE = \frac{1}{n} i = 1 \sum n (y_{i} - \overset{y}{^}_{i})^{2}$
It can be used to compare different sizes of datasets (as long as they are all in the same unit) and to compare different models not he same dataset
rowid_to_column() - the “index” for data in a dataset, a built-in tydiverse function
%in% - a function that filter whether an element in found in a dataset
predict(<model_name>, newdata = <prediction_data>) - make prediction for new observations

R Code to create training and testing datasets & make predictions using a fitted regression model

library(tidyverse)
heights <- read.csv("heights.csv") %>% select(-X)

set.seed(123);
n <- nrow(heights)  # number of observations in heights data (40)

# Pick 80% of observations to go into the training dataset (32 out of 40)
training_indices <- sample(1:n, size=round(0.8*n)) 
# Add a column caled "rowid" to our heights tibble
heights <- heights %>% rowid_to_column()

# Create training dataset
train <- heights %>% filter(rowid %in% training_indices)
# Testing dataset includes all observations NOT in the training data
test <- heights %>% filter(!(rowid %in% training_indicces))

# Build model using the training data only
model <- lm(height ~ shoePrint, data = train)

# Make predictions for test data using the training model
y_hat_test <- predict(model, newdata = test)
y_test <- test$height

# Calculate RMSE for prediction in the test dataset
sqrt( sum( (y_test - y_hat_test)^2) / nrow(test) )

Interpretation and purpose of the RMSE

A small value of RMSE indicates that on average, the predictions for the response are close to the true (observed) values. (small implies good prediction accuracy)
The coefficient of determination $R^{2}$ only tells the proportion fo variability in $y$ , not does not directly indicates the accuracy of the predictions

$R^{2}$	$RMSE$
$1 - \frac{\sum _{i = 1}^{n} ( y _{i} - y ^ _{i} )}{\sum _{i = 1}^{n} ( y _{i} - y ˉ ) ^{2}}$	$\frac{1}{n} \sum_{i = 1}^{n} (y_{i} - \overset{y}{^}_{i})^{2}$
Range: 0 to 1	Range: 0 to infinity
Larger value indicates larger proportion of variation of the response explained by the model	Smaller value indicates better prediction accuracy
No units	Same units as $y$ (the response variable)

Multiple linear regression: Building richer prediction models

Parallel lines

Linear regression models with 2 (or more) predictors, these predictors can be numerical or categorial

$y_{i} = β_{0} + β_{1} x_{1 i} + β_{2} x_{2 i} + \dots + ϵ_{i}$

Where $x_{1 i}$ is the first predictor, and $x_{2 i}$ is the second predictor
- Fitted line for predictor 1 (assume $x_{2 i}$ is categorical): $\overset{y}{^}_{i} = \hat{β}_{0} + \hat{β}_{2} x_{1 i} + \hat{β}_{2} (0) = \hat{β}_{0} + \hat{β}_{1} x_{1 i}$
- Fitted line for predictor 2 (assume $x_{2 i}$ is categorical): $\overset{y}{^}_{i} = \hat{β}_{0} + \hat{β}_{2} x_{1 i} + \hat{β}_{2} (1) = (\hat{β}_{0} + \hat{β}_{2}) + \hat{β}_{1} x_{1 i}$
- Equation for the regression line: $\overset{y}{^}_{i} = \hat{β}_{0} + \hat{β}_{2} x_{1 i} + \hat{β}_{2} x_{2 i}$
The second variable would produce multiple parallel lines (common slope of $\hat{β}_{1}$ ), different intercepts
- Intercept for predictor 2 is $\hat{β}_{0} + \hat{β}_{2}$ ; difference between the two predictor is $\hat{β}_{2}$
New functions to plot multiple lines in R
- The augment(model_name) functions from the broom library creates a data frame with predicted values (named the .fitted column), residuals, etc

# Fitted model for 2 predictors
model2 <- lm(<target_variable> ~ <x_i> + <x_2>, data = <dataset>)
summary(model2)$coefficients
##					Estimate	Std. Error	t-value		Pr(>|t|)
## (Intercept)	 	...			...			...			...
## variable_1	 	...			...			...			...
## variable_2	 	...			...			...			...

library(broom)
augment(model2)
## A tibble: (num of rows) x (number of columns)
##			variable_y 	durtation	cond		.fitted		.se.fit		.resid		.hat		.sigma		.cooksd
## 			<dbl>		<int>		<fct>		<dbl>		<dbl>		<dbl>		<dbl>		<dbl>		<dbl>
## 1		...			...			...			...			...			...			...			...			...
## 2		...			...			...			...			...			...			...			...			...
## 3		...			...			...			...			...			...			...			...			...
## ... with (num of rows - 3) more rows, and (number of columns - 9) more 

# Plot the parallel lines model
<dataset> %>% ggplot(aes(x=<x_1>, y=<target_variable>, color=<x_2>)) +
		geom_point(alpha=0.5) +
		geom_line(data=augment(model2), aes(y=.fitted, colour=<x_2>)) +
		labs(x="name of x_1 varaible(unit)", y="name of target variable (unit)")
# This would output a graph

Parallel regression line created for the mariokart dataset from the openintro library

Use geom_line() function, with template from above for the regression lines

Non-parallel lines

Simpson’s paradox - adding an extra predictor changes the direction of the association in a L.R model
If the new predictor modifies the relationship between the first predictor and the target variable, create a new model that represent this:

$y_{i} = β_{0} + β_{1} x_{1 i} + β_{2} x_{2 i} + β_{3} x_{1 i} x_{2 i} + ϵ_{i}$
- Interaction term - new independent variable, which is the product of $x_{1}$ and $x_{2}$
- Fitted line for predictor 1 (assume $x_{2 i}$ is categorical): [does not change] $\overset{y}{^}_{i} = \hat{β}_{0} + \hat{β}_{2} x_{1 i} + \hat{β}_{2} (0) + \hat{β}_{3} x_{1 i} (0) = \hat{β}_{0} + \hat{β}_{1} x_{1 i}$
- Fitted line for predictor 2 (assume $x_{2 i}$ is categorical): [change of slope] $\overset{y}{^}_{i} = \hat{β}_{0} + \hat{β}_{2} x_{1 i} + \hat{β}_{2} (1) + \hat{β}_{3} x_{1 i} (1) = (\hat{β}_{0} + \hat{β}_{2}) + (\hat{β}_{1} + \hat{β}_{3}) x_{1 i}$

# Fit the linear model with an interaction term (use * instead of + between predictors)
model3 <- lm(<target_variable> ~ <x_i> * <x_2>, data = <dataset>)
summary(model3)$coefficients
## 								Estimate	Std. Error		t-value		Pr(>|t|)
## (Intercept)			  		...			...			...				...
## variable_1				   	...		  	...			...				...
## variable_2				   	...		  	...			...				...
## variable_1: variable_2 		...		  	...			...				...

Non-parallel regression line created for the mariokart dataset from the openintro library

Use geom_smooth() function, same template with simple linear regression plots

Comparing prediction models (criteria for choosing the best)

A model with good prediction accuracy on average (small RMSE)
A model which doesn’t exhibit evidence of overfitting (RMSE for training and test data are similar
- Overfitting - not “fairly trained”, predict good for trained data, doesn’t predict new data well. RMSE for training dataset is much smaller than the testing dataset
A model which as simple as possible (not too many predictors), while still catering the association between the predictors and the response

Steps to choose between different models

Input the data and add new ID column called rowid
Create training dataset
Testing dataset inlaces all observations NOT in the training data
Fit the models to training data
Make predictions for testing and training data using training model
Calculate RMSE for testing and training data
Comparing the different models
- Prediction accuracy from the RMSE for test data
- Evidence for overfitting (compare RMSE for training and testing data)
- Simplicity of the model
- Conclusion

Compaering prediction models for the mariokart dataset
Models

A: $y_{i} = β_{0} + β_{1} \times duration_{i} + ϵ_{i}$

B: $y_{i} = β_{0} + β_{1} \times I (game i is used) + ϵ_{i}$

C: $y_{i} = β_{0} + β_{1} \times duration_{i} + β_{2} \times I (game i is used) + ϵ_{i}$

D: $y_{i} = β_{0} + β_{1} \times duration_{i} + β_{2} \times I (game i is used) + β_{3} \times [duration_{i} \cdot I (game i is used)] + ϵ_{i}$
Creating training/testing datasets & firing models
# Set up
set.seed(1211);
n <- nrow(mariokart2)
training_indices <- sample(1:n, size=round(0.8*n))
mariokart2 <- mariokart2 %>% rowid_to_column()

# Create training dataset
train <- mariokart2 %>% filter(rowid %in% training_indices)
y_train <- train$total_pr;

# Testing dataset includes all observations NOT in the training data
test <- mariokart2 %>% filter(! (rowid %in% training_indecies))
y_test <- test$total_pr;

# Fit models to training data
modA_train <- lm(total_pr ~ duration, data=train)
modB_train <- lm(total_pr ~ cond, data = train) 
modC_train <- lm(total_pr ~ duration + cond, data=train) 
modD_train <- lm(total_pr ~ duration * cond, data=train)
Making predictions for observations using either training or testing data (separately)
# Make prediction for training data using training model
yhat_modA_train <- predict(modA_train, newdata = train) 
yhat_modB_train <- predict(modB_train, newdata = train) 
yhat_modC_train <- predict(modC_train, newdata = train) 
yhat_modD_train <- predict(modD_train, newdata = train)

# Make predcitions for testing data using training model
yhat_modA_test <- predict(modA_train, newdata = test)
yhat_modB_test <- predict(modB_train, newdata = test) 
yhat_modC_test <- predict(modC_train, newdata = test) 
yhat_modD_test <- predict(modD_train, newdata = test)
Calculating RMSE
# Calculate RMSE for training data
modA_train_RMSE <- sqrt(sum((y_train - yhat_modA_train)^2) / nrow(train)) 
modB_train_RMSE <- sqrt(sum((y_train - yhat_modB_train)^2) / nrow(train)) 
modC_train_RMSE <- sqrt(sum((y_train - yhat_modC_train)^2) / nrow(train)) 
modD_train_RMSE <- sqrt(sum((y_train - yhat_modD_train)^2) / nrow(train))

# Calculate RMSE for testing data
modA_test_RMSE <- sqrt( sum( (y_test - yhat_modA_test)^2 ) / nrow(test) )
modB_test_RMSE <- sqrt( sum( (y_test - yhat_modB_test)^2 ) / nrow(test) )
modC_test_RMSE <- sqrt( sum( (y_test - yhat_modC_test)^2 ) / nrow(test) )
modD_test_RMSE <- sqrt( sum( (y_test - yhat_modD_test)^2 ) / nrow(test) )
Comparing the four models
mytable <- tibble(Model = c("A","B","C","D"),
                  RMSE_testdata = c(modA_test_RMSE, modB_test_RMSE,
                                    modC_test_RMSE, modD_test_RMSE), 
                  RMSE_traindata = c(modA_train_RMSE, modB_train_RMSE,
                                     modC_train_RMSE, modD_train_RMSE), 
                  ratio_of_RMSEs = RMSE_testdata / RMSE_traindata)
Analyze of the result and make a conclusion

Prediction accuracy Model D has a much lower RMSE for the test data than the other models. Models B and C are similar to each other in terms of RMSE, and model A has a much higher RMSE.

Evidence of overfitting The RMSE for model D is about 15% higher for the test data than the training data, while for the other models, this ranges from 17% to 21%. Model D shows the least evidence of overfitting among these models.

Simplicity of the model Models A and B each have one predictor, while model C has two predictors and model D has two predictors and one interaction term (total of 4 regression coefficients).

Conclusion Although model D is the most complex of these models, with four regression coefficients, it exhibits the best prediction accuracy (i.e. lowest RMSE for test data) and shows the least evidence of overfitting. Thus, it is reasonable to choose model D to predict the sale price of Mario Kart games sold on eBay around October 2019.

MEL.ZHU

Explorer

blogs

ENG

UofT Notes

ANT100

COG250

CSC111

MAT223

MGT100

PSY220

PSY230

PSY240

PSY260

PSY322

PSY424

STA130

STA130: Lecture_8

Multiple L.R. and Assessing Prediction models

Branches of statistical inference

Measuring and assessing prediction accuracy

Strategy

R Code to create training and testing datasets & make predictions using a fitted regression model

Interpretation and purpose of the RMSE

Multiple linear regression: Building richer prediction models

Parallel lines

Parallel regression line created for the `mariokart` dataset from the `openintro` library

Non-parallel lines

Non-parallel regression line created for the `mariokart` dataset from the `openintro` library

Comparing prediction models (criteria for choosing the best)

Steps to choose between different models

Compaering prediction models for the `mariokart` dataset

Graph View

Table of Contents

Backlinks

MEL.ZHU

Explorer

blogs

ENG

UofT Notes

ANT100

COG250

CSC111

MAT223

MGT100

PSY220

PSY230

PSY240

PSY260

PSY322

PSY424

STA130

STA130: Lecture_8

Multiple L.R. and Assessing Prediction models §

Branches of statistical inference §

Measuring and assessing prediction accuracy §

Strategy §

R Code to create training and testing datasets & make predictions using a fitted regression model §

Interpretation and purpose of the RMSE §

Multiple linear regression: Building richer prediction models §

Parallel lines §

Parallel regression line created for the mariokart dataset from the openintro library §

Non-parallel lines §

Non-parallel regression line created for the mariokart dataset from the openintro library §

Comparing prediction models (criteria for choosing the best) §

Steps to choose between different models §

Compaering prediction models for the mariokart dataset §

Graph View

Table of Contents

Backlinks

Multiple L.R. and Assessing Prediction models

Branches of statistical inference

Measuring and assessing prediction accuracy

Strategy

R Code to create training and testing datasets & make predictions using a fitted regression model

Interpretation and purpose of the RMSE

Multiple linear regression: Building richer prediction models

Parallel lines

Parallel regression line created for the `mariokart` dataset from the `openintro` library

Non-parallel lines

Non-parallel regression line created for the `mariokart` dataset from the `openintro` library

Comparing prediction models (criteria for choosing the best)

Steps to choose between different models

Compaering prediction models for the `mariokart` dataset