**Additional Topics in Regression Analysis**

Chapter Goals

**After completing this chapter, you should be able to:**

§ Explain regression model-building methodology

§ Apply dummy variables for categorical variables with more than two categories

§ Explain how dummy variables can be used in experimental design models

§ Incorporate lagged values of the dependent variable is regressors

§ Describe specification bias and multicollinearity

§ Examine residuals for heteroscedasticity and autocorrelation

The Stages of Model Building

§ Understand the problem to be studied

§ Select dependent and independent variables

§ Identify model form (linear, quadratic…)

§ Determine required data for the study

The Stages of Model Building

The Stages of Model Building

The Stages of Model Building

Dummy Variable Models

(More than 2 Levels)

§ Dummy variables can be used in situations in which the categorical variable of interest has more than two categories

§ Dummy variables can also be useful in experimental design

§ Experimental design is used to identify possible causes of variation in the value of the dependent variable

§ Y outcomes are measured at specific combinations of levels for treatment and blocking variables

§ The goal is to determine how the different treatments influence the Y outcome

Dummy Variable Models

(More than 2 Levels)

§ Consider a categorical variable with K levels

§ The number of dummy variables needed is **one less than the number of levels, K – 1**

** **

§ Example:

y = house price ; x_{1 }= square feet

§ If style of the house is also thought to matter:

Style = ranch, split level, condo

Dummy Variable Models

(More than 2 Levels)

§ Example: Let “condo” be the default category, and let x_{2} and x_{3} be used for the other two categories:

y = house price

x_{1 }= square feet

x_{2 }= 1 if ranch, 0 otherwise

x_{3 }= 1 if split level, 0 otherwise

The multiple regression equation is:

Interpreting the Dummy Variable Coefficients (with 3 Levels)

Experimental Design

§ Consider an experiment in which

§ four treatments will be used, and

§ the outcome also depends on three environmental factors that cannot be controlled by the experimenter

§ Let variable z_{1 }denote the treatment, where z_{1} = 1, 2, 3, or 4. Let z_{2} denote the environment factor (the “blocking variable”), where z_{2} = 1, 2, or 3

§ To model the four treatments, three dummy variables are needed

_{§ }To model the three environmental factors, two dummy variables are needed

Experimental Design

_{§ }Define five dummy variables, x_{1}, x_{2}, x_{3}, x_{4}, and x_{5}

_{ }

§ Let treatment level 1 be the default (z_{1} = 1)

§ Define x_{1} = 1 if z_{1} = 2, x_{1} = 0 otherwise

§ Define x_{2} = 1 if z_{1} = 3, x_{2} = 0 otherwise

§ Define x_{3} = 1 if z_{1} = 4, x_{3} = 0 otherwise

§ Let environment level 1 be the default (z_{2} = 1)

§ Define x_{4} = 1 if z_{2} = 2, x_{4} = 0 otherwise

§ Define x_{5} = 1 if z_{2} = 3, x_{5} = 0 otherwise

Experimental Design:

Dummy Variable Tables

§ The dummy variable values can be summarized in a table:

Experimental Design Model

§ The experimental design model can be estimated using the equation

§ The estimated value for ?_{2 }, for example, shows the amount by which the y value for treatment 3 exceeds the value for treatment 1

Lagged Values of the

Dependent Variable

§ In time series models, data is collected over time (weekly, quarterly, etc…)

§ The value of y in time period t is denoted y_{t}

§ The value of y_{t} often depends on the value y_{t-1}, as well as other independent variables x_{j} :

Interpreting Results

in Lagged Models

§ An increase of 1 unit in the independent variable x_{j} in time period t (all other variables held fixed), will lead to an expected increase in the dependent variable of

§ b_{j} in period t

§ b_{j} g in period (t+1)

§ b_{j}g^{2} in period (t+2)

§ b_{j}g^{3} in period (t+3) and so on

§ The total expected increase over all current and future time periods is b_{j}/(1-g)

§ The coefficients b_{0}, b_{1}, . . . ,b_{K}, g are estimated by least squares in the usual manner

§ Confidence intervals and hypothesis tests for the regression coefficients are computed the same as in ordinary multiple regression

§ (When the regression equation contains lagged variables, these procedures are only approximately valid. The approximation quality improves as the number of sample observations increases.)

§ Caution should be used when using confidence intervals and hypothesis tests with time series data

§ There is a possibility that the equation errors e_{i} are no longer independent from one another.

§ When errors are correlated the coefficient estimates are unbiased, but not efficient. Thus confidence intervals and hypothesis tests are no longer valid.

Specification Bias

§ Suppose an important independent variable z is omitted from a regression model

§ If z is uncorrelated with all other included independent variables, the influence of z is left unexplained and is absorbed by the error term, ?

§ But if there is any correlation between z and any of the included independent variables, some of the influence of z is captured in the coefficients of the included variables

Specification Bias

§ If some of the influence of omitted variable z is captured in the coefficients of the included independent variables, then those coefficients are biased…

§ …and the usual inferential statements from hypothesis test or confidence intervals can be seriously misleading

§ In addition the estimated model error will include the effect of the missing variable(s) and will be larger

Multicollinearity

§ Collinearity: High correlation exists among two or more independent variables

§ This means the correlated variables contribute redundant information to the multiple regression model

Multicollinearity

§ Including two highly correlated explanatory variables can adversely affect the regression results

§ No new information provided

§ Can lead to unstable coefficients (large standard error and low t-values)

§ Coefficient signs may not match prior expectations

Some Indications of

Strong Multicollinearity

§ Incorrect signs on the coefficients

§ Large change in the value of a previous coefficient when a new variable is added to the model

§ A previously significant variable becomes insignificant when a new independent variable is added

§ The estimate of the standard deviation of the model increases when a variable is added to the model

Detecting Multicollinearity

§ Examine the simple correlation matrix to determine if strong correlation exists between any of the model independent variables

§ Multicollinearity may be present if the model appears to explain the dependent variable well (high F statistic and low s_{e }) but_{ } the individual coefficient t statistics are insignificant

Assumptions of Regression

§ Normality of Error

§ Error values (?) are normally distributed for any given value of X

§ Homoscedasticity

§ The probability distribution of the errors has constant variance

§ Independence of Errors

§ Error values are statistically independent

Residual Analysis

§ The residual for observation i, e_{i}, is the difference between its observed and predicted value

§ Check the assumptions of regression by examining the residuals

§ Examine for linearity assumption

§ Examine for constant variance for all levels of X (homoscedasticity)

§ Evaluate normal distribution assumption

§ Evaluate independence assumption

§ Graphical Analysis of Residuals

§ Can plot residuals vs. X

Residual Analysis for Linearity

Residual Analysis for

Homoscedasticity

Excel Residual Output

Heteroscedasticity

§ Homoscedasticity

§ The probability distribution of the errors has constant variance

§ Heteroscedasticity

§ The error terms do not all have the same variance

§ The size of the error variances may depend on the size of the dependent variable value, for example

§ When heteroscedasticity is present

§ least squares is not the most efficient procedure to estimate regression coefficients

§ The usual procedures for deriving confidence intervals and tests of hypotheses is not valid

Tests for Heteroscedasticity

§ To test the null hypothesis that the error terms, ?_{i}, all have the same variance against the alternative that their variances depend on the expected values

§ Estimate the simple regression

§ Let R^{2} be the coefficient of determination of this new regression

The null hypothesis is rejected if nR^{2 } is greater than c^{2}_{1,}_{a}

§ where c^{2}_{1,}_{a} is the critical value of the chi-square random variable with 1 degree of freedom and probability of error a

Autocorrelated Errors

§ Independence of Errors

§ Error values are statistically independent

§ Autocorrelated Errors

§ Residuals in one time period are related to residuals in another period

§ Autocorrelation violates a least squares regression assumption

§ Leads to s_{b} estimates that are too small (i.e., biased)

§ Thus t-values are too large and some variables may appear significant when they are not

Autocorrelation

§ Autocorrelation is correlation of the errors (residuals) over time

The Durbin-Watson Statistic

§ The Durbin-Watson statistic is used to test for autocorrelation

Testing for Positive Autocorrelation

§ Calculate the Durbin-Watson test statistic = d

§ d can be approximated by d = 2(1 – r) , where r is the sample correlation of successive errors

§ Find the values d_{L} and d_{U} from the Durbin-Watson table

§ (for sample size n and number of independent variables K)

Negative Autocorrelation

§ Negative autocorrelation exists if successive errors are negatively correlated

§ This can occur if successive errors alternate in sign

Testing for Positive Autocorrelation

§ Example with n = 25:

Testing for Positive Autocorrelation

§ Here, n = 25 and there is k = 1 one independent variable

§ Using the Durbin-Watson table, d_{L} = 1.29 and d_{U} = 1.45

§ D = 1.00494 < d_{L} = 1.29, so reject H_{0} and conclude that significant positive autocorrelation exists

§ Therefore the linear model is not the appropriate model to forecast sales

Dealing with Autocorrelation

§ Suppose that we want to estimate the coefficients of the regression model

where the error term ?_{t} is autocorrelated

§ Two steps:

(i) Estimate the model by least squares, obtaining the Durbin-Watson statistic, d, and then estimate the autocorrelation parameter using

Dealing with Autocorrelation

(ii) Estimate by least squares a second regression with

§ dependent variable (y_{t} – ry_{t-1})

§ independent variables (x_{1t} – rx_{1,t-1}) , (x_{2t} – rx_{2,t-1}) , . . ., (x_{k1t} – rx_{k,t-1})

§ The parameters b_{1}, b_{2}, . . ., b_{k} are estimated regression coefficients from the second model

§ An estimate of b_{0} is obtained by dividing the estimated intercept for the second model by (1-r)

§ Hypothesis tests and confidence intervals for the regression coefficients can be carried out using the output from the second model

Chapter Summary

§ Discussed regression model building

§ Introduced dummy variables for more than two categories and for experimental design

§ Used lagged values of the dependent variable as regressors

§ Discussed specification bias and multicollinearity

§ Described heteroscedasticity

§ Defined autocorrelation and used the Durbin-Watson test to detect positive and negative autocorrelation