Additional Topics in Regression Analysis

Additional Topics in Regression Analysis

Chapter Goals

After completing this chapter, you should be able to:

§     Explain regression model-building methodology

§     Apply dummy variables for categorical variables with more than two categories

§     Explain how dummy variables can be used in experimental design models

§     Incorporate lagged values of the dependent variable is regressors

§     Describe specification bias and multicollinearity

§     Examine residuals for heteroscedasticity and autocorrelation

The Stages of Model Building

§      Understand the problem to be studied

§      Select dependent and independent variables

§      Identify model form (linear, quadratic…)

§      Determine required data for the study

The Stages of Model Building

The Stages of Model Building

The Stages of Model Building

Dummy Variable Models
(More than 2 Levels)

§     Dummy variables can be used in situations in which the categorical variable of interest has more than two categories

§     Dummy variables can also be useful in experimental design

§    Experimental design is used to identify possible causes of variation in the value of the dependent variable

§    Y outcomes are measured at specific combinations of levels for treatment and blocking variables

§    The goal is to determine how the different treatments influence the Y outcome

Dummy Variable Models
(More than 2 Levels)

§    Consider a categorical variable with K levels

§    The number of dummy variables needed is one less than the number of levels, K – 1

§    Example:

y = house price ;  x1 = square feet

§    If style of the house is also thought to matter:

Style = ranch,  split level,  condo

Dummy Variable Models
(More than 2 Levels)

§     Example: Let “condo” be the default category, and let x2 and x3 be used for the other two categories:

y = house price

x1 = square feet

x2 = 1 if ranch, 0 otherwise

x3 = 1 if split level, 0 otherwise

The multiple regression equation is:

Interpreting the Dummy Variable Coefficients (with 3 Levels)

Experimental Design

§     Consider an experiment in which

§    four treatments will be used, and

§    the outcome also depends on three environmental factors that cannot be controlled by the experimenter

§     Let variable z1 denote the treatment, where z1 = 1, 2, 3, or 4.  Let z2 denote the environment factor (the “blocking variable”), where z2 = 1, 2, or 3

§     To model the four treatments, three dummy variables are needed

§ To model the three environmental factors, two dummy variables are needed

Experimental Design

§ Define five dummy variables, x1, x2, x3, x4, and x5

§    Let treatment level 1 be the default (z1 = 1)

§    Define x1 = 1 if z1 = 2, x1 = 0 otherwise

§    Define x2 = 1 if z1 = 3, x2 = 0 otherwise

§    Define x3 = 1 if z1 = 4, x3 = 0 otherwise

§    Let environment level 1 be the default (z2 = 1)

§    Define x4 = 1 if z2 = 2, x4 = 0 otherwise

§    Define x5 = 1 if z2 = 3, x5 = 0 otherwise

Experimental Design:
Dummy Variable Tables

§    The dummy variable values can be summarized in a table:

Experimental Design Model

§    The experimental design model can be estimated using the equation

§    The estimated value for  ?2 , for example, shows the amount by which the  y  value for treatment 3 exceeds the value for treatment 1

Lagged Values of the
Dependent Variable

§    In time series models, data is collected over time (weekly, quarterly, etc…)

§    The value of  y  in time period  t  is denoted  yt

§    The value of  yt often depends on the value  yt-1, as well as other independent variables xj :

Interpreting Results
in Lagged Models

§           An increase of 1 unit in the independent variable xj in time period t (all other variables held fixed), will lead to an expected increase in the dependent variable of

§         bj in period t

§         bj g in period (t+1)

§         bjg2 in period (t+2)

§         bjg3 in period (t+3)    and so on

§           The total expected increase over all current and future time periods is bj/(1-g)

§           The coefficients b0, b1, . . . ,bK, g are estimated by least squares in the usual manner

§           Confidence intervals and hypothesis tests for the regression coefficients are computed the same as in ordinary multiple regression

§          (When the regression equation contains lagged variables, these procedures are only approximately valid.  The approximation quality improves as the number of sample observations increases.)

§           Caution should be used when using confidence intervals and hypothesis tests with time series data

§         There is a possibility that the equation errors ei are no longer independent from one another.

§         When errors are correlated the coefficient estimates are unbiased, but not efficient.  Thus confidence intervals and hypothesis tests are no longer valid.

Specification Bias

§    Suppose an important independent variable  z  is omitted from a regression model

§    If  z  is uncorrelated with all other included independent variables, the influence of  z  is left unexplained and is absorbed by the error term,  ?

§    But if there is any correlation between  z  and any of the included independent variables, some of the influence of  z  is captured in the coefficients of the included variables

Specification Bias

§    If some of the influence of omitted variable  z  is captured in the coefficients of the included independent variables, then those coefficients are biased…

§    …and the usual inferential statements from hypothesis test or confidence intervals can be seriously misleading

§    In addition the estimated model error will include the effect of the missing variable(s) and will be larger


§    Collinearity:  High correlation exists among two or more independent variables

§    This means the correlated variables contribute redundant information to the multiple regression model


§    Including two highly correlated explanatory variables can adversely affect the regression results

§   No new information provided

§   Can lead to unstable coefficients (large standard error and low t-values)

§   Coefficient signs may not match prior expectations

Some Indications of
Strong Multicollinearity

§     Incorrect signs on the coefficients

§     Large change in the value of a previous coefficient when a new variable is added to the model

§     A previously significant variable becomes insignificant when a new independent variable is added

§     The estimate of the standard deviation of the model increases when a variable is added to the model

Detecting Multicollinearity

§     Examine the simple correlation matrix to determine if strong correlation exists between any of the model independent variables

§     Multicollinearity may be present if the model appears to explain the dependent variable well (high  F  statistic and low  se ) but the individual coefficient  t  statistics are insignificant

Assumptions of Regression

§    Normality of Error

§    Error values (?) are normally distributed for any given value of  X

§    Homoscedasticity

§    The probability distribution of the errors has constant variance

§    Independence of Errors

§    Error values are statistically independent

Residual Analysis

§     The residual for observation i, ei, is the difference between its observed and predicted value

§     Check the assumptions of regression by examining the residuals

§    Examine for linearity assumption

§    Examine for constant variance for all levels of X (homoscedasticity)

§    Evaluate normal distribution assumption

§    Evaluate independence assumption

§     Graphical Analysis of Residuals

§    Can plot residuals vs. X

Residual Analysis for Linearity

Residual Analysis for

Excel Residual Output


§    Homoscedasticity

§    The probability distribution of the errors has constant variance

§    Heteroscedasticity

§    The error terms do not all have the same variance

§    The size of the error variances may depend on the size of the dependent variable value, for example

§    When heteroscedasticity is present

§    least squares is not the most efficient procedure to estimate regression coefficients

§    The usual procedures for deriving confidence intervals and tests of hypotheses is not valid

Tests for Heteroscedasticity

§     To test the null hypothesis that the error terms, ?i, all have the same variance against the alternative that their variances depend on the expected values

§     Estimate the simple regression

§     Let R2 be the coefficient of determination of this new regression

The null hypothesis is rejected if  nR2 is greater than  c21,a

§    where  c21,a is the critical value of the chi-square random variable with 1 degree of freedom and probability of error a

Autocorrelated Errors

§    Independence of Errors

§    Error values are statistically independent

§    Autocorrelated Errors

§    Residuals in one time period are related to residuals in another period

§    Autocorrelation violates a least squares regression assumption

§    Leads to  sb estimates that are too small (i.e., biased)

§    Thus t-values are too large and some variables may appear significant when they are not


§    Autocorrelation is correlation of the errors (residuals) over time

The Durbin-Watson Statistic

§     The Durbin-Watson statistic is used to test for autocorrelation

Testing for Positive Autocorrelation

§      Calculate the Durbin-Watson test statistic = d

§     d can be approximated by  d = 2(1 – r) , where r is the sample correlation of successive errors

§      Find the values dL and dU from the Durbin-Watson table

§     (for sample size n and number of independent variables K)

Negative Autocorrelation

§     Negative autocorrelation exists if successive errors are negatively correlated

§    This can occur if successive errors alternate in sign

Testing for Positive Autocorrelation

§     Example with  n = 25:

Testing for Positive Autocorrelation

§     Here, n = 25 and there is k = 1 one independent variable

§     Using the Durbin-Watson table, dL = 1.29  and  dU = 1.45

§     D = 1.00494 < dL = 1.29, so reject H0 and conclude that significant positive autocorrelation exists

§     Therefore the linear model is not the appropriate model to forecast sales

Dealing with Autocorrelation

§     Suppose that we want to estimate the coefficients of the regression model

where the error term  ?t is autocorrelated

§     Two steps:

(i) Estimate the model by least squares, obtaining the Durbin-Watson statistic, d, and then estimate the autocorrelation parameter using

Dealing with Autocorrelation

(ii) Estimate by least squares a second regression with

§    dependent variable  (yt – ryt-1)

§    independent variables (x1t – rx1,t-1) , (x2t – rx2,t-1) , . . ., (xk1t – rxk,t-1)

§     The parameters b1, b2, . . ., bk are estimated regression coefficients from the second model

§     An estimate of  b0 is obtained by dividing the estimated intercept for the second model by (1-r)

§     Hypothesis tests and confidence intervals for the regression coefficients can be carried out using the output from the second model

Chapter Summary

§     Discussed regression model building

§     Introduced dummy variables for more than two categories and for experimental design

§     Used lagged values of the dependent variable as regressors

§     Discussed specification bias and multicollinearity

§     Described heteroscedasticity

§     Defined autocorrelation and used the Durbin-Watson test to detect positive and negative autocorrelation