Additional Topics in Regression Analysis
Chapter Goals
After completing this chapter, you should be able to:
§ Explain regression model-building methodology
§ Apply dummy variables for categorical variables with more than two categories
§ Explain how dummy variables can be used in experimental design models
§ Incorporate lagged values of the dependent variable is regressors
§ Describe specification bias and multicollinearity
§ Examine residuals for heteroscedasticity and autocorrelation
The Stages of Model Building
§ Understand the problem to be studied
§ Select dependent and independent variables
§ Identify model form (linear, quadratic…)
§ Determine required data for the study
The Stages of Model Building
The Stages of Model Building
The Stages of Model Building
Dummy Variable Models
(More than 2 Levels)
§ Dummy variables can be used in situations in which the categorical variable of interest has more than two categories
§ Dummy variables can also be useful in experimental design
§ Experimental design is used to identify possible causes of variation in the value of the dependent variable
§ Y outcomes are measured at specific combinations of levels for treatment and blocking variables
§ The goal is to determine how the different treatments influence the Y outcome
Dummy Variable Models
(More than 2 Levels)
§ Consider a categorical variable with K levels
§ The number of dummy variables needed is one less than the number of levels, K – 1
§ Example:
y = house price ; x1 = square feet
§ If style of the house is also thought to matter:
Style = ranch, split level, condo
Dummy Variable Models
(More than 2 Levels)
§ Example: Let “condo” be the default category, and let x2 and x3 be used for the other two categories:
y = house price
x1 = square feet
x2 = 1 if ranch, 0 otherwise
x3 = 1 if split level, 0 otherwise
The multiple regression equation is:
Interpreting the Dummy Variable Coefficients (with 3 Levels)
Experimental Design
§ Consider an experiment in which
§ four treatments will be used, and
§ the outcome also depends on three environmental factors that cannot be controlled by the experimenter
§ Let variable z1 denote the treatment, where z1 = 1, 2, 3, or 4. Let z2 denote the environment factor (the “blocking variable”), where z2 = 1, 2, or 3
§ To model the four treatments, three dummy variables are needed
§ To model the three environmental factors, two dummy variables are needed
Experimental Design
§ Define five dummy variables, x1, x2, x3, x4, and x5
§ Let treatment level 1 be the default (z1 = 1)
§ Define x1 = 1 if z1 = 2, x1 = 0 otherwise
§ Define x2 = 1 if z1 = 3, x2 = 0 otherwise
§ Define x3 = 1 if z1 = 4, x3 = 0 otherwise
§ Let environment level 1 be the default (z2 = 1)
§ Define x4 = 1 if z2 = 2, x4 = 0 otherwise
§ Define x5 = 1 if z2 = 3, x5 = 0 otherwise
Experimental Design:
Dummy Variable Tables
§ The dummy variable values can be summarized in a table:
Experimental Design Model
§ The experimental design model can be estimated using the equation
§ The estimated value for ?2 , for example, shows the amount by which the y value for treatment 3 exceeds the value for treatment 1
Lagged Values of the
Dependent Variable
§ In time series models, data is collected over time (weekly, quarterly, etc…)
§ The value of y in time period t is denoted yt
§ The value of yt often depends on the value yt-1, as well as other independent variables xj :
Interpreting Results
in Lagged Models
§ An increase of 1 unit in the independent variable xj in time period t (all other variables held fixed), will lead to an expected increase in the dependent variable of
§ bj in period t
§ bj g in period (t+1)
§ bjg2 in period (t+2)
§ bjg3 in period (t+3) and so on
§ The total expected increase over all current and future time periods is bj/(1-g)
§ The coefficients b0, b1, . . . ,bK, g are estimated by least squares in the usual manner
§ Confidence intervals and hypothesis tests for the regression coefficients are computed the same as in ordinary multiple regression
§ (When the regression equation contains lagged variables, these procedures are only approximately valid. The approximation quality improves as the number of sample observations increases.)
§ Caution should be used when using confidence intervals and hypothesis tests with time series data
§ There is a possibility that the equation errors ei are no longer independent from one another.
§ When errors are correlated the coefficient estimates are unbiased, but not efficient. Thus confidence intervals and hypothesis tests are no longer valid.
Specification Bias
§ Suppose an important independent variable z is omitted from a regression model
§ If z is uncorrelated with all other included independent variables, the influence of z is left unexplained and is absorbed by the error term, ?
§ But if there is any correlation between z and any of the included independent variables, some of the influence of z is captured in the coefficients of the included variables
Specification Bias
§ If some of the influence of omitted variable z is captured in the coefficients of the included independent variables, then those coefficients are biased…
§ …and the usual inferential statements from hypothesis test or confidence intervals can be seriously misleading
§ In addition the estimated model error will include the effect of the missing variable(s) and will be larger
Multicollinearity
§ Collinearity: High correlation exists among two or more independent variables
§ This means the correlated variables contribute redundant information to the multiple regression model
Multicollinearity
§ Including two highly correlated explanatory variables can adversely affect the regression results
§ No new information provided
§ Can lead to unstable coefficients (large standard error and low t-values)
§ Coefficient signs may not match prior expectations
Some Indications of
Strong Multicollinearity
§ Incorrect signs on the coefficients
§ Large change in the value of a previous coefficient when a new variable is added to the model
§ A previously significant variable becomes insignificant when a new independent variable is added
§ The estimate of the standard deviation of the model increases when a variable is added to the model
Detecting Multicollinearity
§ Examine the simple correlation matrix to determine if strong correlation exists between any of the model independent variables
§ Multicollinearity may be present if the model appears to explain the dependent variable well (high F statistic and low se ) but the individual coefficient t statistics are insignificant
Assumptions of Regression
§ Normality of Error
§ Error values (?) are normally distributed for any given value of X
§ Homoscedasticity
§ The probability distribution of the errors has constant variance
§ Independence of Errors
§ Error values are statistically independent
Residual Analysis
§ The residual for observation i, ei, is the difference between its observed and predicted value
§ Check the assumptions of regression by examining the residuals
§ Examine for linearity assumption
§ Examine for constant variance for all levels of X (homoscedasticity)
§ Evaluate normal distribution assumption
§ Evaluate independence assumption
§ Graphical Analysis of Residuals
§ Can plot residuals vs. X
Residual Analysis for Linearity
Residual Analysis for
Homoscedasticity
Excel Residual Output
Heteroscedasticity
§ Homoscedasticity
§ The probability distribution of the errors has constant variance
§ Heteroscedasticity
§ The error terms do not all have the same variance
§ The size of the error variances may depend on the size of the dependent variable value, for example
§ When heteroscedasticity is present
§ least squares is not the most efficient procedure to estimate regression coefficients
§ The usual procedures for deriving confidence intervals and tests of hypotheses is not valid
Tests for Heteroscedasticity
§ To test the null hypothesis that the error terms, ?i, all have the same variance against the alternative that their variances depend on the expected values
§ Estimate the simple regression
§ Let R2 be the coefficient of determination of this new regression
The null hypothesis is rejected if nR2 is greater than c21,a
§ where c21,a is the critical value of the chi-square random variable with 1 degree of freedom and probability of error a
Autocorrelated Errors
§ Independence of Errors
§ Error values are statistically independent
§ Autocorrelated Errors
§ Residuals in one time period are related to residuals in another period
§ Autocorrelation violates a least squares regression assumption
§ Leads to sb estimates that are too small (i.e., biased)
§ Thus t-values are too large and some variables may appear significant when they are not
Autocorrelation
§ Autocorrelation is correlation of the errors (residuals) over time
The Durbin-Watson Statistic
§ The Durbin-Watson statistic is used to test for autocorrelation
Testing for Positive Autocorrelation
§ Calculate the Durbin-Watson test statistic = d
§ d can be approximated by d = 2(1 – r) , where r is the sample correlation of successive errors
§ Find the values dL and dU from the Durbin-Watson table
§ (for sample size n and number of independent variables K)
Negative Autocorrelation
§ Negative autocorrelation exists if successive errors are negatively correlated
§ This can occur if successive errors alternate in sign
Testing for Positive Autocorrelation
§ Example with n = 25:
Testing for Positive Autocorrelation
§ Here, n = 25 and there is k = 1 one independent variable
§ Using the Durbin-Watson table, dL = 1.29 and dU = 1.45
§ D = 1.00494 < dL = 1.29, so reject H0 and conclude that significant positive autocorrelation exists
§ Therefore the linear model is not the appropriate model to forecast sales
Dealing with Autocorrelation
§ Suppose that we want to estimate the coefficients of the regression model
where the error term ?t is autocorrelated
§ Two steps:
(i) Estimate the model by least squares, obtaining the Durbin-Watson statistic, d, and then estimate the autocorrelation parameter using
Dealing with Autocorrelation
(ii) Estimate by least squares a second regression with
§ dependent variable (yt – ryt-1)
§ independent variables (x1t – rx1,t-1) , (x2t – rx2,t-1) , . . ., (xk1t – rxk,t-1)
§ The parameters b1, b2, . . ., bk are estimated regression coefficients from the second model
§ An estimate of b0 is obtained by dividing the estimated intercept for the second model by (1-r)
§ Hypothesis tests and confidence intervals for the regression coefficients can be carried out using the output from the second model
Chapter Summary
§ Discussed regression model building
§ Introduced dummy variables for more than two categories and for experimental design
§ Used lagged values of the dependent variable as regressors
§ Discussed specification bias and multicollinearity
§ Described heteroscedasticity
§ Defined autocorrelation and used the Durbin-Watson test to detect positive and negative autocorrelation