Additional Topics in Regression Analysis

Chapter Goals

After completing this chapter, you should be able to:

§ Explain regression model-building methodology

§ Apply dummy variables for categorical variables with more than two categories

§ Explain how dummy variables can be used in experimental design models

§ Incorporate lagged values of the dependent variable is regressors

§ Describe specification bias and multicollinearity

§ Examine residuals for heteroscedasticity and autocorrelation

The Stages of Model Building

§ Understand the problem to be studied

§ Select dependent and independent variables

§ Identify model form (linear, quadratic…)

§ Determine required data for the study

The Stages of Model Building

Dummy Variable Models
(More than 2 Levels)

§ Dummy variables can be used in situations in which the categorical variable of interest has more than two categories

§ Dummy variables can also be useful in experimental design

§ Experimental design is used to identify possible causes of variation in the value of the dependent variable

§ Y outcomes are measured at specific combinations of levels for treatment and blocking variables

§ The goal is to determine how the different treatments influence the Y outcome

Dummy Variable Models
(More than 2 Levels)

§ Consider a categorical variable with K levels

§ The number of dummy variables needed is one less than the number of levels, K – 1

§ Example:

y = house price ; x₁= square feet

§ If style of the house is also thought to matter:

Style = ranch, split level, condo

Dummy Variable Models
(More than 2 Levels)

§ Example: Let “condo” be the default category, and let x₂ and x₃ be used for the other two categories:

y = house price

x₁= square feet

x₂= 1 if ranch, 0 otherwise

x₃= 1 if split level, 0 otherwise

The multiple regression equation is:

Interpreting the Dummy Variable Coefficients (with 3 Levels)

Experimental Design

§ Consider an experiment in which

§ four treatments will be used, and

§ the outcome also depends on three environmental factors that cannot be controlled by the experimenter

§ Let variable z₁denote the treatment, where z₁ = 1, 2, 3, or 4. Let z₂ denote the environment factor (the “blocking variable”), where z₂ = 1, 2, or 3

§ To model the four treatments, three dummy variables are needed

_§To model the three environmental factors, two dummy variables are needed

Experimental Design

_§Define five dummy variables, x₁, x₂, x₃, x₄, and x₅

§ Let treatment level 1 be the default (z₁ = 1)

§ Define x₁ = 1 if z₁ = 2, x₁ = 0 otherwise

§ Define x₂ = 1 if z₁ = 3, x₂ = 0 otherwise

§ Define x₃ = 1 if z₁ = 4, x₃ = 0 otherwise

§ Let environment level 1 be the default (z₂ = 1)

§ Define x₄ = 1 if z₂ = 2, x₄ = 0 otherwise

§ Define x₅ = 1 if z₂ = 3, x₅ = 0 otherwise

Experimental Design:
Dummy Variable Tables

§ The dummy variable values can be summarized in a table:

Experimental Design Model

§ The experimental design model can be estimated using the equation

§ The estimated value for ?₂, for example, shows the amount by which the y value for treatment 3 exceeds the value for treatment 1

Lagged Values of the
Dependent Variable

§ In time series models, data is collected over time (weekly, quarterly, etc…)

§ The value of y in time period t is denoted y_t

§ The value of y_t often depends on the value y_t-1, as well as other independent variables x_j :

Interpreting Results
in Lagged Models

§ An increase of 1 unit in the independent variable x_j in time period t (all other variables held fixed), will lead to an expected increase in the dependent variable of

§ b_j in period t

§ b_j g in period (t+1)

§ b_jg² in period (t+2)

§ b_jg³ in period (t+3) and so on

§ The total expected increase over all current and future time periods is b_j/(1-g)

§ The coefficients b₀, b₁, . . . ,b_K, g are estimated by least squares in the usual manner

§ Confidence intervals and hypothesis tests for the regression coefficients are computed the same as in ordinary multiple regression

§ (When the regression equation contains lagged variables, these procedures are only approximately valid. The approximation quality improves as the number of sample observations increases.)

§ Caution should be used when using confidence intervals and hypothesis tests with time series data

§ There is a possibility that the equation errors e_i are no longer independent from one another.

§ When errors are correlated the coefficient estimates are unbiased, but not efficient. Thus confidence intervals and hypothesis tests are no longer valid.

Specification Bias

§ Suppose an important independent variable z is omitted from a regression model

§ If z is uncorrelated with all other included independent variables, the influence of z is left unexplained and is absorbed by the error term, ?

§ But if there is any correlation between z and any of the included independent variables, some of the influence of z is captured in the coefficients of the included variables

Specification Bias

§ If some of the influence of omitted variable z is captured in the coefficients of the included independent variables, then those coefficients are biased…

§ …and the usual inferential statements from hypothesis test or confidence intervals can be seriously misleading

§ In addition the estimated model error will include the effect of the missing variable(s) and will be larger

Multicollinearity

§ Collinearity: High correlation exists among two or more independent variables

§ This means the correlated variables contribute redundant information to the multiple regression model

Multicollinearity

§ Including two highly correlated explanatory variables can adversely affect the regression results

§ No new information provided

§ Can lead to unstable coefficients (large standard error and low t-values)

§ Coefficient signs may not match prior expectations

Some Indications of
Strong Multicollinearity

§ Incorrect signs on the coefficients

§ Large change in the value of a previous coefficient when a new variable is added to the model

§ A previously significant variable becomes insignificant when a new independent variable is added

§ The estimate of the standard deviation of the model increases when a variable is added to the model

Detecting Multicollinearity

§ Examine the simple correlation matrix to determine if strong correlation exists between any of the model independent variables

§ Multicollinearity may be present if the model appears to explain the dependent variable well (high F statistic and low s_e) but the individual coefficient t statistics are insignificant

Assumptions of Regression

§ Normality of Error

§ Error values (?) are normally distributed for any given value of X

§ Homoscedasticity

§ The probability distribution of the errors has constant variance

§ Independence of Errors

§ Error values are statistically independent

Residual Analysis

§ The residual for observation i, e_i, is the difference between its observed and predicted value

§ Check the assumptions of regression by examining the residuals

§ Examine for linearity assumption

§ Examine for constant variance for all levels of X (homoscedasticity)

§ Evaluate normal distribution assumption

§ Evaluate independence assumption

§ Graphical Analysis of Residuals

§ Can plot residuals vs. X

Residual Analysis for Linearity

Residual Analysis for
Homoscedasticity

Excel Residual Output

Heteroscedasticity

§ Homoscedasticity

§ The probability distribution of the errors has constant variance

§ Heteroscedasticity

§ The error terms do not all have the same variance

§ The size of the error variances may depend on the size of the dependent variable value, for example

§ When heteroscedasticity is present

§ least squares is not the most efficient procedure to estimate regression coefficients

§ The usual procedures for deriving confidence intervals and tests of hypotheses is not valid

Tests for Heteroscedasticity

§ To test the null hypothesis that the error terms, ?_i, all have the same variance against the alternative that their variances depend on the expected values

§ Estimate the simple regression

§ Let R² be the coefficient of determination of this new regression

The null hypothesis is rejected if nR² is greater than c²_1,_a

§ where c²_1,_a is the critical value of the chi-square random variable with 1 degree of freedom and probability of error a

Autocorrelated Errors

§ Independence of Errors

§ Error values are statistically independent

§ Autocorrelated Errors

§ Residuals in one time period are related to residuals in another period

§ Autocorrelation violates a least squares regression assumption

§ Leads to s_b estimates that are too small (i.e., biased)

§ Thus t-values are too large and some variables may appear significant when they are not

Autocorrelation

§ Autocorrelation is correlation of the errors (residuals) over time

The Durbin-Watson Statistic

§ The Durbin-Watson statistic is used to test for autocorrelation

Testing for Positive Autocorrelation

§ Calculate the Durbin-Watson test statistic = d

§ d can be approximated by d = 2(1 – r) , where r is the sample correlation of successive errors

§ Find the values d_L and d_U from the Durbin-Watson table

§ (for sample size n and number of independent variables K)

Negative Autocorrelation

§ Negative autocorrelation exists if successive errors are negatively correlated

§ This can occur if successive errors alternate in sign

Testing for Positive Autocorrelation

§ Example with n = 25:

Testing for Positive Autocorrelation

§ Here, n = 25 and there is k = 1 one independent variable

§ Using the Durbin-Watson table, d_L = 1.29 and d_U = 1.45

§ D = 1.00494 < d_L = 1.29, so reject H₀ and conclude that significant positive autocorrelation exists

§ Therefore the linear model is not the appropriate model to forecast sales

Dealing with Autocorrelation

§ Suppose that we want to estimate the coefficients of the regression model

where the error term ?_t is autocorrelated

§ Two steps:

(i) Estimate the model by least squares, obtaining the Durbin-Watson statistic, d, and then estimate the autocorrelation parameter using

Dealing with Autocorrelation

(ii) Estimate by least squares a second regression with

§ dependent variable (y_t – ry_t-1)

§ independent variables (x_1t – rx_1,t-1) , (x_2t – rx_2,t-1) , . . ., (x_k1t – rx_k,t-1)

§ The parameters b₁, b₂, . . ., b_k are estimated regression coefficients from the second model

§ An estimate of b₀ is obtained by dividing the estimated intercept for the second model by (1-r)

§ Hypothesis tests and confidence intervals for the regression coefficients can be carried out using the output from the second model

Chapter Summary

§ Discussed regression model building

§ Introduced dummy variables for more than two categories and for experimental design

§ Used lagged values of the dependent variable as regressors

§ Discussed specification bias and multicollinearity

§ Described heteroscedasticity

§ Defined autocorrelation and used the Durbin-Watson test to detect positive and negative autocorrelation

Disclaimer:

Latest Articles

Bar Council Canons of Professional Conduct and Etiquette

Waqfs Ordinance, 1962

Waqf Ordinance 1962 Bangladesh: Key Sections & Cases

Section 115 CPC: Civil Revision in Bangladesh Explained

Sale of Goods Act 1930 Bangladesh: Key Provisions & Cases

Complete Guide to the General Clauses Act 1897

Law Frim in Bangladesh