Replication of Examples in Chapter 5
1 Introduction
In this document, I will show you how to perform hypothesis testing for a single coefficient in a simple linear regression model. I will do this through replicating the examples that appear in Chapter 5.
2 The OLS estimation
The linear model is
\begin{equation} \label{eq:testscr-str-1} TestScore_i = \beta_0 + \beta_1 STR_i + u_i \end{equation}
We first read data from the Stata file, caschool.dta
, into R.
library(AER) library(foreign) classdata <- read.dta("caschool.dta")
Then, we estimate the linear regression model with the function lm()
and get the estimation results using the function summary()
.
df2use <- classdata[c("testscr", "str")] mod1 <- lm(testscr ~ str, data = df2use) summary(mod1)
Call: lm(formula = testscr ~ str, data = df2use) Residuals: Min 1Q Median 3Q Max -47.727 -14.251 0.483 12.822 48.540 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 698.9330 9.4675 73.825 < 2e-16 *** str -2.2798 0.4798 -4.751 2.78e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 18.58 on 418 degrees of freedom Multiple R-squared: 0.05124, Adjusted R-squared: 0.04897 F-statistic: 22.58 on 1 and 418 DF, p-value: 2.783e-06
The estimation results reported by summary(mod1)
include the
estimated coefficients, their standard errors, t-statistics, and the
p-values. There are other statistics that we will learn in the next
two lectures.
By default, the standard errors reported are computed using the formula of the homoskedasticity-only standard errors, which are then used to compute the t-statistics. And the p-values are based on the Student's t distribution with 418 degrees of freedom.
3 Hypothesis tests
Now we get into testing the hypothesis regarding \(\beta_1\), that is, \[ H_0: \beta_1 = \beta_{1,0} \text{ vs. } H_1: \beta_1 \neq \beta_{1,0} \]
In this example, we are testing the null hypothesis \(H_0: \beta_1 = 0\).
Compute the t-statistic
We compute the t-statistic based on the following formula,
\begin{equation} \label{eq:t-stat-b1} t = \frac{\hat{\beta}_1 - \beta_{1,0}}{SE(\hat{\beta}_1)} \end{equation}Upon computing the t-statistic, we compare it with the critical value at the desired significant level, say 5%, which is 1.96 for a two-sided test with the standard normal distribution. Also, we can use the t-statistic to get the p-value.
How can we get all the quantities used in this formula? Of course, you
can simply copy them from the output of summary(mod1)
. But doing so
is cumbersome and prone to mistakes. More importantly, we should use
the heteroskedasticity-robust standard error of \(\hat{\beta}_1\)
instead of the ones reported by default.
Next, I will show you two ways to get the t-statistics using the
heteroskedasticity-robust standard errors. The first way is to get all
the quantities in Equation (\ref{eq:t-stat-b1}) using R
functions, and the second way is to get the appropriate t-statistic
through the function coeftest()
. Although the second method is much
easier than the first one, we can learn how to obtain all the elements
in the output of the function lm()
.
Get all elements from an lm
object
The coefficients
The output of the function lm()
is an lm
object that has the same
structure as a list
object.
class(mod1)
str(mod1, max.level=1, give.attr = FALSE)
[1] "lm" List of 12 $ coefficients : Named num [1:2] 698.93 -2.28 $ residuals : Named num [1:420] 32.7 11.3 -12.7 -11.7 -15.5 ... $ effects : Named num [1:420] -13406.2 88.3 -14 -12.6 -16.8 ... $ rank : int 2 $ fitted.values: Named num [1:420] 658 650 656 659 656 ... $ assign : int [1:2] 0 1 $ qr :List of 5 $ df.residual : int 418 $ xlevels : Named list() $ call : language lm(formula = testscr ~ str, data = df2use) $ terms :Classes 'terms', 'formula' language testscr ~ str $ model :'data.frame': 420 obs. of 2 variables:
From an lm()
object, all estimated coefficients can be extracted
using the function coef()
, which returns a vector containing all
estimated coefficients. By default, the first element in the vector is
the estimated intercept, and the slope is the second.
b <- coef(mod1)
b
(Intercept) str 698.932952 -2.279808
b1 <- b[2]
The standard errors
The homoskedasticity-only standard errors are reported in the output of
summary()
by default in R. They can also be extracted with the function,vcov()
, which returns a matrix called the covariance matrix, with the diagonal elements representing the variances of the estimated coefficients. Thus, the standard errors are the square roots of the diagonal elements.V <- vcov(mod1) se_b1 <- sqrt(V[2, 2]); se_b1
[1] 0.4798256
The heteroskedasticity-robust standard errors are the square roots of the diagonal elements in the heteroskedasticity-consistent covariance matrix, obtained using the function of
vcovHC()
in thesandwich
package that is loaded by default. There are several versions of the heteroskedasticity-consistent covariance matrix. What we use is the type ofHC1
.htV <- vcovHC(mod1, type = "HC1") se_b1_rb <- sqrt(htV[2, 2]); se_b1_rb
[1] 0.5194892
The t-statistic, the critical value, and the p-value
The t-statistics using the heteroskedasticity-robust standard error is then computed by
(t_b1_rb <- b1 / se_b1_rb)
str -4.388557
Although we know the critical value at the 5% significant level for a
two-sided test is 1.96 with a large sample, we prefer getting the
value from a function in R. The critical value at the 5% significance
level is in fact the 97.5th percentile of the standard normal
distribution, which can be obtained from the qnorm()
function.
(c.5 <- qnorm(0.975))
[1] 1.959964
The p-value associated with the actual t-statistics is
\(\mathrm{Pr}\left(|t| > |t^{act}| \right) = 2 \Phi(-|t^{act}|)\). We can
compute the p-value in R, following this definition and using the
pnorm()
function.
(pval <- 2 * pnorm(-abs(t_b1_rb)))
str 1.141051e-05
Use coeftest()
Since hypothesis testing is a very common work in statistics, many R
functions have been developed to do it. Here I introduce a function,
coeftest()
, which is in the package of lmtest
loaded automatically
through loading the AER
package.
coeftest(mod1)
t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 698.93295 9.46749 73.8245 < 2.2e-16 *** str -2.27981 0.47983 -4.7513 2.783e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
By default, it reports the homoskedasticity-only standard errors,
the corresponding t-statistics, and the p-values. To get the
heteroskedasticity-robust results, we need to add an argument to this
function to specify the heteroskedasticity-consistent covariance
matrix, which has been defined above as htV <- vcovHC(mod1, type = "HC1")
.
# coeftest returns a matrix t_tst <- coeftest(mod1, vcov. = htV); t_tst
t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 698.93295 10.36436 67.4362 < 2.2e-16 *** str -2.27981 0.51949 -4.3886 1.447e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## Get the t-statistic for STR t_b1 <- t_tst["str", "t value"]; t_b1
[1] -4.388557
Confidence interval
Finally, we can construct the 95% confidence interval of \(\beta_1\)
using the function of confint()
, which uses the
homoskedasticity-only standard errors.
# confidence interval with the default homoskedasticity-only SE confint(mod1, "str")
2.5 % 97.5 % str -3.22298 -1.336637
Since there is no existing function to report the confidence interval with heteroskedasticity-robust SE, we can write a user-defined function to do that.
get_confint_rb <- function(lm_obj, param, vcov_ = vcov(lm_obj), level = 0.05){ ## This function generates a two-sided confidence interval for a ## parameter in the linear regression model with a specified ## covariance matrix. The inputs The output ## get all the parameters' names and select one based on param all_param <- names(coef(lm_obj)) which_param <- grep(param, all_param) ## get the estimated parameter and its standard error bhat_param <- coef(lm_obj)[which_param] sd_param <- sqrt(vcov_[which_param, which_param]) ## get the critical value cv <- qnorm(1 - level/2) ## calculate the confidence interval lower <- bhat_param - cv * sd_param upper <- bhat_param + cv * sd_param conf_interval <- c(lower, upper) names(conf_interval) <- c("lower", "upper") return(conf_interval) }
Note that we define the default value of vcov_
to be the
homoskedasticity-only covariance matrix in the function
get_confint_rb()
, and the default significant level is 5%. When
computing the confidence interval with the heteroskedasticity-robust
standard errors, we need to change vcov_
to a
heteroskedasticity-consistent covariance matrix, which is htV
.
get_confint_rb(mod1, param = "str", vcov = htV)
lower upper -3.297988 -1.261628
4 Dummy variable
Create a dummy variable
We define a dummy variable representing small classes.
\begin{equation*} D_i = \begin{cases} 1,\; &\text{ if } str < 20 \\ 0,\; &\text{ if } str \geq 20 \end{cases} \end{equation*}smallclass <- ifelse(df2use$str < 20, 1, 0)
The function ifelse()
creates a vector consisting of 1 and 0. The
first argument in this function is a condition, df2use$str < 20
. If the
condition is satisfied for an element in df2use$str
, the corresponding
element in D
is 1, otherwise 0.
Regression with a dummy variable
Then we can estimate the linear regression of test scores against the dummy variable, and do the zero hypothesis test.
mod2 <- lm(testscr ~ smallclass, data = df2use) coeftest(mod2, vcov. = vcovHC(mod2, type = "HC1"))
t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 649.9788 1.3229 491.3317 < 2.2e-16 *** smallclass 7.3724 1.8236 4.0428 6.288e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Create a scatterplot
We can create a scatterplot to visualize the relationship between test scores and the dummy variable.
plot(smallclass, df2use$testscr, xlab = "class size", ylab = "test score")
Figure 1: The scatterplot without jittering
In Figure 1, since the variable smallclass
takes only
the value of 1 or 0, many points are overlapped. To make these points
more visible, we can create a scatterplot with jittered points. I also
adjust the limit of x-axis so that the clusters of points appear close
towards the center of the plot.
smallclass_jittered <- jitter(smallclass, amount = 0.05) plot(smallclass_jittered, df2use$testscr, xlim = c(-0.5, 1.5), xlab = "class size", ylab = "test score") abline(coef(mod2)[1], coef(mod2)[2], col = "blue")
Figure 2: The scatterplot with jittering