BBC Russian

Showing posts with label statistics. Show all posts

Monday, July 11, 2016

The gains to US GDP from a Doing Business score of 100

by Dhananjay Ghei and Nikita Singh.

Can a country achieve growth by implementing large pro-business reforms? If yes, then how much growth is really possible from such reforms? In a recent WSJ op-ed, Cochrane takes a stab at this question for the United States. Using data from the World Bank's ease of doing business index, Cochrane claims there is a log-linear relationship between GDP per person and business climate. By extrapolating this relationship out of sample, he predicts that the US would register a 209% improvement in per capita income (or, 6% additional annual growth if the required reforms are implemented over the next 20 years) by achieving the ease of doing business index value of 100.

Brad Delong disagrees. He fits a fourth-degree polynomial on the same data. He justifies this on the grounds that the third degree coefficient is negative and statistically significant. His forecast shows that an increase in the index value beyond 90 would actually lead to a lower GDP per person. Figure 1 juxtaposes the log-linear and polynomial regression fit, and we can see how the two views are sharply different. The straight line yields higher and higher GDP as you go to 100; the polynomial droops off at the end.

Figure 1: The analysis of John Cochrane and Brad DeLong

Areas of concern

There are many areas of concern with this analysis:

Assuming linearity is surely a stretch. But polynomial regressions are a bad way to deal with nonlinearity. In particular, polynomial regressions are very fragile at the end points. This can be easily seen in Figure 2 as the prediction interval increases at edges of the data. In addition, extrapolation using a polynomial is almost always sure to give a wrong answer as the curvature of the polynomial is unidentified outside the sample.
Using a cross-sectional regression with one variable is a poor guide to the causal relationships. Labour and capital matter to GDP per capita. There are stark differences in law and governance, institutions and culture across countries; it is unlikely that the doing business score is a sufficient statistic.
Hallward-Driemeier and Pritchett (2015) show that the "doing business index" is not a good reflection of how the laws on paper are implemented in reality. The main point of their argument is that better de jure regulations do not necessarily imply improved de facto outcomes specially when a country has weak governmental capabilities for implementation and enforcement. Even if the US does well on the rule of law, and this gap between rules and deals is absent, this is a serious issue for many (most?) observations in the dataset.

Figure 2: The 95% prediction interval for the polynomial regression

Can we do better?

Criticisms 2 and 3 are hard to handle. But a little bit of statistics helps us do better on the first. We use non-parametric regression as a way to have nonlinearity in the relationship between business climate and GDP per person without having to take a stand on a particular functional form. This involves three steps:

Selecting an optimum bandwidth using cross-validation
Estimating a nonparametric model using the chosen bandwidth
Tests of statistical significance and specification

We use a second order Gaussian kernel and fit a local linear estimator to identify the functional form in sample. Business climate is significant at 1% level in the local linear non-parametric model. Moreover, based on a lower cross validation score, the non-parametric regression is favoured. In addition, we do a bunch of robustness tests by changing the type of kernel and regression. The results do not change much in either of the cases. These calculations were done in R using the np package.

Figure 3: Non-parametric regression gives us the best of both worlds

The results, shown above, show that there is nonlinearity in the data. The linear model used by Cochrane is not appropriate. But we're better off as compared with using a polynomial regression; the confidence interval is tighter at the edges.

Figure 4 superposes the three models. The coloured dots show the predicted value of GDP per person using the three different specifications when the doing business index takes the value of 100.

Figure 4: Comparing the three predictions

Our nonparametric estimate shows that gains from achieving a score beyond 90 are increasing and somewhere in between Cochrane and DeLong's numbers. Cochrane predicts that the US would achieve 6% additional annual growth for 20 years by moving to a score of 100. If we go out of sample to estimate using the nonparametric fit, this shows an annual growth of 2.22% for the next 20 years. This is not something to laugh at, but it's a smaller, and we think a more plausible estimate.

References

Hallward-Driemeier, Mary and Lant Pritchett. 2015. "How Business Is Done in the Developing World: Deals versus Rules." Journal of Economic Perspectives, 29(3): 121-40.

Tristen Hayfield and Jeffrey S. Racine (2008). Nonparametric Econometrics: The np Package. Journal of Statistical Software 27(5). URL http://www.jstatsoft.org/v27/i05/.

Dhananjay Ghei is a researcher at the National Institute of Public Finance and Policy. Nikita Singh is a MRes. student at London School of Economics and Political Science. The authors thank Ajay Shah for valuable discussions and feedback.

Wednesday, June 15, 2016

Sophisticated clustered standard errors using recent R tools

by Dhananjay Ghei

Many blog articles have demonstrated clustered standard errors, in R, either by writing a function or manually adjusting the degrees of freedom or both (example, example, example and example). These methods give close approximations to the standard Stata results, but they do not do the small sample correction as the Stata does.

In recent months, elegant solutions have come about in R, which push the envelope on functionality, and yield substantial improvements in speed. I use the test dataset of Petersen which is the workhorse of this field.

The problem

In regression analysis, getting accurate standard errors is as crucial as obtaining unbiased and consistent estimates of the regression coefficients. Standard errors are important in determining the accuracy of the coefficients and thereby, affecting hypothesis testing procedures.

The correct nature of standard errors depends on the underlying structure of the data. For our purposes, we consider cases where the error terms of the model are independent across groups but correlated within groups. For instance, studies with cross-sectional data on individuals with clustering on village/state/hospital level. Another example could be difference in difference regressions with clustering at a group level. Clustered standard errors allow for a general structure of the variance covariance matrix by allowing errors to be correlated within clusters but not across clusters. In such cases, obtaining standard errors without clustering can lead to misleadingly small standard errors, narrow confidence intervals and small p-values.

Clustered standard errors can be obtained in two steps. Firstly, estimate the regression model without any clustering and subsequently, obtain clustered errors by using the residuals. Clustered standard errors can be estimated consistently provided the number of clusters goes to infinity. However, the variance covariance matrix is downward-biased when dealing with a finite number of clusters. One of the methods commonly used for correcting the bias, is adjusting for the degrees of freedom in finite clusters.

R and Stata codes

The code below shows how to compute clustered standard errors in R, using the plm and lmtest packages. Petersen's dataset can be loaded directly from the multiwayvcov package. Pooled OLS and fixed effect (FE) models are estimated using the plm package.

# Loading the required libraries
library(plm)
library(lmtest)
library(multiwayvcov)

# Loading Petersen's dataset
data(petersen)
# Pooled OLS model
pooled.ols <- plm(formula=y~x, data=petersen, model="pooling", index=c("firmid", "year")) 
# Fixed effects model
fe.firm <- plm(formula=y~x, data=petersen, model="within", index=c("firmid", "year"))

Clustered standard errors can be computed in R, using the vcovHC() function from plm package. vcovHC.plm() estimates the robust covariance matrix for panel data models. The function serves as an argument to other functions such as coeftest(), waldtest() and other methods in the lmtest package. Clustering is achieved by the cluster argument, that allows clustering on either group or time. The type argument allows estimating standard errors by allowing for heteroskedasticity across groups. Recently, the plm package introduced the small sample correction as an option to the "type" argument of vcovHC.plm() function. This is switched on by specifying type="sss".

# OLS with SE clustered by firm (Petersen's Table 3)
coeftest(pooled.ols, vcov=vcovHC(pooled.ols, type="sss", cluster="group"))  

# OLS with SE clustered by time (Petersen's Table 4)
coeftest(pooled.ols, vcov=vcovHC(pooled.ols, type="sss", cluster="time")) 


# FE regression with SE clustered by firm
coeftest(fe.firm, vcov=vcovHC(fe.firm, type="sss", cluster="group")) 

# FE regression with SE clustered by time
coeftest(fe.firm, vcov=vcovHC(fe.firm, type="sss", cluster="time"))

Stata makes it easy to cluster, by adding the cluster option at the end of any routine regression command (such as reg or xtreg). The code below shows how to cluster in OLS and fixed effect models:

webuse set http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/
webuse test_data.dta, clear

* OLS with SE clustered by firm (Petersen's Table 3)
reg y x, vce(cluster firmid)
* OLS with SE clustered by time (Petersen's Table 4)
reg y x, vce(cluster year)

* Declaring dataset to be a panel
xtset firmid year
* FE regression with SE clustered by firm
xtreg y x, fe vce(cluster firmid)
* FE regression with SE clustered by time
xtreg y x, fe vce(cluster year) nonest

The table given below shows a comparison of the standard errors computed by R and Stata. The standard errors computed from R and Stata agree up to the fifth decimal place.

Model	SE (in R)	SE (in Stata)
OLS with SE clustered by firm	0.05059	0.05059
OLS with SE clustered by time	0.03338	0.03338
FE regression with SE clustered by firm	0.03014	0.03014
FE regression with SE clustered by time	0.02668	0.02668

Performance comparison

I run benchmarks for comparing the speed of Stata MP and R for each of these models on a quad-core processor. The results show that R is faster than Stata. In order to do parallelisation, I set the number of processors that Stata MP will use as 4. An example of the benchmarking code in Stata is given below:

* Stata benchmarking program : Example
set processors 4
timer clear
timer on 1
bs, nodrop reps(1000) seed(1): reg y x
timer off 1
timer list

Parallelisation in R is done using standard R packages. An example of the benchmarking code in R is given below:

# R benchmarking program : Example
library(doParallel)
library(rbenchmark)
set.seed(1)
c <- detectCores()
cl <- makeCluster(c)
ols.benchmark <- mcparallel(benchmark(lm(y~x, petersen), replications=1000))
mccollect(ols.benchmark)
stopCluster(cl)

The table below shows a comparison of R and Stata MP for each of these models. The average time is calculated as the ratio of elapsed time to the number of replications. Relative efficiency is defined as the ratio of the average time taken by Stata MP to the average time taken by R. It turns out that the R is faster.

Model	Replications	Average time (R - 4 core)	Average time (Stata MP - 4 core)	Relative efficiency
OLS with SE clustered by firm	1000	0.0737	0.1635	2.22
OLS with SE clustered by time	1000	0.0557	0.0742	1.33
FE regression with SE clustered by firm	1000	0.0880	0.3176	3.61
FE regression with SE clustered by time	1000	0.0729	0.1118	1.53

Multi-level clustering in R

Two way clustering does not have a routine estimation procedure with most of the Stata commands (except for ivreg2 and xtivreg2). There are a few codes available online (See for example, here and here) that do two way clustering. This is easily handled in R, using the vcovDC.plm() function. The function can be used in a similar fashion as vcovHC.plm().

# OLS with SE clustered by firm and time (Petersen's Table 5)
coeftest(pooled.ols, vcov=vcovDC(pooled.ols, type="sss"))

A more recent addition, multiwayvcov package is useful for clustering on multiple levels and, in computing bootstrapped clustered standard errors. The package supports parallelisation thereby, making it easier to work with large datasets. Two functions are exported from the package, cluster.vcov() and cluster.boot(). cluster.vcov() computes clustered standard errors, whereas, cluster.boot() calculates bootstrapped clustered standard errors. The code for replicating Petersen's results is available in the reference manual of the package. One limitation of cluster.vcov() is its inability to work with plm objects. This is because the package imports estfun() from the sandwich package, which is not compatible with plm objects.

R code

Here's the R code to reproduce the results.

Dhananjay Ghei is a researcher at the National Institute of Public Finance and Policy. He thanks Ajay Shah, Vimal Balasubramaniam and Apoorva Gupta for valuable discussions and feedback.

The Leap Blog

Search interesting materials