This repository contains the R package scul
that is used in Hollingsworth and Wing (2020) “Tactics for design and inference in synthetic control studies: An applied example using high-dimensional data.” https://doi.org/10.31235/osf.io/fc9xt
# Install development version from GitHub (CRAN coming soon) using these two lines of code
if (!require("devtools")) install.packages("devtools")
::install_github("hollina/scul")` devtools
An in-depth tutorial of the package and overview of every function in the package is available here, https://hollina.github.io/scul/articles/scul-tutorial.html. The tutorial uses publicly available data and discusses many features of the SCUL procedure. The tutorial also provides a simple comparison of the SCUL method to the traditional synthetic control method.
More details on SCUL can be found in our working paper. The paper, joint with Coady Wing, desribes identification assumptions and recommendations for key—and normally ad hoc—decisions that arise in most synthetic control studies. The paper then describes the SCUL method and uses the procedure to estimate how recreational marijuana legalization affects sales of alcohol and over-the-counter painkillers, finding reductions in alcohol sales.
The synthetic control methodology is a strategy for estimating causal treatment effects for idiosyncratic historical events. In the typical application developed by Abadie, Diamond, and Hainmueller (2010), researchers observe time series outcomes for both a treated unit and a number of untreated units. A weighted average of the untreated series is used as a counterfactual estimate of the treated series, which is referred to as a synthetic comparison group. Weights are chosen to minimize discrepancies between the synthetic comparison group and the treated unit in the pre-treatment time period. Treatment effect estimates are usually the difference between observed outcomes and the synthetic counterfactual. Statistical inference is normally organized around a placebo analysis; in which, pseudo-treatment effects are estimated for many untreated placebo units, and the distribution of pseudo-estimates represents the null distribution of no treatment effect.
A useful way to think about synthetic controls is as a procedure that attempts to match donor series to target series based on the unobserved factors that determine the data generating process. When framed in this manner identification assumptions and strategies for model selection/inference become more salient.
Recent methodological work has proposed a number of innovative strategies for estimating synthetic control weights (Arkhangelsky et al. 2018; Doudchenko and Imbens 2017; Powell 2019). In a similar vein, we construct donor weights using a method we call Synthetic Control Using Lasso (SCUL).
This method is a flexible, data-driven way to construct synthetic control groups. It relies on lasso regressions, which are popular in the machine-learning literature, and favor weights that predict well out of sample.
Our working paper highlights identification assumptions and recommendations that are relevant for any synthetic control study.
In general, our approach allows for:
We consider the combination of this statistical approach and following these recommendations as the SCUL procedure
We outline two simple identification assumptions required for a synthetic control design to identify causal treatment parameters:
While neither of these assumptions is directly testable, our working paper offers perspectives and strategies that may help in interpreting the validity of such assumptions in applied work.
Our recommendations for decisions that commonly appear in synthetic control studies include:
We implement versions of the recommendations in our tutorial and outline each in more detail in our working paper.
Lasso regressions penalize specifications with numerious variables and large coeficients. This drives the value of many coefficients to zero and allows for the inclusion of very large donor pools. A benefit of this is that—so long as a donor is theoreticlally valid— a researcher will not need to decide whether to include one donor over another. A cost of this is the concern that the procedure could overfit the data. SCUL weights are created using cross-validated lasso regressions that ensure the weights do not “overfit” the data and that favor out-of-sample prediction.
By automating model selection and allowing for a large number of donors, we reduce “researcher degrees of freedom.” It is easy to imagine that the best synthetic prediction could be created for each target series, but less time would be spent perfecting the model for each placebo series. If the automated model selection results in better fit for placebo series, we also improve the statistical power. This occurs if better fit in the pre-treatment period results in less deterioration (i.e., better fit) in the post-period. This improves statistical power because statistical inference in done by comparing deviations of the treated series to the distribution of placebo deviations. Therefore reducing the spread of the placebo null-distribution allows for smaller deviations of the treated unit to be considered statistically rare.
We frame synthetic controls as a way of matching on unobserved underlying factors that form the data generating process. When viewed in this context, using donor units from a wide range of variable types makes sense because different variable types may help pin down different underlying factors/features of the data generating process for the treated unit. As such we use a wide range of donor variables to construct our synthetic control groups, not just the same variable type as the target variable as is common practice.
The traditional synthetic control method restricts weights to be non-negative and to sum to one. These restrictions force the synthetic control group to remain within the support (i.e., convex hull) of the donor pool, preventing extrapolation. This can certainly be a desirable property. However there are some situations where these restrictions that prevent extrapolation can inhibit a synthetic control group from finding a perfect donor series.
The package is made for R. and was developed on a Unix machine using R 3.6.1. See session info in the vignette for exact version of every package used. Documentation was made using roxygen2
, pkgdown
, and RStudio
.
Abadie, Alberto, Alexis Diamond, and Jens Hainmueller. 2010. “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program.” Journal of the American Statistical Association 105 (490): 493–505. https://doi.org/10.1198/jasa.2009.ap08746.
Arkhangelsky, Dmitry, Susan Athey, David A. Hirshberg, Guido W. Imbens, and Stefan Wager. 2018. “Synthetic Difference in Differences.” http://arxiv.org/abs/1812.09970.
Doudchenko, Nikolay, and Guido W. Imbens. 2017. “Balancing, Regression, Difference-In-Differences and Synthetic Control Methods: A Synthesis.” http://arxiv.org/abs/1610.07748.
Powell, David. 2019. “Imperfect Synthetic Controls,” no. May 2017: 1–38. https://sites.google.com/site/davidmatthewpowell/imperfect-synthetic-controls.