Difference-in-differences | Synthetic Control | Causal Inference in Data Science Part 2

This video is the second part of our mini course on application of Causal Inference in data science. We are going to discuss what kind of methods you can use to do Causal Inference with just a few treated units. Two methods are introduced: difference-in-differences and synthetic control.
🔗 Regression and Matching • Regression and Matchin...
📃Yuan's blog post on causal inference www.yuan-meng.com/posts/causa...
📚 Resources recommended by Yuan
- Abadie, A. (2021). Using synthetic controls: Feasibility, data requirements, and methodological aspects. Journal of Economic Literature, 59(2), 391-425.
- Jones, N., & Barrows, S. (2019, July 24). Uber’s synthetic control. PyData Amsterdam 2019. • Nick Jones, Sam Barrow...
- Python/R/Stata code from The Effect: An Introduction to Research Design and Causality: github.com/NickCH-K/causaldata
- The “synth” package for synthetic control: rpubs.com/danilofreire/synth
🟢Get all my free data science interview resources
www.emmading.com/resources
🟡 Product Case Interview Cheatsheet www.emmading.com/product-case...
🟠 Statistics Interview Cheatsheet www.emmading.com/statistics-i...
🟣 Behavioral Interview Cheatsheet www.emmading.com/behavioral-i...
🔵 Data Science Resume Checklist www.emmading.com/data-science...
✅ We work with Experienced Data Scientists to help them land their next dream jobs. Apply now: www.emmading.com/coaching
// Comment
Got any questions? Something to add?
Write a comment below to chat.
// Let's connect on LinkedIn:
/ emmading001
====================
Contents of this video:
====================
00:00 How to measure COVID's Impact on the Economy
08:13 Difference-in-Differences
14:47 Synthetic Control
24:17 Summary

Пікірлер: 31

  • @junqichen6241
    @junqichen62412 жыл бұрын

    This video clears a lot of questions in my mind. Thank you!

  • @sophial.4488
    @sophial.44882 жыл бұрын

    This channel and this video is sooo under-rated!

  • @itsBlu4e
    @itsBlu4e2 жыл бұрын

    Oh my god was this useful. Thank you so much for planning it out and recording it! Amazing job.

  • @emma_ding

    @emma_ding

    2 жыл бұрын

    You're so welcome!

  • @pavlobilinskyi775
    @pavlobilinskyi7758 ай бұрын

    Fantastic lecture! The introduction to DiD method is really very intuitive. It is one of the best explanations, to my experience.

  • @dadmehrdidgar4971
    @dadmehrdidgar4971 Жыл бұрын

    Loved this video! Thank you both! :)

  • @xinyaohui1919
    @xinyaohui19193 ай бұрын

    fantastic lecture! thanks Yuan and Emma!

  • @yungetong634
    @yungetong6342 жыл бұрын

    great great video! Thank you guys!

  • @houmlackmbp3075
    @houmlackmbp3075 Жыл бұрын

    Thanks for your good information

  • @escargot8854
    @escargot88542 жыл бұрын

    Wouldn't covid be a bad use of DD because it was worldwide? There are limited economies that were unaffected that can be used as a counterfactual

  • @TheBjjninja

    @TheBjjninja

    Жыл бұрын

    You would use an event study design to measure effects of COVID

  • @jaredgreathouse3672
    @jaredgreathouse36722 жыл бұрын

    Synthetic controls are pretty much the big brother of difference-in-differences. You can do so much more with SCM that you can't really do with DD. For example.... I'm writing a synthetic control command for Stata, and it uses LASSO or Ridge to automate donor/variable selection, and this method already outperforms classic SCM. I've even gotten it to do staggered implementation as well as placebo inference, and the best thing is that you only need outcome data, you don't need a long list of covariates to measure the counterfactual.

  • @brotherbig4651

    @brotherbig4651

    2 жыл бұрын

    It seems you are using endogenous outcome variables on the right hand side of your regression.

  • @brotherbig4651

    @brotherbig4651

    2 жыл бұрын

    The variables you choose to construct the synthetic group are subjective. Endogeneity, omitted variable bias, the pre-treatment trend you have are all hidden in the process.

  • @jaredgreathouse3672

    @jaredgreathouse3672

    2 жыл бұрын

    @@brotherbig4651 Yeah you're right, the variables we choose are subjective. And you're also right that the pre-treatment regression uses the donor outcomes to predict the outcomes of the actually treated unit. And in fact, the algorithm can also use other covariates, it just doesn't need to. The cross validation procedure, in addition to combating overfitting, also attempts to ensure we have the best out of sample predictors "k" time periods ahead of a point in the training data. Initially, I was super skeptical about the approach when I read about it for Python and R, I pretty much couldn't believe it. Well, I wrote the routine myself for Stata, roughly based off their code, and it works pretty well, even under suboptimal conditions (short pre-intervention periods, 100s or thousands of donors) and that kinda thing

  • @jacobdsk1381
    @jacobdsk13812 ай бұрын

    amazing thank you!

  • @PeakWuNeverSurrender
    @PeakWuNeverSurrender Жыл бұрын

    By using synthetic control, we target to meet the common trend assumption as required by Difference in Differences.

  • @the_teemo1
    @the_teemo12 жыл бұрын

    for the uber case, what is the argument of NOT using A/B test? (or is it just for the example's case) thanks!

  • @dataseance4041

    @dataseance4041

    2 жыл бұрын

    because riders in the same market share drivers. only treatment users had to walk (if they requested express pool), but that would reduce the average trip duration for all pool riders, even the control users who didn't walk. as a result, an a/b test can't detect the treatment effect.

  • @andreaxue376
    @andreaxue3762 жыл бұрын

    one question i had is why we need to do the counterfactual prediction on the donor pool (similar cities) instead of using the treatment city's own historical data before the treatment to predict the counterfactuals for the period of interest?

  • @yangsong7864

    @yangsong7864

    2 жыл бұрын

    Hey Andrea, I think it's mainly because the donor pool could better capture the seasonality/trend/environment changes and makes the counterfactual prediction on the treated unit more accurate (especially for irregular time series). Imagine when Pandemic starts, there is no way for the treatment city to be able to estimate its own counterfactual by using its own historical data (prior to Pandemic). On the other hand, the donor pool are also affected by the pandemic, their weighted post treatment metric/values would be a better counterfactual to the treated unit.

  • @nanlinr

    @nanlinr

    Жыл бұрын

    You need both is my understanding. Donor pool data should represent a world where treatment isn't implemented and you find it by modeling prior data of donor pool to best represent the treatment city's. Then you track how that synthetic control performs after treatment started and use that as a baseline to see how the treatment city's behavior differs from it

  • @rikki146

    @rikki146

    Жыл бұрын

    I guess you can, but comparison of this kind is hardly convincing. Sometimes temporal data make better predictions and sometimes cross-sectional data make better predictions. For example, say I am interested in the effect of tax on investment gains in the US market, I would rather base my estimation on counterfactual derived from JP/EU market etc than historical data.

  • @TheBjjninja
    @TheBjjninja Жыл бұрын

    I think we forgot to answer the original question of "how did COVID impact our economy"? I'd probably not use Diff-in-diff to answer that but use an event study design. The whole world was impacted by COVID so it's difficult to find an appropriate control. For example what country is matchable to USA that was not impacted by COVID? An event study allows us to predict the counterfactual in this case and then compare with actual. The residual is our effect size.

  • @rikki146

    @rikki146

    Жыл бұрын

    yeah i think it is unanswered in the video. found your comment when i was looking for answer in comment section

  • @jaden2582
    @jaden25822 жыл бұрын

    I have a question that many people may be confused as well: Other than cases where one event being estimated happened in the history, in what else cases do we feel that it is better to use DID than AB testing to estimate an effect?

  • @jaredgreathouse3672

    @jaredgreathouse3672

    2 жыл бұрын

    We usually can't do controlled experiments/AB testing for an intervention; using DD is what we do practically when experiments aren't possible, and they have quite a lot of pitfalls that many economists don't address when writing about their methods. SCM however is the supreme variant of DD, a generalized version of it which offers a principled way to select donors. My variant of SCM explicitly combats overfitting and noise, for example, with machine learning estimators. DD isn't quite as capable of this, yet

  • @jaden2582

    @jaden2582

    2 жыл бұрын

    @@jaredgreathouse3672 Thanks for the reply!

  • @McDreamyn_mdphd
    @McDreamyn_mdphd9 ай бұрын

    I would kindly argue that DiD and Synthetic Controls suffer from the same pitfalls as standard statistical controls. When these two methods are employed within observational designs, confounding can be introduced if the two groups of interest are not balanced on key covariates. We employ methods like Counterfactuals (Propensity Score adjustments) as a way to balance or equal the two groups, which then can be analyzed within the eye toward providing supportive or disconfirming evidence. Synthetic controls also can suffer from confounding likely unobserved. Because the confounding is unobserved, you cannot use Propensity methods, and instead must use something more like instrumental variable methods.

  • @percytaabazuing4554
    @percytaabazuing4554 Жыл бұрын

    Good Job Guys!!!is it possible you do a vedio on the commands used in SCM?

  • @emma_ding

    @emma_ding

    Жыл бұрын

    Hi, Percy! Thanks for your comment. I've added your suggestion to my list of potential content ideas. 😊