Part II: Estimating impacts
A/B Testing, Causal Inference, Causal Time Series Analysis, Data Science, Difference-in-Differences, Directed Acyclic Graphs, Econometrics, Impact Evaluation, Instrumental Variables, Heterogeneous Treatment Effects, Potential Outcomes, Power Analysis, Sample Size Calculation, Python and R Programming, Randomized Experiments, Regression Discontinuity, Treatment Effects
As you progress, you’ll see that causal questions often require more than just a snapshot. That’s why, in later chapters — specifically those about difference-in-differences — we’ll explore panel (also called longitudinal) data, which tracks the same individuals across multiple time points. We’ll also introduce time series data in as a last resource when there is no other option, where the focus shifts to following a single entity (like a company or product) over time. By gradually expanding the types of data we use, you’ll build a solid foundation before tackling more complex scenarios.
Chapter 4: Experiments I – designing and running reliable experiments
Key message: Randomization is the most reliable way to establish causality because it ensures treatment and control groups are comparable by design.
- Experiments (A/B tests) rule out confounding variables, making them the gold standard for causal inference.
- We cover the mechanics of randomization, how to check if it worked (balance checks), and common pitfalls like peeking at results early, multiple testing, and sample ratio mismatches.
Why it matters: If you can run an experiment, you should. It’s the only method that guarantees unbiased estimates without relying on strong, untestable assumptions.
Chapter 5: Experiments II – Sample size, power, and detecting real effects
Key message: Running an experiment isn’t enough; you need enough data to detect the effect you’re looking for.
- “No significant effect” doesn’t mean “no effect”; it may mean your sample was too small (a Type II error).
- We cover how to calculate sample size before starting an experiment, and why post-hoc power calculations are meaningless once the test is over.
Why it matters: Underpowered experiments are a waste of resources. Power calculations help you prioritize higher-impact tests and manage stakeholder expectations about what can (and cannot) be measured.
Chapter 6: Causal assumptions: think first, regress later
Key message: When you can’t experiment, simple comparisons are unfair because treated units are already different from untreated ones. Regression try to fix this by comparing “apples to apples”, but you must know how to select the right variables to control for.
- These methods rely on the selection on observables assumption: If we control for enough variables, will treatment be effectively as random?
- We explore how to use Directed Acyclic Graphs (DAGs) to reason about confounders, mediators, and colliders, then apply linear regression (OLS) to adjust for differences.
Why it matters: Most business data is observational. These are the first tools you reach for when you need to answer causal questions from historical data, provided you have good data on potential confounders.
Chapter 7: Instrumental variables – when users don’t do what they were assigned to
Key message: Sometimes you try to treat people, but they don’t comply. Or you can’t treat them directly, but you can nudge them. Instrumental variables allow you to estimate causal effects in these “messy” situations.
- We categorize users into compliers, always-takers, never-takers, and defiers, and explain why IV recovers the Local Average Treatment Effect (LATE) for compliers only.
- The chapter walks through an email frequency optimization example using two-stage least squares (2SLS).
Why it matters: Perfect experiments are rare. IV gives you a rigorous way to handle non-compliance and measure the impact of programs where users self-select into the treatment.
Chapter 8: Regression discontinuity design: cutoffs as natural experiments
Key message: Arbitrary rules, like “users scoring above 50 get free shipping,” create natural experiments at the cutoff.
- People just above and just below the cutoff are virtually identical, except for the treatment.
- We distinguish between Sharp RDD (deterministic cutoff) and Fuzzy RDD (probability jump at cutoff), and cover validation checks like manipulation testing and covariate balance at the threshold.
Why it matters: RDD is considered the most credible quasi-experimental method after randomized experiments. Many business policies (tiers, limits, qualifications) naturally create these cutoffs, offering a goldmine for rigorous causal analysis.
Chapters 9: Two-way fixed effect – the old difference-in-differences
Key message: By comparing the change over time in a treated group to the change in a control group, we cancel out fixed differences and global trends.
- The classic Two-Way Fixed Effects (TWFE) model works well for the canonical “two groups, two time periods” setting.
- We cover the parallel trends assumption (and how to check it with event studies), SUTVA, and practical concerns like anticipation effects and selection bias (Ashenfelter’s Dip).
Why it matters: DiD is the workhorse of policy evaluation. It lets you measure the impact of product launches, marketing campaigns, or regional policies using the panel data you likely already have.
Chapters 10: Staggered treatment – the new difference‑in‑differences
Key message: When treatment rolls out at different times, the classic TWFE estimator can produce biased or even wrong-signed estimates. Modern DiD methods fix this problem.
- We explain “forbidden comparisons” (the issue of using already-treated units as comparison groups) and why heterogeneous effects across cohorts may break TWFE.
- We introduce the Callaway & Sant’Anna approach, which builds group-time ATTs and aggregates them into clean event studies.
Why it matters: Staggered rollouts are the norm in business (regional product launches, phased marketing campaigns). Using modern methods prevents you from drawing the wrong conclusions from your data.
Chapters 11: Time series methods: measuring impact without a control group
Key message: Sometimes you launch everywhere at once (like Big Brother Brasil or a Super Bowl ad), so there’s no control group. Interrupted Time Series methods use the past to predict the counterfactual.
- We cover Causal ARIMA (using autoregressive patterns) and CausalImpact (using auxiliary series unaffected by the intervention) to construct counterfactuals.
- These methods require strict assumptions: stable historical patterns, no other shocks at the intervention date, and no anticipation effects.
Why it matters: ITS is the “method of last resort” when valid control groups are impossible. It helps you answer “did this nationwide launch work?” while being honest about the uncertainty.
Chapter 12: Heterogeneous treatment effects: different people, different reactions
Key message: The average treatment effect hides the fact that an intervention might work wonders for some users and backfire for others.
- We introduce Causal Forests to estimate conditional average treatment effects (CATEs) and discover who responds best to an intervention.
- The chapter covers honest estimation, sample splitting, and how to validate heterogeneous effect estimates.
Why it matters: Modern businesses don’t just want to know if something works “on average”; they want to know for whom. This unlocks personalization and targeted optimization.
