Glossary
Keywords
A/B Testing, Causal Inference, Causal Time Series Analysis, Data Science, Difference-in-Differences, Directed Acyclic Graphs, Econometrics, Impact Evaluation, Instrumental Variables, Heterogeneous Treatment Effects, Potential Outcomes, Power Analysis, Sample Size Calculation, Python and R Programming, Randomized Experiments, Regression Discontinuity, Treatment Effects
A
- A/A test
- A diagnostic experiment where both groups receive the identical experience (no actual treatment difference). Used to validate that the experimentation infrastructure is working correctly — if you find a statistically significant difference between two identical groups, something may be broken in your randomization or data pipeline.
- A/B test
- A randomized experiment where two or more variants (A and B) are compared to determine which one performs better. In causal inference, it is the gold standard for estimating treatment effects because randomization eliminates selection bias.
- Alternative hypothesis (\(H_1\))
- The hypothesis that there is a real effect or difference (e.g., “the feature generates incremental revenue”). Validated only when data provides strong evidence to reject the null hypothesis.
- Always-takers
- In instrumental variables settings, people who always take the treatment regardless of their assignment. They get treated whether assigned to control or treatment and contribute zero to the first stage effect.
- Ashenfelter’s Dip
- A phenomenon where individuals who participate in a program (like job training) experience a temporary decline in their outcome (like earnings) just before the program starts. This pre-program dip is a classic sign of selection bias.
- Attrition bias
- Bias caused when participants drop out of a study non-randomly (e.g., only unsatisfied users leave), rendering the remaining treatment and control groups incomparable.
- Average Treatment Effect (ATE)
- The conceptual average of the individual treatment effects across the entire population. It answers the question: “What would be the average difference in outcomes if everyone were treated versus if no one were treated?”
- Average Treatment Effect on the Treated (ATT)
- The average causal effect of the treatment specifically for those who actually received it. It answers: “How much did the program help the people who actually participated?”
B
- Backdoor path
- In a causal diagram (DAG), a non-causal path connecting treatment and outcome through common causes (confounders). These paths must be blocked to identify the true effect.
- Bad control
- A variable that, when controlled for in a regression, introduces bias rather than removing it. Common examples include mediators (which block the causal path) and colliders (which open spurious paths).
- Bandwidth
- The specific range of the running variable around the cutoff used for analysis in a Regression Discontinuity Design (RDD). Because the assumption of comparability only holds near the threshold, bandwidth selection triggers a trade-off between bias (too wide) and variance (too narrow).
- Bias
- Systematic error that causes the estimated effect to differ from the true causal effect. Common sources include selection bias, omitted variable bias, and measurement error.
C
- Causal forest
- A machine learning algorithm used to estimate heterogeneous treatment effects. It improves on standard random forests by using “honest” splitting to maximize the difference in effects across subgroups.
- Causal inference
- The science of determining whether and how much a specific action (cause) affects an outcome (effect). It goes beyond correlation to understand “what would happen if” we intervened.
- Cannibalization
- A form of interference where treating one unit reduces outcomes for another unit. Common in marketplaces (e.g., boosting Seller X’s visibility reduces sales for Seller Y) and product lines (e.g., a new product steals customers from an existing one). Cannibalization violates SUTVA because the control group’s outcomes are affected by the treatment.
- Central Limit Theorem (CLT)
- A statistical theorem stating that the sampling distribution of the sample mean approaches a normal distribution as the sample size gets larger, regardless of the shape of the population distribution.
- Cluster randomization
- A randomization scheme where entire groups (clusters) — such as cities, stores, or driver pools — are assigned to treatment or control rather than individual units. Used to prevent contamination when interference between units is expected, but dramatically reduces effective sample size.
- Clustered standard errors
- Standard errors that account for the correlation of observations within groups (clusters). Required when the unit of analysis differs from the unit of randomization, or when errors are correlated within groups over time.
- Cohen’s d
- A standardized measure of effect size expressing the difference between two means in terms of standard deviations. In experiments, it helps compare effects across studies and determine practical significance beyond statistical significance.
- Collider
- A variable that is caused by both the treatment and the outcome (or by two other variables in a path). Controlling for a collider creates a spurious association between the variables that cause it, leading to bias (often called M-bias or selection bias).
- Compliers
- In instrumental variables settings, people who follow their assignment: they take the treatment if assigned to it and skip it if assigned to control. The LATE estimates the treatment effect specifically for this persuadable segment.
- Common support (overlap)
- The assumption that every unit in the population has a non-zero probability of receiving either the treatment or the control. Without common support, we cannot reliably estimate causal effects for the entire population.
- Compliance rate
- The percentage of units assigned to a treatment group who actually receive or consume the treatment. Low compliance dilutes the intention-to-treat estimate.
- Conditional Average Treatment Effect (CATE)
- The average treatment effect for a specific subgroup defined by observed characteristics (e.g., the effect for “mobile users” vs. “desktop users”).
- Conditional Independence Assumption (CIA)
- The assumption that, after controlling for a set of observed covariates, the treatment assignment is effectively random (i.e., independent of potential outcomes). Also known as “selection on observables” or “unconfoundedness.”
- Confidence interval (CI)
- A range of values derived from sample data that is likely to contain the true population parameter with a certain level of confidence (e.g., 95%). It quantifies the uncertainty of an estimate.
- Confounder
- A variable that causes both the treatment and the outcome. If not controlled for, it creates a “backdoor path” that introduces omitted variable bias, making it look like the treatment caused the outcome when it was actually the confounder.
- Control group
- The group of units that does not receive the treatment or intervention. They serve as the benchmark for comparison to estimate the counterfactual (what would have happened to the treated group without the treatment).
- Correlation
- A statistical measure of the strength and direction of the linear relationship between two variables. While correlation can indicate a pattern, “correlation does not imply causation.”
- Counterfactual
- A hypothetical scenario representing what would have happened to a specific unit if they had received a different treatment than they actually did. Since we can never observe the counterfactual, causal inference is often defined as a “missing data problem.”
- Covariance
- A statistical measure of how two variables change together. Positive covariance means they move in the same direction; negative means opposite directions.
- Covariate
- A variable that is observed and can be included in a statistical model to adjust for confounding or improve precision. Also called a control variable, regressor, or feature.
- Covariate balance test
- A diagnostic check to verify that treatment and control groups are comparable on observable characteristics. In RCTs, it confirms successful randomization; in quasi-experiments (like RDD), it supports the validity of the design by ensuring no systematic differences exist between groups other than the treatment itself.
- CUPED (Controlled-experiment Using Pre-Experiment Data)
- A variance reduction technique that uses pre-experiment data to reduce noise in experiment analysis. By subtracting each user’s predicted outcome (based on their pre-experiment behavior), CUPED can dramatically shrink confidence intervals, allowing detection of smaller effects with the same sample size.
- Customer Lifetime Value (LTV)
- The total predicted revenue or profit a customer will generate throughout their entire relationship with a business. Since it is a long-term metric, experiments often utilize short-term predictive metrics (surrogates) to estimate impact on LTV.
- Cutoff
- The specific threshold value of the running variable in a Regression Discontinuity Design (RDD). Crossing this line determines whether a unit is assigned to the treatment group or the control group.
D
- DAG (Directed Acyclic Graph)
- A visual map of causal assumptions using nodes (variables) and arrows (causal links). It helps brainstorm which variables are confounders, colliders, or mediators.
- Defiers
- In instrumental variables settings, people who do the opposite of their assignment: they refuse treatment when assigned to it and take treatment when assigned to control. Most IV analyses assume defiers do not exist (the monotonicity assumption).
- Density test
- A statistical test (e.g., the classic McCrary test) used in Regression Discontinuity Design to detect manipulation of the running variable. A sudden spike or drop in the number of observations at the cutoff suggests that units may be gaming their scores to qualify for treatment, violating key assumptions.
- Difference-in-Differences (DiD)
- A method that estimates the causal effect of a treatment by comparing the change in outcomes over time for a treated group to the change in outcomes for a control group. It relies on the parallel trends assumption.
- Donut hole RDD
- A robustness check for Regression Discontinuity Design that excludes observations immediately surrounding the cutoff. This is done to verify that the estimated effect isn’t driven by data errors or manipulative sorting occurring at the precise threshold.
- Double Machine Learning (DML)
- A method using machine learning models to “clean” the treatment and outcome of confounding from high-dimensional controls, isolating the causal effect of interest.
- Dummy variable
- A binary variable that takes on the value 0 or 1 to indicate the absence or presence of some categorical effect (e.g., “Treatment = 1” vs “Control = 0”).
E
- Econometrics
- The application of statistical methods to economic data to test hypotheses and estimate causal relationships.
- Encouragement design
- An experimental design where participants are randomly encouraged to take a treatment (e.g., via an invitation or incentive) rather than being forced into it. The encouragement serves as an instrument for estimating the treatment effect among compliers.
- Endogeneity
- A condition where an explanatory variable (like treatment) is correlated with the error term in a regression model, typically due to omitted variables, reverse causality, or measurement error. It renders OLS estimates biased.
- Error term
- The random variable in a regression model (often \(\varepsilon\)) representing all unobserved factors affecting the outcome. Ideally, it should be random and uncorrelated with the predictors.
- Estimand
- The precise quantity we want to estimate (the target), such as the Average Treatment Effect (ATE). It is defined based on the population and the causal question, independent of the data or method used.
- Estimate
- The numerical value obtained from applying an estimator to a specific sample of data. It is our “best guess” of the true parameter.
- Estimator
- The statistical rule or formula used to calculate an estimate from data (e.g., the formula for the sample mean, or the OLS algorithm).
- Event study
- A dynamic version of difference-in-differences that estimates treatment effects for each time period relative to the treatment start. It allows for checking parallel trends (pre-treatment) and seeing how effects evolve over time (post-treatment).
- Exclusion restriction
- A critical assumption in instrumental variables (IV) requiring that the instrument affects the outcome only through the treatment, and not via any other channel.
- Exogeneity
- The condition where an explanatory variable is not correlated with the error term. Using exogenous variation (like in an RCT or IV) is key to identifying causal effects.
- Expected value
- The theoretical long-run average of a random variable. In data analysis, the sample mean is our best estimate of the population expected value.
- External validity
- The extent to which the results of a study can be generalized to other populations, settings, or time periods.
F
- First stage effect
- In instrumental variables, the effect of the instrument on treatment uptake. It measures how much the instrument changes the probability of receiving treatment. A strong first stage is required for reliable IV estimates.
- Fixed effects
- Parameters in a regression model that control for unobserved time-invariant characteristics of units (individual fixed effects) or period-specific shocks shared by all units (time fixed effects).
- Frontdoor path
- A causal path flowing from treatment to outcome, potentially through mediators. It represents the mechanism by which the treatment actually works.
G
- Guardrail metrics
- Essential business metrics monitored during experiments (e.g., latency, crash rate) to ensure a new feature doesn’t accidentally harm the user experience.
H
- HARKing
- “Hypothesizing After the Results are Known.” The misleading practice of presenting a post-hoc hypothesis (created after seeing the data) as if it were the original plan.
- Heterogeneous treatment effects
- When the effect of a treatment is not the same for everyone but varies across different individuals or subgroups (e.g., a drug works better for younger patients).
- HiPPO
- “Highest Paid Person’s Opinion.” A decision-making culture where the intuition of senior leaders overrides data and experimental evidence.
- Hypothesis testing
- A statistical method used to decide whether to reject a null hypothesis (usually “no effect”) based on sample data, assessing whether an observed result is likely due to chance.
I
- Identification strategy
- The research design and set of assumptions allowing a researcher to claim that an estimated correlation is truly a causal effect (e.g., “we identified the effect using a randomized experiment”).
- Imperfect compliance
- When participants do not follow their assigned treatment protocol (e.g., people assigned to a drug don’t take it), complicating the estimation of treatment effects.
- Independent variable
- The variable manipulating or predicting the outcome (the \(X\) in \(Y = \beta X\)). In experiments, this is often the treatment.
- Instrumental variable (IV)
- A method used when the treatment is endogenous (e.g., due to unobserved confounding). An “instrument” is a variable that affects the treatment but has no direct effect on the outcome and is not correlated with confounders.
- Intention-to-Treat (ITT)
- The average effect of being assigned to a treatment, regardless of whether the subject actually received it. It reflects the real-world impact of a policy rollout.
- Interaction term
- A variable added to a regression (e.g., \(Treatment \times Gender\)) to test if the treatment effect varies depending on another characteristic.
- Individual Treatment Effect (ITE)
- The treatment effect for a single, specific unit — the most granular level of causal analysis. Fundamentally unobservable because we can never see both potential outcomes for the same individual, but serves as the theoretical foundation for other estimands like CATE.
- Internal validity
- The extent to which a study accurately establishes a causal relationship between the treatment and the outcome within the context of the study itself (i.e., is the estimated effect unbiased?).
- Inverse Probability Weighting (IPW)
- A method that weights observations by the inverse of their probability of treatment (propensity score) to create a synthetic sample where treatment and control are balanced.
K
- Kernel function
- A weighting scheme used in local regression (like RDD) that assigns more importance to observations closer to the cutoff. Common types include triangular (decreasing weight with distance) and uniform (equal weight within the bandwidth).
- Kitchen sink regression
- A regression approach where the analyst includes every available variable, hoping to control for confounding. Often counterproductive because it may include bad controls (mediators, colliders, treatment predictors) that introduce bias or inflate standard errors.
L
- Lagged effect
- An impact that manifests some time after the intervention occurred (e.g., an ad campaign today increasing sales next month).
- Leads and lags
- In event study designs, leads are the pre-treatment coefficients (\(\delta_k\) for \(k < 0\)) that test whether treated and control groups were on similar trajectories before the intervention. Lags are the post-treatment coefficients (\(\delta_k\) for \(k \geq 0\)) that capture how the treatment effect evolves over time. Leads near zero support the parallel trends assumption; significant leads signal a potential violation.
- Law of Large Numbers (LLN)
- A theorem stating that as the sample size increases, the sample mean gets closer and closer to the true population mean.
- Local Average Treatment Effect (LATE)
- The treatment effect specifically for “compliers” — people who are induced to take the treatment by the instrument or assignment.
M
- Machine learning (ML)
- A field of AI focused on building algorithms that learn patterns from data to make predictions. While powerful for prediction, standard ML does not inherently solve causal problems.
- Mechanism
- The process or pathway through which a cause produces an effect (the “why” or “how” it happens).
- Mediator
- A variable that lies on the causal path between the treatment and the outcome (Treatment -> Mediator -> Outcome). It explains how the treatment works.
- Meta-learners
- Frameworks (like T-learner, S-learner) that use standard machine learning models to estimate heterogeneous treatment effects by modeling potential outcomes separately.
- Minimum Detectable Effect (MDE)
- The smallest true effect size an experiment can reliably detect with a given sample size and statistical power.
- Monotonicity
- An IV assumption (no “defiers”) meaning the instrument makes everyone more (or no less) likely to take the treatment; it doesn’t discourage anyone who would have otherwise taken it.
- Multicollinearity
- A situation in which two or more explanatory variables in a regression are highly correlated, making it difficult to isolate their individual effects.
N
- Never-takers
- In instrumental variables settings, people who never take the treatment regardless of assignment. Even when encouraged or invited, they ignore the treatment and contribute zero to the first stage effect.
- Novelty effect
- A temporary spike in user engagement caused by the “newness” of a feature, which often wears off as users get accustomed to it.
- Network effects
- A phenomenon where one user’s behavior or treatment status affects the outcomes of other users. In causal inference, network effects violate the SUTVA assumption of “no interference” (e.g., if a discounted user tells a friend about a product, that friend’s behavior is influenced by someone else’s treatment).
- Null hypothesis (\(H_0\))
- The default assumption that there is no effect or no difference. Statistical tests aim to see if there is enough evidence to reject this hypothesis.
- One-sided Non-compliance
- A scenario where only one group can deviate from assignment. Typically, users in the control group cannot access treatment, but users in the treatment group can decline it.
O
- Omitted variable bias (OVB)
- Bias in the estimated causal effect that occurs when a relevant confounder is left out of the regression model.
- Ordinary Least Squares (OLS)
- The standard method for estimating linear regression coefficients by minimizing the sum of the squared errors between observed and predicted values.
- Outcome variable
- The variable we are trying to change or explain (the effect). Also called the dependent variable or \(Y\).
- Overall Evaluation Criterion (OEC)
- The single primary metric that defines success for an experiment. Having a clear OEC forces stakeholders to agree on what “winning” means before the test runs and prevents post-hoc cherry-picking of favorable metrics.
- Overfitting
- When a model matches the training data too closely, capturing random noise rather than the signal, leading to poor performance on new data.
P
- P-value
- The probability of observing a result as extreme as the one found in the sample, assuming the null hypothesis is true. A low p-value (typically < 0.05) suggests the result is statistically significant.
- Per-protocol analysis
- A naive analysis comparing outcomes based on actual treatment received rather than assignment. In experiments with imperfect compliance, this approach reintroduces selection bias and typically overestimates treatment effects.
- Perfect compliance
- When all units follow their assignment exactly: everyone assigned to treatment receives it, and everyone assigned to control does not. This ideal scenario makes ITT equal to the true ATE.
- Parallel trends assumption
- The critical assumption in difference-in-differences designs: that in the absence of treatment, the average outcomes of the treated and control groups would have changed at the same rate over time.
- Parameter
- The true, unknown value in the population that we are trying to value (e.g., the true average height of all adults).
- Placebo Test
- A falsification check where a “fake” treatment or outcome is tested (e.g., testing the effect on data from before the intervention). Finding an effect describes a problem with the design.
- Population
- The entire group of individuals or units that we are interested in studying.
- Potential outcomes framework
- A conceptual framework (Rubin Causal Model) where causal effects are defined as the difference between potential outcomes under different treatment scenarios (e.g., \(Y(1) - Y(0)\)).
- Power
- The probability that a statistical test will correctly reject a false null hypothesis (i.e., find an effect if it really exists).
- Pre-registration
- Documenting the hypothesis, design, and analysis plan before running a study. This transparent practice prevents p-hacking and HARKing.
- Propensity Score
- The probability that a unit receives the treatment given its observed characteristics. Used to match or weight control units to look like treated units.
R
- R-squared (\(R^2\))
- A measure (0 to 1) of how well the regression model explains the variance in the outcome data.
- Reduced form effect
- In instrumental variables, the direct effect of the instrument on the outcome. For binary instruments, this is equivalent to the ITT. The LATE is calculated by dividing the reduced form by the first stage.
- Randomization
- The process of assigning units to treatment and control groups purely by chance. This ensures that the groups are comparable on average, eliminating selection bias.
- Regression (Linear)
- A statistical method modeling the linear relationship between a dependent variable (\(Y\)) and one or more independent variables (\(X\)).
- Regression discontinuity design (RDD)
- A causal method that exploits a cutoff or threshold assignment rule (e.g., scholarship for grades > 80%). It compares units just above and just below the cutoff, who are assumed to be effectively randomized.
- Reverse causality
- A bias where the direction of cause and effect is flipped or bidirectional (e.g., “do sales drive ad spend, or does ad spend drive sales?”).
- Robustness check
- An additional analysis run to verify that the main results remain stable under different assumptions or model specifications.
- Running variable
- The continuous variable (often a score, income, etc.) that determines treatment assignment in a Regression Discontinuity Design. Also known as the forcing variable.
- Residual
- The difference between the observed value and the predicted value in a regression model. It represents the “unexplained” part of the outcome for a specific unit.
S
- Sample
- A subset of the population selected for study. We use the sample to make inferences about the population.
- Sample Ratio Mismatch (SRM)
- A randomization failure where the ratio of users in treatment vs. control differs significantly from the design (e.g., 50/50 becomes 48/52), often indicating a bug.
- Selection bias
- Bias introduced when the individuals in the treatment group differ systematically from those in the control group in ways that affect the outcome (e.g., sicker patients choosing a new treatment).
- Serial correlation
- When errors/residuals for the same unit are correlated over time (common in time-series data). Failing to account for it can lead to underestimating standard errors.
- Significance level (\(\alpha\))
- The probability of rejecting the null hypothesis when it is true (Type I error). Typically set at 0.05.
- Simpson’s Paradox
- A phenomenon where a trend appears in different groups of data but disappears or reverses when these groups are combined. It often highlights the importance of controlling for confounders.
- Simultaneity
- A type of endogeneity where \(X\) and \(Y\) influence each other at the same time, making it hard to disentangle the causal effect of one on the other.
- Spurious correlation
- A statistical association between two variables driven by coincidence or a third unobserved variable (confounder) rather than a causal link.
- Stable Unit Treatment Value Assumption (SUTVA)
- The assumption that (1) there is no interference between units (one person’s treatment doesn’t affect another’s outcome) and (2) there represents only one version of the treatment.
- Staggered adoption
- A setting where different units receive the treatment at different times rather than all at once. Common in policy rollouts and product launches. Traditional TWFE DiD may produce biased estimates under staggered adoption, motivating modern DiD estimators.
- Standard deviation
- A measure of the amount of variation or dispersion in a set of values. Low SD means values are close to the mean.
- Standard error
- A measure of the statistical accuracy of an estimate. It tells us how much the estimate would vary if we repeated the study with different samples.
- Statistical significance
- A determination that an observed result is unlikely to be due to random chance, usually indicated by a p-value below a preset threshold (e.g., 0.05).
- Stratified randomization
- A randomization scheme that first divides users into strata (groups with similar characteristics like platform or country), then randomizes within each stratum. Ensures balanced representation across key segments and can improve precision by reducing variance.
- Synthetic control
- A method constructing a weighted combination of multiple control units to create a “synthetic” counterfactual that mimics the treated unit’s pre-intervention trend.
T
- Treatment assignment
- The specific rule or mechanism that determines which units receive the treatment and which receive the control. For instance, in experiments, this is random; in RDD, it is based on a cutoff.
- Treatment effect
- The causal impact of the intervention on the outcome.
- Triggering
- An experimental analysis technique that restricts analysis to users who actually encountered the treatment (e.g., those who visited the modified page). Reduces noise by excluding users who were technically in the experiment but never had a chance to be affected. Must be done carefully to avoid reintroducing selection bias.
- Two-sided non-compliance
- A scenario where both treatment and control groups can deviate from assignment. Some treated users decline treatment, and some control users access it anyway (e.g., by finding the feature on their own).
- Two-Stage Least Squares (2SLS)
- The standard regression implementation of instrumental variables. In the first stage, treatment is predicted using the instrument; in the second stage, outcomes are regressed on the predicted treatment to estimate the LATE.
- Trimming
- A technique for addressing lack of common support by excluding units with extreme propensity scores (very close to 0 or 1). By restricting the analysis to the region where treated and control groups overlap, trimming reduces extrapolation bias at the cost of narrowing the population for which causal effects can be estimated.
- Two-way fixed effects (TWFE)
- A standard regression specification for difference-in-differences that includes both unit fixed effects and time fixed effects.
- Type I error
- A “false positive”—mistakenly finding an effect when there is none (rejecting a true null hypothesis).
- Type II error
- A “false negative”—mistakingly finding no effect when there actually is one (failing to reject a false null hypothesis).
V
- Variance
- The average of the squared differences from the mean, measuring how spread out the data is.
W
- Wald estimator
- An IV estimator calculated as the ratio of the reduced form effect (instrument on outcome) to the first stage effect (instrument on treatment).
- Weak instrument
- An instrument with a small and weak first stage effect. Weak instruments produce unstable LATE estimates with large standard errors and can bias results toward naive OLS comparisons.
- Winsorization
- An outlier handling technique that caps extreme values at a specified percentile (e.g., the 99th) rather than removing them entirely. Reduces the influence of outliers on experiment results while preserving sample size.
