Causal Inference for LLM Features: Overcoming the Opt-In Bias with Propensity Scores in Python
<h2 id="problem">The Opt-In Trap: Why Your AI Feature Metrics Mislead</h2>
<p>When you ship a new AI-powered feature behind a user toggle, the numbers can look impressive at first. Users who click “Try our AI assistant” or “Enable smart replies” often show dramatically better outcomes—say, 21% more tasks completed. But this comparison is flawed from the start. The volunteers who opt in are not a random sample; they're typically your most engaged power users. Any naive metric comparing opt-in users to non-users conflates the feature's true causal effect with pre-existing differences between these groups. This is the <strong>Opt-In Trap</strong>, a persistent challenge in product experimentation for generative AI features.</p><figure style="margin:20px 0"><img src="https://cdn.hashnode.com/uploads/covers/5e1e335a7a1d3fcc59028c64/6a8936be-7f43-4977-9baf-6021dc892b2d.png" alt="Causal Inference for LLM Features: Overcoming the Opt-In Bias with Propensity Scores in Python" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: www.freecodecamp.org</figcaption></figure>
<h2 id="psm-intro">How Propensity Scores Break the Bias</h2>
<p>Propensity score methods offer a statistical remedy. A <strong>propensity score</strong> is the probability that a user chooses to opt in, estimated from observable characteristics (e.g., past engagement, account age, feature usage). By weighting or matching users based on these scores, we can create comparable groups that mimic a randomized experiment. The goal is to isolate the feature's causal effect from the bias introduced by self-selection.</p>
<h3 id="pipeline">The Full Pipeline: From Estimation to Inference</h3>
<p>This walkthrough uses a synthetic SaaS dataset of 50,000 users, where the ground truth causal effect is known. You'll follow these steps:</p>
<ol>
<li>Estimate propensity scores</li>
<li>Apply inverse-probability weighting (IPW)</li>
<li>Perform nearest-neighbor matching</li>
<li>Check covariate balance</li>
<li>Compute bootstrap confidence intervals</li>
</ol>
<p>All code runs end-to-end in the companion notebook at <a href="https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/02_propensity_opt_in">GitHub</a> (file <code>psm_demo.ipynb</code>). Pre-executed outputs let you follow along before running locally.</p>
<h2 id="setup">Setting Up the Working Example</h2>
<p>We work with a synthetic dataset containing user-level features: <em>past_engagement_score</em>, <em>account_age_months</em>, <em>feature_usage_count</em>, and a binary <em>opt_in</em> flag. The outcome is <em>tasks_completed</em>. A logistic regression model estimates the propensity score for each user.</p>
<h2 id="step1">Step 1: Estimate the Propensity Score</h2>
<p>We train a logistic regression model (or any classifier) using user features as predictors and the opt-in decision as the target. The resulting predicted probabilities are the propensity scores. In Python:</p>
<pre><code>from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X, y)
propensity_scores = model.predict_proba(X)[:, 1]
</code></pre>
<h2 id="step2">Step 2: Inverse-Probability Weighting</h2>
<p>IPW assigns each user a weight: <code>1 / propensity_score</code> for treated users, <code>1 / (1 - propensity_score)</code> for control users. The weighted average difference in outcomes estimates the average treatment effect (ATE). Large weights can inflate variance, so trimming extreme scores is common.</p><figure style="margin:20px 0"><img src="https://cdn.hashnode.com/uploads/covers/69cc82ffe4688e4edd796adb/df8f4e49-98f3-4cd2-b4a8-f9b49d18f60a.png" alt="Causal Inference for LLM Features: Overcoming the Opt-In Bias with Propensity Scores in Python" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: www.freecodecamp.org</figcaption></figure>
<h2 id="step3">Step 3: Nearest-Neighbor Matching</h2>
<p>Instead of weighting, you can match each treated user with one or more control users who have a similar propensity score. Nearest-neighbor matching (with a caliper) ensures close matches. The average difference within matched pairs estimates the treatment effect on the treated (ATT).</p>
<h2 id="step4">Step 4: Check Covariate Balance</h2>
<p>After weighting or matching, check that covariates are similar across groups. Use standardized mean differences (SMD); values below 0.1 indicate good balance. Visualization with Love plots helps identify remaining bias.</p>
<h2 id="step5">Step 5: Bootstrap Confidence Intervals</h2>
<p>To quantify uncertainty, bootstrap the entire estimation process (re-sample users, re-estimate propensity scores, recalc treatment effect). The 2.5th and 97.5th percentiles of bootstrapped effects form the confidence interval.</p>
<h2 id="failure">When Propensity Score Methods Fail</h2>
<p>Propensity score methods rely on the <em>unconfoundedness assumption</em>: no unmeasured confounders that affect both treatment and outcome. If a hidden variable (like user motivation) drives both opt-in and outcomes, the estimate remains biased. Also, extreme propensity scores (close to 0 or 1) can cause instability, and matching may fail if no similar controls exist. Always perform sensitivity analyses (e.g., E-value) to assess robustness.</p>
<h2 id="next">What to Do Next</h2>
<p>Propensity score methods are powerful but not a silver bullet. Combine them with other causal techniques (e.g., instrumental variables, difference-in-differences) when appropriate. For AI features behind toggles, always consider a randomized staged rollout (A/B test) if feasible. The companion notebook at <a href="https://github.com/RudrenduPaul/product-experimentation-causal-inference-genai-llm/tree/main/02_propensity_opt_in">GitHub</a> includes more advanced diagnostics and variations.</p>
<h2 id="conclusion">Conclusion</h2>
<p>When your product team celebrates a 21% lift from an AI feature, be skeptical—the Opt-In Trap may be inflating the numbers. Propensity score methods, applied correctly, can disentangle selection bias from true causal effects. This Python tutorial provides a reproducible framework for product experimentation teams to make better decisions about LLM-based features.</p>
Tags: