The answer is Bayesian

Sid Ravinutala
Chief Data Scientist at IDinsight.

Jan 8, 2026 5 min read

Usually, I’d say the question doesn’t even matter. But here’s one that is topical:

“How do you measure impact when the treatment itself is a moving target, evolving alongside the control?”

This comes up constantly in AI evaluation conversations. Traditionally, A/B testing assumes your treatment is consistent across the population. But good tech solutions, and AI products in particular, are constantly evolving. You are (and should be) always experimenting to improve your product.

Let’s say you have built an AI-powered personal tutor app. In Release 1, you add gamification rewards. In Release 2, you add multimodal responses. In Release 3, you switch the backend from Gemini 2.5 to 3.0. The treatment effect of each release is likely quite different.

In addition to the treatment changing, your control group is also evolving. Your users have increasing access to other tools. While you are testing, the student population might get cheaper access to ChatGPT, or Gemini might launch a free guided learning feature.

Bayesian modeling shines when you have a complex data-generating process like this. Instead of forcing data into a rigid test, you can write out the process in terms of distributions and let the magic of the sampler do the heavy lifting. Let’s set up a toy problem to demonstrate this.

The Setup

We can simulate this scenario with the following parameters:

# 1. Configuration
n_periods = 30        # Number of time periods you are measuring outcome for
start_baseline = 50   # Some starting value for learning levels

# Volatility parameters (Standard Deviations)
sigma_trend = 1.0   # How much the baseline or control changes each time period
sigma_effect = 0.5  # How much the treatment effect fluctuates each time period
sigma_obs = 2.0     # Measurement noise (scatter around the truth)

# 2. Generate Evolving Control (Random Walk)
trend_shocks = np.random.normal(0, sigma_trend, n_periods)
true_baseline = start_baseline + np.cumsum(trend_shocks)

# 3. Generate Evolving Treatment Effect (Random Walk)
# We start at 0 The effect will drift up and down randomly over time.
effect_shocks = np.random.normal(0, sigma_effect, n_periods)
true_tr_effect = np.cumsum(effect_shocks) 

# 4. Generate Observed Data with Noise
n_samples = 5   # the number of kids whose learning levels you are measuring
# Baseline / Control
control_obs = np.random.normal(true_baseline.reshape(-1,1), sigma_obs, (n_periods, n_samples))
# Treatment = Control + Treatment Effect + Noise
treatment_obs = np.random.normal((true_baseline + true_tr_effect).reshape(-1,1), 
                                 sigma_obs, (n_periods, n_samples))

Here’s what our data looks like when plotted:

Data plot

Note that we only observe the red and blue dots. You might be getting these measurements from an in-app test you give to your users, some proxy measure, or actual data collection. Though you are collecting these frequently, it is not a lot of data points so it should be reasonably light to collect.

The top chart shows the true levels for the treatment and control, and the bottom chart shows the true treatment effect that we’d like to recover.

The Naive Approach

You could assume that each of your experiments is independent. In this view, you use only the measurements from a specific time period to calculate the effect. A naive estimator is simply the difference between the two sample means:

$$ \hat{\tau_t} = \bar{y}_{treatment,t} - \bar{y}_{control,t} \\ $$

With a standard error calculated using Welch’s approach:

$$ SE_{\bar{y}_{treatment,t} - \bar{y}_{control,t}} = \sqrt{\frac{s_{treatment,t}^2}{n_{treatment,t}} + \frac{s_{control,t}^2}{n_{control,t}}}$$

The Bayesian model

Let’s simply write out the data-generating process as a state-space model. We assume what we observe comes from underlying treatment and control effects, plus some noise:

$$y_{treatment,t} \sim N \left( \mu_t + \tau_t, \sigma^2_{obs} \right) $$ $$y_{controls,t} \sim N \left( \mu_t, \sigma^2_{obs} \right) $$

And the underlying treatment effect and control are evolving over time:

$$\mu_{t} \sim N \left( \mu_{t-1}, \sigma^2_{effect} \right) $$ $$\tau_{t} \sim N \left( \tau_{t-1}, \sigma^2_{trend} \right) $$ So what we really want to infer is the time-varying effect $\tau_{t}$.

You might cry foul because I’m modeling this with a Gaussian Random Walk—which is exactly how I generated the data. That’s fair. This is a toy example. In a real workflow, we would use more realistic assumptions (e.g., handling missing data points) and compare different process models.

I used PyMC to model this (check out the notebook for the code).

Results

Here is what we get:

Results plot

The Bayesian estimates are not only more accurate, but the credible intervals are also significantly tighter.

Discussion

You can think of this as a dynamic Difference-in-Differences. Instead of assuming a constant fixed gap between treatment and control (the rigid parallel trends assumption), we allow that gap to evolve stochastically over time.

While the title is a little click-baitey, I don’t really want to step into the Bayesian vs. Frequentist debate). I want to highlight that there are statistcal methods available where we can allow treatment and control to evolve over time. And Bayesian models are an elegant way of doing that.

You could also use a Kalman filter and by using Maximum Likelihood Estimation, stay solidly in the Frequentist domain. Similar to Kalman filters, the reason the Bayesian model does well is because it borrows strength from neighboring time periods. It knows that the app’s performance on Tuesday is likely similar to Monday, whereas the naive estimator treats every day as a completely new universe.

None of this is strictly “new.” However, with the current energy around AI evals and the challenge of rapidly evolving treatments, I think the iron is hot for Bayesian models.

Building Survey Accelerator: How we made finding quality survey questions simple »