Three Stage Sampling

by: Doug Johnson

An earlier version of this blog post appeared on my person blog.

Household surveys often involve more than one “stage” of sampling – e.g. in the first stage, we might randomly sample villages and in the second stage we might randomly sample households within these villages. Most often, we use two stages when sampling. Accounting for two sampling stages is pretty straightforward. In some cases, we might want to consider using three stages. Unfortunately, to my knowledge, there aren’t a lot of good resources on how to account for more than two stages when sampling. In this post, I’ll try to answer four questions:

When do you need to take into account both stages of clustering in a survey or evaluation?
How do you properly account for a three stage design when performing sample size / power calculations?
How should you estimate the inputs required for these calculations?
How do you properly account for a three stage design when analyzing data?

While this blog post is written mainly from the perspective of someone designing a household survey as I show at the end, most of the lessons are equally applicable to randomized evaluations.

When you do you need to take into both stages?

Just because units exhibit some sort of clustering doesn’t mean that you need to account for clustering by adding a stage in your analysis. For example, if you randomly select households from a list of all households it doesn’t matter that households are clustered in villages. Similarly, if you randomize students to receive an education intervention it doesn’t matter that students are clustered in classes. You only need to take into account clustering when you randomly select (or assign clusters). Even then, you can sometimes get away with ignoring the lower level of clustering and focusing just on the highest level of clustering (e.g. clustering at the district level in the example above). We’ll see below why that makes sense. But generally, if you randomly select large clusters, then randomly select subclusters from within clusters, and, finally, select units from within clusters you should take into account both stages of sampling. For example, if you randomly select districts, then randomly select villages within districts, then randomly select households, you should take into account both the district and village stages.

How do you account for a three stage design?

Let’s first recap how one stage of clustering affects the variance of your estimator. Let’s say that you will use a two stage sampling strategy in which you will first randomly sample J clusters and then randomly sample K units from each cluster and want to estimate the mean of some variable y. Further assume that the total number of units per cluster does not vary and is pretty large. If values of y are correlated within each cluster, we can think of the values for y as being made up of a cluster component and an independent within-cluster component, i.e.

$$y_{j,k}=\eta_j+\phi_{j,k}$$

This allows us to calculate the variance the of y as:

$$\sigma^2_y=\sigma^2_{\eta}+\sigma^2_{\phi}$$

And the variance of the mean as:

$$Var(\bar{y})=\frac{\sigma^2_{\eta}}{J}+\frac{\sigma^2_{\phi}}{JK}=\sigma^2_y\left(\frac{\rho}{J}+\frac{(1-\rho)}{JK}\right)$$

Where $\rho=\frac{\sigma^2_{\eta}}{\sigma^2_y}$. It’s also useful to calculate the design effect, or the ratio of the variance of this estimator to the ratio of the estimator if the sample had been collected using simple random sampling (SRS). Since the variance under SRS would be $\frac{\sigma^2_y}{JK}$ the design effect=$1+(K-1)\rho$.

Let’s now suppose that we have a higher level sampling stage. We first pick Q mega-clusters, then J clusters from each mega-cluster, and then K households from each cluster. Similarly, we can think of the values y as made of three components:

$$y_{q,j,k}=\gamma_q+\eta_{q,j}+\phi_{q,j,k}$$

The variance of y is then:

$$\sigma^2_y=\sigma^2_{\gamma}+\sigma^2_{\eta}+\sigma^2_{\phi}$$

And the variance of the mean is:

$$Var(\bar{y})=\frac{\sigma^2_{\gamma}}{Q}+\frac{\sigma^2_{\eta}}{QJ}+\frac{\sigma^2_{\phi}}{QJK}=\sigma^2_y\left( \frac{\rho_{\gamma}}{Q}+\frac{\rho_{\eta}}{QJ}+\frac{(1-\rho_{\gamma}-\rho_{\eta})}{QJK} \right)$$

Where $\rho_{\eta}=\frac{\sigma^2_{\eta}}{\sigma^2_y}$ and $\rho_{\gamma}=\frac{\sigma^2_{\gamma}}{\sigma^2_y}$. For our three stage sampling design, the design effect is:

$$DEFF=1+(JK-1)\rho_{\gamma}+(K-1)\rho_{\eta}$$

This also shows why just looking at the most aggregate level of clustering is usually pretty reasonable – assuming the two ICCs are relatively similar in size, the adjustment to the variance will be driven primarily by the most aggregate level of clustering.

The formula above ignores the finite population correction. With multi-stage sampling, we often want to take into account the finite population correction for at least one stage. We can do this using the

$$DEFF=1+\left(f_{\gamma}JK-f_{\phi}\right)\rho_{\gamma}+\left(f_{\eta}K-f_{\phi}\right)\rho_{\eta}$$

Where $f_{\gamma}$, $f_{\eta}$, and $f_{\phi}$ are the finite population corrections at each stage. For example $f_{\gamma}=1-\frac{Q}{\sum{Q}}$ – i.e. 1 minus the proportion of total mega-clusters sampled. If we wish to ignore any of the FPCs, you can replace them with 1.

How should you estimate the inputs required for these calculations?

Estimating ICCs at each level of clustering is a bit tricky. You need to find a dataset that has both levels of clustering that you are interested in as well as the variable you are interested in (or a similar variable).

If you have such a dataset, you can estimate the ICC components using Stata’s xtmixed command followed by the user written iccvar command and specifying a random effect for each level of clustering you are interested in as well as each level of clustering used in the sampling design for the survey. For example, suppose you have data from a survey which has info on the mega cluster ID and cluster ID, which used a two stage sampling design in which PSUs were selected and households randomly selected within PSUs, and PSUs are nested within clusters. Then you could estimate these components using the following command:

xtmixed y_var || megacluster_id: || cluster_id: || psu_id:

iccvar

Note that the standard errors on the estimates of the components are probably going to be pretty large so it may be useful to run some sensitivity analyses on your estimates.

How do you properly account for a three stage design when analyzing data?

Unfortunately, most Stata commands only allow for a single stage of clustering. To account for two or more stages of clustering, you need to first “svyset” your data and then use the “svy” prefix before running any command.

Dealing with three stage random assignment

The advice above is tailored to someone performing sample size calcs for a survey with a three stage design. All of the advice holds true for power calcs as well. You just need to multiply the final variance by 2 (since you have 2 groups – treatment and control) and then use the standard adjustment to the standard error for power calcs – i.e. instead of multiplying the standard error by +/-1.96 to create a 95% confidence interval you multiply by ~2.8 to calculate an MDE for alpha .05 and power .8.

« Let’s Make Sure the Right-Hand Rule is Left-Behind My board says do AI. Halp plz »