Cluster Sampling — AnalyticsScholar

Definition

Cluster sampling is a probability sampling technique in which the population is divided into clusters — often based on natural groupings such as schools, neighbourhoods, or hospitals — and a random sample of clusters is selected. All individuals within the selected clusters are included in the study, or a further random sample may be drawn from each selected cluster (two-stage sampling).

Why It Matters

When a population is geographically dispersed or lacks a complete sampling frame, individually identifying and contacting every member is impractical or prohibitively expensive. Cluster sampling reduces logistical costs by concentrating data collection within selected clusters. It is the standard method for national household surveys, school-based studies, and healthcare audits in developing regions.

Example

A public-health team wants to assess vaccination coverage across a rural province with 500 villages. Instead of attempting to reach every household, they randomly select 30 villages and survey all households within those villages. This two-stage cluster design dramatically reduces travel time and cost. However, the team must account for intra-cluster correlation in their analysis, as households within the same village tend to have similar vaccination behaviours.

Related Terms

Random Sampling
Stratified Sampling
Multilevel Modelling (used to analyse clustered data)
Sample Size

Software Notes

SPSS: Cluster sampling is a design feature, not a built-in selection command. For analysis of cluster-sampled data, use the Complex Samples module: Analyze > Complex Samples > Prepare for Analysis to define clusters and weights.
R: survey::svydesign(ids = ~cluster, data = df) declares a cluster-sampled design. survey::svymean(~variable, design) computes means with correct standard errors. Use survey::svyglm() for regression.
Stata: svyset cluster declares the primary sampling unit. svyset cluster [pw=weight], strata(strata) for multi-stage designs. svy: mean variable or svy: regress y x produce cluster-robust standard errors.