# Methods for selecting clusters

Cluster selection must be randomly. The first stage in selecting clusters is generally based on a comprehensive listing of all primary sampling units (PSUs). For household surveys, PSUs often have the same boundaries as census enumeration areas (EAs). The PSUs are often referred to as “clusters” because the survey elements, namely households, are clustered within the PSU.

For surveys that concern attendees of government facilities, PSUs can be defined as all government-run health facilities. For primary school-based surveys, PSUs could be all primary schools, including private and religious schools. The estimates obtained would represent only children who attend school.

In household-based surveys, the comprehensive listing of PSUs would require the population size or the number of households within each cluster. In clinic-based surveys, the comprehensive listing of PSUs would require the number of clinic enrollees, while in school-based surveys, the number of students enrolled in and regularly attending each school is necessary for the PSU listing.

If relatively accurate data on population size are available, then the preferred method for selecting clusters is the probability proportional to size (PPS) method. If reasonably reliable population data area not available, then either a random or systematic sampling (SS) of clusters could be used. Each of these methods is described in more detail in the following sections.

## PPS method

Using the PPS method, the likelihood of a PSU being selected is proportional to the size of its population (the number of individuals or households). Thus, larger PSUs are more likely to be selected than smaller ones.

The PPS method starts by obtaining the “best available” census data for all the PSUs in the geographic area to be surveyed. This information is usually available from the government agency responsible for the census, such as a national bureau of statistics. The list from which survey PSUs are selected must cover the whole area intended to be represented by the survey estimates. If it is a nationally representative survey, all national households must be represented in the list.

Depending on the country, PSUs may cover relatively small geographic areas, with a population size between 100 and 1000 individuals or between 20 and 200 households. It is important to confirm the sizes of PSUs, as there may be circumstances in which there are not enough potential survey units in one PSU to meet the required sample size. In those cases, two nearby PSUs should be combined to form a single one, prior to the selection process.

To use the PPS method to select PSUs, first create a table with four columns, as is shown in Box 6.1:

- The first column lists the name or code of each PSU. As a general rule, it is best for the list to be in geographic order and organized by urban, rural, district, and province (implicit stratification).
- The second column contains the population size of each PSU.
- The third column contains the cumulative population that is obtained by adding the population of each PSU to the cumulative population of PSUs preceding it on the list.
- The fourth column indicates selected clusters within PSUs.

A sampling interval (labeled as “*k*”) is obtained by dividing the total population size by the number of PSUs to be selected for the survey. A random number between 1 and the sampling interval (*k*) is chosen to identify the initial PSU. The value of the sampling interval (*k*) is added to this to select the second PSU. This continues, adding the value of k to each selected PSU cumulatively until the desired number of clusters is chosen. Note that the last selected PSU should be less than the value of k away from the end of the PSU listing.

Where there is a large number of PSUs, the selection process is usually performed using a computer. For SAS users, the PROC SURVEYSELECT command has an option to select data using PPS. With SPSS, the optional Complex Samples module has a “Select Sample…” option. Use of spreadsheets and appropriate formulae is another method for performing the selection.

The “How to select PSUs” online tool contains an extract from a spreadsheet with instructions on how to select PSUs using the PPS method. Box 6.1 also shows an example of how this is done.

## Box 6.1. Small-Scale Example of PPS Selection of Clusters From a Listing of PSUs using the PPS method

Step 1:Calculate the sampling interval (k) by dividing the total population by the number of clusters to be surveyed. In this example, the total population is 24 940, and the number of clusters to be surveyed is 30, thus the sampling interval is 24 940 ÷ 30 = 831 people. Always round down to the nearest whole integer.

Step 2:Use a random number table or generator to determine a random starting point between 1 and the sampling interval (k). In this example where the sampling interval is 831, the number 710 was randomly selected as the starting participant

Step 3:Based on the cumulative population column, individual n° 710 is found in the first cluster. In this example, the first cluster is in the PSU of Mina because it includes the population from individual 601 to individual 1300.

Step 4:Continue to assign clusters by adding 831 (k) cumulatively. For example, the second cluster will be in the PSU where the value 1541 is located (710 + 831 = 1541), which is Bolama. The third cluster is where the value 2372 is located (1541 + 831 = 2372), and so on. In PSUs with large populations, more than one cluster could be selected. Note that if two clusters are selected in the same PSU (in this case Hilandia), the survey team will divide the PSU area into two sections of approximately equal population size and treat each area as an independent cluster. Similarly, if three or more clusters were in a PSU (for example, Cococopa), the PSU would be divided into three or more sections (clusters) of approximately equal population size.

PSU # PSU Pop. Cum. Cluster PSU # PSU Pop. Cum. Cluster 1 Utural 600 600 26 Banvinai 400 10,730 13 2 Mina 700 1,300 1 27 Purantna 220 10,950 3 Bolama 350 1,650 2 28 Kegalni 140 11,090 4 Taluma 680 2,330 29 Hamali-Ura 80 11,170 5 War-Yali 430 2,760 3 30 Kameni 410 11,580 14 6 Galey 220 2,980 31 Kiroya 280 11,860 7 Tarum 40 3,020 32 Yanwela 330 12,190 8 Hamtato 150 3,170 4 33 Bagvi 440 12,630 15 9 Nayjaff 90 3,260 34 Atota 320 12,950 10 Nuviya 300 3,560 35 Kogouva 120 13,070 16 11 Cattical 430 3,990 5 36 Ahekpa 60 13,130 12 Paralai 150 4,140 5 37 Yondot 320 13,450 13 Egala-Kuru 380 4,520 38 Nozop 1,780 15,230 17,18 14 Uwanarpo 310 4,830 6 39 Mapazko 390 15,620 19 15 Hilandia 2,000 6,830 7,8 40 Lotohah 1,500 17,120 20 16 Assosa 750 7,580 9 41 Voattigan 960 18,080 21,22 17 Dimma 250 7,830 42 Plitok 420 18,500 18 Aisha 420 8,250 10 43 Dopoltan 270 18,770 19 Nam Yao 180 8,430 44 Cococopa 3,500 22,270 23,24,25,26,27 20 Mai Jarim 300 8,730 45 Famegzi 400 22,670 21 Pua 100 8,830 46 Jigpelay 210 22,880 22 Gambela 710 9,540 11 47 Mewoah 50 22,930 23 Fugnido 190 9,730 12 48 Odigla 350 23,280 28 24 Degeh Bur 150 9,880 12 49 Sanbati 1,440 24,720 29 25 Mezan 450 10,330 50 Andidwa 220 24,940 230 In this example, there are only 50 PSUs in the listing. In practice, the number of PSUs will be much larger. The spreadsheet and formula used to generate the table above are shown in the “How to select PSUs” online tool.

## Random and systematic selection of clusters (where population size is inaccurate or unknown)

When a list of PSUs is available but the population size for each PSU is not known or could be very inaccurate, simple random sampling (SRS) may be used. SRS means that the predetermined number of PSUs is randomly selected from a total list of PSUs. In this case, sampling is based on the sequential numbering of PSUs rather than on population size. Selection proceeds according to a random starting point and a fixed sampling interval (k). In this method of sampling, k is calculated by dividing the total number of PSUs by the desired number of PSUs. This value of k should be used to select the PSUs by rounding up the value of k. Many software packages are available that can easily select the number of PSUs desired.

As is done in the PPS method, a random integer between 1 and the sampling interval (*k*) is chosen as the initial PSU, and the value of the sampling interval (*k*) is added to this PSU number to select the second PSU number. Once the list of selected PSU numbers is completed, they should be rounded down as needed to identify the actual PSU to select. See Box 6.2 for an example.

To be able to analyse the data collected with some adjustment for population size, an estimate of the population size in each selected PSU should be collected when the survey team arrives on site. Typically, a mini census is conducted to determine this number. Definitions of households and an explanation of how to select households and individuals from within selected clusters are provided in more detail in **Module 7: Selecting households and participants**.

If equal numbers of households (or a different survey unit) are randomly selected using the same method within a cluster, then they can have equal weight. Using the PPS sampling method at the first stage above would result in a self-weighted (equal weighted) sample of units within the stratum. All households in a stratum have the same probability of selection regardless of which PSU they are located in.

Implicit stratification spreads the sample evenly among geographically important subgroups of the population, such as urban or rural areas, or administrative regions. The process involves arranging the PSUs in geographic order, such as urban by province, within each province by district, followed by rural by province, then within each province by district before systematically applying the PPS method.

## Box 6.2. Small-scale example of systematic sampling of clusters from a listing of PSUs

Step 1:Obtain the list of the PSUs and number them from 1 to the total number of PSUs. In this example there are 50 PSUs.

Step 2:The number of PSUs to sample should have already been determined. In this example it is 20.

Step 3:Calculate the sampling interval (k) by dividing the total number of PSUs by the number to be sampled. In this example, there are 50 PSUs, of which 20 should be sampled, thus the sampling interval iskis 50 ÷ 20 = 2.5.

Step 4:Using a random number table or generator, select an integer between 1 andk. Whichever number is randomly selected, go to the PSU list and include that PSU as the first selected PSU. In this example, the first selected PSU is number 2.

Step 5:Select the subsequent PSUs by adding k to the selected PSU number, then round down to the nearest whole integer. In this example the second PSU would be 2 + 2.5 = 4.5, rounding down makes it PSU number 4, and the third selected PSU is 4.5 + 2.5 = 7. The fourth selected PSU is 7 + 2.5 =9.5, rounded down to 9. This table shows the 20 PSUs selected from PSUs numbered 1 to 50.

Number PSU name Selected Number PSU name Selected 1 Utural 26 Ban Vinai 2 Mina x-1 27 Puratna x-11 3 Bolama 28 Kegalni 4 Taluma x-2 29 Hamali-Ura x-12 5 War-Yelo 30 Kameni 6 Galey 31 Kiroya 7 Tarum x-3 32 Yanwela x-13 8 Hamtato 33 Bagvi 8 Hamtato 33 Bagvi 9 Nayjaff x-4 34 Atota x-14 10 Nuviya 35 Kogouva 11 Cattical 36 Ahekpa 12 Paralia x-5 37 Yondot x-15 13 Egala-Kuru 38 Nozop 14 Uwanarpol x-6 39 Mapazoko x-16 15 Hilandia 40 Lotohah 16 Assosa 41 Voattigan 17 Dimma x-7 42 Plitok x-17 18 Aisha 43 Dopltan 19 Nam Yao x-8 44 Cococopa x-18 20 Mai Jarim 45 Famegzi 21 Ppua 46 Jigpley 22 Gambela x-9 47 Mewoah x-19 23 Fugnido 48 Odigla 24 Degeh Bur x-10 49 Sanbita x-20 25 Mezan 50 Andidwa x In this example, there are only 50 PSUs in the listing. In practice, the number of PSUs will be much larger. You can find the spreadsheet and formula used to generate the table above in the “How to select PSUs” online tool.