Module 6. Selecting clusters
Determining the appropriate number of clusters and the number of units per cluster
Methods for selecting clusters
Examples of cluster selection using probability proportional to size (PPS) and systematic sampling (SS)
This module provides examples and information on sampling. It is essential that an experienced sampling statistician be included on the team to develop and implement the sampling plan, and to undertake quality control measures as the survey progresses.
Determining the appropriate number of clusters and the number of units per cluster
There are two main issues to consider when applying sample size calculations to fieldwork plans for cluster surveys: how many clusters are needed, and how many units (individuals or households) are needed per cluster. The two values are interrelated, meaning that decisions about one affect the value of the other.
As noted in Module 5: Sample size, the number of units per cluster affects the DEFF and therefore the required overall sample size. The more clusters it is possible to include, the fewer units are needed per cluster and the more diverse the sample will be. As a general rule, up to around 40 clusters per stratum is a good estimate, with the aim of at least 10 observations per cluster. A greater number of clusters and fewer units per cluster will decrease the DEFF and either improve precision for the same sample size, or maintain precision with a smaller sample size. In the example given in Module 5: Sample size, for a sample of 1200 households, higher precision can be achieved by selecting 60 clusters of 20 households each as opposed to 40 clusters of 30 households each. On the other hand, visiting 60 clusters rather than 40 increases the cost of the survey. This underscores the need to weigh the cost of collecting data and specimens against programmatic needs for a specific level of precision.
General guidelines for deciding on the number of clusters and number of units per cluster are provided in the section on DEFF in Module 5: Sample size. Other factors that influence this decision are geography and time per cluster:
- Geography: In larger countries, the cost and time required for teams to move from one cluster to another can be substantial. In this case it might be better to select a smaller number of clusters, but never less than 25. If the country is very small, or if the survey is being conducted in a region or province only, having a larger number of clusters is a reasonable way to improve precision.
- Time per cluster: The number of household visits and data collection that can be completed in a single day can vary. In some surveys, the questionnaire and specimen collection might be brief, whereas in others it may be much more time-consuming. In some surveys, the specimen collection and interviews are conducted in the household, while in others, survey participants may be asked to go to a central laboratory set up in the cluster. Depending on the survey design, the size of the team, the traveling distance between households, and the complexity of the survey, a single team can typically complete visits to five to ten households in one day.
The logistics required for cold chain management are also crucial to consider. In some harder-to-reach clusters with limited access to electricity, it may be necessary to minimize the number of days spent in the cluster if there are specimens that require processing and freezing in the field. Portable freezers and centrifuges that can be plugged into a car or portable generator are available, but the less time spent in areas without a direct power source, the more likely it is that the cold chain can be maintained. There are several ways to minimize the time spent in one cluster: increasing the number of enumerators per team, increasing the number of teams per cluster and increasing the total number of clusters so that there are fewer households per cluster.
In some circumstances, the collection of specimens for a specific biomarker may be complicated and time-consuming. If this is the case, it may be possible to collect data only from a random subsample of the population group of interest within each stratum, and focus on generating a reliable national estimate. The modified relative dose response (MRDR) test for assessing vitamin A deficiency is an example of a biomarker that requires a random subsample. Specimen collection for the MRDR test requires the survey participant to avoid vitamin A rich foods for at least two hours before the initial blood draw, consume a dose of vitamin A2 mixed with oil, continue to avoid vitamin A rich foods, and then have a second blood specimen drawn four to six hours later. In addition, sample analysis for the MRDR test is costly. For reasons of feasibility and cost, it may be sufficient to collect specimens for the MRDR test from one single household per cluster.
Methods for selecting clusters
Cluster selection must be randomly. The first stage in selecting clusters is generally based on a comprehensive listing of all primary sampling units (PSUs). For household surveys, PSUs often have the same boundaries as census enumeration areas (EAs). The PSUs are often referred to as “clusters” because the survey elements, namely households, are clustered within the PSU.
For surveys that concern attendees of government facilities, PSUs can be defined as all government-run health facilities. For primary school-based surveys, PSUs could be all primary schools, including private and religious schools. The estimates obtained would represent only children who attend school.
In household-based surveys, the comprehensive listing of PSUs would require the population size or the number of households within each cluster. In clinic-based surveys, the comprehensive listing of PSUs would require the number of clinic enrollees, while in school-based surveys, the number of students enrolled in and regularly attending each school is necessary for the PSU listing.
If relatively accurate data on population size are available, then the preferred method for selecting clusters is the probability proportional to size (PPS) method. If reasonably reliable population data area not available, then either a random or systematic sampling (SS) of clusters could be used. Each of these methods is described in more detail in the following sections.
PPS method
Using the PPS method, the likelihood of a PSU being selected is proportional to the size of its population (the number of individuals or households). Thus, larger PSUs are more likely to be selected than smaller ones.
The PPS method starts by obtaining the “best available” census data for all the PSUs in the geographic area to be surveyed. This information is usually available from the government agency responsible for the census, such as a national bureau of statistics. The list from which survey PSUs are selected must cover the whole area intended to be represented by the survey estimates. If it is a nationally representative survey, all national households must be represented in the list.
Depending on the country, PSUs may cover relatively small geographic areas, with a population size between 100 and 1000 individuals or between 20 and 200 households. It is important to confirm the sizes of PSUs, as there may be circumstances in which there are not enough potential survey units in one PSU to meet the required sample size. In those cases, two nearby PSUs should be combined to form a single one, prior to the selection process.
To use the PPS method to select PSUs, first create a table with four columns, as is shown in Box 6.1:
- The first column lists the name or code of each PSU. As a general rule, it is best for the list to be in geographic order and organized by urban, rural, district, and province (implicit stratification).
- The second column contains the population size of each PSU.
- The third column contains the cumulative population that is obtained by adding the population of each PSU to the cumulative population of PSUs preceding it on the list.
- The fourth column indicates selected clusters within PSUs.
A sampling interval (labeled as “k”) is obtained by dividing the total population size by the number of PSUs to be selected for the survey. A random number between 1 and the sampling interval (k) is chosen to identify the initial PSU. The value of the sampling interval (k) is added to this to select the second PSU. This continues, adding the value of k to each selected PSU cumulatively until the desired number of clusters is chosen. Note that the last selected PSU should be less than the value of k away from the end of the PSU listing.
Where there is a large number of PSUs, the selection process is usually performed using a computer. For SAS users, the PROC SURVEYSELECT command has an option to select data using PPS. With SPSS, the optional Complex Samples module has a “Select Sample…” option. Use of spreadsheets and appropriate formulae is another method for performing the selection.
The “How to select PSUs” online tool contains an extract from a spreadsheet with instructions on how to select PSUs using the PPS method. Box 6.1 also shows an example of how this is done.
Box 6.1. Small-Scale Example of PPS Selection of Clusters From a Listing of PSUs using the PPS method
Step 1: Calculate the sampling interval (k) by dividing the total population by the number of clusters to be surveyed. In this example, the total population is 24 940, and the number of clusters to be surveyed is 30, thus the sampling interval is 24 940 ÷ 30 = 831 people. Always round down to the nearest whole integer.
Step 2: Use a random number table or generator to determine a random starting point between 1 and the sampling interval (k). In this example where the sampling interval is 831, the number 710 was randomly selected as the starting participant
Step 3: Based on the cumulative population column, individual n° 710 is found in the first cluster. In this example, the first cluster is in the PSU of Mina because it includes the population from individual 601 to individual 1300.
Step 4: Continue to assign clusters by adding 831 (k) cumulatively. For example, the second cluster will be in the PSU where the value 1541 is located (710 + 831 = 1541), which is Bolama. The third cluster is where the value 2372 is located (1541 + 831 = 2372), and so on. In PSUs with large populations, more than one cluster could be selected. Note that if two clusters are selected in the same PSU (in this case Hilandia), the survey team will divide the PSU area into two sections of approximately equal population size and treat each area as an independent cluster. Similarly, if three or more clusters were in a PSU (for example, Cococopa), the PSU would be divided into three or more sections (clusters) of approximately equal population size.
PSU # PSU Pop. Cum. Cluster PSU # PSU Pop. Cum. Cluster 1 Utural 600 600 26 Banvinai 400 10,730 13 2 Mina 700 1,300 1 27 Purantna 220 10,950 3 Bolama 350 1,650 2 28 Kegalni 140 11,090 4 Taluma 680 2,330 29 Hamali-Ura 80 11,170 5 War-Yali 430 2,760 3 30 Kameni 410 11,580 14 6 Galey 220 2,980 31 Kiroya 280 11,860 7 Tarum 40 3,020 32 Yanwela 330 12,190 8 Hamtato 150 3,170 4 33 Bagvi 440 12,630 15 9 Nayjaff 90 3,260 34 Atota 320 12,950 10 Nuviya 300 3,560 35 Kogouva 120 13,070 16 11 Cattical 430 3,990 5 36 Ahekpa 60 13,130 12 Paralai 150 4,140 5 37 Yondot 320 13,450 13 Egala-Kuru 380 4,520 38 Nozop 1,780 15,230 17,18 14 Uwanarpo 310 4,830 6 39 Mapazko 390 15,620 19 15 Hilandia 2,000 6,830 7,8 40 Lotohah 1,500 17,120 20 16 Assosa 750 7,580 9 41 Voattigan 960 18,080 21,22 17 Dimma 250 7,830 42 Plitok 420 18,500 18 Aisha 420 8,250 10 43 Dopoltan 270 18,770 19 Nam Yao 180 8,430 44 Cococopa 3,500 22,270 23,24,25,26,27 20 Mai Jarim 300 8,730 45 Famegzi 400 22,670 21 Pua 100 8,830 46 Jigpelay 210 22,880 22 Gambela 710 9,540 11 47 Mewoah 50 22,930 23 Fugnido 190 9,730 12 48 Odigla 350 23,280 28 24 Degeh Bur 150 9,880 12 49 Sanbati 1,440 24,720 29 25 Mezan 450 10,330 50 Andidwa 220 24,940 230 In this example, there are only 50 PSUs in the listing. In practice, the number of PSUs will be much larger. The spreadsheet and formula used to generate the table above are shown in the “How to select PSUs” online tool.
Random and systematic selection of clusters (where population size is inaccurate or unknown)
When a list of PSUs is available but the population size for each PSU is not known or could be very inaccurate, simple random sampling (SRS) may be used. SRS means that the predetermined number of PSUs is randomly selected from a total list of PSUs. In this case, sampling is based on the sequential numbering of PSUs rather than on population size. Selection proceeds according to a random starting point and a fixed sampling interval (k). In this method of sampling, k is calculated by dividing the total number of PSUs by the desired number of PSUs. This value of k should be used to select the PSUs by rounding up the value of k. Many software packages are available that can easily select the number of PSUs desired.
As is done in the PPS method, a random integer between 1 and the sampling interval (k) is chosen as the initial PSU, and the value of the sampling interval (k) is added to this PSU number to select the second PSU number. Once the list of selected PSU numbers is completed, they should be rounded down as needed to identify the actual PSU to select. See Box 6.2 for an example.
To be able to analyse the data collected with some adjustment for population size, an estimate of the population size in each selected PSU should be collected when the survey team arrives on site. Typically, a mini census is conducted to determine this number. Definitions of households and an explanation of how to select households and individuals from within selected clusters are provided in more detail in Module 7: Selecting households and participants.
If equal numbers of households (or a different survey unit) are randomly selected using the same method within a cluster, then they can have equal weight. Using the PPS sampling method at the first stage above would result in a self-weighted (equal weighted) sample of units within the stratum. All households in a stratum have the same probability of selection regardless of which PSU they are located in.
Implicit stratification spreads the sample evenly among geographically important subgroups of the population, such as urban or rural areas, or administrative regions. The process involves arranging the PSUs in geographic order, such as urban by province, within each province by district, followed by rural by province, then within each province by district before systematically applying the PPS method.
Box 6.2. Small-scale example of systematic sampling of clusters from a listing of PSUs
Step 1: Obtain the list of the PSUs and number them from 1 to the total number of PSUs. In this example there are 50 PSUs.
Step 2: The number of PSUs to sample should have already been determined. In this example it is 20.
Step 3: Calculate the sampling interval (k) by dividing the total number of PSUs by the number to be sampled. In this example, there are 50 PSUs, of which 20 should be sampled, thus the sampling interval is k is 50 ÷ 20 = 2.5.
Step 4: Using a random number table or generator, select an integer between 1 and k . Whichever number is randomly selected, go to the PSU list and include that PSU as the first selected PSU. In this example, the first selected PSU is number 2.
Step 5: Select the subsequent PSUs by adding k to the selected PSU number, then round down to the nearest whole integer. In this example the second PSU would be 2 + 2.5 = 4.5, rounding down makes it PSU number 4, and the third selected PSU is 4.5 + 2.5 = 7. The fourth selected PSU is 7 + 2.5 =9.5, rounded down to 9. This table shows the 20 PSUs selected from PSUs numbered 1 to 50.
Number PSU name Selected Number PSU name Selected 1 Utural 26 Ban Vinai 2 Mina x-1 27 Puratna x-11 3 Bolama 28 Kegalni 4 Taluma x-2 29 Hamali-Ura x-12 5 War-Yelo 30 Kameni 6 Galey 31 Kiroya 7 Tarum x-3 32 Yanwela x-13 8 Hamtato 33 Bagvi 8 Hamtato 33 Bagvi 9 Nayjaff x-4 34 Atota x-14 10 Nuviya 35 Kogouva 11 Cattical 36 Ahekpa 12 Paralia x-5 37 Yondot x-15 13 Egala-Kuru 38 Nozop 14 Uwanarpol x-6 39 Mapazoko x-16 15 Hilandia 40 Lotohah 16 Assosa 41 Voattigan 17 Dimma x-7 42 Plitok x-17 18 Aisha 43 Dopltan 19 Nam Yao x-8 44 Cococopa x-18 20 Mai Jarim 45 Famegzi 21 Ppua 46 Jigpley 22 Gambela x-9 47 Mewoah x-19 23 Fugnido 48 Odigla 24 Degeh Bur x-10 49 Sanbita x-20 25 Mezan 50 Andidwa x In this example, there are only 50 PSUs in the listing. In practice, the number of PSUs will be much larger. You can find the spreadsheet and formula used to generate the table above in the “How to select PSUs” online tool.
Examples of cluster selection using probability proportional to size (PPS) and systematic sampling (SS)
Box 6.1. Small-scale example of selecting clusters from a listing of PSUs using the PPS method
Step 1: Calculate the sampling interval (k) by dividing the total population by the number of clusters to be surveyed. In this example, the total population is 24 940, and the number of clusters to be surveyed is 30, thus the sampling interval is 24 940 ÷ 30 = 831 people. Always round down to the nearest whole integer.
Step 2: Use a random number table or generator to determine a random starting point between 1 and the sampling interval (k). In this example where the sampling interval is 831, the number 710 was randomly selected as the starting participant
Step 3: Based on the cumulative population column, individual n° 710 is found in the first cluster. In this example, the first cluster is in the PSU of Mina because it includes the population from individual 601 to individual 1300.
Step 4: Continue to assign clusters by adding 831 (k) cumulatively. For example, the second cluster will be in the PSU where the value 1541 is located (710 + 831 = 1541), which is Bolama. The third cluster is where the value 2372 is located (1541 + 831 = 2372), and so on. In PSUs with large populations, more than one cluster could be selected. Note that if two clusters are selected in the same PSU (in this case Hilandia), the survey team will divide the PSU area into two sections of approximately equal population size and treat each area as an independent cluster. Similarly, if three or more clusters were in a PSU (for example, Cococopa), the PSU would be divided into three or more sections (clusters) of approximately equal population size.
PSU # PSU Pop. Cum. Cluster PSU # PSU Pop. Cum. Cluster 1 Utural 600 600 26 Banvinai 400 10,730 13 2 Mina 700 1,300 1 27 Purantna 220 10,950 3 Bolama 350 1,650 2 28 Kegalni 140 11,090 4 Taluma 680 2,330 29 Hamali-Ura 80 11,170 5 War-Yali 430 2,760 3 30 Kameni 410 11,580 14 6 Galey 220 2,980 31 Kiroya 280 11,860 7 Tarum 40 3,020 32 Yanwela 330 12,190 8 Hamtato 150 3,170 4 33 Bagvi 440 12,630 15 9 Nayjaff 90 3,260 34 Atota 320 12,950 10 Nuviya 300 3,560 35 Kogouva 120 13,070 16 11 Cattical 430 3,990 5 36 Ahekpa 60 13,130 12 Paralai 150 4,140 5 37 Yondot 320 13,450 13 Egala-Kuru 380 4,520 38 Nozop 1,780 15,230 17,18 14 Uwanarpo 310 4,830 6 39 Mapazko 390 15,620 19 15 Hilandia 2,000 6,830 7,8 40 Lotohah 1,500 17,120 20 16 Assosa 750 7,580 9 41 Voattigan 960 18,080 21,22 17 Dimma 250 7,830 42 Plitok 420 18,500 18 Aisha 420 8,250 10 43 Dopoltan 270 18,770 19 Nam Yao 180 8,430 44 Cococopa 3,500 22,270 23,24,25,26,27 20 Mai Jarim 300 8,730 45 Famegzi 400 22,670 21 Pua 100 8,830 46 Jigpelay 210 22,880 22 Gambela 710 9,540 11 47 Mewoah 50 22,930 23 Fugnido 190 9,730 12 48 Odigla 350 23,280 28 24 Degeh Bur 150 9,880 12 49 Sanbati 1,440 24,720 29 25 Mezan 450 10,330 50 Andidwa 220 24,940 230 In this example, there are only 50 PSUs in the listing. In practice, the number of PSUs will be much larger. The spreadsheet and formula used to generate the table above are shown in the “How to select PSUs” online tool.
Box 6.2. Small-scale example of systematic sampling of clusters from a listing of PSUs
Step 1: Obtain the list of the PSUs and number them from 1 to the total number of PSUs. In this example there are 50 PSUs.
Step 2: The number of PSUs to sample should have already been determined. In this example it is 20.
Step 3: Calculate the sampling interval (k) by dividing the total number of PSUs by the number to be sampled. In this example, there are 50 PSUs, of which 20 should be sampled, thus the sampling interval is k is 50 ÷ 20 = 2.5.
Step 4: Using a random number table or generator, select an integer between 1 and k . Whichever number is randomly selected, go to the PSU list and include that PSU as the first selected PSU. In this example, the first selected PSU is number 2.
Step 5: Select the subsequent PSUs by adding k to the selected PSU number, then round down to the nearest whole integer. In this example the second PSU would be 2 + 2.5 = 4.5, rounding down makes it PSU number 4, and the third selected PSU is 4.5 + 2.5 = 7. The fourth selected PSU is 7 + 2.5 =9.5, rounded down to 9. This table shows the 20 PSUs selected from PSUs numbered 1 to 50.
Number PSU name Selected Number PSU name Selected 1 Utural 26 Ban Vinai 2 Mina x-1 27 Puratna x-11 3 Bolama 28 Kegalni 4 Taluma x-2 29 Hamali-Ura x-12 5 War-Yelo 30 Kameni 6 Galey 31 Kiroya 7 Tarum x-3 32 Yanwela x-13 8 Hamtato 33 Bagvi 8 Hamtato 33 Bagvi 9 Nayjaff x-4 34 Atota x-14 10 Nuviya 35 Kogouva 11 Cattical 36 Ahekpa 12 Paralia x-5 37 Yondot x-15 13 Egala-Kuru 38 Nozop 14 Uwanarpol x-6 39 Mapazoko x-16 15 Hilandia 40 Lotohah 16 Assosa 41 Voattigan 17 Dimma x-7 42 Plitok x-17 18 Aisha 43 Dopltan 19 Nam Yao x-8 44 Cococopa x-18 20 Mai Jarim 45 Famegzi 21 Ppua 46 Jigpley 22 Gambela x-9 47 Mewoah x-19 23 Fugnido 48 Odigla 24 Degeh Bur x-10 49 Sanbita x-20 25 Mezan 50 Andidwa x In this example, there are only 50 PSUs in the listing. In practice, the number of PSUs will be much larger. You can find the spreadsheet and formula used to generate the table above in the “How to select PSUs” online tool.