Cleaning data to finalize the database for analysis

Whether collected on paper or electronically, data require checks and cleaning. Data cleaning is intended to identify potentially erroneously recorded data. For paper-based collection, checking and cleaning take place after data entry errors have been corrected by comparing and reconciling double-entered data. For electronic data that are continually uploaded to the server, checking and cleaning can be done on a regular basis during the field work period.

With electronic data collection, a data review exercise can be done several times a week. This allows feedback to be sent to the Survey coordinator on progress toward the expected numbers of interviews and specimens. For example, it might be found that consent for blood collection among children under 5 years of age is lower than expected, which may prompt follow-up to find the cause and advise teams accordingly. Again, any changes to procedures should be documented to avoid biasing the sample.

Duplicate Entries

Unexpected duplicate entries need to be fixed immediately. Household and individual ID numbers are unique and should occur only once in a data file. Duplicate ID numbers may have several causes, for example a single questionnaire may have been entered twice, two different individuals were assigned the same ID number or one of the two ID numbers was entered incorrectly. This last issue can be avoided by using barcode labels and barcode readers for unique ID numbers in the field and at the laboratory.

Dates and other identifiers are very useful in data cleaning processes. Data tracking documents that describe the clusters, dates and individuals should be used as management tools to help disentangle duplicate data entries and additional types of irregular data findings. Within the data tracking system, there can be a list of ID numbers and the date that data were collected for each individual. For example, if you know that person 1234 belongs to cluster 567 and you have two entries for person 1234, then you can check the cluster number to determine which data was entered incorrectly. This part of data cleaning—determining which of the duplicate entries is correct—requires time and attention to detail.

Implausible Values

The most common method of checking data is to produce a frequency table for every variable and to identify values that fall outside of a normally acceptable range. This range should be defined in the data dictionary (see Box 14.1). Where outlying values are found, they should be traced back to the original questionnaire to see if it could be a simple data entry error, due to handwriting that is difficult to read or to an incorrect decimal place. In general, valid data entry types and ranges are pre-set in any electronic data collection form in order to reduce the likelihood of “out of normal range” errors. Where there appears to be a consistent unexpected value for a specific cluster, the Team leader should be notified and he or she should verify whether the value reflects something unusual about that cluster. If the checks are being conducted on an ongoing basis during data collection and there is a consistently unexpected value produced by one interviewer, the Team leader should follow up and monitor the performance of that person. Often the outlying values cannot be verified and corrected, and a decision needs to be made regarding changing the variable outcome to a ‘missing’ value. Any such findings and changes need to be documented.

Logical errors found during the data cleaning should be investigated and, when possible, corrected. This is relevant for electronic or paper-based data collection.

Examples of logical errors include:

  • the date of birth is recorded as after the survey date;
  • the date of of birth does not fit with the expected age of the individual, for example the age calculated from the date of survey and date of birth is not the same as (or within an acceptable range of) the stated age;
  • the designation of “urban” and “rural” is inconsistent among households within the same cluster;
  • body mass index (BMI) values indicate that the height and weight measurements may have been entered in the wrong boxes, or that a decimal place has been entered incorrectly.

Logical error checks should be pre-set in electronic data collection forms. By correctly programming the electronic data collection system, it is possible to ensure that these errors cannot be entered, and such values are immediately flagged and can be rectified. For example, if a BMI is outside of the expected range, the participant’s weight and height can be measured again.

All errors must be either corrected or deleted from the database, and the process should continue until the data are considered “clean.”

Missing data

Missing data may have been entered as 99.9 or 999.99, depending on the questionnaire instructions. Missing data, including refusal codes, need to be appropriately recoded so they do not skew the summary statistics. In addition, the number of missing responses for each variable needs to be investigated. If there are many missing values, check that these are not a result of a database or data entry error.

Merged data

Laboratory data that are not measured during the data collection period (for example, haemoglobin levels or the presence of malaria) usually become available well after the final database has been approved. These laboratory data will need to be merged with the questionnaire data, using the household or individual’s unique ID number.

Here is an example of how to verify merged data: If the survey data file shows 800 women of reproductive age eligible for specimen collection, and the corresponding laboratory data file includes only 700, you would expect that 100 eligible women refused consent to provide a specimen or that the specimen volume was insufficient. However, after merging data by unique ID, it might be that only 650 lines of data match. In this case investigations to resolve the discrepancy may include:

  • reviewing the Specimen transfer form to compare the IDs of women of reproductive age against specimens collected and specimens sent to the laboratory; or
  • verifying the use of a barcode reader to enter ID numbers at the laboratory (on arrival and during recording of analysis results). If the barcode reader was not used, it is possible that IDs were incorrectly entered at the laboratory and that some specimen ID numbers do not match with a corresponding ID of women of reproductive age;
  • reviewing the response to specimen collection for all individual IDs where laboratory data are missing to assess the reason. Reasons include a declined test, inadequate specimen collected, unable to be measured (for example if the blood was haemolyzed), or lost specimen.

Creating individual-level and household-level data sets

There will be individual data sets and household data sets. The individual data sets may need to be cross-linked, for example, mother and child pair. They will also be linked to the household. During the planning stages, unique IDs and linking variables were ideally created in the data entry form to enable linking at the data management stage. Software should be selected that allows for hierarchical linking of data where needed.

Managing the database

All data, collected on paper or electronically, should be entered and maintained securely in a central database. Typically, a Database manager is responsible for developing the database and maintaining backups. However, there will be multiple people that work on managing the survey database, depending on the complexity of the survey. A Data coordinator needs to work with the Database manager as well as software programmers, statisticians, and other specialists to ensure that the data are entered, linked, and maintained in a secure, organized way. Saving data in two different networks or servers, with different access permissions, helps ensure that there is no loss of data or risk of files being deleted or manipulated by error. The Database manager and Data coordinator need to have strong experience working with large databases.