Module 14 / Processes for data entry, examples of data entry software and management of the database

Processes for data entry, examples of data entry software and management of the database

Constructing the data entry system

A strong data entry system is required to ensure high-quality data, whether data are collected electronically in the field or are entered from paper-based questionnaires. To improve data quality, the data entry program should have preprogrammed skips, correctly formatted fields for variables such as dates, and validation checks that set appropriate limits for certain variables, such as dates of birth for children under 5 years of age and values for haemoglobin levels. It should also include cross-checks for consistency between related variables, such as the ID code of a woman of reproductive age compared with the household ID and the stated number of women in this age group in the household. In this way, the system rejects any unexpected values and the variable is flagged for further review.

Construction of a data entry system requires a complete data collection tool (questionnaire). It is common to develop the electronic data collection system or data entry system for paper-based tools after the cognitive interviewing process (see Module 11: Data collection tools, field manual, and database). This version can be used to train data entry staff (where applicable) who can practice with completed forms during the training and pilot test. Minor adjustments to finalize the tools may still be expected during the training and piloting process, and it is important to make sure that all changes are made to the final software version, whether uploaded for electronic collection or used to enter data from paper-based forms.

For double data entry of paper-based forms, a discrepancy check program needs to be developed to compare the independent entries.

The steps required to develop the system and enter data are illustrated in Fig. 14.1.

With electronic data collection, there is no need for double data entry, nor for the related discrepancy checks and reconciliation. It is possible to move straight to data checks, cleaning, and analysis. This is one of the principal advantages of electronic data collection.

Choosing software

Software for data entry from paper-based forms

Programs that can be used for data entry from paper-based questionnaires include Epi Info (http://www.cdc.gov/epiinfo), Epi Data (http://www.epidata.dk), and Census and Survey Processing System, CSPro (https://www.census.gov/data/software/cspro.html). Several Microsoft® Office programs, including Microsoft® Access (https://products.office.com/en-us/access), offer additional options.

Software for data entry for electronic data collection

Programs available for electronic data entry include Epi Info (http://www.cdc.gov/epiinfo) and Open Data Kit (ODK) (https://opendatakit.org/), a frequently used free and open access software. Factors to consider in choosing software include cost, capacity to generate relational (hierarchical) data files (for example, linking a woman of reproductive age to the household she is in and to a child she may have) and whether open access is an important feature.

In either case (paper-based or electronic), a program that can be modified by others relatively easily should be used in case the primary developer becomes unavailable.

Developing a data dictionary

A data dictionary defines all variables included in the survey questionnaire. It is required for developing the data entry program so that type, field width and validation checks (agreed upon acceptable values) can be programmed for each variable. The data dictionary also needs to define all variables created from the original data, for example, the variable “anaemia” may be defined from the result of the haemoglobin test together with the individual’s age and pregnancy status. A data dictionary is also essential for developing the data analysis syntax. Box 14.1 provides an example of a data dictionary.

Box 14.1 Example Data Dictionary

Variable Variable name Variable type Variable width Example values/notes

Participant number ID Numeric 3 001–999

Household number HHID Numeric 2 01–25

Residence URBAN_RURAL Numeric 1 1=Urban, 2=Rural

Region STRATA Numeric 1 1-3

Cluster number CLUSTER Numeric 2 01-30

Age in months AGE Numeric 2.1 06.0-59.9

Date of birth DOB dd/mm/yyyy 2.1 [values set according to survey date and expected age of respondent]

Sex SEX Numeric 1 1=Male, 2=Female

Date of survey SURVEY dd/mm/yyyy 15/06/2004–20/08/2004

Haemoglobin HB Numeric 2.1 04.0–18.0^a

Urinary iodine concentration UIC Numeric 4.1 0000.0–1000.0 µg/L

Retinol binding protein concentration RBP Numeric 2.2 00.00–90.00 µmol/L

Iodine in salt based on rapid test kit SALT_RTK Numeric 1 1=Yes, 0=No

Iodine level in salt based on titration SALT_QUANT Numeric 3 000–120 mg/kg

^a Note: These may not be correct minimum and maximum values for use in populations living at high altitudes.

Variable	Variable name	Variable type	Variable width	Example values/notes
Participant number	ID	Numeric	3	001–999
Household number	HHID	Numeric	2	01–25
Residence	URBAN_RURAL	Numeric	1	1=Urban, 2=Rural
Region	STRATA	Numeric	1	1-3
Cluster number	CLUSTER	Numeric	2	01-30
Age in months	AGE	Numeric	2.1	06.0-59.9
Date of birth	DOB	dd/mm/yyyy	2.1	[values set according to survey date and expected age of respondent]
Sex	SEX	Numeric	1	1=Male, 2=Female
Date of survey	SURVEY	dd/mm/yyyy		15/06/2004–20/08/2004
Haemoglobin	HB	Numeric	2.1	04.0–18.0^a
Urinary iodine concentration	UIC	Numeric	4.1	0000.0–1000.0 µg/L
Retinol binding protein concentration	RBP	Numeric	2.2	00.00–90.00 µmol/L
Iodine in salt based on rapid test kit	SALT_RTK	Numeric	1	1=Yes, 0=No
Iodine level in salt based on titration	SALT_QUANT	Numeric	3	000–120 mg/kg

Testing the data entry system

The data entry system requires extensive testing, preferably by a number of people entering different options that will, for example, test different skip patterns. After this testing, the system should be piloted among different groups, to assess:

validation checks (expected data ranges/exclusion of implausible values and cross-checks with values for other entered, related variables)
data entry formats
skip patterns
logical, user-friendly variable names, labels, format, and flow

Results of the pilot test may reveal that the data dictionary needs adjustment. Piloting of the data collection and data entry system should be done prior to training, so that enumerators are using the most optimal system during the training.

Data entry requirements

Data entry should start as soon as possible after the initiation of fieldwork. This will allow common errors to be identified early, reasons for errors to be determined and corrective action to be taken.

A micronutrient survey may require that a large amount of data be entered. For paper-based survey questionnaires, data can be entered into the electronic database either:

At the end of the day by the survey team. This approach requires significant time in the field that could otherwise be spent on data collection. On the other hand, it allows for the quick correction of erroneous data by allowing the team to return to a cluster. It also enables data to be backed up onto a separate device to avoid loss of information that could result if the paper version of the completed questionnaire was lost.
By double data entry at the central data management location. This is the most commonly used method for when data collection is paper-based. It may improve data quality by reducing the rate of errors and inter-individual variability because a limited number of experienced data entry personnel enter the data. This approach requires strong supervision and detailed checks in the field to ensure the legibility and quality of the data entered. This method requires:
- a minimum of two data entry staff assigned by the database manager to enter the data;
- entry of information from each questionnaire by each of these two people (double data entry);
- comparison of the two data files by the database manager using the discrepancy check program;
- reconciliation of any differences based on the paper version of the questionnaire; and
- monitoring of personnel performance and retraining where needed.

For accountability, the final and complete set of data files should include:

A clean final master version of the data to be used for data analysis. The final master dataset will have a data dictionary with variable labels that link to specific questionnaires.
The two sets of raw entry (to confirm double entry).
A log of any discrepancies found. The log of discrepancies per variable could be presented as a table with the following headings:

Variable Data entry staff n°1 value Data entry staff n°2 value Resolved value