Duplicate household numbers

Discussions about editing and cleaning data
Forum rules
New release: CSPro 8.0
Post Reply
arijd
Posts: 2
Joined: March 29th, 2014, 1:39 am

Duplicate household numbers

Post by arijd »

My question is as follows:

I have a data file in which some households have exactly the same id. This is because, for example, enumerators erroneously used the same household number twice in their area. Since the census forms were scanned (not keyboarded) this problem was not caught at the time of data capture.
After sorting, these households will become merged, producing largish households with several heads of household. How to split such cases again in the original two (or more) constituent parts? I can think of several solutions involving secondary household number variables, several dictionaries and so forth. But I would like to keep it simple, since counterparts may find this too complex.

Ideas will be much appreciated. No doubt I'm not the first to face this issue.

Arij Dekker
Gregory Martin
Posts: 1777
Joined: December 5th, 2011, 11:27 pm
Location: Washington, DC

Re: Duplicate household numbers

Post by Gregory Martin »

When I've with scanned data, I generally included as an ID a code from the form, usually a barcode number, which prevented CSPro from recognizing two households as a single household. However, if you don't have that, I think there is a solution that may not be too confusing for your counterparts.

1) Edit your dictionary, adding a large number as the last ID field. Then turn off relative positioning and delete the ID field. This will create a gap in your data file between the IDs and the record contents.

2) Concatenate your data using the original dictionary and then reformat this data file using this new dictionary. The IDs must match for the Reformat Data tool to work, which is why we deleted the ID field in step 1.

3) Add the new ID field back to the dictionary.

4) Create a batch application and add an incrementing value to this ID field, 1, 2, 3, ...

Now when you sort the data, you shouldn't run into any duplicate household problems.
arijd
Posts: 2
Joined: March 29th, 2014, 1:39 am

Re: Duplicate household numbers

Post by arijd »

Thank you for your response, Gregory. Very insightful.

1. Indeed our data are scanned. Unfortunately the scanner generates multiple household records for large households. Households with more than 8 members command a second questionnaire, which automatically produces a second household record and so forth. The questionnaires carry preprinted bar-codes, but these may not be in any particular sequence, so they serve no purpose for identifying duplicate household records resulting from large households where several questionnaires are involved. However, the scanner also generates a "scan form sequence number". Where these scan form sequence numbers are sequential and household id's are the same, that is obviously a case of a large household with multiple household records. In that case it is not too difficult to eliminate the superfluous household records. I should add that we do not control the scanning process ourselves, it has been tendered out.

2. The problem comes when the duplicate household id's result from enumerator error, using the same id twice (or even more). You suggest an approach, I'll check it out. However, it does not seem as simple as one would wish. I have now also figured out a solution myself, which runs as follows:
a. Where duplicate household numbers continue to exist (not resulting from scanning), keep only one household in the master file and delete the other households from there. Store these excess households away in a "Write" file.
b. When writing out the excess households. assign them new unique household numbers. For each EA this start with 999 decreasing by 1 for each applicable case in the EA.
c. Concatenate the master and the "Write" files.
d. Sort the resulting complete data file.
e. Pass this new data file again through the editing program, making sure there are no more duplicate household numbers.
I have put these operations together in a batch (*.bat) application.

This works ok, but, alas, it also suffers from a certain level of complexity.

I would prefer a solution where the problem is corrected by CSBatchEdit in one pass without further ado. That's perhaps asking for too much.

Thank you for thinking along!

Arij Dekker
Gregory Martin
Posts: 1777
Joined: December 5th, 2011, 11:27 pm
Location: Washington, DC

Re: Duplicate household numbers

Post by Gregory Martin »

Another possibility would be to add an external dictionary to your application. This dictionary would only contain the IDs of your questionnaire. As you process your file, you would write out the IDs to the file associated with the external dictionary. First you would loadcase with the ID sequence. If it exists in the file (loadcase returns 1), then it means that the current case is a duplicate, and at that point you will assign a new ID. If it isn't in the file, you will writecase out the ID and output the case with the original ID. At the end of this operation, the external file will contain all the IDs in your data files.
peterlee

Re: Duplicate household numbers

Post by peterlee »

Hi, Arij Dekker.
Thanks for your nice sharing. I wonder whether it is the latest version of the Barcode Scanner? I am also looking for a fine
barcode scanner whose way of processing is simple and fast to help me scan aztec barcode. It will be better if it offers free trials for users to check. Any suggestion will be appreciated. Thanks in advance.



Best regards,
Pan
Post Reply