Duplicate cases

faci2012 · Post by **faci2012** » October 4th, 2012, 1:24 pm

Dear Gregory Martin

I recieved a file with more than 60,000 cases; and is not possible upload by cspro because the file have 1,700 cases duplicates (I used SAS for check this duplicates), my questions: Is possible separate in other file only de duplicates cases? and if is possible what is the way for make this?

Thanks for your atention.

Gregory Martin · Post by **Gregory Martin** » October 5th, 2012, 12:07 am

Sure, you can do this with a batch application, though it will require some programming. Let me first state that you can use the Index Files tool to identify duplicate cases, and the tool can also automatically or manually remove the duplicates. But if you want to create a file containing only the duplicates, you can write a batch application and take advantage of save arrays.

On the first run of the program you will populate the save array with all the keys and indicate how often they occur in the file. On the second run you will only write out the duplicate cases. See attached for an example, and the code follows:

PROC GLOBAL

numeric runNumber = 1; // 1 for the first run, 2 for the second

array alpha (10) keys(80000) save; // will store all the keys

array freqs(80000) save = 0 ...; // will store information on how often they occur in the file

alpha (10) thisKey;

numeric numKeys;

PROC OUTPUTDUPLICATES_FF

preproc

    if runNumber = 2 then

        numKeys = 80000; // so we search all the keys

    endif;

PROC OUTPUTDUPLICATES_QUEST

preproc

    thisKey = maketext("%d%d",ID1,ID2);

    numeric idx,found;

    do idx = 1 while not found and idx <= numKeys

        if keys(idx) = thisKey then

            found = 1;

            if runNumber = 1 then

                inc(freqs(idx));

            elseif runNumber = 2 and freqs(idx) = 1 then // not a duplicate, so don't write it out

                skip case;

            endif;

        endif;

    enddo;

    if not found then

        inc(numKeys);

        keys(numKeys) = thisKey;

        freqs(numKeys) = 1;

    endif;

CSPro Users Forum

Duplicate cases

Duplicate cases

Re: Duplicate cases