Duplicate cases

Discussions about editing and cleaning data
Post Reply
faci2012
Posts: 4
Joined: August 28th, 2012, 4:05 pm

Duplicate cases

Post by faci2012 »

Dear Gregory Martin

I recieved a file with more than 60,000 cases; and is not possible upload by cspro because the file have 1,700 cases duplicates (I used SAS for check this duplicates), my questions: Is possible separate in other file only de duplicates cases? and if is possible what is the way for make this?

Thanks for your atention.
Gregory Martin
Posts: 1792
Joined: December 5th, 2011, 11:27 pm
Location: Washington, DC

Re: Duplicate cases

Post by Gregory Martin »

Sure, you can do this with a batch application, though it will require some programming. Let me first state that you can use the Index Files tool to identify duplicate cases, and the tool can also automatically or manually remove the duplicates. But if you want to create a file containing only the duplicates, you can write a batch application and take advantage of save arrays.

On the first run of the program you will populate the save array with all the keys and indicate how often they occur in the file. On the second run you will only write out the duplicate cases. See attached for an example, and the code follows:
PROC GLOBAL

numeric runNumber = 1; // 1 for the first run, 2 for the second

array alpha (10) keys(80000) save; // will store all the keys
array freqs(80000) save = 0 ...; // will store information on how often they occur in the file

alpha (10) thisKey;
numeric numKeys;


PROC OUTPUTDUPLICATES_FF

preproc

    
if runNumber = 2 then
        numKeys =
80000; // so we search all the keys
    endif;


PROC OUTPUTDUPLICATES_QUEST

preproc

    thisKey =
maketext("%d%d",ID1,ID2);

    
numeric idx,found;

    
do idx = 1 while not found and idx <= numKeys
    
        
if keys(idx) = thisKey then
            found =
1;
            
            
if runNumber = 1 then
                
inc(freqs(idx));

            
elseif runNumber = 2 and freqs(idx) = 1 then // not a duplicate, so don't write it out
                skip case;
            
            
endif;
            
        
endif;

    
enddo;
    
    
if not found then
        
inc(numKeys);
        keys(numKeys) = thisKey;
        freqs(numKeys) =
1;
    
endif;
Attachments
outputDuplicates.zip
(11.55 KiB) Downloaded 522 times
Post Reply