Dear Gregory Martin
I recieved a file with more than 60,000 cases; and is not possible upload by cspro because the file have 1,700 cases duplicates (I used SAS for check this duplicates), my questions: Is possible separate in other file only de duplicates cases? and if is possible what is the way for make this?
Thanks for your atention.
Duplicate cases
-
- Posts: 1792
- Joined: December 5th, 2011, 11:27 pm
- Location: Washington, DC
Re: Duplicate cases
Sure, you can do this with a batch application, though it will require some programming. Let me first state that you can use the Index Files tool to identify duplicate cases, and the tool can also automatically or manually remove the duplicates. But if you want to create a file containing only the duplicates, you can write a batch application and take advantage of save arrays.
On the first run of the program you will populate the save array with all the keys and indicate how often they occur in the file. On the second run you will only write out the duplicate cases. See attached for an example, and the code follows:
On the first run of the program you will populate the save array with all the keys and indicate how often they occur in the file. On the second run you will only write out the duplicate cases. See attached for an example, and the code follows:
PROC GLOBAL
numeric runNumber = 1; // 1 for the first run, 2 for the second
array alpha (10) keys(80000) save; // will store all the keys
array freqs(80000) save = 0 ...; // will store information on how often they occur in the file
alpha (10) thisKey;
numeric numKeys;
PROC OUTPUTDUPLICATES_FF
preproc
if runNumber = 2 then
numKeys = 80000; // so we search all the keys
endif;
PROC OUTPUTDUPLICATES_QUEST
preproc
thisKey = maketext("%d%d",ID1,ID2);
numeric idx,found;
do idx = 1 while not found and idx <= numKeys
if keys(idx) = thisKey then
found = 1;
if runNumber = 1 then
inc(freqs(idx));
elseif runNumber = 2 and freqs(idx) = 1 then // not a duplicate, so don't write it out
skip case;
endif;
endif;
enddo;
if not found then
inc(numKeys);
keys(numKeys) = thisKey;
freqs(numKeys) = 1;
endif;
numeric runNumber = 1; // 1 for the first run, 2 for the second
array alpha (10) keys(80000) save; // will store all the keys
array freqs(80000) save = 0 ...; // will store information on how often they occur in the file
alpha (10) thisKey;
numeric numKeys;
PROC OUTPUTDUPLICATES_FF
preproc
if runNumber = 2 then
numKeys = 80000; // so we search all the keys
endif;
PROC OUTPUTDUPLICATES_QUEST
preproc
thisKey = maketext("%d%d",ID1,ID2);
numeric idx,found;
do idx = 1 while not found and idx <= numKeys
if keys(idx) = thisKey then
found = 1;
if runNumber = 1 then
inc(freqs(idx));
elseif runNumber = 2 and freqs(idx) = 1 then // not a duplicate, so don't write it out
skip case;
endif;
endif;
enddo;
if not found then
inc(numKeys);
keys(numKeys) = thisKey;
freqs(numKeys) = 1;
endif;
- Attachments
-
- outputDuplicates.zip
- (11.55 KiB) Downloaded 522 times