SAS and R perform Merge differently

3

What is the difference between the way R and SAS execute the merge? The SAS Merge command returns 205546 rows, and the R returns 207208 rows. Here's an example.

I'm working with the IBGE file available at:
ftp: // ftp. ibge.gov.br/PNS/2013/microdados/pns_2013_microdados.zip

The bases DOMPNS2013.txt and PESPNS2013.txt will be used

SAS:
1) Assignment of variables: execute the files "input DOMPNS2013" and "input PESPNS2013"
2) Selecting an Interest Value and Merge:

data dompns2013v3;  
set dompns2013;  
if V0015 = 1;  
run;  
/*NOTE: There were 81187 observations read from the data set WORK.DOMPNS2013.
NOTE: The data set WORK.DOMPNS2013V2 has 64348 observations and 20 variables.*/  

data arq.dompes2013v3;  
merge dompns2013v3 pespns2013;   
by v0001 v0024 upa_pns v0006;  
run;  
/*NOTE: There were 64348 observations read from the data set WORK.DOMPNS2013V2.
NOTE: There were 205546 observations read from the data set WORK.PESPNS2013.
NOTE: The data set ARQ.DOMPES2013V2 has 205546 observations and 388 variables.
NOTE: DATA statement used (Total process time):*/  
#

A: 1) assignment of variables:

d2013 = read.fwf(file='DOMPNS2013.txt',widths=c(2,8,7,4,2,6,1,1))  

names(d2013) = c("v0001","v0024","upa_pns","v0006","v0015","skip1","v0026","v0031")  

d2013 = subset(d2013,select=c("v0001","v0024","upa_pns","v0006","v0015","v0026","v0031"))  

p2013 = read.fwf(file='PESPNS2013.txt',widths=c(2,8,7,4,1,2,2,2,1,8,3))  

names(p2013)=c("v0001","v0024","upa_pns","v0006","v0025","skip1","c00301","c004","c006","skip2","c008")  

p2013=subset(p2013,select=c("v0001","v0024","upa_pns","v0006","v0025","c00301","c004","c006","c008"))  

2) Selecting an Interest and Merge Value:

dim(d2013)  
[1] 81187     7  

d2013 = subset(d2013, d2013$v0015 == 1)  
dim(d2013)  
[1] 64348     7  

dim(p2013)  
[1] 205546      9  

dpmerge = merge( p2013,d2013,by=c("v0001","v0024","upa_pns","v0006"))  
dim(dpmerge)  
[1] 207208     12  
    
asked by anonymous 26.01.2015 / 19:03

2 answers

1

SAS is removing duplicate DOMPNS records before doing the merge.

If you make d2013 <- unique(d2013) before doing the merge in R, the number of observations will be equal.

    
16.04.2015 / 15:25
0

Hello, Henry! I do not know how to merge function in R, but in SAS, yes. To use the merge in SAS you need to pre-order the two datasets by the crossover key, because if you happen to have a smaller key value in later values, they will be disregarded.

In your case, the following code (placed just before using the merge) would solve the problem;

proc sort data dompns2013v3; by v0001 v0024 upa_pns v0006; run; proc sort data pespns2013; by v0001 v0024 upa_pns v0006; run;

What I would recommend is to use the sql proc to cross tables. The merge is the most efficient cross-over option, but the tables need to be ordered. Using proc sql , on the other hand, you do not need to sort the bases, other than the fact that using the SQL language is something more versatile because you can use it in several databases. As I do not know the structure of the tables you are using, I can not pass you an equivalent SQL code, but you can easily find it on the internet.

Some considerations / tips that are not directly related to the doubt but which may be useful:

- To filter records, use the WHERE clause instead of if, as it is more efficient. In your case you change this code snippet ( if V0015 = 1; ) by this one ( where V0015 = 1; ); - if there is a possibility of having duplicate keys and also assuming that you want only the unique records, add the nodupkey option in proc sort (after the date = entry_base ). Remembering that it will retain only the first value that the key appears in the base; so if the order in which the key appears, and not just the key itself, is important, you first use proc sort without the nodupkey option - clause by - the keys and the criteria) and then use proc sort with the nodupkey option (by placing in the clause by only the key).

    
02.12.2018 / 04:43