What is the difference between the way R and SAS execute the merge? The SAS Merge command returns 205546 rows, and the R returns 207208 rows. Here's an example.
I'm working with the IBGE file available at:
ftp: // ftp. ibge.gov.br/PNS/2013/microdados/pns_2013_microdados.zip
The bases DOMPNS2013.txt and PESPNS2013.txt will be used
SAS:
1) Assignment of variables: execute the files "input DOMPNS2013" and "input PESPNS2013"
2) Selecting an Interest Value and Merge:
data dompns2013v3;
set dompns2013;
if V0015 = 1;
run;
/*NOTE: There were 81187 observations read from the data set WORK.DOMPNS2013.
NOTE: The data set WORK.DOMPNS2013V2 has 64348 observations and 20 variables.*/
data arq.dompes2013v3;
merge dompns2013v3 pespns2013;
by v0001 v0024 upa_pns v0006;
run;
/*NOTE: There were 64348 observations read from the data set WORK.DOMPNS2013V2.
NOTE: There were 205546 observations read from the data set WORK.PESPNS2013.
NOTE: The data set ARQ.DOMPES2013V2 has 205546 observations and 388 variables.
NOTE: DATA statement used (Total process time):*/
#
A: 1) assignment of variables:
d2013 = read.fwf(file='DOMPNS2013.txt',widths=c(2,8,7,4,2,6,1,1))
names(d2013) = c("v0001","v0024","upa_pns","v0006","v0015","skip1","v0026","v0031")
d2013 = subset(d2013,select=c("v0001","v0024","upa_pns","v0006","v0015","v0026","v0031"))
p2013 = read.fwf(file='PESPNS2013.txt',widths=c(2,8,7,4,1,2,2,2,1,8,3))
names(p2013)=c("v0001","v0024","upa_pns","v0006","v0025","skip1","c00301","c004","c006","skip2","c008")
p2013=subset(p2013,select=c("v0001","v0024","upa_pns","v0006","v0025","c00301","c004","c006","c008"))
2) Selecting an Interest and Merge Value:
dim(d2013)
[1] 81187 7
d2013 = subset(d2013, d2013$v0015 == 1)
dim(d2013)
[1] 64348 7
dim(p2013)
[1] 205546 9
dpmerge = merge( p2013,d2013,by=c("v0001","v0024","upa_pns","v0006"))
dim(dpmerge)
[1] 207208 12