Understanding the survey

2

Come on ...

I'm studying the survey package

I started by studying this page

link

But my questions are more basic

I have already loaded the following database

> mydata
  id str clu     wt hou85 ue91 lab91
1  1   2   1  0.500 26881 4123 33786
2  2   2   1  0.500 26881 4123 33786
3  3   1  10  1.004  9230 1623 13727
4  4   1   4  1.893  4896  760  5919
5  5   1   7  2.173  4264  767  5823
6  6   1  32  2.971  3119  568  4011
7  7   1  26  4.762  1946  331  2543
8  8   1  18  6.335  1463  187  1448
9  9   1  13 13.730   675  129   927
> 

I would like to understand very well what is being done in the following code

mydesign <- 
svydesign(
    id = ~clu ,
    data = mydata ,
    weight = ~wt ,
    strata = ~str
)

What is the role of the id = ~ clu argument?

And what is the role of the argument strata = ~ str?

From the little I read, it seems that some kind of division or separation of the mydata file is happening. But I can not see this ...

Now notice in the following sequence of commands

> summary(mydata$ue91)
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     129     331     760    1401    1623    4123 
>
> options(survey.lonely.psu = "adjust")
> svymean(~ue91, mydesign)
       mean     SE
ue91 445.18 185.56

First the average is 1401 and then the average is 445.18. Why?

What does SE mean?

Good people, for now my doubts are these

Thank you

    
asked by anonymous 09.06.2014 / 06:16

1 answer

3

The survey package is for analysis of complex samples. That is: where not all elements are equally likely to be sampled, and that is where the strata parameters come in. weight and id enter.

In this bank of yours it is difficult to explain what is what, because I did not understand him right, but I will try to explain to you by the bank of the IBGE sample. The sample works as follows: 5% of the households are sampled (this value can change from one city to another), where all the residents of these households respond to the sample questionnaire, which is more complete than the universe. These domains are then pooled into AEDs (Sample Data Expansion Areas). The data (available aqui ) have, among many, the following variables:

V0010 - Peso amostral
V0011 - AED
V0300 - Controle

The variable V00100 - Peso amostral is calculated after the sample and the census, through variables in common with the sample and universe questionnaire. Using survey , we must declare the following parameters:

 svydesign(ids = ~ V0300,  strata = ~ V0011, weights = ~ V0010, data = dados)

In this case, the parameter ids receives the variable V0300 , since it is the house code sampled, and all household members are interviewed (therefore the house is a cluster, not a stratum). The strata are the AEDs ( V0011 ), because only a percentage of their population (of households) was sampled. The sample weight ( weights ) receives V0010 .

The difference in the results you get is because the 2nd is a weighted average, using the wt as the weight. Give to get the same value through the command:

with(mydata, sum((wt * ue91)/sum(wt)))
[1] 445.1821

Already SE, is the standard error of the mean estimate.

    
09.06.2014 / 18:05