Sunday, April 13, 2014

Survey: Computing Your Own Post-Stratification Weights in R

Social Science Goes R: Weighted Survey Data

Survey Data: Computing Your Own Weights

The second installment in my series on working with survey data in R explains how to compute your own post-stratification weights to use with survey data. For a more detailed overview on why you might need post-stratification weights, look at my previous post on survey weights.

Working with survey data has been a focal point of quantitative social sciences for most of the 20^th century. The idea behind a survey is to take a sample of the general population and generalize the sample’s answers. For this, the sample needs to be representative, i.e. reflect the characteristics of the population it was drawn from. Broadly speaking, population characteristics are important because they define strata within the population that behave differently with respect to social science questions. For instance, workers tend to harbor different views than managers, and young people lead different lives than pensioners. Maintaining the population’s mix in the sample is key in coming up with generalizable findings.

What precisely a good characteristic for a population is, depends on the population, the questionnaire and the research interest. A lot of consideration needs to be put into selecting appropriate population characteristics for aligning it with the sample. Good starters, however, are the distributions of age and gender or employment status.

Post-stratification weights are actually a very important tool to generalize findings from a sample to a larger population. See for instance Andrew Gelman’s piece in the Washington Post on the subject.

Computing your own weights

When you conduct a survey yourself, you need to come up with weights. Already when sketching the questionnaire, one needs to know which characteristics will later be used to align sample and population. Each one of them needs to be covered by a question in the questionnaire. The idea behind post stratification is to make sure you have as many pensioners and as many workaholics in your data set, as there are in the general population, to extend the example from last time. For this to work, you need the number of workaholics and pensioners in the general population. We call these figures the marginal distributions. Again, which marginal distribution you need, depends on your data set and your research questions.

Once you have the marginal distributions, you can use survey’s rake() function to compute the weights. This is done by an algorithm called Iterative Proportional Fitting (IPF). See Wikipedia’s entry on IPF for all its gory details. For a easier description, let’s consider that this algorithm tries to find weights, so that the actual distributions in your data set, when weighted, match the specified marginal distributions of the general population as closely as possible. Sometimes, this can get a little out of hand, and weights become too large. In that case you need to trim the weights back to a reasonable amount. Having weights between 0.3 and 3 is a good rule of thumb.

There are three steps involved: (a) you need to make a survey design object out of your data, without any weights associated, (b) you need to rake this object, so that you now have weights, and optionally (c), if the weights become too small or too large, you need to trim them.

Create unweighted survey design object

Let’s use the small artificial data set from the previous post again. In order to have self-contained examples, let’s load the data set (again).

load(url("http://knutur.at/wsmt/R/RData/small.RData"))

This data set contains a weight variable, but for now let’s assume it had come without any. This would be the case, if e.g. you had conducted the survey yourself.

Now the first step is to create a survey design object without any weights. survey’s syntax here is quite straight forward. The ids argument is used to tell survey that the data came all from on single primary sampling unit. If we were to survey students nested in classes in schools, or institutions nested in countries, then our respondents would not all come from the same sampling unit. But since our data set small is to mimic a simple, run-of-the-mill survey, all respondents were chosen at random from the same list.

In order to create a survey object, let’s first load the survey package.

Then we create the unweighted survey object:

small.svy.unweighted <- svydesign(ids=~1, data=small)

## Warning: No weights or probabilities supplied, assuming equal probability

This just works as before, but now you don’t specify any weights. We still need to compute them (and R tells us that in a warning).

Rake unweighted survey design

To rake, you first need marginal distributions. Once you obtained them, you need to tell the rake function, which variables correspond with what distribution. R does everything else.

Specifying these relationships looks a little bit daunting at first, but is actually quite easy. rake() uses two arguments for this, sample.margins and population.margins. Sample.margins is a list of formulas, that tell R where to look for the variables you want to weight with. These formulas can be as simple as ~sex. But they can also be more complicated, for instance ~sex+edu, if you have population marginal distributions for education obtained separated for the two sexes.

The argument population.margins takes a list of just that, i.e. data frames specifying how often each level occurs in the population. You need one data frame for each of the formulas above.

Here, we want to compute weights based on the distribution of gender (variable sex) and education (edu). These variables have 2 and 3 levels, respectively. We are going to need the relative frequencies for each of these levels.
So let’s set up the population data frames:

sex.dist <- data.frame(sex = c("M", "F"),
                       Freq = nrow(small) * c(0.45, 0.55))
edu.dist <- data.frame(edu = c("Lo", "Mid", "Hi"),
                       Freq = nrow(small) * c(0.30, 0.50, 0.20))

Here, each data frame consists of two vectors: one describing the levels of the associated factor and the other the corresponding frequencies. Note that we multiply the theoretically known relative frequencies as we have obtained them from our reference population with the number of rows in the data set we compute the weights for, to obtain absolute frequencies.

In real life, we would not use fantasy distributions but frequencies others have obtained for our population. Where to obtain these frequencies from depends on the reference population. But if you are working with samples that should represent the general population, then checking with your national statistical offices might get you started.

Now it’s time to compute the weights by putting together the marginal distributions from before with our data.

small.svy.rake <- rake(design = small.svy.unweighted,
                   sample.margins = list(~sex, ~edu),
                   population.margins = list(sex.dist, edu.dist))

This creates a new survey design object, that now contains weights. The argument sample.margins contains a list of simple formulas, that tell R that the marginal distributions of interest to you are sex and edu. The argument population.margins then specifies the marginal distributions as they are to be found in the general population.

Trim the weights

As stated above, sometimes it is necessary to trim weights, if they have grown too large or too small. This will make your data fit less well the population marginal distributions, but inflating a few cases to too much weight, is rarely ever sensible: Perhaps that one person that now counts as 50 is somewhat deranged, or otherwise not representative. So it is best to keep an eye on your weights.

summary(weights(small.svy.rake))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.431   0.745   0.945   1.000   1.620   1.630

These weights actually fit quite well, with none of them exceeding our rule of thumb bounds. If they were to be smaller than 0.3 or larger than 3, then we should trim the weights. Survey provides us with the handy trimWeights() function for this. This function takes three arguments: a survey design object, and a lower and upper bound for the weights. It works by setting weights below and above the limits to the limits’ values. The weight that is now “missing”, is divided equally among the cases that were not affected by the trimming operation. This can sometimes push cases above the limit. To prevent this, specify the option strict=TRUE.

small.svy.rake.trim <- trimWeights(small.svy.rake, lower=0.3, upper=3,
                                  strict=TRUE)

This would then create a new survey design object that now has its weights trimmed to fit into the interval \((0.3,3)\).

After this, we now have a survey design object with (trimmed) weights applied to it.

With that we conclude this little demonstration on how to compute your own sampling weights. While the main application is certainly with proper random samples, you can also use weights to get potentially biased surveys, like online surveys, to better fit the underlying population. But bear in mind that the only thing that weights can do, is ensure that your sample composition better mimics the general population’s characteristics. Weights will never help you if the process governing non-response is part of the puzzle you want to solve.

Perhaps noteworthy is also the relation between post-stratified random samples and quota samples. In a random sample, we define a population, draw from that population at random and then compute and apply weights to align the sample with the population. This weighting is necessary because some people originally sampled might be e.g. harder to reach than others, thereby biasing the sample. Once the post-stratification weights have been applied, the random sample is representative of the population it was drawn from. Statistics gives us a method to tell just how accurately the findings from the sample can be generalized.

Quota samples, on the other hand work differently. There, not only a population is being defined, but also how many persons of a certain strata are to be sampled. For instance, a quota sample might prescribe that there are 250 women and 200 men to be sampled, in a bid to generate a sample that is representative of the population in the first place. At a first look, this might make quota samples more attractive, as no post-stratification is required to ensure representativeness. There is a problem, however. And its quite a serious one: A quota sample might be representative of the population (if quotas actually do work, which they not always do). But a quota sample will never satisfy the strict randomness requirements that statistics require. Only if we are working with a random sample can we make inferences from the sample to the population. In quota samples, there is not sufficient randomness, as the interviewer selects the interviewees actively. Therefore, quota samples cannot be used to reason about the general population.

In the next post of this series we are going to dive deeper into the survey package and compute descriptive statistics for survey-weighted data. Again, if you have any thoughts on computing your own post-stratification weights, please do leave a comment.

Wednesday, April 2, 2014

Social Science Goes R: Weighted Survey Data

To get this blog started, I'll be rolling out a series of posts relating to the use of survey data in R. Most content comes from the ECPR Winter School in Methods and Techniques R course, that I had the pleasure of teaching this February. This post is going to be focused on some of the practical problems of sampling in social science contexts and on an introduction to the survey package by Thomas Lumley and on how to set up data with predefined sampling weights.

In the next few weeks, I'll be covering more topics related to survey data, like computing your own weights for a survey, doing statistical analyses and visualizing survey data. All posts will deal with R and survey and be example driven. Enough of a prelude, let's get to the meat.

Survey data

Much of social science quantitative methods still rely on survey data. Surveys are questionnaires that are distributed to interviewees in hopes of measuring their views on certain topics. Instead of asking everyone – which is too costly, usually – only a subset, the sample, is being interviewed. The findings from this subset are then generalized on the entire population, using statistical methods. Most famously perhaps, surveys are used to predict the outcomes of elections.

For this generalization from sample to population to work, everyone in the population has to have the same chances of ending up in the sample. We call this simple random sampling. This is easy to achieve for small populations: Let's say you want to estimate the age distribution of all students in a (large) class. For small classes, you could simply ask everyone. If you find a day where everyone is present, you can then simply pick people at random. If your aim is as bad as mine, you can just throw things at people. The person that catches it, is sampled. This should make for a simple random sample. Everyone had the same chance to be hit by my random shots. If you don't like throwing things at people, you can throw things at a list of people instead. This leads us to how simple random sampling is done in practice: taking a list of everyone in the population and selecting people at random from it. Alternatively but to the same effect if (and only if unfortunately) everyone has a unique number, we can just randomly create numbers, until we have enough people in the sample.

Sampling from larger groups becomes more complex. In theory we could use one of the simple random sampling approaches outlined above. But for larger groups, there are usually no lists. And even if there are complete and up-to-date lists, it might not guarantee that you can reach that person. Or that the person on the list wants to even talk to you.

Non response

In social science, we generally group non-response into two broad categories: Item and unit non-response. The former refers to certain questions not being answered by someone who was already sampled successfully. This effectively boils down to missing values and is another topic that is generally ignored in social sciences. There will be a series on missing value treatment later on.

Here, I want to focus on unit non-response. That is, people not ending up in the sample, for various reasons. One possibility is that they were not on the list from the example above. If our list is made up of legal telephone numbers, but someone in the population does not have a telephone, than that person cannot end up in the sample. Similarly, if we contact people by phone (and they indeed do have a phone), but they are too busy to participate in the survey, they won't end up in the sample either.

As long as this unit non-response is purely random, it is not a problem. It is also not a problem, if it is not random but unconnected to the phenomenon we want to investigate. For instance, if you investigate favorite colors, it is plausible to assume that people without phones or people that refuse to participate in the survey don't have systematically different color preferences. Unfortunately, this is about the most realistic example for the independence of a social science investigative topic and sample inclusion. Most questions in social science would be answered differently by those without phones and those that do not participate in surveys.

Representativeness

In order to be able to conclude from our sample to the entire population, that sample needs to be representative of the population. Here representative means that traits that can be found in the population are also found in the sample. And at the same proportions. So if the population contains 54% women, so should the sample. The odds that simple random sampling will produce a representative sample are very high. Whether a sample is representative or not is also context dependent. A sample that is representative for social sciences might not be representative in a medical context.

Because some people are very hard or even impossible to reach, the sample we end up with might not be representative. It might be biased. Typically, in a survey sample we end up with too many senior citizens, too many women and too few young people and too few workaholics. Using that sample as is permits conclusions only from the sample to the typical survey participating population. Usually, that population is not the one we are interested in. Rather, we would like for our results to apply to the general public.

So, because not every one in the population has the same probability of being included in our sample, we don't end up with a simple random sample. As it is not a simple random sample, we need to check if it is still representative. And usually, it isn't.

Post-stratification weights

The solution to this dilemma is quite easy. We assign everyone that did make it into the sample a weight \( w \). So people that turn out too often in the sample receive a weight of less than 1. And those that we were not able to reach enough of are upweighted with a weight larger than 1. Respondents that belong to groups that have been sampled perfectly receive a weight of 1. This solution is called post-stratification, because it computes weights based on group (or stratum) characteristics, like the distribution of age or gender proportions.

This method of course only works to correct minor biases. If we did not manage to include a single person from a group, no weight can correct for that. Even if we managed to sample only very few, the resulting large weights could be problematic. If that three young men suddenly count for 100, the results might not be representative either.

In the next weeks we'll go into more detail here and discuss how these weights can be computed. For now, we'll focus on data sets that come with weights supplied. This is typical for large social science data sets, like the European Social Survey.

Survey data in R

In R, working with survey data that requires post-stratification weights is made possible by the survey package. The survey package not only allows for adjusting the composition of a sample to the characteristics of the general population. Most base packages would allow you to do that by specifying a weights argument. The survey package goes further by correcting the design effect introduced by the application of post-stratification weights.

I've prepared a small data set to demonstrate the usage of survey. You can find it under: http://knutur.at/wsmt/R/RData/small.RData. It's basically just random data with some creative column names.

To start using this data set, we first need to load it into R and then attach the survey package.

load(url("http://knutur.at/wsmt/R/RData/small.RData"))

Let's look at the data briefly.

summary(small)

##  sex     edu        partDemo       nJobsYear        intell    
##  M:57   Lo :24   Min.   :0.000   Min.   :0.00   Min.   :40.5  
##  F:43   Mid:42   1st Qu.:0.000   1st Qu.:1.00   1st Qu.:46.9  
##         Hi :34   Median :0.500   Median :2.00   Median :50.6  
##                  Mean   :0.735   Mean   :2.25   Mean   :50.0  
##                  3rd Qu.:1.000   3rd Qu.:3.00   3rd Qu.:52.6  
##                  Max.   :4.000   Max.   :8.00   Max.   :58.7  
##                  NA's   :2                      NA's   :1     
##       emo             age           weight     
##  Min.   :-64.5   Min.   :19.0   Min.   :0.341  
##  1st Qu.: 11.2   1st Qu.:27.0   1st Qu.:1.046  
##  Median : 79.4   Median :44.0   Median :1.707  
##  Mean   : 54.2   Mean   :41.7   Mean   :1.716  
##  3rd Qu.:100.3   3rd Qu.:54.0   3rd Qu.:2.436  
##  Max.   :118.1   Max.   :63.0   Max.   :2.995  
##                  NA's   :1

There are a number of variables in this artificial data set. The one of most interest now is perhaps the weight variable. It contains the weight \( w \) per respondent.

Let's attach the survey package now:

library(survey)

## 
## Attaching package: 'survey'
## 
## The following object is masked from 'package:graphics':
## 
##     dotchart

The message that the dotchart function from the graphics package is now being masked is not really dramatic. It simply tells us that both packages provide a function by that name, and that after attaching survey the one from survey is going to be used. If you were dependent on using graphics' dotchart function, you would need to be careful here. But since we just demo the creation of weighted data sets, this message can simply be ignored.

Inside survey, there is the svydesign function that can be used to link a data frame with a weight. This function takes many arguments, three are of importance at this point: ids, data, and weights. The latter two are easily explained as they specify the data frame we seek to apply weights on and the vector containing the weights.

The first argument, ids is a little bit more complex. By default, survey assumes we use data generated in a multistage sampling procedure. Then the ids argument would be used to specify which variables encode sampling cluster membership. Clustered sampling is another quite frequent social science sampling method, but not one that is used (nor usable) for large general population surveys. Therefore, we set ids = ~ 1 to indicate that all respondents originated from the same cluster.

small.w <- svydesign(ids = ~1, data = small, weights = small$weight)

The resulting object is not a simple data frame anymore. Using summary on it produces different results:

summary(small.w)

## Independent Sampling design (with replacement)
## svydesign(ids = ~1, data = small, weights = small$weight)
## Probabilities:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.334   0.410   0.586   0.801   0.956   2.930 
## Data variables:
## [1] "sex"       "edu"       "partDemo"  "nJobsYear" "intell"   
## [6] "emo"       "age"       "weight"

To further use this object, we need to use survey's own set of analytical functions. They will be the topics of the next few posts. As a little teaser, here is a comparison of the sex ratios in the unweighted and the weighted data frames:

prop.table(table(small$sex))

## 
##    M    F 
## 0.57 0.43

prop.table(svytable(~sex, design = small.w))

## sex
##      M      F 
## 0.5857 0.4143

Not a dramatic change, but one that could prove decisive in the more complex analyses of the weeks to come. This concludes the first entry in our series on survey data in R. The next post in this series will deal with how we can compute post stratification survey weights in the first place. If you have any thoughts on survey data in R or would like to see something specific covered, please leave a comment.