Surveys are not all created equal. A good survey reveals important insights, while a bad one is money down the drain. Even worse, when a bad survey is mistaken for a good one, the cost of errors resulting from the flawed data can easily dwarf the money wasted on the poll.
So, how do you tell good surveys from bad? Good surveys are, in a word, scientific. Most people understand that a good scientific survey is based on a representative sample of the population it is intended to represent. But there is one more critical ingredient for proper polling that far fewer are aware of: weights.
Weighted Surveys Are Accurate Surveys
Weights determine how much each person in a sample counts. They are required because real-world surveys differ from the simplified ideal taught in Statistics 101.
In the ideal world of introductory statistics classes, surveys use simple random samples to select respondents. In this ideal, a survey sample represents the population because each member of the population has an equal probability of being included in it, and each person sampled represents an equal slice of the population. If you sample 1,000 people in a state with a population of ten million, the reasoning goes, each respondent represents 10,000 people.
In the real world, surveys of the general population of the United States and most other countries never use simple random samples where everyone has an equal chance of inclusion, because there is no comprehensive list of all the people in the population from which a pollster can randomly draw respondents. Instead, real surveys differ from the ideal in three ways: 1) They start by selecting households, often by calling phone numbers at random; 2) Within households, they select one respondent; and 3) More often than not, no one in the selected household will take part in the survey, and the pollster has to try the next one.
Weights at Work: A Simple Illustration
Weights account for these peculiarities of sampling by determining how much each respondent counts—that is, the number of people (or the portion of the whole, however tiny) each person represents. Suppose women are more cooperative on a given survey than men, so when it is finished, there are 600 responses from women and 400 from men. If we know the population is 52 percent female and 48 percent male, it’s clear that this 60-40 split isn’t a representative sample. Weights fix this by counting the male responses a little more heavily than the female ones. In this example, the women would each be assigned a weight of 52 ÷ 60 = .867, and the men would each get a weight of 48 ÷ 40 = 1.20. Once weighted, the sample comes out to 52 percent female and 48 percent male, matching the general population.
Weights Are Not Magic
Weights are a wonderful tool, but they are not magic. They fix minor biases or random peculiarities in survey numbers, but they cannot make surveys representative of population groups that are missing altogether. (In technical parlance, they can’t fix noncoverage or extreme nonresponse bias.) For instance, if a survey is conducted only in English, and the sample ends up with too few immigrants, weights cannot make the survey properly representative of immigrants who do not speak English.
How to Weight: From Design Effects to Poststratification
Creating weights is moderately technical, but fortunately it’s not rocket science. Here are the top ten things that any consumer or producer of polling data should know about how weights are properly constructed. (These are the basic concepts. For detailed step-by-step technical instructions on weighting, see “Computing Weights for American National Election Study Survey Data,” a report available online.)
Weighting is not optional. The only kind of survey worth paying for is a scientific survey, and with few exceptions real-world scientific surveys require weights to accurately describe the population.
There’s no such thing as a free lunch. Weights improve survey accuracy, but this accuracy typically comes at a cost of increasing the variance of survey estimates. In simple terms, weighting increases the survey’s margin of error.
Weights can be optimized differently. There is more than one reasonable way to create weights for a survey. Two statisticians might come up with different weights for the same survey, with each set more accurate for some purposes than others. This doesn’t mean weighting is unscientific or that one set of weights is as good as another; it just means that weights optimized for one purpose, such as accurately representing the relative proportions of men and women, will not necessarily be the same as weights optimized for another purpose.
Account for household selection. Weights are variables that determine how much each respondent counts. They are calculated by multiplying several weighting factors together. There may be factors for household selection, respondent selection, and nonresponse as well as a poststratification adjustment (all described below)—and the final weight is the product of all of these parts. The first component of most weighting describes the relative probability of selection of each household in the sample. In a telephone survey, the major component of household selection probability is the number of telephone numbers that could be used to reach the household (including cell phones, if cell phone numbers are included in the sample). The weight is the inverse of the number of phone numbers, because households with many numbers have many chances to be included in the survey. A household with only one phone number would have a weighting factor of 1 at this stage, while a household with two numbers would have a factor of ½. Other aspects of sample design can also affect household selection probability, such as samples that are stratified to include certain types of households. The overall household selection weight adjustment is the product of all of these design factors multiplied together.
Account for respondent selection. In most surveys, only one person in a selected household will be interviewed. The major component of respondent selection probability is the number of people in the household who are eligible for the survey. For example, in a survey of registered voters, the weighting factor for respondent selection would be the number of registered voters living in the household. It is important to weight by this factor to account for the fact that people living in households with several registered voters are less likely to be selected than people living alone.
Account for unequal response rates. The response rate for a survey is the percentage of eligible people in the sample who completed the survey. Response rates vary because some kinds of people participate in surveys at higher rates than others. When survey researchers know some characteristics of nonrespondents, it is appropriate to weight to adjust for these differences in response rates. Typically, very little is known about people who fail to complete a survey, but one thing that is routinely known is where they live. To adjust for unequal response rates, weight by the inverse of the response rate in each area covered by the survey—and do the same for each other group for which response rates can be separately calculated.
Compare to benchmarks. Poststratification is weighting to make survey estimates match known population benchmarks, such as the percentage of men and women in the earlier example. The first step in poststratification is choosing which “benchmark” characteristics to match. The best way to do this is compare the composition of your survey sample to that of the greater population using every statistic you can find for which you know the population’s characteristics with reasonable certainty. Typically, these will be statistics for which there are official population data, such as demographic characteristics from the decennial census or from major Census Bureau surveys such as the Current Population Survey or American Community Survey. If your weighted survey sample differs from known population characteristics by more than a few percentage points in terms of age, sex, race/ethnicity, education, or other key variables, then it is a good idea to adjust the weights to correct the largest of those errors.
Rake. Raking is a method to perform this adjustment. To rake sand smooth, you repeatedly draw a rake across it in different directions until the lumps are all smoothed out. To rake data, you do something analogous: Apply the correction factor for one variable, then another, then another, and then repeat the whole process until the errors are gone. Typically, fixing one variable (such as the sex distribution) will introduce new small errors in another (such as the age distribution). But if we fix sex, then age, then sex again, then age again, repeating five or ten or fifty times, the differences from benchmarks will get smaller and smaller until they fade away, like ripples raked away in the sand.
One controversial topic in weighting is the occasional practice of raking to match a target distribution of party identification. Matching a known population characteristic is appropriate, but raking to match party identification is unwise unless you are absolutely certain of the distribution of party identification in the population being surveyed. Party identification changes over time, is subject to measurement error, and is far more volatile than population demographics. For these reasons, academic survey researchers are skeptical about weighting to party identification.
Review and adjust. After raking, it is important to scrutinize your weights carefully. Repeat the benchmark comparisons and make sure that the survey gives more accurate results than before. Check for extreme values in the weights, which might cause statistical outliers, and cap extremely high weight values, such as those more than five or six times the average. Also, importantly, a statistician should examine the design effect—the increase in variance (or margin of error) caused by weighting. If weights have produced an unreasonably large margin of error, then the weights should be re-computed using fewer variables in raking, using fewer categories in those variables, or imposing a lower cap on the weights.
Use design-consistent statistics. Finally, it is important to analyze weighted data with appropriate statistical methods. Because real-world surveys do not use simple random samples, it is usually not optimal to use statistical methods that were invented for them when analyzing survey data. Consumers of poll data should ask analysts if they have used design-consistent estimation methods that account for weights and design effects in producing analytical results, from topline statistics to sophisticated models. An outline of appropriate methods for design-consistent analysis can be found in “How to Analyze ANES Survey Data,” which is available online.
The bottom line is that most real-world surveys need properly constructed weights in order to be accurate. Survey researchers, data analysts, and clients alike are increasingly insisting on these methods to improve the accuracy, and thus the value, of survey data.
Matthew DeBell, Ph.D., is director of Stanford operations for American National Election Studies.